Remove duplicates according to their frequency in column

corfuitl · October 19, 2015, 5:09am

Hi all,

I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3.

Column1 Column2 Column3 Column4 Column5 Column6 Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

In this case, the result should be:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3

because user1 with access1 occur twice. Moreover, in case the original list contains the following entry:

5    user1    access2    word    word    2    1

The result should be

2    user2    access2    word    word    5    3
5    user1    access2    word    word    2    1

because user1 with access1 and user2 with access2 occur twice, so the smaller numbers of Column6 and Column7 should be taken into consideration.

Thanks in advance for your time and consideration.

RudiC · October 19, 2015, 6:07am

Any attempts from your side?

---------- Post updated at 12:07 ---------- Previous update was at 12:03 ----------

And, why should line 5 be preferred to line 4? Except for field 1, they're identical.

corfuitl · October 19, 2015, 6:16am

Hi,

Thank you for your reply. Lines 4 and 5 are identical, so no problem, it will be correct if it extracts line 4.

I am not familiar with awk but I have found the following command from a similar post but it seems that it doesn't work in my case.

awk '(NR==1);a[$2]<$3||d[$2]<$4{a[$2]=$3;d[$2]=$4;b[$2]=$0};END{for(i in b)if(b !~ /ID/){print b}}'

Thanks

RudiC · October 19, 2015, 7:20am

Well, try:

awk '
NR==1           {print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] > MAX[$2]) MAX[$2] = FREQ[$2,$3]
                 if (MIN[$2] == 0 ||
                     SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file
Column1    Column2    Column3    Column4    Column5    Column6    Column7
5    user1    access2    word    word    2    1
2    user2    access2    word    word    5    3

corfuitl · October 19, 2015, 7:34am

Hi, thanks for your prompt reply

Unfortunately, it seems that the result is not completely correct. The correct result should have a unique user, so in Column2 user1 should appear only once based on the number of occurrences of Column 3. In case the number of occurrences is duplicated, then the smallest numbers of Column6 and Column7 should be taken into consideration.
I am sorry, but it is complicated and may be I didn't express my thought.

thanks

RudiC · October 19, 2015, 7:39am

There's only one single user1 and one single user2 in above result!?

corfuitl · October 19, 2015, 7:56am

Hi,

The above result is correct, but when I tried with:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

I got:

1    user1    access1    word    word    3    2
4    user1    access2    word    word    2    1

What is going wrong?

Best regards,

RudiC · October 19, 2015, 8:21am

That's because you dropped the header line.

---------- Post updated at 14:21 ---------- Previous update was at 14:20 ----------

Output with header line in file:

Column1    Column2    Column3    Column4    Column5    Column6    Column7
3    user1    access1    word    word    3    1
2    user2    access2    word    word    5    3

corfuitl · October 20, 2015, 6:22am

Hi,

Once again thanks for your help. It seems that it works with:

Column1    Column2    Column3    Column4    Column5    Column6    Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1
5       user1    access2    word    word    2    1

but not with

Column1    Column2    Column3    Column4    Column5    Column6    Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

Moreover, is it possible not to include the header line in the results?

Thanks

RudiC · October 20, 2015, 6:36am

Well, I've noticed that as well. Try

awk '
NR==1           {# print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] == MAX[$2])    if (SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                 if (FREQ[$2,$3] >  MAX[$2])    {MAX[$2] = FREQ[$2,$3]
                                                 MIN[$2] = SUM[$2,$3]
                                                }
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file

corfuitl · October 20, 2015, 7:03am

Hi again,

Works perfectly now

Thanks for your time!

Best,
Kon