Remove duplicates according to their frequency in column

Hi all,

I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3.

Column1 Column2 Column3 Column4 Column5 Column6 Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

In this case, the result should be:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3

because user1 with access1 occur twice. Moreover, in case the original list contains the following entry:

5    user1    access2    word    word    2    1

The result should be

2    user2    access2    word    word    5    3
5    user1    access2    word    word    2    1

because user1 with access1 and user2 with access2 occur twice, so the smaller numbers of Column6 and Column7 should be taken into consideration.

Thanks in advance for your time and consideration.

Any attempts from your side?

---------- Post updated at 12:07 ---------- Previous update was at 12:03 ----------

And, why should line 5 be preferred to line 4? Except for field 1, they're identical.

1 Like

Hi,

Thank you for your reply. Lines 4 and 5 are identical, so no problem, it will be correct if it extracts line 4.

I am not familiar with awk but I have found the following command from a similar post but it seems that it doesn't work in my case.

awk '(NR==1);a[$2]<$3||d[$2]<$4{a[$2]=$3;d[$2]=$4;b[$2]=$0};END{for(i in b)if(b !~ /ID/){print b}}'

Thanks

Well, try:

awk '
NR==1           {print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] > MAX[$2]) MAX[$2] = FREQ[$2,$3]
                 if (MIN[$2] == 0 ||
                     SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file
Column1    Column2    Column3    Column4    Column5    Column6    Column7
5    user1    access2    word    word    2    1
2    user2    access2    word    word    5    3
1 Like

Hi, thanks for your prompt reply :slight_smile:

Unfortunately, it seems that the result is not completely correct. The correct result should have a unique user, so in Column2 user1 should appear only once based on the number of occurrences of Column 3. In case the number of occurrences is duplicated, then the smallest numbers of Column6 and Column7 should be taken into consideration.
I am sorry, but it is complicated and may be I didn't express my thought.

thanks

There's only one single user1 and one single user2 in above result!?

Hi,

The above result is correct, but when I tried with:

1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

I got:

1    user1    access1    word    word    3    2
4    user1    access2    word    word    2    1

What is going wrong?

Best regards,

That's because you dropped the header line.

---------- Post updated at 14:21 ---------- Previous update was at 14:20 ----------

Output with header line in file:

Column1    Column2    Column3    Column4    Column5    Column6    Column7
3    user1    access1    word    word    3    1
2    user2    access2    word    word    5    3
1 Like

Hi,

Once again thanks for your help. It seems that it works with:

Column1    Column2    Column3    Column4    Column5    Column6    Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1
5       user1    access2    word    word    2    1

but not with

Column1    Column2    Column3    Column4    Column5    Column6    Column7
1    user1    access1    word    word    3    2
2    user2    access2    word    word    5    3
3    user1    access1    word    word    3    1
4    user1    access2    word    word    2    1

Moreover, is it possible not to include the header line in the results?

Thanks

Well, I've noticed that as well. Try

awk '
NR==1           {# print
                 next
                }

                {LINE[$2,$3]=$0
                 FREQ[$2,$3]++
                 SUM[$2,$3]=$6+$7
                 if (FREQ[$2,$3] == MAX[$2])    if (SUM[$2,$3]  < MIN[$2]) MIN[$2] = SUM[$2,$3]
                 if (FREQ[$2,$3] >  MAX[$2])    {MAX[$2] = FREQ[$2,$3]
                                                 MIN[$2] = SUM[$2,$3]
                                                }
                }
END             {for (f in FREQ)        {split (f, TMP, SUBSEP)
                                         if     (FREQ[f] == MAX[TMP[1]] &&
                                                 SUM[f] == MIN[TMP[1]])
                                            print LINE[f]
                                        }
                }
' FS="\t" SUBSEP="\t" file
1 Like

Hi again,

Works perfectly now :slight_smile:

Thanks for your time!

Best,
Kon