I have huge a tab-delimited file with the following format and I want to remove the duplicates according to their frequency based on Column2 and Column3.
Column1 Column2 Column3 Column4 Column5 Column6 Column7
1 user1 access1 word word 3 2
2 user2 access2 word word 5 3
3 user1 access1 word word 3 1
4 user1 access2 word word 2 1
In this case, the result should be:
1 user1 access1 word word 3 2
2 user2 access2 word word 5 3
because user1 with access1 occur twice. Moreover, in case the original list contains the following entry:
5 user1 access2 word word 2 1
The result should be
2 user2 access2 word word 5 3
5 user1 access2 word word 2 1
because user1 with access1 and user2 with access2 occur twice, so the smaller numbers of Column6 and Column7 should be taken into consideration.
Thanks in advance for your time and consideration.
Unfortunately, it seems that the result is not completely correct. The correct result should have a unique user, so in Column2 user1 should appear only once based on the number of occurrences of Column 3. In case the number of occurrences is duplicated, then the smallest numbers of Column6 and Column7 should be taken into consideration.
I am sorry, but it is complicated and may be I didn't express my thought.
Once again thanks for your help. It seems that it works with:
Column1 Column2 Column3 Column4 Column5 Column6 Column7
1 user1 access1 word word 3 2
2 user2 access2 word word 5 3
3 user1 access1 word word 3 1
4 user1 access2 word word 2 1
5 user1 access2 word word 2 1
but not with
Column1 Column2 Column3 Column4 Column5 Column6 Column7
1 user1 access1 word word 3 2
2 user2 access2 word word 5 3
3 user1 access1 word word 3 1
4 user1 access2 word word 2 1
Moreover, is it possible not to include the header line in the results?