How to ignore relative few occurrences of a field value?

abercrom · December 17, 2013, 12:05pm

Hi experts,

I have a very long file that looks about like this.

aaad_1577 64000
aaad_1577 72000
aaad_1577 72000
aaad_1577 65000
aaad_1577 65000
(...aaad about a thousand times...)
bbbd_2002 56000
bbbd_2002 57000
bbbd_3045 57000
cccd_3452 150000
dddd_6014 150000
dddd_6014 150000
dddd_6014 150000
(...dddd about a thousand times...)

I want to ignore the rows where the first column values occur fewer than handful of times, say 5 times.

It would be helpful if I could see how many occurrences I'm getting before I ignore them so I can go from this:

aaad_1577 64000 1005
aaad_1577 72000 1005
aaad_1577 72000 1005
aaad_1577 65000 1005
aaad_1577 65000 1005
(...aaad about a thousand times...)
bbbd_2002 56000 2
bbbd_2002 57000 2
bbbd_3045 57000 1
cccd_3452 150000 1
dddd_6014 150000 1003
dddd_6014 175000 1003
dddd_6014 150000 1003
(...dddd about a thousand times...)

to using this:

awk '{ if ($3>3) print $0}' [file]

and get this:

aaad_1577 64000 1005
aaad_1577 72000 1005
aaad_1577 72000 1005
aaad_1577 65000 1005
aaad_1577 65000 1005
(...aaad about a thousand times...)
dddd_6014 150000 1003
dddd_6014 175000 1003
dddd_6014 150000 1003
(...dddd about a thousand times...)

Thank you!

Akshay_Hegde · December 17, 2013, 12:14pm

Try

$ cat file
aaad_1577 64000
aaad_1577 72000
aaad_1577 72000
aaad_1577 65000
aaad_1577 65000
bbbd_2002 56000
bbbd_2002 57000
bbbd_3045 57000
cccd_3452 150000
dddd_6014 150000
dddd_6014 150000
dddd_6014 150000

$ awk 'FNR==NR{A[$1]++;next}{print $0,A[$1]}' file file

aaad_1577 64000 5
aaad_1577 72000 5
aaad_1577 72000 5
aaad_1577 65000 5
aaad_1577 65000 5
bbbd_2002 56000 2
bbbd_2002 57000 2
bbbd_3045 57000 1
cccd_3452 150000 1
dddd_6014 150000 3
dddd_6014 150000 3
dddd_6014 150000 3

abercrom · December 17, 2013, 12:45pm

That'll do it!
Thanks Akshay