Filtering data using uniq and sed

narachaid · October 18, 2013, 1:41am

Hello,

Does anyone know an easy way to filter this type of file? I want to get everything that has score (column 2) 100.00 and get rid of duplicates (for example gi|332198263|gb|EGK18963.1| below), so I guess uniq can be used for this?

gi|3379182634|gb|EGK18561.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|332936633|ref|EGK16471.1|	97.00
gi|3329991602|ref|EGK11733.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|332302583|gb|EGK13714.1|	98.00

I want to choose these ones.

gi|3379182634|gb|EGK18561.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|3329991602|ref|EGK11733.1|	100.00

In the end, I want the output to look like this.. Is it possible to use sed for this?
OUTPUT:

EGK18561.1
EGK18963.1
EGK11733.1

Can anyone please help? Thanks so much in advance!

Akshay_Hegde · October 18, 2013, 1:58am

You may try awk solution its easy, Others might have sed solution

$ awk -F'[|]' '$5==100 && !A[$4]++ {print $4}' file

OR

$ grep -E '(100)'  file | cut -d '|' -f4 | sort | uniq

Resulting

EGK18561.1
EGK18963.1
EGK11733.1

narachaid · October 18, 2013, 2:03am

Hello Akshay,
Thanks for your response. It didn't seem to work and gave me this error message: "A[: Event not found."

Akshay_Hegde · October 18, 2013, 2:06am

It's working which OS ? How you tried ?

Scrutinizer · October 18, 2013, 2:08am

That is because of history expansion in bash, which interprets the exclamation mark, even within double quotes. You need to use single quotes like Akshay showed, not double quotes.
You can turn off history expansion altogether with

set +H

--
What constitutes a duplicate? $4 (the 4th field), the entire line? Does $2 have a 1:1 correlation with $4? If it is the entire line for example you could use !A[$0]++ instead of !A[$4]++ in Akshay's suggestion..

narachaid · October 18, 2013, 3:21am

All clear now, thank you so much!

summer_cherry · October 21, 2013, 2:17am

awk '$2==100' a | sort | uniq