Filtering data using uniq and sed

Hello,

Does anyone know an easy way to filter this type of file? I want to get everything that has score (column 2) 100.00 and get rid of duplicates (for example gi|332198263|gb|EGK18963.1| below), so I guess uniq can be used for this?

gi|3379182634|gb|EGK18561.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|332936633|ref|EGK16471.1|	97.00
gi|3329991602|ref|EGK11733.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|332302583|gb|EGK13714.1|	98.00

I want to choose these ones.

gi|3379182634|gb|EGK18561.1|	100.00
gi|332198263|gb|EGK18963.1|	100.00
gi|3329991602|ref|EGK11733.1|	100.00

In the end, I want the output to look like this.. Is it possible to use sed for this?
OUTPUT:

EGK18561.1
EGK18963.1
EGK11733.1

Can anyone please help? Thanks so much in advance!

You may try awk solution its easy, Others might have sed solution

$ awk -F'[|]' '$5==100 && !A[$4]++ {print $4}' file

OR

$ grep -E '(100)'  file | cut -d '|' -f4 | sort | uniq

Resulting

EGK18561.1
EGK18963.1
EGK11733.1
1 Like

Hello Akshay,
Thanks for your response. It didn't seem to work and gave me this error message: "A[: Event not found." :frowning:

It's working which OS ? How you tried ?

That is because of history expansion in bash, which interprets the exclamation mark, even within double quotes. You need to use single quotes like Akshay showed, not double quotes.
You can turn off history expansion altogether with

set +H

--
What constitutes a duplicate? $4 (the 4th field), the entire line? Does $2 have a 1:1 correlation with $4? If it is the entire line for example you could use !A[$0]++ instead of !A[$4]++ in Akshay's suggestion..

1 Like

All clear now, thank you so much!

awk '$2==100' a | sort | uniq