Parsing a list

narachaid · October 9, 2013, 5:49pm

Hello,

I have a very long list of file (see input below). I only need the first "chunk" of the line before the space and omit the rest. Also, the > sign needs to be excluded. Can anyone help me please?

Thank you so much!

INPUT:

>gi|24976465|gb|AL935113.1|AL935113 AL935113 Homo sapiens library
>gi|24978364|gb|AL93981336.1|AL93981336 AL93981336 Homo sapiens library
>gi|24973415|gb|AL931542.1|AL931542 AL931542 Homo sapiens library
>gi|24939375|gb|AL93376241.1|AL93376241 AL93376241 Homo sapiens library
>gi|24937965|gb|AL9343716.1|AL9343716 AL9343716 Homo sapiens library

OUTPUT:

gi|24976465|gb|AL935113.1|AL935113
gi|24978364|gb|AL93981336.1|AL93981336
gi|24973415|gb|AL931542.1|AL931542
gi|24939375|gb|AL93376241.1|AL93376241
gi|24937965|gb|AL9343716.1|AL9343716

Scott · October 9, 2013, 5:54pm

If you used > and a space as field separators you could easily get what's in between:

$ awk -F"[> ]" '{print $2}' file
gi|24976465|gb|AL935113.1|AL935113
gi|24978364|gb|AL93981336.1|AL93981336
gi|24973415|gb|AL931542.1|AL931542
gi|24939375|gb|AL93376241.1|AL93376241
gi|24937965|gb|AL9343716.1|AL9343716

blackrageous · October 9, 2013, 5:54pm

if the input is in file y.y, then....

cat y.y | sed -e s'/^>//' | awk '{print $1}'

Yoda · October 9, 2013, 6:03pm

sed 's#^>##;s# .*##' file

disedorgue · October 9, 2013, 6:30pm

Hi,
Another sed solution:

sed 's/^>\| .*//g' file

Or version maybe faster:

sed '/^>\| .*/s///g' file

Regards.