Extract specific content from a file

patrick87 · October 7, 2009, 6:31am

My input file:

>sequence_1
ASSSSSSSSSSSDDDDDDDDDDDCCCCCCC
ASDSFDFFDFDFFWERERERERFSDFESFSFD
>sequence_2
ASDFDFDFFDDFFDFDSFDSFDFSDFSDFDSFASDSADSADASD
ASDFFDFDFASFASFASFAFSFFSDASFASFASFAFS
>sequence_3
VEDFGSDGSDGSDGSDGSDGSDGSDG
dDFSDFSDFSDFSDFSDFSDFSDFSDF
SDGFDGSFDGSGSDGSDGSDGSDGSDG

My desired output file:

>sequence_2
ASDFDFDFFDDFFDFDSFDSFDFSDFSDFDSFASDSADSADASD
ASDFFDFDFASFASFASFAFSFFSDASFASFASFAFS

I only want to extract the header of sequence_2 and its content.
Do anybody got idea how to do it?
Will awk response faster if got a long list of contents?
Thanks for all of your suggestion

thegeek · October 7, 2009, 6:37am

sed -n -e '/>sequence_3/q' -e '/>sequence_2/,/>sequence_3/p' t1

Put your input & output in code tags for better visibility.

radoulov · October 7, 2009, 6:54am

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris:

awk 'END { if (r ~ p) print r }
/^>sequence/ { if (r ~ p) print r; r = x }
{ r = (r ? r RS : x) $0 }
' p="sequence_2" infile

---------- Post updated at 12:47 PM ---------- Previous update was at 12:41 PM ----------

Yes,
thegeek's sed approach should be faster.
Assuming progressive sequence numbers with fixed format,
you could add parameters:

start="sequence_2"
stop="$(( ${start##*_} + 1 ))"

sed -n "
  /$stop/q
  /$start/,/$stop/p
  " infile

---------- Post updated at 12:51 PM ---------- Previous update was at 12:47 PM ----------

A similar approach with awk:

awk '$0 ~ stop { exit }
  $0 ~ start, $0 ~ stop {
    if ($0 !~ stop) print 
    }' start="sequence_2" \
stop="$(( ${start##*_} + 1 ))" infile

---------- Post updated at 12:54 PM ---------- Previous update was at 12:51 PM ----------

Notice that the sed and the second awk versions assume an input in numeric (by sequence number) order just like the example in the original post.

danmero · October 7, 2009, 7:02am

awk '/_3$/{exit}/_2$/{f=1}f' file

jnetnix · October 7, 2009, 7:03am

If your grep supports -A (--after-context) you could try this:

grep -A 2 "sequnce_2" infile

My Ubuntu distro has it but I know I had to grab it for my Solaris boxes.

radoulov · October 7, 2009, 7:16am

Well,
for one-shot solutions could be even:

awk '/_3$/{exit}/_2$/,0' infile

Or:

awk '/_3$/{exit}/_2$/,_' infile

patrick87 · October 7, 2009, 8:49pm

Hi thegeek,
Thanks for your suggestion. It is worked nice.
Can you roughly explain about the reason that you write the code?!

sed -n -e '/>sequence_3/q' -e '/>sequence_2/,/>sequence_3/p' t1

For example, if I got long list of contents and I only want to extract specific contents based on the interested header, can I use the sed code that you recommend as well?

thegeek · October 9, 2009, 2:19pm

The idea is very simple,

Print from sequence_2 to sequence_3, and when you find a pattern sequence_3 just exit.

So i would very well recommend, as after the sequence_3 your file is not read, sed had been quit, so it is efficient too, is it not ?!

> This terminology in sed is PATTERN addressing.

patrick87 · October 9, 2009, 8:29pm

Thanks a lot, thegeek.
I understand it now d
hehe...
Do you have any idea to solve this thread:
http://www.unix.com/shell-programming-scripting/120999-execution-problem-repeat-same-program-two-set-same-data.html\#post302360533
It seems like more difficult and complicated
Thanks a lot for your advice.

steadyonabix · October 10, 2009, 2:54am

Hi Radoulov

Once again I am baffled by the brevity of your code!

You Explained this one to me a few days ago in another post: -

awk '/_3$/{exit}/_2$/{f=1}f' file

I just don't get these two at all though, why do they work?

awk '/_3$/{exit}/_2$/,0' infile

Or:

awk '/_3$/{exit}/_2$/,_' infile

What is the ,0 and ,_ about?

danmero · October 10, 2009, 4:57am

0 is NULL and _ variable is not set, is NULL.
Literal awk will print from first pattern to the end(NULL) but exit on second pattern.

That's why I like radoulov solutions, you have to ask yourself why

Scrutinizer · October 10, 2009, 7:09am

Another road to Rome:

mawk 'BEGIN {RS="\n>"; printf">"} /_2/' infile

The following is more generic and would also work in case the actual label is not "sequence_2" but the OP means the second record and the ">" at the beginning of a line marks the start of a label of a new record:

mawk 'BEGIN {RS="\n>"; printf">"} NR==2' infile

or gawk. As danmero pointed out, this code does not work with standard awk nor nawk or posix awk. Those versions only accept a single character for RS.

steadyonabix · October 11, 2009, 2:40pm

I see :rolleyes: Why the single , before the 0 though, what does that mean in this context?

awk '/_3$/{exit}/_2$/,0' infile

summer_cherry · October 11, 2009, 11:21pm

local $/=">";
open FH,"<a.txt";
while(<FH>){
  print if /sequence.*2/;
}

radoulov · October 12, 2009, 2:55am

It's the awk range pattern, from Effective AWK Programming:

In the above code it means:

from the record that matches the _2$ pattern to the end of the input (0 -> false -> never -> eof).

And of course, we exit prematurely because of the previous action.

Just a few words about the beauty of the programming code ...
We often try to play golf[1] here and we're doing it for fun.
In my opinion, a piece of code or a program is beautiful when:

it's self documenting (!)
concise and simple (simple as possible)
it takes advantage of the full functionality/potential of the given programming language

That said, at least as far as my posts are concerned, you should take those obfuscated and golfed samples for what they are.
Try to understand them, use them on the command line, but don't use them in scripts and/or production code.
Think about the next maintainer of that code.

Perl - Wikipedia_golf#Perl_golf

steadyonabix · October 13, 2009, 5:42am

Thanks thats good advice about the readability.What prompted me to join this forum is the need to learn to write code that runs as quickly as possible. At work I am now writing tools that run against gigabytes of data written in ksh and nawk. I have been learning to optimise code recently and am astonished by the improvement in speed that can be achieved, particularly when creating extra processes in a loop.One script I optimised recently went from 5+hrs to 20 mins run time simply by minimising the processes being kicked off in two loops!Hence my interest in writing "lean" code....Cheers

patrick87 · October 14, 2009, 7:00am

Hi danmero,

My input file:
Code:
>sequence_1
ASSSSSSSSSSSDDDDDDDDDDDCCCCCCC
ASDSFDFFDFDFFWERERERERFSDFESFSFD
>sequence_2
ASDFDFDFFDDFFDFDSFDSFDFSDFSDFDSFASDSADSADASD
ASDFFDFDFASFASFASFAFSFFSDASFASFASFAFS
>sequence_3
VEDFGSDGSDGSDGSDGSDGSDGSDG
dDFSDFSDFSDFSDFSDFSDFSDFSDF
SDGFDGSFDGSGSDGSDGSDGSDGSDG
>ABC_6
SAASASASASASASTSDGSDGSDGSDG
dDFSDFSDFSDFSDFSDFSDFSDFSDF
>SDF_7
TASDASDAFSDFSDFSDFSDFSDFSDF
SDGFDGSFDGSGSDGSDGSDGSDGSDG

My desired output file:
Code:
>sequence_2
ASDFDFDFFDDFFDFDSFDSFDFSDFSDFDSFASDSADSADASD
ASDFFDFDFASFASFASFAFSFFSDASFASFASFAFS
>ABC_6
SAASASASASASASTSDGSDGSDGSDG
dDFSDFSDFSDFSDFSDFSDFSDFSDF
>SDF_7
TASDASDAFSDFSDFSDFSDFSDFSDF
SDGFDGSFDGSGSDGSDGSDGSDGSDG

If I got a long list of file, how I can use your script or program to extract only the contents of sequence_2,ABC_6,SDF_7?
Do you have any idea how I can extract specific content only from a long list of file?
As I try, the awk script that you suggested only can extract sequence_2 from a long list of file.
Thanks again:)

danmero · October 14, 2009, 10:27am

awk '$1~ /sequence_2|ABC_6|SDF_7/{$1=">"$1;print}' RS=">" ORS="" FS=OFS="\n" file

---------- Post updated at 10:27 AM ---------- Previous update was at 09:24 AM ----------

To keep the forums high quality for all users, please take the time to format your posts correctly.

Use Code Tags when you post any code or data samples so others can easily read your code.
You can easily do this by highlighting your code and then clicking on the # in the editing menu. (You can also type code tags and by hand.)
Avoid adding color or different fonts and font size to your posts.
Selective use of color to highlight a single word or phrase can be useful at times, but using color, in general, makes the forums harder to read, especially bright colors like red.
Be careful when you cut-and-paste, edit any odd characters and make sure all links are working property.

Thank You.

The UNIX and Linux Forums
Reply With Quote

patrick87 · November 4, 2009, 10:46pm

Hi, danmero.
I just found out that by using the code that you suggested:

awk '$1~ /sequence_2|ABC_6|SDF_7/{$1=">"$1;print}' RS=">" ORS="" FS=OFS="\n" file

If my file also got the content header like ABC_61,ABC_605,SDF_750.
All of them, the code that you suggested also will extract.
Do you have any better idea just specific and extract only sequence_2,ABC_6 and SDF_7. Really thanks for your suggestion ^^

danmero · November 5, 2009, 8:11am

Let's try to work around

awk '$1~">"{f=0}$1~">" && $1~/[sequence_2|ABC_6|SDF_7]$/{f=1}f'  file