awk used to extract data between text

jaldo0805 · June 17, 2013, 10:50am

Hello all,
I have a file (filename.txt) with some data (in two columns X and Y) which looks like this:

##########
'Header1'
'Sub-header1'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header2'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header3'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

#######
'Header2' 
'Sub-header1'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header2'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header3'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

...and so on...

So, the three different 'Sub-headers' under each different header are the same (the same three every time)..., so what I want is to extract the data that is between the 'Sub-headers', what I am doing right now is to apply the following command:

awk '/Sub-header1/ {getline;getline}{j++}j==1{flag=1;next} /Sub-header2/ {i++}i==1{flag=0} flag {print}' filename.txt > ofile.txt

I am using the {getline;getline} commands to skip the lines of the 'Sub-header1' and 'X Y', but although it does skip those two lines, it also prints the 'Header1' (and this is something I really don't get) and the data I wanted to have.
The reason I want to have just the data is that I want to use it to make a plot with python... (but that's another story). I also would like to get rid of the blank line at the bottom of the set of data that I am extracting, and I tried using instead of the second pattern ('Sub-header2') the blank line (\/n) but it didn't worked.
I've been told not to "abuse" of the getline command since sometimes (unless I really understood what it does) it can give unexpected results, I found also the option of using 'c&&!--c;/Sub-header1/ {c=3} etc... to tell to skip to the third line after the pattern (Sub-header1) but this gives me something even more unexpected.
Hopefully someone followed me until this point :),
Thank you very much!

MadeInGermany · June 17, 2013, 11:21am

What do you actually want to print?
The following prints all sections following /Sub-header1/; it stops printing when it meets an empty line, /^$/:

awk '/Sub-header1/ {getline;getline;flag=1} /^$/ {flag=0} flag {print}' filename.txt

Without getline:

awk '/Sub-header1/ {flag=1;c=3} /^$/ {flag=0} flag && !(c && --c) {print}' filename.txt

jaldo0805 · June 17, 2013, 11:30am

Thanks for your reply, what I want to print is the data that appears following the first 'Sub-header1' and up to before the 'Sub-header2' that's why I added the counters

 {j++}j==1

, (and then I will modify it to print into a second file the contents of the data between the second set of 'Sub-header1' 'Sub-header2', by changing

j==1

to

 j==2

... I tried using the line you gave me, and I see what it does, it prints all the sets of data between this patterns together... I will try now adding my counters to see if I get what I wanted.
Thanks,

MadeInGermany · June 17, 2013, 1:23pm

Below your example; my awk script will print the lines with <this

##########
'Header1'
'Sub-header1'
X                    Y
xxxx.xx       yyyy.yyy   <this
xxxx.xx       yyyy.yyy   <this
....                 ... <this

'Sub-header2'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header3'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

#######
'Header2' 
'Sub-header1'
X                    Y
xxxx.xx       yyyy.yyy   <this
xxxx.xx       yyyy.yyy   <this
....                 ... <this

'Sub-header2'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

'Sub-header3'
X                    Y
xxxx.xx       yyyy.yyy
xxxx.xx       yyyy.yyy
....                 ...

jaldo0805 · June 17, 2013, 1:28pm

Thanks for the explanation, now, what can I do if I want to print only the first set of lines with

<this

or only de second set of lines with

<this

?
Thanks again!

MadeInGermany · June 17, 2013, 1:36pm

This prints the 2nd occurrence:

awk '/Sub-header1/ && ++n==2 {flag=1; c=3} /^$/ {flag=0} flag && !(c && --c) {print}' filename.txt

You also can give the search criteria as additional arguments:

awk '$0~search && ++n==num {flag=1; c=3} /^$/ {flag=0} flag && !(c && --c) {print}' search="Sub-header1" num=2 filename.txt

jaldo0805 · June 17, 2013, 1:41pm

Thank you so much, I spend hours yesterday trying to figure this out myself! using awk is fun, and simplifies a lot work (when you know how to use it, but on the mean time, it can be painful after some hours of try and error).
Thanks!

MadeInGermany · June 17, 2013, 3:59pm

Instead of counting the occurrences, you could set another scope instance.
This time I broke the awk code into a multi-line, IMHO better readable.
And I have introduced next , that directly starts a new cycle.
That means c=2 not 3 because the next 2 lines are to be skipped.

awk '$0~header {n=1; next}
n && $0~subheader {n=0; flag=1; c=2; next}
/^$/ {flag=0}
flag && !(c && --c) {print}
' header="Header2" subheader="Sub-header1" filename.txt