get part of file with unique & non-unique string

andrewsc · September 17, 2009, 10:48am

I have an archive file that holds a batch of statements. I would like to be able to extract a certain statement based on the unique customer # (ie. 123456). The end for each statement is noted by "ENDSTM".

I can find the line number for the beginning of the statement section with sed.

start=`sed -n '/123456/=' filename`

This will get me the starting line number. It is actually 5 lines before this but this is the line with the unique customer #. From here I would like to find the next occurrence of "ENDSTM" after the line where the customer # was found. Then I could do a sed to grab the section of the file I need based on the starting and ending line #s.

How would I do this? grep, sed, awk?

Or is there a way to get what I need with one command?

Thanks,
Andrew

durden_tyler · September 17, 2009, 12:04pm

Yes, here's one way to do it with Perl:

$
$ cat -n f1
     1  456789
     2  stm1 - line 1
     3  stm1 - line 2
     4  stm1 - line 3
     5  ENDSTM
     6
     7  123456
     8  stm2 - line 1
     9  stm2 - line 2
    10  stm2 - line 3
    11  stm2 - line 4
    12  ENDSTM
    13
    14  345678
    15  stm3 - line 1
    16  stm3 - line 2
    17  ENDSTM
    18
    19  567890
    20  stm4 - line 1
    21  stm4 - line 2
    22  stm4 - line 3
    23  stm4 - line 4
    24  ENDSTM
    25
$
$ perl -ne 'BEGIN{undef $/} {/.*123456.(.*?)ENDSTM/s && print $1}' f1
stm2 - line 1
stm2 - line 2
stm2 - line 3
stm2 - line 4
$
$

You can also use awk on that data file:

$
$ ##
$ awk 'BEGIN{x=0}
>      /123456/ {x=1; getline}
>      /ENDSTM/ && x==1 {x=0}
>      x==1 {print}' f1
stm2 - line 1
stm2 - line 2
stm2 - line 3
stm2 - line 4
$
$

HTH,
tyler_durden

andrewsc · September 17, 2009, 1:37pm

durden_tyler:

Yes, here's one way to do it with Perl:

$
$ ##
$ awk 'BEGIN{x=0}
>      /123456/ {x=1; getline}
>      /ENDSTM/ && x==1 {x=0}
>      x==1 {print}' f1
stm2 - line 1
stm2 - line 2
stm2 - line 3
stm2 - line 4
$
$

The awk gets the page(s) I need. However, on the first page I actually need to start 5 lines above the "123456" (6 including that line).

durden_tyler · September 17, 2009, 2:32pm

So in case of file f1, you want to start from line (7-5=) 2:

$
$ cat -n f1
     1  456789
     2  stm1 - line 1
     3  stm1 - line 2
     4  stm1 - line 3
     5  ENDSTM
     6
     7  123456
     8  stm2 - line 1
     9  stm2 - line 2
    10  stm2 - line 3
    11  stm2 - line 4
    12  ENDSTM
    13
    14  345678
    15  stm3 - line 1
    16  stm3 - line 2
    17  ENDSTM
    18
    19  567890
    20  stm4 - line 1
    21  stm4 - line 2
    22  stm4 - line 3
    23  stm4 - line 4
    24  ENDSTM
    25
$
$
$ ##
$ perl -ne 'BEGIN{undef $/} {/.*\n(([^\n]*\n){5}123456.*?)ENDSTM/s && print $1}' f1
stm1 - line 1
stm1 - line 2
stm1 - line 3
ENDSTM
 
123456
stm2 - line 1
stm2 - line 2
stm2 - line 3
stm2 - line 4
$
$

tyler_durden

andrewsc · September 17, 2009, 3:11pm

the perl code does work for me. Can the awk be modified to grab the section 5 rows earlier? Also, include the last line that has the ENDSTM.

Thanks

durden_tyler · September 17, 2009, 3:43pm

Sorry, flexibility of regular expressions is one of the (many) reasons I tend to favor Perl.

Anyway -

$
$ ##
$ awk 'BEGIN {x = -1}
>      /123456/ {x=0; n=NR}
>      /ENDSTM/ && x==0 {x=1; s[NR]=$0}
>      x<1 {s[NR]=$0}
>      END {for (i=n-5; i<=length(s); i++) {print s}}' f1
stm1 - line 1
stm1 - line 2
stm1 - line 3
ENDSTM
 
123456
stm2 - line 1
stm2 - line 2
stm2 - line 3
stm2 - line 4
ENDSTM
$
$

tyler_durden