Blocks of text in a file - extract when matches...

Bashingaway · May 26, 2014, 9:24am

I sat down yesterday to write this script and have just realised that my methodology is broken........

In essense I have.....

-----------------------------------------------------------------  (This line really is in the file)               
                     Service ID: 12345                
                        Event ID: 67890               
                      start_date: 0xdde8 21:00:00 (Sat May 31 22:00:00 2014)                
                        duration:  01:00:00                
                             name:     Any old name info could be in here                
                                text:        Extended information description......        
                 Content Type:  Specific set of 10 different flag types            
                    Event CRID:  /XXYYZZ 
----------------------------------------------------------------              
                      Service ID: 54321                
                         Event ID: 09876               
                       start_date: 0xdde9 20:00:00 (Sun Jun 1 21:00:00 2014)                
                         duration:  02:00:00                
                              name:     Any old name info could be in here                
                                 text:        Extended information description......        
                  Content Type:  Specific set of 10 different flag types
                     Event CRID:  /YYZZXX 
---------------------------------------------------------------------------------

Notes....there can be other new fields introduced from the source witout notice and with no control over naming.

And so on repeated up to a couple of thousand times.

What I want to do is the following..

Match against the Service ID variables (specific values, got from different external source) and partial match against the name for specific words.....

So I had coded this as a:-

cat /imputfile.txt | while read LINEVAR ; do  

if [[ $LINEVAR == "Service ID:"* ]] && [[  $LINEVAR ==  *$VARIABLE1*  ||  $LINEVAR == *$VARIABLE2*  ]]; then  

echo "Found a match........"  

do some stuff  

done

I've just realised my flawed thinking.......I need the data from the:-

start_date
duration
name
text
event crid

Lines to be able to complete my data process and of course as I'm reading this on a LINE by LINE basis I can't then identify the correct subsequent fields in the source file....basically I'm a twit!!

So I have to come up with a different method but I'm having a brain f*rt and can't think, partly because I don't do a lot of bash so syntax always has to be re-looked up.

Ideas?

vbe · May 26, 2014, 10:24am

Using IFS=":" (after having save OLD IFS value of course...)
You could read your line as 2 variables! say VAR1 and VAR2
You can then test if
[ $VAR1 = "Service ID" ] , then save VAR2 etc...

RudiC · May 26, 2014, 11:50am

Try this and adapt/extend:

awk -vRS="-------" '/Service ID: (54321|12345)/ {match ($0, /start_date[^\n]*/); print substr ($0, RSTART, RLENGTH)} ' file
start_date: 0xdde8 21:00:00 (Sat May 31 22:00:00 2014)                
start_date: 0xdde9 20:00:00 (Sun Jun 1 21:00:00 2014)

Bashingaway · May 29, 2014, 10:57am

rudic:

Try this and adapt/extend:

awk -vRS="-------" '/Service ID: (54321|12345)/ {match ($0, /start_date[^\n]*/); print substr ($0, RSTART, RLENGTH)} ' file
start_date: 0xdde8 21:00:00 (Sat May 31 22:00:00 2014)                
start_date: 0xdde9 20:00:00 (Sun Jun 1 21:00:00 2014)

This is almost there, but how do I do it if the Service ID values are variables? and I want to extract multiple lines in a single awk statement, would the following work?

awk -vRS="-------" '/Service ID: ($VARIABLE1|$VARIABLE2)/ {match ($0, /start_date[^\n]*/); print substr ($0, RSTART, RLENGTH)} {match ($0,/duration[^\n]*/); print substr ($0, RSTART, RLENGTH)}' input.file

I've tried various formats of using single or double quotes to get the $VARIABLEx values to read properly but my lack of familiarity with awk syntax precision means I'm going round in circles a little....when I get the $VARIABLEx values correct it causes syntax problems later in the line.......

CarloM · May 29, 2014, 11:23am

You can pass the shell variables in using -v.

Like:

$ export SERVICEID=54321; awk -vRS="-------" -vVAR1=$SERVICEID '$0 ~ "Service ID: ("VAR1")" {match ($0, /start_date[^\n]*/); print substr ($0, RSTART, RLENGTH)} ' input.txt
start_date: 0xdde9 20:00:00 (Sun Jun 1 21:00:00 2014)

Bashingaway · May 29, 2014, 4:12pm

carlom:

You can pass the shell variables in using -v.

Like:

$ export SERVICEID=54321; awk -vRS="-------" -vVAR1=$SERVICEID '$0 ~ "Service ID: ("VAR1")" {match ($0, /start_date[^\n]*/); print substr ($0, RSTART, RLENGTH)} ' input.txt
start_date: 0xdde9 20:00:00 (Sun Jun 1 21:00:00 2014)

Carlo

I don't think you've taken fully onboard what I'm trying to do....

I'm search for the Text Phrase "Service ID:" AND then the $VARIABLEx value (not Service ID=value).

Service ID: effectively marks the start of a block that ends with ---------, the $VARIABLEx value (more then one value is valid for $VARIABLEx) then determines if it's a valid match.

If it is I want to extract start_date, duration and other lines from that text block......that's why I gave the example I gave in my reply and asked how I can get multiple extractions in one awk line when I get a match.

Hope that makes it clearer.

CarloM · May 30, 2014, 4:33am

It's not supposed to be a perfect solution - it's an example on how to use shell variables in awk scripts.

If you want to use more than one variable then you need to pass them in individually and extend the regex.

RudiC · May 30, 2014, 4:58am

Try this:

awk     -v RS="-------" \ 
        -v SID="54321|12345" \
        '$0 ~ "Service ID: ("SID")"     {match ($0, /start_date[^\n]*/) 
                                         print substr ($0, RSTART, RLENGTH)}
        ' file