awk extract strings matching multiple patterns

chrissycc · October 13, 2013, 5:13am

Hi,

I wasn't quite sure how to title this one! Here goes:

I have some already partially parsed log files, which I now need to extract info from. Because of the way they are originally and the fact they have been partially processed already, I can't make any assumptions on the number of fields and the exact format etc. All I know is I can look for certain patterns. An extract of the original source is:

Job <1>, Job Name <BLAH>, Queue-- MEMLIMIT 10 G Fri Oct 11 09:55:48: Started on <cn035>, -- The CPU time is 12 seconds. MEM: 1 Gbytes; 
Job <2>, Job Name <BLAH>, Queue-- MEMLIMIT 10 G Fri Oct 11 09:55:48: Started on <cn069>, -- The CPU time is 10 seconds. MEM: 1 Gbytes; 
Job <3>, Job Name <BLAH>,  MEMLIMIT 10 G Fri Oct 11 09:55:48: Started on <cn049>, ;-- The CPU time is 13 seconds. MEM: 2 Gbytes; 
Job <4>, Job Name <BLAH>,  Status <RUN>,  Command <-- The CPU time is 76 seconds. MEM: 3 Gbytes; 
Job <7>, Job Name <BLAH>,  Stat us <RUN>,  Command <-- The CPU time is 49 seconds. MEM: 1014 Mbytes; 
Job <8>, Job Name <BLAH> , Status <RUN>, -- MEMLIMIT 10 G Fri Oct 11 22:13:19: Started on <cn014>;-- The CPU time is 12 seconds. MEM: 391 Mbytes; 
Job <9>, Job Name <BLAH>,  Status <RUN >,  Command <: Started on <cn026>,-- The CPU time is 71 seconds. MEM: 13 Mbytes; 
Job <10>, Job Name <BLAH>,  Sta tus <RUN>,  Command <#!/bi-- MEMLIMIT 22 G  Started on <cn064>, -- The CPU time is 25 seconds. MEM: 12 Gbytes;

I want to extract based on:

Started on <____>,
MEMLIMIT __ G
MEM: ___ bytes;

The first line example being:

MEMLIMIT 10 G Fri Oct 11 09:55:48: Started on <cn035>, -- The CPU time is 12 seconds. MEM: 1 Gbytes;

Each line may contain all, some or none of the above. My ideal output based on the above would be something like:

Started: cn035 MEMLIMIT: 10 G MEM: 1 G
Started: cn069 MEMLIMIT: 10 G MEM: 1 G 
etc
etc

(ideally, if there is no MEMLIMIT found on a line for example):
Started: cn026 MEMLIMIT: 0 G MEM: 13 M

I've messed around with gsub in awk to extract a single instance but couldn't work out how to select on multiple patterns...

Any help as always would be appreciated!

Scrutinizer · October 13, 2013, 5:46am

Like this?

sed -n 's/.*\(MEMLIMIT [^ ]* [^ ]*\).*Started on <\([^>]*\).*\(MEM: [^ ]* .\).*/Started: \2 \1 \3/p' file

chrissycc · October 13, 2013, 6:46am

Thanks for that Scrutinizer - so very close to what I need! If I've got it correct, it only displays if all three patterns are found, ideally it would be great if it could print every line with 1 or more matches:

Started: cn026 MEMLIMIT: 0 G MEM: 13 M

or just blank rather than 0 G on the MEMLIMIT. Basically every entry _should_ have a 'Started on' and a MEM:, but not necessarily a MEMLIMIT

bartus11 · October 13, 2013, 8:33am

If you are OK with Perl solution: put this into "script.pl":

#!/usr/bin/perl
use strict;
open I, "$ARGV[0]";
while (chomp($_=<I>)) {
  if (/Started on <([^>]+)/) {
    my $started=$1;
    my $memlimit=$1 if /MEMLIMIT (\d+) G/;
    $memlimit=$memlimit?$memlimit:0;
    /MEM: ([^;]+)/;
    my $mem=$1;
    print "Started: $started MEMLIMIT: $memlimit G MEM: $mem\n";
  }
}

Then run: perl script.pl file

RudiC · October 13, 2013, 9:12am

straightforward awk:

awk     'match ($0, /Started on/)       {C++; X=substr ($0, RSTART+RLENGTH,10); gsub (/^.*<|>.*$/, "", X)}
         match ($0, /MEMLIMIT/)         {C++; Y=substr ($0, RSTART+RLENGTH,10); gsub (/^ |[^kMG]*$/, "", Y)}
         match ($0, /MEM:/)             {C++; Z=substr ($0, RSTART+RLENGTH,10);  sub (/[bB].*$/, "", Z)}
         C >=2                          {printf "Started: %s MEMLIMIT: %6s MEM: %6s\n", X, Y, Z}
                                        {C=X=Y=Z=0}
        ' file
Started: cn035 MEMLIMIT:   10 G MEM:    1 G
Started: cn069 MEMLIMIT:      0 MEM:    1 G
Started: cn049 MEMLIMIT:   10 G MEM:    2 G
Started: cn014 MEMLIMIT:   10 G MEM:  391 M
Started: cn026 MEMLIMIT:      0 MEM:   13 M
Started: cn064 MEMLIMIT:   22 G MEM:   12 G

MadeInGermany · October 13, 2013, 12:12pm

As RudiC pointed out, the following only works on Solaris:

/usr/xpg4/bin/awk '{
started=$0; if (!sub(".*Started on <([^>]*).*","\1",started)) started="-"
memlimit=$0; if (!sub(".*MEMLIMIT ([^ ]* [^ ;]*).*","\1",memlimit)) memlimit="-"
mem=$0; if (!sub(".*MEM: ([^ ]* [^ ;]*).*","\1",mem)) mem="-"
printf "Started on: %-8s MEMLIMIT: %-8s MEM: %-8s\n",started,memlimit,mem
}' file
Started on: cn035    MEMLIMIT: 10 G     MEM: 1 Gbytes
Started on: cn069    MEMLIMIT: 10 G     MEM: 1 Gbytes
Started on: cn049    MEMLIMIT: 10 G     MEM: 2 Gbytes
Started on: -        MEMLIMIT: -        MEM: 3 Gbytes
Started on: -        MEMLIMIT: -        MEM: 1014 Mbytes
Started on: cn014    MEMLIMIT: 10 G     MEM: 391 Mbytes
Started on: cn026    MEMLIMIT: -        MEM: 13 Mbytes
Started on: cn064    MEMLIMIT: 22 G     MEM: 12 Gbytes

RudiC · October 13, 2013, 12:34pm

@MadeInGermany: What awk- version are you using? Mine (mawk) takes "\1" as the "\001" character.

MadeInGermany · October 13, 2013, 12:41pm

You are right, it only works with the Solaris /usr/xpg4/bin/awk

chrissycc · October 13, 2013, 2:38pm

Firstly, thanks everyone for their responses. Cheers for the perl solution, unfortunately to answer your question, I'm not ok with perl (can never get my head round it!!).

I'm working on Centos 5, and my awk is GNU Awk 3.1.5, which I'm pleased to say seems to work with RudiC's suggestion:

awk     'match ($0, /tarted on/)       {C++; X=substr ($0, RSTART+RLENGTH,10); gsub (/^.*<|>.*$/, "", X)}
         match ($0, /MEMLIMIT/)         {C++; Y=substr ($0, RSTART+RLENGTH,10); gsub (/^ |[^kMG]*$/, "", Y)}
         match ($0, /MEM:/)             {C++; Z=substr ($0, RSTART+RLENGTH,10);  sub (/[bB].*$/, "", Z)}
         C >=2                          {printf "Started: %s MEMLIMIT: %6s MEM: %6s\n", X, Y, Z}
                                        {C=X=Y=Z=0}
        ' file.test | head
Started: cn103 MEMLIMIT:      0 MEM:   13 M
Started: cn103 MEMLIMIT:      0 MEM:  652 M
Started: cn103 MEMLIMIT:      0 MEM:  273 M
Started: cn103 MEMLIMIT:      0 MEM:  623 M
Started:  16 Hosts/ MEMLIMIT:      0 MEM:    4 G
Started:  64 Hosts/ MEMLIMIT:      0 MEM:    9 G
Started: cn133 MEMLIMIT:   39 G MEM:   24 G
Started: cn104 MEMLIMIT:      0 MEM:    2 M
Started: cn104 MEMLIMIT:      0 MEM:   10 M
Started: cn104 MEMLIMIT:      0 MEM:  217 M
etc

Brilliant, thanks very much!