Grep couple of consecutive lines if each lines contains certain string

black_fender · May 30, 2012, 7:01am

Hello,

I want to extract from a file like :

20120530025502914 | REQUEST | whatever
20120530025502968 | RESPONSE | whatever
20120530025502985 | RESPONSE | whatever
20120530025502996 | REQUEST | whatever
20120530025503013 | REQUEST | whatever
20120530025503045 | RESPONSE | whatever

I want to extract all groups of 2 lines in which the first line contanis 'REQUEST' and the next line below contains 'RESPONSE'

Basically from the above file I would like to have extracted the following :

20120530025502914 | REQUEST | whatever
20120530025502968 | RESPONSE | whatever
20120530025503013 | REQUEST | whatever
20120530025503045 | RESPONSE | whatever

(Please note those numbers from the first fields to be able to identify which line I've extracted from the initial file - basically the lines in red from the initial file ).
I'm not completely emty-handed, I have this snippet as a starting point ('stolen' a while ago from internet ) :

nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=0 a=1 s="string" file

Which takes the line I that contains the "string" (in my case "REQUEST" ) and the next line after, but i don;t know where to put the condition that the line after to contain the string "RESPONSE" and if so to extract the respective group of 2 lines.

Franklin52 · May 30, 2012, 7:12am

Try this:

awk -F"|" '$2 ~ "REQUEST" {s=$0;f=1;next} f && $2 ~ "RESPONSE" {print s RS $0;f=0}' file

black_fender · May 30, 2012, 7:28am

Thanks for the response, but I think there must be a syntax error, because I get this :

 echo kkk | awk -F"|" '$2 ~ "REQUEST" {s=$0;f=1;next} f && $2 ~ "RESPONSE" {print s RS $0;f=0}'
awk: syntax error near line 1
awk: bailing out near line 1

Franklin52 · May 30, 2012, 7:56am

On Solaris use nawk or /usr/xpg4/bin/awk rather than awk

Ygor · May 30, 2012, 8:49am

black_fender:

I'm not completely emty-handed, I have this snippet as a starting point ('stolen' a while ago from internet ) :
nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=0 a=1 s="string" file

I recognise my own code from this post: grep and display few lines before and after Post: 302098992

Since I wrote that in 2006, I notice it has propogated over the internet in other forums and blogs, and now seems to have taken a life of its own.

It's not applicable in this case.

drl · May 30, 2012, 9:02am

Hi.

I often use cgrep for complex matching and manipulation. It extends some of the features of GNU/grep and is comparable in speed. The heart of the following script is the cgrep. The surrounding code displays the environment under which it was run, as well as comparing results:

#!/usr/bin/env bash

# @(#) s1	Demonstrate matching on successive lines, cgrep.
# See: http://sourceforge.net/projects/cgrep/

# Section 1, setup, pre-solution, $Revision: 1.25 $".
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin" HOME=""
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
db() { : ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep

set -o nounset
pe

FILE=${1-data1}

# Display sample of data file, with edges or head & tail as a last resort.
db " Section 1: display of input data and expected output."
pe " || start sample [ specimen first:middle:last ] $FILE"
specimen $FILE expected-output.txt 2>/dev/null \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 2, solution.
pl " Results:"
db " Section 2: solution."
cgrep -a 'REQUEST.*\n.*RESPONSE' $FILE |
tee f1

# Section 3, post-solution, check results, clean-up, etc.
v1=$(wc -l <expected-output.txt)
v2=$(wc -l < f1)
pl " Comparison of $v2 created lines with $v1 lines of desired results:"
db " Section 3: validate generated calculations with desired results."

pl " Comparison with desired results:"
if [ ! -f expected-output.txt -o ! -s expected-output.txt ]
then
  pe " Comparison file \"expected-output.txt\" zero-length or missing."
  exit
fi
if cmp expected-output.txt f1
then
  pe " Succeeded -- files have same content."
else
  pe " Failed -- files not identical -- detailed comparison follows."
  if diff -b expected-output.txt f1
  then
    pe " Succeeded by ignoring whitespace differences."
  fi
fi

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15

 db,  Section 1: display of input data and expected output.
 || start sample [ specimen first:middle:last ] data1
Whole: 5:0:5 of 6 lines in file "data1"
20120530025502914 | REQUEST | whatever
20120530025502968 | RESPONSE | whatever
20120530025502985 | RESPONSE | whatever
20120530025502996 | REQUEST | whatever
20120530025503013 | REQUEST | whatever
20120530025503045 | RESPONSE | whatever

Whole: 5:0:5 of 4 lines in file "expected-output.txt"
20120530025502914 | REQUEST | whatever
20120530025502968 | RESPONSE | whatever
20120530025503013 | REQUEST | whatever
20120530025503045 | RESPONSE | whatever
 || end

-----
 Results:
 db,  Section 2: solution.
20120530025502914 | REQUEST | whatever
20120530025502968 | RESPONSE | whatever
20120530025503013 | REQUEST | whatever
20120530025503045 | RESPONSE | whatever

-----
 Comparison of 4 created lines with 4 lines of desired results:
 db,  Section 3: validate generated calculations with desired results.

-----
 Comparison with desired results:
 Succeeded -- files have same content.

I like awk for its flexilbility (and especially in readability compared to sed for compilcated jobs), but I don't like one-off (nonce) scripts, as well as the fact that my measurements indicate that awk uses about 5 times as much CPU and 5 times as much wall clock time as most members of the grep family for the similar tasks (however, cgrep does use more system time, about twice as much).

See the sourceforge link for the compilable source if it is not in an available repository.

Best wishes ... cheers, drl

Scrutinizer · May 30, 2012, 9:40am

@drl, grep cannot do this and I do not think cgrep is present on Solaris, is it? cgrep looks nice though and it is fast indeed. I presume cgrep was tested against gawk, which is one of the slowest awks. Perhaps you could compare it to the fastest awk, which is mawk..

drl · May 30, 2012, 10:46am

Hi, Scrutinizer.

I have only the old Solaris-X86 running in a VM:

OS, ker|rel, machine: SunOS, 5.10, i86pc
Distribution        : Solaris 10 10/08 s10x_u6wos_07b X86

There are a number of repos which may have it, but I have not searched extensively. I can try to see if cgrep will compile on Solaris (it was an easy make on Linux, both 32-and-64-bit), but that will be a low-priority task.

An excerpt from a searching benchmark on a 100MB file shows:

By cpu:
        code   cpu   real system real/cpu cpu/best real/best sys/best
       cgrep  0.15   0.30   0.12     2.00     1.00      1.30     2.40
   fgrep (2)  0.16   0.24   0.06     1.50     1.07      1.04     1.20
        grep  0.16   0.23   0.06     1.44     1.07      1.00     1.20
       agrep  0.20   0.28   0.05     1.40     1.33      1.22     1.00
         awk  1.32   1.45   0.08     1.10     8.80      6.30     1.60
        mawk  1.36   1.46   0.06     1.07     9.07      6.35     1.20
        perl  1.36   1.50   0.07     1.10     9.07      6.52     1.40
         sed  1.36   1.46   0.06     1.07     9.07      6.35     1.20
        ruby  1.96   2.14   0.12     1.09    13.07      9.30     2.40
        java  2.19   2.51   0.12     1.15    14.60     10.91     2.40

So for that task the versions used were:

mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
gawk GNU Awk 3.1.5

Best wishes ... cheers, drl

Scrutinizer · May 30, 2012, 11:17am

Strange, are you sure the mawk numbers are correct? They should not be anywhere near the gawk numbers. I ran these tests on another 100 MB file

cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' infile

mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' infile

gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' infile

and I got (in seconds):

code   real     user    sys  
cgrep  11.160   10.845  0.300        
mawk   16.995   16.505  0.464
gawk   98.290   97.578  0.548

--
cgrep version 8.15, mawk 1.3.3, GNU Awk 3.1.6 on Ubuntu 10.04 LTS

drl · May 30, 2012, 11:35am

Hi, Scrutinizer.

Thanks for spotting that anomaly. In fact, I was using GNU/awk for mawk. The new (interim) excerpt of the searching benchmark is:

By cpu:
	code   cpu   real system real/cpu cpu/best real/best sys/best
	grep  0.13   0.23   0.08     1.77     1.00	1.00	 1.60
   fgrep (2)  0.16   0.23   0.05     1.44     1.23	1.00	 1.00
       cgrep  0.17   0.30   0.11     1.76     1.31	1.30	 2.20
       agrep  0.20   0.28   0.06     1.40     1.54	1.22	 1.20
	mawk  0.51   0.64   0.09     1.25     3.92	2.78	 1.80
	 awk  1.33   1.42   0.06     1.07    10.23	6.17	 1.20
	perl  1.37   1.49   0.10     1.09    10.54	6.48	 2.00
	 sed  1.37   1.47   0.06     1.07    10.54	6.39	 1.20
	ruby  1.98   2.45   0.11     1.24    15.23     10.65	 2.20
	java  2.01   3.15   0.17     1.57    15.46     13.70	 3.40

which shows that for this task, mawk is 2-3 times faster than gawk in CPU time (although, like cgrep, the system time is greater).

I'm sure that Michael appreciates you defending his code's honor

Best wishes ... cheers, drl

Scrutinizer · May 30, 2012, 12:30pm

Still quite a discrepancy, because I get a factor 5 - 5.5 . Maybe you have a strange compile or could there be a caching effect with the others?

I just ran the tests also on OSX with

code   real     user    sys  
cgrep  0.8665   0.847  0.017        
mawk   0.954    0.923  0.031
awk(*) 4.582    4.492  0.037

--
cgrep version 8.15, mawk 1.3.3, (*)BWK Awk 20070501 on OSX 10.7.4

drl · May 30, 2012, 1:35pm

Hi.

This is a quickly-put-together script:

#!/usr/bin/env bash

# @(#) s2	Demonstrate comparison among cgrep, gawk, mawk.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
cs() { echo "$1" | perl -wp -e '1 while s/^([-+]?\d+)(\d{3})/$1,$2/; ' ; }
clock() { /usr/bin/time --format="real %e\nuser %U\nsys %S" $*; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep gawk mawk

FILE=${1-/tmp/100-mb.txt}
lines=$( wc -l < $FILE )
chars=$( wc -c < $FILE )
pl " Input file $FILE is $( cs $lines ) lines, $( cs $chars ) characters:"
specimen $FILE

pl " Results for cgrep:"
time cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' $FILE

pl " Results for gawk:"
time gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

pl " Results for mawk:"
time mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

exit 0

producing:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15
gawk GNU Awk 3.1.5
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

-----
 Input file /tmp/100-mb.txt is 1,777,700 lines, 120,540,400 characters:
Edges: 5:0:5 of 1777700 lines in file "/tmp/100-mb.txt"
Preliminary Matter.  

This text of Melville's Moby-Dick is based on the Hendricks House edition.
It was prepared by Professor Eugene F. Irey at the University of Colorado.
Any subsequent copies of this data must include this notice  
   ---
AND FLOATED BY MY SIDE. +BUOYED UP BY THAT COFFIN, FOR ALMOST ONE WHOLE DAY
AND NIGHT, +I FLOATED ON A SOFT AND DIRGE-LIKE MAIN. +THE UNHARMING SHARKS,
THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWKS SAILE
D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AND PIC
KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RETRACIN

-----
 Results for cgrep:

real	0m0.224s
user	0m0.104s
sys	0m0.100s

-----
 Results for gawk:

real	0m1.453s
user	0m1.328s
sys	0m0.092s

-----
 Results for mawk:

real	0m1.105s
user	0m0.988s
sys	0m0.096s

If there is something that takes a cache hit, it would be the wc, or at least the cgrep ... cheers, drl

Scrutinizer · May 30, 2012, 3:58pm

With an input file, similar to your Moby Dick and not directly related to the problem at hand in this thread (and with which there were no matches) I also get a factor 5 difference between gawk and mawk, so your result may be a compile thing?. The difference between cgrep and mawk is a factor 6.

With an input file that is a large version of the input file of the problem in this thread, mawk and cgrep are about the same speed, with mawk being 5-10% faster than cgrep, while the difference between mawk and gawk was still a factor 5 - 5.5

black_fender · May 31, 2012, 2:36am

Thanks, this worked for me

---------- Post updated at 01:36 AM ---------- Previous update was at 01:34 AM ----------

drl:

Hi.

This is a quickly-put-together script:

#!/usr/bin/env bash

# @(#) s2    Demonstrate comparison among cgrep, gawk, mawk.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
cs() { echo "$1" | perl -wp -e '1 while s/^([-+]?\d+)(\d{3})/$1,$2/; ' ; }
clock() { /usr/bin/time --format="real %e\nuser %U\nsys %S" $*; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep gawk mawk

FILE=${1-/tmp/100-mb.txt}
lines=$( wc -l < $FILE )
chars=$( wc -c < $FILE )
pl " Input file $FILE is $( cs $lines ) lines, $( cs $chars ) characters:"
specimen $FILE

pl " Results for cgrep:"
time cgrep -a '\| REQUEST \|.*\n.*\| RESPONSE \|' $FILE

pl " Results for gawk:"
time gawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

pl " Results for mawk:"
time mawk -F\| '$2 ~ "REQUEST"{s=$0;next} s && $2~"RESPONSE"{print s RS $0; s=x}' $FILE

exit 0

producing:

% ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
cgrep ATT cgrep 8.15
gawk GNU Awk 3.1.5
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan

-----
 Input file /tmp/100-mb.txt is 1,777,700 lines, 120,540,400 characters:
Edges: 5:0:5 of 1777700 lines in file "/tmp/100-mb.txt"
Preliminary Matter.  

This text of Melville's Moby-Dick is based on the Hendricks House edition.
It was prepared by Professor Eugene F. Irey at the University of Colorado.
Any subsequent copies of this data must include this notice  
   ---
AND FLOATED BY MY SIDE. +BUOYED UP BY THAT COFFIN, FOR ALMOST ONE WHOLE DAY
AND NIGHT, +I FLOATED ON A SOFT AND DIRGE-LIKE MAIN. +THE UNHARMING SHARKS,
THEY GLIDED BY AS IF WITH PADLOCKS ON THEIR MOUTHS; THE SAVAGE SEA-HAWKS SAILE
D WITH SHEATHED BEAKS. +ON THE SECOND DAY, A SAIL DREW NEAR, NEARER, AND PIC
KED ME UP AT LAST. +IT WAS THE DEVIOUS-CRUISING +RACHEL, THAT IN HER RETRACIN

-----
 Results for cgrep:

real    0m0.224s
user    0m0.104s
sys    0m0.100s

-----
 Results for gawk:

real    0m1.453s
user    0m1.328s
sys    0m0.092s

-----
 Results for mawk:

real    0m1.105s
user    0m0.988s
sys    0m0.096s

If there is something that takes a cache hit, it would be the wc, or at least the cgrep ... cheers, drl

Hello,

Thank you very much for your effort, looks like very good craftsmanship, unfortunately I cannot test anyware as I don;t have cgrep on any of my machines.

drl · May 31, 2012, 5:17am

Hi, black_fender.

You're welcome. Glad to see that you have a solution.

It is true that one needs to do some work to have cgrep available. It's worth it if one needs to do searching where GNU/grep does not have the facilities. On the other hand, the awk family is certainly worth investing time in learning.

Best wishes ... cheers, drl