How to extract a paragraph containing a given string?

delphys · June 29, 2016, 7:22am

Hello:

Have a very annoying problem:

Need to extract paragraphs with a specific string in them from a very large file
with a repeating record separator.

Example data: a file called test.out

CREATE VIEW view1
AS something
FROM table1 ,table2 as A, table3 (something FROM table4)
FROM table5, table6
USING file1
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table9
something
something
FROM table5 ,table (something FROM table4 ,table5(this is something FROM table8)
USING file2
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table8
something
something
FROM table5 ,table (something FROM table4 ,table5(this is something FROM table8)
USING file2
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table6
something
something
FROM table5 ,table (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

If I want to extract a paragraph containing the string "table7"

 
awk -v RS="CREATE VIEW" '/table7/' test.out

 view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

The problem is that the RS variable always cuts out the RS value itself, as you can see..

How do I tell awk to print the RS value too .. ??

Thnx in advance.

RudiC · June 29, 2016, 7:30am

You can set the ORS variable equal to RS, but I doubt you'd be happy with the result of this applied to your problem.
Wouldn't ; lend itself as an ORS/RS character in this case? Although I know it can show up in other spots in DDL as well ...

delphys · June 29, 2016, 8:13am

Yes I tried setting RS & ORS the same value but the output is the same, I am afraid..
Still cuts out the RS value..:

awk -v RS="CREATE" -v ORS="CREATE" '/table7/' test.out

==========================
 VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

smoofy · June 29, 2016, 9:47am

What about:

awk 'BEGIN{RS=ORS=";\n"}/table7/' text

CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

I just can't get rid of the leading empty line....

delphys · June 29, 2016, 10:06am

Thnx.. But when I try it on the real data, it just greps the string ..
Not the whole paragraph..

Aia · June 29, 2016, 10:08am

perl -ne 'BEGIN{$/=";\n"} /table7/ and print' file

Output:

CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

drl · June 29, 2016, 10:43am

Hi.

A grep-like code from ATT, cgrep , allows 3 patterns: the pattern in which you are primarily interested, and the 2 end-point patterns of an enclosing window. Here is an example:

#!/usr/bin/env bash

# @(#) s1       Demonstrate extraction by matching token in paragraph.
# cgep source:
# http://sourceforge.net/projects/cgrep/ (verified: 2016.06.29)

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cgrep

FILE=${1-data1}

pl " Sample input data file $FILE:"
head $FILE

pl " Results:"
cgrep -D -w '^CREATE' +w '^;' table7 $FILE

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
cgrep ATT cgrep 8.15

-----
 Sample input data file data1:
CREATE VIEW view1
AS something
FROM table1 ,table2 as A, table3 (something FROM table4)
FROM table5, table6
USING file1
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table9
something
something

-----
 Results:
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

You will need to compile cgrep . I have done it several times in both 32-bit and 64-bit without trouble.

Best wishes ... cheers, drl

Aia · June 29, 2016, 4:16pm

smoofy:

What about:

awk 'BEGIN{RS=ORS=";\n"}/table7/' text

CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

I just can't get rid of the leading empty line....

From the POSIX standard:

How many characters are you assigning to RS? Two. In your AWK implementation, it ignores the "\n" , other implementations might do something different.

delphys · July 4, 2016, 7:22am

Thank you all..

Yeah tried the perl command too but didnt work ..

Looks like there is no solution..

pravin27 · July 4, 2016, 7:41am

Delphys,

What's the issue with perl code in post #6 ? Could you please post some more information (like error , o/p ) ?

~ Pravin ~

Scrutinizer · July 4, 2016, 3:14pm

Try:

awk '1; NF==1 && $1=";"{print x}' test.out | awk '/table7/' RS=

--
Note: On Solaris use /usr/xpg4/bin/awk rather than awk

RudiC · July 4, 2016, 4:30pm

awk '/CREATE VIEW/,/;/ {TMP=TMP DL $0; DL=ORS } /;/ {if (TMP ~ /table7/) print TMP; TMP=DL=""}' file
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

bakunin · July 5, 2016, 5:32am

Just for the record: the grep -utility in AIX has a "-p" option which makes the output the surrounding paragraph instead of only the line with the match. If you happen to be on an AIX system this should pretty much do what you want.

Because grep -p works only on empty-line-separated paragraphs you will have to "massage" your file a bit first, something like:

sed '/^;$/ G' /your/file | grep -p '<regexp'

I hope this helps.

bakunin

delphys · July 5, 2016, 8:08am

ok, so I need to apologize for being a little reticent..

I am trying to extract paragraphs of data from the wireshark pcap
for our Client's traffic, hence the sensitivity issue..
Doing this on Redhat Box.

So if the extraction is successful, I should get:

Date:  yymmdd:time ==> record separator
.
.
.
  var logonId = "xxyyzzww";
.
.

so if I try:

awk '/Date/,/;/ {TMP=TMP DL $0; DL=ORS } /;/ {if (TMP ~ /var logonId/) print TMP; TMP=DL=""}' /root/rbc-pcap-strings

I get empty results.

When I try the perl oneliner:

 perl -ne 'BEGIN{$/=";\n"} /var logonId/ and print' /root/rbc-pcap-strings
  var logonId = "xxyyzzww";
  var logonId = "yyzzwwqq";

I do not get the paragraph..

Thnx

pravin27 · July 5, 2016, 8:19am

Delphys,
Perl one liner works at my end. It would be helpful if you could attached part of your i/p file.
~Pravin~

drl · July 6, 2016, 8:46am

Hi.

Comments:
1) the cgrep solution is not trivial, but not difficult, provided one can compile c code. Another grep-like code, skip ( sift - a fast and powerful alternative to grep ) is one of the fastest greps, but I could not entice it to execute as I wished.

2) the suggested solutions in awk and perl worked for me.

3) for a sense of completeness and continuing bakunin's note about grep in AIX with paragraph mode, here are 3 perl codes that support paragraph mode and should work on AIX and any other system with a recent version of perl :

#!/usr/bin/env bash

# @(#) s2       Demonstrate extraction by matching token in paragraph.
# Sources, verified 2016.07.05
# tcgrep:
# https://api.metacpan.org/source/CWEST/ppt-0.14/html/commands/grep/tcgrep
# xtcgrep:
# http://cpansearch.perl.org/src/MNEYLON/File-Grep-0.01/Grep.pm
# peg:
# http://www.cpan.org/authors/id/A/AD/ADAVIES/peg-3.10

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C perl tcgrep xtcgrep peg

FILE=${1-data1}

pl " Sample input data file $FILE:"
head $FILE

pl " Results, tcgrep:"
tcgrep -P ';' table7 $FILE

pl " Results, xtcgrep:"
xtcgrep -P '";"' -e table7 $FILE

pl " Results, peg:"
peg -/ ';\n' table7 $FILE

exit 0

producing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.4 (jessie) 
bash GNU bash 4.3.30
perl 5.20.2
tcgrep - ( local: RepRev 1.2, ~/bin/tcgrep, 2012-02-06 )
xtcgrep (local) 1.5
peg (local) 3.10

-----
 Sample input data file data1:
CREATE VIEW view1
AS something
FROM table1 ,table2 as A, table3 (something FROM table4)
FROM table5, table6
USING file1
;
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table9
something
something

-----
 Results, tcgrep:

CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
--------------------

-----
 Results, xtcgrep:

CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;
--------------------

-----
 Results, peg:
CREATE VIEW view1
FROM table1 ,table2 ,table6 ,table7
something
something
FROM table5 ,table7 (something FROM table4 ,table5(this is something FROM table8)
USING file2
;

As you can see, there may be some cleanup necessary for the separating delimiter in some runs. Some codes may have in-line help that describe options for discarding the output separator lines.

My names for codes is not necessarily the same as the content pointed to by the links.

Best wishes ... cheers, drl