[Awk] Extract block of with a particular pattern

sandeepk1611 · February 14, 2011, 4:58pm

Hi,

I have some CVS log files, which are divided into blocks. Each block has many fields of information and I want to extract those blocks with a pattern. Here is the sample input.

RCS file: /cvsroot/eclipse/org.eclipse.debug.core/core/org/eclipse/debug/core/DebugPlugin.java,v
head: 1.174
branch:
locks: strict
access list:
keyword substitution: o
total revisions: 181;    selected revisions: 16
description:
----------------------------
revision 1.149
date: 2007-04-16 11:06:45 -0500;  author: darin;  state: Exp;  lines: +51 -132;  commitid: 611546239f144567;
Bug 178902 Setting Stop in main does not stop when launched
----------------------------
revision 1.148
date: 2007-03-26 20:47:29 -0500;  author: darin;  state: Exp;  lines: +1 -1;  commitid: 61604608779a4567;
update copyrights
----------------------------
revision 1.147
date: 2007-01-18 10:57:34 -0600;  author: darin;  state: Exp;  lines: +5 -0;  commitid: 458f45afa6fd4567;
tracing for debug events
----------------------------
revision 1.146
date: 2007-01-17 09:01:45 -0600;  author: darin;  state: Exp;  lines: +7 -0;  commitid: 614345ae3a564567;
javadoc settings and fixes
=============================================================================
RCS file: /cvsroot/eclipse/org.eclipse.debug.core/core/org/eclipse/debug/core/DebugException.java,v
head: 1.17
branch:
locks: strict
access list:
keyword substitution: o
total revisions: 18;    selected revisions: 2
description:
----------------------------
revision 1.14
date: 2006-06-12 15:42:24 -0500;  author: darin;  state: Exp;  lines: +2 -2;
copyright updates
----------------------------
revision 1.13
date: 2006-05-16 09:34:00 -0500;  author: darin;  state: Exp;  lines: +1 -1;
javadoc spelling errors
=============================================================================

After the word "description", there is information for each revision. I only want those revisions where the last field (which is free text) has the patterns "Bug" , or "Fix" or "####" some number without any preceding letters or words. The last field may be in a single line or in 2 lines.

The above input has the data for 2 files. For each file, I want to retain the information till the word "description", but after that I want the information only for those revisions which have these patterns in them.

The expected output is

RCS file: /cvsroot/eclipse/org.eclipse.debug.core/core/org/eclipse/debug/core/DebugPlugin.java,v
head: 1.174
branch:
locks: strict
access list:
keyword substitution: o
total revisions: 181;    selected revisions: 16
description:
----------------------------
revision 1.149
date: 2007-04-16 11:06:45 -0500;  author: darin;  state: Exp;  lines: +51 -132;  commitid: 611546239f144567;
Bug 178902 Setting Stop in main does not stop when launched
=============================================================================
RCS file: /cvsroot/eclipse/org.eclipse.debug.core/core/org/eclipse/debug/core/DebugException.java,v
head: 1.17
branch:
locks: strict
access list:
keyword substitution: o
total revisions: 18;    selected revisions: 2
description:
=============================================================================

Sorry for the long question. I would appreciate any help.

Thank you very much.

Sandeep

rdcwayx · February 14, 2011, 6:28pm

awk '
BEGIN{RS="==*\n";FS="--*\n"}
{for (i=1;i<=NF;i++) {if ($i~/[Bug|Fix|####] [0-9]/||$i~/RCS file:/) print $i OFS}}
' OFS="----------------------------"  infile

Chubler_XL · February 14, 2011, 7:31pm

Great solution rdcwayx,

Just a couple of slight tweaks to stop false positives (Bug Fix must start 3rd line) and also support number without proceeding letters (think that is what ##### was supposed to represent):

awk '
BEGIN{RS="==*\n";FS="--*\n"}
{for (i=1;i<=NF;i++) {if ($i~/[^\n]*\n[^\n]*\n(Bug |Fix |)[0-9]/||$i~/^RCS file:/) print $i OFS}}
' OFS="----------------------------"  infile

yinyuemi · February 14, 2011, 8:07pm

awk -v p=0 -v label1="----------------------------" -v label2="=============================================================================" '
$0==label1{p++;y=p}
/RCS file/{x++;p=1}
/===*/{p=""}{a[x" "p]=a[x" "p]"\n##"$0}
END{
for(m=1;m<=x;m++) {for(n=1;n<=y;n++) if(a[m" "n]~/RCS file|##Bug|##Fix|##[0-9]/) print gensub("##","","g",a[m" "n]);print label2}
}' file

alister · February 14, 2011, 8:07pm

A couple of things to keep in mind, in case the solutions don't work for the OP:

The use of a regular expression or string in RS is a gawk extension.
Since at least one field is free text, it's probably a good idea to anchor the FS regular expression.

Regards,
Alister

Chubler_XL · February 14, 2011, 8:36pm

Agreed, this should anchor things down, and also keeps the =======* and -----* delimiters from the original file.

awk '
BEGIN{RS="=============================================================================\n";
 FS="----------------------------";OFS=FS}
{for (i=1;i<=NF;i++) {if($i~/^RCS file:/)printf $1; if($i~/[^\n]*\n[^\n]*\n(Bug |Fix |)[0-9]/) printf OFS $i} printf RS} ' infile

sandeepk1611 · February 15, 2011, 11:07am

Thanks everyone for the replies. I will try out these solutions on the data I have.

@Chubler_XL,

I had a question. Why did you first assign FS="------*" and then again OFS=FS ? Can you explain that.

Thanks,
Sandeep

Chubler_XL · February 15, 2011, 3:50pm

Just a leftover from rdcwayx's original code you could get rid of OFS=FS and just use FS directly in the print statement.