Extract if pattern matches

Raynon · October 19, 2007, 10:36am

Hi All,

I have an input below. I tried to use the awk below but it seems that it ;s not working. Can anybody help ?
My concept here is to find the 2nd field of the last occurrence of such pattern " ** XXX ccc ccc cc cc ccc 2007 " . In this case, the 2nd field is " XXX ". With this "XXX" term stored as a variable, i want to print out the all lines with 2nd field having " XXX " term and its subsequent lines containing terms matching with " k= ". Expected output are highlighted as bold red in the input.

Input:

wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1

Output:

** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

My AWK code:

$NF == "2007" && $1 == "**" && NF == "8" {Field2 = $2}

$1 == "**" && $8 == "2007" && $2 == Field2   {
print ;
flag = 1;
next;
}
flag == 1 && $2 ~ /k=/ {print}

$1 == "**" && $8 == "2007" && $2 != Field2 {flag = 0}

Raynon · October 19, 2007, 8:28pm

Hi All,

Actually my main problem is to assign the last occurrence of 2nd field which follows this pattern " ** XXX ccc ccc cc cc ccc 2007 ". I could have done it with the END option shown below but i can't because i need to use the Field2 variable in the coming lines. Can anybody help ?

{$NF == "2007" && $1 == "**" && NF == "8"} END {Field2 = $2}

drl · October 19, 2007, 9:14pm

Hi.

I often think that awk's automatic read can get in the way (perhaps it just gets in the way of my thought processes).

Here's a perl script that produces your specified output from the given input:

#!/usr/bin/perl

# @(#) p1       Demonstrate extraction after pattern match.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

my ($lines) = 0;

my (@a);

# Read until ** XXX, then turn over control to function to scan
# for other pattern.

while (<>) {
  $lines++;
  chomp;
  @a = split;
  if ( $a[0] eq "**" && $a[1] eq "XXX" ) {
    print " Found XXX line at $.\n" if $debug;
    print "$_\n";
    last if not extract_k();
  }
}

print STDERR " ( Lines read: $lines )\n";

# Extract k= lines until line with "**".

sub extract_k {
  my (@a);
  while (<>) {
    chomp();
    @a = split;
    return 1 if $a[0] eq "**";    # not EOF
    print "$_\n" if /k=/;
  }
  return 0;                       # EOF
}

exit(0);

Running against data in file data1:

% ./p1 data1
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
 ( Lines read: 13 )

This makes the assumption that the ** lines alternate; more work will be necessary if that's wrong ... cheers, drl

Raynon · October 19, 2007, 9:28pm

Hi drl,

I got the following output.
I rename your perl code as "myperl" and input file as "input"
But it seems to have some problem ? Can you give some guidance?

$ perl myperl input
 ( Lines read: 24 )

ghostdog74 · October 19, 2007, 10:33pm

awk 'FNR==NR&&/^\*\*/{line=$2;next}
     FNR!=NR&&$0~line{
      print 
      f=1
     }
     f&&$0~/^\*\*/{ 
       if($2 !~ line) f=0
     }
     f&&$2=="k="{print}
' "file" "file"

output;

# ./test.sh
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

drl · October 19, 2007, 11:25pm

Hi, Raynon.

No, I cannot reproduce your failed result with perl code p1. I have amended and extended the code (calling it p2), added a few lines to the data to make sure that consecutive ** XXX line series will be handled, and ran it as you did:

#!/usr/bin/perl

# @(#) p2       Demonstrate extraction after pattern match.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

our ($lines) = 0;

my (@a);

# Read until ** XXX, then turn over control to function to scan
# for other pattern.

while (<>) {
  $lines++;
  chomp;
  @a = split;
  if ( $a[0] eq "**" && $a[1] eq "XXX" ) {
    print " Found XXX line at $.\n" if $debug;
    print "$_\n";

    # last if not extract_k();
    $_ = extract_k();
    if ( not $_ ) {
      last;
    }
    else {
      print " cycling with line $. ", $_ if $debug;
      redo;
    }
  }
}

print STDERR " ( Lines read: $lines )\n";

# Extract k= lines until line with "**".

sub extract_k {
  our ($lines);
  my (@a);
  while (<>) {
    $lines++;
    chomp();
    @a = split;
    return "$_\n" if $a[0] eq "**";    # not EOF
    print "$_\n" if /k=/;
  }
  return 0;                            # EOF
}

exit(0);

Producing:

% perl p2 data2
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
** XXX ccc ccc cc cc ccc 2007
0006 k= 1
0007 k= 1
 ( Lines read: 32 )

If you cannot get my code to work, then it looks like the awk script from ghostdog74 will work -- and it's far shorter than the perl code.

Best wishes ... cheers, drl

Raynon · October 20, 2007, 1:25am

ghostdog74:

awk 'FNR==NR&&/^\*\*/{line=$2;next}
   FNR!=NR&&$0~line{
   print 
   f=1
   }
   f&&$0~/^\*\*/{ 
   if($2 !~ line) f=0
   }
   f&&$2=="k="{print}
' "file" "file"

output;

# ./test.sh
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

Hi GhostDog,

Your code work!!
But i don't really understand about the FNR = NR statement. Can you help me understand that ?
And also why is there a need to have 2 identical input file for this awk code?

ghostdog74 · October 20, 2007, 3:30am

I am bad at explaining, please read the man page of awk for the definition of FNR and NR

so FNR==NR roughly means going over the first file's records...using 2 input files is a way to get back to the beginning of a file (unless there's another way) because awk process records in the forward direction.

Raynon · October 20, 2007, 4:46am

Hi GhostDog,

Thanks for your explanation. But still it seems abstract to me.

Can i say that this statement " FNR==NR&&/^\*\*/{line=$2;next} " will only operate on the 1st input file since the rest of the code is un-true for the 1st input file ?

As for the next part of the code starting with " FNR!=NR&&$0~line .... " , it will only operate on the 2nd input file ?

Am i understanding it correctly ?

But what would be the value of FNR and NR values when operating on the 1st input file ?
And what would be the value of FNR and NR values when operating on the 2nd input file ?

ghostdog74 · October 20, 2007, 10:10am

it very easy if you want to try to understand,just print them out !

awk 'FNR==NR{print "File processing now: " FILENAME "  FNR: "FNR " NR: "NR ;print $0;next}
     { print "File processing now: " FILENAME " NR: " NR " FNR:  " FNR " : "$0 }
' "file1" "file2"

Raynon · October 20, 2007, 8:06pm

ghostdog74:

it very easy if you want to try to understand,just print them out !

awk 'FNR==NR{print "File processing now: " FILENAME "  FNR: "FNR " NR: "NR ;print $0;next}
   { print "File processing now: " FILENAME " NR: " NR " FNR:  " FNR " : "$0 }
' "file1" "file2"

Hi GhostDog,

Thanks for that!!!! It really helps me to understand the codes better.

summer_cherry · October 21, 2007, 10:12pm

Hi,

I think this one should be ok for you!

code:

awk 'BEGIN{flag=0}
{
if ($2=="XXX")
{
	print
	flag=1
}
if ($1=="**" && $2!="XXX")
	flag=0
if (flag==1 && $2=="k=")
	print
}' filename

Raynon · October 21, 2007, 11:50pm

Hi Summer,

Thanks for your code. But we do not know the value of 2nd field (which is XXX) in the 1st place so your code can't apply here.

Hi GhostDog,

I have a little problem here. I have added some more data to my input file (highlighted in blue).
If the 2nd field of last occurence of last occurence of this pattern " ** abc ccc cc cc cc cc 2007 " does not start with " XX ", then the below will be output (that is only the very last portion of the block which matches the pattern will be printed out)

Can you help ?

Input:

wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
** abc ccc cc cc cc cc 2007
0001 k= 1
wwwwww
0002 k= 1
0003 k= 1

Output:

** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1

drl · October 22, 2007, 12:31am

Hi, Raynon.

Thanks for your emphasis on the function of your notation "XXX". Here is an amended perl script to account for that:

#!/usr/bin/perl

# @(#) p3       Demonstrate extraction after pattern match.

use warnings;
use strict;

my ($debug);
$debug = 1;
$debug = 0;

our ($lines) = 0;
my ($key_string);

my (@a);

# Get second field of first line that begins with **, as in:
#
# ** XXX
#
# then use that second field as the key_string. Anytime that
# key_string appears, we begin scanning for "k=" lines, and only
# stopping when another "**" line appears.
#
# So, read until ** XXX, then turn over control to function to
# scan for other pattern.

my ($first) = 1;
while (<>) {
  if ($first) {
    @a = split;
    if ( $a[0] eq "**" ) {
      $key_string = $a[1];
      $first      = 0;
      redo;
    }
  }
  $lines++;
  chomp;
  @a = split;
  if ( $a[0] eq "**" && $a[1] eq $key_string ) {
    print " Found XXX line at $.\n" if $debug;
    print "$_\n";

    # last if not extract_k();
    $_ = extract_k();
    if ( not $_ ) {
      last;
    }
    else {
      print " cycling with line $. ", $_ if $debug;

      # Adjust line count to avoid counting twice.
      $lines--;
      redo;
    }
  }
}

print STDERR " ( Lines read: $lines )\n";

# Extract k= lines until line with "**".

sub extract_k {
  our ($lines);
  my (@a);
  while (<>) {
    $lines++;
    chomp();
    @a = split;
    return "$_\n" if $a[0] eq "**";    # not EOF
    print "$_\n" if /k=/;
  }
  return 0;                            # EOF
}

exit(0);

Running on the new data in file data3:

% ./p3 data3
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
 ( Lines read: 29 )

cheers, drl

( edit 1: corrected line count )

Raynon · October 22, 2007, 4:12am

raynon:

Hi Summer,

Thanks for your code. But we do not know the value of 2nd field (which is XXX) in the 1st place so your code can't apply here.

Hi GhostDog,

I have a little problem here. I have added some more data to my input file (highlighted in blue).
If the 2nd field of last occurence of last occurence of this pattern " ** abc ccc cc cc cc cc 2007 " does not start with " XX ", then the below will be output (that is only the very last portion of the block which matches the pattern will be printed out)

Can you help ?

Input:

wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
** abc ccc cc cc cc cc 2007
0001 k= 1
wwwwww
0002 k= 1
0003 k= 1

Output:

** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1

Hi GhostDOg,

Seems that i am pretty near towards my target.
But there's still a contraint. If the term " ** abc ccc ccc cc cc ccc 2007 " occurs more than 2 times, all the 2nd blocks onwards will be outputted because of these 2 statements.
occur++;
if (occur > 1) print;
Is there any way i could find out the last number of the " occur " variable and make sure that only the last occurence will be printed out ?

FNR==NR&&/^\*\*/{line=$2; CODE = substr ($2,1,2); next}

FNR != NR && $0 ~ line {
      print 
      flag=1
     }
     flag == 1 && $0 ~ /^\*\*/ && CODE == "XX"{ 
       if($2 !~ line) flag=0
     }
     flag == 1 && $2 == "k="{print}


FNR != NR && $2 ~ line && CODE != "XX"  {
      flag=2;
      occur++;
      if (occur > 1)  print;
     }
      flag==2 && occur > 1 && $2 == "k=" { print }

ghostdog74 · October 22, 2007, 5:39am

So "XXX" is actually what you want to get?

awk 'FNR==NR&&/^\*\*/&&$2=="XXX"{line=$2;next}
     FNR!=NR&&$0~line{
      print 
      f=1
     }
     f&&$0~/^\*\*/{ 
       if($2 !~ line) f=0
     }
     f&&$2=="k="{print}
' "file" "file"

output:

# ./testnew.sh
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

Raynon · October 22, 2007, 8:07am

Let me illustrate with 2 examples.

Scenerio 1:
Here the last occurence of the pattern is " ** XXX ccc ccc cc cc ccc 2007 ". In this pattern, 2nd field is " XXX ". Since first 2 characters of the 2nd field are " XX ", it will print out all occurence of such patterns and lines containing " k= " appearing after this pattern.

Input:
wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1

Output:

** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

Scenerio 2:
Here the last occurence of the pattern is " ** abc ccc cc cc cc cc 2007 ". In this pattern, 2nd field is " abc ". Since first 2 characters of the 2nd field DOES NOT match " XX ", it will only print out last occurrence of this pattern and of lines containing " k= " appearing after this pattern.

Input:

wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
** abc ccc cc cc cc cc 2007
0001 k= 1
wwwwww
0002 k= 1
0003 k= 1

Output:

** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1

My below code actually does the trick but if the number of occurrence of " ** abc ccc cc cc cc cc 2007 " appear more than 2 times in the input file. It would not work any more. Pls help me.

FNR==NR&&/^\*\*/{line=$2; CODE = substr ($2,1,2); next}

FNR != NR && $0 ~ line {
      print 
      flag=1
     }
     flag == 1 && $0 ~ /^\*\*/ && CODE == "XX"{ 
       if($2 !~ line) flag=0
     }
     flag == 1 && $2 == "k="{print}


FNR != NR && $2 ~ line && CODE != "XX"  {
      flag=2;
      occur++;
      if (occur > 1)  print;
     }
      flag==2 && occur > 1 && $2 == "k=" { print }

ghostdog74 · October 22, 2007, 8:22am

if you run the amended coded i posted in #16, what happens?

drl · October 22, 2007, 9:37am

Hi.

I thought that I understood the requirements about matching the second field of the "**" lines, but now you're writing about specifically matching (only) 2 characters. This is confusing to me ... cheers, drl

Raynon · October 22, 2007, 8:48pm

Hi GhostDog,

It all depends on what the input is as i mention earlier.

Your code has an output below when using the input (file2) from scenerio2.
This output will be correct if you are using input(file1) from scenerio1.

% nawk -f awking file2 file2
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1

But i would be expecting the below for input (file2) from scenerio2.

** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1