I have an input below. I tried to use the awk below but it seems that it ;s not working. Can anybody help ?
My concept here is to find the 2nd field of the last occurrence of such pattern " ** XXX ccc ccc cc cc ccc 2007 " . In this case, the 2nd field is " XXX ". With this "XXX" term stored as a variable, i want to print out the all lines with 2nd field having " XXX " term and its subsequent lines containing terms matching with " k= ". Expected output are highlighted as bold red in the input.
Input:
wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
Output:
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
Actually my main problem is to assign the last occurrence of 2nd field which follows this pattern " ** XXX ccc ccc cc cc ccc 2007 ". I could have done it with the END option shown below but i can't because i need to use the Field2 variable in the coming lines. Can anybody help ?
I often think that awk's automatic read can get in the way (perhaps it just gets in the way of my thought processes).
Here's a perl script that produces your specified output from the given input:
#!/usr/bin/perl
# @(#) p1 Demonstrate extraction after pattern match.
use warnings;
use strict;
my ($debug);
$debug = 1;
$debug = 0;
my ($lines) = 0;
my (@a);
# Read until ** XXX, then turn over control to function to scan
# for other pattern.
while (<>) {
$lines++;
chomp;
@a = split;
if ( $a[0] eq "**" && $a[1] eq "XXX" ) {
print " Found XXX line at $.\n" if $debug;
print "$_\n";
last if not extract_k();
}
}
print STDERR " ( Lines read: $lines )\n";
# Extract k= lines until line with "**".
sub extract_k {
my (@a);
while (<>) {
chomp();
@a = split;
return 1 if $a[0] eq "**"; # not EOF
print "$_\n" if /k=/;
}
return 0; # EOF
}
exit(0);
Running against data in file data1:
% ./p1 data1
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
( Lines read: 13 )
This makes the assumption that the ** lines alternate; more work will be necessary if that's wrong ... cheers, drl
I got the following output.
I rename your perl code as "myperl" and input file as "input"
But it seems to have some problem ? Can you give some guidance?
No, I cannot reproduce your failed result with perl code p1. I have amended and extended the code (calling it p2), added a few lines to the data to make sure that consecutive ** XXX line series will be handled, and ran it as you did:
#!/usr/bin/perl
# @(#) p2 Demonstrate extraction after pattern match.
use warnings;
use strict;
my ($debug);
$debug = 1;
$debug = 0;
our ($lines) = 0;
my (@a);
# Read until ** XXX, then turn over control to function to scan
# for other pattern.
while (<>) {
$lines++;
chomp;
@a = split;
if ( $a[0] eq "**" && $a[1] eq "XXX" ) {
print " Found XXX line at $.\n" if $debug;
print "$_\n";
# last if not extract_k();
$_ = extract_k();
if ( not $_ ) {
last;
}
else {
print " cycling with line $. ", $_ if $debug;
redo;
}
}
}
print STDERR " ( Lines read: $lines )\n";
# Extract k= lines until line with "**".
sub extract_k {
our ($lines);
my (@a);
while (<>) {
$lines++;
chomp();
@a = split;
return "$_\n" if $a[0] eq "**"; # not EOF
print "$_\n" if /k=/;
}
return 0; # EOF
}
exit(0);
Producing:
% perl p2 data2
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
** XXX ccc ccc cc cc ccc 2007
0006 k= 1
0007 k= 1
( Lines read: 32 )
If you cannot get my code to work, then it looks like the awk script from ghostdog74 will work -- and it's far shorter than the perl code.
Your code work!!
But i don't really understand about the FNR = NR statement. Can you help me understand that ?
And also why is there a need to have 2 identical input file for this awk code?
I am bad at explaining, please read the man page of awk for the definition of FNR and NR
so FNR==NR roughly means going over the first file's records...using 2 input files is a way to get back to the beginning of a file (unless there's another way) because awk process records in the forward direction.
Thanks for your explanation. But still it seems abstract to me.
Can i say that this statement " FNR==NR&&/^\*\*/{line=$2;next} " will only operate on the 1st input file since the rest of the code is un-true for the 1st input file ?
As for the next part of the code starting with " FNR!=NR&&$0~line .... " , it will only operate on the 2nd input file ?
Am i understanding it correctly ?
But what would be the value of FNR and NR values when operating on the 1st input file ?
And what would be the value of FNR and NR values when operating on the 2nd input file ?
Thanks for your code. But we do not know the value of 2nd field (which is XXX) in the 1st place so your code can't apply here.
Hi GhostDog,
I have a little problem here. I have added some more data to my input file (highlighted in blue).
If the 2nd field of last occurence of last occurence of this pattern " ** abc ccc cc cc cc cc 2007 " does not start with " XX ", then the below will be output (that is only the very last portion of the block which matches the pattern will be printed out)
Can you help ?
Input:
wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
** abc ccc cc cc cc cc 2007
0001 k= 1
wwwwww
0002 k= 1
0003 k= 1
Output:
** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1
Thanks for your emphasis on the function of your notation "XXX". Here is an amended perl script to account for that:
#!/usr/bin/perl
# @(#) p3 Demonstrate extraction after pattern match.
use warnings;
use strict;
my ($debug);
$debug = 1;
$debug = 0;
our ($lines) = 0;
my ($key_string);
my (@a);
# Get second field of first line that begins with **, as in:
#
# ** XXX
#
# then use that second field as the key_string. Anytime that
# key_string appears, we begin scanning for "k=" lines, and only
# stopping when another "**" line appears.
#
# So, read until ** XXX, then turn over control to function to
# scan for other pattern.
my ($first) = 1;
while (<>) {
if ($first) {
@a = split;
if ( $a[0] eq "**" ) {
$key_string = $a[1];
$first = 0;
redo;
}
}
$lines++;
chomp;
@a = split;
if ( $a[0] eq "**" && $a[1] eq $key_string ) {
print " Found XXX line at $.\n" if $debug;
print "$_\n";
# last if not extract_k();
$_ = extract_k();
if ( not $_ ) {
last;
}
else {
print " cycling with line $. ", $_ if $debug;
# Adjust line count to avoid counting twice.
$lines--;
redo;
}
}
}
print STDERR " ( Lines read: $lines )\n";
# Extract k= lines until line with "**".
sub extract_k {
our ($lines);
my (@a);
while (<>) {
$lines++;
chomp();
@a = split;
return "$_\n" if $a[0] eq "**"; # not EOF
print "$_\n" if /k=/;
}
return 0; # EOF
}
exit(0);
Running on the new data in file data3:
% ./p3 data3
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
( Lines read: 29 )
Seems that i am pretty near towards my target.
But there's still a contraint. If the term " ** abc ccc ccc cc cc ccc 2007 " occurs more than 2 times, all the 2nd blocks onwards will be outputted because of these 2 statements.
occur++;
if (occur > 1) print;
Is there any way i could find out the last number of the " occur " variable and make sure that only the last occurence will be printed out ?
Scenerio 1:
Here the last occurence of the pattern is " ** XXX ccc ccc cc cc ccc 2007 ". In this pattern, 2nd field is " XXX ". Since first 2 characters of the 2nd field are " XX ", it will print out all occurence of such patterns and lines containing " k= " appearing after this pattern.
Input:
wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
Output:
** XXX ccc ccc cc cc ccc 2007
0001 k= 1
0002 k= 1
** XXX ccc ccc cc cc ccc 2007
0003 k= 1
0004 k= 1
0005 k= 1
Scenerio 2:
Here the last occurence of the pattern is " ** abc ccc cc cc cc cc 2007 ". In this pattern, 2nd field is " abc ". Since first 2 characters of the 2nd field DOES NOT match " XX ", it will only print out last occurrence of this pattern and of lines containing " k= " appearing after this pattern.
Input:
wwwwww
0999 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
wwwwww
0001 k= 1
wwwwww
0002 k= 1
** abc ccc cc cc cc cc 2007
wwwwww
0001 k= 1
wwwwww
0002 k= 1
wwwwww
wwwwww
0003 k= 1
wwwwww
** XXX ccc ccc cc cc ccc 2007
wwwwww
0003 k= 1
wwwwww
0004 k= 1
0005 k= 1
** abc ccc cc cc cc cc 2007
0001 k= 1
wwwwww
0002 k= 1
0003 k= 1
Output:
** abc ccc cc cc cc cc 2007
0001 k= 1
0002 k= 1
0003 k= 1
My below code actually does the trick but if the number of occurrence of " ** abc ccc cc cc cc cc 2007 " appear more than 2 times in the input file. It would not work any more. Pls help me.
I thought that I understood the requirements about matching the second field of the "**" lines, but now you're writing about specifically matching (only) 2 characters. This is confusing to me ... cheers, drl