print chunk of lines only if there is a pattern match in between them

Hi All,
Please find the sample file below:

NAME                                                                                               ID NUMBER
--------------------------------------------------------------------------------------------------       ---------
abcdefgheija;lksdf                                                                                 11000000 
*** LOCKED *** 
    PARENT                            3887                                                        
    TRAN NO:                                03                                                      
    PARENT 
    PARENT 
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
bbcdeqgheija;lksdf                                                                                 11300000 
                                            
    DATE:                            02-MAR-2010                                     
    SOURCE CORRECTION:                      Y                                                             
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
vbcdewgheija;lksdf                                                                                 10200000 
                                           
    DATE:                            15-DEC-2010                                      
    
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
cvbcdeegheija;lksdf                                                                                10300000 
                                           
*** LOCKED *** 
    MIDDLE INITIAL:                                                                                    
    DATE:                            03-JUN-2010                      
<<<<< EOF>>>>

I am trying to print the lines starting from name till next name is encountered, but only if "LOCKED" term is present in the details of that person.

Any input or psuedocode in Perl script is welcome. Right now my code is either printing the whole file or the lines with LOCKED term.

Thanks in advance,
Niel.

Try:

perl -n0e 'while (/NAME.*?((?=NAME)|(?=$))/sg) {$x=$&;print $x if $x=~/LOCKED/}' file

Hi Bartus11,
thanks very much for the code.
I am new to Perl, it would be greatly helpful if you could please explain me your code.

Thanks,
Niel.

Would something kludgy work?:

grep -B4 -A3 LOCKED <file>

Play with the -B (before) and -A (after)

Well I don't know if it would make much sense, because this code is using some quite advanced regex techniques, like look ahead or sequential matching, which require some regex understanding. If you are still interested I can break it down for you.

After giving it a bit of thought I decided to do it anyway :slight_smile:

perl -n0e 'while (/NAME.*?((?=NAME)|(?=$))/sg) {$x=$&;print $x if $x=~/LOCKED/}' file

-n - load file's contents into $_ variable
-0 - load whole file into $_ variable. Without that perl would divide the file into lines and process them one by one
-e - execute script
while (/NAME.?((?=NAME)|(?=$))/sg) - keep going through $_ variable (g option), matching blocks of it that start with "NAME" and that have "NAME" right after their end. This is the look ahead part (?=NAME). To match also last block in variable (file), which is starting with "NAME", but there is no "NAME" at the end, there is alternative look ahead match (?=$), that means end of the variable. /s regex option allows . to match newline characters, which allow regex to match through multiple lines. .? matches non-greedily all the characters that are between "NAME" and before next "NAME". If ? was missing in this expression, regex would perform greedy match, which would match whole variable in single run. While's body is quite easy. $x=$& is assigning recent match to $x variable. It is done to avoid loosing it's contents when /LOCKED/ is run. So now $x consist of block of lines extracted from $_ variable (file's contents), that start with "NAME" and ends just before next "NAME". This block is tested by $x=~/LOCKED/ expression, to check if it contains word LOCKED, and if it does, then print $x is printing it on screen.

1 Like

Hi.

The non-standard utility cgrep -- "context grep" -- may be easier to understand. You probably know that standard GNU grep allows previous and subsequent lines around the matched line to be listed. The cgrep allows you to specify the limits of a window to print around the matched line. The limits can themselves be regular expressions. So:

cgrep pattern filename

works just like GNU grep, but

cgrep -w beginning-of-window-pattern +w end-of-window-pattern main-pattern filename

will extract blocks of text that contain main-pattern.

Here is a demo script that uses a width-decreased version of your data, and the core of the script uses cgrep as noted above. The other lines are supporting code, showing the data, environment, versions, etc.

#!/usr/bin/env bash

# @(#) s1	Demonstrate block extraction with embedded match constraint.
# cgrep home: http://sourceforge.net/projects/cgrep/

# Section 1, setup, pre-solution.
# Infrastructure details, environment, commands for forum posts. 
# Uncomment export command to test script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for i;do printf "%s" "$i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
C=$HOME/bin/context && [ -f $C ] && . $C cgrep
set -o nounset
pe

FILE=${1-data1}
# cut width to look better in demo.
cut -c 1-78 data0 > $FILE

# Section 2, display input file.
# Display sample of data file, with head & tail as a last resort.
pe " || start [ first:middle:last ]"
specimen $FILE \
|| { pe "(head/tail)"; head -n 5 $FILE; pe " ||"; tail -n 5 $FILE; }
pe " || end"

# Section 3, solution.
pl " Results:"
cgrep -D -w NAME +w NAME LOCKED $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.7 (lenny) 
GNU bash 3.2.39
cgrep - (local: ~/executable/cgrep May 29 2009 )

 || start [ first:middle:last ]
Edges: 5:0:5 of 28 lines in file "data1"
NAME                                                                          
------------------------------------------------------------------------------
abcdefgheija;lksdf                                                            
*** LOCKED *** 
    PARENT                            3887                                    
   ---
                                           
*** LOCKED *** 
    MIDDLE INITIAL:                                                           
    DATE:                            03-JUN-2010                      
<<<<< EOF>>>>
 || end

-----
 Results:
NAME                                                                          
------------------------------------------------------------------------------
abcdefgheija;lksdf                                                            
*** LOCKED *** 
    PARENT                            3887                                    
    TRAN NO:                                03                                
    PARENT 
    PARENT 
NAME                                                                          
NAME                                                                          
------------------------------------------------------------------------------
cvbcdeegheija;lksdf                                                           
                                           
*** LOCKED *** 
    MIDDLE INITIAL:                                                           
    DATE:                            03-JUN-2010                      
<<<<< EOF>>>>

The code needs to be obtained from cgrep | Download cgrep software for free at SourceForge.net and compiled with a "c" compiler. I have done it on 32-bit and 64-bit systems without trouble. I have found cgrep to be very useful in situations like this.

Good luck ... cheers, drl

( Edit 1: clarification, minor typo )

2 Likes
awk '/NAME/{if(l)print s;l=0;s=$0;next}/LOCKED/{l=1}{s=s RS $0}END{if(l)print s}' infile
awk 'BEGIN{RS="NAME";FS="\n"} /LOCKED/ {print RS $0}' infile

Note: in most awks RS can only be a single character..

awesome explanation bartus11.
its clear now.

thanks again.

Hi Bartus11,

Could you write in a small script instead of command line.

I tried doing the same using script, but did not succeed.

Please find below the script i used.

#!/usr/bin/perl 
use warnings;
#$path="H:\\01042011.txt";   
  
 open (T01,">H:\\T01.txt" ) || die ("Could not open file. $!");    
    $path=shift @ARGV;        
    
     open (FILE,$path) or die "can not open";                                  
           while (<FILE>) 
           {
             $line=$_;
               while (/ID NUMBER.*?((?=ID NUMBER)|   (?=$))/sg)
               {
               $x=$&;
               print $x if ($x=~/LOCKED/)
               print T01 ($x,"\n");               
               } 
            }
  
close (FILE);    
close (T01);    

Quote:
[i]Well I don't know if it would make much sense, because this code is using some quite advanced regex techniques, like look ahead or sequential matching, which require some regex understanding. If you are still interested I can break it down for you.

After giving it a bit of thought I decided to do it anyway :slight_smile:

perl -n0e 'while (/NAME.*?((?=NAME)|(?=$))/sg) {$x=$&;print $x if $x=~/LOCKED/}' file

-n - load file's contents into $_ variable
-0 - load whole file into $_ variable. Without that perl would divide the file into lines and process them one by one
-e - execute script
while (/NAME.?((?=NAME)|(?=$))/sg) - keep going through $_ variable (g option), matching blocks of it that start with "NAME" and that have "NAME" right after their end. This is the look ahead part (?=NAME). To match also last block in variable (file), which is starting with "NAME", but there is no "NAME" at the end, there is alternative look ahead match (?=$), that means end of the variable. /s regex option allows . to match newline characters, which allow regex to match through multiple lines. .? matches non-greedily all the characters that are between "NAME" and before next "NAME". If ? was missing in this expression, regex would perform greedy match, which would match whole variable in single run. While's body is quite easy. $x=$& is assigning recent match to $x variable. It is done to avoid loosing it's contents when /LOCKED/ is run. So now $x consist of block of lines extracted from $_ variable (file's contents), that start with "NAME" and ends just before next "NAME". This block is tested by $x=~/LOCKED/ expression, to check if it contains word LOCKED, and if it does, then print $x is printing it on screen.

Remember that the whole file has to be loaded into single variable (red code). Your script is splitting it into separate lines. Try my code:

#!/usr/bin/perl 
use warnings;
#$path="H:\\01042011.txt"; 

open (T01,">H:\\T01.txt" ) || die ("Could not open file. $!"); 
$path=shift @ARGV; 
open (FILE,$path) or die "can not open";
local $/;
$_=<FILE>;
while (/ID NUMBER.*?((?=ID NUMBER)|(?=$))/sg){
  $x=$&;
  print $x if ($x=~/LOCKED/);
  print T01 ($x,"\n");
}
close (FILE); 
close (T01);

Thanks Bartus,

Your code is working great.
But i tried using ^NAME to search for NAME at the beginning of the line but it did not give me any output.
as follows:

while (/^NAME.*?((?=^NAME)|(?=$))/sg){

I used this because I need to catch the details starting from NAME in a new line till another NAME in the new line is faced.

lets say that if the file given has the following data:
It has NAME at the beginning of line and in the middle as well, and also unnecessary lines which are marked in RED.

NAME                                                                                               ID NUMBER
--------------------------------------------------------------------------------------------------       ---------
abcdefgheija;lksdf                                                                                 11000000 
*** LOCKED *** 
    PARENT                            3887                                                        
    TRAN NO:                                03                                                      
    PARENT NAME
    PARENT 
This line is an excess line.  This is sample
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
bbcdeqgheija;lksdf                                                                                 11300000 
                                            
    DATE:                            02-MAR-2010                                     
    SOURCE CORRECTION:                      Y                                                             
    SIBLING NAME
As per the current record, this line should not be there in final file.
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
vbcdewgheija;lksdf                                                                                 10200000 
    
    PARENT NAME                                       
    DATE:                            15-DEC-2010                                      
Here is one more line which is not needed in output.
One more here
2011-2015 data analysis 
NAME                                                                                               ID NUMBER
-------------------------------------------------------------------------------------------------- ---------
cvbcdeegheija;lksdf                                                                                10300000 
                                           
*** LOCKED *** 
    MIDDLE NAME:                                                                                    
    DATE:                            03-JUN-2010                      
<<<<< EOF>>>>

How do I get which starts from NAME and has ***LOCKED*** inside it and it should skip all the red lines.

I really appreciate your help,
Thanks,
Niel.

Try:

while (/^NAME.*?((?=^NAME)|(?=\z))/sgm){