Search backwards to certain string

Hi,
I'm using the following to do a backwards search of a file for a string

sed s/^M//g FILE | nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=10 a=0 s="9005"|grep "policy "|sort -u |awk '{print $4}'|cut -c2-10

My issue is that because I'm looking back 10 lines it's pulling in more data than I want. The 10 lines is including lines with the word policy for other policies where I'm only interested in the first occurrence of policy in the reverse search.

So for example my string 9005 is located in 2 different parts of the file and the first occurrence works fine (because there's no line in the preceding 10 containing policy) but the second occurrence is pulling in 2 other lines other than the one I want.

I'm wondering how do I break out of the search when the first occurrence of policy is reached for each 9005 or alternatively instead of searching back 10 lines search back to the word policy for each 9005 ?

Thanks in advance.

Hi SaltyDog,

I don't understand your explanation, but to search for a word backwards, use:

$ tac infile | sed -n '/word/ { p ; q }'
1 Like

Os is Sun and tac is not available.

Perl, ok? You may have to install "File::ReadBackwards" module.

#!/usr/bin/perl
use File::ReadBackwards;
 
$x = File::ReadBackwards -> new('inputfile.txt');

while ( defined($line = $x->readline) )
{
  if (/pattern/) { print "$line\n" }
}
1 Like

i m not clear about your inputfile but maybe try this

# awk 'NR>9005-10&&NR<=9005&&$0~/policy/{x=$0}END{print x}' infile
1 Like

Hi.

Providing representative samples of your data and expected output invites fast and accurate responses. Otherwise the answers either will use no data, or will use individual and / or eccentric datasets. Here is how I interpreted your question.

Nonstandard utility glark has options for this kind of task:

#!/usr/bin/env bash

# @(#) s1	Demonstrate extraction of lines for patterns within n lines of each other.
# See: http://www.incava.org/projects/glark/

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C glark

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
glark --text-color off --and 2 2009 policy $FILE

exit 0

producing sets of lines for 2009 and policy within 2 lines of one another:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
glark version 1.8.0

-----
 Input data file data1:
apple
  policy
banana
cherry
date
  2009
fig
grape
kiwi
lemon
mango
nectarine
  policy
orange
  2009
peach
rhubarb

-----
 Results:
   13   policy
   14 orange
   15   2009

The glark command was in the Debian repository. Otherwise, see the web page noted in the script for more examples and downloads. The glark code is in ruby, so that needs to be available.

Best wishes ... cheers, drl

1 Like

Unfortunately we don't have ruby/glark/tac etc installed which is why I was looking towards an sed/awk solution.

Sorry if the question was more confusing than necessary.

Basically here's some sample data in a file :

CS02010002 Policy 9999998599
CS13000008 Tax processing was done for 17/03/2012.
CS95869005 No BC record found. Please review urgently
CS02010002 Policy 9999998599
SS00200001 Change of adress processed
CS13000008 Tax processing was done for 18/03/2012.
CS02010002 Policy 9999999609
CS02010002 Policy 9999999619
CS02010002 Policy 9999999629
CS43500005 Payout Number A0002 is being processed now.
CS43500005 Payout Number A0003 is being processed now.
CS02010002 Policy 9999999639
CS43500005 Payout Number A0001 is being processed now.
CS02010002 Policy 9999999759
CS02010002 Policy 9999999899
CS43500005 Payout Number A0003 is being processed now.
CS13000008 Tax processing was done for 17/03/2012.
CS95869005 No BC record found. Please review urgently

The output I'm looking for is

9999998599
9999999899

corresponding to the previous policy reference before the "CS95869005 No BC record found. Please review urgently" line

but when I run the command

 sed s/^M//g test | nawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print ;c=a}b{r[NR%b]=$0}' b=10 a=0 s="9005"|grep "Policy "|sort -u |awk '{print $3}'

I get

9999998599
9999999619
9999999629
9999999639
9999999759
9999999899

I understand why I'm getting the extra policy numbers (due to the b=10) but I can't shorten the gap as I don't know how many lines will be between the message "CS95869005 No BC record found. Please review urgently" and the previous "CS02010002 Policy " message.

This is why I was looking for a stop at the first occurrence in the backwards search or something to that effect.

Hi.

I'm still not confident that I understand your question, so perhaps these meta-answers may help.

There was a suggested solution from balajesuri in perl that you may have missed. I don't know about the definite timing of the reading-backwards module, and you seem to not be able to install items, but I think that module might be standard. (The sample code I tried in 2010 was almost instantaneous, but that was with a very short file.) If you do not have it, there are other solutions.

Assuming that the suggestion from birei is correct, you could try various means for reversing a file.

You may not have tac or rev, but there are versions available in perl from one of the CPAN projects.

So you could try using any of those:
PPT: tac
( link to rev removed )

Other approaches to producing a reverse copy of a file are, one in sed:

sed -n '1{
h
}
1 !{
x
H
}
${
x
H
p
}' inputfile 

and you could also add line numbers (cat -n), sort in reverse, and remove the line numbers (cut) to get a reverse copy of a file.

Best wishes ... cheers, drl

1 Like
# cat data
..............
....
.......
................
....................
CS02010002 Policy "9999998599"
CS13000008 Tax processing was done for 17/03/2012.
..............
....
.......
................
....................
....................
....................
.......................................
CS95869005 No BC1 record found. Please review urgently

========================================================
CS02010002 Policy 9999998599
SS00200001 Change of adress processed
CS13000008 Tax processing was done for 18/03/2012.
CS02010002 Policy 9999999609
CS02010002 Policy 9999999619
CS02010002 Policy 9999999629
CS43500005 Payout Number A0002 is being processed now.
CS43500005 Payout Number A0003 is being processed now.
CS02010002 Policy 9999999639
CS43500005 Payout Number A0001 is being processed now.
CS02010002 Policy 9999999759
========================================================

CS02010002 Policy "9999999899"
....
.......
....
.......
..............
CS43500005 Payout Number A0003 is being processed now.
CS13000008 Tax processing was done for 17/03/2012.
CS95869005 No BC2 record found. Please review urgently
....
.......
....
.......
..............
# awk -vstart='CS02010002' -vfstop="No BC" -vlstop=" record found. Please review urgently" '
{while(getline){if($0~start)policy=$3;if($0~fstop "[0-9]*" lstop)print policy}}' data
"9999998599"
"9999999899"
1 Like

This is exactly what I was looking for but I can't get it working. I tried the above and then amended it slightly to the following

awk -v start='CS02010002' -v fstop="CS95869005"  '{while(getline){if($0~start)policy=$3;if($0~fstop)print policy}}'  FILE

but this just gives me

awk: syntax error near line 1
awk: bailing out near line 1

On Solaris use /usr/xpg4/bin/awk or nawk... Otherwise try this:

awk '$2=="Policy"{p=$3}$0~s{print p}' s=CS95869005 infile

or for example:

awk '$2=="Policy"{p=$3}$0~s{print p}' s="No BC record found" infile
1 Like

A completely different approach, but unsuitable for very large files because it does an extra pass of the file for each error found.
Works by numbering the lines in the input stream, finding each occurance of "No BC record found" and then scanning the ten lines above that record for the last occurrence of a record containing "Policy".

cat -n filename.txt | grep "No BC record found"|awk '{print $1}' | while read E1
do
        # Line ten lines above "No BC record found"
        E2=$((${E1} - 10))
        if [ ${E2} -le 0 ]
        then
                E2=1
        fi
        # Line number one line above "No BC record found"
        E3=$((E1 -1))
        # Search 10 line block to just above "No BC record found"
        sed -n "${E2},${E3}p;${E3}q" filename.txt | \
                grep "Policy" | tail -1 | awk '{print $3}'
done

./scriptname
9999998599
9999999899

... and I know that it has a "cat" command in it !