Need help with the syntax using awk+grep

kthri · November 4, 2005, 3:05pm

Hi,
I need to extract information from a 4 GB file based on the following conditions:

1) Check for the presence of a set of account numbers

Each account number is present along with other information within
a PAGESTART and PAGEEND.

The file looks like this:
PAGESTART
ACCOUNT NO 123
DATE 10-01-2004
money 10982
PAGEEND
PAGESTART
ACCOUNT NO 245
DATE 10-03-2005
MONEY 254
PAGEND

2) If the account numbers are present then the information corresponding PAGESTART and PAGEEND must be determined.

If one of the specified account no is 123,
I require the following information
PAGESTART
ACCOUNT NO 123
DATE 10-01-2004
money 10982
PAGEEND

Can anyone help with this!!

Abhishek_Ghose · November 5, 2005, 12:29am

This may not be efficient, but it works:
I am assuming that all the acct nos are placed in a file vertically. For ex. if you need data regarding accounts 123 and 456 they are present in a file in the following format:

acct_file:
123
456

Use sed to reformat the file like this:
sed 's/^/ACCOUNT NO /g' acct_file|sed 's/$/,/g' >temp

temp:
ACCOUNT NO 123,
ACCOUNT NO 456,

Use this as a pattern file for grep, and use paste on the information containing file (lets call it acct_info):

paste -s -d",,,,\n" acct_info|grep -f temp

(paste will horizontally paste every 5 rows of the file. grep uses temp as a pattern file)

Hope this helps.

kthri · November 5, 2005, 12:38am

Thanks Abhishek for your response,

The separator is a tag.

ACCOUNT NO |12345
INVOICE NO |578

There are about 80 fields between the PAGESTART and PAGEEND
which has to be retrieved for a matching account.

PAGESTART
...
ACCOUNT NAME | Business Level
ACCOUNT NO |1234
MONEY |54
...
PAGEEND

ranj1 · November 5, 2005, 2:54am

Try this
At cmd line
awk -F'|' -f awkfile s=acct_no ip_filename

where awkfile contains:
$1 ~ "PAGESTART" {prevline=$0;getline;}
$1 ~ /ACCOUNT NO/ && $2 ~ s {print prevline; do {print $0;getline;}while($0 !~ "PAGEEND"); exit;}
END {print "PAGEEND";}

May be using this you can go through the required nos in a loop and print them out.

Abhishek_Ghose · November 5, 2005, 12:28pm

Guys try perl!

Ygor · November 6, 2005, 8:24pm

Try....

awk -v RS=PAGEEND '/ACCOUNT NO 123/{print $0 RS}' file1

Abhishek_Ghose · November 7, 2005, 4:01am

This is a quick and dirty method (I doubt its efficiency for data of your size):
( Assuming the file name containing nos is acct_nos, wherein the acct numbers are vertically placed, like this:
acct_nos:
123
456
I am assuming the file name of the file containing acct information as "acct_info")

The following statements should work ( these rely on certain special characters...again assuming that you your data does not use characters "#" AND "@". In case if they do replace these by characters not being used)

Heres another way with PERL. This should IDEALLY be faster(and better--- it takes care a lot of whitespace worries. For ex if the acct_nos file lists nos as:
123
456
it wouldnt be affected. Also the script works irrespective of whether the ACCOUNT NO line has some no. of whitespaces at the start or before the "pipe" (or tag as u might say) delimiter (though it is assumed that "ACCOUNT" and "NO" are separated by one space only). Same goes for the account no.):

find_acct.pl:
#!/usr/bin/perl

open (ACCT_INFO,"acct_info");
open (ACCT_NOS,"acct_nos");

@acct_nos=<ACCT_NOS>;
close (ACCT_NOS);

$acct_present="no";
while(<ACCT_INFO>)
{
chop($_);
@buffer;

@chk_pagestart_or_acct=split(/\|/);

if($chk_pagestart_or_acct[0] =~ /^\s*PAGESTART\s*$/)
{ if($acct_present eq "no")
{splice(@buffer,0,@buffer);}
else
{ print ("@buffer");
splice(@buffer,0,@buffer);
}
}
else {
if($chk_pagestart_or_acct[0] =~ /^\sACCOUNT NO\s$/)
{
$chk_pagestart_or_acct[1]=~ s/^\s+//;
$chk_pagestart_or_acct[1]=~ s/\s+$//;
@found=grep(/^\s*$chk_pagestart_or_acct[1]\s*$/,@acct_nos);
$acct_present=($#found == -1 ? "no" : "yes");

   splice\(@found,0,@found\);
  \}

}

push(@buffer,$_."\n");

}

if($acct_present eq "yes")
{ print("@buffer");}

close (ACCT_INFO);