File Parsing

jsusheel · September 24, 2007, 3:01pm

Hi All,
I have a couple of files ( ascii ) with the following data

File 1
#lport1:dc1:lport2:dc2 - All records were delimited by :
6300:ADEF12:6305:ATNE59
3411:EGFE31:3499:GDEF21
. . . .
. . . .
total of 55,000 Records

File 2
#seqno:lport1:id:dlc1:vid:lport2:nni:dc2:ci - All records delimited by :
60568:3411:98:EGFE31:965:3499:3799:GDEF21:432
. . . . . . . . .
. . . . . . . . .
total of 58,000 Records

I need to Compare lport1, dc1, lport2, dc2 values of file1 with lport1, dc1, lport2, dc2 values of file2 and if there is a match, I need to write the entire line in file2 to another file. I tried writing a Perl script under solaris 2.5.8 which took almost 6 hours to finish.
Could anyone of you help me in getting this task run pretty fast i.e, less than 15 minutes using awk/shell script..
Thanks in Advance.

vgersh99 · September 24, 2007, 4:55pm

Assuming:

File 2
#seqno:lport1:id:dlc1:vid:lport2:nni:dc2:ci � All records delimited by :

actually means:

File 2
#seqno:lport1:id:dc1:vid:lport2:nni:dc2:ci � All records delimited by :

nawk -f jsusheel.awk file1 file2

jsusheel.awk:

BEGIN {
   FS=OFS=":"
}
NR==FNR { f1[$1, $2, $3, $4]; next }
($2 SUBSEP $4 SUBSEP $6 SUBSEP $8) in f1

jsusheel · September 24, 2007, 6:37pm

Hi Vgersh99,
thanks for the reply. Yes your assumption is correct. It should be dc1 instead of dlc1. Sorry for the typo error.
When i executed the awk script there was no matching output. The body starting with NR==FNR works perfect by reading all the input records from the file1. I just verified using print $0
However i do not have any clue wrt the line ($2 SUBSEP $4 SUBSEP $6 SUBSEP $8 ) in f1. Could you please help me in deciphering this line as i am not much comfortable to awk.
Also please note that a record in file1 will not match a record in file2 on a one to one basis i.e.,the first record in file1 may match 100th record in file2 and the second record in file1 may match 40123th record in file2.
Again i thank you for sparing your time...

summer_cherry · September 25, 2007, 2:41am

Hi,
I have an idea about your reqs, but it maybe very slow when the file contains too much records.
Just for your reference.

Input:

first.txt:
1:a:2:b
3:c:4:d
5:e:6:f
7:g:8:h

second.txt:
60568:1:98:a:965:2:3799:b:432
60568:1:98:f:965:2:3799:b:432
60568:3:98:c:965:4:3799:d:432
60568:3:98:c:965:4:3799:w:432
60568:5:98:e:965:6:3799:f:432

Output:

60568:1:98:a:965:2:3799:b:432
60568:3:98:c:965:4:3799:d:432
60568:5:98:e:965:6:3799:f:432

Code:

awk 'BEGIN{FS=":"}
{
if (NF<=4)
pre[NR]=$0
else
{
a=sprintf("%s:%s:%s:%s",$2,$4,$6,$8)
for (i in pre)
if (pre==a)
print $0
}
}' first.txt second.txt

vgersh99 · September 25, 2007, 10:01am

f1:

6300:ADEF12:6305:ATNE59
3411:EGFE31:3499:GDEF21

f2:

60568:3411:98:EGFE31:965:3499:3799:GDEF21:432
60568:3422:98:EGFE31:965:3499:3799:GDEF21:432

produces:

60568:3411:98:EGFE31:965:3499:3799:GDEF21:432

Looks good to me given your original description of the fields and the matching criteria.

The '($2 SUBSEP $4 SUBSEP $6 SUBSEP $8 )' is the field matching key for file2 - fields 2,4,6 and 8 'concatenated' from file2 records/line represent a matching key to be used to look up in the associative array 'f1'.

jsusheel · September 25, 2007, 10:25am

Hi,
Many thanks to Summer_cherry and vgresh99 for the responses.
Again these scripts consume lot of cpu utilization and takes longer
to complete. I have desided to run these scripts by midnight.
thanks a lot ...