Script to parse and compare information in two fields of file

GERMANOS · August 28, 2015, 4:49am

Hello,
I am working parsing a large input file1(field CFA)
I have to compare the the file1 field(CFA byte 88-96) with the content of the file2(It contains only one field) and and insert rows equal
in another file.
Here is my code and sample input file:

#########################################
# F.ne: CheckNBS
#########################################

function CheckNBS
{
    
writeInfo "************************************************************************************************"
writeInfo "----------------- CHECK FILE NBS FILE2 Start: $d ------------------"    
FILE2="$DIR_OUT"/"FILE2"_"${NamingDate}.data"
FILE_OUT="$DIR_OUT"/"OUT_CAMPIONE_NBS"_"${NamingDate}.ctrl"


ListFILE=`cat "$NBSPATH"/"*"${DATA_RIFERIMENTO}"*"`
for FILE1 in ${ListFILE}
do

writeInfo "Elaborazione FILE1 : ${FILE1}"

ListCFA=`cat ${FILE2}`
for CFA in ${ListCFA}
do

zcat "$NBSPATH"/"$FILE1" | grep $CFA | awk '$1 == "201" { print $0 }' >> ${FILE_OUT}

done
done
}

Execution is very slow. I can use awk also on compressed files ?

Can you help me?

sea · August 28, 2015, 4:53am

You could remove the grep (untested):

zcat "$NBSPATH"/"$FILE1" | awk -v CFA=$CFA '/CFA/ && $1 == "201" { print $0 }' >> ${FILE_OUT}

Not sure if this would bring you much of a performance gain though.
hth

RudiC · August 28, 2015, 5:29am

You're zcat ting "$NBSPATH"/"$FILE1" and running grep | awk once for every CFA in $FILE2 . That consumes a lot of resources. Why don't you uncompress once into a temp file and use e.g. grep -f $FILE2 on the temp file? Does your system offer the zgrep command?

GERMANOS · August 28, 2015, 6:05am

but the field in the CFA FILE1 it's positioned at bytes 88(for 12 byte)
and in this way it is not identified, for this I used grep.

I can do a substring ( awk )?
thanks a lot

RudiC · August 28, 2015, 6:49am

I'm not sure I understand. Please provide (abbreviated, reasonable) samples of the input files.

GERMANOS · August 28, 2015, 7:01am

I'll explain:

FILE2:

888011193163
888011087843
888011198112
888011126841
888010319633
888010887347
888011103891
888011174045
888011181727
999001166011
888010522751
888010534587
888010751405
888010824309
888000744563
888000995836
888010941118
888011026395
888010224776
888010344784

FILE1(COMPRESSED):

12015060700000009     
201  3358447808                      2015-05-14-02.07.22.000000000000012000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000001                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541448263017051
201  3666678887                      2015-05-12-14.28.06.000000000000009000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541425923651051
201  3666678887                      2015-05-14-10.57.54.000000000000010000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541448351096051
201  335357257                       2015-05-12-17.15.43.000000000000005000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541425957517051
201  3389474079                      2015-05-13-01.22.00.000000000000010000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541426042602051
201  3389474079                      2015-05-14-16.19.01.000000000000009000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541448418547051
201  3389474079                      2015-05-14-05.28.12.000000000000010000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541448287851051
201  3356067312                      2015-05-14-23.08.56.000000000000009000000000000024999000875455LOS000000001020001320119   P          0SNG000CO  000000000                    10000004  GUNB-GST01  GPRSN20150603000000000000024132||||    01444617541448499799051
201  3386401372                      2015-05-10-13.20.33.000000000000025000000000000001888010471777NES000000003112701320119EC9P          0PAM001CO  000000003                    10000004  GUNB-GSTTZ  GPRSN20150603000000000000001132||||    01444617541413713855098
201  3386401372                      2015-05-11-07.40.33.000000000000024000000000000001888010471777NES000000003112701320119EC9P          0PAM001CO  000000003                    10000004  GUNB-GSTTZ  GPRSN20150603000000000000001132||||    01444617541413895397098
900000891

I have to look for values FILE2 inside FILE1(at 88� byte for 12 byte),

in case of equality write the entire line of FILE1 on FILE_OUT.

thanks a lot

RudiC · August 28, 2015, 7:51am

Please use code tags as required by forum rules!

None of the strings in FILE2 is found in FILE1. Should - by sheer coincidence - strings from file2 exist in file1, this might work

grep -Ff file2 file1
201 3386401372 2015-05-11-07.40.33.000000000000024000000000000001888010471777NES0000000031127888011181727 0PAM001CO 000000003 10000004 GUNB-GSTTZ GPRSN20150603000000000000001132|||| 01444617541413895397098

Don_Cragun · August 28, 2015, 3:25pm

Yes, you can use the awk substr() function to grab substrings. If your uncompressed FILE1 contained any of the strings in FILE2 , the following awk script would print lines containing any matching lines:

awk '
FNR == NR {
	CPA[$1]
	next
}
substr($0, 88, 12) in CPA' FILE2 FILE1

but, as has already been stated, no lines in your sample files match.

If we add the following line to your sample FILE2 :

123456789012

and the following line to your sample FILE1 :

201  3386401372                      2015-05-11-07.40.33.000000000000024000000000000001123456789012NES000000003112701320119EC9P          0PAM001CO  000000003                    10000004  GUNB-GSTTZ  GPRSN20150603000000000000001132||||    01444617541413895397098

then the above code prints the line added to FILE1 .

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .