Lookup in another file and conditionally modify it inline

mansoorcfc · January 16, 2017, 3:12am

Hi,

I have an issue where i need to lookup in a given transaction file and if the same transaction is found in another file, then i need to replace a few columns with some other value.
Finally, the changed and unchanged lines must be printed and stored in the same source file.

for example :

f2 (tran number is second field)
1,100,AAA,BBB,X,CCC
5,200,AAA,BBB,Y,CCC
3,400,AAA,BBB,X,CCC

output should be

1,100_P,AAA,BBB,X,CCC
5,200_T,AAA,BBB,Y,CCC
3,400,AAA,BBB,X,CCC

as you can see, only 1st and 2nd record have the matching tran from f1. hence they are changed, record 3 will remain as is.

I have done the following but its a very direct approach and takes a huge amount of time.

cp main_input.txt tmp1.txt;
while read tran_num
do 
    awk  'BEGIN{FS=","; OFS=","} { if ( $2=='${tran_num}' && $5 == "X")  {$2=$2"_P";print}  else if ($2=='${tran_num}' && $5 == "Y") {$2=$2"_T"; print} else {print} }' tmp1.txt > tmp2.txt;
    cp tmp2.txt tmp1.txt;
done < lookup_tran_file.txt
mv tmp1.txt file_input_file.txt

the above works but i dont like the fact that the tmp file is created everytime to preserve the previous modification done by awk. In other words, if f1 will contain 10000 records to be checked, the tmp file will get overwritten 10000 times. This solution is also takin 1-2 hrs to finish depending on the input file size.

RudiC · January 16, 2017, 3:41am

You are right, looping through above shell script 10000 times is an enormous waste of resources and time. It creates 20000 processes for 2 commands to be run 10000 times. How about doing it in one command, leaving the looping to it? If $5 can have ONLY the values "X" and "Y", try

awk -F, -vOFS=, 'NR==FNR {T[$1]; next} $2 in T {$2=$2 ($5=="X"?"_P":"_T")} 1' file1 file2
1,100_P,AAA,BBB,X,CCC
5,200_T,AAA,BBB,Y,CCC
3,400,AAA,BBB,X,CCC

Should there be other values in $5 for which $2 should not be modified, try

awk -F, -vOFS=, 'NR==FNR {T[$1]; next} $2 in T {$2=$2 ($5=="X"?"_P":$5=="Y"?"_T":"")} 1' file1 file2

Don_Cragun · January 16, 2017, 3:59am

Assuming that your first sample file is named f1 rather than f1 being the first line of your first file and that your second input file is named f2 rather than the first line of your second input file being f2 (tran number is second field) , then you can use whichever one of the suggestions RudiC provided that matches the actual input you need to process. If those two lines are actually in your input files, the following seems to produce the output you want:

awk '
BEGIN {	FS = OFS = ","
}
FNR == 1 {
	next
}
FNR == NR {
	tn[$1]
	next
}
$2 in tn {
	if($5 == "X")
		$2 = $2 "_P"
	else if($5 == "Y")
		$2 = $2 "_T"
}
1' f1 f2

If you want to try this (or either of RudiC's suggestions) on a Solaris/SunOS operating system, change awk to /usr/xpg4/bin/awk or nawk .

mansoorcfc · January 16, 2017, 10:30am

thank you Rudi!.

thank you Don!
@Don : f1 and f2 aren't part of the data. they are just the file names. Also your solution is removing 1 record from the output. meaning if my input has 1000 records, i get only 999 as output after the updates/changes. I am trying to see why that is happening.

but both solutions worked on my actual data. This is amazing, i never expected the reply to be so quick! . Thank you so much. this helps me a lot and gives me ideas for my future file processing tasks.

Don_Cragun · January 16, 2017, 2:32pm

Hi mansoorcfc,
That is exactly what I said in post #3 in this thread:

The code I posted was to be used only if the 1st line in each file is to be treated as some kind of header line that should be ignored when producing output. The code in my script:

FNR == 1 {
	next
}

skips the 1st line in each input file. ( FNR is set by awk to be the record number of the current input line from the current input file.) Therefore, the output file will have one line less than your 2nd input file (and whatever transaction number is on the 1st line of your 1st input file will not be recognized as a known transaction number when processing the 2nd input file).

mansoorcfc · January 17, 2017, 4:41am

Hi Don,

Thank you for your response and the clarification. sorry, i am little new to working with awk and couldn't see that right away. I have always used sed or other direct shell scripting approaches for any file editing but have always wanted to learn and use awk.

thank you again so much for your help and responses. much appreciate it.