AWK: matching patterns in 2 different files

asanjuan · September 14, 2010, 4:52am

In a directory, there are two different file extensions (.txt and .xyz) having similar names of numerical strings (). The (.txt) contains 5000 multiple files and the (*.xyz) also contains 5000 multiple files. Each of the files has around 4000 rows and 8 columns, with several unique string patterns at 5th column.

Files *.txt

1.txt
2.txt
3.txt

Files *.xyz

1.doc
2.doc
3.doc

3.txt

OT   3328   CA    CT   268       5.800      7.500      4.700      
OT   3329   HA    CT   268       8.500      8.900      3.600      
OT   3330   NB    CT   268       6.700      5.500      7.600      
OT   3331   O     AT   269       1.200      7.700      5.500      
OT   3332   C1    AT   269       3.800      5.800      5.200 
OT   3333   C2    AT   269       8.800      0.800      0.200 
OT   3334   O     VT   270       9.800      2.800      5.600 
OT   3335   C1    VT   270       5.200      5.132      2.031
OT   3336   C2    VT   270       0.236      5.234      8.351

3.xyz

OT   3328   CA    CT   268       5.800      7.500      4.700        
OT   3329   NB    CT   268       6.700      5.500      7.600      
OT   3330   O     AT   269       1.200      7.700      5.500

Tasks:

(Step-1) At 5th column of '3.xyz' file, find all matching patterns in '3.txt' file.

(Step-2) Write the entire row into a new file, based on the condition in step-1 above.

This is how the output looks like:

newfile.dat

OT   3328   CA    CT   268       5.800      7.500      4.700      
OT   3329   HA    CT   268       8.500      8.900      3.600      
OT   3330   NB    CT   268       6.700      5.500      7.600      
OT   3331   O     AT   269       1.200      7.700      5.500      
OT   3332   C1    AT   269       3.800      5.800      5.200 
OT   3333   C2    AT   269       8.800      0.800      0.200

I tried awk to get the expected output, such that it creates a field array and compares file *.txt with file *.xyz and prints the corresponding matching values into a new file.

awk  '{FS="|"} NF==5 {acc[$5]=5} NF>1 {if( ( $5 in acc ) ) {print  $1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7|"$8} }' 3.xyz 3.txt

And, for iteration to multiple files in directory:

#!/bin/bash

for  d in `ls *`
do
  awk '{FS="|"} NF==5 {acc[$5]=5} NF>1 {if( (  $5 in acc ) ) {print $1"|"$2"|"$3"|"$4|"$5"|"$6"|"$7|"$8} }' $.xyz $.txt  $d > EF_$d
done

Awk error shows 'unterminated string', yet I check the code and coudnt find solution. Please help.

Thank you for your time and attention.

-A

pravin27 · September 14, 2010, 5:10am

try this,

 awk 'NR==FNR{a[$5]=++i;next} { if ( $5 in a) {print $0}}' 3.xyz 3.txt

asanjuan · September 14, 2010, 5:31am

Thanks, Pravin for the helpful reply. The code work perfectly at command prompt. Would you please further help on how to do an iteration script for multiple files in directory using the code you gave ? I tried this:

!/bin/bash

for d in `ls *`
do
  awk 'NR==FNR{a[$5]=++i;next} { if ( $5 in a) {print $0}}' $.xyz $.txt  > EF_$d
done

The $.xyz and $.txt have same numeric string, and only file extension differs.
The EF_$d is to write the output correspondingly.

Thanks so much for your kind help.

-A

pravin27 · September 14, 2010, 6:04am

try this,

#!/bin/sh

for i in {1..4}
do
awk 'NR==FNR{a[$5]=++i;next} { if ( $5 in a) {print $0}}' $i.xyz $i.txt > EF_$i
done

raghunsi · September 14, 2010, 6:11am

I used this Command to pull the certainly a quite nearing to output, but not exactly.

gzmore dealerlistservice_LogS[1-5]_2010*.log.gz |  grep -i ClientError.700 |  cut -b2470-2590 | awk -RS '/<fault(c|s).*>.*<\/fault.*>/'

So here is the output from the above command.

Can we still fine tune this output.

asanjuan · September 14, 2010, 6:15am

Thanks so much pravin ! It works well.

-A