In a directory, there are two different file extensions (.txt and .xyz) having similar names of numerical strings (). The (.txt) contains 5000 multiple files and the (*.xyz) also contains 5000 multiple files. Each of the files has around 4000 rows and 8 columns, with several unique string patterns at 5th column.
Files *.txt
1.txt
2.txt
3.txt
Files *.xyz
1.doc
2.doc
3.doc
3.txt
OT 3328 CA CT 268 5.800 7.500 4.700
OT 3329 HA CT 268 8.500 8.900 3.600
OT 3330 NB CT 268 6.700 5.500 7.600
OT 3331 O AT 269 1.200 7.700 5.500
OT 3332 C1 AT 269 3.800 5.800 5.200
OT 3333 C2 AT 269 8.800 0.800 0.200
OT 3334 O VT 270 9.800 2.800 5.600
OT 3335 C1 VT 270 5.200 5.132 2.031
OT 3336 C2 VT 270 0.236 5.234 8.351
3.xyz
OT 3328 CA CT 268 5.800 7.500 4.700
OT 3329 NB CT 268 6.700 5.500 7.600
OT 3330 O AT 269 1.200 7.700 5.500
Tasks:
(Step-1) At 5th column of '3.xyz' file, find all matching patterns in '3.txt' file.
(Step-2) Write the entire row into a new file, based on the condition in step-1 above.
This is how the output looks like:
newfile.dat
OT 3328 CA CT 268 5.800 7.500 4.700
OT 3329 HA CT 268 8.500 8.900 3.600
OT 3330 NB CT 268 6.700 5.500 7.600
OT 3331 O AT 269 1.200 7.700 5.500
OT 3332 C1 AT 269 3.800 5.800 5.200
OT 3333 C2 AT 269 8.800 0.800 0.200
I tried awk to get the expected output, such that it creates a field array and compares file *.txt with file *.xyz and prints the corresponding matching values into a new file.
awk '{FS="|"} NF==5 {acc[$5]=5} NF>1 {if( ( $5 in acc ) ) {print $1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7|"$8} }' 3.xyz 3.txt
And, for iteration to multiple files in directory:
#!/bin/bash
for d in `ls *`
do
awk '{FS="|"} NF==5 {acc[$5]=5} NF>1 {if( ( $5 in acc ) ) {print $1"|"$2"|"$3"|"$4|"$5"|"$6"|"$7|"$8} }' $.xyz $.txt $d > EF_$d
done
Awk error shows 'unterminated string', yet I check the code and coudnt find solution. Please help.
Thank you for your time and attention.
-A