Match string from two files and print line

Hi,
I have been trying to find help with my issue and I'm thinking awk may be able to do it.
I have two files eg

file1.txt
STRING1    230    400    0.36
STRING2    400    230    -0.13
STRING3    130    349      1

file2.txt

CUFFFLINKS    1   1394    93932   .   +     STRING1
CUFFFLINKS    1   94055   96078   .   +     STRING1
CUFFFLINKS    1   7654    9000   .     +     STRING3
CUFFFLINKS    1   6544    93932   .   +     STRING4

desired.txt
CUFFFLINKS    1   1394    93932   .   +    STRING1
CUFFFLINKS    1   94055   96078   .   +    STRING1
CUFFFLINKS    1   7654    9000   .   +     STRING3

I would like to loop through all entries in column 1 of file 1 and if the string matches any entry in column 7 of file 2 to print out the line of file 2

I have tried:

awk 'NR == FNR { a[$1]++ } NR != FNR { for (e in a) for (i=1;i<NF;i++) if (e ~ $i) print $0 }' file1.txt file2.txt

but this doesn't seem to work.

My understanding is that NR will only == FNR when the first file is read in so this populates the 'a' array. Then when NR != FNR (eg when the second file is read in) then there is a loop to try to match every element of 'a' and if this matched to print out the line. I can't see how I can get this specific to column 7 in the second file??
I'm a complete beginner so any help would be really appreciated!
Thanks.

Welcome to the forum.
Thanks for (partly) using CODE tags, but please do so consistently.
Did you consder the links at the bottom left of this page? They usually offer a good starting point...
Howsoever, try

awk 'NR == FNR {a[$1]; next} $7 in a' file1 file2
CUFFFLINKS    1   1394    93932   .   +     STRING1
CUFFFLINKS    1   94055   96078   .   +     STRING1
CUFFFLINKS    1   7654    9000   .     +     STRING3

Thanks RudiC. It's not quite working and I think it might know why. I've realised column 7 in file two is actually wrapped in ""; So I'm guessing the strings won't exactly match? eg

file2.txt
CUFFFLINKS    1   1394    93932   .   +     "STRING1";
CUFFFLINKS    1   94055   96078   .   +     "STRING1";
CUFFFLINKS    1   7654    9000   .     +     "STRING3";
CUFFFLINKS    1   6544    93932   .   +     "STRING4";

Would this prevent the strings from matching? If so, how can I remove these before performing the match? I thought about trying to open file 2, remove them with sed somehow and pipe into the command that you suggested but I can't do this as awk takes in file2 at the end of the command???

awk 'FNR==NR {f1[$1];next} $(NF-1) in f1' file1 FS='"' file2

Thanks vgersh99 but I can't quite get this to work. Could you explain the code?

It's very similar to tge RuduC's solution with the exception of of the FiledSeparator [FS] being double-quote when file2 is processed.
it works with the files you've posted so far. What files are passing through and what do you get as output?

Sent from my Lenovo B8080-F using Tapatalk

Apologies, this is my first post. I thought it would be easier to explain with an ammended file but I've learnt this confuses things. My file 2 looks like this.

1       StringTie       exon    18887   19382   .       +       .       transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";
1       StringTie       exon    189836  191490  .       +       .       transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "2";
1       StringTie       exon    18887   19382   .       +       .       transcript_id "MSTRG.49.4"; gene_id "MSTRG.5"; exon_number "1";
1       StringTie       exon    189836  191490  .       +       .       transcript_id "MSTRG.49.4"; gene_id "MSTRG.5"; exon_number "2";

If I

 awk '{print$10} file2

this prints out the "MSTRG###" ; string that I want to match to (I don't want to match the string from gene_id .

try - not tested (yet)

awk 'FNR==NR {f1[$1];next} $2 in f1' file1 FS='"' file2

Sent from my Lenovo B8080-F using Tapatalk

---------- Post updated at 10:41 PM ---------- Previous update was at 10:36 PM ----------

sorry - counted the fields wrong and don't have the matching file1, but:

awk 'FNR==NR {f1[$1];next} $4 in f1' file1 FS='"' file2

It should be either $2 or $4 depending on what MSTRG you want to match on......

Can you explain the code ( the $4??)
My file 1 is like:

MSTRG.1.1       233     0       0       0
MSTRG.5.1       2151    300     0.7186  
MSTRG.13.1      1705    261     0.4076
MSTRG.49.1      1746    357     1.189
MSTRG.50.1      1809    273     1.0285
MSTRG.50.2      890     201     1.1133  
MSTRG.50.3      466     75      0.7497  
MSTRG.49.4      1743    246     1.0052
MSTRG.49.5      885     246     1.0052

give file1 and file2 posted above, the code should be:

awk 'FNR==NR {f1[$1];next} $2 in f1' file1 FS='"' file2

The FieldSeparator (FS) for file2 is " . When file2 is processed ( $2 in f1 ), given FS="'" , $2 becomes the FIRST quoted string withOUT the quotes.
Not sure if my explanation is consumable tho :wink:

1 Like