Comparing specific columns between two files

Dear All,

I have two files. File-A having 5 columns and File-B having 2 columns.
I want to match 4th column of file-A with both columns of file-B and print all contents of file-A + the matching lines of file-B as output.

file-A

30.00   12      gi|49483390|ref|YP_040614.1|    DIP-29721N|refseq:NP_683750|uniprot:Q8R418      2e-08
30.00   13      gi|49484704|ref|YP_041928.1|    DIP-33449N|uniprot:Q8WZ42       3e-09
30.00   16      gi|49483425|ref|YP_040649.1|    DIP-23879N|refseq:NP_650366|uniprot:Q9VFJ3      4e-06
30.00   17      gi|49484107|ref|YP_041331.1|    DIP-46805N|uniprot:P70388       1e-06
30.00   21      gi|49482259|ref|YP_039483.1|    DIP-25107N|refseq:NP_495440     2e-15
30.00   22      gi|49482976|ref|YP_040200.1|    DIP-22713N|refseq:NP_524108     1e-06
30.00   26      gi|49483184|ref|YP_040408.1|    DIP-17056N|refseq:NP_651605     1e-09
30.00   31      gi|49484099|ref|YP_041323.1|    DIP-29200N|refseq:NP_005436|uniprot:Q9UQE7      6e-12

flle-B

DIP-10000N|refseq:NP_417192|uniprotkb:P30131    DIP-31848N|uniprotkb:P0A9B2
DIP-10000N|refseq:NP_417192|uniprotkb:P30131    DIP-36429N|uniprotkb:P0AAM7
DIP-10001N|refseq:NP_418748|uniprotkb:P39377    DIP-10001N|refseq:NP_418748|uniprotkb:P39377
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10003N|refseq:NP_290325|uniprotkb:P29209
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10149N|refseq:NP_417877|uniprotkb:P06993
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10397N|refseq:NP_416719|uniprotkb:P06996
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10467N|refseq:NP_415423|uniprotkb:P09373
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10557N|refseq:NP_416344|uniprotkb:P23865
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10573N|refseq:NP_414736|uniprotkb:P16659
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-10783N|refseq:NP_417800|uniprotkb:P02359
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-11097N|refseq:NP_290066|uniprotkb:P28242
DIP-10003N|refseq:NP_290325|uniprotkb:P29209    DIP-11354N|refseq:NP_415140|uniprotkb:P39177

Is it possible? I'd be highly thankful if someone can help me.

The idea of matching fields in one file against a field in another file is easy with awk . But, given that there is only one line in file-B where both columns are the same and there is no line in file-A that contains the value that appears in that line in file-B, there is no output matching your request. Or, did I misunderstand what you're trying to do???

And, if there were lines in your input files that met your criteria, your description of the output you want is not clear.

Please describe more clearly what you are trying to do and show us the sample output you are trying to produce from your sample inputs.

Any attempts from your side?

---------- Post updated at 11:13 ---------- Previous update was at 11:06 ----------

Based on wild guesses, and appreciating what Don Cragun said (NO matches!), and having removed the DOS <CR> line terminators in file-A, this seemed to do sth like what you wanted:

awk 'FNR==NR {T[$1]=$0; T[$2]=$0; next} {print $0, T[$4]}' file-B file-A

Actually both the columns of file-B are interacting-protein-partners.
I want to match the column4 of file-A with both the columns of file-B, to see which of the protein of column4, file-A is also present in any of the column of file-B along with its corresponding protein partner.

Following is how I want the output to be like,

30.00  17  gi|49484107|ref|YP_041331.1|   DIP-46805N|uniprot:P70388  1e-06 DIP-44775N|refseq:NP_006210|uniprotkb:P42338    DIP-46805N|uniprotkb:P70388

i.e all columns of file-A + the matching LINE (both columns) of file-B (if either column contains a value same as that in column4 of file-A).
In the given output note that column2 of file-B had the same value as that of column4 file-A.

Hope I was able to explain my question better.

You description makes it sound like the text shown in red above in the output you say you want should appear in file-A (and it does appear as the 4th line in your sample) and the text shown in orange should appear in file-B (but it does not). There is no match in the 1st field nor in the 2nd field on any line in file-B for the 4th field on any line in file-A in your sample.

And, even if the text in orange did appear in file-B, there would still be no match... The 4th field in file-A:

DIP-46805N|uniprot:P70388

and the last field you have shown in your desired output:

DIP-46805N|uniprotkb:P70388

do NOT match.

Yep, that's a mistake on my part. They sure don't match exactly.
The contents of file-A and B that I have given in my question initially was just an example (and not a very good one) of a very large data set (forgot to mention that).

I sure would try to be more elaborate and exact next time.

Anyhow, the code RudiC has suggested works fine, just the way I wanted.

Thanks anyway. :slight_smile: