How to join one file with multiple files in a directory in UNIX?

sajmar · April 23, 2015, 11:23am

Dear folks
Hello

I have a one file called (file1) which the structure looks like this

1       gi|358484521|ref|NW_003764373.1|
1       gi|358484520|ref|NW_003764374.1|
1       gi|358484519|ref|NW_003764375.1|
.
.
.
30   gi|368484519|ref|NW_00449375.1|

In addition, I have around 300 files which can called (file2) in a directory with this structure:

gi|358484513|ref|NW_003764381.1| 1751 T C 0 1 1
gi|358484498|ref|NW_003764396.1| 481 A C 1 1 2
gi|358484498|ref|NW_003764396.1| 2603 T G 0 1 1

Now my desire output is join the Second column of file1 with first column of file2.

1  gi|358484513|ref|NW_003764381.1| gi|358484513|ref|NW_003764381.1| 1751 T C 0 1 1
1  gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 481 A G 1 1 2
1  gi|358484498|ref|NW_003764396.1|  gi|358484498|ref|NW_003764396.1| 2603 A G 0 1 1

I was trying a for loop to get the join files for all of the 300 files in the directory but it does not happened.

I will appreciate if any one give me a suggestion.

Scrutinizer · April 23, 2015, 11:58am

You could try:

awk 'NR==FNR{A[FNR]=$0; next} {print A[FNR], $0}' file1 otherfile*

If that does not exceed the maximimum command line argument length
or

echo otherfile* | xargs awk 'NR==FNR{A[NR]=$0; next} {print A[FNR], $0}' file1

sajmar · April 23, 2015, 12:11pm

Thank for your suggestion. I tried those codes But these command will not meet my need.

Scrutinizer · April 23, 2015, 12:29pm

I made a correction to the second suggestion.

What does not work exactly? Can you elaborate?

sajmar · April 23, 2015, 12:57pm

As I mentioned, I have 300 files in the folders which are like file2 and they all have the same ending file name which is *.geno . now each of these file have different rows. I want to know how many of the id fields like (gi|358484521|ref|NW_003764373.1|) are present in each of these 300 files.

RudiC · April 23, 2015, 1:26pm

... which doesn't work with the samples you have given. With a slightly modified file2, try

awk 'FNR==NR {T[$2]=$0;next} $1 in T {print T[$1], $0}' file1 file2
1       gi|358484521|ref|NW_003764373.1| gi|358484521|ref|NW_003764373.1| 3303 T G 0 1 1

Is that what you're after?

sajmar · April 23, 2015, 1:41pm

Dear Rudic

Thanks for your command. your command did not give me anything in the output. By the way, if when I have 300 files, I think with this command, I should to run 300 times.

Scrutinizer · April 23, 2015, 1:59pm

Your requirement is unclear to me. In the example you gave there is no relation between column 2 and 3 in the output file..

---
*EDIT*
OK I see you have just edited your post #1 and now the requirements are different. Please do not do that, it makes the thread hard to follow. And please try to get your specification right from the start.

So you mean something like this then?

awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' file1 *.geno

or try an xargs approach...

sajmar · April 23, 2015, 2:49pm

Sorry Scrutinize, if I put you in the wrong way. this time the command works well. But If I need to put it inside loop, this code did not work

ls *.geno | while read FN;  awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' unlocol_accessions $FN > ${FN/geno/geno2}; done

do you have any suggestion for putting your command in loop?

Don_Cragun · April 23, 2015, 3:08pm

sajmar:

Sorry Scrutinize, if I put you in the wrong way. this time the command works well. But If I need to put it inside loop, this code did not work
ls *.geno | while read FN;  awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' unlocol_accessions $FN > ${FN/geno/geno2}; done
do you have any suggestion for putting your command in loop?

Instead of saying "code did not work" it would help everyone reading your thread if you would post the diagnostics that were printed by the shell that explains why it didn't work.

In this case, using correct syntax would seem to fix your problem:

ls *.geno | while read FN
do      awk '
                NR==FNR{A[$2]=$0; next}
                $1 in A{print A[$1], $0}
        ' unlocol_accessions "$FN" > "${FN/geno/geno2}"
done

or, if you insist on single line code instead of readable code:

ls *.geno | while read FN;do awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' unlocol_accessions "$FN" > "${FN/geno/geno2}"; done

Or, more simply:

for FN in *.geno
do      awk '
                NR==FNR{A[$2]=$0; next}
                $1 in A{print A[$1], $0}
        ' unlocol_accessions "$FN" > "${FN/geno/geno2}"
done

sajmar · April 23, 2015, 11:19pm

As I tried to merge the second column of the file1 with first column of file2, I was able to run the command perfectly but I have a problem here. My file1 have 11580 id and file2 have 1805 id. the file2 is the original file and I want to merge file2 with file1 but instead I get 11580 id in my merge file, I will get 1613 common id in my merge file. I need to note in file1, I have id's which repeated more than one time. Does anyone have a suggestion for my problem?

Don_Cragun · April 24, 2015, 12:03am

You lost me.

Please show us a sample set of representative input files and the output (or outputs) that should be produced from those input files.

sajmar · April 24, 2015, 12:17am

This is file1 (11580 id)

gi|358468608|ref|NW_003780270.1| 3935 T C 0 1 1
gi|358468608|ref|NW_003780270.1| 4071 C T 0 1 1
gi|358468608|ref|NW_003780270.1| 4110 C T 0 1 1
gi|358468608|ref|NW_003780270.1| 4377 C G 1 1 2
gi|358468608|ref|NW_003780270.1| 4387 C A 0 1 1
gi|358468608|ref|NW_003780270.1| 4476 A G 1 1 2
gi|358468610|ref|NW_003780268.1| 2707 G A 0 1 1
gi|358468610|ref|NW_003780268.1| 3290 C T 0 1 1
gi|358468610|ref|NW_003780268.1| 5909 A G 0 1 1
gi|358468610|ref|NW_003780268.1| 5950 G A 0 1 1
gi|358468610|ref|NW_003780268.1| 6085 T A 0 1 1
gi|358468624|ref|NW_003780254.1| 392 T C 1 1 2
gi|358468624|ref|NW_003780254.1| 600 A G 0 1 1
gi|358468624|ref|NW_003780254.1| 924 C T 0 1 1
gi|358468624|ref|NW_003780254.1| 972 A G 0 1 1
gi|358468629|ref|NW_003780249.1| 681 A C 0 1 1
gi|358468631|ref|NW_003780247.1| 138 A G 1 1 2
gi|358468631|ref|NW_003780247.1| 327 T G 0 1 1
gi|358468631|ref|NW_003780247.1| 511 A T 0 1 1
gi|358468631|ref|NW_003780247.1| 513 C G 0 1 1
gi|358468633|ref|NW_003780245.1| 1076 T C 0 1 1
gi|358468633|ref|NW_003780245.1| 348 T C 0 1 1
gi|358468633|ref|NW_003780245.1| 460 A G 1 1 2
gi|358468633|ref|NW_003780245.1| 591 G C 0 1 1
.
.
.
gi|358484429|ref|NW_003764465.1| 927 T G 0 1 1
gi|358484430|ref|NW_003764464.1| 366 G A 1 1 2
gi|358484430|ref|NW_003764464.1| 662 C G 0 1 1
gi|358484430|ref|NW_003764464.1| 664 C G 0 1 1
gi|358484430|ref|NW_003764464.1| 709 A G 0 1 1
gi|358484430|ref|NW_003764464.1| 782 T C 0 1 1
gi|358484431|ref|NW_003764463.1| 1295 A G 1 1 2
gi|358484431|ref|NW_003764463.1| 1868 G A 0 1 1
gi|358484431|ref|NW_003764463.1| 1921 G A 0 1 1
gi|358484431|ref|NW_003764463.1| 1980 A G 0 1 1
gi|358484431|ref|NW_003764463.1| 2003 T C 1 1 2
gi|358484431|ref|NW_003764463.1| 3595 T C 0 1 1
gi|358484431|ref|NW_003764463.1| 607 A G 0 1 1
gi|358484431|ref|NW_003764463.1| 686 C G 0 1 1
gi|358484431|ref|NW_003764463.1| 844 C G 0 1 1
gi|358484432|ref|NW_003764462.1| 541 A G 0 1 1
gi|358484451|ref|NW_003764443.1| 1126 G A 0 1 1
gi|358484451|ref|NW_003764443.1| 988 T C 0 1 1
gi|358484469|ref|NW_003764425.1| 785 G T 0 1 1
gi|358484470|ref|NW_003764424.1| 440 C T 1 1 2
gi|358484470|ref|NW_003764424.1| 735 T A 0 1 1
gi|358484471|ref|NW_003764423.1| 551 G T 1 1 2
gi|358484498|ref|NW_003764396.1| 2503 A G 0 1 1
gi|358484498|ref|NW_003764396.1| 381 A G 1 1 2
gi|358484513|ref|NW_003764381.1| 1351 T C 0 1 1

This is file2 (1805 id)

1       gi|358484521|ref|NW_003764373.1|
1       gi|358484520|ref|NW_003764374.1|
1       gi|358484519|ref|NW_003764375.1|
1       gi|358484518|ref|NW_003764376.1|
1       gi|358484517|ref|NW_003764377.1|
1       gi|358484516|ref|NW_003764378.1|
1       gi|358484515|ref|NW_003764379.1|
1       gi|358484514|ref|NW_003764380.1|
1       gi|358484513|ref|NW_003764381.1|
1       gi|358484512|ref|NW_003764382.1|
1       gi|358484511|ref|NW_003764383.1|
1       gi|358484510|ref|NW_003764384.1|
1       gi|358484509|ref|NW_003764385.1|
1       gi|358484508|ref|NW_003764386.1|
1       gi|358484507|ref|NW_003764387.1|
1       gi|358484506|ref|NW_003764388.1|
1       gi|358484505|ref|NW_003764389.1|
1       gi|358484504|ref|NW_003764390.1|
1       gi|358484503|ref|NW_003764391.1|
1       gi|358484502|ref|NW_003764392.1|
1       gi|358484501|ref|NW_003764393.1|
1       gi|358484500|ref|NW_003764394.1|
1       gi|358484499|ref|NW_003764395.1|
1       gi|358484498|ref|NW_003764396.1|
1       gi|358484497|ref|NW_003764397.1|
1       gi|358484496|ref|NW_003764398.1|
. 
.
.
LGE64   gi|358482732|ref|NW_003766162.1|
LGE64   gi|358482731|ref|NW_003766163.1|
LGE64   gi|358482730|ref|NW_003766164.1|
LGE64   gi|358482729|ref|NW_003766165.1|
LGE64   gi|358482728|ref|NW_003766166.1|
LGE64   gi|358482727|ref|NW_003766167.1|
LGE64   gi|358482726|ref|NW_003766168.1|
LGE64   gi|358482725|ref|NW_003766169.1|
LGE64   gi|358482724|ref|NW_003766170.1|
LGE64   gi|358482723|ref|NW_003766171.1|
LGE64   gi|358482722|ref|NW_003766172.1|
LGE64   gi|358482721|ref|NW_003766173.1|
LGE64   gi|358482720|ref|NW_003766174.1|
LGE64   gi|358482719|ref|NW_003766175.1|
LGE64   gi|358482718|ref|NW_003766176.1|
LGE64   gi|358482717|ref|NW_003766177.1|

Now, my original file is file2. I want to bring 2 column of file2 in front of file1. Notice that file1 have repeated id in its file.

RudiC · April 24, 2015, 3:26am

This was vice versa in post #1. Looking at these files, there's no match between file1's field 2 and file2's field one.

Please exercise way more care when specfying your problem.

sajmar · April 24, 2015, 10:21am

Dear Rudic

I did not bring the whole data set of file1 and file2 here. That's what you think these two file do not match together. The question is file2 have unique number of id in the second column but in file1 there could be repeated time of ids in column first. I want to know why after merging these two file, I do not get the same line in file1 which it had before.

RudiC · April 24, 2015, 10:26am

Applying my proposal in post#6 to the above files yields

awk 'FNR==NR {T[$2]=$0;next} $1 in T {print T[$1], $0}' file2 file1
1       gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 2503 A G 0 1 1
1       gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 381 A G 1 1 2
1       gi|358484513|ref|NW_003764381.1| gi|358484513|ref|NW_003764381.1| 1351 T C 0 1 1

What exactly is wrong with this? It is the entire lines from file1 and file2 combined. It looks exactly like what you requested in post#1. What does "I do not get the same line in file1 which it had before" mean?

sajmar · April 24, 2015, 10:37am

Dear Rudic

When I applied your command, I found 1613 common id between file1 and file2. However, file1 has originally 1805 id and file2 has originally 11580 id. because file1 is the master id file, after merging with file2 we expect 11580 lines which I got 1613 line by your command.

RudiC · April 24, 2015, 10:45am

It prints only those lines in which $1 resp. $2 do match. If that's not the desired operation, please explain what this (from post#1) means:

sajmar · April 24, 2015, 10:54am

Dear Rudic

As my first post, it is clear I want to join the common id between these two files (file1 and file2). But the point is file2 has id with replication and file1 just have unique number for each id. Usually after join two files, the unique ids should match with their ids which repeated in file1. In conclusion, I am trying to merge file1 to file2 without changing the lines of file2.

Don_Cragun · April 24, 2015, 1:54pm

As in your 1st post; it is NOT clear.

And, I repeat:

If you are unwilling to select a representative set of data from your input files that shows lines that should match, lines that should not match, and lines that have multiple matches; and show us the exact output you want from that sample input, then we are all wasting our time here.

Show us sample input. Show us the exact output that should be produced from that sample input If you are unwilling to do that, I will close this thread.