In addition, I have around 300 files which can called (file2) in a directory with this structure:
gi|358484513|ref|NW_003764381.1| 1751 T C 0 1 1
gi|358484498|ref|NW_003764396.1| 481 A C 1 1 2
gi|358484498|ref|NW_003764396.1| 2603 T G 0 1 1
Now my desire output is join the Second column of file1 with first column of file2.
1 gi|358484513|ref|NW_003764381.1| gi|358484513|ref|NW_003764381.1| 1751 T C 0 1 1
1 gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 481 A G 1 1 2
1 gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 2603 A G 0 1 1
I was trying a for loop to get the join files for all of the 300 files in the directory but it does not happened.
I will appreciate if any one give me a suggestion.
As I mentioned, I have 300 files in the folders which are like file2 and they all have the same ending file name which is *.geno . now each of these file have different rows. I want to know how many of the id fields like (gi|358484521|ref|NW_003764373.1|) are present in each of these 300 files.
Thanks for your command. your command did not give me anything in the output. By the way, if when I have 300 files, I think with this command, I should to run 300 times.
Your requirement is unclear to me. In the example you gave there is no relation between column 2 and 3 in the output file..
---
*EDIT*
OK I see you have just edited your post #1 and now the requirements are different. Please do not do that, it makes the thread hard to follow. And please try to get your specification right from the start.
So you mean something like this then?
awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' file1 *.geno
Instead of saying "code did not work" it would help everyone reading your thread if you would post the diagnostics that were printed by the shell that explains why it didn't work.
In this case, using correct syntax would seem to fix your problem:
ls *.geno | while read FN
do awk '
NR==FNR{A[$2]=$0; next}
$1 in A{print A[$1], $0}
' unlocol_accessions "$FN" > "${FN/geno/geno2}"
done
or, if you insist on single line code instead of readable code:
ls *.geno | while read FN;do awk 'NR==FNR{A[$2]=$0; next} $1 in A{print A[$1], $0}' unlocol_accessions "$FN" > "${FN/geno/geno2}"; done
Or, more simply:
for FN in *.geno
do awk '
NR==FNR{A[$2]=$0; next}
$1 in A{print A[$1], $0}
' unlocol_accessions "$FN" > "${FN/geno/geno2}"
done
As I tried to merge the second column of the file1 with first column of file2, I was able to run the command perfectly but I have a problem here. My file1 have 11580 id and file2 have 1805 id. the file2 is the original file and I want to merge file2 with file1 but instead I get 11580 id in my merge file, I will get 1613 common id in my merge file. I need to note in file1, I have id's which repeated more than one time. Does anyone have a suggestion for my problem?
gi|358468608|ref|NW_003780270.1| 3935 T C 0 1 1
gi|358468608|ref|NW_003780270.1| 4071 C T 0 1 1
gi|358468608|ref|NW_003780270.1| 4110 C T 0 1 1
gi|358468608|ref|NW_003780270.1| 4377 C G 1 1 2
gi|358468608|ref|NW_003780270.1| 4387 C A 0 1 1
gi|358468608|ref|NW_003780270.1| 4476 A G 1 1 2
gi|358468610|ref|NW_003780268.1| 2707 G A 0 1 1
gi|358468610|ref|NW_003780268.1| 3290 C T 0 1 1
gi|358468610|ref|NW_003780268.1| 5909 A G 0 1 1
gi|358468610|ref|NW_003780268.1| 5950 G A 0 1 1
gi|358468610|ref|NW_003780268.1| 6085 T A 0 1 1
gi|358468624|ref|NW_003780254.1| 392 T C 1 1 2
gi|358468624|ref|NW_003780254.1| 600 A G 0 1 1
gi|358468624|ref|NW_003780254.1| 924 C T 0 1 1
gi|358468624|ref|NW_003780254.1| 972 A G 0 1 1
gi|358468629|ref|NW_003780249.1| 681 A C 0 1 1
gi|358468631|ref|NW_003780247.1| 138 A G 1 1 2
gi|358468631|ref|NW_003780247.1| 327 T G 0 1 1
gi|358468631|ref|NW_003780247.1| 511 A T 0 1 1
gi|358468631|ref|NW_003780247.1| 513 C G 0 1 1
gi|358468633|ref|NW_003780245.1| 1076 T C 0 1 1
gi|358468633|ref|NW_003780245.1| 348 T C 0 1 1
gi|358468633|ref|NW_003780245.1| 460 A G 1 1 2
gi|358468633|ref|NW_003780245.1| 591 G C 0 1 1
.
.
.
gi|358484429|ref|NW_003764465.1| 927 T G 0 1 1
gi|358484430|ref|NW_003764464.1| 366 G A 1 1 2
gi|358484430|ref|NW_003764464.1| 662 C G 0 1 1
gi|358484430|ref|NW_003764464.1| 664 C G 0 1 1
gi|358484430|ref|NW_003764464.1| 709 A G 0 1 1
gi|358484430|ref|NW_003764464.1| 782 T C 0 1 1
gi|358484431|ref|NW_003764463.1| 1295 A G 1 1 2
gi|358484431|ref|NW_003764463.1| 1868 G A 0 1 1
gi|358484431|ref|NW_003764463.1| 1921 G A 0 1 1
gi|358484431|ref|NW_003764463.1| 1980 A G 0 1 1
gi|358484431|ref|NW_003764463.1| 2003 T C 1 1 2
gi|358484431|ref|NW_003764463.1| 3595 T C 0 1 1
gi|358484431|ref|NW_003764463.1| 607 A G 0 1 1
gi|358484431|ref|NW_003764463.1| 686 C G 0 1 1
gi|358484431|ref|NW_003764463.1| 844 C G 0 1 1
gi|358484432|ref|NW_003764462.1| 541 A G 0 1 1
gi|358484451|ref|NW_003764443.1| 1126 G A 0 1 1
gi|358484451|ref|NW_003764443.1| 988 T C 0 1 1
gi|358484469|ref|NW_003764425.1| 785 G T 0 1 1
gi|358484470|ref|NW_003764424.1| 440 C T 1 1 2
gi|358484470|ref|NW_003764424.1| 735 T A 0 1 1
gi|358484471|ref|NW_003764423.1| 551 G T 1 1 2
gi|358484498|ref|NW_003764396.1| 2503 A G 0 1 1
gi|358484498|ref|NW_003764396.1| 381 A G 1 1 2
gi|358484513|ref|NW_003764381.1| 1351 T C 0 1 1
I did not bring the whole data set of file1 and file2 here. That's what you think these two file do not match together. The question is file2 have unique number of id in the second column but in file1 there could be repeated time of ids in column first. I want to know why after merging these two file, I do not get the same line in file1 which it had before.
Applying my proposal in post#6 to the above files yields
awk 'FNR==NR {T[$2]=$0;next} $1 in T {print T[$1], $0}' file2 file1
1 gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 2503 A G 0 1 1
1 gi|358484498|ref|NW_003764396.1| gi|358484498|ref|NW_003764396.1| 381 A G 1 1 2
1 gi|358484513|ref|NW_003764381.1| gi|358484513|ref|NW_003764381.1| 1351 T C 0 1 1
What exactly is wrong with this? It is the entire lines from file1 and file2 combined. It looks exactly like what you requested in post#1. What does "I do not get the same line in file1 which it had before" mean?
When I applied your command, I found 1613 common id between file1 and file2. However, file1 has originally 1805 id and file2 has originally 11580 id. because file1 is the master id file, after merging with file2 we expect 11580 lines which I got 1613 line by your command.
As my first post, it is clear I want to join the common id between these two files (file1 and file2). But the point is file2 has id with replication and file1 just have unique number for each id. Usually after join two files, the unique ids should match with their ids which repeated in file1. In conclusion, I am trying to merge file1 to file2 without changing the lines of file2.
If you are unwilling to select a representative set of data from your input files that shows lines that should match, lines that should not match, and lines that have multiple matches; and show us the exact output you want from that sample input, then we are all wasting our time here.
Show us sample input. Show us the exact output that should be produced from that sample input If you are unwilling to do that, I will close this thread.