I have two text files.
text file 1:
ID filePath col1 col2 col3
1 10584588.mol 269.126 190.958 23.237
2 10584549.mol 281.001 200.889 27.7414
3 10584511.mol 408.824 158.316 29.8561
4 10584499.mol 245.632 153.241 25.2815
5 10584459.mol 290.476 133.699 28.631
6 10584426.mol 440.552 150.846 30.1827
7 10584298.mol 243.248 164.409 21.5715
8 10584286.mol 283.078 230.034 24.3697
9 10584278.mol 287.807 198.625 27.7414
10 10584197.mol 224.356 184.317 24.3616
text file 2:
ID filePath SUB_ID ChBrg_REGID
1 10584588.mol 10584588 9070369
2 10584549.mol 10584549 9070193
3 10584499.mol 10584499 9069982
4 10584459.mol 10584459 9069773
5 10584426.mol 10584426 9069641
6 10584278.mol 10584278 9069060
7 10584197.mol 10584197 9068744
I need to merge the two, keeping only the rows that appear in both files (the shorter list could be the index). The column filePath is the index, so the final file should look like.
ID filePath SUB_ID ChBrg_REGID col1 col2 col3
1 10584588.mol 10584588 9070369 269.126 190.958 23.237
2 10584549.mol 10584549 9070193 281.001 200.889 27.7414
4 10584499.mol 10584499 9069982 245.632 153.241 25.2815
5 10584459.mol 10584459 9069773 290.476 133.699 28.631
6 10584426.mol 10584426 9069641 440.552 150.846 30.1827
9 10584278.mol 10584278 9069060 287.807 198.625 27.7414
10 10584197.mol 10584197 9068744 224.356 184.317 24.3616
I am guessing this could be done in awk, and certainly in perl, but I'm not sure how do to the alignment by the index.
LMHmedchem
Hi
Assuming your input files are a1 and a2:
awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}' a1 a2
ID filePath SUB_ID ChBrg_REGID col1 col2 col3
1 10584588.mol 10584588 9070369 269.126 190.958 23.237
2 10584549.mol 10584549 9070193 281.001 200.889 27.7414
4 10584499.mol 10584499 9069982 245.632 153.241 25.2815
5 10584459.mol 10584459 9069773 290.476 133.699 28.631
6 10584426.mol 10584426 9069641 440.552 150.846 30.1827
9 10584278.mol 10584278 9069060 287.807 198.625 27.7414
10 10584197.mol 10584197 9068744 224.356 184.317 24.3616
Guru
That worked great, except the header row never made it to the output file.
---------- Post updated at 02:16 PM ---------- Previous update was at 12:24 AM ----------
I have been working on the header row. If I do,
awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t", $1, $2, $3, $4, $5 }' $a2 > temp.txt
awk 'NR==1 {$1=$2=""}1' $a1 >> temp.txt
This comes close, but prints the entire a1 file to temp.txt, not just the first row. This takes the first 5 fields from file a2 and then is supposed to add from field 3 to the last field of file a1. This will gob together the header row, and then I can use the command above to fill in the rest of the file.
---------- Post updated at 02:20 PM ---------- Previous update was at 02:16 PM ----------
This seems to work, but overall this seems an odd way of adding the header row.
awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t", $1, $2, $3, $4, $5 }' $a2 > temp.txt
awk '{$1=$2=""} NR==1{print $0}' $a1 >> temp.txt
LMHmedchem
I'm no awk expert but..
Everything works fine here with your input files and guru's awk.
I'm getting header as it should be or ? :
After running the code vs your input (a1 and a2) i'm getting.
ID filePath SUB_ID ChBrg_REGID col1 col2 col3
1 10584588.mol 10584588 9070369 269.126 190.958 23.237
2 10584549.mol 10584549 9070193 281.001 200.889 27.7414
4 10584499.mol 10584499 9069982 245.632 153.241 25.2815
5 10584459.mol 10584459 9069773 290.476 133.699 28.631
6 10584426.mol 10584426 9069641 440.552 150.846 30.1827
9 10584278.mol 10584278 9069060 287.807 198.625 27.7414
Also for generating headers via awk, remove the header from the data and use BEGIN block or put another pair of { } bracers infront of guru's code (you will need to remove header from data first)
Something like :
{
if ( NR == 1 )
print $1 $2 $3 $4 $5 ## use printf to format as you wish
}
or
BEGIN {
printf ... ## you will need to write headers here yourself, since you can't use them from input file(s)
}
<other code>
I found a bug in my data where the first two files had a different header name for one header. The header row is now correct, more or less.
There still seems to be an issue in that the last column has three columns of space delimited data in it.
sumSO2Am SUB_ID SOURCE
0 10584046 ChemBridge
0 10580948 ChemBridge
0 10580812 ChemBridge
0 10580337 ChemBridge
0 10579979 ChemBridge
0 10579233 ChemBridge
The last two, SUB_ID and SOURCE are duplicate cols (already occur at $3,$4). These come from $3, $4 in a2. Each row should end with the sumSO2Am field.
I don't see where that is happening in the command, or I just don't get it. I see how the first 7 fields are being printed, but not the rest of each row. I can post some short test files if that would help.
LMHmedchem
you mean you have SUB_ID and SOURCE in both the file?
sample input files with which you tried the code would help.
regards,
Ahamed
Yes, apparently, my input files were not quite what I thought, so the field identifiers were not quite right. I have everything corrected now. The test files I am using have more cols than the sample I posted. I thought I had this working a bit ago, but now it seems it's not working again.
I have attached .zip with 4 files. There are the a1 and a2 input files, the output I am hoping for, and the output (incorrect) I am now getting. I am just trying to add the values from the cols SUB_ID, SOURCE, ChBrg_REGID from file a2 to file a1. File a1 has fewer rows, so it is necessary to look for the values in a1"filePath" to match the right row. This is basically looking up the values for the 3 cols in the a2 file and adding them in right after the a1"filePath" col. I won't always want all of the rest of the cols from a1, but it's just as well to leave them in, since I can edit the file further with cut, etc.
This is the command I am using,
awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}' a1_temp.txt a2_temp.txt > output_temp.txt
This is part of a more involved script, but I'm just trying to get this part working.
I really thought I had it for a bit, but I guess not.
LMHmedchem
I don't think there is a problem here. If you see your a1 input header it is
and a2 header
and the output
SUB_ID and SOURCE is not repeated anywhere.
If you want SUB_ID and SOURCE at 3rd and 4th column, then change the code accordingly.
Post the code you use, something is wrong in that!
regards,
Ahamed
That's more or less what I am trying to figure out how to do. I can get the three cols I want, but they are the last three, not cols 3 4 and 5. There is also a blank col.
Just to clarify, based on the command (different from the last post)
awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s\t%-15s\t%-10s\t%-15s\t%-10s\t%-10s\t%-10s\n",b[1],b[2],$3,$4,$5,b[3],b[4],b[5]}}' a1_temp.txt a2_temp.txt > output_temp.txt
I assume that in the print statement, $3 refers to the 3rd field of file a1_temp, and b[1] refers to the 1st field of file a2_temp???
If that is correct, in the printout I should get cols 1 and 2 of a2_temp (b[1],b[2]), which should be "id", "filePath". Then I should get cols 3, 4, and 5 from a1_temp, followed by cols 3,4, and 5 from a2_temp. Instead, I am getting all 39 cols from a1_temp, followed by a blank col, followed by the three cols I want from a2_temp. This isn't a disaster, as I can further edit the file with cut, but I would like to better understand the syntax of what I'm trying to do here.
LMHmedchem