merge two text files of different size on common index

LMHmedchem · April 30, 2011, 10:22pm

I have two text files.

text file 1:

ID  filePath       col1      col2      col3
1   10584588.mol   269.126   190.958   23.237
2   10584549.mol   281.001   200.889   27.7414
3   10584511.mol   408.824   158.316   29.8561
4   10584499.mol   245.632   153.241   25.2815
5   10584459.mol   290.476   133.699   28.631
6   10584426.mol   440.552   150.846   30.1827
7   10584298.mol   243.248   164.409   21.5715
8   10584286.mol   283.078   230.034   24.3697
9   10584278.mol   287.807   198.625   27.7414
10  10584197.mol   224.356   184.317   24.3616

text file 2:

ID   filePath       SUB_ID     ChBrg_REGID
1    10584588.mol   10584588   9070369
2    10584549.mol   10584549   9070193
3    10584499.mol   10584499   9069982
4    10584459.mol   10584459   9069773
5    10584426.mol   10584426   9069641
6    10584278.mol   10584278   9069060
7    10584197.mol   10584197   9068744

I need to merge the two, keeping only the rows that appear in both files (the shorter list could be the index). The column filePath is the index, so the final file should look like.

ID  filePath       SUB_ID     ChBrg_REGID   col1      col2      col3
1   10584588.mol   10584588   9070369       269.126   190.958   23.237
2   10584549.mol   10584549   9070193       281.001   200.889   27.7414
4   10584499.mol   10584499   9069982       245.632   153.241   25.2815
5   10584459.mol   10584459   9069773       290.476   133.699   28.631
6   10584426.mol   10584426   9069641       440.552   150.846   30.1827
9   10584278.mol   10584278   9069060       287.807   198.625   27.7414
10  10584197.mol   10584197   9068744       224.356   184.317   24.3616

I am guessing this could be done in awk, and certainly in perl, but I'm not sure how do to the alignment by the index.

LMHmedchem

guruprasadpr · April 30, 2011, 11:39pm

Hi
Assuming your input files are a1 and a2:

awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}' a1 a2  
ID filePath        SUB_ID     ChBrg_REGID     col1       col2       col3
1  10584588.mol    10584588   9070369         269.126    190.958    23.237
2  10584549.mol    10584549   9070193         281.001    200.889    27.7414
4  10584499.mol    10584499   9069982         245.632    153.241    25.2815
5  10584459.mol    10584459   9069773         290.476    133.699    28.631
6  10584426.mol    10584426   9069641         440.552    150.846    30.1827
9  10584278.mol    10584278   9069060         287.807    198.625    27.7414
10 10584197.mol    10584197   9068744         224.356    184.317    24.3616

Guru

LMHmedchem · May 1, 2011, 2:20pm

That worked great, except the header row never made it to the output file.

---------- Post updated at 02:16 PM ---------- Previous update was at 12:24 AM ----------

I have been working on the header row. If I do,

   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk 'NR==1 {$1=$2=""}1' $a1 >> temp.txt

This comes close, but prints the entire a1 file to temp.txt, not just the first row. This takes the first 5 fields from file a2 and then is supposed to add from field 3 to the last field of file a1. This will gob together the header row, and then I can use the command above to fill in the rest of the file.

---------- Post updated at 02:20 PM ---------- Previous update was at 02:16 PM ----------

This seems to work, but overall this seems an odd way of adding the header row.

   awk 'NR==1 {printf "%s\t%s\t%s\t%s\t%s\t",  $1, $2, $3, $4, $5 }' $a2 > temp.txt
   awk '{$1=$2=""} NR==1{print $0}' $a1 >> temp.txt

LMHmedchem

Peasant · May 1, 2011, 3:45pm

I'm no awk expert but..
Everything works fine here with your input files and guru's awk.
I'm getting header as it should be or ? :

After running the code vs your input (a1 and a2) i'm getting.

ID  filePath        SUB_ID     ChBrg_REGID     col1       col2       col3      
1   10584588.mol    10584588   9070369         269.126    190.958    23.237    
2   10584549.mol    10584549   9070193         281.001    200.889    27.7414   
4   10584499.mol    10584499   9069982         245.632    153.241    25.2815   
5   10584459.mol    10584459   9069773         290.476    133.699    28.631    
6   10584426.mol    10584426   9069641         440.552    150.846    30.1827   
9   10584278.mol    10584278   9069060         287.807    198.625    27.7414

Also for generating headers via awk, remove the header from the data and use BEGIN block or put another pair of { } bracers infront of guru's code (you will need to remove header from data first)
Something like :

{
if ( NR == 1 )
 print $1 $2 $3 $4 $5 ## use printf to format as you wish
}

or

BEGIN {
printf ... ## you will need to write headers here yourself, since you can't use them from input file(s)
}
<other code>

LMHmedchem · May 1, 2011, 3:59pm

I found a bug in my data where the first two files had a different header name for one header. The header row is now correct, more or less.

There still seems to be an issue in that the last column has three columns of space delimited data in it.

sumSO2Am                 SUB_ID     SOURCE                                          
0                 10584046   ChemBridge                                      
0                 10580948   ChemBridge                                      
0                 10580812   ChemBridge                                      
0                 10580337   ChemBridge                                      
0                 10579979   ChemBridge                                      
0                 10579233   ChemBridge

The last two, SUB_ID and SOURCE are duplicate cols (already occur at $3,$4). These come from $3, $4 in a2. Each row should end with the sumSO2Am field.

I don't see where that is happening in the command, or I just don't get it. I see how the first 7 fields are being printed, but not the rest of each row. I can post some short test files if that would help.

LMHmedchem

ahamed101 · May 1, 2011, 5:28pm

you mean you have SUB_ID and SOURCE in both the file?
sample input files with which you tried the code would help.

regards,
Ahamed

LMHmedchem · May 1, 2011, 6:45pm

Yes, apparently, my input files were not quite what I thought, so the field identifiers were not quite right. I have everything corrected now. The test files I am using have more cols than the sample I posted. I thought I had this working a bit ago, but now it seems it's not working again.

I have attached .zip with 4 files. There are the a1 and a2 input files, the output I am hoping for, and the output (incorrect) I am now getting. I am just trying to add the values from the cols SUB_ID, SOURCE, ChBrg_REGID from file a2 to file a1. File a1 has fewer rows, so it is necessary to look for the values in a1"filePath" to match the right row. This is basically looking up the values for the 3 cols in the a2 file and adding them in right after the a1"filePath" col. I won't always want all of the rest of the cols from a1, but it's just as well to leave them in, since I can edit the file further with cut, etc.

This is the command I am using,

awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s %-15s %-10s %-15s %-10s %-10s %-10s\n",b[1],b[2],$3,$4,b[3],b[4],b[5]}}'  a1_temp.txt a2_temp.txt > output_temp.txt

This is part of a more involved script, but I'm just trying to get this part working.

I really thought I had it for a bit, but I guess not.

LMHmedchem

ahamed101 · May 1, 2011, 7:35pm

I don't think there is a problem here. If you see your a1 input header it is

and a2 header

and the output

SUB_ID and SOURCE is not repeated anywhere.

If you want SUB_ID and SOURCE at 3rd and 4th column, then change the code accordingly.

Post the code you use, something is wrong in that!

regards,
Ahamed

LMHmedchem · May 1, 2011, 8:45pm

That's more or less what I am trying to figure out how to do. I can get the three cols I want, but they are the last three, not cols 3 4 and 5. There is also a blank col.

Just to clarify, based on the command (different from the last post)

awk 'NR==FNR{a[$2]=$0;next;}{if ($2 in a){split(a[$2],b," *");printf "%-2s\t%-15s\t%-10s\t%-15s\t%-10s\t%-10s\t%-10s\n",b[1],b[2],$3,$4,$5,b[3],b[4],b[5]}}' a1_temp.txt a2_temp.txt  > output_temp.txt

I assume that in the print statement, $3 refers to the 3rd field of file a1_temp, and b[1] refers to the 1st field of file a2_temp???

If that is correct, in the printout I should get cols 1 and 2 of a2_temp (b[1],b[2]), which should be "id", "filePath". Then I should get cols 3, 4, and 5 from a1_temp, followed by cols 3,4, and 5 from a2_temp. Instead, I am getting all 39 cols from a1_temp, followed by a blank col, followed by the three cols I want from a2_temp. This isn't a disaster, as I can further edit the file with cut, but I would like to better understand the syntax of what I'm trying to do here.

LMHmedchem