Join not working properly

I want to join two files , with file 1 col 3 and file 2 col 1 as key.
The join command is erratic for some reason. File 2 is a master file having all the names, and file 1 has some values. I want to add the names from fil2 in file 1. If I use the original master file, some output is missing.

For example Medtr1g004990 exists in the master file but does not come in the output.

However, if just use a truncated master file that has 5 records including Medtr1g004990, the output seems to be correct.

I have used sorted files also , same problem. Please help solve this, if join doesnt work,please let me know if some similar command to join would.

I have attached the original master file and pasted the truncated one.

File 1

# more Medtr1g006600.exp
XLOC_000005     XLOC_000005     Medtr1g004990   chr1:35909-40554        q1      q2      OK      0.520378        6.91484 3.73206 6.85797 5e-05   0.000126299     yes     down
XLOC_000006     XLOC_000006     Medtr1g006490   chr1:44429-46280        q1      q2      OK      16.1083 122.606 2.92814 10.2969 5e-05   0.000126299     yes     down
XLOC_000008     XLOC_000008     Medtr1g006600   chr1:51360-54977        q1      q2      OK      6.94505 3.84361 -0.853525       -2.49824        0.0001  0.000244358     yes     up
XLOC_000010     XLOC_000010     Medtr1g006660   chr1:70777-71741        q1      q2      OK      1.15476 2.47771 1.10142 2.07776 0.0045  0.00841718      yes     down
XLOC_000014     XLOC_000014     Medtr1g006975   chr1:129007-136403      q1      q2      OK      0.389401        0.166262        -1.2278 -2.00092        0.0017  0.00343409      yes     up

File 2 (truncated from attached file)

# more Medtr1g006600.annot
Medtr1g004990   casein kinase
Medtr1g006490   major intrinsic protein %28MIP%29 family transporter
Medtr1g006590   tonoplast intrinsic protein
Medtr1g006600   exostosin family protein
Medtr1g006605   hypothetical protein
Medtr1g006660   AP2 domain class transcription factor

Command and output with master file

 # join -a1 -1 3 -2 1  Medtr1g006600.exp mt4.genenames.txt
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up disease resistance protein %28CC-NBS-LRR class%29 family protein

Command and output with truncated file

# join -a1 -1 3 -2 1  Medtr1g006600.exp Medtr1g006600.annot 
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down casein kinase
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down major intrinsic protein %28MIP%29 family transporter
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up

Using your original files, I get this:

join -a1 -1 3 -2 1  file /tmp/mt4.genenames.txt
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down
join: /tmp/mt4.genenames.txt:6: is not sorted: Medtr1g004980    hypothetical protein
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
etc.

After sort /tmp/mt4.genenames.txt >/tmp/MED , I get

join -a1 -1 3 -2 1  file /tmp/MED 
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down casein kinase
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down major intrinsic protein %28MIP%29 family transporter
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up disease resistance protein %28CC-NBS-LRR class%29 family protein

, which is close to what you need, I believe.

1 Like

Thanks, sorting solves the problem. I was previously sorting with -k3,3 which doesnt work but -k3 works.
Actually another problem is with some of the data.
Some of the key field in file 1 are concatenated by a comma

For example

XLOC_017038     XLOC_017038     Medtr3g056615,Medtr3g056620     chr3:22631993-22636677  q1      q2      OK      0.588602        0.125583        -2.22866        -2.09056        0.0099  0.0372138       yes     up

which actually has Medtr3g056615 and Medtr3g056620 , is it possible to look up both names from the master file and output names separated by a '"," ?

The corresponding entries in the master files are

Medtr3g056615   NB-ARC domain disease resistance protein
Medtr3g056620   NB-ARC domain disease resistance protein

So the output for these lines should be

Medtr3g056615,Medtr3g056620 XLOC_017038 XLOC_017038 chr3:22631993-22636677 q1 q2 OK 0.588602 0.125583 -2.22866 -2.09056 0.0099 0.0372138 yes up NB-ARC domain disease resistance protein,NB-ARC domain disease resistance protein

Try this - you may want to add some formatting, but in principle, it works:

awk     'NR==FNR        {T[$1]=($1="")$0; next}
                        {C=split ($3,I,",")
                         n=NF+1
                         for (i=1; i<=C; i++) $n=$n($n?",":"")T[i]}
         1
        ' /tmp/mt4.genenames.txt file1
XLOC_000005 XLOC_000005 Medtr1g004990 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down  casein kinase
XLOC_000006 XLOC_000006 Medtr1g006490 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down  major intrinsic protein %28MIP%29 family transporter
XLOC_000008 XLOC_000008 Medtr1g006600 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up  exostosin family protein
XLOC_000010 XLOC_000010 Medtr1g006660 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down  AP2 domain class transcription factor
XLOC_000014 XLOC_000014 Medtr1g006975 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up  disease resistance protein %28CC-NBS-LRR class%29 family protein
XLOC_017038 XLOC_017038 Medtr3g056615,Medtr3g056620 chr3:22631993-22636677 q1 q2 OK 0.588602 0.125583 -2.22866 -2.09056 0.0099 0.0372138 yes up  NB-ARC domain disease resistance protein, NB-ARC domain disease resistance protein
1 Like

Ok, I used the code removing a $n from line 4,

awk     'NR==FNR        {T[$1]=($1="")$0; next}                         
                                        {C=split ($3,I,",")                         
 n=NF+1                          for (i=1; i<=C; i++) $n=($n?",":"")T[i]}       
   1        
 ' /tmp/mt4.genenames.txt file1

The output doesnt pick up anything from genenames, it is the same as the input.

#   awk     'NR==FNR        {T[$1]=($1="")$0; next}
>                           {
>                           C=split ($3,I,",")
>                        n = NF+1
>                            for (i=1; i<=C; i++) $n=($n?",":"")T[i]}
>            1
>         ' mt4.genenames.sorted.txt file1 | head
XLOC_048267 XLOC_048267 Medtr0001s0120 scaffold0001:49523-50287 q1 q2 OK 1.84679 0 -inf nan 5e-05 0.000126299 yes up
XLOC_048258 XLOC_048258 Medtr0001s0190 scaffold0001:64218-69070 q1 q2 OK 3.61301 0 -inf nan 5e-05 0.000126299 yes up
XLOC_048271 XLOC_048271 Medtr0001s0430 scaffold0001:188781-195705 q1 q2 OK 9.48928 3.76233 -1.33467 -4.06109 5e-05 0.000126299 yes up
XLOC_048260 XLOC_048260 Medtr0001s0490 scaffold0001:397820-401891 q1 q2 OK 10.303 4.14216 -1.31462 -3.77643 5e-05 0.000126299 yes up
XLOC_048263 XLOC_048263 Medtr0001s0570 scaffold0001:430832-437206 q1 q2 OK 7.21593 4.49932 -0.681479 -2.09584 0.00025 0.00057981 yes up
XLOC_048274 XLOC_048274 Medtr0001s0660 scaffold0001:502646-507542 q1 q2 OK 17.1506 2.11969 -3.01634 -7.77595 5e-05 0.000126299 yes up
XLOC_048321 XLOC_048321 Medtr0002s0270 scaffold0002:187389-188523 q1 q2 OK 0.232771 4.20249 4.17426 5.6741 0.0001 0.000244358 yes down
XLOC_048337 XLOC_048337 Medtr0002s0900 scaffold0002:341925-343403 q1 q2 OK 69.3725 28.7422 -1.27119 -4.90762 5e-05 0.000126299 yes up
XLOC_048340 XLOC_048340 Medtr0002s1060 scaffold0002:400363-401875 q1 q2 OK 1.527 160.669 6.71725 14.7088 5e-05 0.000126299 yes down
XLOC_048308 XLOC_048308 Medtr0002s1200 scaffold0002:454553-457418 q1 q2 OK 11.0499 0.148847 -6.21406 -6.09651 0.00165 0.00334041 yes up

Something is missing in the code in this line, after the second $n

for (i=1; i<=C; i++) $n=$n($n?",":"")T[i]}

Don't remove anything, it needs the code as presented. Nothing is missing in that line which uses a conditional assignment; it appends the T[I[i]] array elements to the new last field $n.
Test the code with a small subset of both files; come back with error msgs on failure.

The error is the following with any subset of data, or for that matter any filename that i provide, it is not getting to the files at all.

awk: cmd. line:3: (FILENAME=file1 FNR=1) fatal: function `n' not defined

with code

awk     'NR==FNR {T[$1]=($1="")$0; next}
                        {C=split ($3,I,",")
                         n= $NF+1
                         for (i=1; i<=C; i++) $n=$n($n?",":"")T[i]}
         1
        ' mt4.genenames.sorted.txt file1

What OS and what awk version do you use?

# awk --version
GNU Awk 3.1.5

OS

Red Hat Enterprise Linux Server release 5.9 (Tikanga)

Strange. gawk should know conditional assignments. Try

                         n= $NF+1; Sep=""
                         for (i=1; i<=C; i++) {$n=$n Sep T[i]; Sep=","}}
1 Like

I guess this line need correction, isn't it?

$n=$n($n?",":"")T[i]}
1 Like

This time there are no errors but but output = input with no name appended

  awk     'NR==FNR        {T[$1]=($1="")$0; next}
                          {
                              C=split ($3,I,",")
                         n= $NF+1; Sep=""
                         for (i=1; i<=C; i++) {$n=$n Sep T[i]; Sep=","}}
           1
        ' mt4.genenames.sorted.txt file1 | head

---------- Post updated at 05:39 PM ---------- Previous update was at 05:36 PM ----------

I think the bracket after $n in that line makes it assume that n is a function.

Try this

NR==FNR        {X=$1; $1=""; T[X]=$0; next}
1 Like

awesome !! seems to work perfectly now... will get back if I notice anything else..thank you so much !

Seems like gawk 3.1.5 has some idiosyncratic ideas of parentheses...

@RudiC : I tried like this, this might be problem, where as with mawk no error message

awk 'BEGIN{
           hello="hello"; y="world" 
           # Space is there between o <SPACE> (
           x = hello (y)

           print x

           # No space between o and ( treating as function in my GNU Awk 3.1.8
           x = hello(y)
          }'
helloworld
awk: cmd. line:8: fatal: function `hello' not defined
mawk 'BEGIN{
           hello="hello"; y="world" 
           # Space is there between o <SPACE> (
           x = hello (y)

           print x

           # No space between o and ( treating as function in my GNU Awk 3.1.8
           x = hello(y)
           print x
          }'
helloworld
helloworld

Interesting! Maybe ritakadm can adapt and run earlier posts' scripts to verify that?