Joining multiple files based on one column with different and similar values (shell or perl)

Hi,

I have nine files looking similar to file1 & file2 below.

File1:
  1 ABCA1
   1 ABCC8
   1 ABR:N
   1 ACACB
   1 ACAP2
   1 ACOT1
   1 ACSBG
   1 ACTR1
   1 ACTRT
   1 ADAMT
   1 AEN:N
   1 AKAP1
File2:
   1 A4GAL
   1 ACTBL
   1 ACTL7
   1 ACTR1
   2 ADAMT
   1 AGPAT
   1 AHNAK
   1 AKAP1
   1 AKR1B
   1 AMT:N
   1 AOX1:
   1 APAF1

I would like to obtain a file like this (with either shell or perl?)

gene    file1    file2    file3 etc 
A4GAL    0    1    ?
ABCA1    1    0    ?
ABCC8    1    0
ABR:N    1    0
ACACB    1    0
ACAP2    1    0
ACOT1    1    0
ACSBG    1    0
ACTL7    0    1
ACTR1    1    1
etc

Unfortunately, I don't have an idea myself so any help is appreciated!

Best wishes, seqbiologist

See man join. With join, you must use flat files, as it seeks in anticipation of Cartesian products. Files need to be header free, delimited and sorted. If this is a many to one or one to one deal (a simple merge), you could use my m1join.c tool and all piped data for the sort and header removal: Is there a 'fuzzy search' facility in Linux?

Another alternative is the JDBC and unixODBC drivers that treat flat text or CSV files as database tables, so you can express your desire in SQL to jisql or isql (unixODBC).

1 Like
awk '{gene[$2];file[FILENAME];count[$2 FS FILENAME]=$1;}
     END { printf "gene"; for (i in file) printf OFS i; printf RS
           {for (i in gene )
              { printf i;
                 { for (j in file) printf count[i FS j]?OFS count[i FS j]:OFS "0"}
                 printf RS
              }
            }
         }' OFS="\t" file1 file2 file3
1 Like

Hi, DGPickett/

I looked at that thread and I don't see the m1join source. -- or did you mean that you mentioned it there? ... cheers, drl

"join" did the trick - although in a very incompetent way:

join -a1 -a2 -1 2 -2 2 -o 0,1.1,2.1 -e "0" ${genelist001} ${genelist002} | join -a1 -a2 -1 1 -2 2 -o 0,1.2,1.3,2.1 -e "0" - ${genelist003} | join -a1 -a2 -1 1 -2 2 -o 0,1.2,1.3,1.4,2.1 -e "0" - ${genelist004} etc

Thanks!