Help with joining files and adding headers to files

rrdavis · May 22, 2012, 1:00pm

Hi,

I have about 20 tab delimited text files that have non sequential numbering such as:

UCD2.summary.txt
UCD45.summary.txt
UCD56.summery.txt

The first column of each file has the same number of lines and content. The next 2 column have data points:

i.e UCD2.summary.txt:

a   8.9   9.6
b   5.6   68
c   8.5   52

UCD45.summary.txt:

a   4.2   8.5
b   5.5   56
c   5.6   12

There are no headers for these files. I would like to join all these files together since the first column has the same data. However I need to be able to tell which file each value came from so I need to add headers.

The output file would look like this with header files:

probeID   UCD2-value1   UCD2-value2  UCD45-value1 UCD45-value2
a                  8.9               9.6                4.2               8.5
b                  5.6                68                5.5               56
c                  8.5                52                5.6               12

I am very new to linux and perl and would love some help accomplishing the output above. Thanks!

Ryan

Corona688 · May 22, 2012, 1:09pm

$ echo header > filename
$ join file1 file2 >> filename
$ cat filename

header
a 8.9 9.6 4.2 8.5
b 5.6 68 5.5 56
c 8.5 52 5.6 12

$

rrdavis · May 22, 2012, 2:03pm

Thanks for the quick reply.
However this won't work because I don't have a header for each of the columns. I need to know for each column where the data came from.

Thanks!

Corona688 · May 22, 2012, 2:41pm

You'll have to get that information from somewhere, and nothing in your post suggests where it does come from, so I think we need more information.

rrdavis · May 22, 2012, 3:00pm

Sorry for not being clear.

I would like the output to be as so:

probe id   "FILENAMEA-info1"  "FILENAMEA-info2"  "FILENAMEB-info1"  "FILENAMEB-info2"
a                value                    value                  value
b                value                    value                  value
c                value                    value                  value

The first column would have the hearder "probeID"
2nd colum would have the filename+info as a header
3r column would have the filename+info as a header
and etc...

Does that make sense?
Thanks

Corona688 · May 22, 2012, 3:11pm

Yes, I see what you want now, sorry for being dense.

Working on something.

Corona688 · May 22, 2012, 3:20pm

$ cat jn.awk

BEGIN { OFS="\t";       }

F!=FILENAME {
        F=FILENAME;

        for(N=1; N<=NF; N++)    COL=COL OFS FILENAME"-info"N;
}

{
        D[$1]=D[$1] " " $0;
        if(!($1 in O))
        {
                O[++ORDER]=$1;
                O[$1]=1
        }
}

END {
        print substr(COL,2);
        for(N=1; N<=ORDER; N++)
        {
                $0=substr(D[O[N]], 2);
                $1=$1;
                print;
        }
}

$ awk -f jn.awk data1 data2

data1-info1     data1-info2     data1-info3     data2-info1     data2-info2    data2-info3
a       8.9     9.6     a       4.2     8.5
b       5.6     68      b       5.5     56
c       8.5     52      c       5.6     12

$

Perhaps not the most efficient but an all-in-one solution.

rrdavis · May 22, 2012, 7:55pm

Thanks.

It's almost there. The column that contains

a
b
c

does not need to be repeated.

is there a way to have that first column labeled "probe" for all the files? and then use a simple join command to join all the files together?

Thanks

Corona688 · May 24, 2012, 3:58pm

$ cat header.sh

#!/bin/sh

FILES="$*"
COL="probe"

for FILE in $FILES
do
        N=1
        read LINE <"$FILE"
        set -- $LINE ; shift
        while [ "$#" -gt 0 ]
        do
                COL="$COL $FILE-$N"
                N=`expr $N + 1`
                shift
        done
done

echo $COL

join $FILES

$ ./header.sh data*

probe data1-1 data1-2 data2-1 data2-2
a 8.9 9.6 4.2 8.5
b 5.6 68 5.5 56
c 8.5 52 5.6 12

$