Hi,
I have about 20 tab delimited text files that have non sequential numbering such as:
UCD2.summary.txt
UCD45.summary.txt
UCD56.summery.txt
The first column of each file has the same number of lines and content. The next 2 column have data points:
i.e UCD2.summary.txt:
a 8.9 9.6
b 5.6 68
c 8.5 52
UCD45.summary.txt:
a 4.2 8.5
b 5.5 56
c 5.6 12
There are no headers for these files. I would like to join all these files together since the first column has the same data. However I need to be able to tell which file each value came from so I need to add headers.
The output file would look like this with header files:
probeID UCD2-value1 UCD2-value2 UCD45-value1 UCD45-value2
a 8.9 9.6 4.2 8.5
b 5.6 68 5.5 56
c 8.5 52 5.6 12
I am very new to linux and perl and would love some help accomplishing the output above. Thanks!
Ryan
$ echo header > filename
$ join file1 file2 >> filename
$ cat filename
header
a 8.9 9.6 4.2 8.5
b 5.6 68 5.5 56
c 8.5 52 5.6 12
$
Thanks for the quick reply.
However this won't work because I don't have a header for each of the columns. I need to know for each column where the data came from.
Thanks!
You'll have to get that information from somewhere, and nothing in your post suggests where it does come from, so I think we need more information.
Sorry for not being clear.
I would like the output to be as so:
probe id "FILENAMEA-info1" "FILENAMEA-info2" "FILENAMEB-info1" "FILENAMEB-info2"
a value value value
b value value value
c value value value
The first column would have the hearder "probeID"
2nd colum would have the filename+info as a header
3r column would have the filename+info as a header
and etc...
Does that make sense?
Thanks
Yes, I see what you want now, sorry for being dense.
Working on something.
$ cat jn.awk
BEGIN { OFS="\t"; }
F!=FILENAME {
F=FILENAME;
for(N=1; N<=NF; N++) COL=COL OFS FILENAME"-info"N;
}
{
D[$1]=D[$1] " " $0;
if(!($1 in O))
{
O[++ORDER]=$1;
O[$1]=1
}
}
END {
print substr(COL,2);
for(N=1; N<=ORDER; N++)
{
$0=substr(D[O[N]], 2);
$1=$1;
print;
}
}
$ awk -f jn.awk data1 data2
data1-info1 data1-info2 data1-info3 data2-info1 data2-info2 data2-info3
a 8.9 9.6 a 4.2 8.5
b 5.6 68 b 5.5 56
c 8.5 52 c 5.6 12
$
Perhaps not the most efficient but an all-in-one solution.
1 Like
Thanks.
It's almost there. The column that contains
a
b
c
does not need to be repeated.
is there a way to have that first column labeled "probe" for all the files? and then use a simple join command to join all the files together?
Thanks
$ cat header.sh
#!/bin/sh
FILES="$*"
COL="probe"
for FILE in $FILES
do
N=1
read LINE <"$FILE"
set -- $LINE ; shift
while [ "$#" -gt 0 ]
do
COL="$COL $FILE-$N"
N=`expr $N + 1`
shift
done
done
echo $COL
join $FILES
$ ./header.sh data*
probe data1-1 data1-2 data2-1 data2-2
a 8.9 9.6 4.2 8.5
b 5.6 68 5.5 56
c 8.5 52 5.6 12
$