Hello,
I have 40 data files where the first three columns are the same (in theory) and the 4th column is different. Here is an example of three files,
file 2: A_f0_r179_pred.txt
Id Group Name E0
1 V N(,)'1 0.2904
2 V N(,)'2 0.3180
3 V N(,)'3 0.3277
4 V N(,)'4 0.3675
5 V N(,)'5 0.3456
file 2: A_f1_r173_pred.txt
Id Group Name E0
1 V N(,)'1 0.2916
2 V N(,)'2 0.3123
3 V N(,)'3 0.3234
4 V N(,)'4 0.3475
5 V N(,)'5 0.3294
file 3: A_f3_r243_pred.txt
Id Group Name E0
1 V N(,)'1 0.2581
2 V N(,)'2 0.2903
3 V N(,)'3 0.2988
4 V N(,)'4 0.3496
5 V N(,)'5 0.3390
In reality these files could have any number of rows.
What I need to do is to aggregate the E0 columns into a single file along with the Id and Name columns
Id Name E0 E0 E0
1 N(,)'1 0.2904 0.2916 0.2581
2 N(,)'2 0.3180 0.3123 0.2903
3 N(,)'3 0.3277 0.3234 0.2988
4 N(,)'4 0.3675 0.3475 0.3496
5 N(,)'5 0.3456 0.3294 0.33900
The trick is that I want to check the "Name" column value of each row every time a new column is added. It is very important that this data stay in registration. It would also help to take something from each file name to use for a header in place E0 because I think having all of the columns be named the same is asking for trouble. It would be very easy to have the script change this in the file beforehand if that would make more sense.
My current thought was to use cut or paste to merge all of the columns I want, including the name columns, into one file like,
Id Name E0 Name E0 Name E0
1 N(,)'1 0.2904 N(,)'1 0.2916 N(,)'1 0.2581
2 N(,)'2 0.3180 N(,)'2 0.3123 N(,)'2 0.2903
3 N(,)'3 0.3277 N(,)'3 0.3234 N(,)'3 0.2988
4 N(,)'4 0.3675 N(,)'4 0.3475 N(,)'4 0.3496
5 N(,)'5 0.3456 N(,)'5 0.3294 N(,)'5 0.3390
Then I could use IFS='\t' read -a
to grab each line into an array and test the name fields to make sure they are all the same for each row. If they are, I could output the data columns to a new file. I think that would work but would be pretty awkward.
At some point, I also need to create a new column with the average of all of the data columns for each row.
I had an older thread on something similar some time ago,
The final script didn't actually work but I thought I would post it anyway in case it would be helpful. This script was supposed to allow the header value of the index key to be passed in the call to the script along with the header names of the columns to be output.
There are a great many ways to do this, so suggestions are greatly appreciated.
LMHmedchem
The script below was kindly suggested by Chubler_XL. I believe it would work for what I need but the output has the id column out of order and includes many blank rows interspersed with data.
#!/bin/bash
# script data_merge_awk.sh
INDEX=$1
INDEX_FILE=$2
MERGE_FILE=$3
INCLUDE=${4:-.*}
EXCLUDE=${5:-}
awk -vIDX="$INDEX" -vO="$INCLUDE" -vN="$EXCLUDE" '
FNR==1 {
split(O, hm)
split(N, skip)
split("", p)
for(i=1;i<=NF;i++) {
if ($i==IDX) keypos=i
if ($i in have) continue;
for (t in hm) {
x=nul
if (!($i in p) && match($i, hm[t])) {
for(x in skip) if (match($i, skip[x])) break;
if (x&&match($i, skip[x])) continue;
o[++headers]=$i
p=headers
have[$i]
break
}
}
}
next;
}
keypos { for(c in p) {K[$keypos]; OUT[$keypos,p[c]]= $(c) } }
END {
$0=""
for(i=1;i<=headers;i++) $i=o
print
$0=""
for(key in K) {
for(i=1;i<=headers;i++) $i=OUT[key,i]
print
}
}' FS='\t' OFS='\t' $INDEX_FILE $MERGE_FILE
# call with,
# data_merge_awk.sh index_key index_file merge_file [fields] [exclude]