Merge multiple tab delimited files with index checking

LMHmedchem · November 30, 2016, 12:13am

Hello,

I have 40 data files where the first three columns are the same (in theory) and the 4th column is different. Here is an example of three files,

file 2: A_f0_r179_pred.txt

Id	Group	Name	E0
1	V	N(,)'1	0.2904
2	V	N(,)'2	0.3180
3	V	N(,)'3	0.3277
4	V	N(,)'4	0.3675
5	V	N(,)'5	0.3456

file 2: A_f1_r173_pred.txt

Id	Group	Name	E0
1	V	N(,)'1	0.2916
2	V	N(,)'2	0.3123
3	V	N(,)'3	0.3234
4	V	N(,)'4	0.3475
5	V	N(,)'5	0.3294

file 3: A_f3_r243_pred.txt

Id	Group	Name	E0
1	V	N(,)'1	0.2581
2	V	N(,)'2	0.2903
3	V	N(,)'3	0.2988
4	V	N(,)'4	0.3496
5	V	N(,)'5	0.3390

In reality these files could have any number of rows.

What I need to do is to aggregate the E0 columns into a single file along with the Id and Name columns

Id	Name	E0	E0	E0
1	N(,)'1	0.2904	0.2916	0.2581
2	N(,)'2	0.3180	0.3123	0.2903
3	N(,)'3	0.3277	0.3234	0.2988
4	N(,)'4	0.3675	0.3475	0.3496
5	N(,)'5	0.3456	0.3294	0.33900

The trick is that I want to check the "Name" column value of each row every time a new column is added. It is very important that this data stay in registration. It would also help to take something from each file name to use for a header in place E0 because I think having all of the columns be named the same is asking for trouble. It would be very easy to have the script change this in the file beforehand if that would make more sense.

My current thought was to use cut or paste to merge all of the columns I want, including the name columns, into one file like,

Id	Name	E0	Name	E0	Name	E0
1	N(,)'1	0.2904	N(,)'1	0.2916	N(,)'1	0.2581
2	N(,)'2	0.3180	N(,)'2	0.3123	N(,)'2	0.2903
3	N(,)'3	0.3277	N(,)'3	0.3234	N(,)'3	0.2988
4	N(,)'4	0.3675	N(,)'4	0.3475	N(,)'4	0.3496
5	N(,)'5	0.3456	N(,)'5	0.3294	N(,)'5	0.3390

Then I could use IFS='\t' read -a to grab each line into an array and test the name fields to make sure they are all the same for each row. If they are, I could output the data columns to a new file. I think that would work but would be pretty awkward.

At some point, I also need to create a new column with the average of all of the data columns for each row.

I had an older thread on something similar some time ago,

The final script didn't actually work but I thought I would post it anyway in case it would be helpful. This script was supposed to allow the header value of the index key to be passed in the call to the script along with the header names of the columns to be output.

There are a great many ways to do this, so suggestions are greatly appreciated.

LMHmedchem

The script below was kindly suggested by Chubler_XL. I believe it would work for what I need but the output has the id column out of order and includes many blank rows interspersed with data.

#!/bin/bash

# script data_merge_awk.sh

INDEX=$1
INDEX_FILE=$2
MERGE_FILE=$3
INCLUDE=${4:-.*}
EXCLUDE=${5:-}
 
awk -vIDX="$INDEX" -vO="$INCLUDE" -vN="$EXCLUDE" '
FNR==1 {
   split(O, hm)
   split(N, skip)
   split("", p)
   for(i=1;i<=NF;i++) {
       if ($i==IDX) keypos=i
       if ($i in have) continue;
       for (t in hm) {
           x=nul
           if (!($i in p) && match($i, hm[t])) {
               for(x in skip) if (match($i, skip[x])) break;
               if (x&&match($i, skip[x])) continue;
               o[++headers]=$i
               p=headers
               have[$i]
               break
           }
       }
   }
   next;
}
keypos { for(c in p) {K[$keypos]; OUT[$keypos,p[c]]= $(c) } }
END {
    $0=""
    for(i=1;i<=headers;i++) $i=o
    print
    $0=""
    for(key in K) {
    for(i=1;i<=headers;i++) $i=OUT[key,i]
    print
    }
}' FS='\t' OFS='\t' $INDEX_FILE $MERGE_FILE

# call with,
# data_merge_awk.sh index_key index_file merge_file [fields] [exclude]

RudiC · November 30, 2016, 6:52am

What if you pipe the output through a sort operation?

drl · November 30, 2016, 9:48am

Hi.

It looks like you have a number of requests for help / requirements:

1) aggregate the E0 fields into a single file along with the Id and Name columns -- for 40 files -- a join operation
2) create a new column with the average of all of the data columns for each row
3) take something from each file name to use for a header in place E0

You seem to like to use awk , but I think that given your heavy use of (essentially csv files (with TABs being used in place of commas), that acquiring and learning a csv-specific tool would be useful. That's up to you, of course.

I found that I could use csvtool to at least start on this. Its join is far better than the system join (the latter of which deals only with 2 files). So here is, without supporting scaffolding listed, what csvtool could easily do with your 3 sample files.

csvtool -t TAB -u TAB join 1,2,3 4 data[1-3]

producing:

1       V       N(,)'1  0.2904  0.2916  0.2581
2       V       N(,)'2  0.3180  0.3123  0.2903
3       V       N(,)'3  0.3277  0.3234  0.2988
4       V       N(,)'4  0.3675  0.3475  0.3496
5       V       N(,)'5  0.3456  0.3294  0.3390
Id      Group   Name    E0      E0      E0

However, csvtool does not do arithmetic directly. Incorporating the filename or some other distinguishing feature to replace the E0 also does not seem to be doable. I may look at csvfix, ffe, CRUSH , etc. to see how they might apply.

Best wishes ... cheers, drl

RudiC · November 30, 2016, 11:10am

Try this - very specific to your problem, not as versatile and flexible as Chubler_XL's script - little awk proposal:

awk '
NR == 1         {HD = $1
                }
FNR == 1        {split (FILENAME, T, "_")
                 HD = HD OFS $3 OFS $4 "_" T[2]
                }

                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX 
                }

FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX]  = OUT[IX] $3 OFS $4 OFS
                 next
                }

                {OUT[IX]  = OUT[IX] OFS OFS
                }

END             {print HD
                 for (i=1; i<=MAX; i++) print ID, OUT
                }
' OFS="\t" A_*_pred.txt
Id	Name	E0_f0	Name	E0_f1	Name	E0_f3
1	N(,)'1	0.2904	N(,)'1	0.2916	N(,)'1	0.2581	
2	N(,)'2	0.3180	N(,)'2	0.3123	N(,)'2	0.2903	
3	N(,)'3	0.3277	N(,)'3	0.3234	N(,)'3	0.2988	
4	N(,)'4	0.3675	N(,)'4	0.3475	N(,)'4	0.3496	
5	N(,)'5	0.3456	N(,)'5	0.3294	N(,)'5	0.3390

LMHmedchem · November 30, 2016, 12:46pm

Below is a script I put this together last night.

The runtime for this was ~40 seconds for 40 input files, each with 2500 rows. That's not too awful but I think this code is a bit ghastly. It would be faster if I collected all of the data in memory instead of writing it to a file and then reading it back in.

This solution also used sed in the pipe to replaces the E0 values with a value read from the file name as the data is passed to the new file. That is almost the only think about this script that I like. The code is not generalized but could be a bit more so in a few places.

RudiC, I will check out your latest post in a few minutes.

LMHmedchem

#!/bin/bash

# name of output file
output_file=$1

# collect names of all pred output files in array, files are in pwd with script
pred_file_list=($(ls  *'_pred.txt'))

# the first file forms the base of the output, so capture the name here
first_file=${pred_file_list[0]}

# get set, fold, rnd from file name
unset FIELD; IFS='_' read -a FIELD <<< "$first_file"
set_fold_rnd=${FIELD[0]}'_'${FIELD[1]}'_'${FIELD[2]}

# use the first output file as the base file for the rest
# collect columns 1,3,and 4 and pipe to aggregate file
# change E0 to set fold and rnd ini from file name
cut -f1,3,4 ${pred_file_list[0]} | sed "s/E0/$set_fold_rnd/1" > tmp_output1.txt

# loop through file list 
for pred_file in "${pred_file_list[@]}"
do
   # don't enter the first file twice
   if [ "$pred_file" != "$first_file" ]; then
      # get set, fold, rnd ini from filename
      unset FIELD; set_fold_rnd='';
      # create substitute column header value from filename
      IFS='_' read -a FIELD <<< "$pred_file"
      set_fold_rnd=${FIELD[0]}'_'${FIELD[1]}'_'${FIELD[2]}
      # collect columns 3and 4 and pipe to temp file
      # change E0 to set fold and rnd ini from file name
      cut -f3,4 './'$pred_file | sed "s/E0/$set_fold_rnd/1" > tmp_output2.txt
      # merge temp file with aggregate file to create second temp
      paste tmp_output1.txt  tmp_output2.txt > tmp_output3.txt
      # rename second temp back to aggregate file name
      mv tmp_output3.txt  tmp_output1.txt
      # cleanup
      rm -f tmp_output2.txt tmp_output3.txt
   fi
done

# tmp_output1.txt now contains all of the renamed data columns and all of the name columns

# name columns to check
# this could be dynamic by reading header line and recording the positions where "name" is found
declare -a field_check_array=(3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79)

# data columns to output
# this could be dynamic by reading header line and recording the positions where "E0" is found
declare -a output_cols_array=(0 1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80)

# process the resulting aggregate file
while read line; do 

   # reinitialize array and output line string
   unset FIELD; output_line='';
   # read tab separated line into array
   IFS=$'\t' read -a FIELD <<< "$line"

   # for each line check the value of each field in field_check_array against the first field
   # check name fields to make sure they are all the same, exit if they are not
   for field_check in "${field_check_array[@]}"
   do
      if [ "${FIELD[1]}" != "${FIELD[$field_check]}" ]; then
         echo "names do not match"
         echo "FIELD[1]: " ${FIELD[1]}
         echo "FIELD["$field_check"]: " ${FIELD[$field_check]}
         exit -1
      fi
   done

   # if all name fields check for this row
   # add fields in output_cols_array to output_line string
   for output_col in "${output_cols_array[@]}"
   do
      # get value for next field
      cell="${FIELD[$output_col]}"

      # if this is the first column, the size of the output string will be 0, no tab
      if [ -z "$output_line" ]; then
         output_line="$cell"
      else
         # concatenate with row string
         output_line="$output_line"$'\t'"$cell"
      fi
   done

   # if file does not exist, this is the first row of output
   if [ ! -f "$output_file" ]; then
      # create file, touch and then append prevents empty column from newline???
      touch $output_file
      # write first row
      echo "${output_line}" >> $output_file
   # if file exists, append
   else
      echo "${output_line}" >> $output_file
   fi

done < tmp_output1.txt

# cleanup
rm -f tmp_output1.txt

---------- Post updated at 12:46 PM ---------- Previous update was at 12:13 PM ----------

I made a few modifications to the script posted by RudiC.

This just changes the code that creates the substitute header from,
HD = HD OFS $3 OFS $4 "_" T[2]
to
HD = HD OFS $3 OFS T[1] "_" T[2] "_" T[3]

For the filename "A_f0_r179_pred.txt", this results in the header, "A_f0_r179" instead of the header E0_f0.

It also changes the input regular expression from,
A_*_pred.txt
to
*_*_pred.txt
because there are file names that start with letters other than A.

#!/bin/bash

# name of output file
output_file=$1

awk '
NR == 1         {HD = $1
                }
FNR == 1        {split (FILENAME, T, "_")
                 HD = HD OFS $3 OFS T[1] "_" T[2] "_" T[3]
                }

                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX 
                }

FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX] = OUT[IX] $3 OFS $4 OFS
                 next
                }

                {OUT[IX]  = OUT[IX] OFS OFS
                }

END             {print HD
                 for (i=1; i<=MAX; i++) print ID, OUT
                }
' OFS="\t" *_*_pred.txt > $output_file

This runs in 0.2 seconds (compared to 40 seconds for my script). The only issue is that the Name columns are still appearing in the final output and I only need the Name once.

I could add more code to process the output and remove all of the "Name" columns except the first one.

LMHmedchem

RudiC · November 30, 2016, 12:53pm

How about

awk '
NR == 1         {HD = $1 OFS $3
                }
FNR == 1        {split (FILENAME, T, "")
                 HD = HD OFS $4 "_" T[1] "_" T[2] "_" T[3]
                }

                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX 
                }

FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX]  = OUT[IX]  $4 OFS
                 next
                }

                {OUT[IX]  = OUT[IX] OFS
                }

END             {print HD
                 for (i=1; i<=MAX; i++) print ID, NAME, OUT
                }
' OFS="\t" *_*_pred.txt

LMHmedchem · November 30, 2016, 1:26pm

Thanks, I got this working.

I made two changes,

split (FILENAME, T, ""
to
split (FILENAME, T, "_"

to split on underscore.

and,

HD = HD OFS $4 "_" T[1] "_" T[2] "_" T[3]
to
HD = HD OFS T[1] "_" T[2] "_" T[3]

to skip the original "E0" in the new header name.

Run time was 0.2 seconds to process 40 files with 2500 rows and 43 columns.

#!/bin/bash

# name of output file
output_file=$1

awk '
NR == 1         {HD = $1 OFS $3
                }
FNR == 1        {split (FILENAME, T, "_")
                 HD = HD OFS T[1] "_" T[2] "_" T[3]
                }

                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX 
                }

FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX]  = OUT[IX]  $4 OFS
                 next
                }

                {OUT[IX]  = OUT[IX] OFS
                }

END             {print HD
                 for (i=1; i<=MAX; i++) print ID, NAME, OUT
                }
' OFS="\t" *_*_pred.txt  > $output_file

I can more or less follow what this script is doing. I guess you could make it more general by using variables for the columns you are checking and outputting?

Could you do some things like,

-v var1='$1' -v var2='$2'
ID[IX] = var1
NAME[IX] = var2

LMHmedchem

RudiC · November 30, 2016, 1:39pm

Not quite. But it should be doable, like (variable names exchangeable)

awk -v C1=1 -v C2=3 ' ...

and then, inside the script, replace every occurrence of $1 by $C1 and $3 by $C2 . The result should be identical to what you got above. Then, try using other columns.
If you want to convey columns via shell positional parameters, e.g. $1 and/or $2, don't use single but double quotes.

RavinderSingh13 · January 1, 2017, 1:40am

rudic:

Try this - very specific to your problem, not as versatile and flexible as Chubler_XL's script - little awk proposal:

awk '
NR == 1         {HD = $1
   }
FNR == 1        {split (FILENAME, T, "_")
   HD = HD OFS $3 OFS $4 "_" T[2]
   }

   {IX  = FNR - 1
   MAX = IX>MAX?IX:MAX 
   }

FNR == NR       {ID[IX]   = $1
   NAME[IX] = $3
   }
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX]  = OUT[IX] $3 OFS $4 OFS
   next
   }

   {OUT[IX]  = OUT[IX] OFS OFS
   }

END             {print HD
   for (i=1; i<=MAX; i++) print ID, OUT
   }
' OFS="\t" A_*_pred.txt
Id    Name    E0_f0    Name    E0_f1    Name    E0_f3
1    N(,)'1    0.2904    N(,)'1    0.2916    N(,)'1    0.2581    
2    N(,)'2    0.3180    N(,)'2    0.3123    N(,)'2    0.2903    
3    N(,)'3    0.3277    N(,)'3    0.3234    N(,)'3    0.2988    
4    N(,)'4    0.3675    N(,)'4    0.3475    N(,)'4    0.3496    
5    N(,)'5    0.3456    N(,)'5    0.3294    N(,)'5    0.3390

Thanks a lot RudiC for this nice script. I know it is some days now for this post but wanted to add explanation here so that everybody could take advantage of this nice code snippet.

awk '
NR == 1         {HD = $1
#### Here we are putting condition NR==1 which means this will be TRUE in very first line of very first file only.
#### Where we are putting $1's value to variable HD. Actually we are creating headings here.
                }
#### FNR==1 is the condition which will be TRUE only when each file's first line will be read, as we all know variable FNR's value 
 #### will be reset each time it reads next file. Then we are using split, which is splitting the current Input_file's name putting it into an array
#### named T whose delimiter is "_". Now putting these values into variable named HD so HD will be something like Id Name E0_f0 at very first file, similarly it will concatenate the values of all Input_files.
FNR == 1        {split (FILENAME, T, "_")
                 HD = HD OFS $3 OFS $4 "_" T[2]
                }
#### creating a variable named IX here whose value is 1 less than FNR, then creating a variable named MAX(which is basically to know how many maximum lines are there in any Input_file)
#### So if MAX's value is already greater than variable IX then no change else replace the MAX's current value with IX's current value as it is greater than MAX.
                {IX  = FNR - 1
                 MAX = IX>MAX?IX:MAX
                }
#### FNR==NR(this condition will be TRUE only when very first file is being read), so creating an array named ID whose index is IX so it will be like...
#### ID[0]=Id, ID[1]=1 and so on.....
#### creating an array named NAME whose index is IX so it's value will be NAME[0]=Name, NAME[1]=N(,)'1 and so on...
FNR == NR       {ID[IX]   = $1
                 NAME[IX] = $3
                }
 #### So checking here condition if $1's value is equal to ID's value whose index is IX and $3's value is equal to NAME's value whose index is IX
#### Then we are creating an array named OUT with index IX and putting $3 and $4's values too to it. next will skip all further statements.
$1 == ID[IX] &&
$3 == NAME[IX]  {OUT[IX]  = OUT[IX] $3 OFS $4 OFS
                 next
                }
#### If above condition is NOT TRUE then it means there was NO match found for current IX and $1 or $3 values then we are adding a OFS space on that place so that it prints NULL(space) there.
                {OUT[IX]  = OUT[IX] OFS OFS
                }
#### So printing here value of HD(which is heading of all files), then going till maximum value of MAX(which is the number of maximum line in a file).
#### printing the values of ID and OUT then.
END             {print HD
                 for (i=1; i<=MAX; i++) print ID, OUT
                }
#### Mentioning Output field separator as tab and mentioning all the files which needs to be passed to awk for reading.
' OFS="\t" A_*_pred.txt

NOTE: Above code is for explanation purposes only not for running, one should use actual code for running it and getting the output.
Could take actual code from link Merge multiple tab delimited files with index checking Post: 302986828

Thanks,
R. Singh