Combine and complete multiple CSV files based on 1 parameter

Xterra · August 19, 2018, 11:54am

I have to create a new CSV file based on the value listed on the 3rd column from different CSV files. This is what I need:

I should substitute the first column from each file, excluding the headers, with the file name InputXX .
Then, I need to look for rows with 0 on the third column in the first file Input1.csv , and substitute the entire row with the same corresponding row from the second file Input2.csv . This process should be repeated for all rows till the end of the file (96 entries).
If the value on the third column from the second file is also 0, then, the value from the third file is the one that is used to complete the entry on the first file. If the value on the third file is also 0,then the value from the forth file is used to complete the entry. So on and so on. If the value for that particular row is 0 in all files, then, the original entry is kept.
Once the first file is completed, the process is repeated for the second file using data from the remaining files.
The rows from the last file that have been used to complete the first files are kept the same on that last file but the value on the third column is changed to 0 since they have been already used.
Finally, all data is combined into the one file keeping the headers at the very top.

The original files look like this:
Input1.csv

Source,Well_Source,Volume,Destination_Well,Destination
PCR_Plate,1,5,1,Deadpool
PCR_Plate,2,7,1,Deadpool
PCR_Plate,3,8,1,Deadpool
PCR_Plate,4,3,1,Deadpool
PCR_Plate,5,5,1,Deadpool
PCR_Plate,6,0,1,Deadpool
PCR_Plate,7,4.,1,Deadpool
PCR_Plate,8,7.1,1,Deadpool
PCR_Plate,9,0,1,Deadpool
PCR_Plate,10,0,1,Deadpool

Input2.csv

Source,Well_Source,Volume,Destination_Well,Destination
PCR_Plate,1,0,1,Deadpool
PCR_Plate,2,0,1,Deadpool
PCR_Plate,3,8,1,Deadpool
PCR_Plate,4,3,1,Deadpool
PCR_Plate,5,5,1,Deadpool
PCR_Plate,6,0,1,Deadpool
PCR_Plate,7,4.,1,Deadpool
PCR_Plate,8,7.1,1,Deadpool
PCR_Plate,9,10,1,Deadpool
PCR_Plate,10,0,1,Deadpool

Input3.csv

Source,Well_Source,Volume,Destination_Well,Destination
PCR_Plate,1,5,1,Deadpool
PCR_Plate,2,0,1,Deadpool
PCR_Plate,3,8,1,Deadpool
PCR_Plate,4,0,1,Deadpool
PCR_Plate,5,0,1,Deadpool
PCR_Plate,6,0,1,Deadpool
PCR_Plate,7,4.,1,Deadpool
PCR_Plate,8,7.1,1,Deadpool
PCR_Plate,9,10,1,Deadpool
PCR_Plate,10,0,1,Deadpool

Input4.csv

Source,Well_Source,Volume,Destination_Well,Destination
PCR_Plate,1,5,1,Deadpool
PCR_Plate,2,0,1,Deadpool
PCR_Plate,3,8,1,Deadpool
PCR_Plate,4,5,1,Deadpool
PCR_Plate,5,5,1,Deadpool
PCR_Plate,6,0,1,Deadpool
PCR_Plate,7,4.,1,Deadpool
PCR_Plate,8,7.1,1,Deadpool
PCR_Plate,9,10,1,Deadpool
PCR_Plate,10,10,1,Deadpool

The resulting CSV file should look like this:

Source,Well_Source,Volume,Destination_Well,Destination
Input1,1,5,1,Deadpool
Input1,2,7,1,Deadpool
Input1,3,8,1,Deadpool
Input1,4,3,1,Deadpool
Input1,5,5,1,Deadpool
Input1,6,0,1,Deadpool
Input1,7,4.,1,Deadpool
Input1,8,7.1,1,Deadpool
Input2,9,10,1,Deadpool
Input4,10,10,1,Deadpool
Input3,1,5,1,Deadpool
Input2,2,0,1,Deadpool
Input2,3,8,1,Deadpool
Input2,4,3,1,Deadpool
Input2,5,5,1,Deadpool
Input2,6,0,1,Deadpool
Input2,7,4.,1,Deadpool
Input2,8,7.1,1,Deadpool
Input3,9,10,1,Deadpool
Input2,10,0,1,Deadpool
Input4,1,5,1,Deadpool
Input3,2,0,1,Deadpool
Input3,3,8,1,Deadpool
Input4,4,5,1,Deadpool
Input4,5,5,1,Deadpool
Input3,6,0,1,Deadpool
Input3,7,4.,1,Deadpool
Input3,8,7.1,1,Deadpool
Input4,9,10,1,Deadpool
Input3,10,0,1,Deadpool
Input4,1,0,1,Deadpool
Input4,2,0,1,Deadpool
Input4,3,8,1,Deadpool
Input4,4,0,1,Deadpool
Input4,5,0,1,Deadpool
Input4,6,0,1,Deadpool
Input4,7,4.,1,Deadpool
Input4,8,7.1,1,Deadpool
Input4,9,0,1,Deadpool
Input4,10,0,1,Deadpool

I have been focusing on trying to get it to work using two files with little to no success:

awk -F, 'FNR==NR{a[$0]=$3 FS $3;next}{ print $0, a[$1]}'

Any help will be greatly appreciated

RudiC · August 19, 2018, 12:00pm

Why are 0 lines, e.g.

Input2,2,0,1,Deadpool
Input2,6,0,1,Deadpool
Input2,10,0,1,Deadpool

in the output file?

Xterra · August 19, 2018, 12:19pm

Only when the other files also have 0 for that particular row. Otherwise, it should be populated with the rows that come from the remaining files

RudiC · August 19, 2018, 4:25pm

Input2,10,0,1,Deadpool

?

Xterra · August 20, 2018, 10:31am

Only input file 4 shows a value >0 in row number 10. That value should be used to complete the first input file. The goal is try to complete all 10 rows in the first file. Once the first file is completed, the goal changes to try to complete the second input file. However, the data present in the new completed infile 1 cannot be used any longer

RudiC · August 20, 2018, 11:25am

OK, now that everthing is more or less clear, let's try

awk -F, '
FNR == 1        {FLNR++
                 FN[FLNR] = FILENAME
                 next
                }

                {V[FLNR, $2] = $3
                 D[FLNR, $2] = $4
                 if ($2 > MX2) MX2 = $2
                }

END             {print "Source,Well_Source,Volume,Destination_Well,Destination"
                 for (f=1; f<=FLNR; f++)
                   for (i=1; i<=MX2; i++)       {ft = f
                                                 while ((ft <= FLNR) && (V[ft,i] == 0)) ft++
                                                 if (ft > FLNR) ft = f
                                                 print FN[ft], i, V[ft,i], D[f,i], "Deadpool"
                                                 V[ft, i] = 0
                                                }
                }

'  OFS=, Input[1-4]
Source,Well_Source,Volume,Destination_Well,Destination
Input1,1,5,1,Deadpool
Input1,2,7,1,Deadpool
Input1,3,8,1,Deadpool
Input1,4,3,1,Deadpool
Input1,5,5,1,Deadpool
Input1,6,0,1,Deadpool
Input1,7,4.,1,Deadpool
Input1,8,7.1,1,Deadpool
Input2,9,10,1,Deadpool
Input4,10,10,1,Deadpool
Input3,1,5,1,Deadpool
Input2,2,0,1,Deadpool
Input2,3,8,1,Deadpool
Input2,4,3,1,Deadpool
Input2,5,5,1,Deadpool
Input2,6,0,1,Deadpool
Input2,7,4.,1,Deadpool
Input2,8,7.1,1,Deadpool
Input3,9,10,1,Deadpool
Input2,10,0,1,Deadpool
Input4,1,5,1,Deadpool
Input3,2,0,1,Deadpool
Input3,3,8,1,Deadpool
Input4,4,5,1,Deadpool
Input4,5,5,1,Deadpool
Input3,6,0,1,Deadpool
Input3,7,4.,1,Deadpool
Input3,8,7.1,1,Deadpool
Input4,9,10,1,Deadpool
Input3,10,0,1,Deadpool
Input4,1,0,1,Deadpool
Input4,2,0,1,Deadpool
Input4,3,8,1,Deadpool
Input4,4,0,1,Deadpool
Input4,5,0,1,Deadpool
Input4,6,0,1,Deadpool
Input4,7,4.,1,Deadpool
Input4,8,7.1,1,Deadpool
Input4,9,0,1,Deadpool
Input4,10,0,1,Deadpool

Xterra · August 20, 2018, 5:19pm

Rudy
That worked like a charm! Any chance you can explain your code?
Appreciate it very much!

RudiC · August 20, 2018, 5:52pm

There you are:

awk -F, '                                                                                       # set field separator to comma
FNR == 1        {FLNR++                                                                         # on 1. line of each file: count files
                 FN[FLNR] = FILENAME                                                            # save respective file name
                 next                                                                           # don't process each file's first line further
                }
                {V[FLNR, $2] = $3                                                               # save Volume data in array indexed by file No. and Source Well No.
                 D[FLNR, $2] = $4                                                               # save Dest Well with same index
                 if ($2 > MX2) MX2 = $2                                                         # save max. Source Well No.
                }
END             {print "Source,Well_Source,Volume,Destination_Well,Destination"                 # print header
                 for (f=1; f<=FLNR; f++)                                                        # process files 1 - max. file No.
                   for (i=1; i<=MX2; i++)       {ft = f                                         # process 1 - MX2 wells in files; assign temp file No.
                                                 while ((ft <= FLNR) && (V[ft,i] == 0)) ft++    # test if Volume is not zero in actual and following files
                                                                                                # and we don't exceed max file No.
                                                 if (ft > FLNR) ft = f                          # if exceeded, use original f No.
                                                 print FN[ft], i, V[ft,i], D[f,i], "Deadpool"   # print actual (saved) file name, source well No., first non-
                                                                                                # zero Volume, respective dest well, and string constant
                                                 V[ft, i] = 0                                   # set consumed Volume to zero
                                                }
                }
'  OFS=, file[1-4]                                                                              #  set output field separator to comma, specify 1 - n file names