Reconciling two CSV files using shell scripting

hustler · November 12, 2019, 4:26pm

I have two CSV files file1, file2 as below

File 1:
Key, Value1, Value2, Value3, Value4,......value
A,50,100,50,40,....,100

File 2:
Key, value1,Value2,Value3,Value 4 so on...
A,50,80,45,50.....

Now, I want to check if key from file 1 is present in file 2 or not if present I want to crate new file with following headers and data

Key, diff columns, f1_value1,f2_value1,value1_diff, f1_value2,f2_value2, value2_diff......
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5 and so on.

I have file with more 50k lines and around 60 columns...

Can someone help on suggesting how we can achieve this... I am new to shell.

Neo · November 12, 2019, 8:34pm

You can easily do this with PHP, for example, you can read the CSV files into arrays and and compare the arrays:

Quick examples to get you started (untested, for example only):

<?php
$csvFile1 = file('../somefile.csv');
$csvarray1 = [];
 foreach ($csvFile1 as $line) {
     $csvarray1[] = str_getcsv($line);
 }

$csvFile2 = file('../someotherfile.csv');
$csvarray2 = [];
 foreach ($csvFile2 as $line) {
     $csvarray2[] = str_getcsv($line);
 }

Then you can use one or two of myriad PHP array methods to check for existence of keys, differences in arrays, etc. See, for example, these methods:

<?php
array_diff();
array_keys();
array_diff_ukey();
array_key_exists()

Then, after you have your new array as you desire, then you can simply convert your temporary array back to PHP, for example:

<?php
fputcsv ();

In a nutshell, is easy to process CSV files with PHP either from a script, directly from the command line, or interactively from the command line; most notably converting CSV files to arrays, doing array operations, and converting back to a CSV file.

So personally I would do this in PHP and not use shell scripts because PHP is built to do this kind of processing easily.

OBTW, these days I tent to quickly prototype and test my PHP ideas interactively in the shell as follows:

php -a

Then in the shell in interactive mode, I test and debug logic quickly and easily.

This is how I process CSV files. You can also easily to this same type of CSV processing easily in Python, BTW.

Other may have more "shell script-like" approaches for you which do not use PHP or Python; I am only describing how I approach these types of issues in CSV, JSON or other standard file formats. Since most of my work all touches the Internet somehow (web servers), and those servers are mostly PHP based, I like to stick to code I can reuse and debug together, so that is why I tend to use PHP over Python. Actually, if my apps were not mostly PHP based, I would use Python more.

RudiC · November 13, 2019, 4:14am

How about

awk -F, '
FNR == 1        {FCNT++
                }
FNR > 1         {KEYS[$1]
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i
                }
END             {printf "Key, diff columns"
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]
                                                                 N2 = W[k,2,i]
                                                                 D1 = N2 - N1
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)
                                         OUT = COLS = ""
                                        }
                }
' OFS=, SUBSEP=, file[12]
Key, diff columns,f1_value1,f2_value1,value1_diff,f1_value2,f2_value2,value2_diff,f1_value3,f2_value3,value3_diff,f1_value4,f2_value4,value4_diff
A,"Value2,Value3,Value4",50,50,0,100,80,-20,50,45,-5,40,50.....,10

hustler · November 14, 2019, 3:48pm

Thanks Rudic,

Could you please explain the code. I could not understand it completely.

RudiC · November 14, 2019, 4:35pm

awk -F, '
FNR == 1        {FCNT++                                                                                         # inc file counter with every new file
                }
FNR > 1         {KEYS[$1]                                                                                       # keep $1 in an array; overwrite duplicates
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i                                                        # keep fields in array indexed by key, file No., field No.
                }
END             {printf "Key, diff columns"                                                                     # start printing header
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i                     # complete header line for all fields
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]                                  # for all keys, for all fields, get values
                                                                 N2 = W[k,2,i]                                  # for both files,
                                                                 D1 = N2 - N1                                   # and calc difference
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT    # collect all those in temp var OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS               # if diff exist, collect fields in temp var COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)                    # print all those, cutting off leading comma
                                         OUT = COLS = ""                                                        # reset temp vars
                                        }
                }
' file[12]                                                                                                      # OFS and SUBSEP relict from development, not needed

Scrutinizer · November 14, 2019, 5:14pm

Try:

awk '
  NR==FNR {                                         # When reading the first file (then NR is equal to FNR)
    A[$1]=$0                                        # Store the first file in array A with key $1
    next
  } 

  FNR==1 {                                          # On the first line of the second file
    split($0,Header)                                # Split the header labels in array "Header"
    $1=$1 OFS "diff columns"                        # Create the first 2 field headers
    for(i=2; i<=NF; i++)
      $i=sprintf("f1_%s,f2_%s,%s_diff",$i, $i, $i)  # Create the rest of the field headers
    print                                           # Print the field headers
  } 

  FNR>1 {                                           # Processing the content of file 2
    diffs=""                                        # Set the differences to ""
    if($1 in A) {                                   # if the key in $1 of file2 also occurs in file1
      split(A[$1], F)                               # Split the corresponding line of file 1 into Fields in array F
      for(i=2; i<=NF; i++) {                        # For field 2 until the last field
        if($i!=F)                                # if there is a value difference for that field
          diffs=diffs (diffs?OFS:"") Header      # Add the corresponding header label to the differences
        $i=F OFS $i OFS (F-$i)                # Prepend the value of file1 and append the subtraction of file1 val - file val
      } 
      $1=$1 OFS "\"" diffs "\""                     # When all differences found, append them to field 1
      print                                         # print the result
    }
  }
' FS=', *' OFS=, file1 file2                        # set FS to a comma with spaces, set OFS to a comma and read file 1 and file2

Key,diff columns,f1_value1,f2_value1,value1_diff,f1_Value2,f2_Value2,Value2_diff,f1_Value3,f2_Value3,Value3_diff,f1_Value4,f2_Value4,Value4_diff,...
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5,40,50,-10,...

hustler · November 15, 2019, 8:07am

I tried implementing this but getting an error that cannot read input file.

RudiC · November 15, 2019, 8:17am

Do the files exist? Did you adapt the filenames? You gave this in post #1:

hustler · November 16, 2019, 5:20am

scrutinizer:

Try:

awk '
  NR==FNR {                                         # When reading the first file (then NR is equal to FNR)
   A[$1]=$0                                        # Store the first file in array A with key $1
   next
  } 

  FNR==1 {                                          # On the first line of the second file
   split($0,Header)                                # Split the header labels in array "Header"
   $1=$1 OFS "diff columns"                        # Create the first 2 field headers
   for(i=2; i<=NF; i++)
   $i=sprintf("f1_%s,f2_%s,%s_diff",$i, $i, $i)  # Create the rest of the field headers
   print                                           # Print the field headers
  } 

  FNR>1 {                                           # Processing the content of file 2
   diffs=""                                        # Set the differences to ""
   if($1 in A) {                                   # if the key in $1 of file2 also occurs in file1
   split(A[$1], F)                               # Split the corresponding line of file 1 into Fields in array F
   for(i=2; i<=NF; i++) {                        # For field 2 until the last field
   if($i!=F)                                # if there is a value difference for that field
   diffs=diffs (diffs?OFS:"") Header      # Add the corresponding header label to the differences
   $i=F OFS $i OFS (F-$i)                # Prepend the value of file1 and append the subtraction of file1 val - file val
   } 
   $1=$1 OFS "\"" diffs "\""                     # When all differences found, append them to field 1
   print                                         # print the result
   }
  }
' FS=', *' OFS=, file1 file2                        # set FS to a comma with spaces, set OFS to a comma and read file 1 and file2

Key,diff columns,f1_value1,f2_value1,value1_diff,f1_Value2,f2_Value2,Value2_diff,f1_Value3,f2_Value3,Value3_diff,f1_Value4,f2_Value4,Value4_diff,...
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5,40,50,-10,...

Hey Rudic,

Thanks for the help but still I am not getting the desired output. It seems code is just reading file 1.

Along with that I am getting a fatal error.

{FILENAME=file2.csv FNR=2} FATAL: function 'diffs' not defined

Scrutinizer · November 16, 2019, 6:55am

What is your OS and version?