Compare two csv's with column based

rohit_shinez · March 9, 2017, 5:12am

Hi,

I am having below two CSV's

col_1,col_2,col_3
1,2,4
1,3,6

col_1,col_3,col2,col_5,col_6
1,2,3,4,5
1,6,3,,,

I need to compare based on the columns where the mismatch is

expected output

col_1,col_2,col_3
1,2,4

the above output is compared based on column since the column names are shuffled among other csv

here 1,3,6 got matached because it is same as other 2.csv

i tried using diff but not able to handle with coulmns also used join

diff 1.csv 2.csv
join -t, <(sort 1.csv) <(sort 2.csv)

drl · March 9, 2017, 10:33am

Hi.

Is col2 in col_1,col_3,col2,col_5,col_6 a typo which should be col_2 , or is this a different column name? ... cheers, drl

RudiC · March 9, 2017, 4:56pm

Not sure this is the most elegant solution, but the result is as desired. Try

awk '
NR == FNR       {if (NR == 1)   {print
                                 for (MX=n=NF; n>0; n--) REF[$n]=n
                                }
                 else           TMP[NR] = $0
                 next
                }

FNR == 1        {for (n=NF; n>0; n--) if ($n in REF) CMP[n]=REF[$n]
                 next
                }

                {TSTR = TFS = ""
                 split ("", IT)
                 for (n=NF; n>0; n--)   if (CMP[n]) IT[CMP[n]] = $n
                 for (i=1; i<=MX; i++)  {TSTR = TSTR TFS IT
                                         TFS  = FS
                                        }
                 if (TSTR != TMP[FNR]) print TMP[FNR]
                }


' FS="," file[12]
col_1,col_2,col_3
1,2,4

drl · March 10, 2017, 9:31am

Hi.

I agree that as soon as one knows that fields are involved, one should consider awk .

However, if you are going to be dealing with csv files frequently, you might want to consider csv-specific utilities. Here's an example of one, along with a field-specific utility.

#!/usr/bin/env bash

# @(#) s1       Demonstrate field differences, csv-format file.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C csvfix dwdiff

pl " Input data files:"
head data1 data2

csvfix order -fn col_1,col_2,col_3 data1 >new1
csvfix order -fn col_1,col_2,col_3 data2 >new2

pl " Results, csvfix to generate named column files in order:"
head new1 new2

pl " Results, delimited-word diff:"
dwdiff -3 -d',' new1 new2

exit 0

producing:

$ ./s1 

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
csvfix - ( local: ~/executable/csvfix, 2014-05-17 )
dwdiff 2.0.9

-----
 Input data files:
==> data1 <==
col_1,col_2,col_3
1,2,4
1,3,6

==> data2 <==
col_1,col_3,col_2,col_5,col_6
1,2,3,4,5
1,6,3,,,

-----
 Results, csvfix to generate named column files in order:
==> new1 <==
"col_1","col_2","col_3"
"1","2","4"
"1","3","6"

==> new2 <==
"col_1","col_2","col_3"
"1","3","2"
"1","3","6"

-----
 Results, delimited-word diff:
======================================================================
[-"2"-]{+"3"+}
======================================================================
[-"4"-]{+"2"+}
======================================================================

The csvfix selects the named columns, placing them in order, and dwdiff compares the fields (words) separately.

So "2" is replaced by "3", and "4" is replaced by "2". There are a number of options on dwdiff , see man page.

Here are some details on those utilities:

csvfix  Manipulate csv files (doc)
Path    : ~/executable/csvfix
Version : - ( local: ~/executable/csvfix, 2014-05-17 )
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Home    : https://neilb.bitbucket.io/csvfix/

dwdiff  a delimited word diff program (man)
Path    : /usr/bin/dwdiff
Version : 2.0.9
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with --help
Repo    : Debian 8.7 (jessie)

and here are some other csv-specific tools:

CSV, comma separated variable file (comma typical, it can be almost anything)

        1) csvfix
           https://neilb.bitbucket.io/csvfix/ (checked  2017.03)

        2) csvtool

        3) Text::CSV -- perl module, DIY

        4) crush, Google collection
           http://crush-tools.googlecode.com/files/crush-tools-2013-04.tar.gz

        5) csvkit
           https://github.com/wireservice/csvkit

Best wishes ... cheers, drl