Comparison of Cells in EXCEL using awk

meva · February 12, 2010, 8:08am

Hi

I have 2 csv files which looks like the following and i have to compare the 2 CSVs and create a third file such that if the value of the 1st cell in A.CSV and the value of the first cell in the B.CSV are same, it should print "SAME" in the third file or else print NOT SAME. Likewise i need to create the third file with the SAME/NOT SAME corresponding to each cell in the 2 CSVs.

A.CSV
"1","2009/1/12","DATA",+10.00
"2","2009/1/12","DATA",+11.00
"3","2009/1/12","DATA",+12.00
"4","2009/1/12","DATA",+13.00

B.CSV
"1","2009/1/12","DATA",+99.00
"2","2009/1/12","DATA",+11.00
"3","2009/1/12","DATA",+12.00
"4","2009/1/12","DATA",+13.00

I used the code as below

awk -F',' 'FNR==NR {_[$4]=$0;next} 
                   { split(_[$4],ar);
                     for(i=1;i<=NF;i++) 
                        if(ar=$i) $i="SAME,"; 
                        else  $i="NOT SAME,"; 
                     print 
                   }' A.csv B.csv>> C.csv

but it always showed the following output. whether the data was same or not same.

OUTPUT
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,

but it is expected to show

SAME,SAME,SAME,NOT SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,

Thanks in advance for the help.
Regards
Meva
whereas the expected

Franklin52 · February 12, 2010, 8:25am

Try this:

awk -F, '{
  getline name < "B.csv"
  split(name, a, ",")
  for(i=1;i<=NF;i++) {
    if(a==$i) {
      printf("SAME,")
    } 
    else {
      printf("NOT SAME,")
    }
  }
  print ""
}' A.csv > C.csv

meva · February 14, 2010, 3:30am

Hi Franklin.....

This is working...Thanks a lot.....

But we still have a problem...File A looks like the one given below

A.CSV
"1","2009/1/12","DATA",+10.00
"2","2009/1/12","DATA",+11.00
"3","2009/1/12","DATA",+12.00
"4","2009/1/12","DATA",+13.00

and file B looks like this

B.CSV
1,"2009/1/12","DATA",+99.00
2,"2009/1/12","DATA",+11.00
3,"2009/1/12","DATA",+12.00
4,"2009/1/12","DATA",+13.00

so as a result of excuting the code we get the following output

NOT SAME,SAME,SAME,NOT SAME,
NOT SAME,SAME,SAME,SAME,
NOT SAME,SAME,SAME,SAME,
NOT SAME,SAME,SAME,SAME,

But we expect the output to be

SAME,SAME,SAME,NOT SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,

Please help with your thoughts on how to get the expected result.

Thanks In Advance
Meva.

malcomex999 · February 14, 2010, 5:08am

Try...

awk -F, '{gsub("\"","",$1)}
{
  getline name < "B.csv"
  split(name, a, ",")
  for(i=1;i<=NF;i++) {
    if(a==$i) {
      printf("SAME,")
    } 
    else {
      printf("NOT SAME,")
    }
  }
  print ""
}' A.csv

meva · February 14, 2010, 5:34am

Still not working....

malcomex999 · February 14, 2010, 5:42am

Sorry, didn't try that before but this one is working now...

awk -F, '{gsub("\"","",$1)}
{
  getline name < "B.csf"
  split(name, a, ",")
  for(i=1;i<=NF;i++) {
    if(a==$i) {
      printf("SAME,")
    } 
    else {
      printf("NOT SAME,")
    }
  }
  print ""
}' A.csv

meva · February 14, 2010, 6:06am

Thanks a Lot....Its working:)

Meva

EAGL · February 14, 2010, 6:35am

Hello Friends,

I need to find out how this code exactly works, please help me confirming or correcting my comments regarding to code above as im not really good at using getline in awk.

Field seperator is "," and gsub strips double quotes from 1st column of A.csv
"Getline" gets each line of B.csv and assigns them to variable "name"
"Split" splits the variable "name" namely each line of B.csv,assign each field to array"a"
Each field of lines of A.csv is compared to corresponding fields of lines of B.csv, then
prints SAME or NOT SAME as a result.

Thanks in Advance,
Best regards

Franklin52 · February 14, 2010, 9:14am

Completely correct!

BTW: You can add a line after the getline statement to compare the whole lines first to make it faster if you have huge files.

If they are equal it isn't necessary to do the loop:

awk -F, '{
  gsub("\"","",$1)
  getline name < "B.csv"
  if(name == $0){print "SAME,SAME,SAME,SAME,"; next}
  split(name, a, ",")
  for(i=1;i<=NF;i++) {
    if(a==$i) {
      printf("SAME,")
    } 
    else {
      printf("NOT SAME,")
    }
  }
  print ""
}' A.csv

meva · February 15, 2010, 8:41am

Thanks for the inputs:)

ahmad.diab · February 15, 2010, 11:09am

in perl:-

perl -e '
open(FH1,"< $ARGV[0]") or die ;
open(FH2,"< $ARGV[1]") or die ;
while ( ($f1=<FH1>) && ($f2=<FH2>) ) {
chomp ($f1,$f2) ;
@a=split/,/,$f1;
@b=split/,/,$f2 ;
$end=@a ;
$a[0] =~ s/\"//g;
for ($i=0;$i<$end;$i++) {
        printf "Not_Same," if($a[$i] ne $b[$i]);
        printf "Same," if ($a[$i] eq $b[$i]) ;
              } ;
  print "\n" ;
} ;
' A.csv B.csv

;);)

meva · February 26, 2010, 7:59am

Hi Guys,

Thanks for the inputs, i have one more requirement for this. Currently we are putting the contents of A.CSV,B.csv and C.csv to be put into a fourth CSV D.csv such that they were one below the other but the expected D.csv should look in the following format

Expected result

<Content of A.csv>   <Content of B.csv>    <Content of C.csv>

I want the data in the A.csv,B.csv and the C.csv to appear side by side in the fourth csv D.csv so that when we have a NOT SAME value we can easily uderstand which value in the A.csv and B.csv is NOT SAME.

Kindly help me

Thanks again
Meva

EAGL · February 26, 2010, 8:30am

paste -d "\t" A.csv B.csv C.csv  > D.csv

meva · February 26, 2010, 9:40am

thanks Eagle

steadyonabix · February 26, 2010, 1:27pm

     nawk -F, '
        {_[NR] = $0}
        NR != FNR {
                printf("<%s> <%s> <", $0, _[FNR] )
                gsub("\"","",$1)
                split(_[FNR],a)
                for ( i = 1 ; i <= NF ; i++ )
                        printf("%s%s", a == $i ? "SAME" : "NOT SAME", i == NF ? ">\n" : "," )   ## without trailing , 
        } ' B.CSV A.CSV

If you want the trailing , then use this as the final print statement: -

 printf("%s%s", a == $i ? "SAME," : "NOT SAME,", i == NF ? ">\n" : "" ) ## with trailing ,

Output: -

<"1","2009/1/12","DATA",+10.00> <1,"2009/1/12","DATA",+99.00> <SAME,SAME,SAME,NOT SAME,>
<"2","2009/1/12","DATA",+11.00> <2,"2009/1/12","DATA",+11.00> <SAME,SAME,SAME,SAME,>
<"3","2009/1/12","DATA",+12.00> <3,"2009/1/12","DATA",+12.00> <SAME,SAME,SAME,SAME,>
<"4","2009/1/12","DATA",+13.00> <4,"2009/1/12","DATA",+13.00> <SAME,SAME,SAME,SAME,>

meva · March 1, 2010, 1:40am

Hi

Thanks for the response.We do not have nawk programming installed in our Server so unable to use the code.

we used the code

paste -d "\t" A.csv B.csv C.csv  > D.csv

But the output has a special character (looking like a box) also being printed such as following first line

<first line of A.csv> <(special char)first line of B.csv> <(special char)first line of C.csv>

the special character appears only for the first column of the B.csv and C.csv

Please help!!!

Thanks again.
Meva

EAGL · March 1, 2010, 7:49am

I think your OS does not support "tab" character "\t". Simply try to change Delimeter, you can use only a space if its ok for you:

paste -d " " A.csv B.csv C.csv  > D.csv

durden_tyler · March 1, 2010, 10:46am

meva:

...
we used the code
paste -d "\t" A.csv B.csv C.csv  > D.csv
But the output has a special character (looking like a box) also being printed such as following first line
<first line of A.csv> <(special char)first line of B.csv> <(special char)first line of C.csv>
the special character appears only for the first column of the B.csv and C.csv

...

In case you want to investigate; what is the output of the following command ?

head -1 D.csv | od -bc

tyler_durden

steadyonabix · March 1, 2010, 3:42pm

Did you try removing the n from nawk and seeing if it runs?

alister · March 1, 2010, 4:03pm

I know it's a couple of weeks old and that the original problem has been solved, but the revival of this thread brought it to my attention and i thought i'd toss one in anyway

$ paste -d, a b | awk -F, '{for(i=1;i<=NF/2;i++) printf("%sSAME,",$i!=$(i+NF/2)?"NOT ":""); print ""}'
SAME,SAME,SAME,NOT SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,
SAME,SAME,SAME,SAME,

Cheers,
Alister