Compare two files based on integer part only

yale_work · September 9, 2010, 6:57pm

Please see how can I do this:

File A (three columns):

X1,Y1,1.01
X2,Y2,2.02
X3,Y3,4.03

File B (three columns):

X1,Y1,1
X2,Y2,2
X3,Y3,4.0005

Now I have to compare file A and B based on the integer part of column 3. Means first 2 rows should be OK and the third row should not satisfy the criteria. First two columns make a unique row in one file so no row will be repeated in a file. Same first two columns will be in both the files....means if we can build a logic to compare the integer part of third column for each row (based on column 1 and 2). Thanks.

durden_tyler · September 9, 2010, 7:27pm

yale_work:

...

File A (three columns):
X1,Y1,1.01
X2,Y2,2.02
X3,Y3,4.03
File B (three columns):
X1,Y1,1
X2,Y2,2
X3,Y3,4.0005
Now I have to compare file A and B based on the integer part of column 3. Means first 2 rows should be OK and the third row should not satisfy the criteria. ...

Why not ?
Integer part of row 3, column 3 in file A = int(4.03) = 4
Integer part of row 3, column 3 in file B = int(4.0005) = 4

So as per your logic, row nos. 3 in both files should be considered a match.

tyler_durden

yale_work · September 9, 2010, 7:31pm

Sorry. Yes you are right. Please consider the layout of files as follows:
File A (three columns):

X1,Y1,1.01
X2,Y2,2.02
X3,Y3,4.03

File B (three columns):

X1,Y1,1
X2,Y2,2
X3,Y3,5.0005

durden_tyler · September 9, 2010, 11:57pm

Here's an idea -

$ 
$ 
$ cat filea
x1,y1,1.01
x2,y2,2.02
x3,y3,4.03
x4,y4,7.0001
x5,y5,9.9997
$ 
$ 
$ cat fileb
x1,y1,1
x2,y2,2
x3,y3,5.0005
x4,y4,7.9998
x5,y5,4.0003
$ 
$ 
$ awk -F, 'NR==FNR {x[NR]=$0}
           NR!=FNR {split(x[FNR],a,",");
                    if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n---\n> %s\n",FNR,x[FNR],$0)}
                   }' filea fileb
ROW 3
< x3,y3,4.03
---
> x3,y3,5.0005
ROW 5
< x5,y5,9.9997
---
> x5,y5,4.0003
$ 
$ 
$

tyler_durden

rdcwayx · September 10, 2010, 1:08am

awk -F \. '{a=$1;b=$0 ;getline< "fileb"}{if ($1!=a)print b "|" $0}' filea

yale_work · September 10, 2010, 2:25pm

..............

---------- Post updated at 01:25 PM ---------- Previous update was at 01:24 PM ----------

Following suggested command is getting integer part based on the decimal(.):

awk -F \. '{a=$1;b=$0 ;getline< "fileb"}{if ($1!=a)print b "|" $0}' filea

Actually in my case column 1 and column 2 also have dot(.) so this command is not returning correct values. I have to compare on column 3 only. My files are as follows:

filea

X1.T1,Y1,1.01
X2,Y2.T2,2.02
X3.T3,Y3.T4,4.03

fileb

X1.T1,Y1,1
X2,Y2.T2,2
X3.T3,Y3.T4,5.03

Need to compare integer value of column 3 only.

durden_tyler · September 10, 2010, 6:38pm

$
$
$ cat filea
X1.T1,Y1,1.01
X2,Y2.T2,2.02
X3.T3,Y3.T4,4.03
$
$ cat fileb
X1.T1,Y1,1
X2,Y2.T2,2
X3.T3,Y3.T4,5.03
$
$
$ awk -F, 'NR==FNR {x[NR]=$0}
           NR!=FNR {split(x[FNR],a,",");
                    if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n---\n> %s\n",FNR,x[FNR],$0)}
                   }' filea fileb
ROW 3
< X3.T3,Y3.T4,4.03
---
> X3.T3,Y3.T4,5.03
$
$

tyler_durden

yale_work · September 10, 2010, 7:02pm

I tried this one and it returns
I have tried this but it does not work on actual code. I have sent you one line of real data in your private message and this code fails even if you make both the files same.

durden_tyler:

$
$
$ cat filea
X1.T1,Y1,1.01
X2,Y2.T2,2.02
X3.T3,Y3.T4,4.03
$
$ cat fileb
X1.T1,Y1,1
X2,Y2.T2,2
X3.T3,Y3.T4,5.03
$
$
$ awk -F, 'NR==FNR {x[NR]=$0}
   NR!=FNR {split(x[FNR],a,",");
   if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n---\n> %s\n",FNR,x[FNR],$0)}
   }' filea fileb
ROW 3
< X3.T3,Y3.T4,4.03
---
> X3.T3,Y3.T4,5.03
$
$

tyler_durden

durden_tyler · September 10, 2010, 7:16pm

Post your real data over here.

tyler_durden

yale_work · September 10, 2010, 9:46pm

Even the files are same but code shows that there is a difference (it is tab delimited):

filea

Mechanical.Markdown.Directed.POS.$ WK17 10.5

fileb

Mechanical.Markdown.Directed.POS.$ WK17 10.5

awk -F \t 'NR==FNR {x[NR]=$0} NR!=FNR {split(x[FNR],a,"\t"); if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n",FNR,x[FNR],$0)} }' filea fileb

durden_tyler · September 10, 2010, 11:35pm

Nope, it doesn't. Check this out -

$ 
$ 
$ cat filea
Mechanical.Markdown.Directed.POS.$    WK17    10.5
$ 
$ cat fileb
Mechanical.Markdown.Directed.POS.$    WK17    10.5
$ 
$ ## show the contents of these files with ^I for TAB characters and $ for end-of-line
$ cat -et filea
Mechanical.Markdown.Directed.POS.$^IWK17^I10.5$
$ 
$ cat -et fileb
Mechanical.Markdown.Directed.POS.$^IWK17^I10.5$
$ 
$ 
$ ## now try the awk script, tweaked a little bit so that it displays a message for lines that match
$ awk -F"\t" 'NR==FNR {x[NR]=$0}
              NR!=FNR {split(x[FNR],a,"\t");
                       if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n---\n> %s\n",FNR,x[FNR],$0)}
                       else {print "ROW ",FNR,"is the same in both files"}
                      }' filea fileb
ROW  1 is the same in both files
$ 
$ 
$ 
$ ## now try the other case - edit one file so that the last field is different
$ 
$ sed 's/10.5/11.5/' filea >tmp && mv tmp filea
$ 
$ ## check the contents of both files again
$ cat filea
Mechanical.Markdown.Directed.POS.$    WK17    11.5
$ 
$ cat fileb
Mechanical.Markdown.Directed.POS.$    WK17    10.5
$ 
$ ## finally, try the awk script once again
$ awk -F"\t" 'NR==FNR {x[NR]=$0}
              NR!=FNR {split(x[FNR],a,"\t");
                       if(int(a[3]) != int($3)) {printf("ROW %d\n< %s\n---\n> %s\n",FNR,x[FNR],$0)}
                       else {print "ROW ",FNR,"is the same in both files"}
                      }' filea fileb
ROW 1
< Mechanical.Markdown.Directed.POS.$    WK17    11.5
---
> Mechanical.Markdown.Directed.POS.$    WK17    10.5
$ 
$

If your results are different, then my best guess is that either one or both the files aren't truly tab-delimited.
Check the octal dump of each file to see what exactly is in there.

od -bc filea
od -bc fileb

tyler_durden

rdcwayx · September 11, 2010, 8:27am

With new input:

awk -F, '
{split($3,x,".");a=$1 FS $2 FS x[1] ;b=$0}
{getline < "fileb" ; split ($3,y,".");}
{if (a!=$1 FS $2 FS y[1]) print b "|" $0}
' filea

X3.T3,Y3.T4,4.03|X3.T3,Y3.T4,5.03

yale_work · September 13, 2010, 9:38am

Thanks durden_tyler and rdcwayx.