Compare fields in two files line by line

dhruvmohan · September 22, 2013, 12:06pm

I am new to awk scripting.

I want to do a field by word (field) comparison of two files File1.txt and File2.txt.

The files contain a list of | (pipe) separated field.

**File 1:
-------------------

aaa|bbb|ccc|eee|fff
lll|mmm|nnn|ooo|ppp
rrr|sss|ttt|uuu|vvv**

File 2:
-------------------

aaa|bbb|ccc|eee|fff
rrr|sss|ttt|uuu|vvv
rrr|sss|ttt|uuu|uuu

We compare the same line no. in both the files.
Fields in Line 1 of both file match.

In Line 2 all the fields (lll, mmm, nnn, ooo, ppp) donot not match with all fields (rrr, sss, ttt, uuu, vvv) in line 2 of File 2. Similarly the 5th field of 3rd line in both the files donot match.

Hence Line no. 2 and Line no. 3 should get echoed by bash.

Both files will follow an order.

Don_Cragun · September 22, 2013, 12:49pm

dhruvmohan:

I am new to awk scripting.

I want to do a field by word (field) comparison of two files File1.txt and File2.txt.

The files contain a list of | (pipe) separated field.

**File 1:
-------------------
aaa|bbb|ccc|eee|fff
lll|mmm|nnn|ooo|ppp
rrr|sss|ttt|uuu|vvv**
File 2:
-------------------
aaa|bbb|ccc|eee|fff
rrr|sss|ttt|uuu|vvv
rrr|sss|ttt|uuu|uuu
We compare the same line no. in both the files.
Fields in Line 1 of both file match.

In Line 2 all the fields (lll, mmm, nnn, ooo, ppp) donot not match with all fields (rrr, sss, ttt, uuu, vvv) in line 2 of File 2. Similarly the 5th field of 3rd line in both the files donot match.

Hence Line no. 2 and Line no. 3 should get echoed by bash.

Both files will follow an order.

Is this a homework item? It seems like a strange set of requirements for any non-classroom project.

You haven't shown the output that you want from the input given above, but it seems that you are asking for entire lines from File 1 or entire lines from File 2 to be written to standard output if any field is different. If that is the case, why should we compare fields instead of just comparing lines? Comparing lines should get the same results, is easier to program, and probably run faster.

If you are using awk to compare files; why is it important that bash use the echo command to print the results instead of having awk print the results directly?

dhruvmohan · September 22, 2013, 1:12pm

Thanks. You gt my question absolutely correct.

But I want to compare field by field instead of line by line so in case a field does not match I can echo out that field in output.
Hope you understood.

Don_Cragun · September 22, 2013, 1:44pm

I repeat:

Is this a homework assignment? If not what is the real world project that is driving this?
Show us the output you want (using CODE tags) corresponding to the sample input files you provided.
Explain why you require us to use the bash echo command to print the results that an awk script is going to calculate.

dhruvmohan · September 22, 2013, 2:40pm

Thanks Don for your quick response.
Please find my comments below:

1). Well this is not a homework assignment. It is part of my performance testing project for a investment bank in US. Actually I have got a .csv file as a output of one application run. Now I need to verify the contents of this .csv file against the expected data.

2). Output will look like this:

Line 2: lll,mmm,nnn,ooo,ppp
Line 3: vvv

3). By echo out the output I meant I wanted to print the output in a file/ $ prompt (through print statement in awk). I did not mean of literally using the "echo" command only.

Hope I have answered all your questions. Please revert back to me in case of any further questions.

Don_Cragun · September 22, 2013, 3:16pm

In "File 1" line 3 you have:

rrr|sss|ttt|uuu|vvv**

and in "File 2" line 3 you have:

rrr|sss|ttt|uuu|uuu

Why is the output showing the differences in line 3 supposed to be:

Line 3: vvv

instead of:

Line 3: vvv**

Scrutinizer · September 22, 2013, 5:52pm

If the asterisks are not part of file 1 you could try something like this:

awk '
{
  s=x
  for(i=1; i<=NF; i++)
    if (NR==FNR) {
      A[NR,$i]
    } 
    else if(!((FNR,$i) in A)) s=(s?s OFS:x) $i
  if(s)print "Line " FNR ": " s
}
' FS=\| OFS=, file2 file1

drl · September 22, 2013, 9:04pm

Hi.

Here is a non-awk solution:

#!/usr/bin/env bash

# @(#) s1	Demonstrate comparison, "diff", at field level.
# See: http://os.ghalkes.nl/dwdiff.html

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C dwdiff

pl " Input data files:"
head data1 data2

pl " Expected output:"
cat expected-output.txt

pl " Results:"
dwdiff -L -d '|' --no-common --no-inserted data1 data2

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
dwdiff 1.8.2

-----
 Input data files:
==> data1 <==
aaa|bbb|ccc|eee|fff
lll|mmm|nnn|ooo|ppp
rrr|sss|ttt|uuu|vvv**

==> data2 <==
aaa|bbb|ccc|eee|fff
rrr|sss|ttt|uuu|vvv
rrr|sss|ttt|uuu|uuu

-----
 Expected output:
Line 2: lll,mmm,nnn,ooo,ppp
Line 3: vvv

-----
 Results:
======================================================================
   1:1    
   2:1    lll|mmm|nnn|ooo|ppp
======================================================================
   3:2    vvv**
======================================================================

The utility dwdiff is available at the URL noted in the script comment, as well as in Ubuntu, Debian, antiX, FreeBSD, Fedora, "recent" versions.

Best wishes ... cheers, drl