Comparing two files

Hi,

I have two files in this format. The files contain the statistics of tables as seen below. The other file is also in this format. I need to compare both the files and if there is a mismatch i need to display the contents within the break lines from both the files for that corresponding table.

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER): LENGTH(DEQUEUE_USER): 50
------------------------------------------------------------------------------------------------------------------------------

Can anyone help me on this?

A possible way using awk :

awk '
/^Table Name/  { table = $3 ; tables++                }
! NF || /^-+/  {                                        next }
NR == FNR      { stats1 = stats1 $0 ORS ; next }
               { stats2 = stats2 $0 ORS ; next }
END {
   for (t in tables) {
      if (stats1[t] != stats2[t]) {
         out = "==========================================" ORS
         out = out stats1[t] ORS stats2[t]
         print out
      }
   }
}
' stats1.dat stats2.dat

Input file 1:

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):
------------------------------------------------------------------------------------------------------------------------------

Input file 2:

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_K
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#):  0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0
------------------------------------------------------------------------------------------------------------------------------

Ouput:

==========================================
Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#):  0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0

==========================================

Table Name: AQ$_FT_Q_BECMD_K
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

Jean-Pierre.

Hi,

The code is working very fine. Can you please explain each line of the code?

:eek:

awk '

/^Table Name/ {                             # Select lines starting with 'Table Name'
   table = $3;                              # Memorize table name into variable
   tables++                          #   and array
}                                           #

! NF || /^-+/ {                             # Select empty and delimiter lines 
   next                                     # Proceed next line (skip selected lines)
}                                           #

NR == FNR {                                 # Select lines from first input file 
   stats1 = stats1 $0 ORS;    # Memorize stats's table
   next                                     # Proceed next line
}

{                                           # Lines comes from second input file 
   stats2 = stats2 $0 ORS;    # Memorize stats's table
   next                                     # Proceed next line
}                                           #

END {                                       # All files have been read
   for (t in tables) {                      # For all memorized tables
      if (stats1[t] != stats2[t]) {         #    If stats mismatch
         out = "==========================================" ORS;
         out = out stats1[t] ORS stats2[t]; #
         print out                          #       Output sep line and stats
      }                                     #
   }                                        #
}                                           #

' stats1.dat stats2.dat

Jean-Pierre.

Hi aigles,

Thanks a lot.:smiley:

Hi Jean-Pierre,

Does stats1 and stats2 refer to the two input files? Will the code work if i give the path name like /home/frk/ragav/stats1 and /home/frk/ragav/stats2 instead of the file names? Then how should the code be modified?

If i assign the path like /home/frk/ragav/stats1 to a variable how can i call the path in the code?

When i assigned the file name to a variable like
a=stats1.txt
b=stats2.txt
and changed the code to

nawk '
/^Table Name/ { table = $3 ; tables[table]++ }
! NF || /^-+/ { next }
NR == FNR { $e[table] = $e[table] $0 ORS ; next }
{ $f[table] = $f[table] $0 ORS ; next }
END {
for (t in tables) {
if ($e[t] != $f[t]) {
out = "-----------------------------------------------------------------" ORS
out = out $e[t] ORS $f[t]
print out
}
}
}
' $e $f >> result.out

i am getting this error.

nawk: illegal field $()
input record number 1, file startendcut1.txt
source line number 4

Can you please help on the above two ways of modifying the code?

In my script, stats1 and stats2 inside awk code are arrays.
stats1.dat and stats2.dat are the input files.

The inputfiles can be specified with the full path.
Don't use $e and $f as arrays in your awk code, use fixed names (stats1 and stats2 or what you want like first_array and second_array...)

a=stats1.txt
b=stats2.txt

nawk '
/^Table Name/ { table = $3 ; tables++ }
! NF || /^-+/ { next }
NR == FNR     { stats1 = stats1 $0 ORS ; next }
              {stats2 = stats2 $0 ORS ; next }
END {
   for (t in tables) {
      if (stats1[t] != stats2[t]) {
         out = "-----------------------------------------------------------------" ORS
         out = out stats1[t] ORS stats2[t]
         print out
      }
   }
}
' $e $f >> result.out

Jean-Pierre.

Hi,

Instead of displaying the entire content from both the files like this

==========================================
Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#): 0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0 LENGTH(DEQUEUE_USER): 50

==========================================

Can it be displayed this way?

SUM(SUBSCRIBER#) from the table AQ$_FT_Q_BECMD_H has mismatches in both the files
SUM(SUBSCRIBER#): 11
SUM(SUBSCRIBER#):

MAX(SUBSCRIBER#) from the table AQ$_FT_Q_BECMD_H has mismatches in both the files
MAX(SUBSCRIBER#): 11
MAX(SUBSCRIBER#):

in addition to this if a particular field is present in one and not in the other file then can the output be added

LENGTH(DEQUEUE_USER): 50 from the table AQ$_FT_Q_BECMD_H is present in file two and not in file 1.

Can you help? and is there a way to ignore the spacing difference?

Hi,

Can anyone please help me? Its really urgent.

Hi,

The spacing in the input files which i have received is not uniform. There are spacing differences in both the files.So the output was not correct.How can this script be modified to ignore spacing difference? Please help.I should deliver it in another six hours.

nawk '
/^Table Name/ { table = $3 ; tables[table]++ }
! NF || /^-+/ { next }
NR == FNR { stats1[table] = stats1[table] $0 ORS ; next }
{stats2[table] = stats2[table] $0 ORS ; next }
END {
for (t in tables) {
if (stats1[t] != stats2[t]) {
out = "-----------------------------------------------------------------" ORS
out = out stats1[t] ORS stats2[t]
print out
}
}
}
' $e $f >> result.out

The content of the input files which i have received

File 1

Table Name: AAA
Row Count: 96 ~~SUM(F1): 3739 ~~MAX(F1): 77 ~~MIN(F1):0 ~~AVG(F1): 38.94791666666666666666666666667 ~~LENGTH(LINE): 2260

File 2

Table Name: AAA
Row Count: 96 ~~SUM(F1): 3739 ~~MAX(F1): 77 ~~MIN(F1): 0 ~~AVG(F1): 38.94791666666666666666666666667 ~~LENGTH(LINE):2260

Please help.

It's not at all clear where there are spacing differences. Can you point out what works and what doesn't?

There are spaces between the fields. For example considering this
SUM(F1): 3739. There could be spacing between these two words. But the spacing difference is not constant in both the files.And the spacing is not constant between the field and the field value.Can you help me in bringing the output into a particular single format in both the files ignoring the spacing differences and compare and only print exactly the mismatches?I am not able to show the spacing differences here.

With code tags you can.

How about simply run the text through tr -s " "?

Hi,

I have attached both the files.I have also attached the code which i have used.Kindly help me.

Hi,

The problem here is if suppose for example this is one line in file1

Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260.

This is a single line in file1.

But in the other file this one line is broken into two.
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167
LENGTH(LINE): 2260
Hence the script is showing this as a mismatch.

Can this be ignored and compared?

I think this is one of the problem.

Please help.

Can anyone help me please?

This is a single line in the file2.txt you posted. However, there are other examples where a long line has been folded (e.g. AQ$_FT_Q_BECMD_I in file1.txt)

I would propose that you normaize file1.txt and file2.txt so that these differences are removed. Change runs of more than one space to a single space, and remove a newline just before tilde (optionally with spaces before it). Then compare the resulting files instead.

perl -0777 -pi~ -e 's/ *\n *~/ ~/g; s/  */ /g' file1.txt file2.txt

The original files will be left as file1.txt~ and file2.txt~ in case you need to get them back.

(This will also modify the double spaces in the Start Time: and End Time: lines; hope that's not a problem.)

Hi era,

When i ran that code before running my comparison code
perl -0777 -pi~ -e 's/ *\n *~/ ~/g; s/ */ /g' file1.txt file2.txt

All lines starting with that tilde "~~" symbol were removed off for tables having huge number of columns.I dont want this to happen. They were not getting added to a single line.I think it is because of the line limit of the file.

Few lines are getting deleted.

They are not removed, they are merged with the previous line. However, seems that you have DOS carriage returns in there too, so you need to change \n to \r?\n in the script -- sorry for missing that.

The file format is kind of messy, so you might need additional normalizations still. Possibly add something like*s/:\s*(\d+)\s*(~|$)/: $1 $2/g to make all numbers between a colon and a tilde (or end of line) separated by whitespace on both sides.