Comparing two files

ragavhere · April 21, 2008, 5:09am

Hi,

I have two files in this format. The files contain the statistics of tables as seen below. The other file is also in this format. I need to compare both the files and if there is a mismatch i need to display the contents within the break lines from both the files for that corresponding table.

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER): LENGTH(DEQUEUE_USER): 50
------------------------------------------------------------------------------------------------------------------------------

Can anyone help me on this?

aigles · April 21, 2008, 8:58am

A possible way using awk :

awk '
/^Table Name/  { table = $3 ; tables++                }
! NF || /^-+/  {                                        next }
NR == FNR      { stats1 = stats1 $0 ORS ; next }
               { stats2 = stats2 $0 ORS ; next }
END {
   for (t in tables) {
      if (stats1[t] != stats2[t]) {
         out = "==========================================" ORS
         out = out stats1[t] ORS stats2[t]
         print out
      }
   }
}
' stats1.dat stats2.dat

Input file 1:

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):
------------------------------------------------------------------------------------------------------------------------------

Input file 2:

Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------
Table Name: AQ$_FT_Q_BECMD_K
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

------------------------------------------------------------------------------------------------------------------------------

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#):  0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0
------------------------------------------------------------------------------------------------------------------------------

Ouput:

==========================================
Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#):  0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0

==========================================

Table Name: AQ$_FT_Q_BECMD_K
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):

Jean-Pierre.

ragavhere · April 22, 2008, 3:06am

Hi,

The code is working very fine. Can you please explain each line of the code?

aigles · April 22, 2008, 5:13am

awk '

/^Table Name/ {                             # Select lines starting with 'Table Name'
   table = $3;                              # Memorize table name into variable
   tables++                          #   and array
}                                           #

! NF || /^-+/ {                             # Select empty and delimiter lines 
   next                                     # Proceed next line (skip selected lines)
}                                           #

NR == FNR {                                 # Select lines from first input file 
   stats1 = stats1 $0 ORS;    # Memorize stats's table
   next                                     # Proceed next line
}

{                                           # Lines comes from second input file 
   stats2 = stats2 $0 ORS;    # Memorize stats's table
   next                                     # Proceed next line
}                                           #

END {                                       # All files have been read
   for (t in tables) {                      # For all memorized tables
      if (stats1[t] != stats2[t]) {         #    If stats mismatch
         out = "==========================================" ORS;
         out = out stats1[t] ORS stats2[t]; #
         print out                          #       Output sep line and stats
      }                                     #
   }                                        #
}                                           #

' stats1.dat stats2.dat

Jean-Pierre.

ragavhere · April 22, 2008, 8:57am

Hi aigles,

Thanks a lot.

ragavhere · April 26, 2008, 5:26am

Hi Jean-Pierre,

Does stats1 and stats2 refer to the two input files? Will the code work if i give the path name like /home/frk/ragav/stats1 and /home/frk/ragav/stats2 instead of the file names? Then how should the code be modified?

If i assign the path like /home/frk/ragav/stats1 to a variable how can i call the path in the code?

When i assigned the file name to a variable like
a=stats1.txt
b=stats2.txt
and changed the code to

nawk '
/^Table Name/ { table = $3 ; tables[table]++ }
! NF || /^-+/ { next }
NR == FNR { $e[table] = $e[table] $0 ORS ; next }
{ $f[table] = $f[table] $0 ORS ; next }
END {
for (t in tables) {
if ($e[t] != $f[t]) {
out = "-----------------------------------------------------------------" ORS
out = out $e[t] ORS $f[t]
print out
}
}
}
' $e $f >> result.out

i am getting this error.

nawk: illegal field $()
input record number 1, file startendcut1.txt
source line number 4

Can you please help on the above two ways of modifying the code?

aigles · April 26, 2008, 6:39am

In my script, stats1 and stats2 inside awk code are arrays.
stats1.dat and stats2.dat are the input files.

The inputfiles can be specified with the full path.
Don't use $e and $f as arrays in your awk code, use fixed names (stats1 and stats2 or what you want like first_array and second_array...)

a=stats1.txt
b=stats2.txt

nawk '
/^Table Name/ { table = $3 ; tables++ }
! NF || /^-+/ { next }
NR == FNR     { stats1 = stats1 $0 ORS ; next }
              {stats2 = stats2 $0 ORS ; next }
END {
   for (t in tables) {
      if (stats1[t] != stats2[t]) {
         out = "-----------------------------------------------------------------" ORS
         out = out stats1[t] ORS stats2[t]
         print out
      }
   }
}
' $e $f >> result.out

Jean-Pierre.

ragavhere · May 11, 2008, 5:49am

Hi,

Instead of displaying the entire content from both the files like this

==========================================
Table Name: AQ$_FT_Q_BECMD_H
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#): LENGTH(TRANSACTION_ID): SUM(DEQUEUE_USER): MAX(DEQUEUE_USER):

Table Name: AQ$_FT_Q_BECMD_H
Row Count: 1 SUM(SUBSCRIBER#): 11 MAX(SUBSCRIBER#): 11 MIN(SUBSCRIBER#): 11 AVG(SUBSCRIBER#): 11 LENGTH(NAME) 24 :SUM(ADDRESS#): MAX(ADDRESS#): 0 MIN(ADDRESS#): 0 AVG(ADDRESS#): 0 LENGTH(TRANSACTION_ID): 4 SUM(DEQUEUE_USER): 0 MAX(DEQUEUE_USER): 0 LENGTH(DEQUEUE_USER): 50

==========================================

Can it be displayed this way?

SUM(SUBSCRIBER#) from the table AQ$_FT_Q_BECMD_H has mismatches in both the files
SUM(SUBSCRIBER#): 11
SUM(SUBSCRIBER#):

MAX(SUBSCRIBER#) from the table AQ$_FT_Q_BECMD_H has mismatches in both the files
MAX(SUBSCRIBER#): 11
MAX(SUBSCRIBER#):

in addition to this if a particular field is present in one and not in the other file then can the output be added

LENGTH(DEQUEUE_USER): 50 from the table AQ$_FT_Q_BECMD_H is present in file two and not in file 1.

Can you help? and is there a way to ignore the spacing difference?

ragavhere · May 12, 2008, 12:16am

Hi,

Can anyone please help me? Its really urgent.

ragavhere · June 9, 2008, 1:27am

Hi,

The spacing in the input files which i have received is not uniform. There are spacing differences in both the files.So the output was not correct.How can this script be modified to ignore spacing difference? Please help.I should deliver it in another six hours.

nawk '
/^Table Name/ { table = $3 ; tables[table]++ }
! NF || /^-+/ { next }
NR == FNR { stats1[table] = stats1[table] $0 ORS ; next }
{stats2[table] = stats2[table] $0 ORS ; next }
END {
for (t in tables) {
if (stats1[t] != stats2[t]) {
out = "-----------------------------------------------------------------" ORS
out = out stats1[t] ORS stats2[t]
print out
}
}
}
' $e $f >> result.out

The content of the input files which i have received

File 1

Table Name: AAA
Row Count: 96 ~~SUM(F1): 3739 ~~MAX(F1): 77 ~~MIN(F1):0 ~~AVG(F1): 38.94791666666666666666666666667 ~~LENGTH(LINE): 2260

File 2

Table Name: AAA
Row Count: 96 ~~SUM(F1): 3739 ~~MAX(F1): 77 ~~MIN(F1): 0 ~~AVG(F1): 38.94791666666666666666666666667 ~~LENGTH(LINE):2260

Please help.

era · June 9, 2008, 4:28am

It's not at all clear where there are spacing differences. Can you point out what works and what doesn't?

ragavhere · June 9, 2008, 4:55am

There are spaces between the fields. For example considering this
SUM(F1): 3739. There could be spacing between these two words. But the spacing difference is not constant in both the files.And the spacing is not constant between the field and the field value.Can you help me in bringing the output into a particular single format in both the files ignoring the spacing differences and compare and only print exactly the mismatches?I am not able to show the spacing differences here.

era · June 9, 2008, 7:09am

With code tags you can.

How about simply run the text through tr -s " "?

ragavhere · June 9, 2008, 8:05am

Hi,

I have attached both the files.I have also attached the code which i have used.Kindly help me.

ragavhere · June 9, 2008, 8:19am

Hi,

The problem here is if suppose for example this is one line in file1

Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260.

This is a single line in file1.

But in the other file this one line is broken into two.
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167
LENGTH(LINE): 2260
Hence the script is showing this as a mismatch.

Can this be ignored and compared?

I think this is one of the problem.

Please help.

ragavhere · June 10, 2008, 1:33am

Can anyone help me please?

era · June 10, 2008, 2:34am

This is a single line in the file2.txt you posted. However, there are other examples where a long line has been folded (e.g. AQ$_FT_Q_BECMD_I in file1.txt)

I would propose that you normaize file1.txt and file2.txt so that these differences are removed. Change runs of more than one space to a single space, and remove a newline just before tilde (optionally with spaces before it). Then compare the resulting files instead.

perl -0777 -pi~ -e 's/ *\n *~/ ~/g; s/  */ /g' file1.txt file2.txt

The original files will be left as file1.txt~ and file2.txt~ in case you need to get them back.

(This will also modify the double spaces in the Start Time: and End Time: lines; hope that's not a problem.)

ragavhere · June 10, 2008, 3:21am

Hi era,

When i ran that code before running my comparison code
perl -0777 -pi~ -e 's/ *\n *~/ ~/g; s/ */ /g' file1.txt file2.txt

All lines starting with that tilde "~~" symbol were removed off for tables having huge number of columns.I dont want this to happen. They were not getting added to a single line.I think it is because of the line limit of the file.

Few lines are getting deleted.

era · June 10, 2008, 3:31am

They are not removed, they are merged with the previous line. However, seems that you have DOS carriage returns in there too, so you need to change \n to \r?\n in the script -- sorry for missing that.

era · June 10, 2008, 3:35am

The file format is kind of messy, so you might need additional normalizations still. Possibly add something like*s/:\s*(\d+)\s*(~|$)/: $1 $2/g to make all numbers between a colon and a tilde (or end of line) separated by whitespace on both sides.