Comparing similar columns in two different files

ragavhere · April 11, 2008, 2:51am

Hi,

I have two text files.The first and the 2nd file have data in the same format
For e.g. The first file has
table_name1 column1 sum(column1) max(column1) min(column1)
table_name1 column2 sum(column2) max(column2) min(column2)
table_name1 coulmn3 sum(column3) max(column3) min(column3)

table_name2 column1 sum(column1) max(column1) min(column1)
table_name2 column2 sum(column2) max(column2) min(column2)
.
.

The 2nd file has the same data but few additional columns might be present in it.

I need to compare the values of both the files for common columns and report the mismatches to a separate file.
I am new to unix. Can anyone tell me how to do this?

era · April 11, 2008, 2:59am

Assuming your columns are separated by a single space (or fixed width), you can use cut to extract the ones which should be in both files, and compare.

cut -d ' ' -f1-5 file2 | diff -u file1 - | less

You probably don't want diff -u output as the end result, but that should get you going.

ragavhere · April 11, 2008, 7:01am

Hi,

I dont understand ur reply. I can probably give you the relevant content of the file.

File1 contains

BOOKS COUNT: 40
BOOKS AUTHOR1 SUM:1018 MAX:47 MIN:1 AVG:25.45
BOOKS AUTHOR3 SUM:181 MAX:48 MIN:3 AVG:18.1

Note:Read it as Table columnname sum(column) max(column) min(column)

File2 contains

BOOKS COUNT: 40
BOOKS AUTHOR1 SUM:1018 MAX:45 MIN:2 AVG:23.55
BOOKS AUTHOR2 SUM:561 MAX:49 MIN:1 AVG:21.5769231
BOOKS AUTHOR3 SUM:181 MAX:48 MIN:3 AVG:18.1

The file 2 contains the additional column named AUTHOR2.

Comparison should be such that the column AUTHOR1 in file 1 is compared with the column AUTHOR1 in file 2.

Here The coumn AUTHOR! in the table BOOKS mismatches with the other file.

I should write a script to compare the columns of the table as said and report the mismatches to a different file and also report the additioanl columns present in the 2nd file.

The output should be in this format.

Column AUTHOR1 in the tabloe BOOKS mismatches.
BOOKS AUTHOR1 SUM:1018 MAX:47 MIN:1 AVG:25.45
BOOKS AUTHOR1 SUM:1018 MAX:45 MIN:2 AVG:23.55

Column AUTHOR2 is additional.

Please help me in writing a script to do this job.

Thanks.

ragavhere · April 11, 2008, 7:03am

Hi,

File 1 and 2 contains the statistics for many other tables in the format mentioned before. And the column are separated by a single space.

krishmaths · April 11, 2008, 8:01am

comm can be used to handle common lines between two files.

comm -3 File1 File2

Further help in

man comm

radoulov · April 11, 2008, 8:26am

awk 'NR == FNR { 
f1[$1,$2] = $0
next
}
($1 SUBSEP $2) in f1 && $0 != f1[$1,$2] {
  printf "Column %s in the table %s mismatches\n%s\n%s\n", 
  $2, $1, f1[$1,$2], $0
}
!(($1 SUBSEP $2) in f1) {
  printf "Columns %s is additional\n", $2 
}' file1 file2

Use nawk or /usr/xpg4/bin/awk on Solaris.

ragavhere · April 14, 2008, 3:31am

;)Hi Radoulov,

Thanks for your help. The output of your code is almost what i expected.

The ouput i get for the additional columns present is in this format

column 'column_name' is additional.

But this doesn't tell from which table.What can be done to make my output look like this.

column 'column_name' from the table 'table_name' is additional.

ragavhere · April 14, 2008, 3:43am

I got the code to do that.

I modified a part of the code.

"printf "Columns %s is additional\n", $2" to

"printf "Columns %s in the table %s is additional\n", $2, $1.

Thanks so much for your help.

ragavhere · April 14, 2008, 4:44am

Hi,

Can you explain how this code works?

Thanks.

radoulov · April 14, 2008, 4:47pm

Yes.

NR == FNR { 
f1[$1,$2] = $0
next
}

While reading the first input file (NR == FNR) populate the associative array f1 with keys based on the first two fields and values/elements based on the entire record. The next statement prevents subsequent actions (without explicit constraints regarding the first file) from being performed.

($1 SUBSEP $2) in f1 && $0 != f1[$1,$2] {
  printf "Column %s in the table %s mismatches\n%s\n%s\n", 
  $2, $1, f1[$1,$2], $0
}

While reading the rest of the input:
if the first two fileds match any key in the f1 array AND the entire record is different from the corresponding f1 value , print "column in table mismatches and the desired fields ($2, $1 and then the value of f1[$1,$2] and the entire record).

!(($1 SUBSEP $2) in f1) {
  printf "Columns %s is additional\n", $2 
}

If the first two fields don't match any key in the f1 array,
print "additional".

Hope this helps.

ragavhere · April 15, 2008, 3:09am

Hi radoulov

Thanks. I need additional information. Please help me.

The two files are obtained from two test environments. Here one table and its statistics in the format said might be present in file1 but not in file2. Same way table and its statistics present in file2 might not be present in file1.

In this case the code has to be modified in such a way that there is additional information in the output saying that the table present in one file is not present in the other.

For e.g

The contents of file1 are

FT_T_CAVJ COUNT:22117022
FT_T_CAVJ CURR_TL SUM:5.2621E+13 MAX:2432919 MIN:2336485 AVG:2379195.7
FT_T_CAVJ JRNL_CRVL_AMT SUM:5.4881E+10 MAX:495687276 MIN:-178235551 AVG:2481.41038
FT_T_CAVJ ORIG_CRVL_AMT SUM:0 MAX:0 MIN:0 AVG:0

EMBARCADERO_EXPLAIN_PLAN COUNT:0
EMBARCADERO_EXPLAIN_PLAN PARTITION_ID SUM:5.4881E+10 MAX:495687276 MIN:-178235551 AVG:2481.41038
EMBARCADERO_EXPLAIN_PLAN PARTITION_START SUM:0 MAX:0 MIN:0 AVG:0

File2 contains

FT_T_CAVJ COUNT:22117022
FT_T_CAVJ CURR_TL SUM:5.2621E+13 MAX:2432919 MIN:2336585 AVG:2379195.7
FT_T_CAVJ JRNL_CRVL_AMT SUM:5.4881E+10 MAX:495697276 MIN:-178335551 AVG:2481.41038
FT_T_CAVJ CRVL_JRNL_ID SUM:221170220
FT_T_CAVJ CRVL_TYP

CAVJ_VRFY_PRE COUNT:115
CAVJ_VRFY_PRE BAL_DIFFERENCE SUM:6409.44 MAX:457.01 MIN:-399.99 AVG:55.7342609
CAVJ_VRFY_PRE CAVA_AMT SUM:182628580 MAX:33744800.9 MIN:-2851167.2 AVG:1588074.61

Here file2 contains additional columns which are not present in file1. They are CRVL_JRNL_ID and CRVL_TYP .

File1 contain a column ORIG_CRVL_AMT which is not present in file2.

And file2 also contains an additional table named "CAVJ_VRFY_PRE " and its statistics. This is not present in file1.

Table name EMBARCADERO_EXPLAIN_PLAN is present in file1 but not in file2.

The output should be in this format

Column CURR_TL in the table FT_T_CAVJ has mismatches
FT_T_CAVJ CURR_TL SUM:5.2621E+13 MAX:2432919 MIN:2336485 AVG:2379195.7
FT_T_CAVJ CURR_TL SUM:5.2621E+13 MAX:2432919 MIN:2336585 AVG:2379195.7

Column CRVL_JRNL_ID is present in file2 but not in file1.

Column CRVL_TYP is present in file2 but not in file1.

Table EMBARCADERO_EXPLAIN_PLAN is present in file1 but not in file2.
EMBARCADERO_EXPLAIN_PLAN COUNT:0
EMBARCADERO_EXPLAIN_PLAN PARTITION_ID SUM:5.4881E+10 MAX:495687276 MIN:-178235551 AVG:2481.41038
EMBARCADERO_EXPLAIN_PLAN PARTITION_START SUM:0 MAX:0 MIN:0 AVG:0

Table CAVJ_VRFY_PRE is present in file2 but not in file1.
CAVJ_VRFY_PRE COUNT:115
CAVJ_VRFY_PRE BAL_DIFFERENCE SUM:6409.44 MAX:457.01 MIN:-399.99 AVG:55.7342609
CAVJ_VRFY_PRE CAVA_AMT SUM:182628580 MAX:33744800.9 MIN:-2851167.2 AVG:1588074.61

Kindly help me in doing this.

radoulov · April 15, 2008, 8:57am

Something like this:

awk 'END {
  for (t1_ in t1) {
    if (!(t1_ in t2))
      printf "Table %s is present only in %s\n%s\n",
      t1_, ARGV[1], t1[t1_]
  }
}
NR == FNR { t1[$1] = t1[$1] ? t1[$1]RS$0 : $0; c1[$1,$2] = $0; next }
($1 SUBSEP $2) in c1 && c1[$1,$2] != $0 {
    printf "Column %s in the table %s mismatches.\n%s\n%s\n",
    $2, $1, c1[$1,$2], $0
}
($1 in t1) && !(($1 SUBSEP $2) in c1) {
    printf "Columns %s is present only in %s\n",
    $2, ARGV[2]
}
!($1 in t1) {
    __++
    if (__ == 1) {
    printf "Table %s is present only in %s\n",
    $1, ARGV[2]
    }
    else
      print  
}
{ t2[$1] }' file1 file2

ragavhere · April 16, 2008, 2:24am

Hi radoulov,

Thanks so much for ur help. The code is working fine.

ragavhere · April 16, 2008, 8:53am

Hi,

The format of my two files have changed. I need to compare these two files.
Both the files are in the same format.

Start Time: Thu Apr 17 03:41:21 PDT 2008
Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260
End Time: Thu Apr 17 03:41:22 PDT 2008

------------------------------------------------------------------------------------------------------------------------------

Start Time: Thu Apr 17 03:41:22 PDT 2008
Table Name: AQ$_FT_Q_BECMD_G
Row Count: 0 SUM(SUBSCRIBER#):MAX(SUBSCRIBER#): MIN(SUBSCRIBER#): AVG(SUBSCRIBER#):LENGTH(NAME):SUM(ADDRESS#): MAX(ADDRESS#):MIN(ADDRESS#):AVG(ADDRESS#):
End Time: Thu Apr 17 03:41:23 PDT 2008

------------------------------------------------------------------------------------------------------------------------------
I need to ignore the start time line and end time line and need to compare
the other two lines.The table present in one file might not be present in the other file. Similarly
few columns present in one file might not be present in the other.After comparing the output should be like this

E.g.

column SUBSCRIBER# has mismatches for the sum value
below this the information for the corresponding table of this column should be present.

Column F1 present in file1 is not present in other
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260

Table AAA is present in file 1 but not in the file2.
Start Time: Thu Apr 17 03:41:21 PDT 2008
Table Name: AAA
Row Count:96 SUM(F1): 3739 MAX(F1):77 MIN(F1): 0 AVG(F1): 38.9479167 LENGTH(LINE): 2260
End Time: Thu Apr 17 03:41:22 PDT 2008

Can you give me a script to do this job? Please help me. I am stuck. Its urgent.