File Comparison- Need help

uihnybgte · March 2, 2009, 10:03am

I have two text files which have records of thousand rows. Each row is having around 40 columns. Each column is tab delimited. Each row is delimited by newline character.

My requirement is to find for each row i need to find whether any column is different between the two files. For each row i need to find which columns are different. Example is as below

Output should be in tabular form
Difference

Row ColumnNumber Value in File1 Value in File2
1 2 check chck
1 5 672 670
2 1 2 3
2 3 this ok

Please help me.

Let me know if information is not sufficient.

vgersh99 · March 2, 2009, 10:38am

Seems like your files are '|' delimited - not 'tab' delimited you stated.
nawk -F'|' -f ui.awk file1 file2

ui.awk:

FNR==NR { f1[FNR]=$0; next }
{
   f1N=split(f1[FNR], arr, FS)
   for(i=1;i<=NF; i++)
     if ($i != arr)
       print FNR, i, arr, $i
}

uihnybgte · March 3, 2009, 4:50am

Thanks for the reply. Sorry for mentioning it as | instead of tab.

I am able to run the command properly and the results are coming as expected. One problem i am facing is that there is possibility that files have data which are in different order means 1 st row in file 1 could point to 5th row in file 2.
What i could of think of now as we should take the key column (number) from the user and then sort the file on the basis of that. Is there any other way of doing the same.
Can it be done on the basis of filteration also means for file 1 we will take the primary key and then filter the file 2 on the basis of that but i think it will be cumbersome. Sorting the whole file on the basis of primary key will be better option.

Can you please provide any better way of doing this. what will the unix commands for the same. Thanking you in advance for helping me out on this.

vgersh99 · March 3, 2009, 6:20am

What does it mean when you say that '1 st row in file 1 could point to 5th row in file 2'?
Is there a common key (a common row cell OR a combination of cells) that relates 2 rows from 2 different files?
If you know that, you can rewrite the initial script - no need for sorting.

uihnybgte · March 3, 2009, 11:43am

File2
1|chck|test|plan|670
3|jst|no|pln|400
4|checked|ok|plan|610

User will provide the common key (let's say the user has provided 1, it means first column in each file is the primary key to identify records)

Output should be in tabular form
Difference

Row ColumnNumber Value in File1 Value in File2
1 2 check chck
1 5 672 670
2 3 this ok
3 2 just jst
3 4 plan pln
3 5 612 400

*Row - Here row should row number of first file and also if possible we can display the primary key also

One more thing here is user can provide the combination of keys also.

Hope i have cleared my question now.

vgersh99 · March 3, 2009, 12:39pm

ok, something along these lines:

default value for the key is '1' - first column. you can specify a key/column on cli like below:

# key - the value in the FIRST column (default)
nawk -f ui.awk file1 file2

# key - the value in the THIRD column
nawk -v key=3 -f ui.awk file1 file2

# key - the value in the FORTH column
nawk -v key=4 -f ui.awk file1 file2

ui.awk

BEGIN {
  FS="|"
  if (key=="") key="1"
}
FNR==NR { f1r[$key]=FNR; f1v[$key]=$0; next }
{
   f1N=split(f1v[$key], arrV, FS)
   for(i=1;i<=NF; i++)
     if ( ($key in f1r) && $i != arrV)
       print $key, f1r[$key], i, arrV, $i
}

The output will be in the format:

keyValue rowFile1 columnFile2 file1value file2value

uihnybgte · March 6, 2009, 5:39am

Thanks a lot. It perfectly works fine.
Now the problem which i am facing in this that some of the rows are missing in the file2 due to which it does not come in the report. The current code tells the difference between columns of each row. There should be also some report which tells these rows (means primary key in file1 is not found in file2) are missing in file2.
Hope i have clear my question

vgersh99 · March 6, 2009, 7:42am

Hopefully by now you understand the current implementation and can modify it to fulfill your changing requirements.
Do come back with any specific concerns.
Good luck.

uihnybgte · April 13, 2009, 4:04am

Can you please help me on my previuos question.

What if there is composite key. Means row is identified based on multiple key columns.

What should i pass in this if the primary key column is 1,4 (composite one)
# key - the value in the FORTH column
nawk -v key=4 -f ui.awk file1 file2

Do we need to change anything in ui.awk. Please suggest