awk to compare flat files and print output to another file

suhaeb · June 25, 2010, 4:35am

Hello,

I am strugling from quite a some time to compare flat files with over 1 million records could anyone please help me.

I want to compare two pipe delimited flat files, file1 with file2 and output the unmatched rows from file2 in file3

Sample File1:

210|testdata1|0003431231|sjhhfskdf|5-NOV-99|
010|testdata2|00234001231|sjhhfskdf|5-NOV-99|
111|testdata3|0003421101231|sjhhfskdf|5-NOV-99|

Sample File2:

210|xxxxxxxxxxxx|0003431231|sjhhfskdf|5-NOV-99|
010|testdata2|00234001231|sjhhfskdf|5-DEC-99|
111|testdata3|0003421101231|sjhhfskdf|5-NOV-99|

Sample output in file3:

210|xxxxxxxxxxxx|0003431231|sjhhfskdf|5-NOV-99|
010|testdata2|00234001231|sjhhfskdf|5-DEC-99|

Can this be done using awk, any help on this is much appreciated.

Ygor · June 25, 2010, 4:46am

Try...

grep -vf File2 File1 > File3

dazdseg · June 25, 2010, 6:06am

cat file1 > final_file
cat file 2 >> final file 

uniq -u final_file >> result_file

suhaeb · June 25, 2010, 6:49am

Thank you dazdseg, with this I can find the uniq records/matching records in both the files, is there a way I can move diffrence records/unmatched records in a new file.

cat file1 > final_file
cat file 2 >> final file 

uniq -u final_file >> result_file

methyl · June 25, 2010, 10:05am

Sorry to be a wet blanket but neither the grep nor the uniq approach will fulfill the requirement, even if the data was in sorted order (which it isn't).

1) Do both files have exactly the same number of records and are you just looking for records which have changed? Does the order of the output into file3 matter?
2) If there can be more or less records in file2 than file1, does the order of the output into file3 matter?
Are you also interested in records which exist in file1 but do not exist in file2?
3) What percentage of differences do you expect? (This is really a performance question because some approaches would involve multiple lookups).
4) If this proves too difficult for shell programming, do you have a mainstream database engine?

---------- Post updated at 15:05 ---------- Previous update was at 14:20 ----------

One shell approach if the order of the output does not matter.
Tried with two approx 5 million record files of 500 Mb each. Took about 5 mins to run and the output only shows the mismatched records from file2. Actual performance will depend on how fast you computer is and how much memory you can give to sort.

#!/bin/ksh
cat file1 | sort > sortfile1
cat file2 | sort > sortfile2
comm -13 sortfile1 sortfile2

When sorting large files be sure to set $TMPDIR to somewhere with enough space for at least twice the size of the file being sorted.

suhaeb · June 25, 2010, 10:18am

Thanks for your time on this, its much appreciated

1) Do both files have exactly the same number of records and are you just looking for records which have changed? Does the order of the output into file3 matter?

File1 has 1803077 records
file2 has 1795370 records

2) If there can be more or less records in file2 than file1, does the order of the output into file3 matter?

I would prefer 1st row in file3 from file1 and 2nd row from file2 and so on

Are you also interested in records which exist in file1 but do not exist in file2?

Yes, and viceversa also, it would be good if we can copy the records to diffrent files say recordsonlyonfile1.txt and recordsonlyonfile2.txt

3) What percentage of differences do you expect? (This is really a performance question because some approaches would involve multiple lookups).

there are huge changes in the file it could be over 50%

4) If this proves too difficult for shell programming, do you have a mainstream database engine?

I have informix database I am not sure if this would not help me as there is no uniq key in the records

---------- Post updated at 15:05 ---------- Previous update was at 14:20 ----------

One shell approach if the order of the output does not matter.
Tried with two approx 5 million record files of 500 Mb each. Took about 5 mins to run and the output only shows the mismatched records from file2. Actual performance will depend on how fast you computer is and how much memory you can give to sort.

#!/bin/ksh
cat file1 | sort > sortfile1
cat file2 | sort > sortfile2
comm -13 sortfile1 sortfile2

When sorting large files be sure to set $TMPDIR to somewhere with enough space for at least twice the size of the file being sorted.
[/quote]

methyl · June 25, 2010, 11:04am

The database approach is winning because of the requirement to retain the original random order and to get enough speed to check each record in turn. Can't see an obvious way to keep any other form of compare in step when there is such a high volume of differences.

1) Load each external flat file into a separate table, using the whole record as the key.
2) Re-read each external flat file in turn checking the result of a seek of each record in the opposing file, outputting non-matches to a respective further external file.

suhaeb · June 25, 2010, 12:30pm

Hi methyl,

I see performace issues with db appoach, I have informix db, loading one of the file tooke me over 30mins.... loading data on this and running SQL to find match and unmatched records is not very efficient, If I have to compare anothere set of files, I have to drop inxex, tables and reload eveything again.

is there any other appraoch I can follow? umm may be shell script

methyl · June 25, 2010, 9:58pm

Sorry, others may disagree but I don't think that shell script is a non-starter for this project because of the need to cross-check with serial reads through a large flat file in high volume. The number of serial reads will be the multiple of the number of records in each file - say one million times one million!. The database "seek" approach avoids a serial pass through a large flat file for every record providing that the database engine is reasonably efficient and that a "seek" does not retrieve the record - all we want to know is whether record exists.
If the order or the output is unimportant then sorting the files prior to processing in unix script is a simple option.
Overall this is an application design problem.

dazdseg · June 28, 2010, 8:01am

what Methyl said is true. just wanted to drop in and check if ur done with ur requirement. If ur not and u still need some help. if its in unix. please paste it on, i will try again.