Compare large file and identify difference in separate file

jubaier · March 15, 2012, 5:29am

I have a very large system generated file containing around 500K rows size 100MB like following

  
  HOME|ALICE STREET|3||NEW LISTING
  HOME|NEWPORT STREET|1||NEW LISTING
  HOME|KING STREET|5||NEW LISTING
  HOME|WINSOME AVENUE|4||MODIFICATION
  CAR|TOYOTA|4||NEW LISTING
  CAR|FORD|4||NEW LISTING
  COMPUTER|HP|1||NEW LISTING
  COMPUTER|APPLE|1||NEW LISTING

The file is generated once a day. Everyday some rows are deleted some modified and some added as following (Line 2 in first file is deleted, Line 5 in first file is modified and Line 6 in second file is added)

   
  HOME|ALICE STREET|3||NEW LISTING
  HOME|KING STREET|5||NEW LISTING
  HOME|WINSOME AVENUE|4||MODIFICATION
  CAR|TOYOTA|5||NEW LISTING
  CAR|FORD|4||NEW LISTING
  CAR|HONDA|4||NEW LISTING
  COMPUTER|HP|1||NEW LISTING
  COMPUTER|APPLE|1||NEW LISTING

I want to identify those rows deleted into a file and those rows modified and added into a file.

Diif File 1 should be

 
  HOME|NEWPORT STREET|1||NEW LISTING

And Diff File 2 should be

   
  CAR|TOYOTA|5||NEW LISTING
  CAR|HONDA|4||NEW LISTING

I am very new to shell scripting. Any help is very much appreciated.

bartus11 · March 15, 2012, 5:34am

Take a look at the output of

diff file1 file2

methyl · March 15, 2012, 9:38am

I don't think that the output format from "diff" is suitable and it tends to get lost on large unsorted files.

The file must be a proper unix text file. In your sample, the data has no particular sorted order.

You'll need to "sort" both files to produce two new sorted output files in a work area. Then compare the two sorted files using two different unix "comm" commands. The differences will not be in the same order as the original file.

drl · March 15, 2012, 10:08am

Hi.

I think I would still try GNU diff first. If it works, then you are done, if not, then something else can be tried, such as the suggestion from methyl.

If your files are described as:

   When the files you are comparing are large and have small groups of
changes scattered throughout them, you can use the
`--speed-large-files' option to make a different modification to the
algorithm that `diff' uses.  If the input files have a constant small
density of changes, this option speeds up the comparisons without
changing the output.  If not, `diff' might produce a larger set of
differences; however, the output will still be correct.

-- excerpt from info diff, q.v.

then you may want to experiment with that.

Best wishes ... cheers, drl

jubaier · March 20, 2012, 1:13am

I am basically trying to identify the deleted rows from first file.

$ cat file1
HOME|ALICE STREET|3||NEW LISTING
HOME|NEWPORT STREET|1||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|4||NEW LISTING
CAR|FORD|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
$ cat file2
HOME|ALICE STREET|3||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|5||NEW LISTING
CAR|FORD|4||NEW LISTING
CAR|HONDA|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
$ diff file1 file2
2d1
< HOME|NEWPORT STREET|1||NEW LISTING
5c4
< CAR|TOYOTA|4||NEW LISTING
---
> CAR|TOYOTA|5||NEW LISTING
6a6
> CAR|HONDA|4||NEW LISTING
$

I am trying to get a output file containing only deleted line from first file, in this instance following

HOME|NEWPORT STREET|1||NEW LISTING

pravin27 · March 20, 2012, 1:36am

try this,

 
awk 'NR==FNR{a[$0]++;next} !a[$0]' file2 file1

rangarasan · March 20, 2012, 2:05am

Hi,

Try this one, It will consider the first 2 columns as a key. Just little modification from pravin's post.

awk 'BEGIN{FS="|";}NR==FNR{a[$1$2]++;next}!a[$1$2]' file2 file1

Output:

HOME|NEWPORT STREET|1||NEW LISTING

Cheers,
Ranga:)

jubaier · March 20, 2012, 6:19pm

Hi rangarasan/pravin27

Thanks for your post. I am getting following error on the code.

$ awk 'BEGIN{FS="|";}NR==FNR{a[$1$2]++;next}!a[$1$2]' file2 file1
awk: syntax error near line 1
awk: bailing out near line 1
$

mjf · March 21, 2012, 9:26am

jubaier,
Using the results of the diff command, you can select those records that were on file 1 but not on file 2 and vice versa and output results to individual files. As others have mentioned, you should sort before doing diff. For example:

diff file1.txt file2.txt 

2d1 
< HOME|NEWPORT STREET|1||NEW LISTING 
5c4 
< CAR|TOYOTA|4||NEW LISTING 
--- 
> CAR|TOYOTA|5||NEW LISTING 
6a6 
> CAR|HONDA|4||NEW LISTING

Output those records on file2.txt not on file1.txt

diff file1.txt file2.txt | grep ">" | cut -b 3- > add.txt

Output those records on file1.txt not on file2.txt

diff file1.txt file2.txt | grep "<" | cut -b 3- > drop.txt

cat add.txt 
CAR|TOYOTA|5||NEW LISTING 
CAR|HONDA|4||NEW LISTING 

cat drop.txt 
HOME|NEWPORT STREET|1||NEW LISTING 
CAR|TOYOTA|4||NEW LISTING

Note that this compare is matching the entire record which may not match your requirement since you don't want the original record "CAR|TOYOTA|4||NEW LISTING" that changed written to the drop.txt file. You then can look at the join command which will allow you to match based on certain fields/keys in your file to determine if it's truly a dropped record vs a changed record.

mjf

drl · March 21, 2012, 10:17am

Hi.

Here is a complex script that attempts to satisfy the original requirements:

#!/usr/bin/env bash

# @(#) s1	Demonstrate comparison and extraction of small differences.

# Section 1, setup, pre-solution.
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen diff awk
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Display sample data files.
pe
specimen $FILE1 $FILE2 
# edges 3 $FILE1
# edges 3 $FILE2

# Section 2, solution.
pl " Preparation and pipeline:"
db " Section 2: solution."
diff -u $FILE1 $FILE2 |
tee f1 |
awk '
/^-[^-]/	{ 
			# print "debug for + working on",NR,$0
			previous = NR ; action = "deleted"; line = $0  ;  next }
/^+[^+]/	{
			# print "debug for - working on",NR,$0
				action = "inserted"
			if ( previous != NR-1 ) { 
				if ( previous != 0 ) {
				print action, $0 
				previous = 0
				next
				} else {
				print action, $0 ;
				}
		} else {
			action = "changed"
		  print action, $0
		  previous = 0
		}
		next
		}
previous != 0	{
				# print "debug for not +-",NR,$0
				print action, line ; previous = 0 }
' |
tee f2 |
awk '
/^deleted/	{ sub(/^deleted [-]/, "") ; print > "f.deleted" ; next }
/^(changed|inserted)/	{ sub(/^(changed|inserted) [+]/,"") ; print > "f.changed" ; next }
'

pl " Results, deletions file:"
cat f.deleted
pl " Results, insertions and changes file:"
cat f.changed

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
specimen (local) 1.17
diff (GNU diffutils) 2.8.1
awk GNU Awk 3.1.5

Whole: 5:0:5 of 8 lines in file "data1"
HOME|ALICE STREET|3||NEW LISTING
HOME|NEWPORT STREET|1||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|4||NEW LISTING
CAR|FORD|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING

Whole: 5:0:5 of 8 lines in file "data2"
HOME|ALICE STREET|3||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|5||NEW LISTING
CAR|FORD|4||NEW LISTING
CAR|HONDA|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING

-----
 Preparation and pipeline:

-----
 Results, deletions file:
HOME|NEWPORT STREET|1||NEW LISTING

-----
 Results, insertions and changes file:
CAR|TOYOTA|5||NEW LISTING
CAR|HONDA|4||NEW LISTING

This uses the unified format for the diff. It obviously works for the sample files, but I don't know if it will work on far larger instances. You can look at files f1 and f2 to see the intermediate data.

If it does not work, then perhaps a sort and diff would be the best approach -- I just dislike making passes over files when I don't have to, especially if they are large. However, these days, 100 MB is not over-whelming.

Best wishes ... cheers, drl