Hi.
Here is a complex script that attempts to satisfy the original requirements:
#!/usr/bin/env bash
# @(#) s1 Demonstrate comparison and extraction of small differences.
# Section 1, setup, pre-solution.
# Infrastructure details, environment, debug commands for forum posts.
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen diff awk
set -o nounset
FILE1=${1-data1}
shift
FILE2=${1-data2}
# Display sample data files.
pe
specimen $FILE1 $FILE2
# edges 3 $FILE1
# edges 3 $FILE2
# Section 2, solution.
pl " Preparation and pipeline:"
db " Section 2: solution."
diff -u $FILE1 $FILE2 |
tee f1 |
awk '
/^-[^-]/ {
# print "debug for + working on",NR,$0
previous = NR ; action = "deleted"; line = $0 ; next }
/^+[^+]/ {
# print "debug for - working on",NR,$0
action = "inserted"
if ( previous != NR-1 ) {
if ( previous != 0 ) {
print action, $0
previous = 0
next
} else {
print action, $0 ;
}
} else {
action = "changed"
print action, $0
previous = 0
}
next
}
previous != 0 {
# print "debug for not +-",NR,$0
print action, line ; previous = 0 }
' |
tee f2 |
awk '
/^deleted/ { sub(/^deleted [-]/, "") ; print > "f.deleted" ; next }
/^(changed|inserted)/ { sub(/^(changed|inserted) [+]/,"") ; print > "f.changed" ; next }
'
pl " Results, deletions file:"
cat f.deleted
pl " Results, insertions and changes file:"
cat f.changed
exit 0
producing:
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
bash GNU bash 3.2.39
specimen (local) 1.17
diff (GNU diffutils) 2.8.1
awk GNU Awk 3.1.5
Whole: 5:0:5 of 8 lines in file "data1"
HOME|ALICE STREET|3||NEW LISTING
HOME|NEWPORT STREET|1||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|4||NEW LISTING
CAR|FORD|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
Whole: 5:0:5 of 8 lines in file "data2"
HOME|ALICE STREET|3||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|5||NEW LISTING
CAR|FORD|4||NEW LISTING
CAR|HONDA|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
-----
Preparation and pipeline:
-----
Results, deletions file:
HOME|NEWPORT STREET|1||NEW LISTING
-----
Results, insertions and changes file:
CAR|TOYOTA|5||NEW LISTING
CAR|HONDA|4||NEW LISTING
This uses the unified format for the diff. It obviously works for the sample files, but I don't know if it will work on far larger instances. You can look at files f1 and f2 to see the intermediate data.
If it does not work, then perhaps a sort and diff would be the best approach -- I just dislike making passes over files when I don't have to, especially if they are large. However, these days, 100 MB is not over-whelming.
Best wishes ... cheers, drl