Deleting duplicate records from file 1 if records from file 2 match

vestport · May 5, 2012, 1:43am

I have 2 files

"File 1" is delimited by ";" and "File 2" is delimited by "|".

File 1 below (3 record shown):

Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones
Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull
Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles O'Brien;Bill Rudd

File 2 below (4 records shown):

6 Main Street|New York
345 Tipp Road|Brewser
885 Peartree|Buffalo
779 Old Windy Road|Buffalo

"File 1" is faily small, "File 2" is huge.

My problem: Line by line I need to copare each record in "File 1", the 3rd field (city) and 4th field (address)against matching field data in "File 2", the 1st field (address) and 2nd field (city) to make sure that there are no record matches.

All records that do not match should be copied out or > redirected to a new file (the edited file). If there is a match then that record should not be copied out to the edited file.

In other words given the example data above from "File 1" and "File 2" the "new edited file" should look like this:

Doc2;03/01/2012;Syracuse;876 Broadway;John Davis;Barbara Lull

The other 2 files below would be discarded as records matched "File 2"

Doc1;03/01/2012;New York;6 Main Street;Mr. Smith 1;Mr. Jones
Doc3;03/01/2012;Buffalo;779 Old Windy Road;Charles O'Brien;Bill Rudd

I hope that is not too confusing. I know this can probably be done with awk but I am as rusty as the Titanic with coding and lucky I got as far as I did with this project. Many thanks to "agama" for helping out on the last issue!

Thanks in advance for any replies!

Art

Peasant · May 5, 2012, 3:04am

First you convert the separator in File1 to pipe, use tr or sed , it's fairly simple.

Then try this code :

awk -F"|" 'NR==FNR { s=$1FS$2; a = $0; next }  ! a[$4FS$3] { print > "nonmatch.txt" }  ' file2 file1

Hope that helps

Regards
Peasant.

vestport · May 5, 2012, 9:20am

peasant thanks so much for that! It worked perfectly!

What I did was as you suggested convert the ";" delimiters in the one file first to "|" to get a common delimiter as your code uses the -F"|" option:

cat FileThatNeedsConverting | sed 's/;/|/g' > ConvertedFile

I was going to ask how to see discarded data but a simple "diff" between the 2 files (original and nonmatch.txt) accomplishes that.

diff OriginalFile nonmatch.txt

Also by doing a:

wc -l OriginalFile

and a:

wc -l nonmatch.txt

you can see that records were shaved off. I just wanted to add that in the case that this may help someone else verify or similar project.

Many thanks man!

Art