use python or awk to match names 'with error tolerance'

grossgermany · January 19, 2009, 10:28pm

I think this is a very challenging problem I am facing and I have no idea how to deal with it
Suppose I have two csv files

A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue

B.csv
Toyota Inc. Camry, 2000km
Honda Corp Civic,1500km

I want to generate C.csv
Toyota Camry,1998,blue ,2000km
Honda Civic,1999,blue,1500km

The worst part of the task is that there needs to be error tolerance to deal with the variations in the company name
1.extra spaces
2.extra dots
3.phrases such as Inc, corp.

Is this mission impossible?

summer_cherry · January 19, 2009, 10:39pm

#!/usr/bin/perl
open FH,"<a.csv";
while(<FH>){
	chomp;
	my @tmp=split(",",$_);
	$hash{$tmp[0]}=$_;
}
close FH;
open FH,"<b.csv";
while(<FH>){
	chomp;
	my @tmp=split(",",$_,2);
	$tmp[0]=~s/(Inc|Corp)\.* //;
	$hash{$tmp[0]}.=",".$tmp[1];
}
for $key (keys %hash){
	print $hash{$key},"\n";
}

angheloko · January 19, 2009, 10:56pm

Lemme give it a try:

cat a.csv | while read x; do
echo -n "$x,";grep `echo ^$x | awk '{print $1}'` b.csv | awk -F, '{print $NF}' | sed 's/^ *//g;s/ *$//g'
done

grossgermany · February 26, 2009, 5:51pm

I don't know perl, would you please do it in Python or SAS

rikxik · February 27, 2009, 2:55am

import re

f1, f2 = ['A.csv', 'B.csv']
a, b = open('A.csv', 'r'), open('B.csv', 'r')
sep = ','
excl = {sep:1, '.':1, 'Inc':1,'Corp':1}

ah, bh = {}, {}
for i in (a):
        l = i.strip().split(sep, 1)
        ah[ l[0] ] = l[1]
a.close()

for i in (b):
        l = i.strip().split(sep, 1)
        n = re.sub("[.,]", "", l[0])
        s = " ".join([i for i in n.split() if(excl.has_key(i) == False)])
        if(ah.has_key(s)):
                print sep.join([s, ah, l[1]])
        else:
                print "Could not match", s, "with", f1;
b.close()

Output:

C:\Projects\Python>type A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue

C:\Projects\Python>type B.csv
Toyota  Inc. Camry, 2000km
Honda Corp.     Civic,1500km

C:\Projects\Python>match.py
Toyota Camry,1998,blue, 2000km
Honda Civic,1999,blue,1500km

summer_cherry · February 27, 2009, 3:35am

nawk 'BEGIN{FS=","}
{
if(NR==FNR)
  _[$1]=$0
else
{
  sub(/(Inc.?|Corp.?) /,"",$1)
  _[$1]=sprintf("%s,%s",_[$1],$2)
}
}
END{
  for(i in _)
  print _
}' a b

grossgermany · March 1, 2009, 7:42pm

Thanks a lot for the reply, but is it possible to create manual translation tables:

Suppose the file is now
A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue
Acura Inf,2000,yellow

B.csv
Toyota Inc. Camry, 2000km
Honda Corp Civic,1500km
HondaUSA Inf, 2000, 2300km

I want to generate C.csv
Toyota Camry,1998,blue ,2000km
Honda Civic,1999,blue,1500km
HondaUSA Inf,2000,yellow,2300km

How to generate a list of translation table which would say: Acura translates to HondaUSA

rikxik · March 1, 2009, 11:04pm

The same way the "excl" hash was used in python to remove some common "corp.", "inc" etc. words, you can create a "trans" hash e.g.:

trans = {"Acura": "HondaUSA ", "somethingElse": "something"}

Then use this to translate. I think it would do you good to try this yourself if you are really interested in solving problems (current and future ones) with python.