Complex file matching

panyam · April 27, 2011, 10:02am

Hello All,

I have two files on which I have to do "pattern" based matching and need to place the records to "Matched" and "Unmatched" output files respectively.

Here we go:

 
cat file1
ft , * , *, prem , odacc
ftpr , * ,* , prem , odacc
ft,aa,*,*,odacc
ft,*,*,*,*
*,*,*,odacc
 
cat file2 
abc,*,*,prem,odacc
* , bcd , * , prem , odacc

Now , I have to do a seach from file1 and file2 such that , if the fields in "file1" are equal to "file2" or one of the fields in either of the file is "*" then place the file1 content in to "matched" else in to "unmatched".

Here , "*" is considered as universal acceptance character ( so will be true always irrepsective of the corresponding filed value in other file)

The required outcome from file "matched"

 
ft , * , *, prem , odacc ## matches with * , bcd , * , prem , odacc from file2
ftpr , * ,* , prem , odacc ## matches with * , bcd , * , prem , odacc from file2
ft,*,*,*,* ## matches with * , bcd , * , prem , odacc from file2
*,*,*,odacc ## matches with * , bcd , * , prem , odacc from file2

from file "Un matched"

 
ft,aa,*,*,odacc

DGPickett · April 27, 2011, 3:45pm

Well, before the *, it was a straight sort and comm, but with them, more a cartesian product NxM problem. The usual JDBC/unixODBC SQL solutions do not work cleanly since LIKE is unidirectional, and this wild card is bidirectional.

If one file is much shorter, it could be placed into a two dimensional string array and then the longer file can be filtered by that array to decide which report to write it into, iterating through all the fields and records in the array for each incoming record, a=* or b=* or a=b. Empty file cells would get , or is this a typo, since prem does not match odacc in column 4?
*,*,,odacc ## matches with * , bcd , * , prem , odacc from file2
Even with the wild cards, some small optimization could be had by sorting so the * are low, to give up if input first field > array first field. The spaces in one file might mess this up a bit.

panyam · April 28, 2011, 7:46am

Hello DGPickett,

Sorry and it's a typo.

I tried below and it's seems to be working OK as of now ( might end up in duplicates , how ever need to get rid of those ).

 
#!/usr/bin/ksh
rm matched 2>/dev/null
rm unmatched 2>/dev/null
while IFS=, read f1 f2 f3 f4
do
c=0;
while IFS=, read e1 e2 e3 e4
do
if [[ "$e1" = "$f1" || $f1 = "*" || $e1 = "*" ]] && [[ "$e2" = "$f2" || $e2 = "*" || $f2 = "*" ]] ##&& [[ "$e3" = "$f3" || $e3 = "*" || $f3 = "*" ]] && [[ "$e4" = $f4 || $e4 = "*" || $f4 = "*" ]]
then
c=1
print $f1","$f2","$f3","$f4 >> matched
break
fi
done <f2
if [ $c -eq 0 ];then
print $f1","$f2","$f3","$f4 >> unmatched
fi
done <f1

As the number of fields are fixed in my case and no possiblity of extra spaces the solutions seems to be OK. Performance on this yet to check.

Thanks for looking in to this.

Regards
Ravi