Help required on Length based lookup

rramkrishnas · December 18, 2014, 6:32am

Hi,

I have two files one (abc.txt) is having approx 28k records and another (bcd.txt) on is having 112k records, the length of each files are varried.

I am trying to look up abc.txt file with bcd.txt based on length, where ever abc.txt records are matching with bcd.txt I am successful match the records with bcd, but I am unable to fetch the records which are not matching with bcd.txt.

abc.txt

bcd.txt

I want the mismatch in each file are as below:

abc.txt matches with bcd.txt
120
1201
1203
122
1224
123
12345
abc.txt not matches with bcd.txt
191
1890
1245
 
bcd.txt not matches with abc.txt
199

below is my script which I tried for matching of the records, but its is taking almost 5 hours, and next I am unable to find the mismatch records for both the files.

awk -F"," 'BEGIN{OFS=","}
{
if(NR==FNR){
a[FNR]=$0;max=FNR;Next}
if(NR!=FNR)
 {
if (FNR==1) print $0;
 for ( i=1;i<=max;i++)
 {  
 tmp = a;
 len = length(tmp);
 if(substr($1,1,len) ==tmp)
 {print $0;}
 } #End For
 } #End if
}' abc.txt bcd.txt > abc_matches_bcd.txt;

Please help me on this, this will save a lot of manual work at my end.

Regards,
Ram

RavinderSingh13 · December 18, 2014, 6:50am

Hello rramkrishnas,

Following may help you in same.

awk 'BEGIN{print "abc.txt matches with bcd.txt"} FNR==NR{X[$1]=$1;next} {Y[$1]=$1} {for(i in X){for(i in Y){if(X){print X;delete X;delete Y}}}} END{print "abc.txt not matches with bcd.txt"; for(u in X){if(X){print X}};print "bcd.txt not matches with abc.txt"; for(v in Y){print Y[v]}}' abc.txt bcd.txt

Output will be as follows.

abc.txt matches with bcd.txt
120
122
123
abc.txt not matches with bcd.txt
1890
1234
123456
1245
191
bcd.txt not matches with abc.txt
1201
1203
12345
121
1224
199

EDIT: Adding a non one liner form for same.

awk 'BEGIN{print "abc.txt matches with bcd.txt"}
     FNR==NR{X[$1]=$1;next}
     {Y[$1]=$1}
     {for(i in X)
        {for(i in Y)
                {if(X)
                        {print X;
                         delete X;
                         delete Y
                        }
                }
        }
     }
     END{
        print "abc.txt not matches with bcd.txt";
        for(u in X){
                        if(X)
                                {print X}
                   };
        print "bcd.txt not matches with abc.txt";
        for(v in Y){
                        print Y[v]
                   }
        }' abc.txt bcd.txt

Thanks,
R. Singh

RudiC · December 18, 2014, 6:51am

With some wild guessing I presume that you want to match entries based on the smallest common substring. But some questions remain:
Will abc always have the smallest substring or could that be in bcd as well?
Will the smallest substring always precede the longer ones?
Where is the 121 entry from bcd in the outputs? Where 123456 from abc?

pravin27 · December 18, 2014, 7:00am

awk 'NR==FNR{a[$1]++;next}
{if(a[$1]){m[$1]++}else{notM[$1]++}delete a[$1]} 
END {print "matching"; for (i in m) {print i}
print "abc not match with bcd"; for ( j in a) {print j;}
print "bcd not match with abc"; for (k in notM) { print k}}' abc.txt bcd.txt

rramkrishnas · December 19, 2014, 1:07am

Dear RudiC,

With some wild guessing I presume that you want to match entries based on the smallest common substring. But some questions remain:

Below are my comments against your Query.
Will abc always have the smallest substring or could that be in bcd as well?

abc.txt Records will have smalest substring as well as the same string will apear in bcd.txt.

Will the smallest substring always precede the longer ones?
Yes, and will be present at bcd.txt

Where is the 121 entry from bcd in the outputs? Where 123456 from abc?
121 entry is an extra entry in bcd.txt, and 123456 is presnt at bcd.txt however in abc.txt 123 record is present hence 123456 should be a match case.

---------- Post updated 12-19-14 at 11:37 AM ---------- Previous update was 12-18-14 at 05:40 PM ----------

Can any one please help me on this

RudiC · December 19, 2014, 5:40am

Your requirements still are far from clear. Why do 1201, 1203 and 1224 from bcd show up in the "match" result? Where is 1234 from abc?

rramkrishnas · December 19, 2014, 5:47am

Dear Rudic,

if you will see my abc.txt where I am having a record which is haing value of 120, where as in bcd, my value is 1201 & 1203, since 120 of abc.txt, is matching with first 3 digit of bcd.txt which is having 1203 & 1201, like wise for 1224 too.

Hope I am clear in my requirement now.

Regards,
Ram

RudiC · December 19, 2014, 6:24am

This is as far as I can get. The output covers most of what you require. Please make sure your next request is way more precise. Try

awk     'FNR==NR        {A[$1]=length($1);next}
                        {B[$1]=length($1)}
         END            {print "match"
                         for (b in B) {
                           for (a in A) if (b~"^"a) {print b; delete B; DEL[a]++;break}
                          }
                         for (d in DEL) delete A[d]
                         print "a"
                         for (a in A) print a
                         print "b"
                         for (b in B) print b
                        }
        ' abc bcd
match
1224
120
122
123
12345
1201
1203
a
1245
123456
191
1890
b
121
199

ongoto · December 19, 2014, 9:30am

Just for comparison...
This produces the same output as the awk solutions except 123456 is gone. Can't get rid of 121 though.

while read x
do
    grep ${x:0:3} bcd >> match
    if [[ ! $(grep ${x:0:3} bcd) ]]; then echo $x >> nomatch; fi
done < abc
sort -u match
echo
sort -u nomatch
echo

while read z
do
    if [[ ! $(grep ${z:0:3} abc) ]]; then echo $z >> noabc; fi
done < bcd
sort -u noabc

# ------------
# 120
# 1201
# 1203
# 122
# 1224
# 123
# 12345
# 
# 1245
# 1890
# 191
# 
# 121
# 199