Duplicate identification using partial matches

ahmedwaseem2000 · December 11, 2014, 10:46pm

Hi ,

I have a column with names. I would want to match names which match either completely or partially and capture them in separate column like below.

Input		
Abc		
dbc		
abc xyz		
def		
bcd		
abc ggg	
xxx abc xxx

Output| Duplicate	
Abc|abc xyz
     |abc ggg
     |xxx abc xxx
dbc|	
def|	
bcd|

Thanks for your help!!

junior-helper · December 12, 2014, 10:27am

What have you tried? I'm asking this primarily to be able to comprehend the logic or how it should work respectively.

Anyways, after some wild guessing I can offer a potential solution.

The awk code flow is as follows:

Read file (first run)
Read the file line by line, if there is only a single word in the line, store it in the array A, else ignore that line. All single words are stored as "particular word in all lowerspace characters" - "particular word in original format" pairs.
.
Read file (second run)
Read the file line by line, this time ignore lines with only single words. For each line with more than one word, see if there are any words stored in the array A in it. If so, create another array (B) and store as "particular word in original format" - "whole line the mentioned word appears in" pairs.

END section: print whole array B + split the pairs and store all single words in array D.
Search all words from array A in array D, if there is no match print that word + a vertical bar.

The gsub functions are simply deleting potential trailing horizontal tabs.

awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort

Demo:

$ awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
        for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Abc|abc ggg
Abc|abc xyz
Abc|xxx abc xxx
bcd|
dbc|
def|
$