Hi ,
I have a column with names. I would want to match names which match either completely or partially and capture them in separate column like below.
Input
Abc
dbc
abc xyz
def
bcd
abc ggg
xxx abc xxx
Output| Duplicate
Abc|abc xyz
|abc ggg
|xxx abc xxx
dbc|
def|
bcd|
Thanks for your help!!
What have you tried? I'm asking this primarily to be able to comprehend the logic or how it should work respectively.
Anyways, after some wild guessing I can offer a potential solution.
The awk
code flow is as follows:
- Read file (first run)
Read the file line by line, if there is only a single word in the line, store it in the array A, else ignore that line. All single words are stored as "particular word in all lowerspace characters" - "particular word in original format" pairs.
.
- Read file (second run)
Read the file line by line, this time ignore lines with only single words. For each line with more than one word, see if there are any words stored in the array A in it. If so, create another array (B) and store as "particular word in original format" - "whole line the mentioned word appears in" pairs.
END section: print whole array B + split the pairs and store all single words in array D.
Search all words from array A in array D, if there is no match print that word + a vertical bar.
The gsub functions are simply deleting potential trailing horizontal tabs.
awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Demo:
$ awk 'NR==FNR{ gsub(/\t+$/,"",$0); if (NF==1) A[tolower($0)]=$0; next }
{ gsub(/\t+$/,"",$0); if (NF==1) next; for (i in A) if (tolower($0) ~ i) B[A"|"$0]++
} END { for (b in B) {split(b,C,"|"); print b; D[C[1]]++}
for (a in A) if (!(A[a] in D)) print A[a]"|" }
' file file | sort
Abc|abc ggg
Abc|abc xyz
Abc|xxx abc xxx
bcd|
dbc|
def|
$