Hi,
My input looks like that:
A|123|qwer
A|456|tyui
A|456|wsxe
B|789|dfgh
Using awk, I am trying to get:
A|123;456|qwer;tyui;wsxe
B|789|dfgh
For records with same $1, group all the $2 in a field (without replicates), and all the $3 in a field (without replicates).
What I have tried:
echo -e "A|123|qwer\nA|456|tyui\nA|456|wsxe\nB|789|dfgh" | gawk 'BEGIN{FS=OFS="|"}{a[$1]=sprintf("%s%s", a[$1], a[$1] ~ /$2/ ? "":";"$2); b[$1]=sprintf("%s%s", b[$1], b[$1] ~ /$3/ ? "":";"$3)}END{for(i in a){print i FS a FS b}}'
(Wrong) output:
A|;123;456;456|;qwer;tyui;wsxe
B|;789|;dfgh
However, I still cannot manage to remove the duplicated strings inside fields $2 and $3.
RudiC
July 6, 2018, 8:22am
2
Do yourself a favour and start indenting / structuring your code for readability and understandability. Try
awk -F\| '
{if (!(a[$1] ~ $2)) a[$1] = a[$1] DL[$1] $2
if (!(b[$1] ~ $3)) b[$1] = b[$1] DL[$1] $3
DL[$1] = ";"
}
END {for (i in a) {print i, a, b
}
}
' OFS="|" file
A|123;456|qwer;tyui;wsxe
B|789|dfgh
1 Like
The following variant does a precise lookup (to supress duplicates),
and does not need an array of delimiters:
awk '
BEGIN {
FS=OFS="|"
dl=";"
}
function strjoin(i, j){
if (i=="") return j # first element
if (index((dl i dl), (dl j dl))) return i # duplicate
return (i dl j) # join element
}
{
s2[$1]=strjoin (s2[$1], $2)
s3[$1]=strjoin (s3[$1], $3)
}
END {
for (i in s2) print i, s2, s3
}
' file
This is a good demonstration of a function
1 Like
I don't understand this statement. Both solutions seem to work just fine.
Is one more prone to errors than the other?
The regular expression search ~
is different from the string search via index
.
You'll see differences e.g. with the following input files
A|123|qwer
A|456|tyui
A|45|wsxe
B|789|dfgh
A|123|qwer
A|455|tyui
A|45*|wsxe
B|789|dfgh
1 Like
Very good point !
I got it now, thanks !
RudiC
July 6, 2018, 5:00pm
7
You can "sharpen" or "narrow down" the regex to avoid false positive matches like
awk -F\| '
{if (!(a[$1] ~ "(^|;)" $2 "(;|$)")) a[$1] = a[$1] DL[$1] $2
if (!(b[$1] ~ "(^|;)" $3 "(;|$)")) b[$1] = b[$1] DL[$1] $3
DL[$1] = ";"
}
END {for (i in a) {print i, a, b
}
}
' OFS="|" file