Awk: group multiple fields from different records

beca123456 · July 6, 2018, 6:51am

Hi,

My input looks like that:

A|123|qwer
A|456|tyui
A|456|wsxe
B|789|dfgh

Using awk, I am trying to get:

A|123;456|qwer;tyui;wsxe
B|789|dfgh

For records with same $1, group all the $2 in a field (without replicates), and all the $3 in a field (without replicates).

What I have tried:

echo -e "A|123|qwer\nA|456|tyui\nA|456|wsxe\nB|789|dfgh" | gawk 'BEGIN{FS=OFS="|"}{a[$1]=sprintf("%s%s", a[$1], a[$1] ~ /$2/ ? "":";"$2); b[$1]=sprintf("%s%s", b[$1], b[$1] ~ /$3/ ? "":";"$3)}END{for(i in a){print i FS a FS b}}'

(Wrong) output:

A|;123;456;456|;qwer;tyui;wsxe
B|;789|;dfgh

However, I still cannot manage to remove the duplicated strings inside fields $2 and $3.

RudiC · July 6, 2018, 8:22am

Do yourself a favour and start indenting / structuring your code for readability and understandability. Try

awk -F\| '
        {if (!(a[$1] ~ $2)) a[$1] = a[$1] DL[$1] $2
         if (!(b[$1] ~ $3)) b[$1] = b[$1] DL[$1] $3
         DL[$1] = ";"
        }
END     {for (i in a)   {print i, a, b
                        }
        }
' OFS="|"  file
A|123;456|qwer;tyui;wsxe
B|789|dfgh

MadeInGermany · July 6, 2018, 11:50am

The following variant does a precise lookup (to supress duplicates),
and does not need an array of delimiters:

awk '
BEGIN {
  FS=OFS="|"
  dl=";"
}
function strjoin(i, j){
  if (i=="") return j  # first element
  if (index((dl i dl), (dl j dl))) return i # duplicate
  return (i dl j) # join element
} 
{
  s2[$1]=strjoin (s2[$1], $2)
  s3[$1]=strjoin (s3[$1], $3)
}
END {
  for (i in s2) print i, s2, s3
}
' file

This is a good demonstration of a function

beca123456 · July 6, 2018, 2:13pm

I don't understand this statement. Both solutions seem to work just fine.
Is one more prone to errors than the other?

MadeInGermany · July 6, 2018, 2:44pm

The regular expression search ~ is different from the string search via index .
You'll see differences e.g. with the following input files

A|123|qwer
A|456|tyui
A|45|wsxe
B|789|dfgh

A|123|qwer
A|455|tyui
A|45*|wsxe
B|789|dfgh

beca123456 · July 6, 2018, 2:54pm

Very good point !
I got it now, thanks !

RudiC · July 6, 2018, 5:00pm

You can "sharpen" or "narrow down" the regex to avoid false positive matches like

awk -F\| '
        {if (!(a[$1] ~ "(^|;)" $2 "(;|$)")) a[$1] = a[$1] DL[$1] $2
         if (!(b[$1] ~ "(^|;)" $3 "(;|$)")) b[$1] = b[$1] DL[$1] $3
         DL[$1] = ";"
        }
END     {for (i in a)   {print i, a, b
                        }
        }
' OFS="|"  file