Implementing Listagg like function in shell

Hi,

Basically what I am trying to do is making multiple fields of the same type comma-separated.

i.e. for a data like this:

B00000	abc
B00001	abc,def
B00001	ghi
B00001	jkl
B00002	abc
B00002	def
B00003	xyz

Output should be like:

B00000	abc
B00001	abc,def,ghi,jkl
B00002	abc,def
B00003	xyz

I found a solution on stackoverflow for a very similar problem here

but none of the specified solutions in there work for me. They just print whatever the input is given, as it is.

Any ideas on what's wrong and how can I make it work?

#!/usr/bin/awk -f
BEGIN {
    FS="\t"; OFS=FS
    deduplist=ARGV[1]
    ARGV[1]=""
    split(deduplist,tmp," ")
    for (i in tmp) dedup[tmp]=1
}
{
    for (i=1; i<=NF;i++)
        if (i in dedup) {
            if ($i == prev)
                $i = ""
            else
                prev = $i
        }
    # prevent printing lines that are completely blank because 
    # it's an exact duplicate of the preceding line and all fields 
    # are being deduplicated
    if ($0 !~ /^[[:blank:]]*$/) 
        print
}

Thanks.

Your help is greatly appreciated.

No surprise it doesn't work for you, as it by no means fits the problem. Try

awk '{F[$1] = F[$1] DL[$1] $2; DL[$1] = ","} END {for (f in F) print f, F[f]}' file
B00000 abc
B00001 abc,def,ghi,jkl
B00002 abc,def
B00003 xyz

As the order in which the array elements are retrieved is unspecified, you may need to sort the output. And, you may want to define a different output field separator.

1 Like

Thanks @RudiC that worked great.

Sorry for the late reply. I didn't get time yesterday to check the code.

Can we make it more generalised like passing arguments such as in the example since I have some results with multiple columns in between.

Thanks again. I know its a piece of cake for you.

You need to become more specific. As I said, the example DOESN'T fit.

What I mean is, some tables are like:

B00000	1	...	abc
B00001	1	...	abc,def
B00001	2	...	ghi
B00001	2	...	jkl
B00002	1	...	abc
B00002	2	...	def
B00003	1	...	xyz 

Can we edit the code such that we can pass which columns should be comma-separated and which not?

In the table that I gave now, I may not need to remove duplicates from the second column but only from the last column or vice versa.

Possible solutions are:
1.

B00000	1	...	abc
B00001	1,2	...	abc,def,ghi,jkl
B00002	1,2	...	abc,def
B00003	1	...	xyz  
B00000	1	...	abc
B00001	1	...	abc,def
B00001	2	...	ghi,jkl
B00002	1	...	abc
B00002	2	...	def
B00003	1	...	xyz 

Basically, the syntax would be like:

awk 'command' arguments(specifying columns to be comma-separated) file

Thanks.

Those two requests are different: The first one has the columns aggregated with unique key $1 , the second with key $1 $2 .
What about the other columns (indicated by "..." in your sample)? Which contents should be retained?

Yes that is exactly the question.

Can't we have a more generalised command that cover all the above scenarios?
i.e specify which column to aggregate with and which to comma separate.

Am I being clear or is this too complex?

This is how far I can get right now:

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY $KC OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                         DL[KEY,i] = ","
                                         $AC = "\001" i
                                        }
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         print l, LINE[l]
                                        }
                }
' OFS="\t" KEYCOL="1" AGGCOL="2,4" file
B00000			1	...	abc
B00001			1,2,2	...	abc,def,ghi,jkl
B00002			1,2	...	abc,def
B00003			1	...	xyz

Result with KEYCOL="1,2" AGGCOL="4" :

B00002	2				...	def
B00003	1				...	xyz
B00000	1				...	abc
B00001	1				...	abc,def
B00001	2				...	ghi,jkl
B00002	1				...	abc
1 Like

Prohank,

Suppose file contains 5 columns, key fields are 1, 2 and agg field is 5.
What about field no 3 and 4. Do we suppose to agg or ignore ?

Thanks
Pravin

This is brilliant. Thanks @RudiC. This has reduced a lot of manual work.

@pravin27 I think for the current situation its ok for me to mention all the columns as either a key or an agg field as there are not too many columns.

It also avoids confusion if others use it.

Next version: duplicate field contents removed; multiple OFS reduced to single:

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY $KC OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {if (F[KEY,i] !~ "(^|,)" $AC "(,|$)")        {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                                                                         DL[KEY,i] = ","
                                                                                        }
                                         $AC = "\001" i
                                        }
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         OP = sprintf ("%s%s%s", l, OFS, LINE[l])
                                         gsub (OFS OFS "*", OFS, OP)
                                         print OP
                                        }
                }
' OFS="\t" KEYCOL="1" AGGCOL="2,4" file
B00000	1	...	abc
B00001	1,2,3	...	abc,def,ghi,jkl,ybc,xab
B00002	1,2	...	abc,def
B00003	1	...	xyz

(tested with an extended input file)

1 Like

Thanks RudiC,

Your solution works great with tab as delimiter but I am using | and it is failing with the below error:

awk: 0602-521 There is a regular expression error.
        *?+ not preceded by valid expression
 The input line number is 24. The file is /tmp/abc.del.
 The source line number is 1.

The script I posted has 23 lines, so it is difficult to track an error in line 24 without seeing a) the script you ran b) the input file (or a representative extraction of it).

You could run it with <TAB>s and then tr '\t' '|' the result.

To clarify a little bit, 24 is the last line of my input file and source line is 1 because I ran your script by eliminating all the new lines and replacing them with semi colons.

When I run it the you have given the error is on line 18 which is:

gsub (OFS OFS "*", OFS, OP)

Actually I changed the input file to test the code, it comes with | by default.

Which I thought it might be. Tough luck - | is the "alternation" char in regexes. So you have to taylor your script. Try escaping the char like "\|".

Nuhuh already tried, doesn't work

So, I guess there's no solution for this then...

Of course there is and RudiC has already mentioned it to you (in post #13): use tr or any other text filter to change single characters) to change the separator to something else, then eventually reset it (if needed). Which character is safe to use will have to be decided by you after analysing your data. I suggest trying "@", which is fairly uncommon in normal text if no email addresses are mentioned, otherwise "�", "�" or the like.

I hope this helps.

bakunin

Well yes I read it but as I said the default file comes with | by default and it would be preferred if that wasn't replaced with something else as the files are very huge and it is not advised to make their copies.

But may be that's the only thing possible right now. Cheers.

Try

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = DX = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY DX $KC
                                         DX = OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {if (F[KEY,i] !~ "(^|,)" $AC "(,|$)")        {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                                                                         DL[KEY,i] = ","
                                                                                        }
                                         $AC = "\001" i
                                        }
                 $0 = $0
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         OP = sprintf ("%s%s", l, LINE[l])
                                         print OP
                                        }
                }
' KEYCOL="1" AGGCOL="2,4" FS="\|+" OFS="|" file
1 Like