Implementing Listagg like function in shell

prohank · June 27, 2016, 5:57am

Hi,

Basically what I am trying to do is making multiple fields of the same type comma-separated.

i.e. for a data like this:

B00000	abc
B00001	abc,def
B00001	ghi
B00001	jkl
B00002	abc
B00002	def
B00003	xyz

Output should be like:

B00000	abc
B00001	abc,def,ghi,jkl
B00002	abc,def
B00003	xyz

I found a solution on stackoverflow for a very similar problem here

but none of the specified solutions in there work for me. They just print whatever the input is given, as it is.

Any ideas on what's wrong and how can I make it work?

#!/usr/bin/awk -f
BEGIN {
    FS="\t"; OFS=FS
    deduplist=ARGV[1]
    ARGV[1]=""
    split(deduplist,tmp," ")
    for (i in tmp) dedup[tmp]=1
}
{
    for (i=1; i<=NF;i++)
        if (i in dedup) {
            if ($i == prev)
                $i = ""
            else
                prev = $i
        }
    # prevent printing lines that are completely blank because 
    # it's an exact duplicate of the preceding line and all fields 
    # are being deduplicated
    if ($0 !~ /^[[:blank:]]*$/) 
        print
}

Thanks.

Your help is greatly appreciated.

RudiC · June 27, 2016, 6:25am

No surprise it doesn't work for you, as it by no means fits the problem. Try

awk '{F[$1] = F[$1] DL[$1] $2; DL[$1] = ","} END {for (f in F) print f, F[f]}' file
B00000 abc
B00001 abc,def,ghi,jkl
B00002 abc,def
B00003 xyz

As the order in which the array elements are retrieved is unspecified, you may need to sort the output. And, you may want to define a different output field separator.

prohank · June 28, 2016, 3:58am

Thanks @RudiC that worked great.

Sorry for the late reply. I didn't get time yesterday to check the code.

Can we make it more generalised like passing arguments such as in the example since I have some results with multiple columns in between.

Thanks again. I know its a piece of cake for you.

RudiC · June 28, 2016, 4:02am

You need to become more specific. As I said, the example DOESN'T fit.

prohank · June 28, 2016, 5:02am

What I mean is, some tables are like:

B00000	1	...	abc
B00001	1	...	abc,def
B00001	2	...	ghi
B00001	2	...	jkl
B00002	1	...	abc
B00002	2	...	def
B00003	1	...	xyz

Can we edit the code such that we can pass which columns should be comma-separated and which not?

In the table that I gave now, I may not need to remove duplicates from the second column but only from the last column or vice versa.

Possible solutions are:
1.

B00000	1	...	abc
B00001	1,2	...	abc,def,ghi,jkl
B00002	1,2	...	abc,def
B00003	1	...	xyz

B00000	1	...	abc
B00001	1	...	abc,def
B00001	2	...	ghi,jkl
B00002	1	...	abc
B00002	2	...	def
B00003	1	...	xyz

Basically, the syntax would be like:

awk 'command' arguments(specifying columns to be comma-separated) file

Thanks.

RudiC · June 28, 2016, 5:12am

Those two requests are different: The first one has the columns aggregated with unique key $1 , the second with key $1 $2 .
What about the other columns (indicated by "..." in your sample)? Which contents should be retained?

prohank · June 28, 2016, 5:21am

Yes that is exactly the question.

Can't we have a more generalised command that cover all the above scenarios?
i.e specify which column to aggregate with and which to comma separate.

Am I being clear or is this too complex?

RudiC · June 28, 2016, 6:44am

This is how far I can get right now:

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY $KC OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                         DL[KEY,i] = ","
                                         $AC = "\001" i
                                        }
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         print l, LINE[l]
                                        }
                }
' OFS="\t" KEYCOL="1" AGGCOL="2,4" file
B00000			1	...	abc
B00001			1,2,2	...	abc,def,ghi,jkl
B00002			1,2	...	abc,def
B00003			1	...	xyz

Result with KEYCOL="1,2" AGGCOL="4" :

B00002	2				...	def
B00003	1				...	xyz
B00000	1				...	abc
B00001	1				...	abc,def
B00001	2				...	ghi,jkl
B00002	1				...	abc

pravin27 · June 28, 2016, 7:04am

Prohank,

Suppose file contains 5 columns, key fields are 1, 2 and agg field is 5.
What about field no 3 and 4. Do we suppose to agg or ignore ?

Thanks
Pravin

prohank · June 28, 2016, 8:29am

This is brilliant. Thanks @RudiC. This has reduced a lot of manual work.

@pravin27 I think for the current situation its ok for me to mention all the columns as either a key or an agg field as there are not too many columns.

It also avoids confusion if others use it.

RudiC · June 28, 2016, 10:19am

Next version: duplicate field contents removed; multiple OFS reduced to single:

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY $KC OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {if (F[KEY,i] !~ "(^|,)" $AC "(,|$)")        {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                                                                         DL[KEY,i] = ","
                                                                                        }
                                         $AC = "\001" i
                                        }
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         OP = sprintf ("%s%s%s", l, OFS, LINE[l])
                                         gsub (OFS OFS "*", OFS, OP)
                                         print OP
                                        }
                }
' OFS="\t" KEYCOL="1" AGGCOL="2,4" file
B00000	1	...	abc
B00001	1,2,3	...	abc,def,ghi,jkl,ybc,xab
B00002	1,2	...	abc,def
B00003	1	...	xyz

(tested with an extended input file)

prohank · June 30, 2016, 8:03am

Thanks RudiC,

Your solution works great with tab as delimiter but I am using | and it is failing with the below error:

awk: 0602-521 There is a regular expression error.
        *?+ not preceded by valid expression
 The input line number is 24. The file is /tmp/abc.del.
 The source line number is 1.

RudiC · July 1, 2016, 3:00am

The script I posted has 23 lines, so it is difficult to track an error in line 24 without seeing a) the script you ran b) the input file (or a representative extraction of it).

You could run it with <TAB>s and then tr '\t' '|' the result.

prohank · July 1, 2016, 5:23am

To clarify a little bit, 24 is the last line of my input file and source line is 1 because I ran your script by eliminating all the new lines and replacing them with semi colons.

When I run it the you have given the error is on line 18 which is:

gsub (OFS OFS "*", OFS, OP)

Actually I changed the input file to test the code, it comes with | by default.

RudiC · July 1, 2016, 6:04am

Which I thought it might be. Tough luck - | is the "alternation" char in regexes. So you have to taylor your script. Try escaping the char like "\|".

prohank · July 1, 2016, 6:22am

Nuhuh already tried, doesn't work

prohank · July 5, 2016, 6:36am

So, I guess there's no solution for this then...

bakunin · July 5, 2016, 7:28am

Of course there is and RudiC has already mentioned it to you (in post #13): use tr or any other text filter to change single characters) to change the separator to something else, then eventually reset it (if needed). Which character is safe to use will have to be decided by you after analysing your data. I suggest trying "@", which is fairly uncommon in normal text if no email addresses are mentioned, otherwise "�", "�" or the like.

I hope this helps.

bakunin

prohank · July 5, 2016, 8:44am

Well yes I read it but as I said the default file comes with | by default and it would be preferred if that wasn't replaced with something else as the files are very huge and it is not advised to make their copies.

But may be that's the only thing possible right now. Cheers.

RudiC · July 5, 2016, 10:10am

Try

awk '
NR == 1         {KN = split (KEYCOL, KC, ",")
                 AN = split (AGGCOL, AC, ",")
                }
                {KEY = DX = ""
                 for (i=1; i<=KN; i++)  {KEY = KEY DX $KC
                                         DX = OFS
                                         $KC = ""
                                        }
                 for (i=1; i<=AN; i++)  {if (F[KEY,i] !~ "(^|,)" $AC "(,|$)")        {F[KEY,i] = F[KEY,i] DL[KEY,i] $AC
                                                                                         DL[KEY,i] = ","
                                                                                        }
                                         $AC = "\001" i
                                        }
                 $0 = $0
                 LINE[KEY] = $0
                }
END             {for (l in LINE)        {for (i=1; i<=AN; i++)  sub ("\001" i, F[l,i], LINE[l])
                                         OP = sprintf ("%s%s", l, LINE[l])
                                         print OP
                                        }
                }
' KEYCOL="1" AGGCOL="2,4" FS="\|+" OFS="|" file