How to concat lines that have the same key field

redwing · March 29, 2010, 10:12pm

Hi, I have file like this -

ABC 123
ABC 456
ABC 321
CDE 789
CDE 345
FGH 111
FGH 222
FGH 333
XYZ 678

I need the output like this:

ABC 123,456
CDE 789,345
FGH 111, 222
XYZ 678

Meaning I want to concat the lines that have the same first column, but I only need the first two lines. If the key field (like above ABC) has more than 3 lines, I only want to concat the first two lines.

Any idea how to do this?

Thanks!!

rdcwayx · March 29, 2010, 10:25pm

awk '{if (b[$1]<2) {a[$1]=a[$1] FS $2;b[$1]++}} END {for (i in a) print i,a|"sort"}' urfile

redwing · March 30, 2010, 11:48am

Hi Rdcwayx,

Thanks for the quick reply. My question is, I am dealing with large amout of data. Probably about 10 million. Will this have performance issue?

Thanks

Franklin52 · March 30, 2010, 11:53am

Are the lines sorted by the first column?

alister · March 30, 2010, 12:50pm

You should just run it and find out if performance and resource consumption is acceptable. What does and doesn't have performance issues depends on the person, situation, priority of the job, the hardware, etc.

Only thing that can be stated by looking at the awk/sort code, is that it will require something on the order of 2 x 10 x avg_line_length megabytes of memory for a 10 million line data set, since both awk and sort will require full copies of the data in memory (worst case scenario). If the average line length is 50 characters, it could approach 1 gigabyte or ram required.

Regards,
Alister

Franklin52 · March 30, 2010, 1:12pm

If the data is ordered by the 1e column, this should work fine:

awk '$1==k{s=s "," $2; next}
s{print s}
{s=$0;k=$1}
END{print s}' file

alister · March 30, 2010, 1:32pm

Hello, Franklin52:

Actually, that code isn't quite right. The original problem statement requires that at most two values are matched per key. This solution will continue to assimilate as many as found. Changing

{print s}

to

s{match(s,/^[^ ]* [^,]*(,[^,]*)?/); print substr(s, RSTART, RLENGTH)}

works around that by discarding unwanted values at print time. A counter is probably a nicer fix, though:

awk '$1==k {if (++i<3) s=s "," $2; next}
s{print s}
{s=$0;k=$1;i=1}
END{print s}' file

Regards,
Alister

redwing · March 30, 2010, 1:49pm

Hi Franklin52 & Alister,

We would sort the file by key field first.

Alister,

Could you please explain your code a little bit?

Thanks

alister · March 30, 2010, 2:46pm

It's Franklin52's code, with a minor tweak, but I'll explain it

Quick primer, in case it's needed. AWK scripts consist of a series of pattern-action pairs. For each input line, awk will evaluate the pattern-action pairs in the order they occur in the script. The action (the code within curly braces) will only execute for a given input line if its pattern expression evalutes to a boolean true value. Either the pattern or the action may be absent, but not both. An absent pattern defaults to true and therefore an action without a pattern will execute for every line read. An absent action defaults to printing the current line, so a pattern without an action will print out the current line, $0, if the pattern evaluates to true.

Now, on to this particular script.

The first field of each input line is treated as a key value. The active key, the key for which values are being collected at any given time is stored in k.

i is a counter that keeps track of how many values we've collected for the active key.

s is a string containing the current key followed by its value (or two values delimited with a comma).

Given the way this code works, it makes sense to start with the last pattern-action pair before the END pattern-action pair.

{s=$0;k=$1;i=1}

This pattern-action pair is missing it's pattern, so it will match every line by default. This is the first action to execute. When the script first starts, the previous pattern-action pairs do not match because the values of k and s are still unset. This line initializes (and later, whenever there is a key change, reinitializes) s to the current line; the active key, k, to the first field of the line, $1; and the counter, i, that tracks how many values have been collected in s.

$1==k {if (++i<3) s=s "," $2; next}

If the current line's key ($1) is equal to the active key, k, and if less than three values (including the current line's value) have been seen for this key, then we want to append its value ($2) to the string s. Regardless of how many values have been seen for this key, skip the rest of the pattern-action pairs (the 'next' statement), read in the next line from the input, and continue repeating this first step as long as the new input line's key is equal to the active key.

s{print s}

If we reached this point in the awk script, it's because the current line's key is not the same as the active key. Time to print the old key and its values. The next step will be to execute the pattern-action pair with which I began this explanation (to reinitialize all of the script's variables).

END{print s}

All the input has been read, so print the key which was active when we reached the end of the file.

I hope that helped,
Alister

Franklin52 · March 30, 2010, 3:50pm

Hi alister,

Your right, I misread the question. Thanks for the fix and explanation.

Regards

redwing · March 30, 2010, 9:35pm

Hi Alister & Franklin52,

Thanks so much to you both for the code and detailed explanation. Appreciate your help!