Merging multiple lines to columns with awk, while inserting commas for missing lines

RalphNY · August 31, 2015, 4:24am

Hello all,
I have a large csv file where there are four types of rows I need to merge into one row per person, where there is a column for each possible code / type of row, even if that code/row isn't there for that person.

In the csv, a person may be listed from one to four times consecutively. Each row in which they are listed has a number of hours and a code (WE, MO, CE, ES) that indicates type of hours

I found a similar thread and a solution to a similar problem. The forum won't let me post links yet, but you can add this to the base URL /shell-programming-and-scripting/208027-merge-multiple-lines-same-file-common-key-using-awk.html to open that thread.

However, the output of that script tacks each of the optional lines on to the end of the first line, but that messes up the csv columns, so that the same data is not in the same place in the merged columns, afterward.

Here is a simplified version of the input data I have, which has a lot of fields per line after these, but where for each person, all fields except for the type-of-hours code, and number of hours are identical between lines the person is listed on:

1111,John Smith,WE,105,Fourty Two Cherry Lane,Nowhere,NY
1111,John Smith,MO,23,Fourty Two Cherry Lane,Nowhere,NY
2222,Elane Boozler,WE,70,Twelve Peach Court,Elsehere,NY
2222,Elane Boozler,MO,24,Twelve Peach Court,Elsehere,NY
2222,Elane Boozler,CE,30,Twelve Peach Court,Elsehere,NY
2222,Elane Boozler,ES,8,Twelve Peach Court,Elsehere,NY
3333,Hortense Gable,CE,25,Three Pinapple Ln,Somewhere,NY
3333,Hortense Gable,ES,14,Three Pinapple Ln,Somewhere,NY

I need the output, to be in the form

1111,John Smith,105,23,,,Fourty Two Cherry Lane,Nowhere, NY
2222,Elane Boozler,70,24,30,8,Peach Court,Elsehere,NY
3333,Hortense Gable,,,25,14,Three Pinapple Ln,Somewhere,NY

Please let me know how I might get this kind of output from this kind of input using awk. Thanks so much for your help!

RudiC · August 31, 2015, 5:06am

Based on the solution given in the mentioned thread, try

awk -F, '
NR!=1 && p1!=$1         {print prev
                         delete TMP
                        }
                        {p1=$1
                         TMP[$3]=$4
                         $4=sprintf("%s,%s,%s,%s", TMP["WE"], TMP["MO"], TMP["CE"], TMP["ES"])
                         sub ($3 FS, "")
                         prev=$0
                        }
END                     {print prev}
' OFS="," file
1111,John Smith,105,23,,,Fourty Two Cherry Lane,Nowhere,NY
2222,Elane Boozler,70,24,30,8,Twelve Peach Court,Elsehere,NY
3333,Hortense Gable,,,25,14,Three Pinapple Ln,Somewhere,NY

RalphNY · September 1, 2015, 9:00pm

Great! Thanks so much for your help : ) and sorry for the delayed response.

Could you please answer one more question? If the ID field is not the first field in the file, such as, if there are six fields to the left of it, I would have thought that I could just increment the field identifiers: $1 to $7, $3 to $9, and $4 to $10 to have the script continue to operate on the same fields, shown below.

awk -F, '
NR!=1 && p1!=$7         {print prev
                         delete TMP
                        }
                        {p1=$9
                         TMP[$9]=$10
                         $10=sprintf("%s,%s,%s,%s", TMP["WE"], TMP["MO"], TMP["CE"], TMP["ES"])
                         sub ($9 FS, "")
                         prev=$0
                        }
END                     {print prev}
' OFS="," TestIDNotFirst.csv

However, if I run that on a version of the file where the ID is the 7th field, such as below, that doesn't work.

red,up,cold,in,Japanamation,Digeridoo,1111,John Smith,WE,105,Fourty Two Cherry Lane,Nowhere,NY
red,up,cold,in,Japanamation,Digeridoo,1111,John Smith,MO,23,Fourty Two Cherry Lane,Nowhere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,WE,70,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,MO,24,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,CE,30,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,ES,8,Twelve Peach Court,Elsehere,NY
blue,right,hot,in,Smitten,Kitten,3333,Hortense Gable,CE,25,Three Pinapple Ln,Somewhere,NY
blue,right,hot,in,Smitten,Kitten,3333,Hortense Gable,ES,14,Three Pinapple Ln,Somewhere,NY

Results in the output below, where the fields WE, MO, CE, ES aren't being put into the correct columns

red,up,cold,in,Japanamation,Digeridoo,1111,John Smith,105,,,,Fourty Two Cherry Lane,Nowhere,NY
red,up,cold,in,Japanamation,Digeridoo,1111,John Smith,,23,,,Fourty Two Cherry Lane,Nowhere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,70,,,,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,,24,,,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,,,30,,Twelve Peach Court,Elsehere,NY
green,down,hot,out,Collapso,Polanko,2222,Elane Boozler,,,,8,Twelve Peach Court,Elsehere,NY
blue,right,hot,in,Smitten,Kitten,3333,Hortense Gable,,,25,,Three Pinapple Ln,Somewhere,NY
blue,right,hot,in,Smitten,Kitten,3333,Hortense Gable,,,,14,Three Pinapple Ln,Somewhere,NY

In reality, I need to output specific fields other than the ID and number of hours per category. In pseudo-code, what I want to write is something like:

awk '{ print $2, $4, $7, $8 }

NR!=1 && p1!=$7         {print prev
                         delete TMP
                        }
                        {p1=$9
                         TMP[$9]=$10
                         $10=sprintf("%s,%s,%s,%s", TMP["WE"], TMP["MO"], TMP["CE"], TMP["ES"])
                         prev=$0
                        }
END                     {print prev}
{ print $10, $12}'
' OFS="," InputFile.csv

And have the code that actually worked like that, output the following:

up,in,1111,John Smith,105,23,,,Fourty Two Cherry Lane,NY
down,out,2222,Elane Boozler,70,24,30,8,Twelve Peach Court,NY
right,in,3333,Hortense Gable,,,25,14,Three Pinapple Ln,NY

You've already gone above and beyond the call of duty, but if you could turn my pseudocode into real code that outputs this, I'll send a pizza to the address of your choice!

Aia · September 1, 2015, 10:07pm

Try,

awk -F, '
NR != 1 && p1 != $8 {
    print prev
    delete TMP
}
{
    p1 = $8
    TMP[$9] = $10
    prev = $2 FS $4 FS $7 FS $8 FS TMP["WE"] FS TMP["MO"] FS TMP["CE"] FS TMP["ES"] FS $11 FS $13
}
END {
    print prev
}' OFS="," file

RalphNY · September 1, 2015, 10:33pm

Awesome! That works! I was serious about the pizza- if you would like it, DM me your address!

Aia · September 1, 2015, 10:38pm

No need, thank you. Keep at it.

Don_Cragun · September 1, 2015, 11:13pm

ralphny:

... ... ...

awk -F, '
NR!=1 && p1!=$7         {print prev
   delete TMP
   }
   {p1=$9
   TMP[$9]=$10
   $10=sprintf("%s,%s,%s,%s", TMP["WE"], TMP["MO"], TMP["CE"], TMP["ES"])
   sub ($9 FS, "")
   prev=$0
   }
END                     {print prev}
' OFS="," TestIDNotFirst.csv

... ... ...

Your translation of RudiC's suggestion was close... I think you would have gotten what you had expected if the 2nd line marked in red in your code above had been:

                        {p1=$7

instead of:

                        {p1=$9

The code Aia suggested should work fine as long your data never contains two people with the same name with different IDs. RudiC's code was matching on the ID (e.g., 1111 , 2222 , and 3333 ) while Aia's code is matching on the name (e.g., John Smith , Elane Boozler , and Hortense Gable ). With the sample data you provided you could have matched on any field except $3, $9, and $10; but I assume that ID ($7) and name ($8) are the only two reasonable choices to consider for your real data.

Note also that the awk statement delete array_name (i.e., delete TMP ) is an extension to the standards and will not work on some awk implementations. If you ever need to port this to a system that doesn't accept that statement, the following should work on any standards conforming version of awk :

                         for(i in TMP) delete TMP

RavinderSingh13 · September 2, 2015, 1:39am

Hello RalphNY,

If order doesn't matter for you, you can try as follows.

awk -F, '{A[$1 FS $2]=$1 FS $2;B[$1 FS $2]=$(NF-2) FS $(NF-1) FS $NF;C[$1 FS $2 FS $3]=$4;;A[$1 FS $2]=A[$1 FS $2] FS C[$1 FS $2 FS "WE"] FS C[$1 FS $2 FS "MO"] FS C[$1 FS $2 FS "CE"] FS C[$1 FS $2 FS "ES"] FS B[$1 FS $2]} END{for(i in A){print A}}'  Input_file

Output will be as follows.

3333,Hortense Gable,,,25,14,Three Pinapple Ln,Somewhere,NY
1111,John Smith,105,23,,,Fourty Two Cherry Lane,Nowhere,NY
2222,Elane Boozler,70,24,30,8,Twelve Peach Court,Elsehere,NY

If you need to get output in sorting order then following may help you.

awk -F, '{A[$1 FS $2]=$1 FS $2;B[$1 FS $2]=$(NF-2) FS $(NF-1) FS $NF;C[$1 FS $2 FS $3]=$4;;A[$1 FS $2]=A[$1 FS $2] FS C[$1 FS $2 FS "WE"] FS C[$1 FS $2 FS "MO"] FS C[$1 FS $2 FS "CE"] FS C[$1 FS $2 FS "ES"] FS B[$1 FS $2]} END{for(i in A){print A}}' Input_file | sort -k1

Output will be as follows.

1111,John Smith,105,23,,,Fourty Two Cherry Lane,Nowhere,NY
2222,Elane Boozler,70,24,30,8,Twelve Peach Court,Elsehere,NY
3333,Hortense Gable,,,25,14,Three Pinapple Ln,Somewhere,NY

EDIT: Adding a non one-liner for same solution now.

awk -F, '{
                A[$1 FS $2]=$1 FS $2
                B[$1 FS $2]=$(NF-2) FS $(NF-1) FS $NF;
                C[$1 FS $2 FS $3]=$4;
                A[$1 FS $2]=A[$1 FS $2] FS C[$1 FS $2 FS "WE"] FS C[$1 FS $2 FS "MO"] FS C[$1 FS $2 FS "CE"] FS C[$1 FS $2 FS "ES"] FS B[$1 FS $2]
         }
                END{
                        for(i in A){
                                        print A
                                   }
                   }
        ' Input_file | sort -k1

Thanks,
R. Singh

RalphNY · September 2, 2015, 2:49pm

Don,
Thanks for pointing those things out! I hadn't noticed the latest version was matching the name instead of the ID. In the real file, there definitely would be some people with identical names, so it would have to match the ID.

Just changing the $8 to $7, to match the ID rather than the name seems to have worked. Could you tell me if there is anything wrong with the code below?

awk -F, '
NR != 1 && p1 != $7 {
    print prev
    delete TMP
}
{
    p1 = $7
    TMP[$9] = $10
    prev = $2 FS $4 FS $7 FS $8 FS TMP["WE"] FS TMP["MO"] FS TMP["CE"] FS TMP["ES"] FS $11 FS $13
}
END {
    print prev
}' OFS="," file

I'm using cygwin awk / gnu awk. I don't see any errors when running with delete TMP , but that should be compatible with gnu awk, right? Thanks again : )

Don_Cragun · September 2, 2015, 3:37pm

ralphny:

Don,
Thanks for pointing those things out! I hadn't noticed the latest version was matching the name instead of the ID. In the real file, there definitely would be some people with identical names, so it would have to match the ID.

Just changing the $8 to $7, to match the ID rather than the name seems to have worked. Could you tell me if there is anything wrong with the code below?
awk -F, '
NR != 1 && p1 != $7 {
   print prev
   delete TMP
}
{
   p1 = $7
   TMP[$9] = $10
   prev = $2 FS $4 FS $7 FS $8 FS TMP["WE"] FS TMP["MO"] FS TMP["CE"] FS TMP["ES"] FS $11 FS $13
}
END {
   print prev
}' OFS="," file
I'm using cygwin awk / gnu awk. I don't see any errors when running with delete TMP , but that should be compatible with gnu awk, right? Thanks again : )

I'm glad to have helped.

As I said before, the above code should be fine as long as it doesn't print any diagnostics (and GNU awk is one of many that have the extension that accepts delete array_name in addition to the delete array_element that is required by the standards). Just keep the delete element loop in the back of your mind if you ever need to port your code to a system that doesn't support that extension.