Pivot Column using awk

morbid_angel · January 27, 2015, 12:26am

Hello everyone

I have a 20M file which is having the below sample layout

1111,ABC,100
1111,DEF,200
1111,XYZ,300
4444,LMN,100
4444,AKH,500
4444,WCD,400
2222,ABC,100
7777,DEF,300
7777,WCD,300

I need to covert this to below format
Output

Party_ID|ABC|AKH|DEF|LMN|WCD|XYZ
1111|100|0|200|0|0|300
2222|100|0|0|0|0|0
4444|0|500|0|100|400|0
7777|0|0|300|0|300|0

How can I make this through UNIX and it should be pretty faster.My current script take huge time to pivot for 20M records

anbu23 · January 27, 2015, 3:08am

Can you post your script

Don_Cragun · January 27, 2015, 3:33am

In addition to what anbu23 has already requested, please also tell us some more information:

Are the input file field2 values always the same, or do the headings in the output file have to be determined from the input file?
If the headings have to be determined from the input, do the headings have to be sorted alphabetically?
Are all input records for a given Party_ID adjacent in your input files (as they are in your sample input)?
Are Party_ID values all numeric?
Does the output need to be sorted by Party_ID, or can the output be in the same order as your input data?

RudiC · January 27, 2015, 3:49am

Try this for a starting point:

awk     '       {LN[$1]; HD[$2]; MX[$1,$2]=$3}
         END    {               printf "%10s", ""; for (i in HD) printf "%10s", i; print "";
                 for (j in LN) {printf "%10s",j;   for (i in HD) printf "%10s", MX[j,i]; print ""}
                }
        ' FS=, file
                 WCD       AKH       DEF       LMN       XYZ       ABC
      4444       400       500                 100                    
      2222                                                         100
      7777       300                 300                              
      1111                           200                 300       100

morbid_angel · January 27, 2015, 2:20pm

This is my script and its too slow

cat File|awk -F',' '{print $1}' |sort -u >UniqueIds
cat File|awk -F',' '{print $2}' |sort -u >UniqueColumns


Counter=1
cat UniqueIds|
{
while read Line
do

cat UniqueColumns|
{
while read Col 
do
if [[ $(grep $Line File|grep $Col) == "" ]];then
echo $Line,$Col,0 >>File
fi
done
}

COL_LIST=$(cat File|grep $Line|sort -t',' -k2|awk -F',' '{print $2}' ORS='|'|awk -F'|' '{NF=NF-1;$1=$1}1' OFS="|")
VALUE_LIST=$(cat File|grep $Line|sort -t',' -k2|awk -F',' '{print $3}' ORS='|'|awk -F'|' '{NF=NF-1;$1=$1}1' OFS="|")
if [[ $Counter == 1 ]];then
echo Party_ID"|"$COL_LIST
fi
echo $Line"|"$VALUE_LIST
((Counter=Counter+1))
done
}
rm UniqueIds UniqueColumns

---------- Post updated at 10:18 AM ---------- Previous update was at 10:17 AM ----------

Done . I send my script. I need to sort the second column

---------- Post updated at 03:10 PM ---------- Previous update was at 10:18 AM ----------

GREAT THANKS Rudi

My Script takes 3 minute for processing 30Million records .I modified little bit and here is the one

awk   '{LN[$1]; HD[$2]; MX[$1,$2]=$3}
END{
printf "%s", "PARTY_ID"; for (i in HD) printf "|%s", i; print "";
for (j in LN) {printf "%s",j;
for (i in HD) printf "|%s", MX[j,i]; print ""}
}
' FS=, file

---------- Post updated at 03:20 PM ---------- Previous update was at 03:10 PM ----------

I have made small modification to print zero too

awk   '{LN[$1]; HD[$2]; MX[$1,$2]=$3}
END{
printf "%s", "PARTY_ID"; for (i in HD) printf "|%s", i; print "";
for (j in LN) 
{
printf "%s",j;
for (i in HD) 
if (MX[j,i] =="")
  printf "|%s",0
else 
  printf "|%s", MX[j,i]; 
  print ""
}
}' FS=, file

MadeInGermany · January 27, 2015, 2:40pm

A little trick enforces a NULL value to be casted to an integer, giving 0 :

printf "|%s", MX[j,i]+0

Or printf can specify an integer or floating point in its format argument:

printf "|%d", MX[j,i]

printf "|%.f", MX[j,i]