Pivoting csv field from pipe delimited file

svks1985 · November 20, 2014, 11:51am

Hello All,
Thanks for taking time to read through the thread and for providing any possible solution.
I am trying to pivot a comma separated field in a pipe delimited file. Data looks something like this:

Field1|Field2
123|345,567,789
234|563,560
345|975,098,985,397,984
456|736

Desired output:

Field1|Field2
123|345
123|567
123|789
234|563
234|560
345|975
345|098
345|985
345|397
345|984
456|736

Please note that the comma separated field can have any number of separator. That is, there is no fixed number that Field2 can have a maximum of 10 delimiters (commas).
Thing that I could think of is:

cut -d'|' -f2 test.dat | awk -F',' '{for(i=1;i<=NF;++i)print $i;}'

But this gives output like this

I need Field1 as well as that is my key field and without that the pivoting will not make any sense.

Any help would be greatly appreciated!
Thanks very much.

junior-helper · November 20, 2014, 12:15pm

Try

awk -F'[\|,]' 'BEGIN {OFS="|"} NR==1 {print;next} {for (i=2;i<=NF;i++) print $1, $i}' file

durden_tyler · November 20, 2014, 12:15pm

svks1985:

...
I am trying to pivot a comma separated field in a pipe delimited file. Data looks something like this:
Field1|Field2
123|345,567,789
234|563,560
345|975,098,985,397,984
456|736
Desired output:
Field1|Field2
123|345
123|567
123|789
234|563
234|560
345|975
345|098
345|985
345|397
345|984
456|736

$
$ cat test.dat
Field1|Field2
123|345,567,789
234|563,560
345|975,098,985,397,984
456|736
$
$
$ awk -F"|" '{n=split($2,a,","); for(i=1;i<=n;++i){print $1"|"a}}' test.dat
Field1|Field2
123|345
123|567
123|789
234|563
234|560
345|975
345|098
345|985
345|397
345|984
456|736
$
$

svks1985 · November 20, 2014, 12:29pm

Thanks very much junior-helper and durden_tyler.
Both the solutions worked. However, with junior_helper's solution, I am getting a warning message

awk: warning: escape sequence '\|' treated as plain `|'

Also, I would really appreciate if you guys can explain the code as well.
Thanks again!

junior-helper · November 20, 2014, 12:58pm

awk -F'[\|,]'
Telling awk to use pipe and comma as field separator/delimiter for the input file.
I used backslash "\" to escape the pipe, but obviously it's not mandatory here, you can remove the backslash to avoid the warning.

'BEGIN {OFS="|"}
Defining "Output Field Separator" as pipe. This portion is executed only once.
Hint: Delete this part to see the difference.
This is one way of defining the OFS. Alternatively one can "hard-code" it in the print command, eg. print $1"|"$i

NR==1 {print;next}
NR is an internal awk variable, meaning "Number of Row" or line number, respectively.
The above line means if the line number is 1, print the line unmodified; read next line.
This portion is executend only once too.

{for (i=2;i<=NF;i++) print $1, $i}'
NF is an internal awk variable, meaning (total) "Number of Fields" in the particular line.
awk is looping here from field 2 to last field and printing $1, $i
($1 is the first field, $i is the second; in the next loop awk will print $1 and the third field and so on)

Hope I was clear.

shamrock · November 20, 2014, 1:02pm

Yet another way of doing the same thing with awk..

awk -F\| '{OFS="|";gsub(",",RS$1FS,$2);print}' file

durden_tyler · November 20, 2014, 4:13pm

awk -F"|" '{n=split($2,a,","); for(i=1;i<=n;++i){print $1"|"a}}' test.dat

The awk code performs 3 steps for every line that it reads from "test.dat":

Step 1:
Split the line on the "|" character, since -F"|" has been specified. After splitting, the variables $1 and $2 are set to the two values. Each line will have only two values since there is exactly one "|" per line.

Step 2:
Use the "split" function on the value of $2 from Step 1. Use the comma "," as separator here. After splitting, set the values to the array "a". Set the value of "n" to the size of the array "a".

Step 3:
Run the "for" loop from value of i = 1 to "n" that was determined in Step 2. For each iteration, print the value of $1 from Step 1, the pipe character "|" and the value of a[i]. "a" was determined in Step 2 and i is the iterator value.

Once you understand these 3 steps, you can apply that knowledge to a couple of lines read.

---------------------------------------------------
Line 1 => Read "Field1|Field2"
---------------------------------------------------
Step 1:
After splitting on "|" character, value of $1 = Field1 and that of $2 = Field2

Step 2:
After splitting $2 = Field2 on the "," character, the array "a" has only one element. a[1] = Field2 and n = 1.

Step 3:
Loop from i=1 to n i.e. 1. Print $1 then "|" then a[1] i.e. print "Field1|Field2"

---------------------------------------------------
Line 2 => Read "123|345,567,789"
---------------------------------------------------

Step 1:
After splitting on "|" character, value of $1 = 123 and that of $2 = 345,567,789

Step 2:
After splitting $2 = 345,567,789 on the "," character, the array "a" has 3 elements.
a[1] = 345
a[2] = 567
a[3] = 789
n = 3

And so on...

Since -F forces awk to split the line anyway, the "split" function could be avoided like so:

awk -F'[|,]' '{for(i=2;i<=NF;++i){print $1"|"$i}}' test.dat

A test run follows:

$
$ cat test.dat
Field1|Field2
123|345,567,789
234|563,560
345|975,098,985,397,984
456|736
$
$ awk -F'[|,]' '{for(i=2;i<=NF;++i){print $1"|"$i}}' test.dat
Field1|Field2
123|345
123|567
123|789
234|563
234|560
345|975
345|098
345|985
345|397
345|984
456|736
$
$

The OFS variable could also be used, as others have shown.

svks1985 · November 23, 2014, 8:33pm

Thanks junior-helper and b=durden-tyler. The explanation really made things clear to me. I made some changes and implemented the same and looks like things are coming the way I was expecting.
I really appreciate all your help!!

ongoto · November 24, 2014, 10:34am

...and in pure bash
Same same...

# File fldsdata=
123|345,567,789
234|563,560
345|975,098,985,397,984
456|736

#!/bin/bash
< fldsdata mapfile
for f1 in ${MAPFILE
[*]}
do
    for f2 in $(echo ${f1#*\|} | tr ',' '\n')
    do 
        echo "${f1%\|*}|$f2"
    done
done

output
--------
123|345
123|567
123|789
234|563
234|560
345|975
345|098
345|985
345|397
345|984
456|736