Transposing data based on 1st column

I do have a big tab delimited file of the following format

aa 344 456
aa 34 67
bb 34 90
bb 23 100
bb 1 89
d 0 12
e 45 678
e 78 90
e 56 90
....
....
....

I would like to transpose the data based on the category on column one and get the output file in the following tab delimited format format:

a 344-456, 34-67
b 34-90,23-100,1-89
d 0-12
e 45-678,78-90,56-90


Please let me know the best way to do this using awk or sed

Hello kanja,

If you don't bother of the sequence of the output like how it is in Input_file then following may help you in same.

awk '{A[$1]=A[$1]?A[$1] OFS $2"-"$3:$2"-"$3} END{for(i in A){print i FS A}}' OFS=", "   Input_file

Output will be as follows.

bb 34-90, 23-100, 1-89
d 0-12
e 45-678, 78-90, 56-90
aa 344-456, 34-67

If you want output in same sequence as Input_file then following may help you in same.

awk 'FNR==NR{A[$1]=A[$1]?A[$1] OFS $2"-"$3:$2"-"$3;next} ($1 in A){print $1 FS A[$1];delete A[$1]}' OFS=", "  Input_file  Input_file

Output will be as follows.

aa 344-456, 34-67
bb 34-90, 23-100, 1-89
d 0-12
e 45-678, 78-90, 56-90

Thanks,
R. Singh

awk '{i=substr($0,1,1) ;v=v?v","$2"-"$3:$2"-"$3;} END{ for (i in v) print i"\t"v}' input_file

You said that your input file has <tab> delimited fields, but the sample input and output you provided is <space> delimited.

If your input file has all records for a given first character of the first field on adjacent lines (as in your example), you could try this simpler approach which uses less memory (and, therefore, should run a little faster) and keeps the output order the same as the input order. It assumes that you want a <tab> separating fields in the output, but I assume you can see how to change that to a <space> if that is what you want:

awk '
last != substr($1, 1, 1) {
	if(NR > 1) print ""
	last = substr($1, 1, 1)
	printf("%s\t%s-%s", last, $2, $3)
	next
}
{	printf(",%s-%s", $2, $3)
}
END {	if(NR > 0) print ""
}' file

which, with your sample input, produces the output:

a	344-456,34-67
b	34-90,23-100,1-89
d	0-12
e	45-678,78-90,56-90

For any of the awk scripts suggests in this thread, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Thank you all

I see that you all assume that the first column only contains characters. If it had also some numbers along with the character in the first column , how do I modify the awk command?. Also the input file is tab-delimited.

for example:

aa123 344 456
aa123 34 67
bb34 34 90
bb34 23 100
bb34 1 89
d3 0 12
e55 45 678
e55 78 90
e55 56 90
....
....
....

@Kanja

No assumptions have been made - only answers based upon your input data/expected output

1 Like

Your sample data in post #1 in this thread explicitly showed that no matter how many characters were in field 1 in your input file, you only wanted the 1st character of field 1 to appear in your output file. Have you now changed you mind on which lines are to be grouped together??? If so, please explicitly state your new requirements.

Also note that you say your input file is <tab> delimited, but every sample file you have shown us is delimited by a single <space> character; not a <tab>.

1 Like

Sorry, I realize my mistake. I need the entire characters on column 1 to categorize the data in the output file;

aa123 344-456, 34-67
bb34 34-90,23-100,1-89
d3 0-12
e55 45-678,78-90,56-90

It was the half-baked specification that made people resort to assumptions based on your sample output. On top, RavinderSingh13's proposals in post#2 DO provide for the entire first field. Adapt Don Cragun's proposal to allow for the entire field1:

awk '
last != $1      {if (NR > 1) print ""
                 last = $1
                 printf("%s\t%s-%s", last, $2, $3)
                 next
                }
                {printf(",%s-%s", $2, $3)
                }
END             {if (NR > 0) printf RS
                }
' file
1 Like