Group by a column and first 2 characters of another

Hello all,

I want to consolidate my data group by cities into 4 rolled up code categories (first 2 characters of 2nd col), namely PR, GR, TY and others.

CHICAGO	PR1
CHICAGO	GR3
CHICAGO	GR4
CHICAGO	HT5
CHICAGO	HT6
CHICAGO	TY5
HOUSTON	PR3
HOUSTON	PR9
HOUSTON	TY4
HOUSTON	HJ5
DESMOINES	PR5
DESMOINES	PR6
DESMOINES	GR7
DESMOINES	HT7

Now I want to have

	PR	GR 	TY	OTHERS
CHICAGO	PR1	GR3|GR4	TY5	HT5|HT6
HOUSTON	PR3|PR9		TY4	HJ5
DESMOINES	PR5|PR6	GR7		HT7

I tried to achieve this by the following code

awk -F"\t" '{if(a[$1]){a[$1]=a[$1]"|"$2} else { a[$1]=$2}} END {for (i in a) {print i"\t"a}}' 

but I`m not sure how to include PR, GR , TY and all others separately. Please help.

You could try something like:

awk '
BEGIN {	print "\tPR\tGR\tTY\tOTHERS"
	grp["PR"]; grp["GR"]; grp["TY"]
}
{	if(!((g = substr($2, 1, 2)) in grp))
		g = "OTHERS"
	if(!($1 in city))
		citylist[++cc] = city[$1] = $1
	data[$1, g] = data[$1, g] == "" ? $2 : data[$1, g] "|" $2
}
END {	for(i = 1; i <= cc; i++)
		printf("%s\t%s\t%s\t%s\t%s\n", citylist,
		    data[citylist, "PR"], data[citylist, "GR"],
		    data[citylist, "TY"], data[citylist, "OTHERS"])
}' file

which will work even if the input lines for a given city are not all contiguous. If the data for all cities are grouped on contiguous lines (as in your sample data), this script could be simplified.

This code produces the output you said you wanted except it does not put a space in the heading line after the GR before the tab field separator.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

2 Likes