Here's the code with comments in hopes that you can figure out what is going on. I added a sort (some awk versions don't have a built-in sort function, so I wrote one that shouldn't be too inefficient based on the number of rows/columns you've indicated.
#!/usr/bin/env ksh
awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
function sort_order( what, l, i, j )
{
for( i = 0; i < l; i++ ) # big pass across array
{
big = 0; # set first entry as the largest value
for( j = 1; j < l-i; j++ ) # small pass (up to length - i)
if( what[j] > what[big] ) # if this value larger, mark it
big = j;
if( big != j-1 ) # if largest is not the last checked
{
hold = what[j-1]; # swap the last for the largest
what[j-1] = what[big];
what[big] = hold;
}
}
}
{
if( $(c2) == "" ) # if value in c2 is missing, we dont count
next;
if( $(c1) == "" )
$(c1) = "BLANK"; # easy eycatcher for empty field
if( !seen_1[$(c1)]++ )
order_1[o1idx++] = $(c1); # order each c1 was observed
if( !seen_2[$(c2)]++ )
order_2[o2idx++] = $(c2); # order each c2 was observed
count[$(c1),$(c2)]++; # count the number of times the pair (c1,c2) appear together
}
END {
sort_order( order_1, length( order_1 ) ); # sort values from c1, and c2
sort_order( order_2, length( order_2 ) );
printf( "%15s ", "COLA/COLB" ); # print the header line using the order_2 list
for( j=0; j < o2idx; j++ )
printf( "%15s ", order_2[j] ); # %15s will align columns based on a width of 15
printf( "\n" ); # new line for first row of matrix
for( i = 0; i < o1idx; i++ ) # print matrix -- for each row (c1 value)
{
printf( "%15s ", order_1 ); # print the c1 value (again width of 15)
for( j=0; j < o2idx; j++ ) # for each column (c2 values)
printf( "%15d ", count[order_1,order_2[j]] ); # print each (width 15 again)
printf( "\n" ); # end the row by printing a newline
}
}
'
exit
As for it not lining up correctly, that seems odd. Each field is printed with a width of 15 characters (%15d or %15s) and provided that none of the data is more than 15 characters wide it should space out nicely. You might need to reduce the setting (from 15 to 8 or something like that) in order to suit your needs. The programme also prints the word "BLANK" for empty data (in column A) and does not count anything if column B is missing. I assumed that missing columns were adjacent comas (,,).
For testing I ran this small set of data:
6000,1000,100,5000,3000,100,1000,1000,2000,4000
2000,7000,4000,,8000,9000,3000,100,1000,2000
4000,6000,6000,9000,100,7000,8000,7000,9000,8000
4000,4000,5000,,8000,1000,6000,6000,5000,9000
9000,5000,8000,100,6000,4000,8000,8000,9000,7000
7000,1000,100,7000,8000,4000,6000,2000,5000,6000
100,5000,6000,3000,5000,5000,100,4000,4000,6000
7000,9000,4000,6000,4000,8000,8000,6000,6000,9000
9000,1000,4000,,5000,1000,1000,2000,9000,8000
7000,1000,5000,1,5000,3000,100,3000,4000,1000
to ensure sorting and blanks worked. With the command test_script 4 5 <test_data
the following was the output:
COLA/COLB 100 3000 4000 5000 6000 8000
1 0 0 0 1 0 0
100 0 0 0 0 1 0
3000 0 0 0 1 0 0
5000 0 1 0 0 0 0
6000 0 0 1 0 0 0
7000 0 0 0 0 0 1
9000 1 0 0 0 0 0
BLANK 0 0 0 1 0 2
The columns are 15 characters wide, and the BLANK value will sort out last if the data in the input is all numeric. You can replace "BLANK" with " "
if you just want a space in the output. Unfortunately, I cannot say what might be happening if your columns are misaligned as it works for me in my test environment.
Hope this has been of some value.