convert data into matrix- awk

is it possible to count the number of keys based on state and cell and output it as a simple matrix.

Ex: cell1-state1 has 2 keys
cell3-state1 has 4 keys.

Note: Insert 0 if no data available.

input

key    states    cell
key1    state1    cell1
key1    state2    cell1
key1    state3    cell1
key1    state4    cell1
key1    state5    cell1
key1    state6    cell1
key1    state7    cell1
key1    state8    cell1
key1    state9    cell1
key1    state10    cell1
key1    state11    cell1
key1    state12    cell1
key1    state13    cell1
key1    state14    cell1
key1    state15    cell1
key2    state1    cell1
key2    state2    cell1
key2    state3    cell1
key2    state4    cell1
key2    state5    cell1
key2    state6    cell1
key2    state7    cell1
key2    state8    cell1
key2    state9    cell1
key2    state10    cell1
key2    state11    cell1
key2    state12    cell1
key2    state13    cell1
key2    state14    cell1
key2    state15    cell1
key1    state1    cell2
key1    state2    cell2
key1    state3    cell2
key1    state4    cell2
key1    state5    cell2
key1    state6    cell2
key1    state7    cell2
key1    state8    cell2
key1    state9    cell2
key1    state10    cell2
key1    state11    cell2
key1    state12    cell2
key1    state13    cell2
key1    state14    cell2
key1    state15    cell2
key1    state1    cell3
key1    state14    cell3
key2    state1    cell3
key3    state1    cell3
key4    state1    cell3

output

cell    state1    state2    state3    state4    state5    state6    state7    state8    state9    state9    state10    state11    state12    state13    state14    state15
cell1    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2
cell2    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
cell3    4    0    0    0    0    0    0    0    0    0    0    1    0    1    0

I think it is. are the values of column 2 & 3 static values you know about?

col2 has state1 to state15
col3 has cell1-cell9

and

col1 has many keys 1- several thousands

maybe you can create a loop and concatenate the value of col2&3 to a name. this name can be used as variable and set.
for example:
1st row -> found key1, state3, cell5 -> $state3cell5 = "_key1"

before adding the key to the var check if the key(substring) already exists.
after the loop you can count in each variable matches of "/_key\d{1,3}/".

I don't quiet get it. ?

Hi,

Try this one,

awk '{f1[$3]++;f2[$2]++;a[$3"^"$2]++;}END{ORS="";print "      ";for(j in f2){print j" ";c++;if(c>5){print " ";}}print "\n";for(i in f1){print i;for(k in f2){l=i"^"k;if(a[l] == ""){print "        ";}else{print "   "a[l]"    ";}}print "\n";}}' file

My Output is,

 
      state7 state8 state9 state10 state11 state12  state1  state13  state2  state14  state3  state15  state4  state5  state6
cell1   2       2       2       2       2       2       2       2       2       2       2       2       2       2       2
cell2   1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
cell3                                                   4                       1

if you have gawk, you can sort the header with below code,

gawk '{f1[$3]++;f2[$2]++;a[$3"^"$2]++;}END{asort(f1);ORS="";print "      ";for(j in f2){print j" ";c++;if(c>5){print " ";}}print "\n";for(i in f1){print i;for(k in f2){l=i"^"k;if(a[l] == ""){print "        ";}else{print "   "a[l]"    ";}}print "\n";}}' file

i dont have gwk installed here, so i didn't used that.

Cheers,
Ranga:)

2 Likes

gawk is installed but not getting output

$ gawk '{f1[$3]++;f2[$2]++;a[$3"^"$2]++;}END{asort(f1);ORS="";print "      ";for(j in f2){print j" ";c++;if(c>5){print " ";}}print "\n";for(i in f1){print i;for(k in f2){l=i"^"k;if(a[l] == ""){print "        ";}else{print "   "a[l]"    ";}}print "\n";}}' rm2
      state12 state7 state13 state8 state14 state9  state15  state1  state2  states  state3  state4  state10  state5  state11  state6  
4                                                                                                                                
1                                                                                                                                
2                                                                                                                                
3 

but awk version is working great.

key1    state1    cell3 
key1    state14    cell3 
key2    state1    cell3 
key3    state1    cell3 
key4    state1    cell3

What exactly is it doing?

please refer to the previous post. I just updated it. Thanx

Just make a try, i am not sure it will work perfectly since i dont have gawk.

your code is useful to count combination of col 2&3.. :b:

but doesn't count the occurrence of how many different keys for these pair of values are in the file?

but doesn't count the occurrence of how many different keys for these pair of values are in the file?

I didn't get your point clearly. we have counts for each combiantion right ?

I didn't take care of the matrix format, but all the infos you need are in the output :

awk '{i=$3 FS $2}!a[$0]++{c++}END{for(i in c) print i,c}' yourfile

If you need the detail of which keys :

awk '{i=$3 FS $2}!a[$0]++{c++;x=d;d=(x?x FS:z) $1}END{for(i in c) print i,c,d}' yourfile

You can add a | sort with the options of your choice to make it ordered by cells, or by states etc ...

yes. and we have different keys for combinations.
it was asked for the amount of different keys for each if I understood correctly..

awk '{f1[$3]++;f2[$2]++;a[$3"^"$2]++;}END
[snip...]

f1 is the different cell and f2 is different keys in the input.

Cheers,
Ranga:)

@Ranga: I think cell3 states are wrong in the output of your code.

Your expected output is wrong, just check that we dont have key for state12 and cell3. is that typo error ?

key1    state1    cell3
key1    state14    cell3
key2    state1    cell3
key3    state1    cell3
key4    state1    cell3

You mean 0 if the key is not found,
Yeah i have missed that requirment.

Note: Insert 0 if no data available.

Check with the below code for that requirment.

awk '{f1[$3]++;f2[$2]++;a[$3"^"$2]++;}END{ORS="";print "      ";for(j in f2){print j" ";c++;if(c>5){print " ";}}print "\n";for(i in f1){print i;for(k in f2){l=i"^"k;if(a[l] == ""){print "   0    ";}else{print "   "a[l]"    ";}}print "\n";}}' Input_File

Cheers,
Ranga:)

1 Like

that's confusing to me. I just noticed you didn't used or checked for '$1' (col1 contains keys?)
how to not increase without check if same key for a combination occurred again?

Also can you add this minor change please. normalize the output matrix by dividing total no.of unique keys * 100 ? with a different command by piping

command1 input | command2

For ex: cell1-state1 (2/4)*100

f1[$3]++;f2[$2]++;

we are not using the above arrays count of particular keys. even we can use the below

f1[$3]=1;f2[$2]=1;

yeah you have got point that if i have same key exist more than once for the same combination of cell and state

key1    state1    cell3
key1    state14    cell3
key2    state1    cell3
key3    state1    cell3
key4    state1    cell3
key4    state1    cell3

The above code will display the count as 5 for cell3 and state1 but the actual count count is 4 bcos the same key is exist. but thats not in the input which we have too. if its the case, check with the below code,

awk '{
f1[$3]=1;f2[$2]=1;if(chk[$3"^"$2] != $1){a[$3"^"$2]++;chk[$3"^"$2]=$1;}
}END{ORS="";print "      ";for(j in f2){print j" ";c++;if(c>5){print " ";}}print "\n";for(i in f1){print i;for(k in f2){l=i"^"k;if(a[l] == ""){print "   0    ";}else{print "   "a[l]"    ";}}print "\n";}}' Input_File

Cheers,
Ranga:)

---------- Post updated at 04:49 AM ---------- Previous update was at 04:44 AM ----------

2 is count of keys.
4 is total no of keys.

right ?

where you need this output ?
Expected output format ?

Cheers,
Ranga :slight_smile: