Help with awk in counting characters based on a column

Homa · November 26, 2012, 8:31am

Hello,
I am using Awk in UBUNTU 12.04.

I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations.

ABO_1  1  2
ABO_1  1  2
ABO_2  1  1 
ABO_2  1  2
GPO_1   1  1 
GPO_1  2  2
GPO_2   1  0 
GPO_2  2  0

I want to count the number of 1s and 2s in each population ignoring 0s if there is any but printing 0 if there is no 1 or 2 and have an output like this:

4 0 2 2 
1 3 1 1

Where 4 0 is the number of "1s" and "2s" in the second column of the first population. 1 3 is the number of "1s" and "2s" in the third column of the first population and so on.

Thank you very much for your help.

pamu · November 26, 2012, 8:41am

Try

awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file

Homa · November 26, 2012, 9:00am

Thank you but I tried it on the test file as I have posted above and it is extremely slow, it has not finished calculating yet so it should take even longer for my real big file. I have a code as follows myself:

{
for (i=2; i<=NF; i++)
if ($i=="1") c_one++
    else if ($i=="2") c_two++}
END{
for(i=2; i<=NF; i++)
printf ("%d " " %d\n", c_one, c_two)
}

But this is for the case of having my populations separated that is ABO in one file and GPO in the other. Maybe this code can be modified for the new file for the populations together.

---------- Post updated at 09:00 AM ---------- Previous update was at 08:53 AM ----------

sorry, I had made a mistake. it is not slow but it gives me these numbers:

0 0 6 2

for the file above which is not correct.

pamu · November 26, 2012, 9:10am

Please check..

$ cat file
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2
GPO_1 1 1
GPO_1 2 2
GPO_2 1 0
GPO_2 2 0
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2

$ awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file
4 0 1 3
2 2 1 1
4 0 1 3

$ awk -F "[_ ]" 'function print_o(){
print X[1,fn],X[2,fn],Y[1,fn],Y[2,fn];
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}' OFS="\t" file
4               1       3
2       2       1       1
4               1       3

$ awk -F "[_ ]" 'function print_o(){
print X[1,fn]?X[1,fn]:0,X[2,fn]?X[2,fn]:0,Y[1,fn]?Y[1,fn]:0,Y[2,fn]?Y[2,fn]:0;
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}'  file
4 0 1 3
2 2 1 1
4 0 1 3

Choose any option you want..

pamu

Homa · November 26, 2012, 9:38am

That works, thanks a lot. I am so sorry but I have a problem because my original file is composed of 47 populations. The script works well for the test file but when I run it on my original file, it gives me 4 columns while it should give me 47*2 columns. I am sorry for my basic questions.

pamu · November 26, 2012, 9:50am

Try sth like this..
Below i started i=3 because i have ignored population name while counting occurrences of 1 and 2.

awk -F "[_ ]" 'function print_o(){
for(i=3;i<=NF;i++){
print X[1,fn,i]?X[1,fn,i]:0,X[2,fn,i]?X[2,fn,i]:0,Y[1,fn,i]?Y[1,fn,i]:0,Y[2,fn,i]?Y[2,fn,i]:0;
}
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{for(i=3;i<=NF;i++){X[$i,fn,i]++;Y[$i,fn,i]++;s=$1}}END{print_o()}'  file

Homa · November 26, 2012, 10:01am

Unfortunately, it still gives me 4 columns. I will try to separate the populations into different files.

pamu · November 26, 2012, 10:19am

Sorry. i have done mistake there..

now try

]$ cat file
ABO_1 1 2 2 1 1 0
ABO_1 1 2 2 1 0 2
ABO_1 1 2 2 1 1 1
ABO_1 1 2 2 1 2 2
KK_1 1 2 2 1 1 0
KK_1 1 2 2 1 0 2
KK_1 1 2 2 1 1 1
KK_1 1 2 2 1 2 2

$ awk -F "[_ ]" 'function print_o(){
for(i=3;i<=NF;i++){
printf "%s %s ", X[1,fn,i]?X[1,fn,i]:0,X[2,fn,i]?X[2,fn,i]:0;
}
print "";
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{for(i=3;i<=NF;i++){X[$i,fn,i]++;s=$1}}END{print_o()}'  file
4 0 0 4 0 4 4 0 2 1 1 2
4 0 0 4 0 4 4 0 2 1 1 2