Lookup horizontally and vertically and calculate counts

Hello,

Please help create the following report.

I have a matrix

-  S1 S2 S3 S4  
M1 AA AA TT -
M2 AG AG AA GG
M3 GG TT - -

a first lookup table

M3 chr7 4.456
M1 chr7 28.9
M2 chr8 129.678

a second lookup table

S1 GGHBBGG/DEDD(@DCCD)
S2 GGHBBGG/DEDD(@DCCD)//B-
S3 GGHBBGG/DEDD(@HH?)//B1@NNN
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1

I want to count each nucleotide (As, Ts, Cs and Gs) for each row, and a few variables calculated as

total = NF-1
missing = total number of "-"
mono = total number of ( AA + GG + CC + TT)
mix = total number of ( AT + AC + AG + CT + CG + GT)
m = 2nd highest among (A,T,C,G) / total (A,T,C,G)
data = total - missing

An example calculation of m for M1 is there are 4 As and 2 Ts for M1. The rest are 0s
So m for M1 = total number of Ts, which is 2nd highest ( 2 ) / total A,T,G,C (6) = 0.33

Here is what is my report should look like
I cant seem to line up the table for some reason, it is space delimited.

NAME				                                  M1 M2 M3
CHR				                                  chr7 chr8 chr7 
POS				                                  28.9 129.678 4.456
A	                                                          4 4 0			        
T			                                          2 0 2			
G				                                  0 4 2				
C				                                  0 0 0
-				                                  1 0 2
m				                                  0.33 0.5 0.5
data				                                  3 4 2
mono				                                  3 2 2 
mixed				                                  0 2 0
total				                                  4 4 4
S1 GGHBBGG/DEDD(@DCCD)		                                  AA AG GG
S2 GGHBBGG/DEDD(@DCCD)//B-                                        AA AG TT
S3 GGHBBGG/DEDD(@HH?)//B1@NNN                                     TT AA -
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1                                      - GG -

I tried this, please help

 awk 'NR==FNR{ l[$1]=$2 FS $3;next} $1 in l { $1=$1 FS l[$1]}1' lookup1 file |
	       	     awk '{ for (i=2;i++;i<=NF)
	       	     		if ($i=="AA" || $i=="GG" || $i=="CC" || $i=="TT")
	       	     			mono=mono+1
	       	     				if ($i=="AA")
	       	     					a=a+2
	       	     				else if ($i=="GG")
	       	     					g=g+2
	       	     				if ($i=="CC")
	       	     					c=c+2
	       	     				else if ($i=="TT")
	       	     					t=t+2
	       	     				fi
	       	     		else if ($i=="AT" || $i=="AG" || $i=="AC" || $i=="CT" || $i=="CG" || $i=="GT")
	       	     				mix=mix+1
	       	     		
	       	     		else if ($i="-")
	       	     				missing=missing+1	
	       	     		fi
	       	     		total=NF-1
	       	     		data=(NF-1)-missing
	       	     		$1= $1,a,c,t,g,mono,mix,missing,total,data
	       	     		}1' | awk '
		     { 
		         for (i=1; i<=NF; i++)  {
		             a[NR,i] = $i
		         }
		     }
		     NF>p { p = NF }
		     END {    
		         for(j=1; j<=p; j++) {
		             str=a[1,j]
		             for(i=2; i<=NR; i++){
		                 str=str" "a[i,j];
		             }
		             print str
		         }
	             }'  | awk  'NR==FNR{ l[$1]=$2 ;next} $1 in l { $1=$1 FS l[$1]}1' lookup2 - > final_report
	             
        

You're not printing anything in the first awk so there's nothing to pipe to the second. Even if there were, the second awk just reads file1 but not stdin from the pipe (you can use "-" to supply stdin as a "file name"). On top, it doesn't print either. Please be aware that the variable assignments are lost between different awk scripts.

1 Like

Thanks for your suggestions, would you please look at the script now? Also please help with the calculation of m , I feel I have the structure of the code right, but not providing with the desired report. How to include the row names like "NAME", "CHR" ,..etc ?

Try

awk     'FNR==1         {FILE++}

         FILE==1        {CHR[$1]=$2
                         POS[$1]=$3
                         next
                        }

         FILE==2        {S2[$1]=$0
                         next
                        }

         FILE==3        {if (FNR==1)    {for (i=2; i<=NF; i++) S[i-1]=$i
                                         next
                                        }
                         M[$1]
                         for (i=2; i<=NF; i++)  {S1[$1,S[i-1]]=$i
                                                 total[$1]++
                                                 if ($i == "-")  missing[$1]++
                                                 else           {data[$1]++
                                                                 C1=substr($i,1,1)
                                                                 C2=substr($i,2,1)
                                                                 GN[C1]
                                                                 GN[C2]
                                                                 if (C1 == C2)  {mono[$1]++
                                                                                 G[$1,C1]+=2
                                                                                }
                                                                 else           {mixed[$1]++
                                                                                 G[$1,C1]++
                                                                                 G[$1,C2]++
                                                                                }
                                                                }
                                                }
                        }

         END            {printf "%-40s", "NAME"
                           for (m in M) printf "\t%s", m
                           printf "\n"
                         printf "%-40s",  "CHR"
                           for (m in M) printf "\t%s", CHR[m]
                           printf "\n"
                         printf "%-40s",  "POS"
                           for (m in M) printf "\t%s", POS[m]
                           printf "\n"

                         for (g in GN)  {printf  "%-40s", g
                                           for (m in M) printf "\t%s", G[m, g]+0
                                           printf "\n"
                                        }

                         printf  "%-40s", "-"
                           for (m in M) printf "\t%s", missing[m]+0
                           printf "\n"

                         printf "m"
                           printf "\n"

                         printf  "%-40s", "data"
                           for (m in M) printf "\t%s", data[m]+0
                           printf "\n"

                         printf  "%-40s", "mono"
                           for (m in M) printf "\t%s", mono[m]+0
                           printf "\n"

                         printf  "%-40s", "mixed"
                           for (m in M) printf "\t%s", mixed[m]+0
                           printf "\n"

                         printf  "%-40s", "total"
                           for (m in M) printf "\t%s", total[m]+0
                           printf "\n"

                         for (s in S)   {printf  "%-40s", S2
                                           [S]for (m in M) printf "\t%s", S1[m, S]
                                           printf "\n"
                                        }
                        }
        ' lookup1 lookup2 matrix
NAME                                        M1    M2    M3
CHR                                         chr7    chr8    chr7
POS                                         28.9    129.678    4.456
A                                           4    4    0
G                                           0    4    2
T                                           2    0    2
-                                           1    0    2
m
data                                        3    4    2
mono                                        3    2    2
mixed                                       0    2    0
total                                       4    4    4
S1 GGHBBGG/DEDD(@DCCD)                      AA    AG    GG
S2 GGHBBGG/DEDD(@DCCD)//B-                  AA    AG    TT
S3 GGHBBGG/DEDD(@HH?)//B1@NNN               TT    AA    -
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1               -    GG    -

Base C is missing as it doesn't occur in the matrix. It is assumed that the data field count is constant across the matrix. Please excuse the very short variable/array names - I didn't konw the meaning behind them so could find a meaningful name.

I don't have the sligtest idea on how to compute m . The denominator is easy: 2 * (total - missing). But how to find the numerator, the second highest base count?

1 Like

Yes, the numerator is the minor base count, or the second highest base count.,,

M1 it is T (2)
M2 and M3 , the highest   and the second highest are the same , so its 4

That doesn't necessarily help. I've seen that, but don't have an idea for an algorithm to find it.

Could I suggest these changes in red for m calculation:

                         for (g in GN)  {printf  "%-40s", g
                                           for (m in M) {
                                               printf "\t%s", G[m, g]+0
                                               if(G[m, g] > Max1[m]) {
                                                  Max2[m]=Max1[m]+0
                                                  Max1[m]=G[m, g]
                                               } else if (G[m, g] > Max2[m]) Max2[m]=G[m, g]
                                           }
                                           printf "\n"
                                        }

                         printf  "%-40s", "-"
                           for (m in M) printf "\t%s", missing[m]+0
                           printf "\n"

                         printf  "%-40s", "m"
                           for (m in M) printf "\t%4.2f", 
                               missing[m]>=total[m]?0:(Max2[m]+0)/(2*(total[m]-missing[m]))
                           printf "\n"
2 Likes

Brilliant! I seem to have been a bit worn out last night...

1 Like

thanks, working great ! I will try to learn from this