awk to count and rename based on fields

In the below awk using the tab-delimited input, I am trying count the - symbol in $5 and output the count as well as the renamed condition ins . I am also count the - symbol in $6 and output the count as well as the renamed condition del . I am also count the tomes that in $5 and $6 there are actually letters in both, and output the count as well as the renamed condition snp .

input

Index    Mutation Call    Start    End    Ref    Alt    Func.refGene    Gene.refGene    ExonicFunc.refGene    Sanger
13    c.[1035-3T>C]+[1035-3T>C]    166170127    166170127    T    C    intronic    SCN2A        
16    c.[2994C>T]+[=]    166210776    166210776    C    T    exonic    SCN2A    synonymous SNV    
19    c.[4914T>A]+[4914T>A]    166245230    166245230    T    A    exonic    SCN2A    synonymous SNV    
20    c.[5109C>T]+[=]    166245425    166245425    C    T    exonic    SCN2A    synonymous SNV    
21    c.[5139C>T]+[=]    166848646    166848646    G    A    exonic    SCN1A    synonymous SNV    
22    c.3152_3153insAACCACT    166892841    166892841    -    AGTGGTT    exonic    SCN1A    frameshift insertion    TP
23    c.2044-5delT    166898947    166898947    A    -    intronic    SCN1A        
25    c.1530_1531insA    166901684    166901684    -    T    exonic    SCN1A    frameshift insertion    FP

current output

Category  Count
ins       del    2

desired output

Category  Count
ins           2
del           1
snp           5

Post the awk program that you used.

1 Like

sorry:

awk

awk -F'\t' '$5=="-"{count++}
            $4=="-"{count++} 
                  END{print "Category","Count"; 
                      print "indel",count+0}' input | # replace nulls with zero
  column -t > count # print out tab-delimited

Hello cmccabe,

Sorry to say but I am not able to understad it, following are some questions on this.

i- What you mean here by renamed ins and del here?
ii- Are you trying to fill any field with above metioned keywords?
iii- I could see string del and ins on 23rd and 25th lines respectively, so is it related to it? Though it is second column where I could see it(considering field seprator is space or tab here).

Request you to please post more meaningful data samples and meaningful output samples too, so that we could try to help you in same.

Thanks,
R. Singh

1 Like
awk -F'\t' '$5=="-"{count++} # check for - in $5
              $6=="-"{count++}   # check for - in $6
                  END{print "Category","Count"; # replace null with zero 
                      print "indel",count+0}' out | 
  column -t > count

# print tab-delimited

i- since I am just counting - , I am renaming that based on which field was used
For example, .
if $5 was used to count the - , then the - is renamed or printed as ins
if $6 was used to count the - , then the - is renamed or printed as del
if $5 and $6 had letters in them and were used to count then that is renamed or printed as snp

ii- I am not filling the fields with data, rather using the data already there to output the result.

iii- those keywords are in that field $2 in this example but that is not always the case.

Thank you :).

Try this:-

awk -F'\t' '
        NR == 1 {
                print "Category", "Count"
                next
        }
        $5 == "-" {
                ++A["ins"]
        }
        $6 == "-" {
                ++A["del"]
        }
        $5 != "-" && $6 != "-" {
                ++A["snp"]
        }
        END {
                for ( k in A )
                        print k, A[k]
        }
' OFS='\t' file
1 Like

Thank you very much :).