awk to calculate total and percent off field in file

cmccabe · March 30, 2017, 12:37pm

Trying to use awk to print the lines in file that have either REF or SNV in $3 , add a header line, sort by $4 in numerical order. The below code does that already, but where I am stuck is on the last part where the total lines are counted and printed under Total_Targets , under Targets_less_than is all targets under 15 in $4 , just the count, and then under Percent_less_than is the value of Total_targets / Total_less_than * 100. There is probably a better way to do this but my awk is below. Thank you :).

file tab-delimited

Position	Gene	Type	Reads	Total_Targets	Total_less_than	Percent_less_than
chr10:89720664	PTEN	REF	15
chr3:10183752	VHL	REF	20
chr3:10183734	VHL	REF	21
chr3:10183763	VHL	REF	28
chr3:10183754	VHL	REF	20
chr3:10183758	VHL	REF	20
chr10:89720663	PTEN	REF	15
chr3:10183759	VHL	REF	20
chr3:10183764	VHL	REF	28
chr3:10183765	VHL	REF	28
chr10:89720764	PTEN	CN	25
chr10:89721664	PTEN	CN	15

awk

awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | awk 'BEGIN {print  "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than"}1' | sort -t $'\t' -k4,4n > out

desired out tab-delimeted

Position	Gene	Type	Reads	Total_Targets	Total_less_than	Percent_less_than
chr10:89720663	PTEN	REF	15	10	2	20
chr10:89720664	PTEN	REF	15
chr3:10183752	VHL	REF	20
chr3:10183754	VHL	REF	20
chr3:10183758	VHL	REF	20
chr3:10183759	VHL	REF	20
chr3:10183734	VHL	REF	21
chr3:10183763	VHL	REF	28
chr3:10183764	VHL	REF	28
chr3:10183765	VHL	REF	28

vgersh99 · March 30, 2017, 8:03pm

awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | sort -t $'\t' -k4,4n | (print "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than" ; cat -)

cmccabe · March 30, 2017, 8:53pm

I will give the a try tomorrow.

what does ; cat - do, that is new to me? Thank you very much :).

Don_Cragun · March 30, 2017, 10:22pm

I don't see how the code vgersh99 suggested performs the calculations you requested.

The - pathname operand to the cat utility causes cat to copy the contents of standard input to standard output. When there is only one operand, the command cat - produces exactly the same results as the command cat .

Moving back to your original problem... If by "under 15" you mean "less than or equal to 15" (instead of the way I would normally interpret that quote ("less than 15")) and you really want the common definition of percentage (instead of the formula you specified), then the following seems to do what you want:

#!/bin/ksh
TFN=${0##*/}.$$
trap 'rm -f "$TFN"' EXIT

awk -v tfn="$TFN" '
BEGIN {	FS = OFS = "\t"
	sort_cmd = "sort -t\"\t\" -k4,4n -o \"" tfn "\""
}
NR == 1 {
	print
	next
}
$3 == "REF" || $3 == "SNV" {
	tt++
	if($4 <= 15)
		tlt++
	print  | sort_cmd
}
END {	close(sort_cmd)
	while((getline line < tfn) == 1)
		if(++nr == 1)
			print line, tt, tlt, 100 * tlt / tt
		else	print line
}' file

and it produces exactly the output you said you want. Note that it only invokes awk once (not twice like your script does).

This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax and performs the basic parameter expansions required by the POSIX standards. As always, if you want to run this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Scrutinizer · March 31, 2017, 12:57am

Try using numeric array enumeration order in GNU awk :

awk -v lim=15 '
  NR==1 {
    PROCINFO["sorted_in"] = "@val_num_asc"
    h=$0
    next
  }

  $3=="REF" || $3=="SNV" {
    A[$0]=$4
    tt++
    if($4<=lim)
      tl++
  }

  END{
    print h
    for (i in A)
      if(!n++) print i, tt, tl, tl*100/tt
      else print i
  }
' FS='\t' OFS='\t' infile

--

With GNU awk Co-processing you can also avoid using a temporary file when using external sort..
For example:

awk -v lim=15 '
  NR==1 {
    cmd="sort -k4,4n -k1,1"
    h=$0
    next
  }

  $3=="REF" || $3=="SNV" {
    tt++
    if($4<=lim)
      tl++
    print |& cmd
  } 

  END{ 
    print h
    close(cmd, "to")
    while (( cmd |& getline)>0) {
      if(!n++) print $0, tt, tl, tl*100/tt
      else print
    } 
  }
' FS='\t' OFS='\t' infile