Trying to use awk
to print the lines in file
that have either REF
or SNV
in $3
, add a header line, sort by $4
in numerical order. The below code does that already, but where I am stuck is on the last part where the total lines are counted and printed under Total_Targets
, under Targets_less_than
is all targets under 15 in $4
, just the count, and then under Percent_less_than
is the value of Total_targets
/ Total_less_than
* 100. There is probably a better way to do this but my awk
is below. Thank you :).
file tab-delimited
Position Gene Type Reads Total_Targets Total_less_than Percent_less_than
chr10:89720664 PTEN REF 15
chr3:10183752 VHL REF 20
chr3:10183734 VHL REF 21
chr3:10183763 VHL REF 28
chr3:10183754 VHL REF 20
chr3:10183758 VHL REF 20
chr10:89720663 PTEN REF 15
chr3:10183759 VHL REF 20
chr3:10183764 VHL REF 28
chr3:10183765 VHL REF 28
chr10:89720764 PTEN CN 25
chr10:89721664 PTEN CN 15
awk
awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | awk 'BEGIN {print "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than"}1' | sort -t $'\t' -k4,4n > out
desired out tab-delimeted
Position Gene Type Reads Total_Targets Total_less_than Percent_less_than
chr10:89720663 PTEN REF 15 10 2 20
chr10:89720664 PTEN REF 15
chr3:10183752 VHL REF 20
chr3:10183754 VHL REF 20
chr3:10183758 VHL REF 20
chr3:10183759 VHL REF 20
chr3:10183734 VHL REF 21
chr3:10183763 VHL REF 28
chr3:10183764 VHL REF 28
chr3:10183765 VHL REF 28
awk -F'\t' -v OFS='\t' '$3=="REF" || $3=="SNV" {print $1,$2,$3,$4}' file | sort -t $'\t' -k4,4n | (print "Position\tGene\tType\tReads\tTotal_Targets\tTotal_less_than\tPercent_less_than" ; cat -)
1 Like
I will give the a try tomorrow.
what does ; cat -
do, that is new to me? Thank you very much :).
I don't see how the code vgersh99 suggested performs the calculations you requested.
The -
pathname operand to the cat
utility causes cat
to copy the contents of standard input to standard output. When there is only one operand, the command cat -
produces exactly the same results as the command cat
.
Moving back to your original problem... If by "under 15" you mean "less than or equal to 15" (instead of the way I would normally interpret that quote ("less than 15")) and you really want the common definition of percentage (instead of the formula you specified), then the following seems to do what you want:
#!/bin/ksh
TFN=${0##*/}.$$
trap 'rm -f "$TFN"' EXIT
awk -v tfn="$TFN" '
BEGIN { FS = OFS = "\t"
sort_cmd = "sort -t\"\t\" -k4,4n -o \"" tfn "\""
}
NR == 1 {
print
next
}
$3 == "REF" || $3 == "SNV" {
tt++
if($4 <= 15)
tlt++
print | sort_cmd
}
END { close(sort_cmd)
while((getline line < tfn) == 1)
if(++nr == 1)
print line, tt, tlt, 100 * tlt / tt
else print line
}' file
and it produces exactly the output you said you want. Note that it only invokes awk
once (not twice like your script does).
This was written and tested using a Korn shell, but will work with any shell that uses Bourne shell syntax and performs the basic parameter expansions required by the POSIX standards. As always, if you want to run this on a Solaris/SunOS system, change awk
to /usr/xpg4/bin/awk
or nawk
.
Try using numeric array enumeration order in GNU awk :
awk -v lim=15 '
NR==1 {
PROCINFO["sorted_in"] = "@val_num_asc"
h=$0
next
}
$3=="REF" || $3=="SNV" {
A[$0]=$4
tt++
if($4<=lim)
tl++
}
END{
print h
for (i in A)
if(!n++) print i, tt, tl, tl*100/tt
else print i
}
' FS='\t' OFS='\t' infile
--
With GNU awk Co-processing you can also avoid using a temporary file when using external sort..
For example:
awk -v lim=15 '
NR==1 {
cmd="sort -k4,4n -k1,1"
h=$0
next
}
$3=="REF" || $3=="SNV" {
tt++
if($4<=lim)
tl++
print |& cmd
}
END{
print h
close(cmd, "to")
while (( cmd |& getline)>0) {
if(!n++) print $0, tt, tl, tl*100/tt
else print
}
}
' FS='\t' OFS='\t' infile