Sort file by field 1 that has text as well as a number

I am using the below awk that results in the below output:

awk '{k=$1 OFS $2; s[k]+=$4; c[k]++} END{for(i in s) print i, s/c}' input.txt > output.txt 

output.txt

chr20:43625799-43625957 STK4:exon.6;STK4:exon.7 310.703
chr20:36770455-36770611 TGM2:exon.6;TGM2:exon.7 614.756
chr20:19945585-19945678 RIN2:exon.6;RIN2:exon.7 175.258
chr20:10632768-10632908 JAG1:exon.5;JAG1:exon.7 319.586
chr20:8630010-8630106 PLCB1:exon.1;PLCB1:exon.7 183.188
chr19:17952438-17952581 JAK3:exon.2;JAK3:exon.6;JAK3:exon.7 306.566
chr19:13051547-13051711 CALR:exon.3;CALR:exon.6;CALR:exon.7 337.811
chr19:13006795-13006945 GCDH:exon.5;GCDH:exon.6;GCDH:exon.7 628.62
chr19:11491549-11491657 EPOR:exon.1;EPOR:exon.6;EPOR:exon.7 301.87
chr18:3456341-3456588 TGIF1:exon.1;TGIF1:exon.2;TGIF1:exon.3 430.332
chr15:90630333-90630505 IDH2:exon.5;IDH2:exon.7 516.128 

I can not seem to pipe in a sort of the first column that would re-order the file by ascending order. I think the "chr" text is messing up the sort, but I'm not sure. Thank you :).

Desired output

chr1
chr1
chr2
chr2
chr3
....
....

EDIT: Thought the below would sort using the first column using the fourth character sorted numerically, but that's not working.

awk '{k=$1 OFS $2; s[k]+=$4; c[k]++} END{for(i in s) print i, s/c}' input > output.txt | sort -k1.4 -n output.txt 
awk ' .... ' | awk -F: '{print $1, $0}' OFS=':' | sort -t :  -k1,1n | cut -d : -f2-
1 Like
awk '{k=$1 OFS $2; s[k]+=$4; c[k]++} END{for(i in s) print i, s/c}' input | sort -k1.4 -n > output.txt
1 Like

Try:

sort -nt: -k1.4,1 -k2,2 

to sort the ranges as well

1 Like

Hi.

Utility msort allows fields to be described as hybrid, mixed characters and numeric:

#!/usr/bin/env bash

# @(#) s1	Demonstrate sort of mixed field, "hybrid", with msort.
# If msort is not in repository:
# http://freecode.com/projects/msort

LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results of msort:"
msort -l -q -j -d: -n 1 -c hybrid $FILE

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
msort 8.44

-----
 Input data file data1:
chr20:43625799-43625957 STK4:exon.6;STK4:exon.7 310.703
chr20:36770455-36770611 TGM2:exon.6;TGM2:exon.7 614.756
chr20:19945585-19945678 RIN2:exon.6;RIN2:exon.7 175.258
chr20:10632768-10632908 JAG1:exon.5;JAG1:exon.7 319.586
chr20:8630010-8630106 PLCB1:exon.1;PLCB1:exon.7 183.188
chr19:17952438-17952581 JAK3:exon.2;JAK3:exon.6;JAK3:exon.7 306.566
chr19:13051547-13051711 CALR:exon.3;CALR:exon.6;CALR:exon.7 337.811
chr19:13006795-13006945 GCDH:exon.5;GCDH:exon.6;GCDH:exon.7 628.62
chr19:11491549-11491657 EPOR:exon.1;EPOR:exon.6;EPOR:exon.7 301.87
chr18:3456341-3456588 TGIF1:exon.1;TGIF1:exon.2;TGIF1:exon.3 430.332
chr15:90630333-90630505 IDH2:exon.5;IDH2:exon.7 516.128

-----
 Results of msort:
chr15:90630333-90630505 IDH2:exon.5;IDH2:exon.7 516.128
chr18:3456341-3456588 TGIF1:exon.1;TGIF1:exon.2;TGIF1:exon.3 430.332
chr19:17952438-17952581 JAK3:exon.2;JAK3:exon.6;JAK3:exon.7 306.566
chr19:13051547-13051711 CALR:exon.3;CALR:exon.6;CALR:exon.7 337.811
chr19:13006795-13006945 GCDH:exon.5;GCDH:exon.6;GCDH:exon.7 628.62
chr19:11491549-11491657 EPOR:exon.1;EPOR:exon.6;EPOR:exon.7 301.87
chr20:43625799-43625957 STK4:exon.6;STK4:exon.7 310.703
chr20:10632768-10632908 JAG1:exon.5;JAG1:exon.7 319.586
chr20:8630010-8630106 PLCB1:exon.1;PLCB1:exon.7 183.188
chr20:36770455-36770611 TGM2:exon.6;TGM2:exon.7 614.756
chr20:19945585-19945678 RIN2:exon.6;RIN2:exon.7 175.258

See link listed in script if msort is not your repository ... cheers, drl

1 Like

Thank you all :).