Awk: get upper and lower bound per group

Hi all,

I've data as:

22      51018157        51018157        exonic  CHKB    nonsynonymous SNV
22      51018204        51018204        exonic  CHKB    nonsynonymous SNV
22      51018428        51018428        exonic  CHKB    nonsynonymous SNV
22      51018814        51018814        exonic  CHKB    nonsynonymous SNV
22      51019001        51019001        exonic  CHKB    nonsynonymous SNV
22      51019849        51019849        exonic  CHKB    nonsynonymous SNV
22      51020736        51020736        exonic  CHKB    nonsynonymous SNV
22      51021027        51021027        exonic  CHKB    nonsynonymous SNV
22      51021197        51021197        exonic  CHKB    nonsynonymous SNV
22      51063758        51063758        exonic  ARSA    nonsynonymous SNV
22      51063778        51063778        exonic  ARSA    nonsynonymous SNV
22      51063820        51063820        exonic  ARSA    nonsynonymous SNV
22      51063845        51063845        exonic  ARSA    nonsynonymous SNV
22      51064416        51064416        exonic  ARSA    nonsynonymous SNV
22      51064489        51064489        exonic  ARSA    nonsynonymous SNV
22      51065266        51065266        exonic  ARSA    nonsynonymous SNV
22      51065287        51065287        exonic  ARSA    nonsynonymous SNV
22      51065341        51065341        exonic  ARSA    nonsynonymous SNV
22      51065361        51065361        exonic  ARSA    nonsynonymous SNV
22      51066194        51066194        exonic  ARSA    nonsynonymous SNV
22      51143462        51143462        exonic  SHANK3  nonsynonymous SNV
22      51153371        51153371        exonic  SHANK3  nonsynonymous SNV
22      51159778        51159778        exonic  SHANK3  nonsynonymous SNV
22      51160154        51160154        exonic  SHANK3  nonsynonymous SNV
22      51169684        51169684        exonic  SHANK3  nonsynonymous SNV
22      51176664        51176664        exonic  ACR     nonsynonymous SNV
22      51176734        51176734        exonic  ACR     nonsynonymous SNV
22      51177812        51177812        exonic  ACR     nonsynonymous SNV
22      51178286        51178286        exonic  ACR     nonsynonymous SNV

It's a tab separated data.
Column one is chromosome
Column two is start position, three is end. Column fifth is gene name.

My desired output is

22 CHKB 51018157 51021197
22 ARSA 51063758 51066194
22 SHANK3 51143462 51169684
22 ACR 51176664 51178286


That is, for each gene, I get the smallest number from column 2 and largest from column 3.

I could only get my head around this much:

 cat small_d.txt | awk '{a[$5]=$1} END {for (i in a) {print i,a}}'

I can't simply think in awk. I can write a python script but would like to learn these magic tricks.

First off, you don't need cat's help to read a file, awk can read perfectly fine on its own. Same goes for nearly any other program.

$ awk '($3 > MAX[$5]) { MAX[$5]=$3 }
        (!($5 in MIN) || ($2 < MIN[$5] )) { MIN[$5]=$2 }
        END { for(X in MIN) print X, MIN[X], MAX[X] }' inputfile

ARSA 51063758 51066194
ACR 51176664 51178286
CHKB 51018157 51021197
SHANK3 51143462 51169684

$

You need a min and max variable for smallest and largest position:-

awk -t'\t' '
        {
                idx = $1 FS $5
                if ( idx in A_min )
                {
                        if ( A_min[idx] > $2 )
                                A_min[idx] = $2
                        if ( A_max[idx] < $2 )
                                A_max[idx] = $2
                }
                else
                {
                        A_min[idx] = $2
                        A_max[idx] = $2
                }
        }
        END {
                for ( k in A_min )
                        print k, A_min[k], A_max[k]
        }
' small_d.txt

If the files are always grouped and in sorted/increasing order per group, then something like this might suffice:

awk '{i=$1 FS $5; if(i!=p) {if(p) print p,l,h; l=$2; p=i} h=$3} END{print p,l,h}' file

Which would keep the group order of the input file

Thank you corona.
Do you think you can help me understand how this is working?

---------- Post updated at 09:12 AM ---------- Previous update was at 09:11 AM ----------

---------- Post updated at 09:13 AM ---------- Previous update was at 09:12 AM ----------

Hi Scrutinizer

Thank you. This works exactly I needed, prints chromosome number as well.
Can you please help me understand your code?