printing certain elelment of a column

"File1.txt"

CHR SNP BP ANNOT
8 rs1878374 127974042 MYC(-843.5kb)|FAM84B(+334.4kb)
2 rs2042077 16883103 VSNL1(-702.2kb)|SMC6(-825.5kb)|RAD51AP2(-672.4kb)|MYCN(+878.5kb)|MSGN1(-978.2kb)|GEN1(-915.6kb)|FAM49A(+172.5kb)
12 rs10431347 3023955 TULP3(+103.4kb)|TSPAN9(-32.86kb)|TEAD4(+3.852kb)|PRMT8(-446.7kb)|PARP11(-764.3kb)|NRIP2(+209.5kb)|ITFG2(+219.6kb)|FOXM1(+167.4kb)|FKBP4(+240.6kb)|EFCAB4B(-603.4kb)|CACNA1C(+346.6kb)|C12orf32(+155.1kb)
5 rs10071904 41255697 TTC33(+463.9kb)|RPL37(+384.6kb)|PTGER4(+526.1kb)|PRKAA1(+421.6kb)|PLCXD3(-87.11kb)|OXCT1(-510.2kb)|LOC285636(-684.5kb)|FLJ40243(+148.5kb)|FBXO4(-705.4kb)|CARD6(+364.5kb)|C7(+236.9kb)|C6(0)

I want to print ANNOT whose absolute value of kb is the smallest.
so the desired output will be

CHR SNP BP ANNOT
8 rs1878374 127974042 FAM84B
2 rs2042077 16883103 FAM49A
12 rs10431347 TEAD4
5 rs10071904 41255697 C6

Thanks!

awk '
BEGIN { FS="[()]";}
NR==1 {print;}
NR>1 {
 fl=$1;
 sub(" *$","",fl);
 sub(" *[^ ]*$","",fl);
 la=10000000;
 of="";
 for (f=2; f<=NF; f+=2) {
  pf=$(f-1);
  sub(" *$","",pf);
  sub(".*[ |]","",pf);
  sub("kb$","",$f);
  $f=$f*1;
  $f<0 ? af=($f * -1) : af=($f * 1);
  if (af<la) {
     la=af;
     of=pf;
  }
 }
 print fl,of;
}
' File1.txt
1 Like

Seems to work great! what does la=10000000? mean?

Just setting a very high value to start with. There are probably other ways to accomplish the same.

1 Like

A perl alternative:

#perl -lane 'if(/(\d+\s+\w+\s+\d+\s+)(.*)/){$f=$1;$p=$2;
printf $f;$min=10000000;while ($p =~ m/(\w+)\(([^(kb]+)(kb)?\)/g ){
         $a=abs($2);if ($a <= $min) { $min=$a;$val=$1;}};print $val}else{print} ' infile
CHR SNP BP ANNOT
8 rs1878374 127974042 FAM84B
2 rs2042077 16883103 FAM49A
12 rs10431347 3023955 TEAD4
5 rs10071904 41255697 C6