shell scripting

hravisankar · November 3, 2010, 6:32pm

Hai,
How to remove the repeated 'Chr's in different sequences. In the given example, Chr11 is repeated in two samples with the same number i.e. 100949523. How to remove one of the entry in any of the samples and to give the range for each Chr which is -20 for minimum range value and +120 for maximum range value. For Chr19 it will be displayed as
Chr19:52245903-5224546043 in output file (i.e., for Chr19, 52245923 given. So -20 from this value is min.range and +120 is max. range). No importance for the sign (+ or -before the number not taking into consideration) in the input data. The final output also given for easy understanding.
Input file:

sample1:1:1:1058:8130#0 5 830
Chr19 +52245923 1
Chr17 +69679873 1
Chr23 +52121254 1
Chr11 +100949523 1
Chr8 +28333267 1
sample1:1:1:1060:13599#0 1 1
Chr11 +100949523 1
Chr12 -19596251 2
sample1:1:1:1067:10266#0 5 284
Chr18 -52922341 0
Chr28 -25960086 0
Chr20 -19916978 0
Chr13 +3874326 0

Output is:

[code]Chr19:52245903-52246043
Chr17:69679853-69679993
Chr23:52121234-52121374
Chr11:100949503-100949643
Chr8:28333247-28333387
Chr12:19596231-19596371
Chr18:52922321-52922461
Chr28:25960066-25960206
Chr20:19916958-19917098
Chr13:3874306-3874446

Chubler_XL · November 3, 2010, 8:33pm

If order is not important:

$ awk ' $0 !~ "^sample" { val[$1]=$2<0?-$2:$2; } END { for(i in val) print i":"val-20"-"val+120 }' inputfile
Chr28:25960066-25960206
Chr19:52245903-52246043
Chr8:28333247-28333387
Chr20:19916958-19917098
Chr11:100949503-100949643
Chr12:19596231-19596371
Chr13:3874306-3874446
Chr23:52121234-52121374
Chr17:69679853-69679993
Chr18:52922321-52922461

rdcwayx · November 4, 2010, 7:00am

By the original order

awk '/^Chr/&&!a[$1]++ {sub(/[+-]/,"");print $1":"$2-20"-"$2+120}' infile