Hello,
I want to retrieve the rows with uniq count(column 4) for every ref gene(column 7) on the basis of strand(column8 ) and tss(column 5).
If a ref gene has same number of count and it is on negative strand then keep the row with its highest tss and likewise
If a ref gene has same number of count and it is on positive strand then keep the row with its lowest tss
I am working on the dat of format:
CHR TSS-25bp TSS+25bp count tss Ensemble transcript refgene strand
chr15 79554474 79554524 2 79554499 ENSMUST00000089311 Sun2 -
chr15 79554475 79554525 2 79554500 ENSMUST00000100439 Sun2 -
chr15 79554477 79554527 2 79554502 ENSMUST00000046259 Sun2 -
chr15 79569054 79569104 1 79569079 ENSMUST00000159660 Sun2 -
chr15 79570243 79570293 4 79570268 ENSMUST00000160355 Sun2 -
chr17 44914075 44914125 2 44914100 ENSMUST00000050630 Supt3h +
chr17 44914248 44914298 3 44914273 ENSMUST00000130623 Supt3h +
chr17 44914319 44914369 3 44914344 ENSMUST00000127798 Supt3h +
chr11 87551028 87551078 2 87551053 ENSMUST00000152700 Supt4h1 +
chr11 87551029 87551079 2 87551054 ENSMUST00000141169 Supt4h1 +
chr7 29099891 29099941 2 29099916 ENSMUST00000003527 Supt5h -
chr11 78020504 78020554 3 78020529 ENSMUST00000108314 Supt6h -
I would expect this in the output:
CHR TSS-25bp TSS+25bp count tss Ensemble transcript refgene strand
chr15 79554477 79554527 2 79554502 ENSMUST00000046259 Sun2 -
chr15 79569054 79569104 1 79569079 ENSMUST00000159660 Sun2 -
chr15 79570243 79570293 4 79570268 ENSMUST00000160355 Sun2 -
chr17 44914075 44914125 2 44914100 ENSMUST00000050630 Supt3h +
chr17 44914248 44914298 3 44914273 ENSMUST00000130623 Supt3h +
chr11 87551028 87551078 2 87551053 ENSMUST00000152700 Supt4h1 +
chr7 29099891 29099941 2 29099916 ENSMUST00000003527 Supt5h -
chr11 78020504 78020554 3 78020529 ENSMUST00000108314 Supt6h -
So far I have this ,
#!/bin/bash
example=Workbook4.txt
for gene in `cut -f7 example | uniq`
** do
** sign=`grep $gene example | cut -f8 | uniq`
** for count in `grep $gene example | cut -f4 | sort | uniq`
** do
* * * if [ "$sign" == "-" ]
* * * then
* * * grep $gene example | grep $count example | sort -k5 | head -1 ----
* * * else
* * * grep $gene example | grep $count example | sort -k5 | tail -1 ----
** done
** break
done
[/CODE]]
I am not sure about the one in bold. It would be nice if you can help me solving this.*
Thanks for your time
Kirthi