Find biggest values on replicates

giuliangiuseppe · July 2, 2014, 7:40am

Dear All
I was wondering if someone of you know how to resolve an issue that I met.
In particular I have a file like this:

ENSMUSG01 chr1 77837902 77853530
ENSMUSG02 chr2 18780447 18811972
ENSMUSG02 chr2 18780453 18811626
ENSMUSG02 chr2 18807356 18811987
ENSMUSG03 chr3 142575634 142576538
ENSMUSG03 chr3 142576507 142578095
ENSMUSG03 chr3 142576296 142576910
ENSMUSG03 chr3 142575558 142578120
ENSMUSG03 chr3 142575529 142578143

What I would like to obtain is a file like this, in which for each replicates ID (column 1), I would like to report only those with the biggest lenght(and the relative coordinates)

ENSMUSG00000000003 chrX 77837902 77853530
ENSMUSG00000000028 chr16 18780447 18811972
ENSMUSG00000000031 chr7 142575529 142578143

I hope that my explanation was clear.

Thank you for your help!

Giuliano

CarloM · July 2, 2014, 7:56am

How are you determining which is 'biggest'? It's not clear from your example.

Scrutinizer · July 2, 2014, 7:58am

Hi, it is not clear to me what you are looking for and also, how do you get from the input file that you specified to for example ENSMUSG00000000003 and chrX ?

giuliangiuseppe · July 2, 2014, 9:06am

Hi
I am so sorry!!! I was in a hurry before and I did not check my message.

So my input file is like that:

ENSMUSG01 chr1 77837902 77853530 ENSMUSG02 chr2 18780447 18811972 ENSMUSG02 chr2 18780453 18811626 ENSMUSG02 chr2 18807356 18811987 ENSMUSG03 chr3 142575634 142576538 ENSMUSG03 chr3 142576507 142578095 ENSMUSG03 chr3 142576296 142576910 ENSMUSG03 chr3 142575558 142578120 ENSMUSG03 chr3 142575529 142578143

And my desired output should be like that:

ENSMUSG01 chr1 77837902 77853530 ENSMUSG02 chr2 18780447 18811972
ENSMUSG03 chr3 142575529 142578143

What I am looking for is for each ID (first column) calculate the difference between the column 3 and 4 and keep only the lane in which the difference is bigger.
Thank you again and if you have further question do not hesitate to post a reply!

Giuliano

---------- Post updated at 03:06 PM ---------- Previous update was at 03:03 PM ----------

o
Hi
I am so sorry!!! I was in a hurry before and I did not check my message.

So my input file is like that:

ENSMUSG01 chr1 77837902 77853530 
ENSMUSG02 chr2 18780447 18811972 
ENSMUSG02 chr2 18780453 18811626 
ENSMUSG02 chr2 18807356 18811987 
ENSMUSG03 chr3 142575634 142576538 
ENSMUSG03 chr3 142576507 142578095 
ENSMUSG03 chr3 142576296 142576910 
ENSMUSG03 chr3 142575558 142578120 
ENSMUSG03 chr3 142575529 142578143

And my desired output should be like that:

ENSMUSG01 chr1 77837902 77853530 
ENSMUSG02 chr2 18780447 18811972 
ENSMUSG03 chr3 142575529 142578143

What I am looking for is for each ID (first column) calculate the difference between the column 3 and 4 and keep only the lane in which the difference is bigger.
Thank you again and if you have further question do not hesitate to post a reply!

Giuliano

CarloM · July 2, 2014, 11:21am

awk '{diff=$4-$3; diff=diff >= 0 ?  diff:-diff; if (diff > diffs[$1]) {diffs[$1]=diff;lines[$1]=$0}} END {for (i in lines) {print lines}}' file

If field 4 is always going to be greater than field 3 then you can shorten it a bit by not bothering to calculate the absolute value. Also, it's not guaranteed to preserve the ordering of the records in the file.

Akshay_Hegde · July 2, 2014, 12:23pm

akshay@Aix:/tmp$ awk '{d=$4-$3; if(A[$1]<d){ A[$1]=d; B[$1]=$0}}END{for(i in B)print B}' file
ENSMUSG01 chr1 77837902 77853530
ENSMUSG02 chr2 18780447 18811972
ENSMUSG03 chr3 142575529 142578143

Scrutinizer · July 2, 2014, 12:32pm

To preserve order, presuming an input file grouped by the first field:

awk '{d=$4-$3} $1!=p{if(p)print s; p=$1; m=0} d>m{s=$0; m=d} END{print s}' file