Split a file in more files based on score content

paolo.kunder · November 19, 2013, 7:58am

Dear All,
I have the following file tabulated:

ID	distanceTSS	score
8434	571269	10
10122	393912	9
7652	6	10
4863	1451	9
8419	39	2
9363	564	21
9333	7714	22
9638	8334	9
1638	1231	11
10701	918	1000
6587	32056	111

What I would like to do is the following, create 100 new files based on content of the second column,
The first file should contain all the lines with a distance between 0 and 1000,
the second between 1000 and 2000, and so on untile 99000 and 10000,

Finally for each new file I would like to calculate the median of the third column(score)

Is there a rapid way to do so? I tried witha perl script but it seems really slow,

thanks for your help,
Paolo

pravin27 · November 19, 2013, 8:34am

Could this help you ?

 awk '{ print > int($2/1000)+1"_distance.txt"}'  filename

paolo.kunder · November 19, 2013, 9:21am

Thanks,
That's really help me!
Another issue concerning the output file:
let's imagine I have the following output for 1_distance:

ID	distanceTSS	score
8434	8	10
8434	999	10
7652	6	9
7652	451	9

As you can notice there are ID present more than once, I would like to retain only one ID, for example:

ID	distanceTSS	score
8434	8	10
7652	6	9

And finally for each new file created (distance_*) I would like to calculate the median of the third column,

Thanks again,
Paolo

RudiC · November 19, 2013, 9:59am

Which of the duplicate values do you want to retain? The first? The smallest?
Does the output order matter, esp. reg. median calculation?

paolo.kunder · November 20, 2013, 3:20am

Hi,
Is the same because they have the same score

varathaneie · November 20, 2013, 5:29am

Hi Pravin,

Is it possible for you to explain the awk commadn which u have provided...