Algorithm predicted_gene start_point end_point
A x 65 85
B x 70 80
C x 75 85
D x 10 20
B y 125 130
C y 120 140
D y 200 210
Here there are four tab-separated columns. The first column is the used algorithm for prediction, and there are 4 of them A-D. The second column are the predicted targets (which actually are genes), x and y. The third and fourth column indicate the start and the end of the predicted site in the sequence of the genes.
I'd need to unique the entries in column 2, based on the common range in the columns 3 and 4, something like this:
Algorithm predicted_gene start_point end_point Number_of_algorithms_predicting_this_site
A, B, C x 65 85 3
D x 10 20 1
B, C y 120 140 2
D y 200 210 1
Here, for example, at the first line we have algorithms A, B and C which predict the gene x, and the predicted positions all fall into the same site, i.e. the position 70-80 for algorithm B and 75-85 for algorithm C are both located inside the same predicted position by algorithm A, which is 65-85; and the last column indicates how many algorithms predicted this position. On the contrary, the predicted site by algorithm D for the entry x does not coincide with the others, so is presented in a separate line. The results for the entry y are explained in the same way.
Thank you Yoda for your time. Actually I edited my first post, since it was said not to be clear. In the input file I have four columns, with four headers, and in the output there is one more column, so five columns, and all are tab-delimited. Could you please modify your script based on this? Thanks
Add your headers in BEGIN block edit print statement, try to learn ... after providing 99.99% of code.. if you can't edit small header information means, what I can tell. Please don't expect others to complete your task.. put little effort.
Hi Yoda,
I tried your script, and it does work perfectly on the simplified sample I presented here. However, on my true samples it won't be precise, not including all the numbers in the range. I've attached an example of the input and the expected output. Could you please have a look at it?
thank you very much for the help. Your script is working well, with two minor problems: first, in the output file, it adds the header for every row, and second, although it works very well on the simplified example, on the true samples it won't. I've attached tow files, the input and the expected output. I'd appreciate it if you could modify the script.