Dear All,
I would really appreciate if you can help me to resolve this file comparison
I have two files:
file1:
chr start end ID gene_name
chr1 2020 3030 1 test1
chr1 900 5000 2 test1
chr2 5000 8000 3 test2
chr3 6000 12000 4 test3
chr3 6000 15000 5 test3
file2:
chr start end gene_name
chr1 2000 6000 test1
chr2 3500 9000 test2
chr3 5000 12000 test3
I would like to create a new file similar to file 1 but with the following criterias:
-the comparison should be made on column 4(gene_name)
-keep entries only if the difference between the second column of file 2 and file 1 is < of 1001; for example lane number 2 of file 1 is removed because the difference is 1100 (200-900)
-keep entries only if the third column of file 1 is equal or smaller of third column of file 2; for example fifth lane of lane 1 is removed because 15000 is greater than 12000
the output file should look like this:
file3:
chr start end ID gene_name
chr1 2020 3030 1 test1
chr2 5000 8000 3 test2
chr3 6000 12000 4 test3
Any suggestion?
Thanks,
Paolo
I may have misunderstood your requirements, but the script I came up with not only skips over the file1 lines with ID 2 and 5, but also the line with ID 3. The difference between the start columns for gene_name test2 is 1500 and you said to keep entries only if the difference between the second column of file 2 and file 1 is < of 1001
.
If this doesn't do what you want, maybe it will at least give you something to easily modify to get what you want:
#!/bin/ksh
awk 'FNR==NR{if(NR != 1) {
# Save fields from 1st file for comparison with the 2nd file...
key[$4] = NR
start[NR] = $2
end[NR] = $3
}
next
}
{ if(FNR == 1) {
# Copy the header line to the new file.
print
next
}
if(!($5 in key)) {
if(debug) printf("No entry found for key %s: %s\n", $5, $0)
next
}
entry = key[$5]
diff = $2 > start[entry] ? $2 - start[entry] : start[entry] - $2
if(diff > 1000) {
if(debug) printf("Start diffe |%d-%d| > 1000: %s\n",
$2, start[entry], $0)
next
}
if($3 > end[entry]) {
if(debug) printf("End field too big: (%d > %d) %s\n",
$3, end[entry], $0)
next
}
# We passed all the tests, add entry to output file.
print
}' debug=1 file2 file1
When run in debug mode (as specified by the last line of the script above), the output is:
chr start end ID gene_name
chr1 2020 3030 1 test1
Start diffe |900-2000| > 1000: chr1 900 5000 2 test1
Start diffe |5000-3500| > 1000: chr2 5000 8000 3 test2
chr3 6000 12000 4 test3
End field too big: (15000 > 12000) chr3 6000 15000 5 test3
When run with debugging turned off and output redirected to file3
by chaning the last line of the script from:
}' debug=1 file2 file1
to:
}' file2 file1 > file3
file3 will contain:
chr start end ID gene_name
chr1 2020 3030 1 test1
chr3 6000 12000 4 test3
thanks a lot is working fine!!
that really helps!