file comparison

paolo.kunder · September 13, 2012, 9:47am

Dear All,
I would really appreciate if you can help me to resolve this file comparison

I have two files:

file1:
chr	start		end 		ID	gene_name
chr1	2020		3030		1	test1
chr1	900		5000		2	test1
chr2	5000		8000		3	test2
chr3	6000		12000		4	test3
chr3	6000		15000		5	test3

file2:
chr	start		end 		gene_name
chr1	2000		6000		test1
chr2	3500		9000		test2
chr3	5000		12000		test3

I would like to create a new file similar to file 1 but with the following criterias:
-the comparison should be made on column 4(gene_name)
-keep entries only if the difference between the second column of file 2 and file 1 is < of 1001; for example lane number 2 of file 1 is removed because the difference is 1100 (200-900)
-keep entries only if the third column of file 1 is equal or smaller of third column of file 2; for example fifth lane of lane 1 is removed because 15000 is greater than 12000

the output file should look like this:

file3:
chr	start		end 		ID	gene_name
chr1	2020		3030		1	test1
						
chr2	5000		8000		3	test2
chr3	6000		12000		4	test3

Any suggestion?
Thanks,
Paolo

Don_Cragun · September 15, 2012, 10:38pm

I may have misunderstood your requirements, but the script I came up with not only skips over the file1 lines with ID 2 and 5, but also the line with ID 3. The difference between the start columns for gene_name test2 is 1500 and you said to keep entries only if the difference between the second column of file 2 and file 1 is < of 1001 .
If this doesn't do what you want, maybe it will at least give you something to easily modify to get what you want:

#!/bin/ksh
awk 'FNR==NR{if(NR != 1) {
                # Save fields from 1st file for comparison with the 2nd file...
                key[$4] = NR
                start[NR] = $2
                end[NR] = $3
        }
        next
}
 {      if(FNR == 1) {
                # Copy the header line to the new file.
                print
                next
        }
        if(!($5 in key)) {
                if(debug) printf("No entry found for key %s: %s\n", $5, $0)
                next
        }
        entry = key[$5]
        diff = $2 > start[entry] ? $2 - start[entry] : start[entry] - $2
        if(diff > 1000) {
                if(debug) printf("Start diffe |%d-%d| > 1000: %s\n",
                        $2, start[entry], $0)
                next
        }
        if($3 > end[entry]) {
                if(debug) printf("End field too big: (%d > %d) %s\n",
                        $3, end[entry], $0)
                next
        }
        # We passed all the tests, add entry to output file.
        print
}' debug=1 file2 file1

When run in debug mode (as specified by the last line of the script above), the output is:

chr	start		end 		ID	gene_name
chr1	2020		3030		1	test1
Start diffe |900-2000| > 1000: chr1	900		5000		2	test1
Start diffe |5000-3500| > 1000: chr2	5000		8000		3	test2
chr3	6000		12000		4	test3
End field too big: (15000 > 12000) chr3	6000		15000		5	test3

When run with debugging turned off and output redirected to file3 by chaning the last line of the script from:

}' debug=1 file2 file1

to:

}' file2 file1 > file3

file3 will contain:

chr     start           end             ID      gene_name
chr1    2020            3030            1       test1
chr3    6000            12000           4       test3

paolo.kunder · September 18, 2012, 4:31am

thanks a lot is working fine!!
that really helps!