The awk
below executes as expected if the id in $4
(like in f
) is unique. However most of my data is like f1
where the same id can appear multiple times. I think that is the reason why the awk
is not working as expected. I added a comment on the line that I can not change without causing the script to abort. Each line in f2
is searched and must contain the id, in this case COL1A2
but that id may not be a single entry. That is the id may appear 5 times, but each line with that is in f2
is searched. Using the $4
in f1
as the id and reading each $1
, $2
, and $3
value into a variable min
and max
.
The $4
is then split on the _
in f2
and read into array
. The same id from f1
may appear in multiple lines of f2
however each will have unique $2
and $3
values. Each value in the split will match a $4
id in f1
. The min
and max
must match the $1
of f2
and be between the $2
and $3
values in f2
. An exact match is not needed rather just that the min
or max
variables falls within $2
and $3
. If that is true then exon
is printed in $5
of f2
if it is not true then intron
is printed in $5
. Most of this works as expected I just did not account for the possibity for multiple enteries and am nut sure how to adjust for it. Thank you
For example using the contents of the f1
where COL1A2
appears 3 times, each entry or line is searched in f2
. Currently, I believe since COL1A2
is not unique not match is found in f2
as the min
and max
are not set per entry or line. Thank you :).
awk w/ desired output
awk '
BEGIN{
SUBSEP=","
}
FNR==NR{
max[$1,$NF]=$3
min[$1,$NF]=$2
next
}
{
split($4,array,"_") # How do I change/modify this so it only looks a each line with this id `COL1A2` in it?
}
(($1,array[1]) in max){
if(($2>min[array[5],array[1]] && $2<max[array[5],array[1]]) || ($3>max[array[5],array[1]] && $3<max[array[5],array[1]])){
print array[5],array[1],min[array[5],array[1]],max[array[5],array[1]],"exon"
next
}
}
{
print $0,"intron"}' f f2
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 + intron
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 + intron
chr7 COL1A2 94027591 94027701 exon
awk w/ current output
.... }' f1 f2
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 + intron
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 + intron
chr7 94027683 94027718 COL1A2_cds_2_0_chr7_94027694_f 0 + intron
contents of f single COL1A2 entry
chr7 94027591 94027701 COL1A2
contents of f1 multiple COL1A2 entry, this is most of the actual data, very few are single entries though there are some
chr7 94027591 94027701 COL1A2
chr7 94027799 94027811 COL1A2
chr7 94030799 94030847 COL1A2
contents of f2 always the same format
chr7 94024333 94024423 COL1A2_cds_0_0_chr7_94024344_f 0 +
chr7 94027049 94027080 COL1A2_cds_1_0_chr7_94027060_f 0 +
chr7 94027683 94027718 COL1A2_cds_2_0_chr7_94027694_f 0 +