rkk
May 17, 2013, 4:02pm
1
Hello,
I have a file with two columns like the following
FILE1:
chr1 61042
chr1 61153
chr1 61446
chr1 61457
chr1 61621
chr10 61646
chr10 61914
chr10 62024
chr10 62782
Alos, I have another file
FILE2:
chr1 61150 61600
chr10 61675 62200
Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.
Sample output:
chr1 61150 61600 3
chr10 61675 62200 2
Could any one suggest how to solve this.
Thanks in advance,
rkk,
Try this :
while read line
do
fst=`echo $line|awk '{print $1}'`
min=`echo $line|awk '{print $2}'`
max=`echo $line|awk '{print $3}'`
cnt=0
while read line1
do
val=`echo $line1|awk '{print $2}'`
if [[ $val -le $max && $val -ge $min ]]
then
cnt=`expr $cnt + 1`
fi
done < file1
echo "$fst $min $max $cnt" >> output
done<file2
Thanks,
Vijay
Yoda
May 17, 2013, 4:33pm
3
An awk approach:
awk -F'\t' '
NR == FNR {
A[$1] = $2 "," $3
next
}
$1 in A {
n = split( A[$1], V, "," )
if ( $2 >= V[1] && $2 <= V[2] )
R[$1 OFS V[1] OFS V[2]]++
}
END {
for ( k in R )
print k, R[k]
}
' file2 file1
---------- Post updated at 15:33 ---------- Previous update was at 15:29 ----------
@vmenon , if you writing a shell script always use shell built-ins where ever possible.
No need to use awk to split each fields in a file. You can read each field separately in a while loop read statement:
while read f1 f2 f3
do
....
done < file2
1 Like
rossi
May 17, 2013, 4:38pm
4
use bedtools intersect. that will be easy
Jotne
May 17, 2013, 4:47pm
5
Another
awk
awk 'NR==FNR{aL[$1]=$2;aH[$1]=$3;next} ($1 in aL) {if ($2>=aL[$1] && $2<=aH[$1]) c[$1]++} END { for (i in c) {print i,c}}' f2 f1
chr1 3
chr10 2
Edit: some more readable.
awk '
NR==FNR {
low[$1]=$2
high[$1]=$3
next}
($1 in low) {
if ($2>=low[$1] && $2<=high[$1])
c[$1]++}
END {
for (i in c) {
print i,c}
}
' f2 f1
rkk
May 17, 2013, 5:10pm
6
Thanks all..
Yoda's script gives me what I wanted. But I am missing something.
I think I need to give more details:
FILE1:
chr1 61042
chr1 61153
chr1 61446
chr1 61457
chr1 61621
chr1 100010
chr1 100138
chr1 100145
chr1 100150
chr1 100280
chr10 61646
chr10 61914
chr10 62024
chr10 62782
Alos, I have another file
FILE2:
chr1 61150 61600
chr1 100100 100200
chr10 61675 62200
Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.
Sample output:
chr1 61150 61600 3
chr1 100100 100200 3
chr10 61675 62200 2
But from Yoda's code I got
chr1 100100 100200 3
chr10 61675 62200 2
I am missing the first line..
Can you suggest any modifications?
Thanks,
RudiC
May 17, 2013, 5:11pm
7
Adapting Jotne's approach we get to the result that the requestor requested:
awk 'NR==FNR {low[$1]=$2
high[$1]=$3
next}
$2 >= low [$1] &&
$2 <= high[$1] {c[$1]++}
END {for (i in c) print i, low, high, c}
' file2 file1
chr1 61150 61600 3
chr10 61675 62200 2
rkk
May 17, 2013, 5:24pm
8
Thanks Rudi..,
I reposted my message with little modifications.. could you please check and repost the code
Best
Yoda
May 17, 2013, 6:00pm
9
Here is the modified code:
awk '
NR == FNR {
A[$1","$2] = $3
next
}
{
for ( k in A )
{
n = split ( k, V, "," )
if ( $1 == V[1] )
{
if ( $2 >= V[2] && $2 <= A[$1","V[2]] )
R[$1 OFS V[2] OFS A[$1","V[2]]]++
}
}
}
END {
for ( k in R )
print k, R[k]
}
' file2 file1
rkk
May 17, 2013, 6:21pm
10
Hi Yoda,
Thanks for the code. The complexity for this code very high as I have 9million lines in file1 and 11000 lines in file2.
Is there any quick way in awk??
Best
Yoda
May 17, 2013, 6:31pm
11
I'm not sure if there is another way without for
loop visiting each record in file2 for comparison.
May be someone else in this forum has a better idea.
By the way what system are you on and how long it is taking to complete execution?
You could try this approach (which avoids cycling through every record in file2):
awk '
NR==FNR {
j=++Ranges[$1]
Low[$1,j]=$2
High[$1,j]=$3
next
}
$1 in Ranges {
for(j=1; j<=Ranges[$1]; j++) if (Low[$1,j]<=$2 && $2<=High[$1,j]) Number[$1,j]++
}
END {
for(i in Ranges) for(j=1; j<=Ranges; j++) print i, Low[i,j], High[i,j], Number[i,j]
}
' OFS='\t' file2 file1
I presumed that a range match is only valid if the key in field 1 matches as well.
awk
gawk '{
n=split($0,arr," ")
if(NR==FNR){
min[arr[1]]=arr[2]
max[arr[1]]=arr[3]
cnt[arr[1]]=0
}
else{
if(arr[2] >= min[arr[1]] && arr[2] <= max[arr[1]])
cnt[arr[1]]++
}
}
END{
for (i in cnt)
print i" "min" "max" "cnt
}' b a
python
dic={}
cnt=0
with open("b.txt") as f:
for line in f:
line=line.replace("\n","")
words = line.split(" ")
dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
print(dic)
with open("a.txt") as f:
for line in f:
line=line.replace("\n","")
words = line.split(" ")
if words[1]>=dic[words[0]]['MIN'] and words[1]<=dic[words[0]]['MAX']:
dic[words[0]]['CNT']+=1
for i in dic:
print(i,dic['MIN'],dic['MAX'],dic['CNT'])
@summer_cherry . The gawk / awk does not produce the correct output for multiple ranges with the same $1
(see post #6 ).
---
The python script produces (Python 2.7.2):
line 7, in <module>
dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
IndexError: list index out of range