Count lines

Hello,

I have a file with two columns like the following

FILE1:

chr1	61042
chr1	61153
chr1	61446
chr1	61457
chr1	61621
chr10	61646
chr10	61914
chr10	62024
chr10	62782

Alos, I have another file

FILE2:

chr1	61150	61600
chr10	61675	62200

Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.

Sample output:

chr1	61150	61600       3
chr10	61675	62200        2

Could any one suggest how to solve this.

Thanks in advance,

rkk,
Try this :

while read line
do
fst=`echo $line|awk '{print $1}'`
min=`echo $line|awk '{print $2}'`
max=`echo $line|awk '{print $3}'`
cnt=0
while read line1
do
val=`echo $line1|awk '{print $2}'`
if [[ $val -le $max && $val -ge $min ]]
 then
  cnt=`expr $cnt + 1`
fi
done < file1
echo "$fst $min $max $cnt" >> output
done<file2

Thanks,
Vijay

An awk approach:

awk -F'\t' '
        NR == FNR {
                A[$1] = $2 "," $3
                next
        }
        $1 in A {
                n = split( A[$1], V, "," )
                if ( $2 >= V[1] && $2 <= V[2] )
                        R[$1 OFS V[1] OFS V[2]]++
        }
        END {
                for ( k in R )
                        print k, R[k]
        }
' file2 file1

---------- Post updated at 15:33 ---------- Previous update was at 15:29 ----------

@vmenon, if you writing a shell script always use shell built-ins where ever possible.

No need to use awk to split each fields in a file. You can read each field separately in a while loop read statement:

while read f1 f2 f3
do
       ....
done < file2
1 Like

use bedtools intersect. that will be easy

Another

awk
awk 'NR==FNR{aL[$1]=$2;aH[$1]=$3;next} ($1 in aL) {if ($2>=aL[$1] && $2<=aH[$1]) c[$1]++} END { for (i in c) {print i,c}}' f2 f1
chr1 3
chr10 2

Edit: some more readable.

awk '
	NR==FNR	{
		low[$1]=$2
		high[$1]=$3
		next} 
	($1 in low) {
		if ($2>=low[$1] && $2<=high[$1])
			c[$1]++} 
	END {
		for (i in c) {
			print i,c}
	}
' f2 f1

Thanks all..

Yoda's script gives me what I wanted. But I am missing something.

I think I need to give more details:

FILE1:

chr1	61042
chr1	61153
chr1	61446
chr1	61457
chr1	61621
chr1 100010
chr1 100138
chr1 100145
chr1 100150
chr1 100280
chr10	61646
chr10	61914
chr10	62024
chr10	62782

Alos, I have another file

FILE2:

chr1	61150	61600
chr1 100100        100200
chr10	61675	62200

Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.

Sample output:

chr1	61150	61600       3
chr1 100100        100200     3
chr10	61675	62200       2

But from Yoda's code I got

chr1 100100        100200     3
chr10	61675	62200       2

I am missing the first line..

Can you suggest any modifications?

Thanks,

Adapting Jotne's approach we get to the result that the requestor requested:

awk     'NR==FNR        {low[$1]=$2
                         high[$1]=$3
                         next}
         $2 >= low [$1] &&
         $2 <= high[$1] {c[$1]++}
         END            {for (i in c) print i, low, high, c}
        ' file2 file1
chr1 61150 61600 3
chr10 61675 62200 2

Thanks Rudi..,

I reposted my message with little modifications.. could you please check and repost the code

Best

Here is the modified code:

awk '
        NR == FNR {
                A[$1","$2] = $3
                next
        }
        {
                for ( k in A )
                {
                        n = split ( k, V, "," )
                        if ( $1 == V[1] )
                        {
                                if ( $2 >= V[2] && $2 <= A[$1","V[2]] )
                                        R[$1 OFS V[2] OFS A[$1","V[2]]]++
                        }
                }
        }
        END {
                for ( k in R )
                        print k, R[k]
        }
' file2 file1

Hi Yoda,

Thanks for the code. The complexity for this code very high as I have 9million lines in file1 and 11000 lines in file2.

Is there any quick way in awk??

Best

I'm not sure if there is another way without for loop visiting each record in file2 for comparison.

May be someone else in this forum has a better idea.

By the way what system are you on and how long it is taking to complete execution?

You could try this approach (which avoids cycling through every record in file2):

awk '
  NR==FNR {
    j=++Ranges[$1]
    Low[$1,j]=$2
    High[$1,j]=$3
    next
  }
  $1 in Ranges {
    for(j=1; j<=Ranges[$1]; j++) if (Low[$1,j]<=$2 && $2<=High[$1,j]) Number[$1,j]++
  }
  END {
    for(i in Ranges) for(j=1; j<=Ranges; j++) print i, Low[i,j], High[i,j], Number[i,j]
  }
' OFS='\t' file2 file1

I presumed that a range match is only valid if the key in field 1 matches as well.

awk

gawk '{
        n=split($0,arr," ")
        if(NR==FNR){
         min[arr[1]]=arr[2]
         max[arr[1]]=arr[3]
         cnt[arr[1]]=0
        }
        else{
         if(arr[2] >= min[arr[1]] && arr[2] <= max[arr[1]])
           cnt[arr[1]]++
        }
}
END{
 for (i in cnt)
  print i" "min" "max" "cnt
}' b a

python

dic={}
cnt=0
with open("b.txt") as f:
 for line in f:
  line=line.replace("\n","")
  words = line.split(" ")
  dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
print(dic)
with open("a.txt") as f:
 for line in f:
  line=line.replace("\n","")
  words = line.split(" ")
  if words[1]>=dic[words[0]]['MIN'] and words[1]<=dic[words[0]]['MAX']:
   dic[words[0]]['CNT']+=1
for i in dic:
 print(i,dic['MIN'],dic['MAX'],dic['CNT'])

@summer_cherry. The gawk / awk does not produce the correct output for multiple ranges with the same $1 (see post #6).

---
The python script produces (Python 2.7.2):

line 7, in <module>
    dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
IndexError: list index out of range