Count lines

rkk · May 17, 2013, 4:02pm

Hello,

I have a file with two columns like the following

FILE1:

chr1	61042
chr1	61153
chr1	61446
chr1	61457
chr1	61621
chr10	61646
chr10	61914
chr10	62024
chr10	62782

Alos, I have another file

FILE2:

chr1	61150	61600
chr10	61675	62200

Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.

Sample output:

chr1	61150	61600       3
chr10	61675	62200        2

Could any one suggest how to solve this.

Thanks in advance,

vmenon · May 17, 2013, 4:25pm

rkk,
Try this :

while read line
do
fst=`echo $line|awk '{print $1}'`
min=`echo $line|awk '{print $2}'`
max=`echo $line|awk '{print $3}'`
cnt=0
while read line1
do
val=`echo $line1|awk '{print $2}'`
if [[ $val -le $max && $val -ge $min ]]
 then
  cnt=`expr $cnt + 1`
fi
done < file1
echo "$fst $min $max $cnt" >> output
done<file2

Thanks,
Vijay

Yoda · May 17, 2013, 4:33pm

An awk approach:

awk -F'\t' '
        NR == FNR {
                A[$1] = $2 "," $3
                next
        }
        $1 in A {
                n = split( A[$1], V, "," )
                if ( $2 >= V[1] && $2 <= V[2] )
                        R[$1 OFS V[1] OFS V[2]]++
        }
        END {
                for ( k in R )
                        print k, R[k]
        }
' file2 file1

---------- Post updated at 15:33 ---------- Previous update was at 15:29 ----------

@vmenon, if you writing a shell script always use shell built-ins where ever possible.

No need to use awk to split each fields in a file. You can read each field separately in a while loop read statement:

while read f1 f2 f3
do
       ....
done < file2

rossi · May 17, 2013, 4:38pm

use bedtools intersect. that will be easy

Jotne · May 17, 2013, 4:47pm

Another

awk

awk 'NR==FNR{aL[$1]=$2;aH[$1]=$3;next} ($1 in aL) {if ($2>=aL[$1] && $2<=aH[$1]) c[$1]++} END { for (i in c) {print i,c}}' f2 f1
chr1 3
chr10 2

Edit: some more readable.

awk '
	NR==FNR	{
		low[$1]=$2
		high[$1]=$3
		next} 
	($1 in low) {
		if ($2>=low[$1] && $2<=high[$1])
			c[$1]++} 
	END {
		for (i in c) {
			print i,c}
	}
' f2 f1

rkk · May 17, 2013, 5:10pm

Thanks all..

Yoda's script gives me what I wanted. But I am missing something.

I think I need to give more details:

FILE1:

chr1	61042
chr1	61153
chr1	61446
chr1	61457
chr1	61621
chr1 100010
chr1 100138
chr1 100145
chr1 100150
chr1 100280
chr10	61646
chr10	61914
chr10	62024
chr10	62782

Alos, I have another file

FILE2:

chr1	61150	61600
chr1 100100        100200
chr10	61675	62200

Now for each line specifying range in FILE2, I want to count the number of lines from FILE1 that falls with in the range.

Sample output:

chr1	61150	61600       3
chr1 100100        100200     3
chr10	61675	62200       2

But from Yoda's code I got

chr1 100100        100200     3
chr10	61675	62200       2

I am missing the first line..

Can you suggest any modifications?

Thanks,

RudiC · May 17, 2013, 5:11pm

Adapting Jotne's approach we get to the result that the requestor requested:

awk     'NR==FNR        {low[$1]=$2
                         high[$1]=$3
                         next}
         $2 >= low [$1] &&
         $2 <= high[$1] {c[$1]++}
         END            {for (i in c) print i, low, high, c}
        ' file2 file1
chr1 61150 61600 3
chr10 61675 62200 2

rkk · May 17, 2013, 5:24pm

Thanks Rudi..,

I reposted my message with little modifications.. could you please check and repost the code

Best

Yoda · May 17, 2013, 6:00pm

Here is the modified code:

awk '
        NR == FNR {
                A[$1","$2] = $3
                next
        }
        {
                for ( k in A )
                {
                        n = split ( k, V, "," )
                        if ( $1 == V[1] )
                        {
                                if ( $2 >= V[2] && $2 <= A[$1","V[2]] )
                                        R[$1 OFS V[2] OFS A[$1","V[2]]]++
                        }
                }
        }
        END {
                for ( k in R )
                        print k, R[k]
        }
' file2 file1

rkk · May 17, 2013, 6:21pm

Hi Yoda,

Thanks for the code. The complexity for this code very high as I have 9million lines in file1 and 11000 lines in file2.

Is there any quick way in awk??

Best

Yoda · May 17, 2013, 6:31pm

I'm not sure if there is another way without for loop visiting each record in file2 for comparison.

May be someone else in this forum has a better idea.

By the way what system are you on and how long it is taking to complete execution?

Scrutinizer · May 18, 2013, 1:31am

You could try this approach (which avoids cycling through every record in file2):

awk '
  NR==FNR {
    j=++Ranges[$1]
    Low[$1,j]=$2
    High[$1,j]=$3
    next
  }
  $1 in Ranges {
    for(j=1; j<=Ranges[$1]; j++) if (Low[$1,j]<=$2 && $2<=High[$1,j]) Number[$1,j]++
  }
  END {
    for(i in Ranges) for(j=1; j<=Ranges; j++) print i, Low[i,j], High[i,j], Number[i,j]
  }
' OFS='\t' file2 file1

I presumed that a range match is only valid if the key in field 1 matches as well.

summer_cherry · May 20, 2013, 4:01am

awk

gawk '{
        n=split($0,arr," ")
        if(NR==FNR){
         min[arr[1]]=arr[2]
         max[arr[1]]=arr[3]
         cnt[arr[1]]=0
        }
        else{
         if(arr[2] >= min[arr[1]] && arr[2] <= max[arr[1]])
           cnt[arr[1]]++
        }
}
END{
 for (i in cnt)
  print i" "min" "max" "cnt
}' b a

python

dic={}
cnt=0
with open("b.txt") as f:
 for line in f:
  line=line.replace("\n","")
  words = line.split(" ")
  dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
print(dic)
with open("a.txt") as f:
 for line in f:
  line=line.replace("\n","")
  words = line.split(" ")
  if words[1]>=dic[words[0]]['MIN'] and words[1]<=dic[words[0]]['MAX']:
   dic[words[0]]['CNT']+=1
for i in dic:
 print(i,dic['MIN'],dic['MAX'],dic['CNT'])

Scrutinizer · May 20, 2013, 5:10am

@summer_cherry. The gawk / awk does not produce the correct output for multiple ranges with the same $1 (see post #6).

---
The python script produces (Python 2.7.2):

line 7, in <module>
    dic[words[0]]={'MIN':words[1],'MAX':words[2],'CNT':0}
IndexError: list index out of range