Count number of pattern matches per line for all files in directory

pathunkathunk · April 23, 2014, 4:43pm

I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the:
-file name
-line number
-number of matches to the pattern /comp[0-9]/ for each line.

Two example files:

cat ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 
m.174408g.174408ORFg.174408m.174408type:internallen:82(+)comp664012_c0_seq1:2(-)250(+)	 Phy00425YH_ACYPI	 

m.28514g.28514ORFg.28514m.28514type:completelen:172(+)comp42344_c0_seq1:416(-)931(+)	 m.28517g.28517ORFg.28517m.28517type:3prime_partiallen:112(+)comp42344_c0_seq2:416(-)754(+)	 Phy00422JU_ACYPI	 Phy0042C6U_ACYPI	 Phy00423KN_ACYPI	 m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-)	 m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)	 

Phy00423KN_ACYPI	 m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-)	 m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)	 

cat ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 
m.30099g.30099ORFg.30099m.30099type:internallen:216(-)comp42976_c0_seq1:1(-)648(-)	 Phy0041ZCK_ACYPI	 m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)	 

Phy0041ZCK_ACYPI	 m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)

Desired output (tab-separated) is:

ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   1

I've tried using awk so far. This code prints the file name and number of matches in the file, but I'm not sure how to go about breaking it down by line.

cat ../IDs
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt

while read file
do 
awk '{while (sub(/comp[0-9]/,":")) t++}END{print FILENAME,t}' ${file}
done < ../IDs

Any ideas out there?

P.S. A bonus answer would include a fourth output column: the largest number of consecutive fields with pattern matches. For example, line 3 in the first file (line 2 is blank) has four matches, but at most only two of these maches are in consecutive fields. Output in this case would be:

ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1

Don_Cragun · April 23, 2014, 5:08pm

Is this a homework assignment?

If not, please explain why you need this data.

bartus11 · April 23, 2014, 5:15pm

I'd say this is some kind of bioinformatics data. Anyway, you can try this in a directory containing your files:

perl -lne '$,=" ";@x=/comp[0-9]+/g;/([^\t]*comp[0-9]+[^\t]*\t?)+/;$tmp=$&;@y=$tmp=~/comp[0-9]+/g;print $ARGV,$.,($#x+1),($#y+1) if ($#x+1);$.=0 if eof' *

pathunkathunk · April 23, 2014, 6:47pm

bartus11, this works, thank you. It's become clear that I need to spend some time learning perl.

Don Cragun, I am a biologist. This request is to help me parse the results of an analysis I did of data that I generated. I hope to soon be able to do everything from field work to wet lab work to all of the analysis...but I'm not quite there.

Don_Cragun · April 23, 2014, 9:46pm

Assuming that I am correct in believing that the desired bonus output you provided:

ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1

should have been:

ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   1   1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   4   2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   5   2   2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   1   2   1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt   3   1   1

and with the sets of three spaces changed to tabs, the following script (using awk instead of perl ) seems to also do what you want:

#!/bin/ksh
awk '
{	nm = nc = ncM = 0
	for(i = 1; i <= NF; i++)
		if(match($i, /comp[0-9]/)) {
			nm++
			if(++nc > ncM)
				ncM = nc
		} else	nc = 0
	if(nm)	printf("%s\t%d\t%d\t%d\n", FILENAME, FNR, nm, ncM)
}' $(cat IDs)

producing the output:

ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	1	1	1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	3	4	2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	5	2	2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	1	2	1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt	3	1	1

If someone wants to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .