I have a directory of files, each with a variable (though small) number of lines. I would like to go through each line in each file, and print the:
-file name
-line number
-number of matches to the pattern /comp[0-9]/ for each line.
Two example files:
cat ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
m.174408g.174408ORFg.174408m.174408type:internallen:82(+)comp664012_c0_seq1:2(-)250(+) Phy00425YH_ACYPI
m.28514g.28514ORFg.28514m.28514type:completelen:172(+)comp42344_c0_seq1:416(-)931(+) m.28517g.28517ORFg.28517m.28517type:3prime_partiallen:112(+)comp42344_c0_seq2:416(-)754(+) Phy00422JU_ACYPI Phy0042C6U_ACYPI Phy00423KN_ACYPI m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-) m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)
Phy00423KN_ACYPI m.14126g.14126ORFg.14126m.14126type:internallen:133(-)comp32693_c0_seq1:3(-)401(-) m.167269g.167269ORFg.167269m.167269type:3prime_partiallen:54(-)comp457687_c0_seq1:1(-)162(-)
cat ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
m.30099g.30099ORFg.30099m.30099type:internallen:216(-)comp42976_c0_seq1:1(-)648(-) Phy0041ZCK_ACYPI m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)
Phy0041ZCK_ACYPI m.42296g.42296ORFg.42296m.42296type:3prime_partiallen:81(+)comp46573_c0_seq1:157(-)402(+)
Desired output (tab-separated) is:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 1 1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 3 4
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 5 2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 1 2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 3 1
I've tried using awk so far. This code prints the file name and number of matches in the file, but I'm not sure how to go about breaking it down by line.
cat ../IDs
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt
while read file
do
awk '{while (sub(/comp[0-9]/,":")) t++}END{print FILENAME,t}' ${file}
done < ../IDs
Any ideas out there?
P.S. A bonus answer would include a fourth output column: the largest number of consecutive fields with pattern matches. For example, line 3 in the first file (line 2 is blank) has four matches, but at most only two of these maches are in consecutive fields. Output in this case would be:
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 1 1 1
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 3 4 2
ACYPI55796-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 5 2 2
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 1 2 1
ACYPI000008-PA.aa.afa.afa.trim_phyml_tree_fullnames_fullhomolog.txt 1 1 1