counting the lines matching a pattern, in between two pattern, and generate a tab

Hi all,

I'm looking for some help. I have a file (very long) that is organized like below:

>Cluster 0
0 283nt, >01_FRYJ6ZM12HMXZS... at +/99%
1 279nt, >01_FRYJ6ZM12HN12A... at +/99%
2 281nt, >01_FRYJ6ZM12HM4TS... at +/99%
3 283nt, >01_FRYJ6ZM12HM946... at +/99%
4 279nt, >01_FRYJ6ZM12HJD9N... at +/99%
5 283nt, >01_FRYJ6ZM12HMM35... at +/99%
6 280nt, >01_FRYJ6ZM12HK26A... at +/99%
7 280nt, >01_FRYJ6ZM12HJ4UN... at +/99%
8 280nt, >01_FRYJ6ZM12HOKP6... at +/99%
9 283nt, >01_FRYJ6ZM12HCH1I... at +/99%
10 280nt, >01_FRYJ6ZM12HKTVV... at +/99%
11 280nt, >01_FRYJ6ZM12HL7IW... at +/98%
12 290nt, >01_FRYJ6ZM12HFI8R... *
13 281nt, >01_FRYJ6ZM12HLLN4... at +/98%
14 280nt, >01_FRYJ6ZM12HI82W... at +/99%
15 267nt, >01_FRYJ6ZM12HISC6... at +/98%
16 270nt, >01_FRYJ6ZM12HMKQG... at +/98%
17 290nt, >01_FRYJ6ZM12HJUQE... at +/98%
18 283nt, >01_FRYJ6ZM12HFSMR... at +/99%
19 280nt, >01_FRYJ6ZM12HK595... at +/99%
20 283nt, >01_FRYJ6ZM12HL768... at +/99%
21 266nt, >01_FRYJ6ZM12HMTF3... at +/100%
22 280nt, >02_FRYJ6ZM12HLE98... at +/99%
23 290nt, >04_FRYJ6ZM12HL1JH... at +/97%
24 275nt, >05_FRYJ6ZM12HE7XC... at +/99%
25 276nt, >05_FRYJ6ZM12HNA0I... at +/98%
26 271nt, >05_FRYJ6ZM12HL9ET... at +/99%
27 275nt, >05_FRYJ6ZM12HH0U0... at +/99%
28 271nt, >05_FRYJ6ZM12HL1AP... at +/99%
29 279nt, >06_FRYJ6ZM12HNECQ... at +/99%
30 278nt, >06_FRYJ6ZM12HMUTE... at +/99%
31 279nt, >06_FRYJ6ZM12HKY06... at +/99%
32 281nt, >08_FRYJ6ZM12HHVLF... at +/99%
33 290nt, >08_FRYJ6ZM12HL1JH... at +/100%
34 276nt, >08_FRYJ6ZM12HLIA7... at +/100%
35 286nt, >08_FRYJ6ZM12HNF98... at +/98%
36 290nt, >08_FRYJ6ZM12HIMCK... at +/100%
37 290nt, >08_FRYJ6ZM12HKJII... at +/100%
38 270nt, >08_FRYJ6ZM12HDIK1... at +/100%
39 279nt, >10_FRYJ6ZM12HEE9R... at +/99%
40 280nt, >10_FRYJ6ZM12HKXEK... at +/98%
41 279nt, >10_FRYJ6ZM12HLZN6... at +/99%
42 275nt, >14_FRYJ6ZM12HGC5C... at +/98%
43 276nt, >15_FRYJ6ZM12HI550... at +/98%
44 271nt, >19_FRYJ6ZM12HMU2M... at +/98%
>Cluster 1
0 290nt, >01_FRYJ6ZM12HKQWR... *
1 281nt, >02_FRYJ6ZM12HNJ2B... at +/100%
2 266nt, >03_FRYJ6ZM12HMQY1... at +/100%
3 266nt, >05_FRYJ6ZM12HMPA8... at +/100%
4 280nt, >05_FRYJ6ZM12HE9N5... at +/99%
5 280nt, >05_FRYJ6ZM12HKTHG... at +/100%
6 280nt, >05_FRYJ6ZM12HKP1Z... at +/99%
7 280nt, >05_FRYJ6ZM12HIF2F... at +/99%
8 279nt, >05_FRYJ6ZM12HJ9MO... at +/97%
9 280nt, >05_FRYJ6ZM12HIQQH... at +/100%
10 281nt, >06_FRYJ6ZM12HLHZL... at +/99%
11 280nt, >06_FRYJ6ZM12HH9O0... at +/99%
12 281nt, >06_FRYJ6ZM12HK2SZ... at +/99%
13 281nt, >06_FRYJ6ZM12HJNW4... at +/100%
14 279nt, >06_FRYJ6ZM12HJUIE... at +/97%
15 280nt, >06_FRYJ6ZM12HFHXR... at +/97%
16 281nt, >06_FRYJ6ZM12HND03... at +/99%
17 282nt, >06_FRYJ6ZM12HHC7G... at +/98%
18 280nt, >06_FRYJ6ZM12HF5CY... at +/100%
19 280nt, >06_FRYJ6ZM12HEVGT... at +/99%
20 281nt, >06_FRYJ6ZM12HLILE... at +/99%
21 278nt, >06_FRYJ6ZM12HLWHQ... at +/99%
22 280nt, >06_FRYJ6ZM12HIU71... at +/100%
23 279nt, >06_FRYJ6ZM12HM3GZ... at +/99%
24 281nt, >06_FRYJ6ZM12HF238... at +/99%
25 273nt, >06_FRYJ6ZM12HDO08... at +/98%
26 276nt, >06_FRYJ6ZM12HE3OI... at +/98%
27 280nt, >06_FRYJ6ZM12HHQ56... at +/100%
28 280nt, >06_FRYJ6ZM12HFYQT... at +/100%
29 271nt, >06_FRYJ6ZM12HLGT2... at +/100%
30 281nt, >06_FRYJ6ZM12HM69N... at +/99%
31 281nt, >06_FRYJ6ZM12HG1WU... at +/99%
32 276nt, >06_FRYJ6ZM12HMHA6... at +/98%
33 245nt, >06_FRYJ6ZM12HHDL4... at +/99%
34 281nt, >06_FRYJ6ZM12HMQZI... at +/98%
35 281nt, >06_FRYJ6ZM12HNAR8... at +/100%
36 279nt, >06_FRYJ6ZM12HN5DI... at +/100%
37 280nt, >06_FRYJ6ZM12HGLSU... at +/98%
38 286nt, >11_FRYJ6ZM12HPCXJ... at +/98%
39 290nt, >11_FRYJ6ZM12HGPWI... at +/99%
40 285nt, >11_FRYJ6ZM12HM9YT... at +/98%
41 286nt, >11_FRYJ6ZM12HI2GG... at +/97%
42 290nt, >11_FRYJ6ZM12HMG2Y... at +/99%
43 281nt, >15_FRYJ6ZM12HKZNJ... at +/100%
44 280nt, >15_FRYJ6ZM12HE9QN... at +/99%
45 265nt, >17_FRYJ6ZM12HJRPI... at +/100%
46 275nt, >17_FRYJ6ZM12HLDLG... at +/98%
47 279nt, >17_FRYJ6ZM12HG1RZ... at +/99%
48 279nt, >17_FRYJ6ZM12HI1H8... at +/98%
49 280nt, >17_FRYJ6ZM12HNISU... at +/99%
50 280nt, >17_FRYJ6ZM12HMIHP... at +/99%
51 280nt, >17_FRYJ6ZM12HI58U... at +/99%
52 280nt, >17_FRYJ6ZM12HILMN... at +/100%
53 242nt, >17_FRYJ6ZM12HKVKQ... at +/98%
54 279nt, >17_FRYJ6ZM12HL1B9... at +/99%
55 280nt, >17_FRYJ6ZM12HEW7F... at +/98%
56 271nt, >17_FRYJ6ZM12HGGML... at +/99%
57 280nt, >17_FRYJ6ZM12HPMJM... at +/98%
58 277nt, >17_FRYJ6ZM12HH5V2... at +/99%
59 267nt, >17_FRYJ6ZM12HIDX1... at +/100%
60 271nt, >17_FRYJ6ZM12HHBYP... at +/98%
61 281nt, >17_FRYJ6ZM12HMHMF... at +/99%
62 282nt, >17_FRYJ6ZM12HLC9P... at +/99%
63 282nt, >17_FRYJ6ZM12HDDJ5... at +/99%
64 276nt, >17_FRYJ6ZM12HKV2F... at +/100%
65 276nt, >17_FRYJ6ZM12HK5OD... at +/99%
66 280nt, >17_FRYJ6ZM12HG1JG... at +/99%
67 281nt, >17_FRYJ6ZM12HMHDW... at +/99%
68 264nt, >17_FRYJ6ZM12HCHVO... at +/100%
69 280nt, >17_FRYJ6ZM12HHT9Y... at +/100%
70 280nt, >17_FRYJ6ZM12HGIYR... at +/100%
71 280nt, >17_FRYJ6ZM12HGR8Y... at +/100%
72 278nt, >17_FRYJ6ZM12HE3PW... at +/98%
73 197nt, >17_FRYJ6ZM12HIYK4... at +/100%
>Cluster 2
0 286nt, >04_FRYJ6ZM12HPCXJ... at +/99%
1 290nt, >04_FRYJ6ZM12HGPWI... *
2 285nt, >04_FRYJ6ZM12HM9YT... at +/98%
3 266nt, >04_FRYJ6ZM12HJK88... at +/100%
4 281nt, >04_FRYJ6ZM12HKZNJ... at +/97%
5 286nt, >04_FRYJ6ZM12HI2GG... at +/98%
>Cluster 3
0 286nt, >04_FRYJ6ZM12HD3BT... *
1 286nt, >06_FRYJ6ZM12HD3BT... at +/97%
>Cluster 4
0 286nt, >23_FRYJ6ZM12HI2GG... *
>Cluster 5
0 280nt, >04_FRYJ6ZM12HO3WD... at +/97%
1 285nt, >04_FRYJ6ZM12HGI5Z... *
2 285nt, >15_FRYJ6ZM12HGI5Z... at +/97%
.......

This is only part of the file. So basically, we have here 6 clusters (numbered 0 to 5, but in my file file I have 1200 total). For each of them, we have the first column ($1) is a count, the second col is the length of the sequence ($2), col3 $3 is the name of the sequence.

For each cluster I have what I call the representative sequence (name of the sequence followed by a *).

In the name of the sequence, I have first the symbol >, followed by 2 digits (01-24), and then letters.

Here is what I need:

Create a tab (probably using awk) containing:

  • in col 1: the name of the representative sequence (one per cluster, name of sequence followed by ). I can use the command:
    awk '/\
    / {print $3}' file

  • in col 2 to 25: the count of sequences that belong to group 01 to group 24. Basically in col2, I want to know how many time I have the sequence beginning by >01_ in cluster 0. In col3, how many time I have the seq number starting by >02_ in cluster 0 and so on.....

Let me give you the output file that I want for the file given above
>01_FRYJ6ZM12HFI8R 22 1 0 1 5 3 0 7 0 3 0 0 0 1 1 0 0 0 1 0 0 0 0 0
>01_FRYJ6ZM12HKQWR 1 1 1 0 7 28 0 0 0 0 5 0 0 0 2 0 29 0 0 0 0 0 0 0
>04_FRYJ6ZM12HGPWI 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>04_FRYJ6ZM12HD3BT 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>23_FRYJ6ZM12HI2GG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
>04_FRYJ6ZM12HGI5Z 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

I need this where all the words are sep by a tab off course.

Thank you to help me if anybody know what I should do or who to ask?
Feel free to ask me questions...

Diane:o:(:b:

please post in a subforum that has something in common with your problem! i've moved the thread to shell scripting...

welcome to unix.com
DN2

#!/usr/bin/perl
use strict;
my (%hash,$cluster);
open my $fh,"<","a.txt";
while(<$fh>){
	if(/^>[^\d]*(\d+)[^\d]*$/){
		$hash{$1}={};
		$cluster=$1;
		next;
	}
	if(/^.+>(\d+).*$/){
		$hash{$cluster}->{$1}++;
	}
	if(/^[^>]*(>[^\.]*).*\*$/){
		$hash{$cluster}->{NAME}=$1;
	}  
}
foreach my $key (sort {$a<=>$b} keys %hash){
	print $hash{$key}->{NAME}," ";
	map {print $hash{$key}->{$_}?$hash{$key}->{$_}:0," "} ('01'..'24');
	print "\n";
} 

Thank you. I see this script is a pearl script. I don't know anything about it, could you tell me the command line I'm supposed to use in my terminal. (input file to be treated: file.fas.clstr.clstr, and output file is named file.txt)

Do I have to save the script you gave me into the same directory as my input file?

Thank you for the maximum info you can give me,

Diane

I made it work. Thank you very much it is doing exactly what I want.

D.