Help with calculating frequency of specific word in a string

Input file:

#read_1
AWEAWQQRZZZQWQQWZ
#read_2
ZZAQWRQTWQQQWADSADZZZ
#read_3
POGZZZZZZADWRR
.
.

Desired output file:

#read_1 3
#read_1 1
#read_2 2
#read_2 3
#read_3 6
.
.

Perl script that I have tried:

#!/usr/bin/perl 

$/ = ">";


while (<>) {
	next if $. ==  1;
	chomp;

	my($header,@other) =  split(/\n/,$_);
	$sequence = join"",@other;

	my @letters = split"",$sequence;
	$seqlength = length $sequence;
	$counter = 0;

	foreach $base (@letters) {
		$counter++ if $base eq 'Z';
	}
	print ">$header\t$counter\n";	
	
}	

Command I have tried:

[home@user]perl count.pl input_file.txt > input_file.stats
[home@user]cat input_file.stats
#read_1 4
#read_2 5
#read_3 6
.
.

My purpose is to calculate the frequency of "Z" at each string in detail.
However, I only able to total sum all the frequency of "Z" in each string.

Thanks for any advice.

perl -l -0043 -ne '/(.*)\n(.*)/;$h=$1;$s=$2;while($s=~/Z+/g){print "#$h " . length $&}' input_file.txt
1 Like

Thanks, bartus11.
Your perl script worked perfectly.
Do you mind to explain what is the meaning of "-l -0043" and "/(.*)\n(.*)/;" at the beginning of your perl script?
Many thanks for advice.

From http://perldoc.perl.org/perlrun.html:

-0[octal/hexadecimal] 

specifies the input record separator ($/ ) as an octal or hexadecimal number.

Octal value of the ASCII code for "#" is "043", so now "#" is specifying record boundaries, not newlines. Now to your second question:

/(.*)\n(.*)/

In that new record whatever is before a newline is matched by red part in the regex, so the header (read_...) goes there. What is after newline is matched by blue part, so the line with Zs goes there.

1 Like