How to count the length of fasta sequences?

dineshkumarsrk · April 9, 2019, 8:44am

I could calculate the length of entire fasta sequences by following command,

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta

But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be printed in a csv file.
Therefore, please help me to do the same.
Thanks in advance.

RavinderSingh13 · April 9, 2019, 8:49am

dineshkumarsrk:

I could calculate the length of entire fasta sequences by following command,
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' unique.fasta
But, I need to calculate the length of a particular fasta sequence specified/listed in another txt file. The results to to be printed in a csv file.
Therefore, please help me to do the same.
Thanks in advance.

Sorry mate without seeing sample of Input and expected output it is NOT possible to tweak a solution. So kindly do add sample of your Input_file(fasta file) and show us expected output file(.csv one) in CODE TAGS and let us know then.

Thanks,
R. Singh

dineshkumarsrk · April 9, 2019, 9:04am

I have a multi fasta sequences in unique.fasta as given below,

>seq1
ATGCTA
>seq2
GCTAGTT
>seq3
TAGC

I need to count the length of following header's listed in id.txt as given below,

seq1
seq2

And the results(length) to be printed in csv file as shown below,

seq1   6
seq2   7

RavinderSingh13 · April 9, 2019, 9:08am

Hello dineshkumarsrk,

Could you please try following.

awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0)}'   Input_file

Output will be as follows.

seq1 6
seq2 7
seq3 4

2nd Solution: Or let's say your Input_file may end with a line which starting with > in that case if you want to print that remaining whose length value will be NULL try following.

awk 'BEGIN{FS="[> ]"} /^>/{val=$2;next}  {print val,length($0);val=""} END{if(val!=""){print val}}'   Input_file

3rd Solution: In case your seq lines may have spaces in them in that case my previous solutions may NOT give their full value so to get their full line values(without > ) use following.

awk '/^>/{sub(/^>/,"");val=$0;next}  {print val,length($0)}'   Input_file

OR to take care of scenario where your Input_file could be ending with > and seq string may have spaces in it try:

awk '/^>/{sub(/^>/,"");val=$0;next}  {print val,length($0);val=""} END{if(val!=""){print val}}'   Input_file

Thanks,
R. Singh

dineshkumarsrk · April 9, 2019, 9:17am

Thank you singh,
Your command prints all the sequences. However, I need to print only few sequences length as listed in id.txt file. If I did not understand your commands properly, please let me know, where to include id.txt file in your command?

RavinderSingh13 · April 9, 2019, 9:24am

Oh ok, I was in impression that you want to print all seq strings length in a single Input_file, could you please try following now.

awk 'FNR==NR{a[$0];next} /^>/ && sub(/^>/,""){;found=val="";if($0 in a){val=$0;found=1};next} found{print val,length($0)} ' ids.txt  Input_file

Output will be as follows.

seq1 6
seq2 7

Thanks,
R. Singh

dineshkumarsrk · April 9, 2019, 9:28am

Thanks singh for your time and help. It works.
Sorry, earlier i did a mistake, that's why i did not get output.

Scrutinizer · April 9, 2019, 8:55pm

Another option to try:

awk 'NR>1{print $1, length($2)}' FS='\n' RS=\> OFS='\t' file

RudiC · April 10, 2019, 3:26am

If - as specified - you want the sequences in id.txt ONLY, amend scrutinizer's proposal like

awk 'FNR == NR {T[$1]; next} $1 in T {print $1, length($2)}' id.txt FS='\n' RS=\> OFS='\t' unique.fasta
seq1    6
seq2    7

dineshkumarsrk · April 10, 2019, 5:43am

is it possible to print the order as same as id.txt. The command works fine, but the order has been changed. For example, id.txt file listed the headers as follows,

seq1
seq2

Where as my unique.fasta file has the sequences in the following order,

>seq2
ATGCTTA
>seq1
GCTAGT

But, the command print seq2 length first followed by seq1 as shown below.

seq2     7
seq1      6

Therefore, is it possible to keep the order as same as id.txt.

--- Post updated at 09:43 AM ---

Dear singh,
Is it possible to print the order as same as id.txt. The command works fine, but the order has been changed. For example, id.txt file listed the headers as follows,

seq1
seq2

Whereas, my unique.fasta file has the sequences in the following order,

>seq2
ATGCTTA
>seq1
GCTAGT

But, the command print seq2 length first followed by seq1 as shown below.

seq2     7
seq1    6

Therefore, is it possible to keep the order as same as id.txt.

RudiC · April 10, 2019, 7:04am

Well, yes, try

awk '
FNR == NR       {T[$1] = length($2)
                 next
                }
                {print $1, T[$1]
                }
' FS='\n' RS=\> OFS='\t' unique.fasta RS="\n" id.txt

This has been tested for exactly your setup; for larger or more complex data there may be hickups. Give it a try and report back.

RavinderSingh13 · April 10, 2019, 7:23am

Hello dineshkumarsrk,

Could you please try following.

awk '
FNR==NR{
  if($0~/^>/){
    sub(/^>/,"")
    val=$0;
    next
  }
  a[val]=(a[val]?a[val] OFS:"")length($0)
  next
}
($0 in a){
  print $0,a[$0]
}
'  Input_file  id.txt

Thanks,
R. Singh

dineshkumarsrk · April 10, 2019, 7:52am

Dear rudic
It works perfectly. But, it prints "0" at last column.
Though it is not a issue at all, but I am reporting the same for your reference.

--- Post updated at 11:52 AM ---

Dear Singh,
It works perfectly fine.

RudiC · April 10, 2019, 8:09am

Surprising. This is what I get from your sample data:

seq1    6
seq2    7

Please show your results...

drl · April 21, 2019, 11:56am

Hi.

A few thoughts on this from a person who does not know much about FASTA-format files. My impression is that the sequence after the >ID line often spans many lines. There are some utilities that are designed to work with such files (and even the more complex formats : FASTA format - Wikipedia )

One such is fastagrep which in one part of a group of perl utilities at: GitHub - rec3141/rec-genome-tools . These appear to be somewhat old, but fastagrep worked for me as shown below.

For those wishing to do more in perl , see the search for fasta at Search for "fasta" - metacpan.org , seeing the BIO and FASTA collections.

However, one may often want to use awk , so I put together a small template that could be used for elementary processing, like getting the length of a sequence. Functions can do some of the work, are easily added, and two are included to see how they can work.

The last example illustrates the modularity of using fastagrep to filter the desired IDs which are then piped into the awk code:

!/usr/bin/env bash

# @(#) s1       Demonstrate awk template for reading FASTA files.

LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk

FILE=${1-data1}

pl " Input file $FILE:"
head -25 $FILE

pl " Results, $FG extracted lines from $FILE:"
fastagrep 'seq1|seq2' $FILE

pl " Results, awk print from $FILE, awk FASTA template:"
awk '
function fformat(id, sequence)
{
  printf ">%s\n", id
  while (length(sequence) >= 60) {
    print substr(sequence, 1, 60)
    gsub(/^.{60}/, "", sequence)
  }
  if (length(sequence) > 0) { print sequence }
}
function operate(id, sequence)
{
  print id, length(sequence)
}
BEGIN {
  getline
  id = substr($1, 2)
  sequence = ""
}
$0 ~ /^>/ {
  fformat(id, sequence)
  # operate(id, sequence)
  id = substr($1, 2)
  sequence = ""
  next
}
{
  sequence = sequence $0
}
END { fformat(id, sequence) }
# END { operate(id, sequence) }

' $FILE

pl " Results, fastagrep and print length with \"operate\" function:"
fastagrep 'seq1|seq2' $FILE |
./s0

exit 0

producing:

$ ./s1 data3

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution        : Debian 8.11 (jessie) 
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)

-----
 Input file data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC


-----
 Results,  extracted lines from data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG

-----
 Results, awk print from data3, awk FASTA template:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC

-----
 Results, fastagrep and print length with "operate" function:
seq1 6
seq2 100

The ./s0 above is essentially the awk-template using the operate function.

More details on fastagrep :

fastagrep       extract sequences from a multi-FASTA file by regex. (what)
Path    : ~/bin/fastagrep
Version : fastagrep version 0.3
Length  : 392 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Home    : GitHub - rec3141/rec-genome-tools (doc)
Modules : (for perl codes)
 strict 1.08
 warnings       1.23
 Getopt::Std    1.10
 IO::File       1.16
 Data::Dumper   2.151_01

Best wishes ... cheers, drl