Hi.
A few thoughts on this from a person who does not know much about FASTA-format files. My impression is that the sequence after the >ID
line often spans many lines. There are some utilities that are designed to work with such files (and even the more complex formats : FASTA format - Wikipedia )
One such is fastagrep
which in one part of a group of perl
utilities at: GitHub - rec3141/rec-genome-tools . These appear to be somewhat old, but fastagrep
worked for me as shown below.
For those wishing to do more in perl
, see the search for fasta at Search for "fasta" - metacpan.org , seeing the BIO and FASTA collections.
However, one may often want to use awk
, so I put together a small template that could be used for elementary processing, like getting the length of a sequence. Functions can do some of the work, are easily added, and two are included to see how they can work.
The last example illustrates the modularity of using fastagrep
to filter the desired IDs which are then piped into the awk
code:
!/usr/bin/env bash
# @(#) s1 Demonstrate awk template for reading FASTA files.
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk
FILE=${1-data1}
pl " Input file $FILE:"
head -25 $FILE
pl " Results, $FG extracted lines from $FILE:"
fastagrep 'seq1|seq2' $FILE
pl " Results, awk print from $FILE, awk FASTA template:"
awk '
function fformat(id, sequence)
{
printf ">%s\n", id
while (length(sequence) >= 60) {
print substr(sequence, 1, 60)
gsub(/^.{60}/, "", sequence)
}
if (length(sequence) > 0) { print sequence }
}
function operate(id, sequence)
{
print id, length(sequence)
}
BEGIN {
getline
id = substr($1, 2)
sequence = ""
}
$0 ~ /^>/ {
fformat(id, sequence)
# operate(id, sequence)
id = substr($1, 2)
sequence = ""
next
}
{
sequence = sequence $0
}
END { fformat(id, sequence) }
# END { operate(id, sequence) }
' $FILE
pl " Results, fastagrep and print length with \"operate\" function:"
fastagrep 'seq1|seq2' $FILE |
./s0
exit 0
producing:
$ ./s1 data3
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-7-amd64, x86_64
Distribution : Debian 8.11 (jessie)
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
-----
Input file data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC
-----
Results, extracted lines from data3:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
-----
Results, awk print from data3, awk FASTA template:
>seq1
ATGCTA
>seq2
CGAAACGCAAACAGCCCACGACAGTTAGCCTAGATTGGAACCTCCGGGAAATATGAGACT
GGGTTTAAGGGATGGCAGCCATAGCCACGACATCACGTAG
>seq3
TAGC
-----
Results, fastagrep and print length with "operate" function:
seq1 6
seq2 100
The ./s0 above is essentially the awk-template using the operate function.
More details on fastagrep
:
fastagrep extract sequences from a multi-FASTA file by regex. (what)
Path : ~/bin/fastagrep
Version : fastagrep version 0.3
Length : 392 lines
Type : Perl script, ASCII text executable
Shebang : #!/usr/bin/env perl
Home : GitHub - rec3141/rec-genome-tools (doc)
Modules : (for perl codes)
strict 1.08
warnings 1.23
Getopt::Std 1.10
IO::File 1.16
Data::Dumper 2.151_01
Best wishes ... cheers, drl