reformat data with a shell script

Can anyone help me with a shell script that can do the following:

I have a data in fasta format (first line is the header, followed by a sequence of characters).

>ALLLY
GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTC
GAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGC
AGTAAAAATCTCGGAGAGCTGACACCAAGTCCTCCCCTGCCACGTAGCAGTGGTAAAGTC
CGAAGCTCAAATTCCGAGAATTGAGCTCTGTTGATTCTTAGAACTGGGGTTCTTAGAAGT
>BLLLK
CTGGTCTCAGTCTGGTACTGAAGTCAGGAATGGCTTAAGGTGAAATCGTGGTCCTCTGGT
GAAGCTCAGCGAAGACCCCCTCGCCTTGTTTATGACAAGAGAACTTCTGGGGGCGGGAGG
AAGAGTCCCTGTTACGATGCTGATCATCATTGAGCTTTTGCTGAGCAGAAAACTCTTTAG
TACTCAAGGTCGAGAGTCTCTGGTGGTCTGCCTGGCACCAGGCACCTTCCTACAACCCTA
GTTTTCCAAAAGGACAAAGCCTGGGGCAGGCGACGTCCTAGCTCGCATTTGAACAGGGCC
GCGGGCCAGCAGAGATGCGCGATGCCCAACTCTTTCCAAGAGCACCTCGCGTCCCGAACC

I want to reformat the data such that I get it in the following format, such that the entire sequence of characters for one entry is printed in one line and the name of the entry eg. ALLLY is now printed as a tab delimited besides the sequence of characters.

ALLLY GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTCGAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGCAGTAAAAATCTCGGAGAGCTGACACCAAGTCCTCCCCTGCCACGTAGCAGTGGTAAAGTCCGAAGCTCAAATTCCGAGAATTGAGCTCTGTTGATTCTTAGAACTGGGGTTCTTAGAAGT
BLLLK CTGGTCTCAGTCTGGTACTGAAGTCAGGAATGGCTTAAGGTGAAATCGTGGTCCTCTGGTGAAGCTCAGCGAAGACCCCCTCGCCTTGTTTATGACAAGAGAACTTCTGGGGGCGGGAGGAAGAGTCCCTGTTACGATGCTGATCATCATTGAGCTTTTGCTGAGCAGAAAACTCTTTAGTACTCAAGGTCGAGAGTCTCTGGTGGTCTGCCTGGCACCAGGCACCTTCCTACAACCCTAGTTTTCCAAAAGGACAAAGCCTGGGGCAGGCGACGTCCTAGCTCGCATTTGAACAGGGCCGCGGGCCAGCAGAGATGCGCGATGCCCAACTCTTTCCAAGAGCACCTCGCGTCCCGAACC

Any suggestion or working script is highly appreciated.

biobee

Here's what I would do under vi:

:v/^>/j!
:g/^>/j
:g/^>/s///

This is basic for awk

#!/bin/awk -f
# join.awk
/^>/  { header=substr($0,2) ; next }
        {print header,$0 } 
chmod a+rx join.awk
./join.awk < inputfile > outputfile
cat some | ./join.awk | somecmd
#...

Here's one way to do it with Perl:

$
$ cat data.txt
>ALLLY
GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTC
GAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGC
AGTAAAAATCTCGGAGAGCTGACACCAAGTCCTCCCCTGCCACGTAGCAGTGGTAAAGTC
CGAAGCTCAAATTCCGAGAATTGAGCTCTGTTGATTCTTAGAACTGGGGTTCTTAGAAGT
>BLLLK
CTGGTCTCAGTCTGGTACTGAAGTCAGGAATGGCTTAAGGTGAAATCGTGGTCCTCTGGT
GAAGCTCAGCGAAGACCCCCTCGCCTTGTTTATGACAAGAGAACTTCTGGGGGCGGGAGG
AAGAGTCCCTGTTACGATGCTGATCATCATTGAGCTTTTGCTGAGCAGAAAACTCTTTAG
TACTCAAGGTCGAGAGTCTCTGGTGGTCTGCCTGGCACCAGGCACCTTCCTACAACCCTA
GTTTTCCAAAAGGACAAAGCCTGGGGCAGGCGACGTCCTAGCTCGCATTTGAACAGGGCC
GCGGGCCAGCAGAGATGCGCGATGCCCAACTCTTTCCAAGAGCACCTCGCGTCCCGAACC
$
$
$ perl -ne 'chomp; if (/^>/) {s/^>//; print $. != 1 ? "\n":"",$_,"\t"} else {print} END {print "\n"}' data.txt
ALLLY   GGCCCCTCGAGCCTCGAACCGGAACCTCCAAATCCGAGACGCTCTGCTTATGAGGACCTCGAAATATGCCGGCCAGTGAAAAAATCTTGTGGCTTTGAGGGCTTTTGGTTGGCCAGGGGCAGTAAAAATCTCGGAGAGCTGACACCAAGTCCTCCCCTGCCACGTAGCAGTGGTAAAGTCCGAAGCTCAAATTCCGAGAATTGAGCTCTGTTGATTCTTAGAACTGGGGTTCTTAGAAGT
BLLLK   CTGGTCTCAGTCTGGTACTGAAGTCAGGAATGGCTTAAGGTGAAATCGTGGTCCTCTGGTGAAGCTCAGCGAAGACCCCCTCGCCTTGTTTATGACAAGAGAACTTCTGGGGGCGGGAGGAAGAGTCCCTGTTACGATGCTGATCATCATTGAGCTTTTGCTGAGCAGAAAACTCTTTAGTACTCAAGGTCGAGAGTCTCTGGTGGTCTGCCTGGCACCAGGCACCTTCCTACAACCCTAGTTTTCCAAAAGGACAAAGCCTGGGGCAGGCGACGTCCTAGCTCGCATTTGAACAGGGCCGCGGGCCAGCAGAGATGCGCGATGCCCAACTCTTTCCAAGAGCACCTCGCGTCCCGAACC
$
$

tyler_durden

Hi Tyler,

Thanks for the perl one liner. I ran it and it gives me an error:

perl -ne 'chomp; if (/^>/) {s/^>//; print $. != 1 ? "\n":"",$_,"\t"} else {print} END {print "\n"}' data.txt

Can't find string terminator "'" anywhere before EOF at -e line 1.

---------- Post updated at 08:55 AM ---------- Previous update was at 08:46 AM ----------

Hi Tyler,
It works in Unix. So its fine now.

thanks

sed -n '/^>/{
1{h;}
1!{x;s/\n/	/;s/^>//;s/\n//g;p;d;}
}
/^>/!{
${H;x;s/\n/	/;s/^>//;s/\n//g;p;d;}
$!{H;}
}'