Split a huge data into few different files?!

Input file data contents:

>seq_1
MSNQSPPQSQRPGHSHSHSHSHAGLASSTSSHSNPSANASYNLNGPRTGGDQRYRASVDA
>seq_2
AGAAGRGWGRDVTAAASPNPRNGGGRPASDLLSVGNAGGQASFASPETIDRWFEDLQHYE
>seq_3
ATLEEMAAASLDANFKEELSAIEQWFRVLSEAERTAALYSLLQSSTQVQMRFFVTVLQQM
ARADPITALLSPANPGQASMEAQMDAKLAAMGLKSPASPAVRQYARQSLSGDTYLSPHSA
>seq_4
TTLPPAPVSPTTTTQAEDAAAAATLASQRAKLKASSRISAPANILLGASGADGVKSPLWS
EKERVVERRSPSPSGRNVERPKSTGSTGEPAQPNNSHAGMNLSQSTGPPSASFLRSPAPD
>seq_5
FDSQLSPIVGGNWASMVNTPLMPMFGSKGGGEGGSFGGLASPGLDGATAKLGSWATGTTT
GQAGIVLDDVRKFRRSARISGSGATGFGGGALGGMYDDQPAQASTNGQQQRRVSPSQLNS
>seq_6
AQQNAINLGLAGLQQQQQQHQQQLRSGAASPGLSSQQAAVAAQQNWRNGLGSPAVDSSDQ
YSQHGMGAFGMGSPANLSANAQLANLFALQQQMMQQQQMQQLNMAAAAGIALTPVQMMGL
QQQQQQAMLSPGGRGFGMGMNGMGMNGMMGMGMGGMGSPRRSPRQSDRSPGGKTNLPSTV
.
.
.
.

Output file 1 contents:

>seq_1
MSNQSPPQSQRPGHSHSHSHSHAGLASSTSSHSNPSANASYNLNGPRTGGDQRYRASVDA
>seq_2
 AGAAGRGWGRDVTAAASPNPRNGGGRPASDLLSVGNAGGQASFASPETIDRWFEDLQHYE
>seq_3
ATLEEMAAASLDANFKEELSAIEQWFRVLSEAERTAALYSLLQSSTQVQMRFFVTVLQQM
ARADPITALLSPANPGQASMEAQMDAKLAAMGLKSPASPAVRQYARQSLSGDTYLSPHSA

Output file 2 contents:

>seq_4
TTLPPAPVSPTTTTQAEDAAAAATLASQRAKLKASSRISAPANILLGASGADGVKSPLWS
EKERVVERRSPSPSGRNVERPKSTGSTGEPAQPNNSHAGMNLSQSTGPPSASFLRSPAPD
>seq_5
FDSQLSPIVGGNWASMVNTPLMPMFGSKGGGEGGSFGGLASPGLDGATAKLGSWATGTTT
GQAGIVLDDVRKFRRSARISGSGATGFGGGALGGMYDDQPAQASTNGQQQRRVSPSQLNS
>seq_6
AQQNAINLGLAGLQQQQQQHQQQLRSGAASPGLSSQQAAVAAQQNWRNGLGSPAVDSSDQ
YSQHGMGAFGMGSPANLSANAQLANLFALQQQMMQQQQMQQLNMAAAAGIALTPVQMMGL
QQQQQQAMLSPGGRGFGMGMNGMGMNGMMGMGMGGMGSPRRSPRQSDRSPGGKTNLPSTV

If I have a long list data inside a file, how I can divide the data into different file?
I need three data inside each file.
For example, my data source got 300 sequence.
I need it to divide 3 sequence in a file. Total desired output are 100 files that content 3 sequence each.
Do anybody got idea to solve my trouble?
Thanks a lot for all of your guide.

awk -F'>' '$2{c--} c<=0{f=$2; c=3; $0=FS$2} {print>f}' infile

---------- Post updated at 02:31 AM ---------- Previous update was at 01:57 AM ----------

shell equivalent:

c=2
while read line; do
  case $line in
    \>*) c=$((c+1));;          # if label is found then increase counter
  esac
  if [ $c -eq 3 ]; then        # if 3 labels have been found then
    exec>${line#>}             # redirect output to file "label"
    c=0                        # reset counter
  fi
  echo $line                   # print output to current output file
done<infile

Hi,

Thanks a lot for your suggestion.
But it seem like it only split two sequence and put the third sequence header read as the file name?
Can I know what is the problem going on?
Thanks again for your kindly help and advice :slight_smile:

With the input you provided it creates two output files, seq_1 and seq_4;

$> cat seq_1
>seq_1
MSNQSPPQSQRPGHSHSHSHSHAGLASSTSSHSNPSANASYNLNGPRTGGDQRYRASVDA
>seq_2
AGAAGRGWGRDVTAAASPNPRNGGGRPASDLLSVGNAGGQASFASPETIDRWFEDLQHYE
>seq_3
ATLEEMAAASLDANFKEELSAIEQWFRVLSEAERTAALYSLLQSSTQVQMRFFVTVLQQM
ARADPITALLSPANPGQASMEAQMDAKLAAMGLKSPASPAVRQYARQSLSGDTYLSPHSA

$> cat seq_4
>seq_4
TTLPPAPVSPTTTTQAEDAAAAATLASQRAKLKASSRISAPANILLGASGADGVKSPLWS
EKERVVERRSPSPSGRNVERPKSTGSTGEPAQPNNSHAGMNLSQSTGPPSASFLRSPAPD
>seq_5
FDSQLSPIVGGNWASMVNTPLMPMFGSKGGGEGGSFGGLASPGLDGATAKLGSWATGTTT
GQAGIVLDDVRKFRRSARISGSGATGFGGGALGGMYDDQPAQASTNGQQQRRVSPSQLNS
>seq_6
AQQNAINLGLAGLQQQQQQHQQQLRSGAASPGLSSQQAAVAAQQNWRNGLGSPAVDSSDQ
YSQHGMGAFGMGSPANLSANAQLANLFALQQQMMQQQQMQQLNMAAAAGIALTPVQMMGL
QQQQQQAMLSPGGRGFGMGMNGMGMNGMMGMGMGGMGSPRRSPRQSDRSPGGKTNLPSTV

Do your real headers contain spaces?

Yup.
You are right.
My other sequence of header contains space and some got ":" inside the header.
But after I using sed to substitute all of this spaces and ":" with "_".
Like:

sed 's/ /_/g' file_data > file_data.out 
sed 's/:/_/g' file_data.out > file_data.final.txt

By using the command that you suggested, end up all of my file name will got the "?" at the end of the file name. :frowning:
Besides that, after I look at the contents of each file produced, it only contains two sequence instead of three sequences inside each file.
Really thanks of your advice :slight_smile:

Hi Scrutinizer, do you have any idea to get my desired output result?
I try to replace the space of header with "_" and try your suggested code.
Unfortunately, it still can't work :frowning:
Thanks a lot for your advise.

Hi patrick87,

The problem is, I put random spaces and : characters inside the labels of your input examples you gave and both scripts still work as expected. I have to assume your real world data sets somehow do not correspond with the input format you provided. You would have to take a small part (say 7 records) of an actual, anonymized, file, then run my scripts on them to see if they also produce the strange results and then post that example input file here, and also list the strange resulting file names and their content, so I can have a look.

S.

Thanks a lot, Scrutinizer
I try your script now again.
I will told you if it is worked ^^