Divide large data files into smaller files

Hello everyone!

I have 2 types of files in the following format:

1) *.fa

>1234
...some text...
>2345
...some text...
>3456
...some text...
.
.
.
.

2) *.info

>1234
...some numbers...
>2345
...some numbers...
>3456
...some numbers...
.
.
.
.

I need to split these huge files (~300-400Mb), into smaller files (around 30-40Mb each). Also, I don't want to divide the data (i.e. every record starting with '>' should have its corresponding information in one smaller file).

Can someone please suggest a way to do this using unix commands?

Thanks!!!

Could you please elaborate more on the criteria you want to use to split the file ?

tell us your expect output

At first, you can use paste to make each record in one line.

$ paste - - < file.info
>1234   ...some numbers...
>2345   ...some numbers...
>3456   ...some numbers...

$ paste - - < file.fa
>1234   ...some text...
>2345   ...some text...
>3456   ...some text...

use csplit

Hello Everyone...
Thanks for your responses!!!

I need to divide my huge data file into smaller files (of say 10 MB or so). I have to feed this file ( as smaller sub-files) into another program.

The line numbers in each file vary. The program cannot take a huge file together, and needs chunks of data!!

I hope I am a little more clear now??

Thanks!

You still don't tell us clearly what you ask for.

Give some real datas, and the expect output.

If I correctly understand your requirement and assuming the above format of the big files,

awk '/^>[0-9]+$/ && c >=int(n/s) { k++; c=1; close(f) }
     { print > (f=FILENAME"_"k+1); c++ }' s=10 n="$(wc -l < bigfile)" bigfile

For different smaller files' sizes adjust the value of the variable s.

Another approch with awk.
In the following script (ad23.sh) bigfile is the input file to split (ad23.txt) and maxsize is the maximum size (in bytes) of the fragments (ad23.txt_*).

bigfile=./ad23.txt
maxsize=${1:-200}

rm ${bigfile}_* >/dev/null 2>&1

awk -v msize=${maxsize} '

function print_record() {
   if ( rsize == 0 ) return;
   if ( csize+rsize > msize && csize != 0 || ifile == 0 ) {
      outfile = FILENAME "_" ++ifile;
      csize = 0;
   }
   csize += rsize;
   print record > outfile;
}

/^>[0-9]+$/ {
   print_record();
   record = $0;
   rsize  = length+1;
   next;
}

{
   record = (record ? record "\n" : "") $0;
   rsize  += length+1;
}

END {
   print_record();
}

' ${bigfile}

Input file (ad23.txt 873 bytes):

>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa

Execution with maxsize=300

$ ./ad23.sh 300
$ wc -c ad23.txt_*
229 ad23.txt_1
260 ad23.txt_2
316 ad23.txt_3
 68 ad23.txt_4
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
::::::::::::::
ad23.txt_2
::::::::::::::
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_3
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
::::::::::::::
ad23.txt_4
::::::::::::::
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$

Another execution with maxsize=500

$ ./ad23.sh 500
$ wc -c ad23.txt_*
489 ad23.txt_1
384 ad23.txt_2
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_2
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$

Jean-Pierre.