ad23
July 16, 2010, 5:35pm
1
Hello everyone!
I have 2 types of files in the following format:
1) *.fa
>1234
...some text...
>2345
...some text...
>3456
...some text...
.
.
.
.
2) *.info
>1234
...some numbers...
>2345
...some numbers...
>3456
...some numbers...
.
.
.
.
I need to split these huge files (~300-400Mb), into smaller files (around 30-40Mb each). Also, I don't want to divide the data (i.e. every record starting with '>' should have its corresponding information in one smaller file).
Can someone please suggest a way to do this using unix commands?
Thanks!!!
Could you please elaborate more on the criteria you want to use to split the file ?
tell us your expect output
At first, you can use paste to make each record in one line.
$ paste - - < file.info
>1234 ...some numbers...
>2345 ...some numbers...
>3456 ...some numbers...
$ paste - - < file.fa
>1234 ...some text...
>2345 ...some text...
>3456 ...some text...
ad23
July 19, 2010, 9:42am
5
Hello Everyone...
Thanks for your responses!!!
I need to divide my huge data file into smaller files (of say 10 MB or so). I have to feed this file ( as smaller sub-files) into another program.
The line numbers in each file vary. The program cannot take a huge file together, and needs chunks of data!!
I hope I am a little more clear now??
Thanks!
You still don't tell us clearly what you ask for.
Give some real datas, and the expect output.
rubin
July 20, 2010, 1:39pm
7
ad23:
...
I need to split these huge files (~300-400Mb), into smaller files (around 30-40Mb each). Also, I don't want to divide the data (i.e. every record starting with '>' should have its corresponding information in one smaller file).
If I correctly understand your requirement and assuming the above format of the big files,
awk '/^>[0-9]+$/ && c >=int(n/s) { k++; c=1; close(f) }
{ print > (f=FILENAME"_"k+1); c++ }' s=10 n="$(wc -l < bigfile)" bigfile
For different smaller files' sizes adjust the value of the variable s.
aigles
July 21, 2010, 8:30am
8
Another approch with awk.
In the following script (ad23.sh) bigfile is the input file to split (ad23.txt) and maxsize is the maximum size (in bytes) of the fragments (ad23.txt_*).
bigfile=./ad23.txt
maxsize=${1:-200}
rm ${bigfile}_* >/dev/null 2>&1
awk -v msize=${maxsize} '
function print_record() {
if ( rsize == 0 ) return;
if ( csize+rsize > msize && csize != 0 || ifile == 0 ) {
outfile = FILENAME "_" ++ifile;
csize = 0;
}
csize += rsize;
print record > outfile;
}
/^>[0-9]+$/ {
print_record();
record = $0;
rsize = length+1;
next;
}
{
record = (record ? record "\n" : "") $0;
rsize += length+1;
}
END {
print_record();
}
' ${bigfile}
Input file (ad23.txt 873 bytes):
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
Execution with maxsize=300
$ ./ad23.sh 300
$ wc -c ad23.txt_*
229 ad23.txt_1
260 ad23.txt_2
316 ad23.txt_3
68 ad23.txt_4
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
::::::::::::::
ad23.txt_2
::::::::::::::
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_3
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
::::::::::::::
ad23.txt_4
::::::::::::::
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$
Another execution with maxsize=500
$ ./ad23.sh 500
$ wc -c ad23.txt_*
489 ad23.txt_1
384 ad23.txt_2
873 total
$ more -999 ad23.txt_*
::::::::::::::
ad23.txt_1
::::::::::::::
>0001
1 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
1 cccccccccccccccccccccccccccc
1 dddddddddddddddddddddddddddd
1 eeeeeeeeeeeeeeeeeeeeeeeeeeee
>0002
2 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
2 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
>0003
3 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
3 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
3 cccccccccccccccccccccccccccc
3 dddddddddddddddddddddddddddd
3 eeeeeeeeeeeeeeeeeeeeeeeeeeee
3 ffffffffffffffffffffffffffff
3 gggggggggggggggggggggggggggg
>0004
4 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
::::::::::::::
ad23.txt_2
::::::::::::::
>0005
5 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
5 bbbbbbbbbbbbbbbbbbbbbbbbbbbb
5 cccccccccccccccccccccccccccc
5 dddddddddddddddddddddddddddd
5 eeeeeeeeeeeeeeeeeeeeeeeeeeee
5 ffffffffffffffffffffffffffff
5 gggggggggggggggggggggggggggg
5 hhhhhhhhhhhhhhhhhhhhhhhhhhhh
5 iiiiiiiiiiiiiiiiiiiiiiiiiiii
5 jjjjjjjjjjjjjjjjjjjjjjjjjjjj
>0006
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
6 aaaaaaaaaaaaaaaaaaaaaaaaaaaa
$
Jean-Pierre.