I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this.
For example:
split -l 3000000 filename.txt
This is very slow and it splits the file with 3 million records in each file. But I would like to give the number of files as a parameter and output the user defined file names and not xaa, xab and so on.
I am also trying awk and I know it will be very fast and simple. I read the forum and they are all splitting the files on a specific pattern and I don't require any pattern.
If disk i/o is not making split "too slow" then try awk. But you should consider that a big I/O request queue length on that filesystem is a likely candidate for slow splitting, rather than split being a bad performer.
awk version of split:
awk ' {
if(NR<300000) { print $0 > "smallfile1"}
if (NR>300000 && NR < 600000) { print $0 > "smallfile2" }
if (NR>60000) {print $0 > "smallfile3" }
}' bigfile
Thank you Radoulov...When I ran your code, it is saying file1, file2 or file3 is not found. It seems like the code is assuming that those are the input files. However, Jim's code is working fine.
The whole environment is on Windows. But I am using MKS Tool kit and invoking bash shell to execute awk. Never worked on Windows before and it is not quite nice..
If you want the number of files/parts,
it's important if the total number of lines in the bigfile is known a priori
(otherwise it will be slower, because we have to read it twice: first to get
the number of lines, and then again to split it).
That is true Radoulov....It is very very slow. The only way I can get the number of lines is by doing a wc -l on the file. The records keep changing and we have ten different files that are incoming from a different source. We will be using this script company wide...
This does not appear to need absolute precision. You can read the first 100 (or few hundreds) of lines, then seek to near the end, read another 100 or so until the end, then calculate the mean length of the lines you saw. Obtaining the length of the file is cheap -- Linux has a stat command, but one could cut the length out from "ls -l". Divide the length in bytes by the estimated mean length and you get an estimate of the number of lines.
If necessary, one could also read a section from the middle (or elsewhere) to increase the accuracy.
The key is using a seek for positioning, and stat for the length -- both are very fast ... cheers, drl