Split a file with no pattern -- Split, Csplit, Awk

madhunk · December 13, 2007, 6:15pm

I have gone through all the threads in the forum and tested out different things. I am trying to split a 3GB file into multiple files. Some files are even larger than this.

For example:

split -l 3000000 filename.txt

This is very slow and it splits the file with 3 million records in each file. But I would like to give the number of files as a parameter and output the user defined file names and not xaa, xab and so on.

I am also trying awk and I know it will be very fast and simple. I read the forum and they are all splitting the files on a specific pattern and I don't require any pattern.

Please give me your input on this..

Smiling_Dragon · December 13, 2007, 6:18pm

I would have thought dd would be a more appropriate choice for this?

madhunk · December 14, 2007, 9:33am

If you can recommend a fast way like awk, that would be very much appreciated. The split taking up a lot of time.

jim_mcnamara · December 14, 2007, 12:14pm

If disk i/o is not making split "too slow" then try awk. But you should consider that a big I/O request queue length on that filesystem is a likely candidate for slow splitting, rather than split being a bad performer.
awk version of split:

awk ' {
          if(NR<300000) { print $0 > "smallfile1"}
          if (NR>300000 && NR < 600000) { print $0 > "smallfile2" }
          if (NR>60000) {print $0 > "smallfile3" }
       }'  bigfile

radoulov · December 14, 2007, 5:59pm

Another approach - you can pass multiple arguments and control the filenames:

awk 'FNR == 1 { c = 1 }
{ close(FILENAME c-1)
	print > (FILENAME (!(FNR%30000000) ? ++c : c))
}'  file_1 file_2 ... file_n

or:

awk 'FNR == 1 { c = 1 }
	      { print > (FILENAME c) }
!FNR%30000000 { close(FILENAME c); ++c }
' file_1 file_2 ... file_n

Use nawk or /usr/xpg4/bin/awk on Solaris.

madhunk · December 17, 2007, 10:18am

Thank you Radoulov...When I ran your code, it is saying file1, file2 or file3 is not found. It seems like the code is assuming that those are the input files. However, Jim's code is working fine.

The whole environment is on Windows. But I am using MKS Tool kit and invoking bash shell to execute awk. Never worked on Windows before and it is not quite nice..

radoulov · December 17, 2007, 10:25am

Sorry,
just realized I misread your question
(you don't want to pass multiple input files).

radoulov · December 17, 2007, 10:39am

Just for completeness: the first parameter (n) is the number of lines per file,
the second - the custom_name with a numeric suffix:

awk 'FNR == 1 { c = 1 }
	      { print > (f c) }
!FNR%n { close(f c); ++c }
' n=<number_of_lines> f=<custom_name> big_file

If you want the number of files/parts,
it's important if the total number of lines in the bigfile is known a priori
(otherwise it will be slower, because we have to read it twice: first to get
the number of lines, and then again to split it).

madhunk · December 17, 2007, 11:20am

That is true Radoulov....It is very very slow. The only way I can get the number of lines is by doing a wc -l on the file. The records keep changing and we have ten different files that are incoming from a different source. We will be using this script company wide...

drl · December 17, 2007, 11:30am

Hi, jim mcnamara.

print "$lines lines read.\\n";;302151318:

If disk i/o is not making split "too slow" then try awk. But you should consider that a big I/O request queue length on that filesystem is a likely candidate for slow splitting, rather than split being a bad performer.
awk version of split:
awk ' {
          if(NR<300000) { print $0 > "smallfile1"}
          if (NR>300000 && NR < 600000) { print $0 > "smallfile2" }
          if (NR>60000) {print $0 > "smallfile3" }
       }'  bigfile

The number in red appears to be missing a zero, suggesting that the last part of the file beyond 60K (not 600K) ends up on smallfile3 ... cheers, drl

drl · December 17, 2007, 11:57am

Hi.

This does not appear to need absolute precision. You can read the first 100 (or few hundreds) of lines, then seek to near the end, read another 100 or so until the end, then calculate the mean length of the lines you saw. Obtaining the length of the file is cheap -- Linux has a stat command, but one could cut the length out from "ls -l". Divide the length in bytes by the estimated mean length and you get an estimate of the number of lines.

If necessary, one could also read a section from the middle (or elsewhere) to increase the accuracy.

The key is using a seek for positioning, and stat for the length -- both are very fast ... cheers, drl