split file problem

Hi All,

I have a parent file with 10 million records. I want to split the parent file in to child files.

Each child file contains 5000 records.

I am using the following command for splitting:

split -5000 parentfile.txt childfile.1

It will split the Parent file as childfile.1aa, childfile.1ab, ... childfile.1zz
It will split only 676 files (=> 3.38 million records only) .

I am not able to split the parentfile into childfile (More than 676 files) at single shot. Please provide your suggestions for splitting.

Regards
Hanuma

"man split" it says "up to a maximum of 676 files" (26x26=676), but in some unixes you can have powers of 26 more.

If you don't have the "-a suffix_length" switch to "split" (which would fix the problem) and assuming you have unlimited disc space, I guess you could do two passes.

Split to 500 files of 20,000 lines.
Split each of the 500 files into 4 parts of 5,000 lines.

I think splitting 2 times is the double work & there is no manual activity also. script will run in the cronjob.

What Operating System and version do you use?
Does your version of the "split" command have the "-a" switch? See "man split".

Splitting he files twice is indeed twice the work. Using shell script for serious data processing has its merits for quick development of one-off jobs and prototypes. For a regular job on this scale personally I would choose a high level language which has no issues in closing file descriptors mid-process.

I am intrigued why you would want to break 10,000,000 records into 5,000 record chunks? To my mind it just creates complications in volume processing.

Other integration applications are not supported for the large size files. For that purpose we will split the large file into small once's and pass it.

We are using the SunSolaris5.10

Please look at your "man split" and advise whether you have the "-a" switch available. This would allow you to extent the range of suffixes by more multiples of 26 .

Thanks for your support.
"-a" option will work successfully. But we are not able to fix the "-a suffixlength". Any dynamic updation of the suffixlength for every time.

If use say "-a 4" this gives you up to 26x26x26x26=456,976 suffix combinations (aaaa-zzzz) which at 5000 records per file comes to 2,284,880,000 maximum input records.

Not sure if I understand your question. If your "-a" value is high enough you don't need to know the number of records in advance.

If you are creating large numbers of files be careful that you have enough inodes in the filesystem (df -i).