How to split a large file with the first 100 lines of each condition?

laurigo · February 23, 2016, 12:48pm

I have a huge file with the following input:

Case1	Specific_Info	Specific_Info
Case1	Specific_Info	Specific_Info
Case3	Specific_Info	Specific_Info
Case4	Specific_Info	Specific_Info
Case1	Specific_Info	Specific_Info
Case2	Specific_Info	Specific_Info
Case2	Specific_Info	Specific_Info
Case1	Specific_Info	Specific_Info
Case3	Specific_Info	Specific_Info
�	�	�
Casen	Specific_Info	Specific_Info

I need to split this file into several files where each final file has 1000 lines per "Casen". I have been using the separate procedures to do this first splitting the the files per Cases, then those files per 1000 lines and then adding the files back into 1, but this process is too long.

jim_mcnamara · February 23, 2016, 4:16pm

Using bash cd to the directory with that file:

Step 1.

nn=$(sort -u infile | wc -l)
echo $nn

If the file open limit (shown by ulimit -n ) is less than nn minus 3 i.e., n-3:

awk '{print $0 > $1}' infile

otherwise nn is too big use:

while read rec
do
   f=${rec## *}
   echo "$rec" >> $f
done < infile

Now you have a bunch of files with many lines that all start with the same string.

Step 2.
Use the split -l (ell) command to make smaller files with a limit of 1000 lines per file. Note last file may have less than 1000

ls case* >tmpfile
while read f
do
  split -l 1000 $f $f
  rm $f # remove litter
done
rm tmpfile

You are going to have loads of small files all looking like this Case343AAB the AAB is the unique file name suffix added by split

RudiC · February 24, 2016, 4:40am

How about

sort file | awk 'C[$1]++ < 1000'