laurigo
February 23, 2016, 12:48pm
1
I have a huge file with the following input:
Case1 Specific_Info Specific_Info
Case1 Specific_Info Specific_Info
Case3 Specific_Info Specific_Info
Case4 Specific_Info Specific_Info
Case1 Specific_Info Specific_Info
Case2 Specific_Info Specific_Info
Case2 Specific_Info Specific_Info
Case1 Specific_Info Specific_Info
Case3 Specific_Info Specific_Info
� � �
Casen Specific_Info Specific_Info
I need to split this file into several files where each final file has 1000 lines per "Casen". I have been using the separate procedures to do this first splitting the the files per Cases, then those files per 1000 lines and then adding the files back into 1, but this process is too long.
Using bash cd to the directory with that file:
Step 1.
nn=$(sort -u infile | wc -l)
echo $nn
If the file open limit (shown by ulimit -n
) is less than nn minus 3 i.e., n-3:
awk '{print $0 > $1}' infile
otherwise nn is too big use:
while read rec
do
f=${rec## *}
echo "$rec" >> $f
done < infile
Now you have a bunch of files with many lines that all start with the same string.
Step 2.
Use the split -l (ell) command to make smaller files with a limit of 1000 lines per file. Note last file may have less than 1000
ls case* >tmpfile
while read f
do
split -l 1000 $f $f
rm $f # remove litter
done
rm tmpfile
You are going to have loads of small files all looking like this Case343AAB
the AAB
is the unique file name suffix added by split
RudiC
February 24, 2016, 4:40am
3
How about
sort file | awk 'C[$1]++ < 1000'