Split a file based on pattern and size

jl487 · May 3, 2012, 10:21am

Hello, I have a large file (2GB) that I would like to split based on pattern and size.

I've used the following command to split the file (token is "HELLO")

awk '/HELLO/{i++}{print > "file"i}' input.txt

and the output is similar to the following (i included filesize in KB):

10  file1
10  file2
20  file3
18  file4
1   file5
1   file6
5   file7

I'd like to make it so that I can merge/cat the files so that if two or more files are below a limit, they get merged. So my desired output with a 20kb restriction would be:

20  file1
20  file2
20  file3
5   file4

From my desired output, files 1-2 got merged, file 3 stayed the same, file 4-6 got merged, and file 7 stayed the same because it's the remainder.

I was thinking of using my awk command first and then for a for loop to merge the files. My only issue is that since there are so many files, if i did a sort based on file name, it would go file1, file10, file100, file2, file20, etc. and i don't want to merge file1 and file101 together.

hfreyer · May 3, 2012, 10:56am

You might extend the file number's digits (4 digits in the example, increase if needed):

awk '/HELLO/{istr=sprintf("%04d",i++)}{print > "file"istr}' input.txt

jl487 · May 3, 2012, 11:48am

Nice !! The naming convention modifcation will definitely help when sorting/merging.

Now I just need to merge files based on size.