I try for a long time searching for a way to split a large gzip csv file into many gzip files (except for the last sub-file which is to joint the next big file's children.) All the subfiles are to be named by the field.
But I only managed to split them into the uncompressed subfiles.
zcat bigfile | awk �{ print $0 >> $1 }� FS=�,�
If I can pass $1 outside awk and if gz could append lines then it will save me a lot of i/o time and capacity. One of my gzip file is 124 gb.
I am new to this forum. Thanks Kumaran. I am not sure how to do that, i.e. testing the change of $1. Also the source data is not sorted by $1 so it's hard to tell tell if I get more of the same over the entire bigfile.
Sorry, its a cut and past from excel so the , have been removed. Its not sorted, that's a problem. But I think it should be clustered so once it change, it would not come back until the next bigfile hence the need to know which is last.
I also need to sort it by date and time in the sub-file. Any short cut to deal with the date?
This is not optimal. Doing pipelines or redirections in awk is very awkward for awk will have to linear search file pointers and flush them on every IO. You should do it in perl or maybe you can split to named pipes which can then be gziped.