How to pass a field from awk in a pipe?

Kingsley · August 17, 2010, 2:33am

Thanks in advance : )

I try for a long time searching for a way to split a large gzip csv file into many gzip files (except for the last sub-file which is to joint the next big file's children.) All the subfiles are to be named by the field.

But I only managed to split them into the uncompressed subfiles.

zcat bigfile | awk �{ print $0 >> $1 }� FS=�,�

If I can pass $1 outside awk and if gz could append lines then it will save me a lot of i/o time and capacity. One of my gzip file is 124 gb.

I.e. something like

zcat file | awk �{ print $0 }� FS=�,� |gzip >> "$PassOut1".gz

If you kwon what I mean.

kumaran_5555 · August 17, 2010, 2:41am

may be you can try something like this.

First write the uncompressed data in subfile1 and when you are about to strat the subfile2, run gzip for subfile1.

Kingsley · August 17, 2010, 2:46am

I am new to this forum. Thanks Kumaran. I am not sure how to do that, i.e. testing the change of $1. Also the source data is not sorted by $1 so it's hard to tell tell if I get more of the same over the entire bigfile.

kumaran_5555 · August 17, 2010, 2:51am

we need sample data and criteria to split the file.(where and when to split the file)

rdcwayx · August 17, 2010, 2:59am

mkdir subfolder
zcat file | awk -F, '{print $0 > "subfolder/" $1}' 
gzip subfolder/*

Kingsley · August 17, 2010, 3:09am

Thanks rdcwayx!! Thats neat, but the file is not gz until it's all split.

The data goes like this, with a lot of fields

VTLJ.J	2-Jan-96	08:07:07.310	2	Quote
VTLJ.J	2-Jan-96	09:30:00.320	2	Quote
BKLJ.J	2-Jan-96	10:56:38.660	2	Quote
LJ.C	3-Jan-96	09:23:34.070	2	Quote

If I can pass $1 outside awk and if gz could be appended line then it will save me a lot of i/o time and capacity. One of my gzip file is 124 gb.

I.e. something like

Zcat file | awk �{ print $0 }' FS=�,� |gzip >> $PassOut1

If you kwon what I mean.

That, is I like to do 1 pass of the data and split the bigfile into

File VTLI.J.gz

VTLJ.J	2-Jan-96	08:07:07.310	2	Quote
VTLJ.J	2-Jan-96	09:30:00.320	2	Quote

File BKLI.J.gz
Etc

rdcwayx · August 17, 2010, 3:27am

First, I don't see "," in your input file.

Second, Do you mean the big file has been sessioned or sorted by column 1

For example, if column 1 is VTLJ.J, and after I see BKLJ.J, there should be no line with VTLJ.J.

You need confirm it first.

If it is not, you have to wait until awk go through the whole file, then gzip the output files.

Kingsley · August 17, 2010, 3:34am

Thanks again, rdcwayx.

Sorry, its a cut and past from excel so the , have been removed. Its not sorted, that's a problem. But I think it should be clustered so once it change, it would not come back until the next bigfile hence the need to know which is last.

I also need to sort it by date and time in the sub-file. Any short cut to deal with the date?

binlib · August 17, 2010, 8:55am

Call gzip directly from awk to avoid disk IO:

zcat file |awk -F, '{print | "gzip >" $1 ".gz"}'

This is not optimal. Doing pipelines or redirections in awk is very awkward for awk will have to linear search file pointers and flush them on every IO. You should do it in perl or maybe you can split to named pipes which can then be gziped.

Kingsley · August 17, 2010, 9:00am

Oh Yeah! Thanks Binlib. Your name sound descriptive!!