split large file based on field criteria

asriva · June 19, 2009, 5:58pm

I have a file containing date/time sorted data of the form
...
2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1
2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1
2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0
2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1
2009/06/10,20:59:59.950,XAG/USD,A,0,, 14.3011,100,1
2009/06/10,21:00:00.100,CHF/JPY,Q,0,0, , 0,0,0
2009/06/10,21:00:00.100,CHF/JPY,Q,1,0, 70.26, 60,2,2
2009/06/10,21:00:00.150,CHF/JPY,D,0, 70.14, 20,XC05, ,NYD9,US,NYA1
...

I want to split this file into exactly two files based on the the date/time criteria. The criteria is all the lines with timestamps less than and equal to 21:00:00.000 should go to 'file1' and greater than 21:00:00.000 should goto 'file2'.

I wrote a simple script using while loop reading each line and matching criteria.
The script works fine but since these files containing data are huge (gigs), the processing takes forever.

Is there a a better way (sed, awk, egrep or even split) to use this more effeciently??

Thanks.

ghostdog74 · June 19, 2009, 7:39pm

have you tried split or csplit ? check their man or info pages.

Scott · June 20, 2009, 8:06am

Sorry, but I user awk for everything!

cat myfile | awk -F, '
  $2 ~ /^21/ { print > "file2"; next }
  { print > "file1" }
'

If you want two files based on each date...

cat myfile | awk -F, '
  { gsub( "/", "_", $1 ) }
  $2 ~ /^21/ { print > $1 "_2"; next }
  { print > $1 "_1" }
'

ghostdog74 · June 20, 2009, 8:19am

then why did you use cat?

Scott · June 20, 2009, 8:36am

I didn't say I used awk exclusively. That would just be boring

But I see what you mean

awk -F, '
  { gsub( "/", "_", $1 ) }
  $2 ~ /^21/ { print > $1 "_2"; next }
  { print > $1 "_1" }
' myfile

Force of habit!

asriva · June 22, 2009, 9:46am

Scottn,

thanks for your reply. It pretty much does the work. But I noticed two issues:

1) All the delimiters in the file "," are gone!

2) I see why you need to replace "/" with an "" since files with "/" are not allowed, but the side effect of this is that the "/" in the file are replaced with "".

durden_tyler · June 22, 2009, 10:41am

$
$ cat data.txt
2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1
2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1
2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0
2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1
2009/06/10,20:59:59.950,XAG/USD,A,0,, 14.3011,100,1
2009/06/10,21:00:00.100,CHF/JPY,Q,0,0, , 0,0,0
2009/06/10,21:00:00.100,CHF/JPY,Q,1,0, 70.26, 60,2,2
2009/06/10,21:00:00.150,CHF/JPY,D,0, 70.14, 20,XC05, ,NYD9,US,NYA1
$
$ perl -ne 'BEGIN{open(F1,">file1"); open(F2,">file2")}
>   { split/[,:]/;
>     if($_[1] <= 20){print F1 $_} else{print F2 $_}
>   }
> END {close(F1); close(F2)}' data.txt
$
$ cat file1
2009/06/10,20:59:59.950,XAG/USD,Q,1,1115, 14.3025,100,1,1
2009/06/10,20:59:59.950,XAG/USD,Q,1,1116, 14.3026,125,1,1
2009/06/10,20:59:59.950,XAG/USD,R,0,0, , 0,0,0
2009/06/10,20:59:59.950,XAG/USD,R,1,0, 14.1910,100,1,1
2009/06/10,20:59:59.950,XAG/USD,A,0,, 14.3011,100,1
$
$ cat file2
2009/06/10,21:00:00.100,CHF/JPY,Q,0,0, , 0,0,0
2009/06/10,21:00:00.100,CHF/JPY,Q,1,0, 70.26, 60,2,2
2009/06/10,21:00:00.150,CHF/JPY,D,0, 70.14, 20,XC05, ,NYD9,US,NYA1
$
$

tyler_durden