I have a working ruby script shown below that reads a big binary file (more than 2GB). The chunks of data I want to analyze
is separated by the sequence FF47 withing the binary. So, in the ruby script is defined as "line separator" = FF47 ($/="\xff\x47")
in order to read the file "line by line" avoiding to load the entire big file in memory.
The program works great and now I'm trying to apply this algorithm in Java. I've seen built-in ways in java to read not big binary files
but I don't know how to set as line separator the sequence FF47.
How can I do this?
#!/usr/bin/env ruby -E BINARY
# -*- encoding: utf-8 -*-
BEGIN{ $/="\xff\x47".force_encoding("BINARY") }
IO.foreach(ARGV[0]){ |l|
CurrentLine = l.unpack('H*')[0]
### Process each line stored in variable "CurrentLine" as desired ###
### ...
### ...
} if File.exists?(ARGV[0])
I am no expert in Java, but i don't think this is possible. You probably have to do it yourself, like in good old C. You open a file ( fopen() ) and use fseek() , fread() and ftell() to find what you search for. The functions are part of the standard library, so they should work the same way in C and Java.
AFAIK there's no way to use the stdio-based family of library calls (fopen(), etc.) and have them treat the binary sequence "FF47" as a "line" separator.
Even if you could set your LOCALE envvals to use a character set that uses "FF47" as a 16-bit character newline character (if one even exists), the fact that it's a binary file could break things - the "newline" character might not always be in a 16-bit boundary.
The only way to do what the OP asked is to read the file as a binary file, and search for the "FF47" bits. And hope that the way the file was written wasn't in a way that's endian-dependent. Especially when using Java on a little-endian machine (x86, most ARM OS's) as Java tends to read/write data in network byte order - big endian - for portability.
You do not treat them as a "line separator", but simply search for the sequence and then read what's after. Using stdios function calls doesn't have "line separators" because there is no such thing as a "line" which could be separated. Sorry for not mentioning that explicitly, i thought it was obvious.
Thanks for your answers. Sounds great an option that reads line by line from a binary file in C using get(), fget() as you said, but since the "lines" or chunks are separated by FF65 and in my original ruby code I process very well the chunks with regular expressions, I'm afraid I cannot use C for this task since I thinks it doesn't has support for Perl regular expressions fashion, I'm not sure.
The C language doesn't have the regular expressions that Perl uses but it has its own built-in regular expressions the same as sed and awk so look up the man page of regexec / regcomp etc...
Yes, I've tried regex of C but are not powerfull enough. The regex I'm using are a kind of complex and use backreference, greedy, non-greedy options, etc. The C regexs does'nt support these kind of things from what I know.
Basically I'd like to process each chunk at a time from binary using as delimiter 0xFF65, but it seems java doesn't have the option to change the line separator when read a binary, similarly as I did with ruby in sample code I show in first post.
I'm not sure, maybe someone knows an alternative or a 3rd party library that could read binary and set a custom separator.