How to replicate Ruby´s binary file reading with Java?

Hello to all guys,

Maybe some expert could help me.

I have a working ruby script shown below that reads a big binary file (more than 2GB). The chunks of data I want to analyze
is separated by the sequence FF47 withing the binary. So, in the ruby script is defined as "line separator" = FF47 ($/="\xff\x47")
in order to read the file "line by line" avoiding to load the entire big file in memory.

The program works great and now I'm trying to apply this algorithm in Java. I've seen built-in ways in java to read not big binary files
but I don't know how to set as line separator the sequence FF47.

How can I do this?

#!/usr/bin/env ruby -E BINARY
# -*- encoding: utf-8 -*-
 
BEGIN{  $/="\xff\x47".force_encoding("BINARY")   }   
 
IO.foreach(ARGV[0]){ |l| 
        CurrentLine = l.unpack('H*')[0]
  ### Process each line stored in variable "CurrentLine" as desired ###
  ### ...
  ### ...
} if File.exists?(ARGV[0])

Thanks for any help.

Regards

I am no expert in Java, but i don't think this is possible. You probably have to do it yourself, like in good old C. You open a file ( fopen() ) and use fseek() , fread() and ftell() to find what you search for. The functions are part of the standard library, so they should work the same way in C and Java.

I hope this helps.

bakunin

AFAIK there's no way to use the stdio-based family of library calls (fopen(), etc.) and have them treat the binary sequence "FF47" as a "line" separator.

Even if you could set your LOCALE envvals to use a character set that uses "FF47" as a 16-bit character newline character (if one even exists), the fact that it's a binary file could break things - the "newline" character might not always be in a 16-bit boundary.

The only way to do what the OP asked is to read the file as a binary file, and search for the "FF47" bits. And hope that the way the file was written wasn't in a way that's endian-dependent. Especially when using Java on a little-endian machine (x86, most ARM OS's) as Java tends to read/write data in network byte order - big endian - for portability.

You do not treat them as a "line separator", but simply search for the sequence and then read what's after. Using stdios function calls doesn't have "line separators" because there is no such thing as a "line" which could be separated. Sorry for not mentioning that explicitly, i thought it was obvious.

bakunin

Actually, there are two stdio-based calls that process input line-by-line - gets() and fgets().

And yes, the only way to do what the OP wants in Java is to search through the data looking for the binary separator sequence.

Hello bakunin and achenle,

Thanks for your answers. Sounds great an option that reads line by line from a binary file in C using get(), fget() as you said, but since the "lines" or chunks are separated by FF65 and in my original ruby code I process very well the chunks with regular expressions, I'm afraid I cannot use C for this task since I thinks it doesn't has support for Perl regular expressions fashion, I'm not sure.

Regards

The C language doesn't have the regular expressions that Perl uses but it has its own built-in regular expressions the same as sed and awk so look up the man page of regexec / regcomp etc...

Hi shamrock,

Thanks for answer.

Yes, I've tried regex of C but are not powerfull enough. The regex I'm using are a kind of complex and use backreference, greedy, non-greedy options, etc. The C regexs does'nt support these kind of things from what I know.

Basically I'd like to process each chunk at a time from binary using as delimiter 0xFF65, but it seems java doesn't have the option to change the line separator when read a binary, similarly as I did with ruby in sample code I show in first post.

I'm not sure, maybe someone knows an alternative or a 3rd party library that could read binary and set a custom separator.

Regards

How about using java's "next" method with a regex as argument to read the entire input string up to the delimiter 0xff65...

Hi shamrock,

I found this method "java.util.Scanner.next()" but this scanner only works with text files.

do you have a reference of the next() method you say to see examples?

Thanks again

Hi.

I think I have used this indirectly PCRE - Perl Compatible Regular Expressions

Best wishes ... cheers, drl