Extract part of an archive to a different file

Tribe · September 6, 2015, 2:04pm

I need to save part of a file to a different one, start and end offset bytes are provided by two counters in long format. If the difference is big, how should I do it to prevent buffer overflow in java?

Don_Cragun · September 6, 2015, 2:25pm

Why use java ? This is a perfect problem for the dd utility.

Tribe · September 6, 2015, 2:27pm

Because the software that detects the start and end offsets is entirely written in java, and I want it to be portable.

Don_Cragun · September 6, 2015, 2:52pm

OK. Show us the java code and show us where in the code you are running into buffering problems.

Tribe · September 10, 2015, 1:01pm

I'm not sure I need to show internal code that has nothing to do with it. Basically I think the issue comes from the maximum number of items into a byte array. Say an example program already contains this code:

byte[] content = new byte[(int) entry.getSize()];

That would mean that the maximum number of elements is the maximum value an integer can achieve, which in Java is 2147483647, so make that bytes. That implies the maximum length of piece to extract can be up to 2 GB approximately. What happens if I want to extract a piece of about 7 GB? Even in the 2GB case, I have no idea if the content is stored into ram memory, which will cause problems on low specs computers.

Don_Cragun · September 10, 2015, 1:41pm

The code has everything to do with it.

Why do you believe that you have to copy everything into an array before you write any of your desired output?

Open your input file. Seek to the offset of the first byte you want to copy. Read data from your input file and write it to your output file until you have copied all of the bytes you want to extract. You can do this one byte at a time (no buffering issues, but relatively slow for large transfers), one block at a time (trivial buffering, relatively faster), one block at a time tuned to input and output file disk block boundaries (more complex logic, possibly hardware/filesystem dependent, faster).

Tribe · September 10, 2015, 1:50pm

So do you think RandomAccessFile will be the simplest way to achieve that? If I write byte by byte is may cause a lot of I/O overhead, specially bad for SSD drives. Writing in blocks of 1MB or so I think is much better. The seek method will provide the way to position the cursor for both reading and writing bytes.

Don_Cragun · September 10, 2015, 2:03pm

You don't need random access for your output file. You'll just be doing sequential writes to your output file. You'll also just doing sequential reads from your input file (once you seek to the proper starting position).