Extracting non zero records from Binary File

dhiraj4mann · July 18, 2011, 12:47pm

Dear Experts,

I have one "binary file" which contains multiple records of fixed size 31744.
I need to extract only those records which have non-zero data.

Sample file could be:

   a6 82 (+31742 bytes)
   a6 00 12 00 (+31740 bytes)
   00 00 (00 31742 times)
   a6 00 12 34 (+31740 bytes)
   00 00 (00 31742 times)
   00 00 (00 31742 times)
   00 00 00 00 00 00

Required output:

   a6 82 (+31742 bytes)
   a6 00 12 00 (+31740 bytes)
   a6 00 12 34 (+31740 bytes)

Thanks,
Dhiraj

Shell_Life · July 18, 2011, 3:41pm

Follow these steps:
1) Extract one record with all zeros from the main file and create a new file (ie Zeros_File).
2) Run the following:

egrep -v -f Zeros_File Main_File

dhiraj4mann · July 18, 2011, 11:36pm

Hi Shell_Life,

As my file is not standard text file, so I guess this would not work.
I need some way to operate on binary data.

Thanks,

Chubler_XL · July 19, 2011, 1:19am

If you pass the binary file thru od you can get an integer value (0-255) for each byte of the file:

od -tu1 -An -w1 your_binary_file
 166
 130
 172
  87
  98
 228
  58
 100
 145
  40
...

It should then be pretty simple to process this output with awk, and use printf("%c", $1) within awk to convert the interger value back to a binary character.

Ygor · July 19, 2011, 1:38am

Try using dd to convert the file, e.g....

$ printf "abc\000\000\000de\000" >file1

$ dd if=file1 of=file2 cbs=3 conv=unblock
0+1 records in
0+1 records out
12 bytes (12 B) copied, 0 s, Infinity B/s

... you will now see newlines every 3 bytes...

$ od -hc file1; od -hc file2
0000000    6261    0063    0000    6564    0000
          a   b   c  \0  \0  \0   d   e  \0
0000011
0000000    6261    0a63    0000    0a00    6564    0a00
          a   b   c  \n  \0  \0  \0  \n   d   e  \0  \n
0000014

Use awk to remove null records and dd to convert back again...

$ awk '/[^\000]/' file2 > file3

$ dd if=file3 of=file4 cbs=3 conv=block
0+1 records in
0+1 records out
6 bytes (6 B) copied, 0 s, Infinity B/s

You should see that null records are removed...

$ od -hc file3; od -hc file4
0000000    6261    0a63    6564    0a00
          a   b   c  \n   d   e  \0  \n
0000010
0000000    6261    6463    0065
          a   b   c   d   e  \0
0000006

That's just an example with a block size of 3, you would use cbs=31744

Chubler_XL · July 19, 2011, 5:55pm

@Ygor, don't forget it's a binary file so there could be EOF CR or NULL characters dotted through the blocks of data. This will cause trouble for awk if the file is processed as-is.

Corona688 · July 19, 2011, 6:12pm

Read in chunks with dd, test against a file of the same size full of binary zeroes, print if nonzero.

dd if=/dev/zero of=zero bs=31744 count=1

while dd count=1 bs=31744 > test
do
        diff test zero > /dev/null || cat test
done < datain > dataout

rm -f test datafile

Or if you can use a solution in C:

#include <unistd.h>
#include <string.h>

int main(void)
{
        char buf[31744], zero[31744];
        ssize_t bpos=0;

        memset(zero, 0, sizeof(zero));

        while(1)
        {      // Read in entire chunk
                bpos=0;
                while(bpos < 31744)
                {
                        ssize_t r=read(STDIN_FILENO, buf+bpos, 31744-bpos);
                        if(r <= 0) // End of file
                                return(0);

                        bpos += r;
                }

                // check if zero
                if(memcmp(zero, buf, 31744) == 0) continue;

                bpos=0;
                // write out entire chunk
                while(bpos < 31744)
                {
                        ssize_t w=write(STDOUT_FILENO, buf+bpos, 31744-bpos);
                        if(w <= 0)
                                return(1); // write error

                        bpos += w;
                }
        }
}

use it like ./program < infile > outfile

Chubler_XL · July 19, 2011, 6:49pm

Here is my od-thru-awk solution. I used a slightly different approach: count the number of zero bytes and if it's < blocksize output the block.

od -tu1 -An -w1 -v your_binary_file | awk '{
    block[i++]=$1;
    if($1==0) z++;
    if(i==31744) {
        if(z<31744) for(j=0;j<i;j++) printf("%c", block[j]);
        delete block;
        i=z=0;
    }}' > fixed_binary_file

mirni · July 19, 2011, 7:06pm

@Corona: nice C solution! But, didn't you forget the last chunk?

Here you return before you write out:

Perhaps the outer while loop could have 'r' in the condition, something like this? :

ssize_t r = 1;
while(r > 0)
        {      // Read in entire chunk
                bpos=0;
                while(bpos < 31744)
                {
                        r=read(STDIN_FILENO, buf+bpos, 31744-bpos);
                        bpos += r;
                }
               // check if zero...
               // write out entire chunk...
}

alister · July 19, 2011, 8:24pm

mirni:

@Corona: nice C solution! But, didn't you forget the last chunk?

Here you return before you write out:

Perhaps the outer while loop could have 'r' in the condition, something like this? :
ssize_t r = 1;
while(r > 0)
   {      // Read in entire chunk
   bpos=0;
   while(bpos < 31744)
   {
   r=read(STDIN_FILENO, buf+bpos, 31744-bpos);
   bpos += r;
   }
   // check if zero...
   // write out entire chunk...
}

Assuming the file's content is an integral number of records, Corona688's solution will encounter the EOF during the first attempt to read in a new record. At that point, bpos is 0 and there's nothing in the buffer that wasn't written out during the previous iteration of the outer while loop.

If, however, the file contains a fractional record, that last fragment would indeed be discarded.

Your proposal would loop endlessly when EOF is encountered (read would return 0, bpos remains unchanged iteration after iteration, never breaking out of the inner while-read loop). Worse, in the face of repeated I/O errors, bpos would would be decremented by 1 during each iteration, until eventually it indexed a location beyond the buffer (triggering a segfault sooner or later ... or worse).

However, I will point out a minor copy-paste (I assume) mistake:

corona688:

   // write out entire chunk
   while(bpos < 31744)
   {
   ssize_t w=write(STDOUT_FILENO, buf+bpos, 31744-bpos);
   if(r <= 0)
   return(1); // write error

   bpos += r;
   }
   }
}

Regards,
Alister

---------- Post updated at 08:24 PM ---------- Previous update was at 07:30 PM ----------

Here's my contribution to this charming little problem:

hexdump -ve '31744/1 "%u " "\n"' bin | sed '/^[0 ]*$/d' | tr -s ' ' \\n | awk '{printf("%c", $0)}' > bin.nonull

Regards,
Alister

dhiraj4mann · July 20, 2011, 11:20am

Thanks for the helpful suggestions ...

shamrock · July 20, 2011, 11:34am

If you know C you can fread each record of 31744 bytes into a buffer until eof...and compare each read byte string to null and print out only those that have non zero data.