Extract sequences of bytes from binary for differents blocks

Ophiuchus · August 13, 2013, 1:07am

Hello to all,

I would like to search sequences of bytes inside big binary file.

The bin file contains blocks of information, each block begins is estructured as follow:

1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33.
2- Next sequence to extract is the correlative (3 bytes) --> I mean, 1, 2, 3...N
3- Next sequence to extract is Product Series (8 bytes) --> The first 4 bytes are always "99 11 45 27"
4- Next sequence to extract is Product Model (8 bytes) --> The first 2 bytes are always "73 49"

There some other sequences of bytes I need to extract for each block, but I need somebody could help me first
saying me how to begin to do this for the 4 items mentioned above.

It is possible to do it in shell script, combining awk etc or what do you suggest me?

PD: It would be better not to save an hexdump in a textfile first, because the binary could be of 2GB. Would be better a way
to extract the sequences directly from the binary.

Thanks in advance

an hexdump -C of the sample binary file is below:

31 45 4a 58 58 59 57 31 5f 44 31 32 31 31 33 30
38 30 37 31 33 34 34 06 99 11 45 27 89 34 55 ff
32 00 00 01 99 11 45 27 89 34 55 0f 73 49 45 49
23 2f ff ff 00 15 00 0a 48 00 01 5a 00 02 42 00
01 60 00 01 33 00 01 36 00 01 37 00 01 5b 00 01
7e 00 01 69 00 00 6a 00 00 79 00 00 93 00 01 22
00 00 21 00 01 09 00 01 0a 00 01 26 00 01 02 00
01 04 00 01 05 00 01 06 00 01 10 00 01 08 00 01
2b 00 00 2c 00 01 2d 00 01 2e 00 01 55 00 01 56
00 07 2a 00 00 2f 00 00 30 00 00 31 00 00 ff 34
00 80 09 32 c9 06 88 88 80 00 a0 00 80 09 35 c9
06 00 00 80 00 00 00 80 09 3c c9 06 88 88 80 00
80 00 80 09 43 c9 06 88 88 80 00 80 00 05 82 00
37 06 01 00 00 01 00 65 00 00 00 02 00 00 02 00
18 00 00 00 03 00 00 03 00 17 00 00 00 04 00 00
04 00 01 00 00 00 05 00 00 05 00 15 00 00 00 0a
00 ff ff 00 65 00 00 00 07 80 2e c9 18 05 91 73
49 52 69 53 1f ff ff ff 00 91 73 49 52 69 53 1f
ff ff 00 01 03 ca 03 08 08 fe cb 0a 00 00 00 00
00 00 00 00 00 00 cc 01 01 81 1b c9 0b 00 91 73
49 52 69 56 7f ff ff ff ca 06 00 00 00 00 00 00
cb 01 03 cc 01 01 ff 32 00 00 02 99 11 45 27 89
34 55 1f 73 49 45 54 76 8f ff ff 00 15 00 0a 48
00 01 5a 00 02 42 00 01 60 00 01 33 00 01 36 00
01 37 00 01 5b 00 01 66 00 01 65 00 01 77 00 01
78 00 01 7e 00 01 69 00 00 6a 00 00 79 00 00 93
00 01 22 00 00 21 00 01 09 00 01 0a 00 01 26 00
01 02 00 01 04 00 01 05 00 01 06 00 01 10 00 01
08 00 01 2b 00 00 2c 00 01 2d 00 01 2e 00 01 55
00 01 56 00 07 2a 00 00 2f 00 00 30 00 00 31 00
00 ff 34 00 80 09 32 c9 06 88 88 80 00 a0 00 80
09 35 c9 06 00 00 80 00 00 00 80 09 3c c9 06 88
88 80 00 80 00 80 09 43 c9 06 88 88 80 00 80 00
03 80 0f 01 02 00 00 00 30 73 49 52 69 05 ff ff
ff 00 81 0f 01 02 00 00 01 3a 73 49 52 69 55 9f
ff ff 00 83 10 01 0c 00 00 00 9f 73 49 52 69 05
ff ff ff 01 01 86 0f 01 0e 00 00 00 eb 73 49 52
69 59 6f ff ff 00 87 0f 01 01 00 06 f6 99 73 49
52 69 56 3f ff ff 00 84 0e 00 01 00 00 01 00 01
00 ff ff 00 00 01 01 85 06 00 03 79 00 01 ea 05
82 00 37 06 01 00 00 01 00 65 00 00 00 02 00 00
02 00 18 00 00 00 03 00 00 03 00 17 00 00 00 04
00 00 04 00 01 00 00 00 05 00 00 05 00 15 00 00
00 0a 00 ff ff 00 65 00 00 00 07 80 2e c9 18 00
91 73 49 52 69 53 9f ff ff ff 00 91 73 49 52 69
53 9f ff ff 00 01 03 ca 03 08 08 fe cb 0a 00 00
00 00 00 00 00 00 00 00 cc 01 01 81 1b c9 0b 00
91 73 49 52 69 56 7f ff ff ff ca 06 00 00 00 00
00 00 cb 01 03 cc 01 01 ff 33 31 33 30 38 30 37
31 33 34 34 30 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Jotne · August 13, 2013, 2:33am

Here is some that will work with hexdump, have no idea on how to do it on a binary file.

awk '{$1=$1} {for (i=1;i<=NF;i++) {if (($i" "$(i+1)" "$(i+5)" "$(i+6)" "$(i+7)" "$(i+8)" "$(i+13)" "$(i+14))=="ff 32 99 11 45 27 73 49") {f=i+1;print "field="f,"row="int(f/16+1),"column="(f/16-int(f/16))*16} }}' RS="" hexdump
field=33 row=3 column=1
field=344 row=22 column=8

awk '
	{$1=$1} 
	{for (i=1;i<=NF;i++) 
		{if (($i" "$(i+1)" "$(i+5)" "$(i+6)" "$(i+7)" "$(i+8)" "$(i+13)" "$(i+14))=="ff 32 99 11 45 27 73 49") 
			{f=i+1;print "field="f,"row="int(f/16+1),"column="(f/16-int(f/16))*16}
		}
	}' RS="" hexdump

wisecracker · August 13, 2013, 2:33am

Depending on the size of the file see if this idea will help you:-

If you intend to attempt to put the binary values into a _string_variable_ then 0, (zero), is not possible directly under bash, all other values are possible. You will have to detect the 0's and slot in "\0" instead.

So from the 256 bytes of DEMO data in the pointer above only 255, (1 to 255), can be placed into a _variable_. It is easy to add 2 more bytes to represent a 0 as mentioned above but makes the DEMO string 257 bytes in size...

However transferring to another binary file is easy as shown in the DEMO...

Hope this will help you...

Ophiuchus · August 13, 2013, 2:02pm

Hello to all,

If there is an option extract the sequences directly from the binary file would be better and faster I think,
I'm not sure if it is possible with bash or Perl or another option you can suggest me.

Hello Jotne,

Thanks for your help. Your script detects the position of the sequences but I would like to
extract those sequences to a new file having in the output file one line per information of each block
in binary.

Hello wisecracker,
I'll check the option you mention, but it is possible with your script to extract the complete byte sequence?

Thanks again for the help.

wisecracker · August 13, 2013, 5:17pm

Python 3.x.x can easily handle binary and is fast at manipulating huge files.

Ophiuchus · August 14, 2013, 1:46am

Thanks wisecraker.

Do you know some of python and maybe you can show me an example of how to extract those byte sequences using this language.

Thanks in advance.

ahamed101 · August 14, 2013, 1:50am

Can you upload a reduced version of this binary with at least one block?

--ahamed

Ophiuchus · August 14, 2013, 2:05am

Hello ahamed101,

Attached the same sample file corresponding to the hexdump I put in first post that has 2 blocks.

PD: In order to be able to upload the file, I needed to add "txt" extension.

Thanks in advance for the help.

wisecracker · August 14, 2013, 2:52am

Yes I know a little Python but you must at least have an attempt yourself.
What have you tried so far?

What scripting, shell or otherwise, have you done?

Why does your code not work?

What error reports do you get?

Which shell are you using?

What HW and OS is this running on?

I have given a pointer and here is a new one which I generated yesterday after my post 3 on this thread:-

LBNL, I am sure you have posted about this before on here some weeks ago!

EDIT:

How about dumping the binary file as a HEX string array, searching the array for a pattern, finding your wanted end point and re-converting back to a binary file again?

Ophiuchus · August 15, 2013, 12:52am

Hello,

I've helped several times in this forum before and I know if somebody wants to help, helps. If somebody ask here there is not a forum rule to make attemtps before posting, due that sometimes the people just don't know where to begin. I'm asking and requesting help here (not complete solution) because I don't know in Python or any other language.

I only have the idea to extract the byte sequences searching regular expressions because not always the sequences are in the same position, but I don't know
in which language would be easier, faster, better and how to begin.

I'm using Ubuntu or Windows, but I'm asking for help and suggestions in bash if it possible or in Perl, Python, C or any language to handle binaries and be able
to extract the byte sequences I mention.

Maybe if someone knows how to do it in any language, could give some examples to follow and continue by myself.

I posted before but now I ask thinking in another approach, but still searching the way to extract the info reading the binary directly
without converting to text.

Thanks in advance for any help.

ahamed101 · August 15, 2013, 3:30am

Is the number of bytes between the start (0x32) and end (0xff 0x33) constant?
Also, whats the significance of the blocks? Can the requested data is present outside the block, in which case you are not interested?
Are there multiple instances of the required data within a block?

I can provide a solution in C. I am not good at python.

--ahamed

Ophiuchus · August 15, 2013, 5:49am

Hello ahamed,

The number of bytes are not constant between the start (0x32) and end (0xff 0x33)

The requested data is only inside each block:

I'm interested in:
1- The sequence in color inmediately after the beginning of each block, I mean after each 0x32 marked in red in image attached. These sequences always happens only once in each block.

2- Some of the sequences after the FF 34 in each block, these sequences not always happens but if happens only do once in each block. For example, in the sample file, the sequences after FF 34 only appear in block #2.

But maybe for now you can help with the sequences of item "1" and after that
I'll try to replicate the logic you use for sequences of item "2" or ask in order to be able to complete the 2nd item.

Thanks in advance for any help.

Regards

ahamed101 · August 15, 2013, 6:16am

Well, there is nothing which mentions the end of each block. 0xff 0x33 represents the end of file, is that right?

Following code extracts the data from which you want, but there is no check for end of block as I am still confused. May be a larger file with expected output can clarify it.

#include <stdio.h>
#include <stdlib.h>

#define err(x) {printf("\nError: %s... Exiting...\n", x); exit(1);}

static unsigned char pat1[] = {0x99, 0x11, 0x45, 0x27};
static unsigned char pat2[] = {0x73, 0x49};

int main(int argc, char **argv)
{

        if(argc < 2)
                err("File name missing");

        unsigned char buf[32];
        unsigned char *ptr = buf;
        int pos = 0;

        FILE *fp = fopen(argv[1], "rb");
        if(!fp) err("Unable to open the file");

        while(!feof(fp)){
                fread(ptr, sizeof(char), 1, fp);
                pos = ftell(fp);
                if(buf[0] == 0x32){
                        fread(ptr+1, sizeof(char), 19, fp);
                        if(memcmp(buf+4, pat1, 4) && memcmp(buf+12, pat2, 2)){
                                fseek(fp, pos, SEEK_SET);
                                continue;
                        }else{
                                int i=0;
                                for(i=0;i<=19;i++) printf("%02x ", buf);
                                printf("\n");
                        }
                }
        }
        return 0;
}

user@Imperfecto_:~$ gcc extract.c -o extract 
user@Imperfecto_:~$ ./extract binfile.txt 
32 00 00 01 99 11 45 27 89 34 55 0f 73 49 45 49 23 2f ff ff 
32 00 00 02 99 11 45 27 89 34 55 1f 73 49 45 54 76 8f ff ff 
user@Imperfecto_:~$

--ahamed

---------- Post updated at 03:16 AM ---------- Previous update was at 03:13 AM ----------

Or is it that once we encounter 0xff 0x33, we should stop?

--ahamed

Ophiuchus · August 15, 2013, 3:05pm

Hello ahamed,

Thank you for your help!!, I'll try your code to begin with no doubt.

And yes, FF 33 is the end of the file, after the 33 follow some bytes that represent the date and hour, not of interest. 0x33 is iso coded, so in ascii is the number 3.

For more details below is the main structure I mentioned in my 1rst post:

1- Each block begins with the hex 32 (1 byte) and ends with FF. After the FF of the last block, it follows 33.
2- Next sequence to extract is the correlative (3 bytes) --> I mean, 1, 2, 3...N
3- Next sequence to extract is Product Series (8 bytes) --> The first 4 bytes are always "99 11 45 27"
4- Next sequence to extract is Product Model (8 bytes) --> The first 2 bytes are always "73 49"

Thank you for your help ahamed.

---------- Post updated at 03:05 PM ---------- Previous update was at 01:07 PM ----------

Hello again ahamed,

It works nice!

Now for each block I try to extract (if present) the bytes after the FF 34 and begins with 0x03 followed by 0x80 or 0x81or 0x83 or 0x86 or 0x87 more 16 bytes more how it is shown in image attched in previous post.

I've added a new line as below:

static unsigned char pat3[] = {0x03, 0x8};

But how to include it in the "if" statement and extract those bytes only when the 0x03 0x8Z (where Z could be 0,1,3,6,7) appears after the occurrence of 0xFF 0x34?

For each block I'd like to have one line in put file.

Thanks in advance again.

wisecracker · August 15, 2013, 3:22pm

You mentioned Ubuntu and Windows so a _bash_ script is not of much use here. However Python is platform independent. The problem is that I am not sure whether Ubuntu has a Python install. A default Windows install certainly does not have a Python installation.

From a Ubuntu terminal enter "python" and see what comes up...

If you have got it, post the version that is installed on your Ubuntu setup.

You will have to install the same version onto your Windows machine.

If neither have an install then I suggest you install the latest stable Python release which is version 3.3.2, I think...

Python 3.3.2 Release

I will generate a simple starter piece of code in the next couple of days and assume that Ubuntu has at least version 3.0.x, when I can get enough free time. I am not usually anywhere near a computer during the working day...

Ophiuchus · August 15, 2013, 6:43pm

Hello wisecracker,

Thanks for the help.

I've installed Python 3.3.2 in Windows machine, but when was trying to test some
simply codes to introduce me myself, even to get the current path I got an syntax error, I'm not sure why.

I've used:

import os 
os.getcwd()

ahamed101 · August 16, 2013, 2:14am

#include <stdio.h>
#include <stdlib.h>

#define err(x) {printf("\nError: %s... Exiting...\n", x); exit(1);}

static unsigned char pat1[] = {0x99, 0x11, 0x45, 0x27};
static unsigned char pat2[] = {0x73, 0x49};
static unsigned char pat3[] = {0xff, 0x34};
static unsigned char intrim_pat1[][2] = { {0x03, 0x80}, {0x03, 0x81}, {0x03, 0x83}, {0x03, 0x86}, {0x03, 0x87} };
static unsigned char end[] = {0xff, 0x33};

void print_data(const unsigned char *ptr, int len)
{
        int i;
        for(i=0;i<=len;i++)
                printf("%02x ", ptr);
        printf("\n");
        return;
}

int main(int argc, char **argv)
{
        if(argc < 2)
                err("File name missing");

        char found = 0, more = 0;
        unsigned char buf[32];
        unsigned char *ptr = buf;
        int pos = 0, i;
        int arr_size = (sizeof(intrim_pat1)/2);

        FILE *fp = fopen(argv[1], "rb");
        if(!fp) err("Unable to open the file");

        while(2 == fread(ptr, sizeof(char), 2, fp)){
                pos = ftell(fp);

                //check for end of file pattern
                if(found && !memcmp(buf, end, 2)){
                        found=0; //start over or stop??
                        continue;
                }

                //check for 0xff 0x34
                if(found && !(memcmp(buf, pat3, 2))){
                        more = 1;
                        continue;
                }

                if(found && more){
                        for(i=0; i< arr_size; i++){
                                if(!memcmp(buf, intrim_pat1, 2)){
                                        if(15 != fread(ptr+2, sizeof(char), 15, fp))
                                                err("Insufficient data");
                                        print_data(ptr, 16);
                                        more=0;
                                        continue;// start with the next byte
                                }
                        }
                }

                if(buf[0] == 0x32){
                        if(18 != fread(ptr+2, sizeof(char), 18, fp))
                                err("Insufficient data");
                        if(memcmp(buf+4, pat1, 4) && memcmp(buf+12, pat2, 2)){
                                fseek(fp, pos, SEEK_SET);
                        }else{
                                found = 1; //found the starting of the block with data
                                print_data(ptr, 19);
                        }
                        continue;
                }
                pos--;
                if(fseek(fp, pos, SEEK_SET))
                        err("Error in seeking");
        }

        return 0;
}

--ahamed

Ophiuchus · August 16, 2013, 12:49pm

Hello ahamed,

Thank you for your great help!

I've tested your new code and prints the sequence "03 80 ...", but is not printing the
other sequences after the 03 80, I mean the sequences that begin with 81, 83, 86, 87. I would like to extract
all the sequences of 17 bytes after 0x03 that begin with 0x80 or 0x81 or 0x83 or 0x84 or 0x86 or 0x87 if they
are present and print all sequences for each block in the same line if is not too complicated.

In summary:

The 0x03 is the byte that says the beginning of certain kind of data.
If 0x03 is present after 0xFF 0x34 then 0x03 could be inmediately followed by any of the sequences that begin with
0x80 or 0x81 or 0x83 or 0x84 or 0x86 or 0x87, because the sequences not always are present all of them.

Sometimes these sequences that begin with 0x80 or 0x81 or 0x83 or 0x84 or 0x86 or 0x87 could be present all, sometimes 3,
2 or only one of those sequences.

So, after the 0xFF 0x34 could happen several cases, some examples below:
0x03 0x80.... 0x86...
or
0x03 0x83.... 0x84... 0x87
or
0x03 0x87
or
0x03 0x81.... 0x87

Maybe you can explain me a little bit the logic of your code and function used, for example "memcmp" in order
to be able to modify it or add it new rules if I need to extract something else or to modify the printing order
or print the bytes without spaces separating with commas different sequences.

Thanks in advance for your time and help again.

Regards

ahamed101 · August 17, 2013, 2:31pm

Well, you said 0x80, 0x81 etc will be preceded by 0x03 and I dont see that pattern. Only 0x80 is preceded by 0x03 and hence it is printed.

--ahamed

Ophiuchus · August 17, 2013, 5:15pm

Hello ahamed,

Sorry, maybe I didn't explaine me very well.

The sequence after FF 34 if present is 0x03 and after 0x03 could be follow by 0x80 or 0x81.... etc. Because of that I've put 0x8Z, where Z=0,1,3,4,6 or 7.

But independently which is the byte that appear after 0x03, the byte 0x03 only will appear once to represent the begin of this sub-block of sequences.

So, if I call the sequences like follow:
Z1=80 B1 B2 ... B16
Z2=81 B1 B2 ... B16
Z3=83 B1 B2 ... B16
.
.
Z6=87 B1 B2 ... B16

Then, if 0x03 is present the "sub-block" could contain:
0x03 Z1 Z2 Z3
or
0x03 Z2 Z3 Z6
or
0x03 Z1 Z2 Z3 Z4 Z5 Z6
or only one sequence like
0x03 Zx

But I would like to extract all the sequences in sub-block independently if
has all sequences Z1 to Z6 or if only have 1 sequence Zx.

I hope is not too complicated and you can help me.

Thanks in advance for all the help.