Script to match strings that sometimes are splitted in 2 lines

Hello to all,

I have an hexdump -C format as below:

31 54 47 55 48 4c 52 31 5f 52 31 32 31 31 32 ff
44 00 00 0E 01 32 14 56 42 17 47 48 0f ff ff ff
44 00 00 01 32 14 56 00 23 83 95 2f 42 17 47 48
00 0f ff ff 00 15 00 0a 48 00 01 5a 00 02 17 00
00 2f 00 00 30 00 00 31 00 00 ff 34 ff 44 00 00
02 32 14 56 01 75 70 69 8f 42 17 47 48 00 1f ff
ff 00 15 00 0c 48 00 01 5a 00 0a 17 00 01 18 00
01 42 00 01 60 00 01 36 00 01 37 00 01 7e ff 44
00 00 2B 32 14 56 79 00 00 94 00 01 93 00 01 22
00 00 21 00 00 09 00 01 0a 00 01 10 00 01 08 00

I want to do a kind of grep to get the strings that follow the pattern:

ff 44 + (3 or 4 bytes) + 32 14 56 + (5 bytes)

I'm able to match that pattern in a text editor with the next regex:

FF( |\n)44(( |\n)[0-9a-f]{2}){3,4}( |\n)32( |\n)14( |\n)56(( |\n)[0-9a-f]{2}){5}

But I don't know how to insert this regex in a bash script (using awk, sed, etc) in order to
get the output below, because with awk the issue I see is that the pattern not always is in a unique line, could begin in one and ends in the next one.

Output desired:

ff 44 00 00 0E 01 32 14 56 42 17 47 48 0f
ff 44 00 00 01 32 14 56 00 23 83 95 2f
ff 44 00 00 02 32 14 56 01 75 70 69 8f
ff 44 00 00 2B 32 14 56 79 00 00 94 00

May some body help with this.

Thanks in advance.

Hi, try:

awk -v pat="ff 44( [^ ]{2}){3,4} 32 14 56( [^ ]{2}){5}" '{b=p $0} match(b,pat) {print substr(b,RSTART,RLENGTH); sub(pat,x)} {p=$0 FS}' file
ff 44 00 00 0E 01 32 14 56 42 17 47 48 0f
ff 44 00 00 01 32 14 56 00 23 83 95 2f
ff 44 00 00 02 32 14 56 01 75 70 69 8f
ff 44 00 00 2B 32 14 56 79 00 00 94 00

Only space (no newline) was used in the matching regex, because space was used to glue two lines together in the variable b The expression will never be on more that two lines and there cannot be two matches on a single line... The substitution ( sub() ) is necessary for cases where a match occurs on a single line...

It is a bit much for a single line, so this may be more readable:

awk -v pat="ff 44( [^ ]{2}){3,4} 32 14 56( [^ ]{2}){5}" '
  {
    b=p $0
  }
  match(b,pat) {
    print substr(b,RSTART,RLENGTH)
    sub(pat,x)
  }
  {
    p=$0 FS
  }
' file

This should work with regular awk. With gawk <= version 3 try gawk --re-interval . With mawk it will not function because it cannot do {m,n} interval expressions, so in that case the regex would need to be expanded..

Hello Scrutinizer,

Thank you, it works just fine.

I'm trying to modify your code in order to group the bytes as follow:

ff 44 00000E01 321456421747480f
ff 44 000001   321456002383952f
ff 44 000002   321456017570698f
ff 44 00002B   3214567900009400	

But trying to separate with substr(b,1,2) substr(3,4) ... is printing in different order, that I'm don't understand yet how is stored in b.

The objetive to group it is to print as decimal the 3rd column as below:

ff 44 3585 	321456421747480f
ff 44 1 	321456002383952f
ff 44 2 	321456017570698f
ff 44 43	3214567900009400

How can I do this?

Thanks again.

You would need a bit of additional substring wrestling, try something like:

awk -v pat="ff 44( [^ ]{2}){3,4} 32 14 56( [^ ]{2}){5}" '
  {
    b=p $0
  } 
  match(b,pat) {
    s=substr(b,RSTART,RLENGTH)
    gsub(FS,x,s)
    l1=length(s)-20
    f1=substr(s,5,l1)
    f2=substr(s,5+l1)
    printf "ff 44 %8-d%s\n","0x" f1,f2
    sub(pat,x)
  }
  {
    p=$0 FS
  }
' file

or remove those spaces beforehand:

awk -v pat="ff44.{6}(..)?321456.{10}" '
  {
    gsub(FS,x)
    b=p $0
  } 
  match(b,pat) {
    s=substr(b,RSTART,RLENGTH)
    l1=length(s)-20
    f1=substr(s,5,l1)
    f2=substr(s,5+l1)
    printf "ff 44 %8-d%s\n","0x" f1,f2
    sub(pat,x)
  }
  {
    p=$0
  }
' file

Hello Scrutinizer,

Thanks for your time!.

I've trying but is printing "0's" only.

ff 44 0       321456421747480f
ff 44 0       321456002383952f
ff 44 0       321456017570698f
ff 44 0       3214567900009400
  • Hi did you prepend the first value with "0x" ?
  • Otherwise, what does awk 'BEGIN{printf "%8-d\n", "0x" "00000E01"}' produce?
  • What is your OS and version?

Hello Scrutinizer,

Produces "0". But strtonum produces correct output.

$ awk 'BEGIN{printf "%8-d\n", "%0x" "00000E01"}'
0
$ awk 'BEGIN{print strtonum(0xE01)}'
3585

I'm using Cygwin, latest version on Windows.

CYGWIN_NT-6.1-WOW64 2013-08-15 11:55 i686 Cygwin

OK, you are using GNU awk. With gawk you need to use another command line option: --non-decimal-data

gawk --re-interval --non-decimal-data -v ...

probably easiest to use:

gawk --posix -v ...

Or use a gawk-only function like you suggested...

Hello Scrutinizer,

Thank you, I'll try it the decimal convertion.

Now I'm trying is to put 2 variables in the same awk script to get other pattern. It seems to work, but is printing the matched pattern of variable 2 in next line.

awk -v pat1="ff 44( [^ ]{2}){3,4} 32 14 56( [^ ]{2}){5}" -v pat2="pattern2" '
  {  b=p $0 ;c=p $0 }
  match(b,pat1) {  print substr(b,RSTART,RLENGTH) sub(pat1,x)  }
  match(c,pat2) {  print substr(c,RSTART,RLENGTH) sub(pat2,z)  }
  {    p=$0 FS  }' file

How to print value of variables b and c in the same line?

Thanks again.

You could try

  match(b,pat1) {s=substr(b,RSTART,RLENGTH); sub(pat1,x) }
  match(c,pat2) {s=s (s!=""?OFS:"") substr(c,RSTART,RLENGTH); sub(pat2,z)  }
  s!="" {print s; s=""}

Hello Scrutinizer,

Thanks, it works that way.

The pat2 could be in more than line, so to could use your code, intead of have 32 elements in a line I changed it to 64 with xxd command, and without spaces.

So, your first code continue working, but I don't know why is not printing all patterns even they are in the file, I got this output:

$ awk -v pat="ff44.{6,18}321456.{5}" '
  {    b=p $0  }
  match(b,pat) {    print substr(b,RSTART,RLENGTH); sub(pat,x)
  }
  {    p=$0 FS  }' file2.txt
ff4400000232145601757
ff4400000332145601766
ff4400000432145601789

But the correct output should be (the patterns that have the sequence 01 and 05 are not being printed):

ff4400000132145600238
ff4400000232145601757
ff4400000332145601766
ff4400000432145601789
ff4400000532145601788

The input file is:

97444444444c52975f529744979744303730359797333003504947480fffffff
44000001321456002383952f50494748000fffff0015000a4800015a00021700
024200016000013300013600013700017e00016900006a000079000094000193
00012200002100010900010a00012600016b00016c00006d0000020001040001
0500010600011000010800012b00002c00002d00002e00005500005600072a00
002f0000300000970000ff34ff44000002321456017570698f50494748001fff
ff0015000c4800015a000a1700011800014200016000013600013700017e0001
6900006a00007900009400019300012200002100000900010a00011000010800
012b00002c00002d00002e00005500005600072a00002f0000300000970000ff
34ff44000003321456017664454f50494748003fffff0015000c4800015a0017
1700011800014200016000013600017e00016900017900009400019300012200
002100000900010a00011000010800012b00002c00002d00002e000055000056
00072a00002f0000300000970000ff34ff44000004321456017890003f504947
48004fffff0015000c4800015a000a1700011800014200016000013600017e00
016900017900009400019300012200002100000900010a00011000010800012b
00002c00002d00002e00005500005600072a00002f0000300000970000ff34ff
44000005321456017887761f50494748005fffff0015000a4800015a000b4200
016000013300013600013700016600016500017700016900006a000079000094
00019300012200002100010900010a00012600011000010800012b00002c0000
2d00002e00005500005600072a00002f0000300000970000ff3403810f010200
00000d5049526905ffffff008970010c0000000d5049526905ffffff0101860f
010c0000000d50495269559fffff00840e01020102010001ffffff0201020185

Thanks in advance for the help.

Hi, it should be p=$0 instead of p=$0 FS

--
Note that this is different from your original specification, in the sense that lengthwise, patterns could now occur more than once on the same line. If this is a possibility, the code would need to be adjusted..

Hello Scrutinizer again.

It works how you said when the file has 128 characters per line, I'm not sure why doesn't
print when has 256 characters in each line.

Well, may you please show me an example how to separate the content of each variable by space or comma, let say:

  • for variable b print characters 1 to 2, 3 to 4, 5 to 11, 12 to 27, 28 to 43.
    and between each range a space.
  • for variable c print characters 1 to 5, 6 to 12

If it is possible to assign to those sub ranges to a new variable could be better, because I would like to do this for presentation issue and in order to be able in the future, for example to print one or more of that ranges in decimal or with another format.

Thanks for all help so far.

Regards

You would need to use substr(s,pos,1) for every character. Example:

$ awk 'BEGIN{s="abcdefg"; print substr(s,4,1), substr(s,2,1)}'
d b

I guess one could create a function:

$ awk 'function char(s,c){return substr(s,c,1)} BEGIN{s="abcdefg"; print char(s,4), char(s,2)}'
d b

Some awks have a special case where, by using an empty string as field separator, a string can be separated into fields that contain a single character:

$ awk 'BEGIN{s="abcdefg"; split(s,F,x); print F[4], F[2]}'
d b

The latter however is not portable..

Hello Scrutinizer,

Many thanks for your help! I'll practice with substr() in order to manipulate the output.

How many variables to include more patterns support awk?

Thanks in advance.

There is no limit in the number of variables...

Hello Scrutinizer,

Thanks for your help.

I have an issue. I have a hexdump with 128 bytes without spaces as input file that I'm attaching (called file_128.txt).

I'm trying to get the 2 patterns shown in script. It prints the patterns, but for the first pat2 is printed in the next
line, I'm not sure why for some cases prints correctly "pat1 pat2"in the same line, but sometimes prints "pat1 \n pat2".

The lines that don't have 2 patterns printed is because they actually don't have associated one pat2 in the input file
and that part is correct how is printed.

This is what I'm getting so far:

  
$ awk -v pat1="ff44.{6,18}321456.{8}.f(.){16}" -v pat2="038.{32,34}.+84(.){30}" '
   {  b=p $0 ;c=p $0 }
   match(b,pat1) {s=substr(b,RSTART,RLENGTH); sub(pat1,x) }
   match(c,pat2) {s=s (s!=""?OFS:"") substr(c,RSTART,RLENGTH); sub(pat2,z)  }
   s!="" {print s; s=""}
   {p=$0 }' file_128.txt
ff44000001321456022272619f81422060001fffff
03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000002321456014041612f81422060002fffff
ff44000003321456022280546f81422060003fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000004321456022939276f81422060004fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000005321456013741169f81422060354fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000006321456013741255f81422079900fffff

and the correct output should be:

ff44000001321456022272619f81422060001fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000002321456014041612f81422060002fffff
ff44000003321456022280546f81422060003fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000004321456022939276f81422060004fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000005321456013741169f81422060354fffff 03810f01020000000d8147451905ffffff008930010c0000000d8147451905ffffff0101860f010c0000000d81474519559fffff00840e01020102010001ffffff02010201
ff44000006321456013741255f81422079900fffff

Thanks in advance for any help.

You would need to skip printing the first time, where the buffer contains only one line ..

Quick fix:

$ awk -v pat1="ff44.{6,18}321456.{8}.f(.){16}" -v pat2="038.{32,34}.+84(.){30}" '
   {  b=p $0 ;c=p $0 }
   match(b,pat1) {s=substr(b,RSTART,RLENGTH); sub(pat1,x) }
   match(c,pat2) {s=s (s!=""?OFS:"") substr(c,RSTART,RLENGTH); sub(pat2,x)  }
   NR>1 && s!="" {print s; s=""}
   {p=$0 }' file_128.txt

--edit--
Actually NR>1 should not be skipped...

Hello Scrutinizer,

It works for the firts tme, but testing in a file a little bigger (100 lines of 128 bytes each)
again appears that sometimes pat1 and its respective pat2 is printed in the next line.

You can see with the new file I'm attaching (file_128_1.txt)

Thanks again.

Come to think I think the output is correct as it is and line 1 should not be skipped. There is no new line being printed, but that is the way it is defined...

I am then not sure what you are looking for. Something like this (every pattern 1 on a new line) ?

wk -v pat1="ff44.{6,18}321456.{8}.f(.){16}" -v pat2="038.{32,34}.+84(.){30}" '
  {
    b=p $0
    c=p $0 
  }
  match(b,pat1) {
    if(s) print s
    s=substr(b,RSTART,RLENGTH)
    sub(pat1,x)
  }
  match(c,pat2) {
    s=s (s!=""?OFS:"") substr(c,RSTART,RLENGTH)
    sub(pat2,x)
  }
  {
    p=$0
  }
  END{
    if(s)print s
  }
' file_128_1.txt

Can you be more specific?