Converting parts of a string to "Hex"

HansHansen · September 12, 2012, 10:24am

Hi Guys,

writing a small shell script, i need to convert parts of a string to "Hex". The problem is that it is not the full string that needs to be converted.
I think it's best to show an example:

$astring = "xxxxxx ABC+10+##########+DEF xxxx"

This is only an example to show how the string could look like. There can be many characters at the beginning, then there is a "starting code" ABC, a value which say's how long the element it (10 in this case), then the some unprintable hex characters (just marked as #) and the "end mark" DEF.

As the hex char's are not displayable, they should be displayed as "\x39" for example. So the above string should result in

$astring = "xxxxxx ABC+10+\x39\x55\x12\x84\xA7\x9F\x2C\xB1\xFF\x12+DEF xxxx"

after the conversion.

The only thing i can do is to loop through the string, find "ABC", get the count number and do the conversion.

Or is there a quicker and simpler solution you could think of?

Thanks a lot for your help !!!

Hans

vidyadhar85 · September 12, 2012, 11:13am

have you tried

od -x or od -h

command?

HansHansen · September 13, 2012, 5:55am

od is converting a whole file. What i need is something that converts only parts of a string.

vidyadhar85 · September 13, 2012, 9:16am

if od is doing what you want then extract the part of the string using sed/awk or anyother command and pass it through od and replace it back..

alister · September 13, 2012, 9:29am

If that string can contain arbitrary binary data, you may have problems. Many UNIX utilities (sed, awk, sh, etc ...) are designed to read text files. Text files are not expected to contain control characters and nullbytes. Nullbytes (\0) in particular could cause failure since UNIX text tools often rely on the c library's string handling functions, which use \0 as a terminator.

Regards,
Alister

RudiC · September 13, 2012, 4:06pm

OK, this was a pain in the neck, but it - mayhap not too elegantly - will solve the problem given. It assumes hexdump to be installed, adapt to od if need be. You need an awk- program- file

$ cat awkfile
BEGIN{FS="+"; hd="hexdump -v -e '\"\\\\\" /1 \"X%02X\"'" }
        {for (i=1; i<=NF; i++)
                {if ($i~"ABC") {j=i+2}
                 if (i==j) {printf "%s",$i|hd;printf "%s", FS} else printf "%s%s", $i,FS}}

And then, execute that:

echo $astring|awk -f awkfile
xxxxxx ABC+10+\X39\X55\X12\X84\XA7\X9F\X2C\XB1\XFF\X12+DEF xxxx+

I know, there is an field separator too many at the end, but I'm out of patience now...

alister · September 13, 2012, 11:02pm

perl -pe '$p=qr/ ABC\+([^+]+)\+/; /$p/; s/($p)(.{$1})/$1.join("",map{"\\x$_"}unpack("H2"x$2,$3))/e' file

Regards,
Alister

---------- Post updated at 11:02 PM ---------- Previous update was at 10:54 PM ----------

Been there myself many times. Unfortunately, your solution won't work reliably. The timing of hexdump's output and awk's output isn't in anyway guaranteed.

With a 3 line sample file and stdout a terminal, the hexdump output for all three lines shows up after all of awk's output for those three lines.

With the same sample file and stdout redirected to a file, the hexdump output for all three lines occurs in the middle of the first line of awk output.

I'm almost certain that the consistent difference is a side effect of my awk implementation increasing buffering in the absence of interactivity.

Regards,
Alister

RudiC · September 14, 2012, 3:46am

Admitted. I noticed that myself in certain circumstances. Still I wanted to publish my "elaborate" construt, and be it as a discussion basis. The typecast feature offered by programming languages was bitterly missed in sed, awk, bash. Obviously they are too smart in variable handling.

alister · September 14, 2012, 10:30am

Here's my attempt at a POSIX approach. It gives me the same result as the perl solution that I posted yesterday. It depends on the od command generating whitespace delimited, two-hexdigit numbers.

match='match($0, / ABC\+[0-9]+\+/)'

awk '{'"$match"'; print substr($0, RSTART+RLENGTH)}' file |
while IFS= read -r line; do
        printf '%s' "$line" | od -An -tx1 | paste -s - |
            sed 's/[[:blank:]]//g; s/../\\x&/g'
done > hextemp

awk '{'"
        $match"'
        pre=substr($0, 1, RSTART+RLENGTH-1)
        nbytes=substr($0, RSTART+5, RLENGTH-6)
        post=substr($0, RSTART+RLENGTH+nbytes)
        getline hexstr < hextemp
        hexstr=substr(hexstr, 1, 4*nbytes)
        printf "%s%s%s%s", pre, hexstr, post, ORS
}' hextemp=hextemp file

Regards,
Alister

Don_Cragun · September 15, 2012, 3:29pm

Hi Hans,
You say that the # characters are unprintable characters, but from the output you say you want from your input ( $astring = "xxxxxx ABC+10+\x39\x55\x12\x84\xA7\x9F\x2C\xB1\xFF\x12+DEF xxxx" ) we see that some of these bytes represent printable characters (assuming you're using a codeset with ASCII underpinnings). The hexadecimal escape codes \x39, \x55, and \x2C are the characters '9', 'U', and ',', respectively. This isn't necessarily bad, but none of the scripts that have been presented here so far will work correctly if one of these characters represented by a "#" is a newline character and these scripts may fail if the "x"s or "#"s contain a sequence that matches the form "ABC+<digits>+". And has already been stated, there is nothing we can do for you in a shell script if any of the bytes represented by a "#" is a null byte ('\x00').

As long as you can guarantee that there won't be any null bytes in the string except for the terminating null byte at the end of every string and can guarantee that exactly one substring of the form "ABC+<digits>+" will appear in echo string, the following script does that you have requested:

#!/bin/ksh
### Functions:
# Usage: hexit bytes
# Convert the string ("bytes") into printable hexadecimal escape sequences
# corresponding to the values of the bytes in the string.  This function will
# not work correctly if a null byte appears in the string other than as the
# string terminator.  It will correctly handle newline characters in the bytes
# operand.
hexit() {
        printf "%s" "$1" | od -An -tx1 | while read x
        do      set -- $x
                while [ $# -gt 0 ]
                do      printf '\\x%s' "$1"
                        shift
                done
        done
}

### Main program:
# Usage: hexstring string...
# This utility will process each string operand (which must be of the form:
#       <front><hex-head><hex-bytes><tail>
# where <front> is any sequence of zero or more printable characters not
#               containing any substring that matches the format specified for
#               <hex-head>.
#       <hex-head> is composed of three parts in sequence:
#               <hex-head-start><count><hex-head-end>
#       where   <hex-head-start> is the characters "ABC+",
#               <count> is one or more characters from the current locale's
#                       digit character class) that will be interpreted as a
#                       decimal digit string specifying the number of bytes
#                       included in <hex-bytes> (see below), and
#               <hex-head-end> is a "+" character.
#       <hex-data> is string of <count> bytes.  These bytes can contain any
#               value except the null byte as long as no substring of these
#               bytes constitute a string that can be interpreted as a valid
#               <hex-head> string either by itself or when combined with the
#               following <tail>.
#       <tail>  is zero or more printable characters not containing any
#               substring that matches the format specifeid above for
#               <hex-head>.
# When processing is complete, a string will be written to stdout containging
# <front> (unchanged), <hex-head> (unchanged), <hex-data> (converted to the
# four character hexadecimal escape sequence representing each byte in the
# <hex-data>), and <tail> (unchanged).
#
# Example: (Assuming this script is invoked by a recent ksh running on a system
# with the ASCII codeset underlying the current locale):
#       hexstring $'start ABC+5+a\tb\nc+end'
# would produce the following output:
#       start ABC+5+\x61\x09\x62\x0a\x63+end
ec=0    # Exit code (0 unless an error is detected)
while [ $# -gt 0 ]
do
        # Extract the <count> field.
        count=$(expr "$1" : ".*ABC+\([0-9]\{1,\}\)+")
        if [ "$count" == "" ]
        then
                printf "%s: \"ABC+<digits>+\" not found in \"%s\"\n" \
                        $(basename "$0") "$1"
                shift
                ec=1
                continue
        fi
        # Calculate the offset to the start of <hex-bytes>.
        off=$(expr "$1" : ".*ABC+[0-9]\{1,\}+")
        # Print <front> and <hex-head>
        printf "%s" "${1:0:off}"
        # Print <hex-bytes> as hexadecimal escape sequences.
        hexit "${1:off:count}"
        # And, print <tail>
        printf "%s\n" "${1:off + count}"
        shift
done
exit $ec

I realize this is a long script, but it is mostly comments. Note that some features used in the above script are only available in versions of ksh newer than November 16, 1988 and some of the od utiity's options used here weren't defined by the standards until 1992.

Presumably, you have a source that creates strings containing binary data so I won't worry about it here. It is easy to create strings like this with $'...' in recent versions of ksh, in a C or C++ program, and using the printf utility with hex escape sequences (but I assume if you're creating hex escape sequences to generate these strings, you don't need to convert them back to hex).