Replace hex values using sed command

mrreds · January 14, 2018, 2:23pm

File lalo.txt contains: �

I need to replace � by A using sed command.

od -x lalo.txt

0000000 c10a
0000002

sed -e 's/\xc1\x0a/A/g' lalo.txt > lalo2.txt

Also tried:

sed -e 's/\xc3\x81/A/g' lalo.txt > lalo2.txt

Output file lalo2.txt still has �

Unix version: SunOS 5.11 11.3 sun4v sparc sun4v

Any input? Thank you all!

drysdalk · January 14, 2018, 2:41pm

Hi,

I think you might be using the wrong hex value here perhaps. When I do a lookup for the hex value of �, I get 00C1. Maybe try that as opposed to C10A, and see if it helps ?

mrreds · January 14, 2018, 3:09pm

Thank you drysdalk!

I tried the suggested.

sed -e 's/\x00\xc1/A/g' lalo.txt > lalo3.txt
sed -e 's/\xc1\x00/A/g' lalo.txt > lalo3.txt

Still the problem, no replace.

Scrutinizer · January 14, 2018, 3:39pm

Hi, see if this works for you:

sed 'y/�/A/' lalo.txt > lalo2.txt

or

tr � A < lalo.txt > lalo2.txt

mrreds · January 14, 2018, 4:02pm

Thank you Scrutinizer!

For 1st option:

# sed 'y/�/A/' lalo.txt > lalo2.txt
sed: command garbled: y/�/A/

Second, no char changed.

drysdalk · January 14, 2018, 7:07pm

Hi,

Hmm, unusual. Can you check one more thing please ? What's the current value of the LANG (and possibly also LC_ALL ) environment variable ? if your current locale isn't actually set to a UTF-8 locale, that might explain some of these problems. Either that, or the utilities on the system you're on just can't handle UTF-8 properly at all. I'm having no issues with this at all on my Linux box, so it could always be something Solaris-specific also. If I get the chance I'll try to test it on a SunOS-style box this evening if I can.

---------- Post updated 15-01-18 at 12:07 AM ---------- Previous update was 14-01-18 at 09:42 PM ----------

Hi,

OK, think I might have a SunOS-based solution for you here. Caveat: this was tested on Tribblix, an open-source Illumos based version of Solaris, so it's SunOS-like, but not official "proper" (or paid-for, more to the point) Oracle Solaris. But it should be compatible enough for almost any purpose.

So, the answer I found was to use octal rather than hex. Try this:

root@tribblix:~/test# cat test.txt
�
root@tribblix:~/test# file test.txt
test.txt:       data
root@tribblix:~/test# cat test.txt | sed 's/\301/A/g' > test2.txt
root@tribblix:~/test# cat test2.txt
A
root@tribblix:~/test# file test2.txt
test2.txt:      ascii text
root@tribblix:~/test#

It's possible there's a little-endian-versus-big-endian byte order issue with the hex substitution, so rather than try to get my head around that I thought to try to octal, and that seems to do the job. Let us know how you get on.

RudiC · January 15, 2018, 6:36am

The line in lalo.txt in post#1 ist a single char followed by a <newline> char. Hex C1 is "extended" ASCII for "Latin capital letter A with acute", which is hex c3 81 in UTF-8.
That's what mayhap garbled your command in post#5 - two bytes for sed 's y command.
You need to be very clear about what encoding you use in a) your session, and b) your data files. iconv or recode might do the job for you. Or use octal / hexadecimal representation for sed or tr as drysdalk proposed.

wisecracker · January 15, 2018, 11:29am

Of course there is always 'printf' and could easily be in a 'read' loop...
OSX 10.13.2, default bash terminal.
Longhand to show it working...

Last login: Mon Jan 15 16:11:31 on ttys000
AMIGA:amiga~> bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin17)
Copyright (C) 2007 Free Software Foundation, Inc.
AMIGA:amiga~> printf "\xc1Bc Aa \xc1 BC \xc1a\xc1\n" > lalo.txt
AMIGA:amiga~> hexdump -C lalo.txt
00000000  c1 42 63 20 41 61 20 c1  20 42 43 20 c1 61 c1 0a  |.Bc Aa . BC .a..|
00000010
AMIGA:amiga~> string=$( cat lalo.txt )
AMIGA:amiga~> printf "${string//$'\xc1'/A}\n" > lalo1.txt
AMIGA:amiga~> cat lalo1.txt
ABc Aa A BC AaA
AMIGA:amiga~> _

disedorgue · January 19, 2018, 6:08pm

Hi,
With perl (>= 5.8) :

$ echo '������' | perl -MUnicode::Normalize -pe 'BEGIN{binmode STDIN, ":encoding(utf-8)"};$_ = NFD $_; y/[^x00-xFF]//cd'
ceeauA

But not work with character as '�' or '�' (no translate but remove by perl).

Don_Cragun · January 20, 2018, 3:32am

Although some versions of sed accept backslash escapes for octal, hexadecimal, and common character escapes like \n for <newline> and \t for <tab>; none of these are present in "standard" sed .

The sed command needed to change Latin capital letter A with acute to Latin capital letter A is simple:

sed 's/�/A/' lalo.txt

but this works if, and only if, your current locale is using the same codeset that is used to encode Latin capital letter A with acute that is present in lalo.txt . Based on the od output you showed us in post #1, we can say that Latin capital letter A with acute is encoded correctly for the ISO/IEC 8859-1 codeset in lalo.txt . And based on the other posts you have made in this thread, my guess would be that your current locale is using UTF-8 as its underlying codeset. Therefore, the following printf j command should create a sed command that will do what you want:

printf "LC_ALL=C sed 's/\xc1/A/' lalo.txt > lalo2.txt" > lalo.ksh

If you run the above command and then run the command:

ksh lalo.ksh

you should end up with the file lalo2.txt with the output you want, as shown by the command:

$ od -bc lalo2.ksh
0000000   101 012                                                        
           A  \n                                                        
0000002
$

The LC_ALL=C is probably needed because the ISO/IEC 8859-1 encoding of � found in lalo.txt is not a valid character in the codeset used by your current locale. The LC_ALL=C will cause that sed command to be run in a locale where every single-byte value is a valid character. This should avoid the errors like:

sed: 1: "s/?/A/": RE error: illegal byte sequence

that you might get without it.