hexadecimal replacing with awk ?

jossojjos · May 17, 2010, 12:08pm

Hi there !

I have text files with some nonsense characters in it, so different text editors put different nonsense symbols, and, worse, the application that should be able to read these files doesn't.

With xxd, the nonsense characters show as "efbfbd", while they should be "c2a7" (the 'paragraph character', �).

E.g., this is a line of xxd output containing the weird character :

0000260: 3433 3036 312c efbf bd28 2244 4941 4e54  43061,...("DIANT

Now, is there a way to change the "efbfbd" into "c2a7" ? Preferably with awk, since that's what I'm using to treat these files with for other purposes.

Thanks !
jos

aigles · May 17, 2010, 1:08pm

With AWK (GNU version, not sure for other flavours) :

$ xxd ascii.txt
0000000: 3433 3036 312c efbf bd28 2244 4941 4e54  43061,...("DIANT
$ awk '{ gsub(/\xef\xbf\xbd/, "\xc2\xa7") ; print }' ascii.txt > new.txt
$ cat new.txt
43061,�("DIANT
$ xxd new.txt
0000000: 3433 3036 312c c2a7 2822 4449 414e 540a  43061,..("DIANT.
$

Jean-Pierre.

jossojjos · May 17, 2010, 1:26pm

Merci beaucoup Jean-Pierre !

My gawk-guide wasn't specific enough ... it suggested "0x" instead of "\x", but I guess that's only for hexadecimal to decimal conversions, not for manipulating ascii.

jos