Any tip to replacing the special characters in a file

newbie_01 · December 5, 2018, 6:05pm

Hi,

Please find attached a file that has special characters on it. It is a copy and paste from a Micro$oft file.

I don't want to use strings as it remove all the 'indentations' / 'formatting' so I am replacing them with space instead.
I am using the sed command below

sed "s/$(printf "\302")/ /g" special_chars.txt | sed "s/$(printf "\240")/ /g" | grep -v "^$" > 123.txt

Note that the sed command above has been specific to what special characters to replace. Is there any way that we can specify a range of special characters to search and replace so we don't need to always find out what is the special character/s to replace and change the sed command to suit?

Reply much appreciated. Thanks in advance.

RudiC · December 5, 2018, 6:23pm

Do those characters really hurt?

We're seeing copious amounts of 'NO-BREAK SPACE' (U+00A0) Unicode Characters in your file, represented as multibyte UTF-8 0xC2 0xA0 (or \302 \240 in octal) char. sed definitely is NOT the right tool to cope with those in general (although it might with single occurrences). What encodings does your Microsoft host use? What your *nix node?

Some conversion might already be done during transfer by using the right ftp options / settings. Or use the dos2unix tool. Or iconv or recode commands.

For exactly above problem,

sed 's/\o302\o240/ /g' /tmp/special_chars.txt

might suffice...

EDIT: I vaguely remember from your recent post that you are using Solaris (you don't mention it here). Not sure if any of above is available, then. YMMV.