sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g" old.txt > new.txt
While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped to be replaced.
Does the above sed work as expected to replace all of the below characters with a space?
Hi why don't you try it yourself? But it would be preferable to use single quotes instead of double quotes, to prevent the shell from nibbling away at the back slashes in front of some characters ( $ ` " \ and <newline> ), so sed will not see them.
echo "\\\`\$\
\""
produces:
\`$"
Also by putting \ in front of some of the normal characters, you are introducing a special meaning:
$ echo ntrv | sed -n "/\n\t\r\v/p"
$ echo ntrv | sed -n "/ntrv/p"
ntrv
As single quotes cannot be escaped with a backslash and double quotes like to evaluate, looks like double quotes is a best bet in my case.
Still investigating if we use sed double quoted "" escape normal character such as number 1 as \1 inside the sed, will it replace the \ and 1 as well, I know it \$ evaluates to $?
---------- Post updated at 01:08 PM ---------- Previous update was at 01:02 PM ----------
Bad example of \1 as it is back reference :), let me use \A
echo "abc-A \ cde fg A \ hi" | sed -e "s/\A/ /g"
abc- \ cde fg \ hi
does not replace '\'
If it's really that simple, you could just use tr and stop worrying about regex special characters, since tr doesn't have regular expressions at all, just ranges and a few character classes.
The odd '"'"' bits are to get single-quote characters inside a string that's otherwise single-quotes, since you can't escape anything in single-quotes. The colors show what's actually happening -- the single quotes end, double quotes containing a single quote begin, then it goes back to single quotes.
You'd need a double-backslash \\ inside that range to get a backslash, and would need to escape [ ] - characters inside the range, but that's pretty much it for specials.
Another thing you could do is give a range of the characters you do want, then invert it with ^, and thus avoid specifying all possible nasty characters in the universe.
# Everything but alphanumerics and spaces replaced with _
echo "..." | sed 's/[^a-zA-Z0-9 \t]/_/g'
I tried it, and like that it did not remove all characters, but like this it did:
LANG=C tr -d '[\200-\377]' < input > output
$ printf "%s\n" 'An preost wes on leoden, Laamon was ihoten
He wes Leovena�es sone -- li�e him be Drihten.
He wonede at Ernlee at ��elen are chirechen,
Uppen Sevarne sta�e, sel �ar him �uhte,
Onfest Radestone, �er he bock radde.' |
LANG=C tr -d '[\200-\377]'
An preost wes on leoden, Laamon was ihoten
He wes Leovenaes sone -- lie him be Drihten.
He wonede at Ernlee at elen are chirechen,
Uppen Sevarne stae, sel ar him uhte,
Onfest Radestone, er he bock radde.