sed replacing specific characters and control characters by escaping

sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g" old.txt > new.txt

While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped to be replaced.

Does the above sed work as expected to replace all of the below characters with a space?

 ,
  .
  @
  /
  #
  :
  (
  )
  '
  *
  %
  $
  +
  ?
  _
  =
  "
  !
  ;

The easiest way to find out is to test it.

echo "@" | sed ...

But the concern is if it removed other characters or did something you did not expect to, there was no way to test this :slight_smile:

---------- Post updated at 06:42 AM ---------- Previous update was at 06:30 AM ----------

ok thanks, may be that is true this regexp is correct and we can excape normal and control characters both to be safe :slight_smile:

echo "abc- cde fg.hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg / hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg/hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg \ hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg\hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg'hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg ' hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg % hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg % hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg = hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg _ hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg_hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg , hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg,hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg : hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg:hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg ( hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg(hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg ) hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg ? hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg @ hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg@hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg + hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg+hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg * hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg*hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg $ hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg$hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg ! hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"
echo "abc- cde fg!hi" | sed -e "s/[\.\/\%'=:\(\)\`_\"\@\;\+\*\#\$\?\!]/ /g"

Hi why don't you try it yourself? But it would be preferable to use single quotes instead of double quotes, to prevent the shell from nibbling away at the back slashes in front of some characters ( $ ` " \ and <newline> ), so sed will not see them.

echo "\\\`\$\ 
\""

produces:

\`$"

Also by putting \ in front of some of the normal characters, you are introducing a special meaning:

$ echo ntrv | sed -n "/\n\t\r\v/p"
$ echo ntrv | sed -n "/ntrv/p"
ntrv

Single quotes seem to fail in following cases:

echo "abc- cde fg ' hi" | sed -e 's/[\.\/\%'=:\(\)\`_\'\@\;\+\*\#\$\?\!]/ /g'
echo "abc- cde fg ' hi" | sed -e 's/[\.\/\%'=:\(\)\`_"'"\@\;\+\*\#\$\?\!]/ /g'

As single quotes cannot be escaped with a backslash and double quotes like to evaluate, looks like double quotes is a best bet in my case.

Still investigating if we use sed double quoted "" escape normal character such as number 1 as \1 inside the sed, will it replace the \ and 1 as well, I know it \$ evaluates to $?

---------- Post updated at 01:08 PM ---------- Previous update was at 01:02 PM ----------

Bad example of \1 as it is back reference :), let me use \A
echo "abc-A \ cde fg A \ hi" | sed -e "s/\A/ /g"
abc- \ cde fg \ hi
does not replace '\'

If it's really that simple, you could just use tr and stop worrying about regex special characters, since tr doesn't have regular expressions at all, just ranges and a few character classes.

The odd '"'"' bits are to get single-quote characters inside a string that's otherwise single-quotes, since you can't escape anything in single-quotes. The colors show what's actually happening -- the single quotes end, double quotes containing a single quote begin, then it goes back to single quotes.

$ echo 'Hello ,.@/#:()'"'"'*%$+?_="!; Goodbye' | tr '[,.@/#:()'"'"'*%$+?_="!;]' '_'

Hello ___________________ Goodbye

$ echo 'Hello ,.@/#:()'"'"'*%$+?_="!; Goodbye' | tr -s '[,.@/#:()'"'"'*%$+?_="!;]' '_'

Hello _ Goodbye

$

You'd need a double-backslash \\ inside that range to get a backslash, and would need to escape [ ] - characters inside the range, but that's pretty much it for specials.

Wanted to continue to use sed in script as it HAD been accepted as working in my team. :o
Wonder how fast tr is compared to sed on GB of text files?

Another thing you could do is give a range of the characters you do want, then invert it with ^, and thus avoid specifying all possible nasty characters in the universe.

# Everything but alphanumerics and spaces replaced with _
echo "..." | sed 's/[^a-zA-Z0-9 \t]/_/g'

Faster, certainly. tr is extremely simple, considering everything one character at a time instead of worrying about entire expressions. Far less work.

It also has no limitations on the size of lines for the same reason, while some versions of sed break on lines longer than 2000 bytes or so.

The only problem might be if there are UTF-8 characters as tr cannot translate those AFAIK.

By definition it can't since those are multi-byte sequences. It can certainly strip them out, though, since they're all values higher than ASCII 127.

tr -d '[\200-\377]' < input > output
1 Like

I tried it, and like that it did not remove all characters, but like this it did:

LANG=C tr -d '[\200-\377]' < input > output
$ printf "%s\n" 'An preost wes on leoden, Laamon was ihoten
He wes Leovena�es sone -- li�e him be Drihten.
He wonede at Ernlee at ��elen are chirechen,
Uppen Sevarne sta�e, sel �ar him �uhte,
Onfest Radestone, �er he bock radde.' |
LANG=C tr -d '[\200-\377]'
An preost wes on leoden, Laamon was ihoten
He wes Leovenaes sone -- lie him be Drihten.
He wonede at Ernlee at elen are chirechen,
Uppen Sevarne stae, sel ar him uhte,
Onfest Radestone, er he bock radde.
1 Like