Shell script remove bad character

I was curious to know how to write into my shell script to remove a character. The character I want to remove is � within a .html file.

sed 's/�//g' a.html

or
cat a.html | tr -d "�"

the tr is what I need but it will not work in the script I wrote:

##Fix copyright incorrect in .html files
		for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
			TFILE="/tmp/$directoryname.$$"
			
			FROM='�'
			TO='\&\#169\;'
			sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
		done

Ive tried:

TO='\&\#169\;'
tr -d '�'
TO='\&\#169\;'
tr -d "�"
TO='\&\#169\;'
tr '�'

For some reason after the code is run it will replace the � but will add � before the html entity. Any ideas??

Character set disagreements between your terminal and the file, probably.

Try this:

tr -d '[\200-\377]' < inputfile > outputfile

This should get rid of all UTF-8 extended characters.

can you explain to me what you are doing??

tr = Translate, squeeze, and/or delete characters
-d = same as --no-dereference --preserve=link

but Im lost here:
'[\200-\377]'

and here:
< inputfile > outputfile

# ascii 169 = oct 251
tr -d '\251' 

'[\252-\254]' # all ascii between \252 and \254
or list some chars
'\251\253\221'

\NNN = octal value

If you need conversion tool, ex. in ksh93 you can use builtin:

dec=169
typeset -i 8 oct
oct=$dec
echo $oct
# or binary
typeset -i 2 bin
bin=$dec
echo $bin

sorry Im still pretty new in bash. Ive tried running tr -d '�' but it stops in the middle of the script with no indication what is going on.

Show the script

                ##Fix copyright incorrect in .html files
		for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
			TFILE="/tmp/$directoryname.$$"
			
			FROM='�'
			TO='\&\#169\;'
			sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
		done

that is the original code for the for statement

No, -d means delete. See man tr.

All ascii characters between 128 and 255.

Read from file inputfile, write to file outputfile.

...You do have a backup, right? You're not just running random code from the internet on precious, irreplaceable input files?

1 Like

Like I said, your terminal and your file may not agree on what the copyright symbol is. You'll have to find out the exact sequence of bytes a copyright symbol is in the file( try hexdump -C filename ) then put that into sed, escaped so it can understand it. Your version of sed might be able to use hexadecimal characters like \x..

yes I do have a backup. Im testing this script out trying to correct a large number of .html files that someone created wrong in regards to the footer. I was told by a friend it would be a good time to learn Linux and bash script so Im trying and reading a lot. I do see a lot of confusion with the complexity of what to do in ubuntu with shell VS bash.

bash is a shell, so yes, some confusion :slight_smile:

can you please try this

tr -d '\194' < input >> output

� is assigned of value 194

my apologies I am getting lost in what to use as the input and output. If i recall the < and << prints to file?

EDIT:
and should it not be:

tr -d '\#194' < input >> output

No, it shouldn't. tr doesn't use HTML entities. Practically nothing but HTML uses HTML entities...

I think >>output should be >output, too.

Did you ever try my code?

yes I did. For some reason when I run it line by line I get inputfile: No such file or directory

:wall:

Unless you have a file named 'inputfile' in your directory, of course it will say that.

Substitute the names you want.

thanks. I am a complete noob in this area. If it was a design or website Id of gotten it :b: