Shell script remove bad character

graphicsman · October 10, 2012, 10:42am

I was curious to know how to write into my shell script to remove a character. The character I want to remove is � within a .html file.

aashish.sharma8 · October 10, 2012, 10:53am

sed 's/�//g' a.html

or
cat a.html | tr -d "�"

graphicsman · October 10, 2012, 11:10am

the tr is what I need but it will not work in the script I wrote:

##Fix copyright incorrect in .html files
		for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
			TFILE="/tmp/$directoryname.$$"
			
			FROM='�'
			TO='\&\#169\;'
			sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
		done

Ive tried:

TO='\&\#169\;'
tr -d '�'

TO='\&\#169\;'
tr -d "�"

TO='\&\#169\;'
tr '�'

For some reason after the code is run it will replace the � but will add � before the html entity. Any ideas??

Corona688 · October 10, 2012, 11:25am

Character set disagreements between your terminal and the file, probably.

Try this:

tr -d '[\200-\377]' < inputfile > outputfile

This should get rid of all UTF-8 extended characters.

graphicsman · October 10, 2012, 11:29am

corona688:

Character set disagreements between your terminal and the file, probably.

Try this:
tr -d '[\200-\377]' < inputfile > outputfile
This should get rid of all UTF-8 extended characters.

can you explain to me what you are doing??

tr = Translate, squeeze, and/or delete characters
-d = same as --no-dereference --preserve=link

but Im lost here:
'[\200-\377]'

and here:
< inputfile > outputfile

kshji · October 10, 2012, 11:30am

# ascii 169 = oct 251
tr -d '\251'

'[\252-\254]' # all ascii between \252 and \254
or list some chars
'\251\253\221'

\NNN = octal value

If you need conversion tool, ex. in ksh93 you can use builtin:

dec=169
typeset -i 8 oct
oct=$dec
echo $oct
# or binary
typeset -i 2 bin
bin=$dec
echo $bin

graphicsman · October 10, 2012, 11:47am

sorry Im still pretty new in bash. Ive tried running tr -d '�' but it stops in the middle of the script with no indication what is going on.

kshji · October 10, 2012, 11:49am

Show the script

graphicsman · October 10, 2012, 12:05pm

                ##Fix copyright incorrect in .html files
		for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
			TFILE="/tmp/$directoryname.$$"
			
			FROM='�'
			TO='\&\#169\;'
			sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
		done

that is the original code for the for statement

Corona688 · October 10, 2012, 12:30pm

No, -d means delete. See man tr.

All ascii characters between 128 and 255.

Read from file inputfile, write to file outputfile.

...You do have a backup, right? You're not just running random code from the internet on precious, irreplaceable input files?

Corona688 · October 10, 2012, 12:33pm

Like I said, your terminal and your file may not agree on what the copyright symbol is. You'll have to find out the exact sequence of bytes a copyright symbol is in the file( try hexdump -C filename ) then put that into sed, escaped so it can understand it. Your version of sed might be able to use hexadecimal characters like \x..

graphicsman · October 10, 2012, 12:45pm

yes I do have a backup. Im testing this script out trying to correct a large number of .html files that someone created wrong in regards to the footer. I was told by a friend it would be a good time to learn Linux and bash script so Im trying and reading a lot. I do see a lot of confusion with the complexity of what to do in ubuntu with shell VS bash.

Corona688 · October 10, 2012, 12:50pm

bash is a shell, so yes, some confusion

dhilipan123 · October 11, 2012, 5:38am

can you please try this

tr -d '\194' < input >> output

� is assigned of value 194

graphicsman · October 11, 2012, 10:40am

my apologies I am getting lost in what to use as the input and output. If i recall the < and << prints to file?

EDIT:
and should it not be:

tr -d '\#194' < input >> output

Corona688 · October 11, 2012, 11:29am

No, it shouldn't. tr doesn't use HTML entities. Practically nothing but HTML uses HTML entities...

I think >>output should be >output, too.

Did you ever try my code?

graphicsman · October 11, 2012, 11:37am

yes I did. For some reason when I run it line by line I get inputfile: No such file or directory

Corona688 · October 11, 2012, 11:43am

:wall:

Unless you have a file named 'inputfile' in your directory, of course it will say that.

Substitute the names you want.

graphicsman · October 11, 2012, 11:52am

thanks. I am a complete noob in this area. If it was a design or website Id of gotten it