I was curious to know how to write into my shell script to remove a character. The character I want to remove is � within a .html file.
sed 's/�//g' a.html
or
cat a.html | tr -d "�"
the tr is what I need but it will not work in the script I wrote:
##Fix copyright incorrect in .html files
for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
TFILE="/tmp/$directoryname.$$"
FROM='�'
TO='\&\#169\;'
sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
done
Ive tried:
TO='\&\#169\;'
tr -d '�'
TO='\&\#169\;'
tr -d "�"
TO='\&\#169\;'
tr '�'
For some reason after the code is run it will replace the � but will add � before the html entity. Any ideas??
Character set disagreements between your terminal and the file, probably.
Try this:
tr -d '[\200-\377]' < inputfile > outputfile
This should get rid of all UTF-8 extended characters.
can you explain to me what you are doing??
tr = Translate, squeeze, and/or delete characters
-d = same as --no-dereference --preserve=link
but Im lost here:
'[\200-\377]'
and here:
< inputfile > outputfile
# ascii 169 = oct 251
tr -d '\251'
'[\252-\254]' # all ascii between \252 and \254
or list some chars
'\251\253\221'
\NNN = octal value
If you need conversion tool, ex. in ksh93 you can use builtin:
dec=169
typeset -i 8 oct
oct=$dec
echo $oct
# or binary
typeset -i 2 bin
bin=$dec
echo $bin
sorry Im still pretty new in bash. Ive tried running tr -d '�' but it stops in the middle of the script with no indication what is going on.
Show the script
##Fix copyright incorrect in .html files
for htmlfile in $(find $DIRECTORY -type f -name \*.html); do
TFILE="/tmp/$directoryname.$$"
FROM='�'
TO='\&\#169\;'
sed "s/$FROM/$TO/g" "$htmlfile" > $TFILE && mv $TFILE "$htmlfile"
done
that is the original code for the for statement
No, -d means delete. See man tr.
All ascii characters between 128 and 255.
Read from file inputfile, write to file outputfile.
...You do have a backup, right? You're not just running random code from the internet on precious, irreplaceable input files?
Like I said, your terminal and your file may not agree on what the copyright symbol is. You'll have to find out the exact sequence of bytes a copyright symbol is in the file( try hexdump -C filename ) then put that into sed, escaped so it can understand it. Your version of sed might be able to use hexadecimal characters like \x..
yes I do have a backup. Im testing this script out trying to correct a large number of .html files that someone created wrong in regards to the footer. I was told by a friend it would be a good time to learn Linux and bash script so Im trying and reading a lot. I do see a lot of confusion with the complexity of what to do in ubuntu with shell VS bash.
bash is a shell, so yes, some confusion
can you please try this
tr -d '\194' < input >> output
� is assigned of value 194
my apologies I am getting lost in what to use as the input and output. If i recall the < and << prints to file?
EDIT:
and should it not be:
tr -d '\#194' < input >> output
No, it shouldn't. tr doesn't use HTML entities. Practically nothing but HTML uses HTML entities...
I think >>output should be >output, too.
Did you ever try my code?
yes I did. For some reason when I run it line by line I get inputfile: No such file or directory
:wall:
Unless you have a file named 'inputfile' in your directory, of course it will say that.
Substitute the names you want.
thanks. I am a complete noob in this area. If it was a design or website Id of gotten it