Hi, I have downloaded a web page that I need to cleanup before passing to xmlstarlet.
Using UltraEdit's HEX utility part of my download is as follows:
3C 2F 61 3E 0A 09 0A 09 09 3C 2F 61 3E
which in ASCII is
</a>
</a>
I need to locate this string and replace it with just </a>
I have tried:
awk '{ gsub(/\x0A\x09\x0A\x09\x09/, "<\/a>"); print}' in > out
sed -e 's/<\/a>\\n\\t\\n\\t\\t<\/a>/<\/a>/g' in > out
sed -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/' in > out
sed -e 's/<\/a>\o012\o011\o012\o011\o011<\/a>/<\/a>/' in > out
but it is not having any of it - it just creates an identical output file.
I did create a test file with just two new lines in it and try "sedding" those but with no success. It is almost like my sed will not recognise anything other than plain text.
I am using sed GNU version 4.2.1 which, according to the documentation, happily supports such activities.
Its not a problem with cygwin, sed or awk. Its the way inputfile is handled. The inputfile is handled line by line by sed and awk. So, that's why it didn't find <\/a>\\n\\t\\n\\t\\t<\/a> in a single line.
You may change your approach. Try this perl one-liner:
ygemici, I created the Hello World file as indicated in your post and ran the sed, but the output was identical to the input. I have also tried the following (as I think they are the fit to my file) on my file but with the result being that an identical output file was created
sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/' in > out
sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/' in > out