SED on cygwin not working with Hex or Octal

Hi, I have downloaded a web page that I need to cleanup before passing to xmlstarlet.

Using UltraEdit's HEX utility part of my download is as follows:

3C 2F 61 3E 0A 09 0A 09 09 3C 2F 61 3E

which in ASCII is

</a>
	
		</a> 

I need to locate this string and replace it with just </a>

I have tried:

awk '{ gsub(/\x0A\x09\x0A\x09\x09/, "<\/a>"); print}' in > out 
sed -e 's/<\/a>\\n\\t\\n\\t\\t<\/a>/<\/a>/g' in > out 
 sed -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/' in > out 
sed -e  's/<\/a>\o012\o011\o012\o011\o011<\/a>/<\/a>/' in > out 

but it is not having any of it - it just creates an identical output file.

I did create a test file with just two new lines in it and try "sedding" those but with no success. It is almost like my sed will not recognise anything other than plain text.

I am using sed GNU version 4.2.1 which, according to the documentation, happily supports such activities.

Any ideas folks?

what is the output from ?

# od -c yourfile

Thanks for your reply. Sorry ygemici, I don't quite follow what you mean.

The input file is an HTML file, the product of a wget to an external web site.

The output file is created by whatever option I try and use to do the search and replace.

I ran the od -c against both input and output file, both were identical. What can I look for that will be helpful in the 'od -c ' output ?

Its not a problem with cygwin, sed or awk. Its the way inputfile is handled. The inputfile is handled line by line by sed and awk. So, that's why it didn't find <\/a>\\n\\t\\n\\t\\t<\/a> in a single line.

You may change your approach. Try this perl one-liner:

[user@cygwin ~]$ cat inputfile
hello
</a>

                </a>
world
[user@cygwin ~]$ perl -ne 'if(/^<\/a>/ .. /^\t\t<\/a>/) { (/^\t\t<\/a>/) && print "</a>\n" } else { print }' inputfile
hello
</a>
world
1 Like

balajesuri, yes of course, line-by-line. A bad example of overlooking the obvious!

The Perl provided doesn't work for me, however, that is another subject for another thread.

This question is, of course, answered. Thanks!

# cat file
hello
</a>

                </a>
world
# sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/'  file
hello
</a>
world

ygemici, I created the Hello World file as indicated in your post and ran the sed, but the output was identical to the input. I have also tried the following (as I think they are the fit to my file) on my file but with the result being that an identical output file was created

 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/'  in > out 
 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/'  in >  out 

in cygwin

$ cat infile
hello
</a>

                </a>
world
$ sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/' infile
hello
</a>
world

how about your input ? is equal to below ?

$ sed -ne '/<\/a>/,/<\/a>/{N;N;p}' infile |od -c
0000000   <   /   a   >  \n  \t  \n  \t  \t   <   /   a   >  \n
0000016