SED on cygwin not working with Hex or Octal

dazhoop · March 26, 2012, 7:44am

Hi, I have downloaded a web page that I need to cleanup before passing to xmlstarlet.

Using UltraEdit's HEX utility part of my download is as follows:

3C 2F 61 3E 0A 09 0A 09 09 3C 2F 61 3E

which in ASCII is

</a>
	
		</a>

I need to locate this string and replace it with just </a>

I have tried:

awk '{ gsub(/\x0A\x09\x0A\x09\x09/, "<\/a>"); print}' in > out

sed -e 's/<\/a>\\n\\t\\n\\t\\t<\/a>/<\/a>/g' in > out

 sed -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/' in > out

sed -e  's/<\/a>\o012\o011\o012\o011\o011<\/a>/<\/a>/' in > out

but it is not having any of it - it just creates an identical output file.

I did create a test file with just two new lines in it and try "sedding" those but with no success. It is almost like my sed will not recognise anything other than plain text.

I am using sed GNU version 4.2.1 which, according to the documentation, happily supports such activities.

Any ideas folks?

ygemici · March 26, 2012, 8:07am

what is the output from ?

# od -c yourfile

dazhoop · March 26, 2012, 8:27am

Thanks for your reply. Sorry ygemici, I don't quite follow what you mean.

The input file is an HTML file, the product of a wget to an external web site.

The output file is created by whatever option I try and use to do the search and replace.

I ran the od -c against both input and output file, both were identical. What can I look for that will be helpful in the 'od -c ' output ?

balajesuri · March 26, 2012, 8:31am

Its not a problem with cygwin, sed or awk. Its the way inputfile is handled. The inputfile is handled line by line by sed and awk. So, that's why it didn't find <\/a>\\n\\t\\n\\t\\t<\/a> in a single line.

You may change your approach. Try this perl one-liner:

[user@cygwin ~]$ cat inputfile
hello
</a>

                </a>
world
[user@cygwin ~]$ perl -ne 'if(/^<\/a>/ .. /^\t\t<\/a>/) { (/^\t\t<\/a>/) && print "</a>\n" } else { print }' inputfile
hello
</a>
world

dazhoop · March 26, 2012, 9:00am

balajesuri, yes of course, line-by-line. A bad example of overlooking the obvious!

The Perl provided doesn't work for me, however, that is another subject for another thread.

This question is, of course, answered. Thanks!

ygemici · March 26, 2012, 9:04am

# cat file
hello
</a>

                </a>
world

# sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/'  file
hello
</a>
world

dazhoop · March 26, 2012, 9:33am

ygemici, I created the Hello World file as indicated in your post and ran the sed, but the output was identical to the input. I have also tried the following (as I think they are the fit to my file) on my file but with the result being that an identical output file was created

 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/'  in > out

 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/'  in >  out

ygemici · March 26, 2012, 10:24am

dazhoop:

ygemici, I created the Hello World file as indicated in your post and ran the sed, but the output was identical to the input. I have also tried the following (as I think they are the fit to my file) on my file but with the result being that an identical output file was created
 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\x0A\x09\x0A\x09\x09<\/a>/<\/a>/'  in > out 
 sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/'  in >  out 

in cygwin

$ cat infile
hello
</a>

                </a>
world

$ sed -e '/<\/a>/,/<\/a>/{N;N;}' -e 's/<\/a>\n\t\n\t\t<\/a>/<\/a>/' infile
hello
</a>
world

how about your input ? is equal to below ?

$ sed -ne '/<\/a>/,/<\/a>/{N;N;p}' infile |od -c
0000000   <   /   a   >  \n  \t  \n  \t  \t   <   /   a   >  \n
0000016