sed fails to apply substitute commands

Adolf1994 · August 5, 2012, 4:17pm

I've made a shell script for archiving HTML pages, i.e. making them work offline plus add some features.
Here is it:

#!/bin/sh

if [ $1 = "" ] || [ $(echo "$1" | egrep "https?://boards.4chan.org/[a-z0-9]+/res/[0-9]+") = "" ]; then
echo "Usage: `basename $0` <4chan thread url> <[OPTIONAL: waiting between sessions in seconds]>"
exit 0
fi

echo "4chan downloader"

LOC=$(echo "$1" | sed 's_.\+/res/\([^#]\+\).*_\1_g')

if [ "$LOC" = "" ]; then
echo "Can't determine the thread's number"
exit 0
fi

ST="static.4chan.org"

if [ $(echo "$2" | egrep "[0-9]*" -o) != "" ]; then
SLP=$(echo "$2" | egrep "[0-9]*" -o)
else
SLP="10"
fi

alias echo="echo -ne"
N="\r"
R="\n"

thejob () {
if [ ! -d $LOC ]; then
mkdir $LOC
fi

if [ ! -d $LOC/misc ]; then
mkdir $LOC/misc
fi

egrep "//.\.thumbs\.4chan\.org/[a-z0-9]+/thumb/[0-9]*s\.jpg" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/misc

egrep "//${ST}/image/spoiler-?[a-z0-9]*\.png" $LOC.html -o | sed 's_^//_http://_g' | head -n1 >> $LOC/misc/misc

egrep "//${ST}/image/favicon-?[a-z]*\.ico" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc

egrep "//${ST}/image/country/([a-z]*/)?([\w]+\....)" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc

egrep "//${ST}/css/[a-z]+\.[0-9]+\.css" $LOC.html -o | sed -e 's_\.css_\.css\n_g' -e 's_//stat_\nhttp://stat_g' | grep /css/ | head -n1 >> $LOC/misc/misc

egrep "//${ST}/image/title/[a-z]+/[0-9a-z]+\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/logo

egrep "//images\.4chan\.org/[a-z0-9]+/src/[0-9]*\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/images

S1="<script>navigator.userAgent.match(/Presto|Gecko/)\&\&s(d.body,'class','i');(function(){var a=d.querySelectorAll('a.fileThumb'),i=0;for(;a.length>i;i++){s(a,'onclick','javascript:imgexp(this);return false;');s(a,'title','Click to toggle image size');}})();</script>"

S2="<script>var d=document;function s(a,b,c){a.setAttribute(b,c);}function g(a,b){if('i'===b.class){0>a.top\&\&(b.scrollTop+=a.top-42);}else{0>a.top\&\&(d.documentElement.scrollTop+=a.top-42);}}function imgexp(b){var a=b.firstElementChild,c=b.getBoundingClientRect(),db=d.body;if(null===b.getAttribute('exp')){s(a,'i-old',a.src);a.getAttribute('style')\&\&(s(a,'i-olds',a.getAttribute('style')),a.removeAttribute('style'));a.src=b.href;s(b,'exp','');}else{a.getAtt ribute('i-olds')\&\&(s(a,'style',a.getAttribute('i-olds')),a.removeAttribute('i-olds'));a.src=a.getAttribute('i-old');a.removeAttribute('i-old');b.removeAttribute('exp');g(c,db);}}</script><style>a[exp]>img{max-width:100%;}body.i a[exp]>img{width:100%!important;}.op:after{clear:both;content:'';display:block;}</style>"

sed -e "s_//${ST}/image/favicon\(-\?[a-z]*\)\.ico_${LOC}/misc/favicon\1.ico_" -e 's_<link rel="alternate style.\+\(<link rel="apple-touch-icon" h\)_\1_' -e "s_//${ST}/css/\([a-z0-9\.]\+\)\.css_${LOC}/misc/\1.css_" -e "s_</body>_${S1}&_" -e "s_</head>_${S2}&_" -e "s_//.\.thumbs\.4chan\.org/[\w]\+/thumb/\([0-9]\+\)s\.jpg_${LOC}/misc/\1s.jpg_g" -e "s_//images\.4chan\.org/[\w]\+/src/\([0-9]\+\)\.\(jpg\|gif\|png\)_${LOC}/\1.\2_g" -e "s_//${ST}/image/title/[a-z]\+/[\w]\+\.\(jpg\|gif\|png\)_${LOC}/misc/logo.\1_g" -e "s_//${ST}/image/spoiler\(-\?[\w]*\)\.png_${LOC}/misc/spoiler\1.png_g" -e "s_//${ST}/image/country/\(\([a-z]*/\)\?\w\+\.gif\)_${LOC}/misc/\1_g" -e "s_\(<a href=\"\)${LOC}\(#p[0-9]\+\"\)_\1\2_g" -e "s_<a href=\"#p${LOC}\" class=\"quotelink\">>>${LOC}_& (OP)_g" -e 's_\(<a href="[0-9]\+\)\(#p[0-9]\+" class="quotelink">>>[0-9]\+\)_\1.html\2 (Cross-thread)_g' -e 's_\(</div></div></div><hr>\)<div class="mobile".\+</div><hr>\(<div class="navLinks navLinksBot">\[<a href="\)\.\./\(\./"[^>]*>Return</a>\] \[<a href="\).top\(">Top</a>\]\).\+</body>_\1\2\3javascript:scroll(0,0);\4<div id="bottom"></div></body>_' -e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_" $LOC.html > a

# :a;N;$!ba;

mv a $LOC.html

cd $LOC

wget --continue -q -i images

rm images

cd misc

if [ "$(ls|grep css)" != "" ]; then
rm "$(ls|grep css)"
fi

wget -q -nc -i misc

CSS=$(cat misc | tail -n1 | sed 's_.*/\([a-z]\+\.[0-9]\+\.css\)_\1_')

sed "s_.*fade\(-\?[a-z]*\)\.png.*_http://${ST}/image/fade\1.png_g" $CSS > misc

wget -q -i misc

sed 's_/image/fade\(-\?[a-z]*\)\.png_fade\1.png_g' $CSS > a

mv a $CSS

if [ $(ls|grep logo.) != "" ]; then
rm $(ls|grep logo.)
fi

wget -q -i logo -O "logo.$(sed "s_\._\n_g" logo|tail -n1)"

rm misc logo

touch .nomedia

cd ../..
}

echo "${N}Downloading to $LOC${N}"

echo "${N}"

echo "------------------------------${N}"

while [ "1" = "1" ]; do

trap 'EXIT=1' 1 2 3 15

if [ -s $LOC.html ]; then

wget -np -nd -nH -q -erobots=off $1 -O a

if [ $(wc -c a|cut -d" " -f1) -eq "0" ]; then

echo "Thread has 404'd or 4chan is down. Stopping script${N}"

rm a

exit 0

fi

if [ $(wc -c a|cut -d" " -f1) -gt $(wc -c $LOC.html|cut -d" " -f1) ]; then

mv a $LOC.html

thejob

else

rm a

fi

else

wget -np -nd -nH -q -erobots=off $1 -O $LOC.html

if [ $(wc -c $LOC.html|cut -d" " -f1) -eq "0" ]; then

echo "Thread doesn't exist or 4chan is down. Stopping script${N}"

rm $LOC.html

exit 0

fi

thejob

fi

trap - 1 2 3 15

if [ $EXIT = "1" ] || [ $SLP = "1" ]; then
echo "Session completed. Exiting ${N}"
exit 0
fi

echo "OK"

sleep $SLP

echo "\b\b \b\b"
done;

The parts not getting applied, even though I have checked them with RegexBuddy:

-e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_"
-e "s_</body>_${S1}&_"
-e "s_</head>_${S2}&_"

I've tried everything I could, but these fail to apply to the fetched HTML.
There aren't any linebreaks in there. These three should apply to the last line of the HTML, because the </head>, <body> and </body> tags are all on one line.

I'm running this on a older Android smartphone that I've replaced with a new one via remote shell and it has "BusyBox v1.19.4-cm9 bionic (2012-02-05 18:40 +0100) multi-call binary" in it. I suppose it has GNU applets.

Corona688 · August 5, 2012, 5:42pm

I don't think busybox sed supports -e very well. It has a reasonably decent awk I think.

Could you explain what you're actually trying to do here? Maybe there's a more direct way.

Adolf1994 · August 5, 2012, 5:53pm

I haven't noticed that it would have problems with -e
As you can see there are plenty of them and most of them works except for these three.
I haven't looked into anything but sed so far.

The first one's supposed to get rid of garbage that's not really useful once the targeted html page's been modified for offline use. However, this one sometimes works when the html page's size is really small, i.e. around our below 100kB. I thought that it might be a problem with buffer, but after some research I found out that GNU sed has no limit.
The other two are supposed to inject some Javascript and CSS to add some really handy features to the html.

Edit: on a second thought, I have no idea why have i included the third command with the S2 variable, because that works well.

And thanks for the quick reply.

Corona688 · August 5, 2012, 6:00pm

busybox is profoundly not the standard Linux GNU utilities. it can do a surprising amount but corners had to be cut to fit that much functionality into one executable. I wouldn't be surprised if its sed -e was imperfect. What surprises me is that it exists at all.

If you're doing 10,000 greps and seds on the same file, it might be time to consider a language like awk or perl. I bet you don't have perl on that thing, though.

Adolf1994 · August 5, 2012, 8:24pm

Hmmm, afaik there's a thing called super sed that doesn't have any dependencies. Maybe if I try and compile that with an Android NDK toolchain?

I'll try first with dividing that big sed commamd pile

---------- Post updated at 12:23 AM ---------- Previous update was at 12:11 AM ----------

Ok, dividing the pile didn't help.
So, how's that awk again? Available options are -v -F and -f. I like the way how you can set a variable there. It's a bit like Javascript, imo

---------- Post updated at 02:24 AM ---------- Previous update was at 12:23 AM ----------

Good news. The awk in the busybox seems to be gawk, because gensub worked with it. I'll try to mess around with this until it works.

alister · August 6, 2012, 9:27am

It is not gawk. gawk is much larger than busybox's awk implementation. However, one of Busybox's goals is to emulate GNU behavior with the features they implement.

Regards,
Alister

Corona688 · August 6, 2012, 10:41am

No, no it is not. I refer you to my earlier post:

It may have gsub, but doesn't have functions.

Adolf1994 · August 6, 2012, 11:48am

okokok, I now realized I can do this with sed. The things I've got in my head:

move the last line I'm working on into a new file
merge sed commands with alternate swith wherever possible
get rid of the .* and replace it with .\{,n\}, where n is a few hundred bigger number than what I want to select with the * to make the command flexible

Anyway, I curse GNU BRE for not having lazy search. With the * it took RegexBuddy to select the first of the three commands in around 54000 steps, even though I've deleted the content that's usually after the selection. So I thought I'd just replace the * with \{,n\} for less steps.

Corona688 · August 7, 2012, 12:02pm

For the third time:

AFAIK only Perl can do non-greedy regex matching, and even then not by default. And it may not be terribly efficient in doing so.

I'm still not convinced you're even using the right tool for the job.

alister · August 7, 2012, 12:40pm

Momentary sidebar ...

At least Python, Ruby, and PHP also support lazy quantifiers.

Regards,
Alister

Adolf1994 · August 7, 2012, 2:29pm

Well, I've looked into the grymoire sed page and decided to use groups and such tasty stuff:

    sed -e '1 s/.\{1,31\}$//' -e '2,10 d' -e '/^<meta / {
        s_<link[^>].\{1,100\}xml"/>\(<title>[^>]\+</title>\).\+_\1_
        s_//'$ST'/image/\(favicon-\?[a-z]\{0,10\}\.ico\)_'$LOC'/misc/\1_
        s_<link rel="alternate style.\+\(<link rel="apple-touch-icon" h\)_\1_
        s_//'$ST'/css/\([a-z0-9\.]\{1,25\}\.css\)_'$LOC'/misc/\1_
        }' -e '3,16 d' -e '$ {
        s_\(<div id="\)pim\([0-9]\{1,100\}\).\{1,1000\}\(\1\(pi\|f\)\2\)_\3_g
        s_//.\.thumbs\.4chan\.org/[a-z0-9]\{1,10\}/thumb/\([0-9]\{1,25\}s\.jpg\)_'$LOC'/misc/\1_g
        s_//images\.4chan\.org/[a-z0-9]\{1,10\}/src/\([0-9]\{1,25\}\.\)\(jpg\|gif\|png\)_'$LOC'/\1\2_g
        s_//'$ST'/image/title/[a-z]\{1,10\}/[a-z0-9]\{1,100\}\.\(jpg\|gif\|png\)_'$LOC'/misc/logo.\1_g
        s_//'$ST'/image/\(spoiler-\?[a-z0-9]\{0,10\}\....\)_'$LOC'/misc/\1_g
        s_//'$ST'/image/country/\(\([a-z]\{0,25\}/\)\?[a-z0-9]\{1,25\}\....\)_'$LOC'/misc/\1_g
        s_\(<a href="\)'$LOC'\(#p\{1,100\}"\)_\1\2_g
        s_<a href="#p'$LOC'" class="quotelink">>>'$LOC'_& (OP)_g
        s_\(<a href="[0-9]\{1,100\}\)\(#p[0-9]\{1,100\}" class="quotelink">>>[0-9]\{1,100\}\)_\1.html\2 (Cross-thread)_g
        s_\(</div></div></div><hr>\)<div class="mobile".\+</div><hr>\(<div class="navLinks navLinksBot">\[<a href="\)\.\./\(\./"[^>]\{0,100\}>Return</a>\] \[<a href="\)#top\(">Top</a>\]\).\+</body>_\1</form>\2\3javascript:scroll(0,0);\4<div id=bottom></div>'$S1'</body>_
        s_^\(.\{1,39\}\)<div id="boardNavDesktop" class="desktop">.\{0,7800\}\(<div class="boardBanner"\{0,250\}\)<hr class="abovePostForm"/\?>.\{0,400\}\(<div class="navLinks">.<a href="\)\.\./\(\./.\{0,100\}\)#bottom\(">Bottom</a>]\).\{0,4000\}alt=""/></a>\(</div><hr><a href="ja\)_'$S2'\1\2\3\4javascript:scroll(0,d.documentElement.scrollHeight)\5\6_
        s_href="//_href="http://_g
        }' $LOC.html > a

However, this returns unmatched _
Every regexp here is RegexBuddy verified.

Any other idea?

Corona688 · August 7, 2012, 3:07pm

Perhaps regex buddy isn't as perfect as one would want to assume...

It depends what you're trying to do. Without reverse-engineering that enormous pile of regexes, there's little for us to work from.

But awk can be used to rip apart tags based on <, letting you handle things piecemeal.

Adolf1994 · August 7, 2012, 3:22pm

This script serves the sole purpose of making threads work offline and getting rid of junk not needed there.
Anyway, is there a character that indicates the end of a command in groups?
awk is actually a language, too, right? But that again would take me a lot of time to study and I'd like to make this script working by tomorrow. There will be a quest thread that I'd like to archive.
If everything else fails, I'll just go back to use -e for every command instead of groups.

vgersh99 · August 7, 2012, 3:38pm

if you give a sample input and a desired output - we might be able to do something - maybe one-step-at-the-time approach is in order - 'sed' seems to be over-complicated for my taste....

Adolf1994 · August 7, 2012, 3:53pm

This one thread could be an input:

(name of the script you save it as, i like to call it threaddl) http://boards.4chan.org/tg/res/20218467 1

This way it will exit once everything is done.

Desired output would be really long both explained and the html pasted in

Corona688 · August 7, 2012, 4:00pm

If you can't actually explain what you need, we can't actually help you.

Adolf1994 · August 7, 2012, 4:30pm

I'll explain the regex'es then. It'll take some time

---------- Post updated at 10:30 PM ---------- Previous update was at 10:10 PM ----------

Reverse engineered:

# removes a script tag
-e '1 s/.\{1,31\}$//' -e '2,10 d'

# remove rss link tag and more script tags
-e '/^<meta / s_<link[^>]\{1,100\}xml"/>\(<title>[^>]\+</title>\).\+_\1_'

# make favicon source a relative link
# ST="static.4chan.org"
# LOC=the input thread's number
-e '/^<meta / s_//'$ST'/image/\(favicon-\?[a-z]\{0,10\}\.ico\)_'$LOC'/misc/\1_'

# remove alternative style sheets
-e '/^<meta / s_<link rel="alternate style.\+\(<link rel="apple-touch-icon" h\)_\1_'

# make default style sheet's source associated to the board a relative link
-e '/^<meta / s_//'$ST'/css/\([a-z0-9\.]\{1,25\}\.css\)_'$LOC'/misc/\1_'

# remove some more script tag(s)
-e '3,16 d'

# remove name, tripcode and post number nodes that are only for mobile view
-e '$ s_\(<div id="\)pim\([0-9]\{1,25\}\).\{1,1000\}\(\1\(pi\|f\)\2\)_\3_g'

# give thumbnails, full sized pictures, logo, spoiler images, country flags relative source link
-e '$ s_//.\.thumbs\.4chan\.org/[a-z0-9]\{1,10\}/thumb/\([0-9]\{1,25\}s\.jpg\)_'$LOC'/misc/\1_g' -e '$ s_//images\.4chan\.org/[a-z0-9]\{1,10\}/src/\([0-9]\{1,25\}\.\)\(jpg\|gif\|png\)_'$LOC'/\1\2_g' -e '$ s_//'$ST'/image/title/[a-z]\{1,10\}/[a-z0-9]\{1,100\}\.\(jpg\|gif\|png\)_'$LOC'/misc/logo.\1_g' -e '$ s_//'$ST'/image/\(spoiler-\?[a-z0-9]\{0,10\}\....\)_'$LOC'/misc/\1_g' -e '$ s_//'$ST'/image/country/\(\([a-z]\{0,25\}/\)\?[a-z0-9]\{1,25\}\....\)_'$LOC'/misc/\1_g'

# point quote links to relative target and mark the ones with a target of OP or a cross thread post
-e '$ s_\(<a href="\)'$LOC'\(#p\{1,100\}"\)_\1\2_g' -e '$ s_<a href="#p'$LOC'" class="quotelink">>>'$LOC'_& (OP)_g' -e '$ s_\(<a href="[0-9]\{1,100\}\)\(#p[0-9]\{1,100\}" class="quotelink">>>[0-9]\{1,100\}\)_\1.html\2 (Cross-thread)_g'

# remove board link list, report/delete form, theme selector from the bottom of the html page, correct the Return and Top links and inject script
# S1=a script tag
-e '$ s_\(</div></div></div><hr>\)<div class="mobile".\+</div><hr>\(<div class="navLinks navLinksBot">\[<a href="\)\.\./\(\./"[^>]\{0,100\}>Return</a>\] \[<a href="\)#top\(">Top</a>\]\).\+</body>_\1</form>\2\3javascript:scroll(0,0);\4<div id=bottom></div>'$S1'</body>_'

# remove board link list, settings, posting form, correct Return and Bottom links, preserve the logo image and text and announcement, inject a script and a style tag to add image expansion feature
# S2= a script and a style tag
-e '$ s_^\(.\{1,39\}\)<div id="boardNavDesktop" class="desktop">.\{0,7800\}\(<div class="boardBanner"\{0,250\}\)<hr class="abovePostForm"/\?>.\{0,400\}\(<div class="navLinks">.<a href="\)\.\./\(\./.\{0,100\}\)#bottom\(">Bottom</a>]\).\{0,4000\}alt=""/></a>\(</div><hr><a href="ja\)_'$S2'\1\2\3\4javascript:scroll(0,d.documentElement.scrollHeight)\5\6_'

# add the http protocol to links in a tags
-e '$ s_<a\(.\{1,1000\}\)href="//_<a\1href="http://_g'

bakunin · August 7, 2012, 5:46pm

Some observations ( i can't completely debug the sed-code in a short time, but maybe some pointers might help):

# removes a script tag
-e '2,10 d'
...
# remove some more script tag(s)
-e '3,16 d'

can't these be combined?

Another thing is this (or variations), which you use quite often. You use shell variable expansion inside a regexp:

s_'$ST'_'$LOC'_

This is OK in principle, but: you have to make sure the variables contents doesn't have regexp metacharacters in it. It might lead to unintended matches. It might be safer to escape the "." to "\." for instance:

# ST="static.4chan.org"

Also i am not sure if this way of interrupting the single quotes doesn't break the string. Maybe

s_'"$ST"'_'"$LOC"'_

would be a safer way to achieve what you want.

I would put back the "" instead of the "\{1,1000\}" constructs you used. Actually i can't believe someone underwent the effort of writing a sed port and missed the proper implementation of something as basic as the metacharacter "".

I hope this helps.

bakunin

Adolf1994 · August 7, 2012, 6:55pm

Thanks for pointing out these, bakunin.
I use the {n,m} format to reduce the number of the backtracks.
I actually found problems where I have to use the S1 and S2 variables.
I'll be experimenting with what you suggested.

---------- Post updated at 12:55 AM ---------- Previous update was at 12:45 AM ----------

And now that the unmatched _ is solved, the command with the S2 variable just wouldn't work.

244an · August 7, 2012, 7:42pm

I haven't studied the whole thread in details, but perhaps I saw why your S2 variable doesn't work. It's used in a regexp? The S2 variable has some "[" in it, e.g. a[exp] , and that will be used as a regexp expression.