I've made a shell script for archiving HTML pages, i.e. making them work offline plus add some features.
Here is it:
#!/bin/sh
if [ $1 = "" ] || [ $(echo "$1" | egrep "https?://boards.4chan.org/[a-z0-9]+/res/[0-9]+") = "" ]; then
echo "Usage: `basename $0` <4chan thread url> <[OPTIONAL: waiting between sessions in seconds]>"
exit 0
fi
echo "4chan downloader"
LOC=$(echo "$1" | sed 's_.\+/res/\([^#]\+\).*_\1_g')
if [ "$LOC" = "" ]; then
echo "Can't determine the thread's number"
exit 0
fi
ST="static.4chan.org"
if [ $(echo "$2" | egrep "[0-9]*" -o) != "" ]; then
SLP=$(echo "$2" | egrep "[0-9]*" -o)
else
SLP="10"
fi
alias echo="echo -ne"
N="\r"
R="\n"
thejob () {
if [ ! -d $LOC ]; then
mkdir $LOC
fi
if [ ! -d $LOC/misc ]; then
mkdir $LOC/misc
fi
egrep "//.\.thumbs\.4chan\.org/[a-z0-9]+/thumb/[0-9]*s\.jpg" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/misc
egrep "//${ST}/image/spoiler-?[a-z0-9]*\.png" $LOC.html -o | sed 's_^//_http://_g' | head -n1 >> $LOC/misc/misc
egrep "//${ST}/image/favicon-?[a-z]*\.ico" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc
egrep "//${ST}/image/country/([a-z]*/)?([\w]+\....)" $LOC.html -o | sed 's_^//_http://_g' >> $LOC/misc/misc
egrep "//${ST}/css/[a-z]+\.[0-9]+\.css" $LOC.html -o | sed -e 's_\.css_\.css\n_g' -e 's_//stat_\nhttp://stat_g' | grep /css/ | head -n1 >> $LOC/misc/misc
egrep "//${ST}/image/title/[a-z]+/[0-9a-z]+\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/misc/logo
egrep "//images\.4chan\.org/[a-z0-9]+/src/[0-9]*\.(jpg|png|gif)" $LOC.html -o | sed 's_^//_http://_g' > $LOC/images
S1="<script>navigator.userAgent.match(/Presto|Gecko/)\&\&s(d.body,'class','i');(function(){var a=d.querySelectorAll('a.fileThumb'),i=0;for(;a.length>i;i++){s(a,'onclick','javascript:imgexp(this);return false;');s(a,'title','Click to toggle image size');}})();</script>"
S2="<script>var d=document;function s(a,b,c){a.setAttribute(b,c);}function g(a,b){if('i'===b.class){0>a.top\&\&(b.scrollTop+=a.top-42);}else{0>a.top\&\&(d.documentElement.scrollTop+=a.top-42);}}function imgexp(b){var a=b.firstElementChild,c=b.getBoundingClientRect(),db=d.body;if(null===b.getAttribute('exp')){s(a,'i-old',a.src);a.getAttribute('style')\&\&(s(a,'i-olds',a.getAttribute('style')),a.removeAttribute('style'));a.src=b.href;s(b,'exp','');}else{a.getAtt ribute('i-olds')\&\&(s(a,'style',a.getAttribute('i-olds')),a.removeAttribute('i-olds'));a.src=a.getAttribute('i-old');a.removeAttribute('i-old');b.removeAttribute('exp');g(c,db);}}</script><style>a[exp]>img{max-width:100%;}body.i a[exp]>img{width:100%!important;}.op:after{clear:both;content:'';display:block;}</style>"
sed -e "s_//${ST}/image/favicon\(-\?[a-z]*\)\.ico_${LOC}/misc/favicon\1.ico_" -e 's_<link rel="alternate style.\+\(<link rel="apple-touch-icon" h\)_\1_' -e "s_//${ST}/css/\([a-z0-9\.]\+\)\.css_${LOC}/misc/\1.css_" -e "s_</body>_${S1}&_" -e "s_</head>_${S2}&_" -e "s_//.\.thumbs\.4chan\.org/[\w]\+/thumb/\([0-9]\+\)s\.jpg_${LOC}/misc/\1s.jpg_g" -e "s_//images\.4chan\.org/[\w]\+/src/\([0-9]\+\)\.\(jpg\|gif\|png\)_${LOC}/\1.\2_g" -e "s_//${ST}/image/title/[a-z]\+/[\w]\+\.\(jpg\|gif\|png\)_${LOC}/misc/logo.\1_g" -e "s_//${ST}/image/spoiler\(-\?[\w]*\)\.png_${LOC}/misc/spoiler\1.png_g" -e "s_//${ST}/image/country/\(\([a-z]*/\)\?\w\+\.gif\)_${LOC}/misc/\1_g" -e "s_\(<a href=\"\)${LOC}\(#p[0-9]\+\"\)_\1\2_g" -e "s_<a href=\"#p${LOC}\" class=\"quotelink\">>>${LOC}_& (OP)_g" -e 's_\(<a href="[0-9]\+\)\(#p[0-9]\+" class="quotelink">>>[0-9]\+\)_\1.html\2 (Cross-thread)_g' -e 's_\(</div></div></div><hr>\)<div class="mobile".\+</div><hr>\(<div class="navLinks navLinksBot">\[<a href="\)\.\./\(\./"[^>]*>Return</a>\] \[<a href="\).top\(">Top</a>\]\).\+</body>_\1\2\3javascript:scroll(0,0);\4<div id="bottom"></div></body>_' -e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_" $LOC.html > a
# :a;N;$!ba;
mv a $LOC.html
cd $LOC
wget --continue -q -i images
rm images
cd misc
if [ "$(ls|grep css)" != "" ]; then
rm "$(ls|grep css)"
fi
wget -q -nc -i misc
CSS=$(cat misc | tail -n1 | sed 's_.*/\([a-z]\+\.[0-9]\+\.css\)_\1_')
sed "s_.*fade\(-\?[a-z]*\)\.png.*_http://${ST}/image/fade\1.png_g" $CSS > misc
wget -q -i misc
sed 's_/image/fade\(-\?[a-z]*\)\.png_fade\1.png_g' $CSS > a
mv a $CSS
if [ $(ls|grep logo.) != "" ]; then
rm $(ls|grep logo.)
fi
wget -q -i logo -O "logo.$(sed "s_\._\n_g" logo|tail -n1)"
rm misc logo
touch .nomedia
cd ../..
}
echo "${N}Downloading to $LOC${N}"
echo "${N}"
echo "------------------------------${N}"
while [ "1" = "1" ]; do
trap 'EXIT=1' 1 2 3 15
if [ -s $LOC.html ]; then
wget -np -nd -nH -q -erobots=off $1 -O a
if [ $(wc -c a|cut -d" " -f1) -eq "0" ]; then
echo "Thread has 404'd or 4chan is down. Stopping script${N}"
rm a
exit 0
fi
if [ $(wc -c a|cut -d" " -f1) -gt $(wc -c $LOC.html|cut -d" " -f1) ]; then
mv a $LOC.html
thejob
else
rm a
fi
else
wget -np -nd -nH -q -erobots=off $1 -O $LOC.html
if [ $(wc -c $LOC.html|cut -d" " -f1) -eq "0" ]; then
echo "Thread doesn't exist or 4chan is down. Stopping script${N}"
rm $LOC.html
exit 0
fi
thejob
fi
trap - 1 2 3 15
if [ $EXIT = "1" ] || [ $SLP = "1" ]; then
echo "Session completed. Exiting ${N}"
exit 0
fi
echo "OK"
sleep $SLP
echo "\b\b \b\b"
done;
The parts not getting applied, even though I have checked them with RegexBuddy:
-e "s_<div id=.boardNavDesktop. class=.desktop.>.*\(<div class=.boardBanner.*\)<hr class=.abovePostForm./\?>.*\(<div class=.navLinks.>.<a href=.\)\.\./\(\./.*Bottom</a>\]\).*alt=../></a>\(</div><hr><a href=.ja\)_\1\2\3\4_"
-e "s_</body>_${S1}&_"
-e "s_</head>_${S2}&_"
I've tried everything I could, but these fail to apply to the fetched HTML.
There aren't any linebreaks in there. These three should apply to the last line of the HTML, because the </head>, <body> and </body> tags are all on one line.
I'm running this on a older Android smartphone that I've replaced with a new one via remote shell and it has "BusyBox v1.19.4-cm9 bionic (2012-02-05 18:40 +0100) multi-call binary" in it. I suppose it has GNU applets.