Script to extract forum posts

KidCactus · February 1, 2011, 5:15am

What I have:

thread_id=666&page=6666#666666">Post title 1</a><br><div style="padding:2px 0px 3px 0px;">Text from the post itself</div>

thread_id=666&page=6666#666666">Post title 2</a><br><div style="padding:2px 0px 3px 0px;">Text from the post itself</div>

thread_id=666&page=6666#666666">Post title 3</a><br><div style="padding:2px 0px 3px 0px;">Text from the post itself</div>

What I want as result:

I assume there would be a quite easy way to do this with awk?

Klashxx · February 1, 2011, 5:45am

One way:

perl -ne  'if(/>([^>]*)<.*div.*>([^>]*)</){print $1."\n\n".$2."\n\n";}' file

malcomex999 · February 1, 2011, 5:48am

With awk...

awk -F"[><]" 'NF{print $2"\n"$(NF-2)}' infile

Klashxx · February 1, 2011, 5:48am

Or:

awk '{print $2"\n\n"$8"\n"}' FS='(>)|(<)' RS= file

KidCactus · February 1, 2011, 5:53am

Thanks everyone.

Is there any way to apply this to a full html page, where there would be alot more than "what I have"? As in first find thread_id= and get that all the way to the fist </div>.

malcomex999 · February 1, 2011, 5:59am

How about if you show what you have with example?

KidCactus · February 1, 2011, 6:07am

That would of course be a better idea.

I have this html file (attached to the post), and I want to cut out all text between:

thread_id=666&page=6666#666666">

and

</a>

And the between

and

</div>

wherever that occurs in the html file.

Klashxx · February 1, 2011, 6:43am

Use:

 perl -ne  'if(/thread_id[^>]*>([^>]*)<\/a.*div style[^>]*>([^>]*)<\/div>/){print $1."\n\n".$2."\n\n";}' htmlFile

KidCactus · February 1, 2011, 6:47am

That doesn't seem to catch 'em all, though. It results in 4 titles and posts out of 25.

Looking at the result I get, I think it skips all posts which contains a quote, which is represented by [user:quoted text], or a link. For example, these two aren't caught:

thread_id=8083&page=3034#1299591">Sandl�dan - Prataomvadsomhelstn�rsomhelst-tr�den</a><br><div style="padding:2px 0px 3px 0px;">[The Ultra:De �r menade att skjutas upp i luften egentligen.] 
Den d�r typen av flares �r ju inte till f�r att skjutas upp.</div>

thread_id=48046&page=1#1287691"><b>Arcaflex...?</b></a><br><div style="padding:2px 0px 3px 0px;">www.gamer.se

Youmeet, vad �r upp?</div>

Or is it maybe the <b> </b> that breaks it in the second example?

This is the url from which the html file is from:

Gameplayer.se - Inl�gg - KidCactus

Klashxx · February 1, 2011, 8:48am

A little dirty but:

perl -pne  's/\n//g' html>html2

Then:

perl -ne  'while(/<a[^>]*thread_id[^>]*>([^>]*)<\/a>[^<]*<br>[^>]*<div style[^>]*>([^>]*)<\/div>/g){print $1."\n\n".$2."\n\n";}' html2

KidCactus · February 1, 2011, 4:38pm

Dirty or not, I piped the first one to the second one, and it works like a charm! Thank you so much.

---------- Post updated at 10:38 PM ---------- Previous update was at 02:53 PM ----------

If someone has time to help me, I would need some addition help with the following:

In a html file, I have this text:

KidCactus';"><div class="forum_thread_text"><span class="forum_text_quote"><strong>Andreas Berg:</strong> Det var inte m�nga �r sedan jag tog hj�lp av Google f�r att koka ett �gg <img src="http://gameplayer.se/gfx/smilies/blush.gif" alt="[blush]" border=0 width=15 height=15> (till mitt f�rsvar �ter jag i princip aldrig �gg och har knappt gjort det alls, s� det har inte riktigt funnits anledning f�r mig att veta hur l�nge ett �gg ska koka <img src="http://gameplayer.se/gfx/smilies/crazy.gif" alt="[crazy]" border=0 width=15 height=15>)</span><br/>�r du fr�n <a class="forum_text_url" href="http://www.svd.se/nyheter/inrikes/artikel_774535.svd" target="_blank">Storbritannien</a>?<br/><br/>Jag googlar r�tt ofta f�r att r�ttstava ord, eller f�r att kolla om vissa ord ens existerar utanf�r min hj�rna.</div>

Anywhere in the file where this is found:

KidCactus';"><div class="forum_thread_text">

I want to cut out the text between that and:

</div>

So the result would be:

<span class="forum_text_quote"><strong>Andreas Berg:</strong> Det var inte m�nga �r sedan jag tog hj�lp av Google f�r att koka ett �gg <img src="http://gameplayer.se/gfx/smilies/blush.gif" alt="[blush]" border=0 width=15 height=15> (till mitt f�rsvar �ter jag i princip aldrig �gg och har knappt gjort det alls, s� det har inte riktigt funnits anledning f�r mig att veta hur l�nge ett �gg ska koka <img src="http://gameplayer.se/gfx/smilies/crazy.gif" alt="[crazy]" border=0 width=15 height=15>)</span><br/>�r du fr�n <a class="forum_text_url" href="http://www.svd.se/nyheter/inrikes/artikel_774535.svd" target="_blank">Storbritannien</a>?<br/><br/>Jag googlar r�tt ofta f�r att r�ttstava ord, eller f�r att kolla om vissa ord ens existerar utanf�r min hj�rna.

If the <br/> also could be converted to a new line at the same time, that would be awesome. I have tried this, but I guess something is wrong since I don't get anything at all:

perl -pne 's/\n//g' input.txt | perl -ne 'while(/KidCactus[^>]*forum_thread_text[^>]*>([^>]*)<\/div>/g){print $1."\n\n";}' > output.txt

Klashxx · February 2, 2011, 5:12am

Try this:

perl -ne 'while(m/KidCactus.;\"><div class=\"forum_thread_text\">(.*?)<\/div>/g){$a=$1;$a=~s/<br\/>/\n/g;print $a."\n";}'