How to remove all text except pattern

Lukasito · January 4, 2010, 4:45pm

i have nasty html file with 2000+ simbols in 1 row...i need to remove whole the code except title="Some title..." and store those into file with titles (the whole text is in variable text)
i've tried something like this:

echo $text | sed 's/.*\(title=\".+\"\).*/\1/' > titles.html

BUT it does not work, script run fine, but nothing happened...file "titles.html" look just like text variable

can you help me someone please?

alister · January 4, 2010, 4:59pm

Try:

echo "$text" | sed 's/^.*\(title="[^"]*"\).*$/\1/' > titles.html

In sed's default basic regular expression grammar, + is not special; it is a literal plus sign.

Also, the quotes around $text that i added preserve any runs of IFS characters (most likely spaces, if any) that may occur in the title.

Regards,
alister

joeyg · January 4, 2010, 5:00pm

>echo "title=123abc and more ^&@%#$%&@"
"title=123abc and more ^&@%#$%&@"

>echo "title=123abc and more ^&@%#$%&@" | tr -cd [:alpha:][:space:][:digit:]
title123abc and more

Lukasito · January 4, 2010, 5:19pm

still everything in the output file

alister · January 4, 2010, 5:29pm

echo "$text" | sed -ne '/^.*\(title="[^"]*"\).*$/s//\1/p' > titles.html

If that doesn't work, then provide some sample data and desired output.

durden_tyler · January 4, 2010, 5:29pm

Can you post the output of the following command over here ?

echo $text

tyler_durden

Also, are the "2000+ symbols" -
(a) special, but printable, characters like "@", "$", "%" etc. or
(b) non-printable characters like those for ASCII 0, 1, 2, etc.

rdcwayx · January 4, 2010, 6:11pm

echo $text| grep -o "title=\".*\""

alister · January 4, 2010, 8:31pm

This will fail (by matching more text than intended) if there is a quote after the quote that terminates the title, since it will be a greedy match. [^"]* instead of .* is best (again, assuming there's a possibility of another quote later on in the line).

Regards,
alister

Scrutinizer · January 4, 2010, 9:09pm

Indeed, you would have to create a non-greedy match.

echo "$text" | grep -o 'title="[^"]*"'

---------- Post updated at 02:50 ---------- Previous update was at 02:41 ----------

This will only match one occurrence per line..

---------- Post updated at 03:09 ---------- Previous update was at 02:50 ----------

Alternatives to grep -o

echo "$text" | awk '/title=$/{getline;print "title=\""$0"\""}' RS=\"

echo "$text" | sed 's/title="[^"]*"/\n&\n/g' | sed '/^title="/!d'

Ygor · January 4, 2010, 11:43pm

Try...

perl -nle 'print for m/title=\".*?\"/g' nasty.html > titles.txt

Lukasito · January 5, 2010, 5:14am

non of those works (output was empty)..ill give you 1 row from my html...

;_OC_timingAction('search');</script><div class="scontentarea" id="scontentarea"><table style="width:100%"><tr><td id="sidebar"style="padding:24px 8px;width:190px;display:none;vertical-align:top"></td><td id="main_content"><div style="margin-bottom:6px; margin-top: 4px"><a class="link_aux" title="" href=""></a></div><div class="result_spacer"><br/></div><div class="rsiwrapper" ><table class="rsi" cellspacing=0 cellpadding=0 border=0 ><tr><td class="coverdstd" align="center"><a href="http://books.google.com/books?id=Ww1B9O_yVGsC&printsec=frontcover&dq=java&hl=sk&ie=ISO-8859-2&cd=1" ><img alt="The Java language specification" class="coverthumb" title="The Java language specification" dir=ltr src="http://bks2.books.google.com/books?id=Ww1B9O_yVGsC&printsec=frontcover&img=1&zoom=5&edge=curl&sig=ACfU3U3EBlPtT6KTuEx6mtanykCsu93qtA" border=0 height=80><script type="text/javascript">if (window['_OC_registerHover']){_OC_registerHover({"title":"The \u003cb\u003eJava\u003c/b\u003e language specification","authors":"James Gosling, Bill Joy","bib_key":"ISBN:0201310082","pub_date":"2000","snippet":"Developers will turn to this book again and again.","subject":"Computers","info_url":"http://books.google.com/books?id=Ww1B9O_yVGsC\u0026dq=java\u0026hl=sk\u0026ie=ISO-8859-2","preview_url":"http://books.google.com/books?id=Ww1B9O_yVGsC\u0026printsec=frontcover\u0026dq=java\u0026hl=sk\u0026ie=ISO-8859-2\u0026cd=1","thumbnail_url":"http://bks2.books.google.com/books?id=Ww1B9O_yVGsC\u0026printsec=frontcover\u0026img=1\u0026zoom=5\u0026edge=curl\u0026sig=ACfU3U3EBlPtT6KTuEx6mtanykCsu93qtA","num_pages":505,"viewability":2,"preview":"partial","embeddable":true})}</script></a><div class="starrating"></div></td><td valign=top><div class=resbdy><h2 class="resbdy"><a href="http://books.google.com/books?id=Ww1B9O_yVGsC&printsec=frontcover&dq=java&hl=sk&ie=ISO-8859-2&cd=1"><span dir=ltr>The <b>Java</b> language specification</span></a></h2><font size=-1><span style="line-height: 1.2em;"><span class=ln2><a href="http://books.google.com/books?q=+inauthor:%22James+Gosling%22&hl=sk&ie=ISO-8859-2" class="link_aux">James Gosling</a>, <a href="http://books.google.com/books?q=+inauthor:%22Bill+Joy%22&hl=sk&ie=ISO-8859-2" class="link_aux">Bill Joy</a> - 2000 - Poet str�nok 505</span><br/><div class="snippet sa" dir=ltr>Developers will turn to this book again and again.</div><div><span style="color:#99522e">Obmedzen� n�h�ad</span> - <a class="link_aux axs_about" href="http://books.google.com/books?id=Ww1B9O_yVGsC&dq=java&hl=sk&ie=ISO-8859-2"">O tejto knihe</a> - <span class="res_ann">

i need from this title="Some text(but not empty)"

Scrutinizer, thanks it works...can you explain to me your solution? (cos I dont know exactly what does [^"]* mean thanks)

Scrutinizer · January 5, 2010, 5:59am

[^"]* means zero or more occurrences of any character that is not a double quote.

Lukasito · January 5, 2010, 6:04am

ah..i see, thanks

Ygor · January 5, 2010, 9:15pm

Previous perl code produces output...

title=""
title="The Java language specification"

New requirement is to exclude empty text, so try...

perl -nle 'print for m/title=\"[^\"]+\"/g' nasty.html > titles.txt