Search for the word and exporting 35 characters after that word using shell script?

sachit_adhikari · August 16, 2012, 2:35am

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word description excluding weird characters like $$#$#@$#@***$# and without html tags in the new file output.txt. Help me. Thanx in advance.

My final goal is to find the word description and print 35 characters after description which shouldn't include the html tags and weird characters. Is it possible? Like here:

 description><p><img class="float_right"
 src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg"
 border="0" alt="dnu" width="400" height="300" /></p><p>The lawn
 was filled with <a class="hidden_link"
 href="http://www.businessinsider.com/blackboard/goldman-sachs">Goldman
 Sachs</a> Group Inc. partners dressed in pink looking out

I want to start from: The lawn is filled with (again skip those tags and continue from) Group Inc. partners (35 characters .done!) and then stop and search for another description!

raj_saini20 · August 16, 2012, 3:41am

please provide the output desired

RudiC · August 16, 2012, 3:54am

To me, it looks like the &lt (<) and &gt (>) pairs don't match in your sample so it's difficult to eliminate the HTML stuff consistently. Pls confirm or revise your sample.

sachit_adhikari · August 16, 2012, 4:32am

Yea, the characters are not uniform. That's the sample. My sample output is:
The lawn is filled with (again skip those tags and continue from) Group Inc. partners (35 characters .done!) and then stop and search for another description!

The script or the command shouldn't be 100% able to remove the html tags and weird characters as there are variation.
Thank you! This is what I thought of:
1) Search for description word using grep.
2) Grab 35 characters after description using sed excluding weird and html characters.
3) Printing out the output in output.txt file.

Is it possible? Please help me!

RudiC · August 16, 2012, 5:20am

This will work on exactly your sample, it cannot resolve the <'s and >'s crossing, and it depends on your sed accepting the -r option (extended regex):

 sed  -r -n ':rep;N; $ !T rep; s/\n//g;s/ description>//; s/<[^&]*>//g; s/(.{35}).*/\1/ p'

yielding

The lawn was filled with Goldman Sa

sachit_adhikari · August 16, 2012, 5:39am

@RudiC: Hello, I ran the command But I get only 15 characters after title. I have attached my source file. I want 35 characters from the word description without html tags and weird characters. There are many description words. So 35 characters after every description should be the output. Should we use loop for that? I think it's better to put down the code in the script file.

RudiC · August 16, 2012, 6:07am

I knew your sample was NOT representing your input exactly! Anyway, try this:

sed -r 's/.*description>//g; s/<[^&]*>//g; s/(.{35}).*/\1/ ' input.txt

printing

The lawn was filled with Goldman Sa
The recall would cover almost all t
Fran&ccedil;ois Hollande, still
More people in the world are overwe
Bloomberg TV just hosted a debate o
 Having failed to graduate from hig
When it comes to big data, "size do
A very successful entrepreneur who 
Unfortunately, there is no World Ba
 Deutsch LA released its first Targ
What if the generation that once ro
Official Chinese economic data have
Andy Grignon is always looking for 
Author Bob Sutton has posted on his
VIENNA (Reuters) - Scientists have 
Whether he's spending time with his
The Harvard Business Review has a f
Today a court in Miami refused a bo
Hedge fund Soros Funds has filed hi
If you want to understand the bigge
Most science journals put up multip
Legendary hedge fund manager John P
There has been a lot of noise about
Barcelona is a partygoer&rsquo;
Hedge fund titan Bill Ackman, the f

from your input.txt file.

elixir_sinari · August 16, 2012, 6:19am

Try this:

sed 's/^.*<description>/<description>/
s/</</g
s/>/>/g
s/&rsquo;/'"'"'/g
s/&ccedil;/c/g
s/<[^>]*>//g
s/^\(.\{35\}\).*/\1/' input.txt

sachit_adhikari · August 16, 2012, 6:26am

Thank you so much. It worked

sachit_adhikari · August 16, 2012, 11:59pm

@RudiC and @elixir_sinari: I have another text file with little weird characters and HTML tags. I used the same command but didn't work. How do I customize yours command to make it work for my another input.txt file. If possible I want to remove <link> before the description too.
Please help! Thank you in advance. I have attached input.txt file.
Thank you!

RudiC · August 17, 2012, 2:19am

There exist HTML text extractors on the net that you may want to test. And - it will be easier to immediately work on the web page's HTML text than on your half preprocessed extraction.

sachit_adhikari · August 17, 2012, 2:24am

HTML extactors is only for windows. Besides, it didn't work with wine too. I want to achieve the result via shell script. Please help me! Thank you.