Search for the word and exporting 35 characters after that word using shell script?

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word description excluding weird characters like $$#$#@$#@***$# and without html tags in the new file output.txt. Help me. Thanx in advance.

My final goal is to find the word description and print 35 characters after description which shouldn't include the html tags and weird characters. Is it possible? Like here:

 description><p><img class="float_right"
 src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg"
 border="0" alt="dnu" width="400" height="300" /></p><p>The lawn
 was filled with <a class="hidden_link"
 href="http://www.businessinsider.com/blackboard/goldman-sachs">Goldman
 Sachs</a> Group Inc. partners dressed in pink looking out

I want to start from: The lawn is filled with (again skip those tags and continue from) Group Inc. partners (35 characters .done!) and then stop and search for another description!

please provide the output desired

To me, it looks like the &lt (<) and &gt (>) pairs don't match in your sample so it's difficult to eliminate the HTML stuff consistently. Pls confirm or revise your sample.

Yea, the characters are not uniform. That's the sample. My sample output is:
The lawn is filled with (again skip those tags and continue from) Group Inc. partners (35 characters .done!) and then stop and search for another description!

The script or the command shouldn't be 100% able to remove the html tags and weird characters as there are variation.
Thank you! This is what I thought of:
1) Search for description word using grep.
2) Grab 35 characters after description using sed excluding weird and html characters.
3) Printing out the output in output.txt file.

Is it possible? Please help me!

This will work on exactly your sample, it cannot resolve the <'s and >'s crossing, and it depends on your sed accepting the -r option (extended regex):

 sed  -r -n ':rep;N; $ !T rep; s/\n//g;s/ description>//; s/<[^&]*>//g; s/(.{35}).*/\1/ p'

yielding

The lawn was filled with Goldman Sa

@RudiC: Hello, I ran the command But I get only 15 characters after title. I have attached my source file. I want 35 characters from the word description without html tags and weird characters. There are many description words. So 35 characters after every description should be the output. Should we use loop for that? I think it's better to put down the code in the script file.

I knew your sample was NOT representing your input exactly! Anyway, try this:

sed -r 's/.*description>//g; s/<[^&]*>//g; s/(.{35}).*/\1/ ' input.txt

printing

The lawn was filled with Goldman Sa
The recall would cover almost all t
Fran&ccedil;ois Hollande, still
More people in the world are overwe
Bloomberg TV just hosted a debate o
 Having failed to graduate from hig
When it comes to big data, "size do
A very successful entrepreneur who 
Unfortunately, there is no World Ba
 Deutsch LA released its first Targ
What if the generation that once ro
Official Chinese economic data have
Andy Grignon is always looking for 
Author Bob Sutton has posted on his
VIENNA (Reuters) - Scientists have 
Whether he's spending time with his
The Harvard Business Review has a f
Today a court in Miami refused a bo
Hedge fund Soros Funds has filed hi
If you want to understand the bigge
Most science journals put up multip
Legendary hedge fund manager John P
There has been a lot of noise about
Barcelona is a partygoer&rsquo;
Hedge fund titan Bill Ackman, the f

from your input.txt file.

1 Like

Try this:

sed 's/^.*<description>/<description>/
s/</</g
s/>/>/g
s/&rsquo;/'"'"'/g
s/&ccedil;/c/g
s/<[^>]*>//g
s/^\(.\{35\}\).*/\1/' input.txt

Thank you so much. It worked :slight_smile:

@RudiC and @elixir_sinari: I have another text file with little weird characters and HTML tags. I used the same command but didn't work. How do I customize yours command to make it work for my another input.txt file. If possible I want to remove <link> before the description too.
Please help! Thank you in advance. I have attached input.txt file.
Thank you!

There exist HTML text extractors on the net that you may want to test. And - it will be easier to immediately work on the web page's HTML text than on your half preprocessed extraction.

HTML extactors is only for windows. Besides, it didn't work with wine too. I want to achieve the result via shell script. Please help me! Thank you.