Extract title

need a bit of help please

i have a htmlfile

in the file there is a long list of txt
each line has a different title

i want to extract all the titles with

egrep, sed, awk, and save them to a txt.file

example extract the bits in red
from this

        <h3 class="single-item__title typo typo--skylark"><strong>Follow the Money</strong></h3>
        <p class="single-item__subtitle typo typo--canary">Episode 1</p>

and save like this

Follow the Money Episode 1

thanks

What is the criteria for filtering out lines that have titles in them?

????

i have already said in post 1
i want to save them to txt file

thats all

can anyone help with my question

thanks

With that sparse info given by you, this will do EXACTLY what you requested for EXACTLY the samples in post#1:

sed -n '1h;1!H;${x;s/<[^>]*>\|\n\|^ *//gp}' file
Follow the Money        Episode 1
1 Like

Hi.

With augmented data:

#!/usr/bin/env bash

# @(#) s1       Demonstrate extraction from HTML, lynx.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C lynx sed paste

ORIG=${1-data1}
FILE=${ORIG}.html
cp $ORIG $FILE

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
lynx -dump $FILE |
tee f1 |
sed '/^[        ]*$/d' |
tee f2 |
sed 's/^[       ]*//' |
tee f3 |
paste -d" " - -

exit 0

producing:

$ ./s1 data2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.3 (jessie) 
bash GNU bash 4.3.30
Lynx Version 2.8.9dev.1 (12 Mar 2014)
sed (GNU sed) 4.2.2
paste (GNU coreutils) 8.23

-----
 Input data file data2.html:
<h3 class="single-item__title typo typo--skylark"><strong>Follow the Money</strong></h3>
<p class="single-item__subtitle typo typo--canary">Episode 1</p>
<h3 class="single-item__title typo typo--skylark"><strong>Follow Your Heart</strong></h3>
<p class="single-item__subtitle typo typo--canary">Episode 0</p>

-----
 Results:
Follow the Money Episode 1
Follow Your Heart Episode 0

Best wishes ... cheers, drl

PS:
Advice for forum posts, general:

To obtain the best answers quickly for processing datasets - extracting, transforming, filtering, you should, after having searched for answers (man pages, Google, etc.):

  1. Post representative samples of your data (i.e. data that should "succeed" and data that should "fail")

  2. Post what you expect the results to be, in addition to describing them. Be clear about how the results are to be obtained, e.g. "add field 2 from file1 to field 3 from file2", "delete all lines that contain 'possum', etc.

  3. Post what you have attempted to do so far. Post scripts, programs, etc. within CODE tags. If you have a specific question about an error, please post the shortest example of the code, script, etc. that exhibits the problem.

  4. Place the data and expected output within CODE tags, so that they are more easily readable.

  5. If you require the use of a specific shell or command, explain why that is the case: if you cannot solve a problem, it may be because you do not know about or enough about a software tool, in which case the responders are probably better judges of a solution than you are.

If you don't show us a representative sample of your input when you start, it should not be a surprise if responder-created-input, possibly in a different format from yours, will work, but your real data won't work with the solutions we suggest.

Special cases, exceptions, etc., are very important to include in the samples.

1 Like

thank you for your time

i tried the sed code above and it works
but it outputs with no space between Money and Episode
like this
Follow the MoneyEpisode 1

how do i get the space ?
and save like this
Follow the Money Episode 1

also because there are many lines
with your code above
they all save on the same line side by side
i want it to save like this

Follow the Money Episode 1
Follow the Money Episode 2
Follow the Money Episode 3

thank you again
sorry about being a bit thick lol
as this is all new to me

Try

sed 'N; s/<[^>]*>\|  \+//g; s/\n/ /g' file
1 Like

Hi.

For the data in post #5, I got this with RudiC's code:

Follow the Money Episode 1
Follow Your Heart Episode 0

Looks right to me ... cheers, drl

1 Like

@RudiC
thank you that works great

much appreciated
for your knowledge and time