not good enough since the some other text in my situation is much more i just simplified it in the example.
I want to get at least what is between <p class="margin-bottom-0"> and </p>
so that the output would be:
text1
<br>
text2
<br>
<br>
text3
I know that there are better tools, but i started out with a simple shell script that grew in time,
and i got everything that i need... this is the last remaining item that i could not parse.
Since HTML is very similar to XML, you may use an xml tool to parse your file.
Since your HTML-File is not fully standards compliant, the parser complains about it and the file has either be adapted by hand to be compliant or to be preprocessed prior to the parsing. The <br> is the problematic element. Compliant would be <br/> with a slash within the tag.
So you can do it with an xmlparser like xmlstarlet in three steps:
sed 's|<br>|<br/>|gi' data.html |
xmlstarlet sel -t -v '//body/div/p' |
sed -e '/^\s*$/d' -e 's/^\s*//'
Make the html file compliant by replacing the br-Tags
Get the wanted HTML-Element with xmlstarlet
suppress unwanted empty lines and leading whitespace in data / xmlstarlet output
oh, and also i had to get rid of the semicolons because i had an error sed: 1: "/<p/,/<\/p/ {/<p.*\/p>/ ...": unexpected EOF (pending }'s)
and found that getting rid of the semicolons and using newline instead fixes this error.
Thanks everyone for the help.
--- Post updated at 09:42 PM ---
stomp, i like your solution too, looks very clean unfortunately xmlstarlet is very picky,
in my real life problem it's not just <br> -s that needs to be transformed to be compliant and would be overkill to check and transform the whole html page for xmlstarlet
But glad that you showed me this, i might use it somewhere else.
RudiC, i'm not dropping it, because i need to get other texts out of the html, but for the example sakes, yes that would make it more optimized.
I have 5 more texts that i'm matching and making the output into a csv file.
The html from which i'm parsing is built up very poorly.
Since i need this all in one line or else the csv file will broke (just realized this) had to get rid of the new lines tr -d "\n\r"
I' removing the extra whitespaces at the beginning and end awk '{$1=$1};1'
Also for csv proofing i'm replacing the commas with semicolon because csv will interpret commas as end of column tr ',' ';'
So this makes me wonder if that one sed could do all these by on it's own.
But i'm happy now because this works now.
Why don't you paint the whole picture with your requirements (including but not limited to "get other texts out of the html", "get rid of the new lines", "replacing the commas with semicolon") and input data so people could work towards a final, optimal solution? E.g. the sed , awk , and dual tr invocations could be combined into a single run of one of the tools,
Explanation: Get all p-Elements with div-elements as parents and output the text data of it.
To get rid of the empty lines, I suggest a small sed command afterwards:
pup 'div p text{}' < data.html | sed '/^\s*$/d'
# Output
text1
text2
text3
Another short demonstration of pup which I shortly used to get the numbers of cases for the coronovirus out of a complex website and into variables(for generating this graph: coronavirus statistics)) with only one combined command:
read n n n n infected deceased recovered < <(wget -O- -q https://www.worldometers.info/coronavirus/ \
| pup 'div[id="maincounter-wrap"]' | pup 'h1,span text{}' | xargs echo)