copy and merge texts between two pattern

I need

  1. to find all the files from a Folder "folder_name" and from its 4 or 5 sub folders which contain a certain pattern "pattern1".
  2. from these files copy and merge all the text between "pattern1" and another different pattern "pattern2" to "mergefile".
  3. Get rid of every html tag.
  4. By doing so, - if possible - I need to write (as first line of the new portion of text to be merged) the name of each file that particular text was taken from.
  5. To be attentive because "pattern1" and "pattern2" (which both are always present there) may appear more than once in a same file.

Many thanks for any help (also for just a part of the task)
mjomba

Example of an input file: "xsargg777.html"
pattern1 = "Lectio altera"
pattern2 = "Ad Laudes matutinas"

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html>

#menu li {float:left; padding:4; margin:0 1px 0 0; position:relative; width:111px; height:1px; z-index:100;}
#menu li a, #menu li a:visited {text-decoration:none;}
<p class="re"><a name="_hlk121580295"></a>Lectio altera</p>
Ex Scriptis sancti Petri Can�sii presb�teri <font color="red">
cuius alter ap�stolus </a></p>
<font color="red">V/.</font> In corde prud�ntis</p>
<p class="rebo"><a name="_hlk91924987"></a>Ad Laudes matutinas</p>
<font color="red">Ant.</font> Qui docti f�erint, fulg�bunt quasi splendor firmam�nti, </p>
<p class="rebo"><a name="_hlk91925031"></a>Ad Vesperas</p>
<font color="red">Ant.</font> O doctor �ptime, Eccl�si� sanct� .</p>
�</p>
<p class="rebo">Die 23 decembris</p>
<p class="rebo">Ad Officium lectionis</p>
<p class="re"><a name="_hlk121580428"></a>Lectio altera</p>
<i>Et soc�etas nostra sit cum Deo Patre et Iesu Christo F�lio eius. </i>
<font color="red">Responsorium<br>
R/.</font> Iste est Io�nnes, qui supra pectus D�mini in cena rec�buit: .<br>
<p class="re">Hymnus <a href="breviarionlineff3175.html?formato=1&archivo=zte_deum.htm">Te Deum</a>.</p>
<p class="renm">Oratio</p>
Deus, qui per be�tum ap�stolum Io�nnem Verbi tui nobis arc�na reser�sti, </p>
<p class="rebo"><a name="_hlk91927253"></a>Ad Laudes matutinas</p>
<p class="re">Hymnus</p>
<div class="indhym">
...etc.etc.

Exemple of Output file:

FILENAME=xsargg777.html
Lectio altera
Ex Scriptis sancti Petri Can�sii presb�teri 
cuius alter ap�stolus 
V/. In corde prud�ntis
Ad Laudes matutinas
Lectio altera
Et soc�etas nostra sit cum Deo Patre et Iesu Christo F�lio eius. 
Responsorium
R/. Iste est Io�nnes, qui supra pectus D�mini in cena rec�buit: 
Hymnus Te Deum
Oratio
Deus, qui per be�tum ap�stolum Io�nnem Verbi tui nobis arc�na reser�sti, 
Ad Laudes matutinas

To get rid of the HTML tags you can use:

 
sed -n '/</p' input_file | sed -e :a -e 's/<[^>]*>//g;/</N;//ba'

remaining part of your question is a excercise to you.

regards
Ravi

  1. to find all the files from a Folder "folder_name" and from its 4 or 5 sub folders which contain a certain pattern "pattern1".
grep -r -l 'pattern' folder_name
 nawk '/Lectio altera/,/Ad Laudes/ {gsub(/<[^>]*>/,"");print}' filename

Thanks
Sha