How to replace a string with only part of the string

I would like to clean up a bunch of broken links on our site by removing them programmatically. (There's 861 of them.) All of the links look like this:

I'm dealing with strings with unpredicted and variable amounts of characters like this.

<a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005668">Bolter # 5668</a> <br>
        
<p><em> Neil "3 Girls' Dad" Sayre<br>  <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005790">Bolter # 5790</a></em></p>

      <li><a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011776" target="_blank"><strong>wasat again</strong></a> <strong>says &quot;It makes finding your correct color easy.&quot; The site says &quot;Where the colors of YESTERDAY come alive TODAY.&quot; <img src="../cool.gif" width="15" height="15"></strong> </li>
  &quot;Sixty5C10&quot; <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00010596">Bolter # 10596</a><br />
  "halfpint33" <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011941">Bolter # 11941</a><br>

      <p align="center">Ken Grimes <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/grimes_ken_1962.html">1962 Chevy C-10 Fleetside Shortbed</a> <span class="style19">\|/</span> &quot;crazed_angler&quot; <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011280">Bolter # 11280</a> <span class="style19">\|/</span> 

    <p align="center">Doug Brechtelsbauer <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/brechtelsbauer_doug_1951.html">1951 GMC 1/2-Ton Longbox</a> <span class="style19">\|/</span> "fritzer999" <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005564">Bolter # 5564</a> <span class="style19">\|/</span> 

  <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00009342">Bolter # 9342 </a><br />

This is an example string I want to extract from that mess.

<a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00003548">Bolter #3548</a><br />

Then, what I want to do is remove the entire string from the file except the string between the closing and opening arrows (in this particular case Bolter #3548), except I should probably enclose it in paragraph tags.

I have no clue how to do this. A combination of sed and split? Awk? Grep? Perl?

I have a feeling that no matter what I do I'm going to end up breaking a bunch of pages, but editing 861 files individually doesn't exactly thrill me.

Couple of questions:

  1. Are these pages static or dynamic? If dynamic, you'll need to address this issue with changes to whatever generates these pages.
  2. I don't see the mentioned "string" to be removed (see above) in the original sample file.
  3. What makes the "link" invalid to be removed? u=xxxxxxxx value? A certain invalid pattern in the link construct? Something else? How to distinguish a "bad" link from the "good" one?
  4. Please provide a small representative data input and the corresponding desired output. Don't say "probably", "maybe" etc. - state exactly what you want as the result based on your SMALL and REPRESENTATIVE input.

The pages are static.
I want to edit these files to change from this;

<a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00003548">Bolter #3548</a><br />

To this:

<p>Bolter #3548</p><br />

All of these links point to an old bulletin board that we used to use. The new bulletin board links look like this:

 <a href="https://www.stovebolt.com/ubbthreads/ubbthreads.php?ubb=showprofile&User=00019243">Bolter # 19243</a><br>

However, there is not a one-to-one correspondence between the Bolter # on the old board and the Bolter # on the new board, so we can't just edit the links. The best solution I can come up with is one of two things. Either remove the link and text completely, or leave the text behind but remove the link.

So, to use the example that I posted originally, I would want to go from this:

<a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005668">Bolter # 5668</a> <br>
        
<p><em> Neil "3 Girls' Dad" Sayre<br>  <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005790">Bolter # 5790</a></em></p>

      <li><a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011776" target="_blank"><strong>wasat again</strong></a> <strong>says &quot;It makes finding your correct color easy.&quot; The site says &quot;Where the colors of YESTERDAY come alive TODAY.&quot; <img src="../cool.gif" width="15" height="15"></strong> </li>
  &quot;Sixty5C10&quot; <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00010596">Bolter # 10596</a><br />
  "halfpint33" <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011941">Bolter # 11941</a><br>

      <p align="center">Ken Grimes <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/grimes_ken_1962.html">1962 Chevy C-10 Fleetside Shortbed</a> <span class="style19">\|/</span> &quot;crazed_angler&quot; <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00011280">Bolter # 11280</a> <span class="style19">\|/</span> 

    <p align="center">Doug Brechtelsbauer <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/brechtelsbauer_doug_1951.html">1951 GMC 1/2-Ton Longbox</a> <span class="style19">\|/</span> "fritzer999" <span class="style19">\|/</span> <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00005564">Bolter # 5564</a> <span class="style19">\|/</span> 

  <a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00009342">Bolter # 9342 </a><br />

To this;

<p>Bolter # 5668</p> <br>
        
<p><em> Neil "3 Girls' Dad" Sayre<br>  Bolter # 5790</em></p>

      <li>wasat again</p> <strong>says &quot;It makes finding your correct color easy.&quot; The site says &quot;Where the colors of YESTERDAY come alive TODAY.&quot; <img src="../cool.gif" width="15" height="15"></strong> </li>
  &quot;Sixty5C10&quot; <span class="style19">\|/</span> Bolter # 10596<br />
  "halfpint33" <span class="style19">\|/</span> Bolter # 11941<br>

      <p align="center">Ken Grimes <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/grimes_ken_1962.html">1962 Chevy C-10 Fleetside Shortbed</a> <span class="style19">\|/</span> &quot;crazed_angler&quot; <span class="style19">\|/</span> Bolter # 11280 <span class="style19">\|/</span> 

    <p align="center">Doug Brechtelsbauer <span class="style19">\|/ </span><a href="https://www.stovebolt.com/gallery/brechtelsbauer_doug_1951.html">1951 GMC 1/2-Ton Longbox</a> <span class="style19">\|/</span> "fritzer999" <span class="style19">\|/</span> Bolter # 5564 <span class="style19">\|/</span> 

  Bolter # 9342 <br />

Now that I think about it, it might be better to remove the link AND the text, since the Bolter # is likely to be wrong anyway. And that might be easier to accomplish anyway.

I'm not sure about a 'shell' solution in bash or whatever shell you use, but I've done something similar using Perl. I had to change the title of a few hundred web pages and used the perl inline-editing feature. Might be worth while looking at.

@trudge,
' but I've done something similar using Perl. I had to change the title of a few hundred web pages and used the perl inline-editing feature. Might be worth while looking at.

how about showing those/that - saying you've '...done something similar...' without showing is not helpful its just noise

I'll take a look at perl.

Perl can be run with certain arguments on the command line to make it do different things. One of things it can do is called 'inline file editing'. What that means is that Perl can open a file, do some editing, and close the file - all without an editor of any kind.

In the directory with all my HTML files I ran this 1-liner:

perl -pi'.old' -e 's/Tutorials/Refreshments/' *.html

You might recognize part of that code to do some string substitution.

Basically it tells Perl to walk through all '.html' files in the current directory, copy them to a new file with a '.old' extension, replace 'Tutorials' with 'Refreshments', and do this for each file.

After I checked to see everything was fine (all 200-some files), I deleted all the files ending in '.old'

Something like this:

perl -pe 's#<a href=".*?">(.*?)</a>#<p>$1</p>#'
1 Like

Thank you for that. I will study it and test it out on a single file. I would assume that, if I decide to just remove the entire string, I would simply make the second expression empty.

Yes, substitution with nothing is a deletion.
As was already suggested, you can directly modify the input files by option -i.old (backup the original as .old , just -i is no backup).

After talking with the owners, they indicated that they wanted to change the old url to the new url but retain the data. So, what that would mean, in practical terms is that this line in the file:

<a href="https://www.stovebolt.com/bboard/cgi-bin//ultimatebb.cgi?ubb=get_profile;u=00009498">Bolter # 9498</a><br>

Should be changed to this:

<a href="https://www.stovebolt.com/ubbthreads/ubbthreads.php?ubb=showprofile&User=00009498">Bolter # 9498</a><br>

So, I wrote this code and ran it on one file.

 perl -i.old -pe '#<a href=”https://(.*?)/.*u=(.*?)\”#<a href=”https://$1/ubbthreads/ubbthreads.php?ubb=showprofile&User=$2#'

It didn't change a thing. It did create a .old file that diff says is identical to the original file. I created the regex using regexr, and it matched as I expected, but apparently it doesn't match the actual file?

If I put this into a perl script, would that solve the problem?

Normally when website change URLs they simply use, assuming the web server is apache2, mod-rewrite.

https://httpd.apache.org/docs/current/mod/mod_rewrite.html

This is how we manage and test URL changes. See for example:

I've used mod_rewrite numerous times. That's an interesting idea. I'll have to think about that.

mod_rewrite is based on a PCRE regular-expression parser, so when you use mod_rewrite you can test and validate everything.

If then you later decide to change all the URLs by editing them, and you don't care about the SEO penalty this will incur, then you can use the the mod_rewrite rules you developed in scripts to create all your edit changes on your site site before cutting over and going live with changes.

Honestly, in doing this kind of work for as long as I can remember, I have always used mod_rewrite and have rarely run any script to make changes.

I think this is how most people do it; because it is easier and preserves SEO for all links, as most people care about SEO.

s command missing?
Also, you can add a \b "word boundary" anchor to increase precision.
.* is greedy, .*u= finds the rightmost u=. While .*? finds the minimum(nearest) match.
The g modifier (at the end of the s command) applies the whole substitution again on the remainder of the line. Would find another following URL

 perl -i.old -pe 's#<a href="https://(.*?)/.*?\bu=(.*?)"#<a href="https://$1/ubbthreads/ubbthreads.php?ubb=showprofile&User=$2#g'

Note that perl works with the ASCII charset; there is a difference between " and ”

Yeah, it never dawned on me to use it.

Now I'm trying to get it working, but I've run into a very odd problem. None of my page loads are showing up in the apache logs - none - not in the access log. Not in the error log.

Makes it really hard to see what's wrong.

If you are using mod_rewrite you can enable rewrite logging.

Reference:

https://httpd.apache.org/docs/current/mod/mod_rewrite.html

I enabled rewrite logging at the trace level, and I can see the rule being compared to pages that don't match. The problem I have now is that apache is not logging any traffic from my IP (and from a lot of other iPs), and I don't know why. But, without log entries, it's really hard to troubleshoot.

This is the contents of the .htaccess file in document root.

RewriteEngine on
RewriteRule ^/bboard.*u=(\d){8} /ubbthreads/ubbthreads.php?ubb=showprofile&amp;User=$1 [NC] 

But the rule is not working. It appears that mod_rewrite is trying to apply the rule but it's not applying it to the actual urls that have the matching string in them.

[Wed Jul 06 00:25:56.714297 2022] [rewrite:trace3] [pid 924] mod_rewrite.c(470): [client 157.55.39.49:1536] 157.55.39.49 - - [stovebolt.com/sid#5631ad188268][rid#5631ad489f50/initial] applying pattern '^/bboard.*u=(\d){8}' to uri '/ubbthreads/ubbthreads.php/topics/1457207/47-chev.html'

[Wed Jul 06 00:25:56.714445 2022] [rewrite:trace3] [pid 924] mod_rewrite.c(470): [client 157.55.39.49:1536] 157.55.39.49 - - [stovebolt.com/sid#5631ad188268][rid#5631ad4b73d0/subreq] applying pattern '^/bboard.*u=(\d){8}' to uri '/topics/1457207/47-chev.html'

This is the error I get when I click on one of those bad links.

The requested URL /bboard/cgi-bin/ultimatebb.cgi was not found on this server.

Is there a reason that everything from the question mark on is ignored in this error message? Do I need to be using QUERY_STRING instead of the regex I'm using? And if I do, how do I capture the stuff after u=?

Well, there is nothing in your rule above for what you posted: