Python BeautifulSoup Re Finding Digits Within Tags

metallica1973 · July 20, 2015, 2:13pm

I am writing a little python script that needs to grab version numbers between "<td>4.2.2</td>" within the tbody of the page:

[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> 
(<a href="https://blah/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> 
(<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> | <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)
</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)
</td></tr><tr><td>4.2.1</td> <td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> 
(<a href="https://blah/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)
</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]

Is it possible to use a one-liner to scrap only the digits between the tags:

"<td>4.2.2</td>"

so it spits out:
4.2.2
4.2.1
etc..

This is what I have done so far but dont understand why it creates the variable rpart as a ResultSet and a regular string that I can scrape the data.

wphtml = BeautifulSoup('http://blah.blah/release)
rpart = wphtml.find_all('tbody', limit=1)
rpart[0]
[<tbody>
<tr style="background: #eee;"><td>4.2.2</td> <td align="center"><a href="https://blah.blah/-4.2.2.zip">zip</a> (<a href="https://blah/blah-4.2.2.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.zip.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2.tar.gz">tar.gz</a> (<a href="https://blah/blahs-4.2.2.tar.gz.md5">md5</a> 
| <a href="https://blah/blah-4.2.2.tar.gz.sha1">sha1</a>)</td><td align="center"><a href="https://blah/blah-4.2.2-IIS.zip">IIS zip</a> 
(<a href="https://blah/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td></tr><tr><td>4.2.1</td> 
<td align="center"><a href="https://blah/blah-4.2.1.zip">zip</a> (<a href="https://blah/blah-4.2.1.zip.md5">md5</a> 
| <a href="https://blah/blah-4.2.1.zip.sha1">sha1</a>)</td><td align="center">[/tbody]
[tbody]blah blah blah blah blah
[/tbody]
whos
rpart           ResultSet        [<tbody>\n<tr style="back<...>="1"></td></tr> </tbody>]
wphtml          BeautifulSoup    <!DOCTYPE html>\n<html di<...>"></iframe></body></html>

Is this their a way to do this as a one-liner?

rpart = wphtml.find_all('tbody', limit=1, td=re.compile('\<td\>\d*.\d*.\d*.\<\/td\>'))
4.2.2
4.2.1
etc..

or

for tag in wphtml.find_all('tbody', limit=1, string=re.compile("\b\<td\>\d*.\d*.\d*.\<\/td\>\b")):
    print(tag.content)

So what I am trying to do is:

1 - Search through the html page and capture on the first [tbody]....[/tbody], hence limit=1
2 - Regex through the results and only print out the digits that are inside the <td>\d*.\d*.\d*.\<td> tags
3 - Resulting in:

4.2.2
4.2.1
etc..

balajesuri · July 21, 2015, 11:46am

The best thing about using a language like python is that you've ready-made parsers to make your life simpler.. and not resort to (cheaper?) techniques like regex (leave those things to perl :-D).

What you're trying to parse looks like a HTML file. Take a look at the HTMLParser module and see if you can cook something using that.

metallica1973 · July 23, 2015, 5:46pm

Many thanks for the reply,

after putting a little elbow grease into this, I was able to accomplish what I needed to do with Beautiful and re:

wphtml = BeautifulSoup('http://blah.blah/release)
rpart=wphtml.soup.find('tbody')
tds=rpart.find_all('td')
blah=[]
for r in rpart:
    re.compile(r'<td>(.*?)</td>', flags=re.DOTALL)
    blah.append(r.string)
blah
u'4.2.2',
 None,
 None,
 None,
 u'4.2.1',

my next question is how do I get rid of the None;)

---------- Post updated at 05:46 PM ---------- Previous update was at 02:39 PM ----------

blah=filter(None, blah)