Python Newbie Question Regex

metallica1973 · March 6, 2013, 4:53pm

I starting teaching myself python and am stuck on trying to understand why I am not getting the output that I want. Long story short, I am using PDB for debugging and here my function in which I am having my issue:

import re
...
...
...

def find_all_flvs(url):
    soup = BeautifulSoup(urllib2.urlopen(url))
    flvs = []
    for link in soup.findAll(onclick=re.compile("doShowCHys=1*")):
        link = str(link)
        vidnum   = re.search("\d{5,6}.*&amp", link)
        vidurl   = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum

        for hashval_url in BeautifulSoup(urllib2.urlopen(vidurl)).findAll("flv"):

            flvs.append(hashval_url.text)

    return flvs

I verified that my regex is correct(\d{5,6}.*&amp):

"/home/Player.aspx?lpk4=108148&playChapter=True\',960,540,94343);return false;"

produces:

which is what I want, so when running pdb using steps and I get to:

vidnum   = re.search("\d{5,6}.*&amp", link)

and this is what I end up with as the output:

<_sre.SRE_Match object at 0xaaf8de8>

in which I should be seeing:

so it can be simply appended to:

vidurl   = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum

producing:

(pdb)p vidurl

http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=108148

I have been through several urls and cannot seem to figure out what I am doing wrong:

Python Regular Expressions

??

---------- Post updated at 04:37 PM ---------- Previous update was at 04:21 PM ----------

I made progress. The things you can find out by just reading:\

re.search(pattern, string, flags=0)

    Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

and 

 re.findall(pattern, string, flags=0)

    Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

I was simply using the wrong function. I replaced re.search with re.findall and it worked partially.

vidnum   = re.findall("\d{5,6}.*&amp", link)
(pdb)p vidum
['108148&amp']
(pdb)p vidurl
http://www.blahblah.com/home/GetPlay...px?lpk4=108148['108148&amp']

How do I remove the brackets and single quotes to produce only:

http://www.blahblah.com/home/GetPlay...px?lpk4=108148&amp

??

---------- Post updated at 04:53 PM ---------- Previous update was at 04:37 PM ----------

It turned out the vidnum is part of a list and I needed to specify its place in the list, so:

vidurl   = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum[0]

Chubler_XL · March 6, 2013, 6:58pm

You could also try:

refound = re.search('\d{5,6}(?=&amp)', link)

if refound:
    vidurl   = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % refound.group(0)