I starting teaching myself python and am stuck on trying to understand why I am not getting the output that I want. Long story short, I am using PDB for debugging and here my function in which I am having my issue:
import re
...
...
...
def find_all_flvs(url):
soup = BeautifulSoup(urllib2.urlopen(url))
flvs = []
for link in soup.findAll(onclick=re.compile("doShowCHys=1*")):
link = str(link)
vidnum = re.search("\d{5,6}.*&", link)
vidurl = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum
for hashval_url in BeautifulSoup(urllib2.urlopen(vidurl)).findAll("flv"):
flvs.append(hashval_url.text)
return flvs
I verified that my regex is correct(\d{5,6}.*&):
"/home/Player.aspx?lpk4=108148&playChapter=True\',960,540,94343);return false;"
produces:
108148
which is what I want, so when running pdb using steps and I get to:
vidnum = re.search("\d{5,6}.*&", link)
and this is what I end up with as the output:
<_sre.SRE_Match object at 0xaaf8de8>
in which I should be seeing:
108148
so it can be simply appended to:
vidurl = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum
producing:
(pdb)p vidurl
http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=108148
I have been through several urls and cannot seem to figure out what I am doing wrong:
??
---------- Post updated at 04:37 PM ---------- Previous update was at 04:21 PM ----------
I made progress. The things you can find out by just reading:\
re.search(pattern, string, flags=0)
Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
and
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
I was simply using the wrong function. I replaced re.search with re.findall and it worked partially.
vidnum = re.findall("\d{5,6}.*&", link)
(pdb)p vidum
['108148&']
(pdb)p vidurl
http://www.blahblah.com/home/GetPlay...px?lpk4=108148['108148&']
How do I remove the brackets and single quotes to produce only:
http://www.blahblah.com/home/GetPlay...px?lpk4=108148&
??
---------- Post updated at 04:53 PM ---------- Previous update was at 04:37 PM ----------
It turned out the vidnum is part of a list and I needed to specify its place in the list, so:
vidurl = "http://www.blahblah.com/home/GetPlayerXML.aspx?lpk4=%s" % vidnum[0]