I have thrown in the towel and cant figure out how to do this. I have a directory of html files that contain urls that I need to scrape (loop through) and add into a dictionary. An example of the output I would like is:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com
My code that I have so far is:
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
for tut in files:
if tut.endswith(".html"):
fpath = os.path.join("./html/tutorials/blah", tut)
content = open(fpath, "r").read()
file = BeautifulSoup(content, 'lxml')
for links in file.find_all('a'):
urls = links.get('href')
print "HTML Files: {}\nUrls: {}\n".format(tut,urls)
produces the correct output for the most part:
HTML Files: bigbadwolf.html
Urls: https://www.blah.com
HTML Files: bigbadwolf.html
Urls: https://www.blahblah.com
HTML Files: bigbadwolf.html
Urls: https://www.blahblahblah.com
HTML files: maryhadalittlelamb.html
Urls: http://www.red.com
HTML files: maryhadalittlelamb.html
Urls: https://www.redyellow.com
HTML files: maryhadalittlelamb.html
Urls: http://www.zigzag.com
but I want it in a dictionary with this format:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com
As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls). I tried many variable of the below code but cant get a single key to have many urls associated with it.
tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
for tut in files:
if tut.endswith(".html"):
fpath = os.path.join("./html/tutorials/blah", tut)
content = open(fpath, "r").read()
file = BeautifulSoup(content, 'lxml')
for links in file.find_all('a'):
urls = links.get('href')
tut_links[tut] = urls
produces:
bigbadwolf.htlm: https://www.blah.com
maryhadalittlelamb.html: http://www.red.com
time.html: https://www.est.com
...
...
...
Can someone please shine some light on what I am trying to do?