Python Web Page Scraping Urls Creating A Dictionary

metallica1973 · June 6, 2017, 11:12am

I have thrown in the towel and cant figure out how to do this. I have a directory of html files that contain urls that I need to scrape (loop through) and add into a dictionary. An example of the output I would like is:

bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

My code that I have so far is:

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
		urls = links.get('href')
		print "HTML Files: {}\nUrls: {}\n".format(tut,urls)

produces the correct output for the most part:

HTML Files: bigbadwolf.html
Urls: https://www.blah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblahblah.com

HTML files: maryhadalittlelamb.html
Urls: http://www.red.com 

HTML files: maryhadalittlelamb.html
Urls: https://www.redyellow.com 

HTML files: maryhadalittlelamb.html
Urls: http://www.zigzag.com

but I want it in a dictionary with this format:

bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls). I tried many variable of the below code but cant get a single key to have many urls associated with it.

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut] = urls

produces:

bigbadwolf.htlm: https://www.blah.com
maryhadalittlelamb.html: http://www.red.com
time.html: https://www.est.com
...
...
...

Can someone please shine some light on what I am trying to do?

Neo · June 6, 2017, 1:54pm

We don't get many Python questions here, sorry.

PHP, PERL, and all the standard UNIX and Linux shell programming languages, as well a C and C++ questions; but not many Python questions.

I'm not a Python programmer; so perhaps someone else here is? and they can help you?

durden_tyler · June 6, 2017, 1:57pm

metallica1973:

...
...
but I want it in a dictionary with this format:
bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com
As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls).
...
...

Is each value of the dictionary:
(a) a list (or array) of URLs? or
(b) a comma-delimited string of URLs?

If you want (a), then try something like the following:

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut].append(urls)

Disclaimer: Completely untested; I don't have the module at the moment.

metallica1973 · June 6, 2017, 2:51pm

Thanks for all the replies

it is a comma-delimited string of URLs

---------- Post updated at 02:30 PM ---------- Previous update was at 02:05 PM ----------

Thanks durden_tyler,

I tested your additional list stuff in which i had looked at before and didnt go down that path and it worked. Awesome. Many thanks

---------- Post updated at 02:51 PM ---------- Previous update was at 02:30 PM ----------

Would you happen to know how to delete duplicate entries inside of this embedded list?

bigbadwolf.htlm: 

'https://www.blah.com',
'https://www.blah.com',
'https://www.blah.com',
'http://www.blahblah.com'
'http://www.blahblah.com'

durden_tyler · June 6, 2017, 3:48pm

Do not add a duplicate entry in the first place:

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                if urls not in tut_links[tut]:
                    tut_links[tut].append(urls)

metallica1973 · June 7, 2017, 2:54pm

I cross referenced the html file with the output urls and its correct. Most of the html files do contain multiple duplicate urls as in:

http://www.blah.org
http://www.blah.org
http://www.blah.org

So I would need to remove the duplicates. I have dont this before in the past using:

tut_links=list(set(tut_links))

Let me give this a shot and see what happens. Thanks for all the help. I will let you know how it goes.

---------- Post updated 06-07-17 at 02:54 PM ---------- Previous update was 06-06-17 at 04:03 PM ----------

Thanks for all the help. Here is the finish code:

tut_links = {}

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
	    tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a', href=True):
		urls = links.get('href')
	    	if urls.startswith('http' or 'https'):		
	           tut_links[tut].append(urls)
	    for dup in tut_links.values(): --> removes duplicate urls from the dictionary value list
    		dup[:] = list(set(dup))

Worked like a champ

'bigbadwolf.htlm' : ['https://www.blah.com', 'http://www.blahblah.com','http://www.blahblahblah.com']