Python Web Page Scraping Urls Creating A Dictionary

I have thrown in the towel and cant figure out how to do this. I have a directory of html files that contain urls that I need to scrape (loop through) and add into a dictionary. An example of the output I would like is:

bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

My code that I have so far is:

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
		urls = links.get('href')
		print "HTML Files: {}\nUrls: {}\n".format(tut,urls)

produces the correct output for the most part:

HTML Files: bigbadwolf.html
Urls: https://www.blah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblah.com

HTML Files: bigbadwolf.html
Urls: https://www.blahblahblah.com

HTML files: maryhadalittlelamb.html
Urls: http://www.red.com 

HTML files: maryhadalittlelamb.html
Urls: https://www.redyellow.com 

HTML files: maryhadalittlelamb.html
Urls: http://www.zigzag.com

but I want it in a dictionary with this format:

bigbadwolf.htlm: https://www.blah.com, http://www.blahblah.com, http://www.blahblahblah.com
maryhadalittlelamb.html: http://www.red.com, https://www.redyellow.com, http://www.zigzag.com
time.html: https://www.est.com, http://www.pst.com, https://www.cst.com

As you can see, there will be several urls inside of an html doc so there will be keys that can contain many values(urls). I tried many variable of the below code but cant get a single key to have many urls associated with it.

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut] = urls

produces:

bigbadwolf.htlm: https://www.blah.com
maryhadalittlelamb.html: http://www.red.com
time.html: https://www.est.com
...
...
...

Can someone please shine some light on what I am trying to do?

We don't get many Python questions here, sorry.

PHP, PERL, and all the standard UNIX and Linux shell programming languages, as well a C and C++ questions; but not many Python questions.

I'm not a Python programmer; so perhaps someone else here is? and they can help you?

Is each value of the dictionary:
(a) a list (or array) of URLs? or
(b) a comma-delimited string of URLs?

If you want (a), then try something like the following:

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                tut_links[tut].append(urls)
 

Disclaimer: Completely untested; I don't have the module at the moment.

2 Likes

Thanks for all the replies

it is a comma-delimited string of URLs

---------- Post updated at 02:30 PM ---------- Previous update was at 02:05 PM ----------

Thanks durden_tyler,

I tested your additional list stuff in which i had looked at before and didnt go down that path and it worked. Awesome. Many thanks

---------- Post updated at 02:51 PM ---------- Previous update was at 02:30 PM ----------

Would you happen to know how to delete duplicate entries inside of this embedded list?

bigbadwolf.htlm: 

'https://www.blah.com',
'https://www.blah.com',
'https://www.blah.com',
'http://www.blahblah.com'
'http://www.blahblah.com'

Do not add a duplicate entry in the first place:

tut_links = {}
for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
            tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a'):
                urls = links.get('href')
                if urls not in tut_links[tut]:
                    tut_links[tut].append(urls)
1 Like

I cross referenced the html file with the output urls and its correct. Most of the html files do contain multiple duplicate urls as in:

http://www.blah.org
http://www.blah.org
http://www.blah.org

So I would need to remove the duplicates. I have dont this before in the past using:

tut_links=list(set(tut_links))

Let me give this a shot and see what happens. Thanks for all the help. I will let you know how it goes.

---------- Post updated 06-07-17 at 02:54 PM ---------- Previous update was 06-06-17 at 04:03 PM ----------

Thanks for all the help. Here is the finish code:

tut_links = {}

for subdir, dirs, files in os.walk('./html/tutorials/blah'):
    for tut in files:
        if tut.endswith(".html"):
	    tut_links[tut] = []
            fpath = os.path.join("./html/tutorials/blah", tut)
            content = open(fpath, "r").read()
            file = BeautifulSoup(content, 'lxml')
            for links in file.find_all('a', href=True):
		urls = links.get('href')
	    	if urls.startswith('http' or 'https'):		
	           tut_links[tut].append(urls)
	    for dup in tut_links.values(): --> removes duplicate urls from the dictionary value list
    		dup[:] = list(set(dup))

Worked like a champ

'bigbadwolf.htlm' : ['https://www.blah.com', 'http://www.blahblah.com','http://www.blahblahblah.com']
1 Like