Newbie Python Url Scraper

metallica1973 · June 4, 2013, 1:12pm

I setup Zoneminder and have been playing around with setting up a couple of Wanscam PTZ ip cameras in which I have been running into road blocks with streaming and etc. I cant find much information on the camera and its webserver that sits on it and wanted to get a an absolute directory structure of the webserver on the camera. I tried using:

wget --spider -r 192.168.3.3:80
Spider mode enabled. Check if remote file exists.
--2013-06-04 13:00:49--  (try: 5)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:00:54--  (try: 6)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:01:00--  (try: 7)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 13:01:07--  (try: 8)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

but doesnt find a thing. I know it has a webserver that reside on TCP:80 because I can view the camera through it. I have been attempting to use Pythons "scrapy" but can understand how to tell it to crawl and find the directory structure as opposed to where to start looking for it. This is what I have so far:

 #!/usr/bin/env python
# encoding=utf-8

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
from scrapy import log
import sys
### Kludge to set default encoding to utf-8
reload(sys)
sys.setdefaultencoding('utf-8')

class PTZcamera(BaseSpider):
      name = "camera"
      allowed_domains = ["http://192.168.3.3:80"]
      #start_urls = [""]

      def parse(self, response):
          pass

but doesn't produce much. I would like an output in which is display on the absolute path of the directory on the webserver like:

http://192.168.3.3/cgi-bin/blah
http://192.168.3.3/cgi-bin/blah2
http://192.168.3.3/video/blah1
http://192.168.3.3/video/blah2
...
...
...

Can someone point me in the correct direction?

Corona688 · June 4, 2013, 1:16pm

If it won't talk to wget, I doubt it'll talk to python. Solve that problem first I think...

It may be refusing to talk to wget because it doesn't like its user-agent, which you can set with something like -U netscape

Also, give it --server-response so you can see exactly where the communication dies.

metallica1973 · June 4, 2013, 2:40pm

Thanks for the reply. It didnt make a difference.

wget --spider r --server-response -U netscape http://192.168.3.3

Spider mode enabled. Check if remote file exists.
--2013-06-04 14:34:51--  (try:19)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Retrying.

Spider mode enabled. Check if remote file exists.
--2013-06-04 14:35:01--  (try:20)  http://192.168.3.3/
Connecting to 192.168.3.3:80... connected.
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.
Giving up.

I tried just a basic HTTP GET request and got this:

nc -v 192.168.3.3 80
Connection to 192.168.3.3 80 port [tcp/http] succeeded!
GET / HTTP/1.0

HTTP/1.1 400 Bad Request
Server: Netwave IP Camera
Date: Tue, 04 Jun 2013 18:37:28 GMT
Content-Type: text/html
Content-Length: 135
Connection: close

<HTML><HEAD><TITLE>400 Bad Request</TITLE></HEAD>
<BODY BGCOLOR="#cc9999"><H4>400 Bad Request</H4>
Can't parse request.
</BODY></HTML>

I will dig around. Thanks

Corona688 · June 4, 2013, 2:44pm

Try removing --spider and see what you get. An embedded HTTP server may do odd things when you do things it wasn't expecting, like checking for the existence of a file instead of actually downloading one.

There's not a generic way to figure out all possible files on a web server if a page doesn't link it.

What page would you be accessing it from if you used an ordinary web browser?

metallica1973 · June 5, 2013, 10:47am

When I loggin into the camera from Firefox, it redirects me to a index1.htm page which in turn redirects me to the actual camera and its config options. When I look at the url at the top of the page, it says:

http://192.168.3.3/index1.htm

and whenever I click on any link, it stays the same

http://192.168.3.3/index1.htm

never changing. I will look at the source code of the page(too tired last night) and see what is going on. I am having trouble understanding why "wget" is having trouble spidering and spitting out the links but should have some feedback today. Thanks for all your input

Corona688 · June 5, 2013, 2:28pm

As I said, I suspect it's not a problem with wget, but --spider. Your camera's got a very tiny computer brain that's probably not running a full complete 100% standards-compliant web server, just a tiny stub which answers full GET requests and very little else.