Using Python to grab data from a website

Hello Everyone,
I'm trying to write a python script that will go to the following website and grab all the data on the page. The page refreshes regularly and the number of flights is different.

Untitled Document

What I wanted to do was grab all the data (except for top three row containing headers) and save the data in a text file.

Any help would be greatly appreciated.

Not python, but:

curl http://www.phl.org/cgi-bin/fidsarrival.pl -o "arrival.txt"

jgt, although that was not a python code, that is still pretty cool/good to know.

This code actually stores the source text of a webpage and will work pretty nice in this case:

#!/usr/local/bin/python

import urllib
import os

# get data
f = urllib.urlopen("http://www.phl.org/cgi-bin/fidsarrival.pl")
s = f.read()
f.close()

# write data
ff = open("output.txt", "w")
ff.write(s)
ff.close()

# run shell command
command="cat output.txt | sed 's/^<.*//;s/.*DATE.*//;s/^Airline.*//;/^$/d' > output.txt"
os.system(command)

Explanation of the sed command:

sed 's/^<.*//;s/.*DATE.*//;s/^Airline.*//;/^$/d'

Replaces all lines which start with "<" with an empty line
Replaces all lines which contain "DATE" with an empty line
Replaces all lines which start with "Airline" with an empty line
Deletes all empty lines

:cool:

1 Like

my solution (i added a for loop to print the output text everytime the script is run, you can remove it if you don't need it.)

#!/usr/bin/python

import urllib.error, urllib.parse, urllib.request
import re

#get the file
f = urllib.request.urlopen("http://www.phl.org/cgi-bin/fidsarrival.pl")
s = str(f.read())
f.close()

#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'

#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
ff = open('refined.txt', 'w')
ff.write(re.sub(pattern, '\n', s))
ff.close()

#prints the file line by line
of = open('refined.txt').readlines()
for line in of:
    print(line, end='')

this is actually built/designed around pseudocode's solution, i just modified it to use in-built regular expressions instead of calling a shell comand to edit the text.

if you're using python 2.x, just replace import urllib.request, urllib.error, urllib.parse with urllib or urllib2, and urllib.request.urlopen gets changed to urllib.urlopen

thanks everyone!