Hello Everyone,
I'm trying to write a python script that will go to the following website and grab all the data on the page. The page refreshes regularly and the number of flights is different.
This code actually stores the source text of a webpage and will work pretty nice in this case:
#!/usr/local/bin/python
import urllib
import os
# get data
f = urllib.urlopen("http://www.phl.org/cgi-bin/fidsarrival.pl")
s = f.read()
f.close()
# write data
ff = open("output.txt", "w")
ff.write(s)
ff.close()
# run shell command
command="cat output.txt | sed 's/^<.*//;s/.*DATE.*//;s/^Airline.*//;/^$/d' > output.txt"
os.system(command)
Explanation of the sed command:
sed 's/^<.*//;s/.*DATE.*//;s/^Airline.*//;/^$/d'
Replaces all lines which start with "<" with an empty line
Replaces all lines which contain "DATE" with an empty line
Replaces all lines which start with "Airline" with an empty line
Deletes all empty lines
my solution (i added a for loop to print the output text everytime the script is run, you can remove it if you don't need it.)
#!/usr/bin/python
import urllib.error, urllib.parse, urllib.request
import re
#get the file
f = urllib.request.urlopen("http://www.phl.org/cgi-bin/fidsarrival.pl")
s = str(f.read())
f.close()
#regular expression pattern matching everything inside < > tags and double-slashed n
pattern = r'(<.*?>|\\n)'
#replaces all instances of the pattern with a newline, then writes it into the file 'refined.txt'
ff = open('refined.txt', 'w')
ff.write(re.sub(pattern, '\n', s))
ff.close()
#prints the file line by line
of = open('refined.txt').readlines()
for line in of:
print(line, end='')
this is actually built/designed around pseudocode's solution, i just modified it to use in-built regular expressions instead of calling a shell comand to edit the text.
if you're using python 2.x, just replace import urllib.request, urllib.error, urllib.parse with urllib or urllib2, and urllib.request.urlopen gets changed to urllib.urlopen