Web Scraping - Output File

SMac2347 · Apr 26, 2012

Hello,

I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).

I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!

import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = 1
Z = 26

for letter in range(A,Z):

for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?
contentID=44&alphaSearch="+str(letter)):

x = line
if '">' in line:
start=x.find('">"')
end= x.find('</A></td>',start)
name=x[start:end]
outfile.write(name+"\n")
print name

SMac2347 · Apr 26, 2012

You should search for ' > said:
Hello,

Click to expand...

I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).

Click to expand...

I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!

Click to expand...

import os
import re
import urllib2

Click to expand...

outfile = open("Skadden.txt","w")

Click to expand...

A = 1
Z = 26

Click to expand...

for letter in range(A,Z):

Click to expand...

for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+str(letter)):

Click to expand...

You need
alphaSearch=a
but you're using
alphaSearch=1

x = line
if '">' in line:

Click to expand...

You should search for ' > said:

start=x.find('">"')
Ditto.

end= x.find('</A></td>',start)
name=x[start:end]

Click to expand...

You should use start+5 to skip ' > said:

outfile.write(name+"\n")
print name

Click to expand...

Your code is bound to break over and over (you should do some smarter parsing), but here's a working version:

--->
import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = ord('a')
Z = ord('z')

for letter in range(A, Z):
for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+chr(letter)):
x = line
if ' >' in line:
start=x.find(' >')
end= x.find('</A></td>',start)
name=x[start+5:end]
outfile.write(name+"\n")
print name
<---

Kiuhnm

Great, thanks so much for your help!

Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
ValueError: I/O operation on closed file. with python3	0	Jun 12, 2013
print header for output	0	Jun 19, 2011
client ssl verification	0	Mar 15, 2012
urllib2 and threading	6	May 1, 2009
URLError	4	Mar 19, 2008
Please help with problem creating class	6	Apr 18, 2009
OWA (Outlook Web Access) with urllib2	5	Sep 23, 2004

Web Scraping - Output File

SMac2347

SMac2347

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads