Web Scraping - Output File

S

SMac2347

Hello,

I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).

I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!

import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = 1
Z = 26

for letter in range(A,Z):

for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?
contentID=44&alphaSearch="+str(letter)):

x = line
if '"><B>' in line:
start=x.find('"><B>"')
end= x.find('</B></A></nobr></td>',start)
name=x[start:end]
outfile.write(name+"\n")
print name
 
S

SMac2347

I am having some difficulty generating the output I want from web
scraping. Specifically, the script I wrote, while it runs without any
errors, is not writing to the output file correctly. It runs, and
creates the output .txt file; however, the file is blank (ideally it
should be populated with a list of names).
I took the base of a program that I had before for a different data
gathering task, which worked beautifully, and edited it for my
purposes here. Any insight as to what I might be doing wrote would be
highly appreciated. Code is included below. Thanks!
import os
import re
import urllib2
outfile = open("Skadden.txt","w")
A = 1
Z = 26
for letter in range(A,Z):
     for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+str(letter)):

You need
  alphaSearch=a
but you're using
  alphaSearch=1
             x = line
             if '"><B>' in line:

You should search for ' > said:
                     start=x.find('"><B>"')
Ditto.

                     end= x.find('</B></A></nobr></td>',start)
                     name=x[start:end]

You should use start+5 to skip ' > said:
                     outfile.write(name+"\n")
                     print name

Your code is bound to break over and over (you should do some smarter parsing), but here's a working version:

--->
import os
import re
import urllib2

outfile = open("Skadden.txt","w")

A = ord('a')
Z = ord('z')

for letter in range(A, Z):
    for line in urllib2.urlopen("http://www.skadden.com/Index.cfm?contentID=44&alphaSearch="+chr(letter)):
            x = line
            if ' ><B>' in line:
                    start=x.find(' ><B>')
                    end= x.find('</B></A></nobr></td>',start)
                    name=x[start+5:end]
                    outfile.write(name+"\n")
                    print name
<---

Kiuhnm

Great, thanks so much for your help!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top