difference between urllib2.urlopen and firefox view 'page source'?

Discussion in 'Python' started by cjl, Mar 20, 2007.

  1. cjl

    cjl Guest

    Hi.

    I am trying to screen scrape some stock data from yahoo, so I am
    trying to use urllib2 to retrieve the html and beautiful soup for the
    parsing.

    Maybe (most likely) I am doing something wrong, but when I use
    urllib2.urlopen to fetch a page, and when I view 'page source' of the
    exact same URL in firefox, I am seeing slight differences in the raw
    html.

    Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    Is yahoo detecting that urllib2 doesn't process javascript, and
    passing different data?

    -cjl
     
    cjl, Mar 20, 2007
    #1
    1. Advertising

  2. cjl

    zacherates Guest

    On Mar 19, 10:30 pm, "cjl" <> wrote:
    > Hi.
    >
    > I am trying to screen scrape some stock data from yahoo, so I am
    > trying to use urllib2 to retrieve the html and beautiful soup for the
    > parsing.
    >
    > Maybe (most likely) I am doing something wrong, but when I use
    > urllib2.urlopen to fetch a page, and when I view 'page source' of the
    > exact same URL in firefox, I am seeing slight differences in the raw
    > html.
    >
    > Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    > Is yahoo detecting that urllib2 doesn't process javascript, and
    > passing different data?
    >
    > -cjl


    http://developer.yahoo.com/yui/articles/gbs/index.html seems to
    indicate that Yahoo! passes you different markup depending on which
    grade your browser falls into. I'm not sure I'd spoof your User-
    Agent, after all your client is unlikely to support the features that
    their looking for in Firefox (javascript, css, SVG).
     
    zacherates, Mar 20, 2007
    #2
    1. Advertising

  3. cjl

    Steve Holden Guest

    cjl wrote:
    > Hi.
    >
    > I am trying to screen scrape some stock data from yahoo, so I am
    > trying to use urllib2 to retrieve the html and beautiful soup for the
    > parsing.
    >
    > Maybe (most likely) I am doing something wrong, but when I use
    > urllib2.urlopen to fetch a page, and when I view 'page source' of the
    > exact same URL in firefox, I am seeing slight differences in the raw
    > html.
    >
    > Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    > Is yahoo detecting that urllib2 doesn't process javascript, and
    > passing different data?
    >

    It's almost certainly a browser detection issue. This may not matter for
    your application.

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    Recent Ramblings http://holdenweb.blogspot.com
     
    Steve Holden, Mar 20, 2007
    #3
  4. cjl

    Tina I Guest

    cjl wrote:
    > Hi.
    >
    > I am trying to screen scrape some stock data from yahoo, so I am
    > trying to use urllib2 to retrieve the html and beautiful soup for the
    > parsing.
    >
    > Maybe (most likely) I am doing something wrong, but when I use
    > urllib2.urlopen to fetch a page, and when I view 'page source' of the
    > exact same URL in firefox, I am seeing slight differences in the raw
    > html.
    >
    > Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    > Is yahoo detecting that urllib2 doesn't process javascript, and
    > passing different data?
    >
    > -cjl
    >

    Unless the data you you need depends on the site detecting a specific
    browser you will probably receive a 'cleaner' code that's more easily
    parsed if you don't set a user agent. Usually the browser optimization
    they do is just eye candy, bells and whistles anyway in order to give
    you a more 'pleasing experience'. I doubt that your program will care
    about that ;)

    Tina
     
    Tina I, Mar 20, 2007
    #4
  5. cjl

    Guest

    On Mar 20, 1:56 am, Tina I <> wrote:
    > cjl wrote:
    > > Hi.

    >
    > > I am trying to screen scrape some stock data from yahoo, so I am
    > > trying to use urllib2 to retrieve the html and beautiful soup for the
    > > parsing.

    >
    > > Maybe (most likely) I am doing something wrong, but when I use
    > > urllib2.urlopen to fetch a page, and when I view 'page source' of the
    > > exact same URL in firefox, I am seeing slight differences in the raw
    > > html.

    >
    > > Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    > > Is yahoo detecting that urllib2 doesn't process javascript, and
    > > passing different data?

    >
    > > -cjl

    >
    > Unless the data you you need depends on the site detecting a specific
    > browser you will probably receive a 'cleaner' code that's more easily
    > parsed if you don't set a user agent. Usually the browser optimization
    > they do is just eye candy, bells and whistles anyway in order to give
    > you a more 'pleasing experience'. I doubt that your program will care
    > about that ;)
    >
    > Tina


    You can do this fairly easily. I found a similar program in the book
    Core Python Programming. It actually sticks the stocks into an Excel
    spreadsheet. The code is below. You can easily modify it to send the
    output elsewhere.


    # Core Python Chp 23, pg 994
    # estock.pyw

    from Tkinter import Tk
    from time import sleep, ctime
    from tkMessageBox import showwarning
    from urllib import urlopen
    import win32com.client as win32

    warn = lambda app: showwarning(app, 'Exit?')
    RANGE = range(3, 8)
    TICKS = ('AMZN', 'AMD', 'EBAY', 'GOOG', 'MSFT', 'YHOO')
    COLS = ('TICKER', 'PRICE', 'CHG', '%AGE')
    URL = 'http://quote.yahoo.com/d/quotes.csv?s=%s&f=sl1c1p2'

    def excel():
    app = 'Excel'
    xl = win32.gencache.EnsureDispatch('%s.Application' % app)
    ss = xl.Workbooks.Add()
    sh = ss.ActiveSheet
    xl.Visible = True
    sleep(1)

    sh.Cells(1, 1).Value = 'Python-to-%s Stock Quote Demo' % app
    sleep(1)
    sh.Cells(3, 1).Value = 'Prices quoted as of: %s' % ctime()
    sleep(1)
    for i in range(4):
    sh.Cells(5, i+1).Value = COLS
    sleep(1)
    sh.Range(sh.Cells(5, 1), sh.Cells(5, 4)).Font.Bold = True
    sleep(1)
    row = 6

    u = urlopen(URL % ','.join(TICKS))
    for data in u:
    tick, price, chg, per = data.split(',')
    sh.Cells(row, 1).Value = eval(tick)
    sh.Cells(row, 2).Value = ('%.2f' % round(float(price), 2))
    sh.Cells(row, 3).Value = chg
    sh.Cells(row, 4).Value = eval(per.rstrip())
    row += 1
    sleep(1)
    u.close()

    warn(app)
    ss.Close(False)
    xl.Application.Quit()


    if __name__ == '__main__':
    Tk().withdraw()
    excel()

    # Have fun - Mike
     
    , Mar 20, 2007
    #5
  6. cjl

    John Nagle Guest

    Here's a useful online tool that might help you see what's happening:

    http://www.sitetruth.com/experimental/viewer.html

    We use this to help webmasters see what our web crawler is seeing.

    This reads a page, using Python and FancyURLOpener, with a
    USER-AGENT string of "SiteTruth.com site rating system."
    Then it parses the page with BeautifulSoup, removes all
    <SCRIPT>, <EMBED>, and <OBJECT> tags, makes all the links
    absolute, then writes the page back out in UTF-8 Unicode.
    The resulting cleaned-up page is displayed.

    If the page you're trying to read looks OK with our viewer,
    you should be able to read it from Python with no problems.

    John Nagle

    cjl wrote:
    > Hi.
    >
    > I am trying to screen scrape some stock data from yahoo, so I am
    > trying to use urllib2 to retrieve the html and beautiful soup for the
    > parsing.
    >
    > Maybe (most likely) I am doing something wrong, but when I use
    > urllib2.urlopen to fetch a page, and when I view 'page source' of the
    > exact same URL in firefox, I am seeing slight differences in the raw
    > html.
    >
    > Do I need to set a browser agent so yahoo thinks urllib2 is firefox?
    > Is yahoo detecting that urllib2 doesn't process javascript, and
    > passing different data?
    >
    > -cjl
    >
     
    John Nagle, Mar 20, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xu, C.S.
    Replies:
    5
    Views:
    494
    John J. Lee
    Sep 17, 2003
  2. Chris
    Replies:
    0
    Views:
    1,081
    Chris
    Jul 10, 2005
  3. wesley chun
    Replies:
    1
    Views:
    496
  4. Massi
    Replies:
    8
    Views:
    713
    Piet van Oostrum
    Aug 7, 2009
  5. koranthala

    Urllib2 urlopen and read - difference

    koranthala, Apr 15, 2010, in forum: Python
    Replies:
    3
    Views:
    3,004
Loading...

Share This Page