Urllib2: Only a partial page retrieved

Discussion in 'Python' started by Dragon Lord, May 22, 2010.

  1. Dragon Lord

    Dragon Lord Guest

    I am trying to download a few IEEE pages by using urllib2, but with
    certain pages I get only the first part of the page. With other pages
    from the same server and url (just another pageID) I get the right
    results. The difference between these pages seems to be the date the
    paper for which the page is was published. Any papers from before 2000
    end just before the date, pages from 2000 and later and at <\html>.

    Two example URLs:

    Does not work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
    Does work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728

    I tried both urlopen and urlretrieve and tried both urllib and
    urllib2. With urlopen I tried both .read() and .read(10000) to make
    sure I got the whole page, but nothing helped.
    Sample code:

    import urllib2
    response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
    freeabs_all.jsp?arnumber=517048")
    html = response.read()
    print html

    The cutoff is allways at the same location: just after the label
    "Meeting date" and before the date itself. Could it be that something
    is interpreted as and eof command or something like that?

    example of the cutoff point with a bad page:
    <br/><b>Meeting Date: </b>



    example of the cutoff point with a good page:
    <br/><b>Meeting Date: </b>

    13 jun 2000

    The bad pages do continue after this point btw. if you use a
    webbrowser, it does not seem to be a server problem.
     
    Dragon Lord, May 22, 2010
    #1
    1. Advertisements

  2. Dragon Lord

    Dragon Lord Guest

    Dragon Lord, May 22, 2010
    #2
    1. Advertisements

  3. Dragon Lord

    hpsMouse Guest

    I checked TCP packages, and found that the remote HTTP server send a
    data package with flag "PUSH", causing the client to close connection.
    That is exactly where the "Meeting Date: </b>" appears.
    This seems not to be a bug for python, because Qt and telnet both
    failed in my test, so did the wget program...
    Most browsers use keep-alive HTTP, so the connection won't be closed.
    I think that's why a browser show the page correctly.
     
    hpsMouse, May 23, 2010
    #3
  4. Dragon Lord

    hpsMouse Guest

    I know what the problem is.

    Server checks client's locale setting to determine how the date should
    be displayed. Python don't send locale information by default. So
    server fails at that point.

    If you add the following field in the HTTP request, the response will
    be correct:
    Accept-Language: en
     
    hpsMouse, May 23, 2010
    #4
  5. Dragon Lord

    Dragon Lord Guest

    Thanks, that works perfectly!

    (oh and I learnt something new too, because I tried using telnet to
    connect to the server :) )
     
    Dragon Lord, May 23, 2010
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.