read() returns data of different sizes

J

jimgardener

hi
while trying out urllib.urlopen ,I wrote this code to read a url and
return the data length

import datetime,time,urllib

def get_page_size(pageurlstr):
h=urllib.urlopen(pageurlstr)
data=h.read()
return len(data)

while True:
print 'reading url www.google.com
at',datetime.datetime.now().isoformat(' ')
print 'size=%d'%get_page_size('http://www.google.com')
time.sleep(5)


I got this output

reading url www.google.com at 2010-10-02 17:22:24.691654
size=9512
reading url www.google.com at 2010-10-02 17:22:30.681236
size=9530
reading url www.google.com at 2010-10-02 17:22:36.886369
size=9530
reading url www.google.com at 2010-10-02 17:22:42.315392
size=9512
reading url www.google.com at 2010-10-02 17:22:48.763693
size=9512
reading url www.google.com at 2010-10-02 17:22:54.711666
size=9548
reading url www.google.com at 2010-10-02 17:23:00.151843
size=9530
reading url www.google.com at 2010-10-02 17:23:05.844620
size=9548


Why is it that the sizes are different?what must I do to ensure that
the whole page is read ?
thanks
jim
 
C

Chris Rebert

hi
while trying out urllib.urlopen ,I wrote this code to read a url and
return the data length

import datetime,time,urllib

def get_page_size(pageurlstr):
   h=urllib.urlopen(pageurlstr)
   data=h.read()
   return len(data)

   while True:
       print 'reading url www.google.com
at',datetime.datetime.now().isoformat(' ')
       print 'size=%d'%get_page_size('http://www.google.com')
       time.sleep(5)


I got this output

reading url www.google.com at 2010-10-02 17:22:24.691654
size=9512
reading url www.google.com at 2010-10-02 17:22:30.681236
size=9530
reading url www.google.com at 2010-10-02 17:22:36.886369
size=9530
reading url www.google.com at 2010-10-02 17:22:42.315392
size=9512
reading url www.google.com at 2010-10-02 17:22:48.763693
size=9512
reading url www.google.com at 2010-10-02 17:22:54.711666
size=9548
reading url www.google.com at 2010-10-02 17:23:00.151843
size=9530
reading url www.google.com at 2010-10-02 17:23:05.844620
size=9548


Why is it that the sizes are different?

Because Google does not always send back the *exact* same HTML every
time you request their homepage (note how small the variance is). You
can easily verify this using the "Save Page" function of your browser
and diff-ing the HTML for 2 different loads. What is varying is
possibly some sort of tracking ID.
what must I do to ensure that the whole page is read ?

Nothing. Using .read() already ensures it.

Cheers,
Chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top