Problem when fetching page using urllib2.urlopen

J

jitu

Hi,

A html page contains 'anchor' elements with 'href' attribute having
a semicolon in the url , while fetching the page using
urllib2.urlopen, all such href's containing 'semicolons' are
truncated.


For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

The page I am talking about can be fetched from
http://travel.yahoo.com/p-travelgui...DOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--

Thanks a Lot
Regards
jitu
 
J

jitu

Hi,

A html page  contains 'anchor' elements with 'href' attribute  having
a semicolon  in the url , while fetching the page using
urllib2.urlopen, all such href's  containing  'semicolons' are
truncated.

For example the hrefhttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
get truncated tohttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

The page I am talking about can be fetched fromhttp://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...

Thanks a Lot
Regards
jitu

Hi

Sorry, the question what I wanted to ask was, whether is this the
correct behaviour or a bug ?


Thanks A Lot.
Regards
jitu
 
D

dorzey

"geturl - this returns the real URL of the page fetched. This is
useful because urlopen (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL
requested." from http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-geturl

It might be worth checking that you are actually getting the page you
want; I seem to remember that semicolons need to be encoded, similar
to '&'.

Dorzey
 
D

Diez B. Roggisch

dorzey said:
"geturl - this returns the real URL of the page fetched. This is
useful because urlopen (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL
requested." from
http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-geturl

It might be worth checking that you are actually getting the page you
want; I seem to remember that semicolons need to be encoded, similar
to '&'.

You remember wrong.

http://www.faqs.org/rfcs/rfc2396.html

See Section 3.3, path-components.

Diez
 
P

Piet van Oostrum

jitu said:
j> Hi,
j> A html page contains 'anchor' elements with 'href' attribute having
j> a semicolon in the url , while fetching the page using
j> urllib2.urlopen, all such href's containing 'semicolons' are
j> truncated.

j> For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
j> get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).

To get them you have to tell the server that you are a respectable
browser. E.g.

import urllib2

url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'

url = 'http://travel.yahoo.com/p-travelgui...DOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--'

hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
'Accept': 'image/*'}

request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()
 
J

jitu

Yes Piet you were right this works. But seems does not work on google
app engine, since it appends it own agent info as seen below

'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;
rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;
(+http://code.google.com/appengine)'

Any way Thanks . Good to know about the User-Agent field.

Jitu
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top