problem using urllib2: \n

B

bmiras

I've got a problem using urllib2 to get a web page.
I'm going through a proxy using user/password authentification
and i'm trying to get a page asking for a HTTP authentification.
And I'm using python 2.3

Here is an exemple of the piece of code I use:

import urllib2
#Proxy handler
proxy_handler = urllib2.ProxyHandler({"http" :
"http://proxyuser:proxypassword@myproxy:8050"})

#Site auth handler

site_auth_handler = urllib2.HTTPBasicAuthHandler();
site_auth_handler.add_password( "This Realm", "www.mysite.com",
"siteuser", "sitepassword" );


opener = urllib2.build_opener( site_auth_handler,
urllib2.HTTPRedirectHandler, urllib2.HTTPHandler , proxy_handler)
urllib2.install_opener(opener)


req = urllib2.Request('http://www.mysite.com/protectedpage')
page = urllib2.urlopen(req)

I got a 401 error.

Analyzing the request using 'strace' I can see the following request
sent to the proxy:

GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
bWlyYXNiOm1pcjAz\n\r\nAuthorization: Basic
bWlyYXM6bWlyYXMwMDE=\n\r\n\r\n

As you can see there is additionnal \n sent to the server just after
the Proxy-authorization and the Authorization fields. I think that in
this case the web server get only this part:
GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
bWlyYXNiOm1pcjAz\n\r\n

and so send me back an error 401, since I'm not authenticated for the
site.

I had a look in the urllib2.py . I think that base64.encodestring add
an \n at the end of the string. It's the case in the method
'proxy_open':

def proxy_open(self, req, proxy, type):
orig_type = req.get_type()
type, r_type = splittype(proxy)
host, XXX = splithost(r_type)
if '@' in host:
user_pass, host = host.split('@', 1)
if ':' in user_pass:
user, password = user_pass.split(':', 1)
user_pass = base64.encodestring('%s:%s' %
(unquote(user),

unquote(password)))
req.add_header('Proxy-authorization', 'Basic ' +
user_pass)
host = unquote(host)
req.set_proxy(host, type)
...

I think it should be:

user_pass = base64.encodestring('%s:%s' % (unquote(user),
unquote(password))).split()

have you any other clue?
thank you!

Bastien
 
J

John J. Lee

I've got a problem using urllib2 to get a web page.
I'm going through a proxy using user/password authentification
and i'm trying to get a page asking for a HTTP authentification.
And I'm using python 2.3

Here is an exemple of the piece of code I use:

import urllib2
#Proxy handler
proxy_handler = urllib2.ProxyHandler({"http" :
"http://proxyuser:proxypassword@myproxy:8050"})

#Site auth handler

site_auth_handler = urllib2.HTTPBasicAuthHandler();
site_auth_handler.add_password( "This Realm", "www.mysite.com",
"siteuser", "sitepassword" );


opener = urllib2.build_opener( site_auth_handler,
urllib2.HTTPRedirectHandler, urllib2.HTTPHandler , proxy_handler)
urllib2.install_opener(opener)

Looks OK (but I don't use a proxy, nor basic auth very often...).

Just as a BTW: you don't need to pass HTTPHandler or
HTTPRedirectHandler in there: build_opener adds them whether you ask
for them or not.

req = urllib2.Request('http://www.mysite.com/protectedpage')
page = urllib2.urlopen(req)

I got a 401 error.

So presumably your proxy is happy, but the site is not. Could you
test that theory by urlopen()ing a URL that *doesn't* require any
authentication? Just:

# ...your code up to install_opener goes here...
print urllib2.urlopen("http://www.python.org/").read()

Analyzing the request using 'strace' I can see the following request
sent to the proxy:

GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
XXX\n\r\nAuthorization: Basic
YYY\n\r\n\r\n

(You probably didn't want to post your usernames and passwords to a
public newsgroup. They're reversibly encoded, so anyone can decode
them. I've replaced them with XXX and YYY in the quote above.)

As you can see there is additionnal \n sent to the server just after
the Proxy-authorization and the Authorization fields. I think that in
this case the web server get only this part:

GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
XXX\n\r\n

and so send me back an error 401, since I'm not authenticated for the
site.

Hmm. That \n does seem likely to be wrong, but I'm not certain.

The urllib2 code appears to duplicate the code for base64 encoding for
proxy basic authorization (in ProxyBasicAuthHandler and ProxyHandler),
and the code differs between the two classes :-(. [It looks like PBAH
responds to 407, and ProxyHandler always sends Proxy-Authorization if
it's in the proxy's URL.] And in fact, only one of them does a
..strip() on the base64 encoded string (they also differ in quoting).
However, the Authorization: header appears to be generated only in one
place (AbstractBasicAuthHandler.retry_http_basic_auth), which *does*
strip, but you've got a \n there, too. So, I don't understand where
that \n is coming from. I'd try sticking some print statements in
there to find out what's going on.

I had a look in the urllib2.py . I think that base64.encodestring add
an \n at the end of the string. It's the case in the method
'proxy_open':

def proxy_open(self, req, proxy, type):
orig_type = req.get_type()
type, r_type = splittype(proxy)
host, XXX = splithost(r_type)
if '@' in host:
user_pass, host = host.split('@', 1)
if ':' in user_pass:
user, password = user_pass.split(':', 1)
user_pass = base64.encodestring('%s:%s' %
(unquote(user),

unquote(password)))
req.add_header('Proxy-authorization', 'Basic ' +
user_pass)
host = unquote(host)
req.set_proxy(host, type)
...

I think it should be:

user_pass = base64.encodestring('%s:%s' % (unquote(user),
unquote(password))).split()

You mean strip, not split?

Try debugging a bit, find out what's really going on. Just copy
urllib2.py to your current directory (so it'll override the installed
standard library's copy), and stick some print statements in there.

have you any other clue?
[...]

You could try sniffing what Mozilla sends, too.

If you get this working, please look at the doc patch here

http://www.python.org/sf/798244


test it, and post a comment to say whether or not it's correct (and
which examples you tried -- preferably all of them ;).


John
 
A

Alan Kennedy

[[email protected] wrote]
Here is an exemple of the piece of code I use:

import urllib2
#Proxy handler
proxy_handler = urllib2.ProxyHandler({"http" :
"http://proxyuser:proxypassword@myproxy:8050"})

Might you need to change that URL? It looks like this URL indicates
that the proxy is running on port 8050 on host "myproxy".

Unless the host on which the proxy is running is named "myproxy", try
changing the proxy URL to one of the following values

http://proxyuser:proxypassword@localhost:8050
http://proxyuser:[email protected]:8050

HTH,
 
B

bmiras

Looks OK (but I don't use a proxy, nor basic auth very often...).

Just as a BTW: you don't need to pass HTTPHandler or
HTTPRedirectHandler in there: build_opener adds them whether you ask
for them or not.



So presumably your proxy is happy, but the site is not. Could you
test that theory by urlopen()ing a URL that *doesn't* require any
authentication? Just:

# ...your code up to install_opener goes here...
print urllib2.urlopen("http://www.python.org/").read()
It's ok with URL that doesn't require authentication

Analyzing the request using 'strace' I can see the following request
sent to the proxy:

GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
XXX\n\r\nAuthorization: Basic
YYY\n\r\n\r\n

(You probably didn't want to post your usernames and passwords to a
public newsgroup. They're reversibly encoded, so anyone can decode
them. I've replaced them with XXX and YYY in the quote above.)

As you can see there is additionnal \n sent to the server just after
the Proxy-authorization and the Authorization fields. I think that in
this case the web server get only this part:

GET http://www.mysite.com/protectedpage HTTP/1.0\r\nHost:
www.mysite.com\r\nUser-agent:
Python-urllib/2.0a1\r\nProxy-authorization: Basic
XXX\n\r\n

and so send me back an error 401, since I'm not authenticated for the
site.

Hmm. That \n does seem likely to be wrong, but I'm not certain.

The urllib2 code appears to duplicate the code for base64 encoding for
proxy basic authorization (in ProxyBasicAuthHandler and ProxyHandler),
and the code differs between the two classes :-(. [It looks like PBAH
responds to 407, and ProxyHandler always sends Proxy-Authorization if
it's in the proxy's URL.] And in fact, only one of them does a
.strip() on the base64 encoded string (they also differ in quoting).
However, the Authorization: header appears to be generated only in one
place (AbstractBasicAuthHandler.retry_http_basic_auth), which *does*
strip, but you've got a \n there, too. So, I don't understand where
that \n is coming from. I'd try sticking some print statements in
there to find out what's going on.

I've done a wrong copy/paste
there is no additional \n after Authorization field
but there an additional \n for Proxy-Authorization

I've used HTTPBasicAuthHandler since you said the code is different
and it worked fine!!!
I think the conclusion is that the strip call in the ProxyHandler code
is missing. Is it necessary to report it as a bug?

You mean strip, not split?

Yes strip, sorry,
Try debugging a bit, find out what's really going on. Just copy
urllib2.py to your current directory (so it'll override the installed
standard library's copy), and stick some print statements in there.

have you any other clue?
[...]

You could try sniffing what Mozilla sends, too.
I've done better: telnet myproxy 8050

GET http://www.mysite.com/protectedpage HTTP/1.0
Host: www.mysite.com
User-agent: Python-urllib/2.0a1
Proxy-authorization: Basic XXX
Authorization: Basic YYY

And it worked fine.
 
J

John J. Lee

(e-mail address removed) (John J. Lee) wrote in message news:<[email protected]>... [...]
I've done a wrong copy/paste
there is no additional \n after Authorization field
but there an additional \n for Proxy-Authorization

I've used HTTPBasicAuthHandler since you said the code is different
and it worked fine!!!
I think the conclusion is that the strip call in the ProxyHandler code
is missing. Is it necessary to report it as a bug?

Yes. Please report it to sourceforge, remembering to check that
nobody else already has. The correct version of the duplicated code
should be factored out into a function.

To help future users, it would be really useful if you could do this
too:

[...]
Won't take you long, since you already have your code working.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top