raise UnicodeError, "label too long"

F

Flavio

Hi I am havin a problem with urllib2.urlopen.

I get this error when I try to pass a unicode to it.

raise UnicodeError, "label too long"

is this problem avoidable? no browser or programs such as wget seem to
have a problem with these strings.
 
M

Marc 'BlackJack' Rintsch

Hi I am havin a problem with urllib2.urlopen.

I get this error when I try to pass a unicode to it.

raise UnicodeError, "label too long"

is this problem avoidable? no browser or programs such as wget seem to
have a problem with these strings.

What exactly are you doing? How does a (unicode?) string look like that
triggers this exception?

Ciao,
Marc 'BlackJack' Rintsch
 
F

Flavio

What I am doing is very simple:

I fetch an url (html page) parse it using BeautifulSoup, extract the
links and try to open each of the links, repeating the cycle.

Beautiful soup converts the html to unicode. That's why when I try to
open the links extracted from the page I get this error.

This is bad, since some links do contain strings with non-ascii
characters.

thanks,

Flávio


Marc 'BlackJack' Rintsch escreveu:
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Flavio said:
What I am doing is very simple:

I fetch an url (html page) parse it using BeautifulSoup, extract the
links and try to open each of the links, repeating the cycle.

Beautiful soup converts the html to unicode. That's why when I try to
open the links extracted from the page I get this error.

This is bad, since some links do contain strings with non-ascii
characters.

Please try answering the exact question that Marc asked:
what is an example for unicode string that triggers the
exception?

Regards,
Martin
 
F

Flavio

something like this, for instance:
http://.wikipedia.org/wiki/Copper(II)_hydroxide

but even url with any non-ascii characters such as this

http://.wikipedia.org/wiki/Ammonia

also fail when passed to urlopen :
File "/usr/lib/python2.4/encodings/idna.py", line 72, in ToASCII
raise UnicodeError, "label too long"
UnicodeError: label too long

very strange, because I tried other unicode urls from the python
console like this

urllib2.urlopen(u'www.google.com')

and it works normally:





Martin v. Löwis escreveu:
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Flavio said:
something like this, for instance:
http://.wikipedia.org/wiki/Copper(II)_hydroxide

but even url with any non-ascii characters such as this

http://.wikipedia.org/wiki/Ammonia

also fail when passed to urlopen :
File "/usr/lib/python2.4/encodings/idna.py", line 72, in ToASCII
raise UnicodeError, "label too long"
UnicodeError: label too long

very strange, because I tried other unicode urls from the python
console like this

It's the host name that starts with a dot that makes it fails:

py> u".wikipedia.org".encode("idna")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "encodings/idna.py", line 163, in encode
File "encodings/idna.py", line 72, in ToASCII
UnicodeError: label too long
py> u"wikipedia.org".encode("idna")
'wikipedia.org'

The exception is certainly misleading; I'll have to find out
whether there is a bug beyond that (i.e. whether host names
with empty labels should be accepted).

Regards,
martin
 
D

Dennis Lee Bieber

F

Flavio

Guys, I am sorry I wrote these messages very late at night.

Naturally what came before the dot is the language defining two letter
string that is usual of wikipedia urls.

Something in my code is obviously gobbling that up. Thanks for pointing
that out and my apologies again for not seeing this obvious bug.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,270
Messages
2,571,101
Members
48,773
Latest member
Kaybee

Latest Threads

Top