getattr/setattr still ASCII-only, not Unicode - blows up SGMLlibfrom BeautifulSoup

J

John Nagle

Just noticed, again, that getattr/setattr are ASCII-only, and don't support
Unicode.

SGMLlib blows up because of this when faced with a Unicode end tag:

File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
in position 46: ordinal not in range(128)

Should attributes be restricted to ASCII, or is this a bug?

John Nagle
 
T

Terry Reedy

| Just noticed, again, that getattr/setattr are ASCII-only, and don't
support
| Unicode.
|
| SGMLlib blows up because of this when faced with a Unicode end tag:
|
| File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
| method = getattr(self, 'end_' + tag)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
| in position 46: ordinal not in range(128)
|
| Should attributes be restricted to ASCII, or is this a bug?

Except for comments and string literals preceded by an encoding
declaration,
Python code is ascii only:
" Python uses the 7-bit ASCII character set for program text."
ref manual 2. lexical analisis

This changes in 3.0
 
J

John Machin

Just noticed, again, that getattr/setattr are ASCII-only, and don't support
Unicode.

SGMLlib blows up because of this when faced with a Unicode end tag:

File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
in position 46: ordinal not in range(128)

Should attributes be restricted to ASCII, or is this a bug?

John Nagle

Identifiers are restricted -- see section 2.3 (Identifiers and
keywords) of the Reference Manual. The restriction is in effect that
they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
obj.nonASCIIname in your code, it makes sense for the equivalent usage
in setattr and getattr not to be available.

However other than forcing unicode to str, setattr and getattr seem
not to care what you use:
.... pass
....
Cheers,
John
 
J

John Nagle

John said:
Just noticed, again, that getattr/setattr are ASCII-only, and don't support
Unicode.

SGMLlib blows up because of this when faced with a Unicode end tag:

File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
in position 46: ordinal not in range(128)

Should attributes be restricted to ASCII, or is this a bug?

John Nagle

Identifiers are restricted -- see section 2.3 (Identifiers and
keywords) of the Reference Manual. The restriction is in effect that
they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
obj.nonASCIIname in your code, it makes sense for the equivalent usage
in setattr and getattr not to be available.

However other than forcing unicode to str, setattr and getattr seem
not to care what you use:

OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
subclass with a function with a name such as "end_img", to be called
at the end of an "img" tag. The mechanism which implements this blows
up on any tag name that won't convert to "str", even when there are
no "end_" functions that could be relevant.

It's easy to fix in SGMLlib. It's just necessary to change

except AttributeError:
to
except AttributeError, UnicodeEncodeError:

in four places. I suppose I'll have to submit a patch.

John Nagle
SiteTruth
 
C

Carl Banks

Identifiers are restricted -- see section 2.3 (Identifiers and
keywords) of the Reference Manual. The restriction is in effect that
they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
obj.nonASCIIname in your code, it makes sense for the equivalent usage
in setattr and getattr not to be available.
However other than forcing unicode to str, setattr and getattr seem
not to care what you use:

OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
subclass with a function with a name such as "end_img", to be called
at the end of an "img" tag. The mechanism which implements this blows
up on any tag name that won't convert to "str", even when there are
no "end_" functions that could be relevant.

It's easy to fix in SGMLlib. It's just necessary to change

except AttributeError:
to
except AttributeError, UnicodeEncodeError:

in four places. I suppose I'll have to submit a patch.


FWIW, the stated goal of sgmllib is to parse the subset of SGML that
HTML uses. There are no non-ascii elements in HTML, so I'm not
certain this would be considered a bug in sgmllib.


Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top