getattr/setattr still ASCII-only, not Unicode - blows up SGMLlibfrom BeautifulSoup

Discussion in 'Python' started by John Nagle, Mar 13, 2008.

  1. John Nagle

    John Nagle Guest

    Just noticed, again, that getattr/setattr are ASCII-only, and don't support
    Unicode.

    SGMLlib blows up because of this when faced with a Unicode end tag:

    File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
    method = getattr(self, 'end_' + tag)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
    in position 46: ordinal not in range(128)

    Should attributes be restricted to ASCII, or is this a bug?

    John Nagle
     
    John Nagle, Mar 13, 2008
    #1
    1. Advertising

  2. John Nagle

    Terry Reedy Guest

    Re: getattr/setattr still ASCII-only,not Unicode - blows up SGMLlibfrom BeautifulSoup

    "John Nagle" <> wrote in message
    news:47d97288$0$36363$...
    | Just noticed, again, that getattr/setattr are ASCII-only, and don't
    support
    | Unicode.
    |
    | SGMLlib blows up because of this when faced with a Unicode end tag:
    |
    | File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
    | method = getattr(self, 'end_' + tag)
    | UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
    | in position 46: ordinal not in range(128)
    |
    | Should attributes be restricted to ASCII, or is this a bug?

    Except for comments and string literals preceded by an encoding
    declaration,
    Python code is ascii only:
    " Python uses the 7-bit ASCII character set for program text."
    ref manual 2. lexical analisis

    This changes in 3.0
     
    Terry Reedy, Mar 13, 2008
    #2
    1. Advertising

  3. John Nagle

    John Machin Guest

    On Mar 14, 5:38 am, John Nagle <> wrote:
    > Just noticed, again, that getattr/setattr are ASCII-only, and don't support
    > Unicode.
    >
    > SGMLlib blows up because of this when faced with a Unicode end tag:
    >
    > File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
    > method = getattr(self, 'end_' + tag)
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
    > in position 46: ordinal not in range(128)
    >
    > Should attributes be restricted to ASCII, or is this a bug?
    >
    > John Nagle


    Identifiers are restricted -- see section 2.3 (Identifiers and
    keywords) of the Reference Manual. The restriction is in effect that
    they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
    obj.nonASCIIname in your code, it makes sense for the equivalent usage
    in setattr and getattr not to be available.

    However other than forcing unicode to str, setattr and getattr seem
    not to care what you use:

    >>> class O(object):

    .... pass
    ....
    >>> o = O()
    >>> setattr(o, '42', 'universe')
    >>> getattr(o, '42')

    'universe'
    >>> # doesn't even need to be ASCII
    >>> setattr(o, '\xff', 'notA-Za-z etc')
    >>> getattr(o, '\xff')

    'notA-Za-z etc'
    >>>


    Cheers,
    John
     
    John Machin, Mar 13, 2008
    #3
  4. John Nagle

    John Nagle Guest

    John Machin wrote:
    > On Mar 14, 5:38 am, John Nagle <> wrote:
    >> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
    >> Unicode.
    >>
    >> SGMLlib blows up because of this when faced with a Unicode end tag:
    >>
    >> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
    >> method = getattr(self, 'end_' + tag)
    >> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
    >> in position 46: ordinal not in range(128)
    >>
    >> Should attributes be restricted to ASCII, or is this a bug?
    >>
    >> John Nagle

    >
    > Identifiers are restricted -- see section 2.3 (Identifiers and
    > keywords) of the Reference Manual. The restriction is in effect that
    > they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
    > obj.nonASCIIname in your code, it makes sense for the equivalent usage
    > in setattr and getattr not to be available.
    >
    > However other than forcing unicode to str, setattr and getattr seem
    > not to care what you use:


    OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
    subclass with a function with a name such as "end_img", to be called
    at the end of an "img" tag. The mechanism which implements this blows
    up on any tag name that won't convert to "str", even when there are
    no "end_" functions that could be relevant.

    It's easy to fix in SGMLlib. It's just necessary to change

    except AttributeError:
    to
    except AttributeError, UnicodeEncodeError:

    in four places. I suppose I'll have to submit a patch.

    John Nagle
    SiteTruth
     
    John Nagle, Mar 14, 2008
    #4
  5. John Nagle

    Carl Banks Guest

    On Mar 14, 1:53 am, John Nagle <> wrote:
    > John Machin wrote:
    > > On Mar 14, 5:38 am, John Nagle <> wrote:
    > >> Just noticed, again, that getattr/setattr are ASCII-only, and don't support
    > >> Unicode.

    >
    > >> SGMLlib blows up because of this when faced with a Unicode end tag:

    >
    > >> File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
    > >> method = getattr(self, 'end_' + tag)
    > >> UnicodeEncodeError: 'ascii' codec can't encode character u'\xae'
    > >> in position 46: ordinal not in range(128)

    >
    > >> Should attributes be restricted to ASCII, or is this a bug?

    >
    > >> John Nagle

    >
    > > Identifiers are restricted -- see section 2.3 (Identifiers and
    > > keywords) of the Reference Manual. The restriction is in effect that
    > > they match r'[A-Za-z_][A-Za-z0-9_]*\Z'. Hence if you can't use
    > > obj.nonASCIIname in your code, it makes sense for the equivalent usage
    > > in setattr and getattr not to be available.

    >
    > > However other than forcing unicode to str, setattr and getattr seem
    > > not to care what you use:

    >
    > OK. It's really a bug in SGMLlib, then. SGMLlib lets you provide a
    > subclass with a function with a name such as "end_img", to be called
    > at the end of an "img" tag. The mechanism which implements this blows
    > up on any tag name that won't convert to "str", even when there are
    > no "end_" functions that could be relevant.
    >
    > It's easy to fix in SGMLlib. It's just necessary to change
    >
    > except AttributeError:
    > to
    > except AttributeError, UnicodeEncodeError:
    >
    > in four places. I suppose I'll have to submit a patch.



    FWIW, the stated goal of sgmllib is to parse the subset of SGML that
    HTML uses. There are no non-ascii elements in HTML, so I'm not
    certain this would be considered a bug in sgmllib.


    Carl Banks
     
    Carl Banks, Mar 14, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Paulo da Silva

    getattr/setattr q.

    Paulo da Silva, Apr 3, 2007, in forum: Python
    Replies:
    10
    Views:
    569
    Paulo da Silva
    Apr 3, 2007
  2. Nathan Harmston

    recursion error using setattr and getattr

    Nathan Harmston, Jun 7, 2007, in forum: Python
    Replies:
    0
    Views:
    299
    Nathan Harmston
    Jun 7, 2007
  3. Simon Brunning
    Replies:
    0
    Views:
    524
    Simon Brunning
    Jun 7, 2007
  4. Donn Ingle

    setattr getattr confusion

    Donn Ingle, Dec 8, 2007, in forum: Python
    Replies:
    7
    Views:
    308
    Donn Ingle
    Dec 8, 2007
  5. maestro

    setattr and getattr, when to use?

    maestro, Aug 23, 2008, in forum: Python
    Replies:
    4
    Views:
    356
    Bruno Desthuilliers
    Aug 26, 2008
Loading...

Share This Page