unescaping xml escape codes

Discussion in 'Python' started by Daniel, Aug 10, 2003.

  1. Daniel

    Daniel Guest

    I'm working with strings that contain xml escape codes, such as '0'
    and need a way in python to unescape these back to their ascii
    representation, such as '&' but can't seem to find a python method for
    this. I tried xml.sax.saxutils.unescape(s), but while it works with
    '&', it doesn't work with '0' and other numeric codes. Any
    suggestions on how to decode the numeric xml escape codes such as this?
    Thanks.

    --
    To reply to me directly, please remove "_NoSpam_" from my email address
     
    Daniel, Aug 10, 2003
    #1
    1. Advertising

  2. On Sun, 10 Aug 2003 10:08:46 -0700, Daniel <> wrote:

    >I'm working with strings that contain xml escape codes, such as '0'
    >and need a way in python to unescape these back to their ascii
    >representation, such as '&' but can't seem to find a python method for
    >this. I tried xml.sax.saxutils.unescape(s), but while it works with
    >'&amp;', it doesn't work with '0' and other numeric codes. Any
    >suggestions on how to decode the numeric xml escape codes such as this?
    >Thanks.
    >

    Maybe just a regex sub function would do it for you? Do you just need the decimal
    forms like above or also the hex? If your coded entities are to ÿ or
    &x00; to &xff; this might work. Other entities are converted to '?'.

    If you want to do this properly, I think you have to parse the html a little and see
    what the encoding is, and convert to unicode, and then do the conversions.

    Very little tested!!
    ====< cvthtmlent.py >======================================
    import re
    rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
    def ent2chr(m):
    code = m.group(1)
    if code.isdigit(): code = int(code)
    else: code = int(code[1:], 16)
    if code<256: return chr(code)
    else: return '?' #XXX unichr(code).encode('utf-16le') ??

    def cvthtmlent(s): return rxo.sub(ent2chr, s)

    if __name__ == '__main__':
    import sys; args = sys.argv[1:]
    if args:
    arg = args.pop(0)
    if arg == '-test':
    print cvthtmlent(
    'blah [0] blah [ö] blah [123] &#x3c9')
    else:
    if arg == '-': fi = sys.stdin
    else: fi = file(arg)
    for line in fi:
    sys.stdout.write(cvthtmlent(line))
    ===========================================================
    If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

    Martin can tell you the real scoop ;-)

    >>> from cvthtmlent import cvthtmlent as cvt
    >>> print cvt('blah [0] blah [ö] blah [123] ω')

    blah [0] blah [ö] blah [123] ?

    Regards,
    Bengt Richter
     
    Bengt Richter, Aug 11, 2003
    #2
    1. Advertising

  3. On 11 Aug 2003 00:09:42 GMT, (Bengt Richter) wrote:
    [...]
    >>

    >Maybe just a regex sub function would do it for you? Do you just need the decimal
    >forms like above or also the hex? If your coded entities are to ÿ or
    >&x00; to &xff; this might work. Other entities are converted to '?'.

    That should be and ÿ respectively. I did implement hex entites after all.
    Botched reediting this commentary however ;-P

    Regards,
    Bengt Richter
     
    Bengt Richter, Aug 11, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vance Kessler

    Re: Unescaping ASP vbscript escaped string

    Vance Kessler, Mar 1, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    2,600
    Vance Kessler
    Mar 1, 2004
  2. Bert Sierra

    Escape codes in XML?

    Bert Sierra, Jul 22, 2004, in forum: XML
    Replies:
    1
    Views:
    5,842
    Michael Wiedmann
    Jul 22, 2004
  3. Greg
    Replies:
    7
    Views:
    26,734
    vektor
    May 17, 2011
  4. John Nagle

    Unescaping URLs in Python

    John Nagle, Dec 25, 2006, in forum: Python
    Replies:
    3
    Views:
    619
    Jeffrey Froman
    Dec 25, 2006
  5. Replies:
    2
    Views:
    2,839
    Malcolm
    Aug 20, 2005
Loading...

Share This Page