Convert xml symbol notation

Discussion in 'Python' started by dumbkiwi, Apr 6, 2007.

  1. dumbkiwi

    dumbkiwi Guest

    Hi,

    I'm working on a script to download and parse a web page, and it
    includes xml symbol notation, such as ' for the ' character. Does
    anyone know of a pre-existing python script/lib to convert the xml
    notation back to the actual symbol it represents?
     
    dumbkiwi, Apr 6, 2007
    #1
    1. Advertising

  2. dumbkiwi wrote:

    > On Apr 7, 5:23 pm, "Gabriel Genellina" <> wrote:


    >>Try the htmlentitydefs module.

    >
    > Is that a standard module? I can't see it anywhere - googled it.


    Sure! For quite a while, at least, since Python 1.5 (I can't go earlier
    in time...)
    http://svn.python.org/view/python/trunk/Lib/htmlentitydefs.py
    Added Wed Sep 27 16:22:08 1995 UTC (11 years, 6 months ago) by guido

    --
    Gabriel Genellina
     
    Gabriel Genellina, Apr 7, 2007
    #2
    1. Advertising

  3. dumbkiwi wrote:

    > I'm working on a script to download and parse a web page, and it
    > includes xml symbol notation, such as ' for the ' character. Does
    > anyone know of a pre-existing python script/lib to convert the xml
    > notation back to the actual symbol it represents?


    Try the htmlentitydefs module.

    --
    Gabriel Genellina
     
    Gabriel Genellina, Apr 7, 2007
    #3
  4. dumbkiwi

    dumbkiwi Guest

    On Apr 7, 5:23 pm, "Gabriel Genellina" <> wrote:
    > dumbkiwi wrote:
    > > I'm working on a script to download and parse a web page, and it
    > > includes xml symbol notation, such as ' for the ' character. Does
    > > anyone know of a pre-existing python script/lib to convert the xml
    > > notation back to the actual symbol it represents?

    >
    > Try the htmlentitydefs module.


    Is that a standard module? I can't see it anywhere - googled it.
     
    dumbkiwi, Apr 7, 2007
    #4
  5. >> I'm working on a script to download and parse a web page, and it
    >> includes xml symbol notation, such as ' for the ' character. Does
    >> anyone know of a pre-existing python script/lib to convert the xml
    >> notation back to the actual symbol it represents?

    >
    > Try the htmlentitydefs module.


    That won't help: this is a character reference, not an entity reference.
    htmlentitydefs only contains the definitions of entities.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 7, 2007
    #5
  6. > I'm working on a script to download and parse a web page, and it
    > includes xml symbol notation, such as ' for the ' character. Does
    > anyone know of a pre-existing python script/lib to convert the xml
    > notation back to the actual symbol it represents?


    If you have this given in an XML file (rather than an HTML file which
    is not well-formed XML), you could use an XML parser for the entire
    file. This would automatically unescape character references. Likewise,
    you can parse it with HTMLParser, which will invoke the handle_charref
    method for these.

    If you just want to unescape references, you can use the code in

    http://effbot.org/zone/re-sub.htm

    HTH,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 7, 2007
    #6
  7. Martin v. Löwis wrote:

    > >> I'm working on a script to download and parse a web page, and it
    > >> includes xml symbol notation, such as ' for the ' character. Does

    > >
    > > Try the htmlentitydefs module.

    >
    > That won't help: this is a character reference, not an entity reference.
    > htmlentitydefs only contains the definitions of entities.


    Ouch! Sorry.

    --
    Gabriel Genellina
     
    Gabriel Genellina, Apr 7, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. baumann@pan
    Replies:
    1
    Views:
    748
    Richard Bos
    Apr 15, 2005
  2. Grey Squirrel

    Hungarian Notation Vs. Pascal Notation?

    Grey Squirrel, Mar 19, 2007, in forum: ASP .Net
    Replies:
    6
    Views:
    1,316
    Steve C. Orr [MCSD, MVP, CSM, ASP Insider]
    Mar 21, 2007
  3. Tameem
    Replies:
    454
    Views:
    12,025
  4. Song Ma
    Replies:
    2
    Views:
    236
    Charles Oliver Nutter
    Jul 20, 2008
  5. Robert Mark Bram

    Dot notation V Bracket notation

    Robert Mark Bram, Jul 4, 2003, in forum: Javascript
    Replies:
    3
    Views:
    471
    Robert Mark Bram
    Jul 5, 2003
Loading...

Share This Page