unescape HTML entities

Discussion in 'Python' started by Rares Vernica, Oct 28, 2006.

  1. Hi,

    How can I unescape HTML entities like " "?

    I know about xml.sax.saxutils.unescape() but it only deals with "&",
    "<", and ">".

    Also, I know about htmlentitydefs.entitydefs, but not only this
    dictionary is the opposite of what I need, it does not have " ".

    It has to be in python 2.4.

    Thanks a lot,
    Ray
     
    Rares Vernica, Oct 28, 2006
    #1
    1. Advertising

  2. Rares Vernica

    Jim Guest

    Rares Vernica wrote:
    > How can I unescape HTML entities like " "?

    Can I ask what you mean by "unescaping"? Do you mean converting into
    numeric references? Into Unicode?

    Jim
     
    Jim, Oct 28, 2006
    #2
    1. Advertising

  3. Rares Vernica wrote:

    > How can I unescape HTML entities like " "?
    >
    > I know about xml.sax.saxutils.unescape() but it only deals with
    > "&", "<", and ">".
    >
    > Also, I know about htmlentitydefs.entitydefs, but not only this
    > dictionary is the opposite of what I need, it does not have
    > " ".


    How about something like:

    #v+
    #!/usr/bin/env/python
    '''dehtml.py'''

    import re
    import htmlentitydef

    myrx = re.compile('&(' + '|'.join(htmlentitydefs.name2codepoint.keys()) + ');')

    def dehtml(s):
    return re.sub(
    myrx,
    lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]),
    s
    )
    # end def dehtml

    if __name__ == '__main__':
    import sys
    print dehtml(sys.stdin.read()).encode('utf-8')
    # end if

    #v-

    E.g.:

    #v+

    $ echo 'frække frølår' | ./dehtml.py
    frække frølår
    $

    #v-

    --
    Klaus Alexander Seistrup
    Copenhagen, Denmark, EU
    http://klaus.seistrup.dk/
     
    Klaus Alexander Seistrup, Oct 28, 2006
    #3
  4. Hi,

    How does your code deal with ' like entities?

    Thanks,
    Ray

    Klaus Alexander Seistrup wrote:
    > Rares Vernica wrote:
    >
    >> How can I unescape HTML entities like " "?
    >>
    >> I know about xml.sax.saxutils.unescape() but it only deals with
    >> "&", "<", and ">".
    >>
    >> Also, I know about htmlentitydefs.entitydefs, but not only this
    >> dictionary is the opposite of what I need, it does not have
    >> " ".

    >
    > How about something like:
    >
    > #v+
    > #!/usr/bin/env/python
    > '''dehtml.py'''
    >
    > import re
    > import htmlentitydef
    >
    > myrx = re.compile('&(' + '|'.join(htmlentitydefs.name2codepoint.keys()) + ');')
    >
    > def dehtml(s):
    > return re.sub(
    > myrx,
    > lambda m: unichr(htmlentitydefs.name2codepoint[m.group(1)]),
    > s
    > )
    > # end def dehtml
    >
    > if __name__ == '__main__':
    > import sys
    > print dehtml(sys.stdin.read()).encode('utf-8')
    > # end if
    >
    > #v-
    >
    > E.g.:
    >
    > #v+
    >
    > $ echo 'frække frølår' | ./dehtml.py
    > frække frølår
    > $
    >
    > #v-
    >
     
    Rares Vernica, Nov 1, 2006
    #4
  5. Rares Vernica wrote:

    > How does your code deal with ' like entities?


    It doesn't, it deals with named entities only. But take a look
    at Fredrik's example.

    Cheers,

    --
    Klaus Alexander Seistrup
    København, Danmark, EU
    http://klaus.seistrup.dk/
     
    Klaus Alexander Seistrup, Nov 1, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh

    Re: unescape HTML entities

    Fredrik Lundh, Oct 29, 2006, in forum: Python
    Replies:
    0
    Views:
    454
    Fredrik Lundh
    Oct 29, 2006
  2. Rares Vernica

    Re: unescape HTML entities

    Rares Vernica, Nov 2, 2006, in forum: Python
    Replies:
    0
    Views:
    407
    Rares Vernica
    Nov 2, 2006
  3. Frederic Rentsch

    Re: unescape HTML entities

    Frederic Rentsch, Nov 2, 2006, in forum: Python
    Replies:
    0
    Views:
    647
    Frederic Rentsch
    Nov 2, 2006
  4. Jim Higson
    Replies:
    3
    Views:
    247
    Eric Amick
    Jul 25, 2004
  5. Philipp

    Escape/ Unescape HTML?

    Philipp, Dec 20, 2007, in forum: Javascript
    Replies:
    2
    Views:
    266
    Thomas 'PointedEars' Lahn
    Dec 21, 2007
Loading...

Share This Page