Re: unescape HTML entities

Discussion in 'Python' started by Rares Vernica, Nov 2, 2006.

  1. Hi,

    I downloades 2.2 beta, just to be sure I have the same version as you
    specify. (The file names are no longer funny.) Anyway, it does not seem
    to do as you said:

    In [14]: import SE

    In [15]: SE.version
    -------> SE.version()
    Out[15]: 'SE 2.2 beta - SEL 2.2 beta'

    In [16]: HTM_Decoder = SE.SE ('HTM2ISO.se')

    In [17]: test_string = '''
    ....: ø=(xf8) # 248 f8
    ....: ù=(xf9) # 249 f9
    ....: ú=(xfa) # 250 fa
    ....: û=(xfb) # 251 fb
    ....: ü=(xfc) # 252 fc
    ....: ý=(xfd) # 253 fd
    ....: þ=(xfe) # 254 fe
    ....: é=(xe9)
    ....: ê=(xea)
    ....: ë=(xeb)
    ....: ì=(xec)
    ....: í=(xed)
    ....: î=(xee)
    ....: ï=(xef)
    ....: '''

    In [18]: print HTM_Decoder (test_string)

    ø=(xf8) # 248 f8
    ù=(xf9) # 249 f9
    ú=(xfa) # 250 fa
    û=(xfb) # 251 fb
    ü=(xfc) # 252 fc
    ý=(xfd) # 253 fd
    þ=(xfe) # 254 fe
    é=(xe9)
    ê=(xea)
    ë=(xeb)
    ì=(xec)
    í=(xed)
    î=(xee)
    ï=(xef)


    In [19]:

    Thanks,
    Ray



    Frederic Rentsch wrote:
    > Rares Vernica wrote:
    >> Hi,
    >>
    >> How can I unescape HTML entities like " "?
    >>
    >> I know about xml.sax.saxutils.unescape() but it only deals with "&",
    >> "<", and ">".
    >>
    >> Also, I know about htmlentitydefs.entitydefs, but not only this
    >> dictionary is the opposite of what I need, it does not have " ".
    >>
    >> It has to be in python 2.4.
    >>
    >> Thanks a lot,
    >> Ray
    >>

    > One way is this:
    >
    > >>> import SE #

    > Download from http://cheeseshop.python.org/pypi/SE/2.2 beta
    > >>> SE.SE ('HTM2ISO.se')('input_file_name', 'output_file_name') #

    > HTM2ISO.se is included
    > 'output_file_name'
    >
    > For repeated translations the SE object would be assigned to a variable:
    >
    > >>> HTM_Decoder = SE.SE ('HTM2ISO.se')

    >
    > SE objects take and return strings as well as file names which is useful
    > for translating string variables, doing line-by-line translations and
    > for interactive development or verification. A simple way to check a
    > substitution set is to use its definitions as test data. The following
    > is a section of the definition file HTM2ISO.se:
    >
    > test_string = '''
    > ø=(xf8) # 248 f8
    > ù=(xf9) # 249 f9
    > ú=(xfa) # 250 fa
    > û=(xfb) # 251 fb
    > ü=(xfc) # 252 fc
    > ý=(xfd) # 253 fd
    > þ=(xfe) # 254 fe
    > é=(xe9)
    > ê=(xea)
    > ë=(xeb)
    > ì=(xec)
    > í=(xed)
    > î=(xee)
    > ï=(xef)
    > '''
    >
    > >>> print HTM_Decoder (test_string)

    >
    > ø=(xf8) # 248 f8
    > ù=(xf9) # 249 f9
    > ú=(xfa) # 250 fa
    > û=(xfb) # 251 fb
    > ü=(xfc) # 252 fc
    > ý=(xfd) # 253 fd
    > þ=(xfe) # 254 fe
    > é=(xe9)
    > ê=(xea)
    > ë=(xeb)
    > ì=(xec)
    > í=(xed)
    > î=(xee)
    > ï=(xef)
    >
    > Another feature of SE is modularity.
    >
    > >>> strip_tags = '''

    > ~<(.|\x0a)*?>~=(9) # one tag to one tab
    > ~<!--(.|\x0a)*?-->~=(9) # one comment to one tab
    > | # run
    > "~\x0a[ \x09\x0d\x0a]*~=(x0a)" # delete empty lines
    > ~\t+~=(32) # one or more tabs to one space
    > ~\x20\t+~=(32) # one space and one or more tabs to
    > one space
    > ~\t+\x20~=(32) # one or more tab and one space to
    > one space
    > '''
    >
    > >>> HTM_Stripper_Decoder = SE.SE (strip_tags + ' HTM2ISO.se ') #

    > Order doesn't matter
    >
    > If you write 'strip_tags' to a file, say 'STRIP_TAGS.se' you'd name it
    > together with HTM2ISO.se:
    >
    > >>> HTM_Stripper_Decoder = SE.SE ('STRIP_TAGS.se HTM2ISO.se') #

    > Order doesn't matter
    >
    > Or, if you have two SE objects, one for stripping tags and one for
    > decoding the ampersands, you can nest them like this:
    >
    > >>> test_string = "<p class=MsoNormal

    > style='line-height:110%'><i>Ren&eacute;</i> est un gar&ccedil;on qui
    > para&icirc;t plus &acirc;g&eacute;. </p>"
    >
    > >>> print Tag_Stripper (HTM_Decoder (test_string))

    > René est un garçon qui paraît plus âgé.
    >
    > Nesting works with file names too, because file names are returned:
    >
    > >>> Tag_Stripper (HTM_Decoder ('input_file_name'), 'output_file_name')

    > 'output_file_name'
    >
    >
    > Frederic
    >
    >
    >
     
    Rares Vernica, Nov 2, 2006
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rares Vernica

    unescape HTML entities

    Rares Vernica, Oct 28, 2006, in forum: Python
    Replies:
    4
    Views:
    879
    Klaus Alexander Seistrup
    Nov 1, 2006
  2. Fredrik Lundh

    Re: unescape HTML entities

    Fredrik Lundh, Oct 29, 2006, in forum: Python
    Replies:
    0
    Views:
    450
    Fredrik Lundh
    Oct 29, 2006
  3. Frederic Rentsch

    Re: unescape HTML entities

    Frederic Rentsch, Nov 2, 2006, in forum: Python
    Replies:
    0
    Views:
    644
    Frederic Rentsch
    Nov 2, 2006
  4. Jim Higson
    Replies:
    3
    Views:
    240
    Eric Amick
    Jul 25, 2004
  5. Philipp

    Escape/ Unescape HTML?

    Philipp, Dec 20, 2007, in forum: Javascript
    Replies:
    2
    Views:
    258
    Thomas 'PointedEars' Lahn
    Dec 21, 2007
Loading...

Share This Page