Convert from unicode chars to HTML entities

Discussion in 'Python' started by Steven D'Aprano, Jan 29, 2007.

  1. I have a string containing Latin-1 characters:

    s = u"© and many more..."

    I want to convert it to HTML entities:

    result =>
    "© and many more..."

    Decimal/hex escapes would be acceptable:
    "© and many more..."
    "© and many more..."

    I can look up tables of HTML entities on the web (they're a dime a
    dozen), turn them into a dict mapping character to entity, then convert
    the string by hand. Is there a "batteries included" solution that doesn't
    involve reinventing the wheel?


    --
    Steven D'Aprano
    Steven D'Aprano, Jan 29, 2007
    #1
    1. Advertising

  2. Steven D'Aprano wrote:
    > I have a string containing Latin-1 characters:
    >
    > s = u"© and many more..."
    >
    > I want to convert it to HTML entities:
    >
    > result =>
    > "© and many more..."
    >
    > Decimal/hex escapes would be acceptable:
    > "© and many more..."
    > "© and many more..."
    >
    > I can look up tables of HTML entities on the web (they're a dime a
    > dozen), turn them into a dict mapping character to entity, then convert
    > the string by hand. Is there a "batteries included" solution that doesn't
    > involve reinventing the wheel?
    >
    >


    Its *very* ugly, but im pretty sure you can make it look prettier.

    import htmlentitydefs as entity

    s = u"© and many more..."
    t = ""
    for i in s:
    if ord(i) in entity.codepoint2name:
    name = entity.codepoint2name.get(ord(i))
    entityCode = entity.name2codepoint.get(name)
    t +="&#" + str(entityCode)
    else:
    t += i
    print t

    Hope this helps.

    Adonis
    Adonis Vargas, Jan 29, 2007
    #2
    1. Advertising

  3. Adonis Vargas wrote:
    [...]
    >
    > Its *very* ugly, but im pretty sure you can make it look prettier.
    >
    > import htmlentitydefs as entity
    >
    > s = u"© and many more..."
    > t = ""
    > for i in s:
    > if ord(i) in entity.codepoint2name:
    > name = entity.codepoint2name.get(ord(i))
    > entityCode = entity.name2codepoint.get(name)
    > t +="&#" + str(entityCode)
    > else:
    > t += i
    > print t
    >
    > Hope this helps.
    >
    > Adonis


    or

    import htmlentitydefs as entity

    s = u"© and many more..."
    t = u""
    for i in s:
    if ord(i) in entity.codepoint2name:
    name = entity.codepoint2name.get(ord(i))
    t += "&" + name + ";"
    else:
    t += i
    print t

    Which I think is what you were looking for.

    Adonis
    Adonis Vargas, Jan 29, 2007
    #3
  4. En Mon, 29 Jan 2007 00:05:24 -0300, Steven D'Aprano
    <> escribió:

    > I have a string containing Latin-1 characters:
    >
    > s = u"© and many more..."
    >
    > I want to convert it to HTML entities:
    >
    > result =>
    > "&copy; and many more..."
    >


    Module htmlentitydefs contains the tables you're looking for, but you need
    a few transforms:

    <code>
    # -*- coding: iso-8859-15 -*-
    from htmlentitydefs import codepoint2name

    unichr2entity = dict((unichr(code), u'&%s;' % name)
    for code,name in codepoint2name.iteritems()
    if code!=38) # exclude "&"

    def htmlescape(text, d=unichr2entity):
    if u"&" in text:
    text = text.replace(u"&", u"&amp;")
    for key, value in d.iteritems():
    if key in text:
    text = text.replace(key, value)
    return text

    print '%r' % htmlescape(u'hello')
    print '%r' % htmlescape(u'"©® áé&ö <²³>')
    </code>

    Output:
    u'hello'
    u'&quot;&copy;&reg; &aacute;&eacute;&amp;&ouml; &lt;&sup2;&sup3;&gt;'

    The result is an unicode object, with all known entities replaced. It does
    not handle missing, unknown entities - as the docs for htmlentitydefs say,
    "the definition provided here contains all the entities defined by XHTML
    1.0 that can be handled using simple textual substitution in the Latin-1
    character set (ISO-8859-1)."

    --
    Gabriel Genellina
    Gabriel Genellina, Jan 29, 2007
    #4
  5. Steven D'Aprano wrote:
    > I have a string containing Latin-1 characters:
    >
    > s = u"© and many more..."
    >
    > I want to convert it to HTML entities:
    >
    > result =>
    > "&copy; and many more..."
    >
    > Decimal/hex escapes would be acceptable:
    > "© and many more..."
    > "© and many more..."


    >>> s = u"© and many more..."
    >>> s.encode('ascii', 'xmlcharrefreplace')

    '© and many more...'
    Leif K-Brooks, Jan 29, 2007
    #5
  6. On Sun, 28 Jan 2007 23:41:19 -0500, Leif K-Brooks wrote:

    > >>> s = u"© and many more..."
    > >>> s.encode('ascii', 'xmlcharrefreplace')

    > '© and many more...'


    Wow. That's short and to the point. I like it.

    A few issues:

    (1) It doesn't seem to be reversible:

    >>> '© and many more...'.decode('latin-1')

    u'© and many more...'

    What should I do instead?


    (2) Are XML entities guaranteed to be the same as HTML entities?


    (3) Is there a way to find out at runtime what encoders/decoders/error
    handlers are available, and what they do?


    Thanks,


    --
    Steven D'Aprano
    Steven D'Aprano, Jan 29, 2007
    #6
  7. Steven D'Aprano wrote:
    > A few issues:
    >
    > (1) It doesn't seem to be reversible:
    >
    >>>> '© and many more...'.decode('latin-1')

    > u'© and many more...'
    >
    > What should I do instead?


    Unfortunately, there's nothing in the standard library that can do that,
    as far as I know. You'll have to write your own function. Here's one
    I've used before (partially stolen from code in Python patch #912410
    which was written by Aaron Swartz):

    from htmlentitydefs import name2codepoint
    import re

    def _replace_entity(m):
    s = m.group(1)
    if s[0] == u'#':
    s = s[1:]
    try:
    if s[0] in u'xX':
    c = int(s[1:], 16)
    else:
    c = int(s)
    return unichr(c)
    except ValueError:
    return m.group(0)
    else:
    try:
    return unichr(name2codepoint)
    except (ValueError, KeyError):
    return m.group(0)

    _entity_re = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
    def unescape(s):
    return _entity_re.sub(_replace_entity, s)

    > (2) Are XML entities guaranteed to be the same as HTML entities?


    XML defines one entity which doesn't exist in HTML: &apos;. But
    xmlcharrefreplace only generates numeric character references, and those
    should be the same between XML and HTML.

    > (3) Is there a way to find out at runtime what encoders/decoders/error
    > handlers are available, and what they do?


    From what I remember, that's not possible because the codec system is
    designed so that functions taking names are registered instead of the
    names themselves. But all of the standard codecs are documented at
    <http://python.org/doc/current/lib/standard-encodings.html>, and all of
    the standard error handlers are documented at
    <http://python.org/doc/current/lib/codec-base-classes.html>.
    Leif K-Brooks, Jan 29, 2007
    #7
  8. Steven D'Aprano schrieb:
    > A few issues:
    >
    > (1) It doesn't seem to be reversible:
    >
    >>>> '© and many more...'.decode('latin-1')

    > u'© and many more...'
    >
    > What should I do instead?


    For reverse processing, you need to parse it with an
    SGML/XML parser.

    > (2) Are XML entities guaranteed to be the same as HTML entities?


    Please make a terminology difference between "entity", "entity
    reference", and "character reference".

    An (external parsed) entity is a named piece of text, such
    as the copyright character. An entity reference is a reference
    to such a thing, e.g. &copy;

    A character reference is a reference to a character, not to
    an entity. xmlcharrefreplace generates character references,
    not entity references (let alone generating entities). The
    character references in XML and HTML both reference by
    Unicode ordinal, so it is "the same".

    > (3) Is there a way to find out at runtime what encoders/decoders/error
    > handlers are available, and what they do?


    Not through Python code. In C code, you can look at the
    codec_error_registry field of the interpreter object.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 29, 2007
    #8
  9. Steven D'Aprano <> wrote:
    > I have a string containing Latin-1 characters:
    >
    > s = u"© and many more..."
    >
    > I want to convert it to HTML entities:
    >
    > result =>
    > "&copy; and many more..."

    [...[
    > Is there a "batteries included" solution that doesn't involve
    > reinventing the wheel?


    recode is good for this kind of things:

    $ recode latin1..html -d mytextfile

    It seems that there are recode bindings for Python:

    $ apt-cache search recode | grep python
    python-bibtex - Python interfaces to BibTeX and the GNU Recode library

    HTH, cheers.
    --
    Roberto Bonvallet
    Roberto Bonvallet, Feb 8, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Laszlo Nagy

    convert html entities into real chars

    Laszlo Nagy, Apr 10, 2007, in forum: Python
    Replies:
    2
    Views:
    305
    Larry Bates
    Apr 10, 2007
  2. ldng
    Replies:
    3
    Views:
    1,807
    Tim Golden
    May 10, 2007
  3. Clodoaldo

    Unicode to HTML entities

    Clodoaldo, May 29, 2007, in forum: Python
    Replies:
    6
    Views:
    337
    Clodoaldo
    May 30, 2007
  4. Beat Richli

    ASP converts Unicode Chars to HTML entities?

    Beat Richli, Sep 5, 2005, in forum: ASP General
    Replies:
    2
    Views:
    525
    Beat Richli
    Sep 7, 2005
  5. Jim Higson
    Replies:
    3
    Views:
    223
    Eric Amick
    Jul 25, 2004
Loading...

Share This Page