Easy way to remove HTML entities from an HTML document?

Discussion in 'Python' started by Robert Oschler, Jul 25, 2004.

  1. Is there a module/function to remove all the HTML entities from an HTML
    document (e.g. - &nbsp, &amp, &apos, etc.)?

    If not I'll just write one myself but I figured I'd save myself some time.

    Thanks,
    --
    Robert
     
    Robert Oschler, Jul 25, 2004
    #1
    1. Advertising

  2. On Sun, 25 Jul 2004, Robert Oschler wrote:

    > Is there a module/function to remove all the HTML entities from an HTML
    > document (e.g. - &nbsp, &amp, &apos, etc.)?


    htmllib has this capability, but if you're not doing any other HTML
    parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
    does nicely:

    import re
    import htmlentitydefs

    def convertentity(m):
    if m.group(1)=='#':
    try:
    return chr(int(m.group(2)))
    except ValueError:
    return '&#%s;' % m.group(2)
    try:
    return htmlentitydefs.entitydefs[m.group(2)]
    except KeyError:
    return '&%s;' % m.group(2)

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'

    Unknown or invalid entities are left in &xxx; format, while also leaving
    Unicode entities in format. If you want a Unicode string to be
    returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
    and 'htmlentitydefs.entitydefs[m.group(2)]' with
    'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

    Hope this helps.
     
    Christopher T King, Jul 25, 2004
    #2
    1. Advertising

  3. "Robert Oschler" <no_replies@fake_email_address.invalid> wrote in message news:<X9UMc.12838$>...
    > Is there a module/function to remove all the HTML entities from an HTML
    > document (e.g. - &nbsp, &amp, &apos, etc.)?
    >
    > If not I'll just write one myself but I figured I'd save myself some time.
    >
    > Thanks,



    check out mark pilgrims site: http://diveintopython.org/html_processing/index.html
     
    Michael Scarlett, Jul 26, 2004
    #3
  4. "Christopher T King" <> wrote in message
    news:p...
    >
    > htmllib has this capability, but if you're not doing any other HTML
    > parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
    > does nicely:
    >
    > import re
    > import htmlentitydefs
    >
    > def convertentity(m):
    > if m.group(1)=='#':
    > try:
    > return chr(int(m.group(2)))
    > except ValueError:
    > return '&#%s;' % m.group(2)
    > try:
    > return htmlentitydefs.entitydefs[m.group(2)]
    > except KeyError:
    > return '&%s;' % m.group(2)
    >
    > def converthtml(s):
    > return re.sub(r'&(#?)(.+?);',convert,s)
    >
    > converthtml('Some &lt;html&gt; string.') # --> 'Some <html> string.'
    >
    > Unknown or invalid entities are left in &xxx; format, while also leaving
    > Unicode entities in format. If you want a Unicode string to be
    > returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
    > and 'htmlentitydefs.entitydefs[m.group(2)]' with
    > 'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
    >
    > Hope this helps.
    >


    Chris,

    I believe the line that reads:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convert,s)

    Should read:

    def converthtml(s):
    return re.sub(r'&(#?)(.+?);',convertentity,s)

    Once I made that change it worked like a charm. I'm showing the correction
    for future Usenet searchers.

    So you can pass a function to re.sub() as the replacement patttern? Very
    cool, I didn't know that. I think you could spend a year just learning
    regular expressions and still miss something.


    Thanks,
    Robert.
     
    Robert Oschler, Jul 26, 2004
    #4
  5. On Mon, 26 Jul 2004, Robert Oschler wrote:

    > I believe the line that reads:
    >
    > def converthtml(s):
    > return re.sub(r'&(#?)(.+?);',convert,s)
    >
    > Should read:
    >
    > def converthtml(s):
    > return re.sub(r'&(#?)(.+?);',convertentity,s)


    Oops, you're right, mea culpa :)

    > So you can pass a function to re.sub() as the replacement patttern? Very
    > cool, I didn't know that. I think you could spend a year just learning
    > regular expressions and still miss something.


    That feature is only mentioned briefly in the online docs, and not at all
    in sre.sub's docstring. Surprising, since it's indeed a very useful
    feature.
     
    Christopher T King, Jul 27, 2004
    #5
  6. "Christopher T King" <> wrote in message
    news:p...
    >
    > That feature is only mentioned briefly in the online docs, and not at all
    > in sre.sub's docstring. Surprising, since it's indeed a very useful
    > feature.
    >


    Chris,

    Speaking of learning cool things by osmosis, do you know of a well commented
    source of Python code, perhaps an Open Source project, that I could study to
    learn more interesting techniques like the regexp tip you shared? I find
    that studying other people's code is the best way to avoid getting in a
    programming rut.

    Thanks.

    --
    Robert
     
    Robert Oschler, Jul 27, 2004
    #6
  7. On Tue, 27 Jul 2004, Robert Oschler wrote:

    > Speaking of learning cool things by osmosis, do you know of a well commented
    > source of Python code, perhaps an Open Source project, that I could study to
    > learn more interesting techniques like the regexp tip you shared? I find
    > that studying other people's code is the best way to avoid getting in a
    > programming rut.


    I seem to recall reading about that re.sub trick in something linked from
    Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
    are often links there to interesting and useful code snippets from
    ActiveState's Python Cookbook and other sources; I'd say start there if
    you want to find neat tricks you can do with Python.

    I'm not sure of any particularly "well commented" Python projects though
    (I've never really looked into that), but you'll probably find some
    interesting small projects in the Vaults of Parnassus
    (http://www.vex.net/parnassus/).
     
    Christopher T King, Jul 29, 2004
    #7
  8. "Christopher T King" <> wrote in message
    news:p...
    >
    > I seem to recall reading about that re.sub trick in something linked from
    > Pythonware's Daily Python URL (http://www.pythonware.com/daily/). There
    > are often links there to interesting and useful code snippets from
    > ActiveState's Python Cookbook and other sources; I'd say start there if
    > you want to find neat tricks you can do with Python.
    >
    > I'm not sure of any particularly "well commented" Python projects though
    > (I've never really looked into that), but you'll probably find some
    > interesting small projects in the Vaults of Parnassus
    > (http://www.vex.net/parnassus/).
    >


    Thanks Chris and thanks for all your other help.

    With your Python skill you should work for Google. Too bad you don't, you'd
    be a wealthy man soon (Google IPO). Wish I did. :)

    --
    Robert
     
    Robert Oschler, Jul 30, 2004
    #8
  9. On Fri, 30 Jul 2004, Robert Oschler wrote:

    > With your Python skill you should work for Google. Too bad you don't,
    > you'd be a wealthy man soon (Google IPO). Wish I did. :)


    Thanks for the compliment. :) To work at Google is my dream job, and I'm
    sure that of many others on this list, too (makes me wonder if any Google
    employees read this list...).
     
    Christopher T King, Jul 31, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kent Tong
    Replies:
    3
    Views:
    399
    Dimitre Novatchev
    Feb 20, 2004
  2. Don Hiatt
    Replies:
    3
    Views:
    1,669
    Terry Reedy
    Jul 24, 2003
  3. Robert Brewer
    Replies:
    0
    Views:
    533
    Robert Brewer
    Jul 25, 2004
  4. Geoff Wilkins

    document.write, HTML entities and IE

    Geoff Wilkins, Oct 12, 2003, in forum: Javascript
    Replies:
    2
    Views:
    209
  5. Jim Higson
    Replies:
    3
    Views:
    250
    Eric Amick
    Jul 25, 2004
Loading...

Share This Page