BeautifulSoup vs. loose & chars

Discussion in 'Python' started by John Nagle, Dec 26, 2006.

  1. John Nagle

    John Nagle Guest

    I've been parsing existing HTML with BeautifulSoup, and occasionally
    hit content which has something like "Design & Advertising", that is,
    an "&" instead of an "&". Is there some way I can get BeautifulSoup
    to clean those up? There are various parsing options related to "&"
    handling, but none of them seem to do quite the right thing.

    If I write the BeautifulSoup parse tree back out with "prettify",
    the loose "&" is still in there. So the output is
    rejected by XML parsers. Which is why this is a problem.
    I need valid XML out, even if what went in wasn't quite valid.

    John Nagle
     
    John Nagle, Dec 26, 2006
    #1
    1. Advertising

  2. John Nagle

    placid Guest

    John Nagle wrote:
    > I've been parsing existing HTML with BeautifulSoup, and occasionally
    > hit content which has something like "Design & Advertising", that is,
    > an "&" instead of an "&". Is there some way I can get BeautifulSoup
    > to clean those up? There are various parsing options related to "&"
    > handling, but none of them seem to do quite the right thing.
    >
    > If I write the BeautifulSoup parse tree back out with "prettify",
    > the loose "&" is still in there. So the output is
    > rejected by XML parsers. Which is why this is a problem.
    > I need valid XML out, even if what went in wasn't quite valid.
    >
    > John Nagle



    So do you want to remove "&" or replace them with "&" ? If you want
    to replace it try the following;

    import urllib, sys

    try:
    location = urllib.urlopen(url)
    except IOError, (errno, strerror):
    sys.exit("I/O error(%s): %s" % (errno, strerror))

    content = location.read()
    content = content.replace("&", "&")


    To do this with BeautifulSoup, i think you need to go through every
    Tag, get its content, see if it contains an "&" and then replace the
    Tag with the same Tag but the content contains "&"

    Hope this helps.
    Cheers
     
    placid, Dec 26, 2006
    #2
    1. Advertising

  3. On 26 Dec 2006 04:22:38 -0800, placid <> wrote:
    > So do you want to remove "&" or replace them with "&amp;" ? If you want
    > to replace it try the following;


    I think he wants to replace them, but just the invalid ones. I.e.,

    This & this &amp; that

    would become

    This &amp; this &amp; that


    No, i don't know how to do this efficiently. =/...
    I think some kind of regex could do it.

    --
    Felipe.
     
    Felipe Almeida Lessa, Dec 26, 2006
    #3
  4. John Nagle

    Duncan Booth Guest

    "Felipe Almeida Lessa" <> wrote:

    > On 26 Dec 2006 04:22:38 -0800, placid <> wrote:
    >> So do you want to remove "&" or replace them with "&amp;" ? If you
    >> want to replace it try the following;

    >
    > I think he wants to replace them, but just the invalid ones. I.e.,
    >
    > This & this &amp; that
    >
    > would become
    >
    > This &amp; this &amp; that
    >
    >
    > No, i don't know how to do this efficiently. =/...
    > I think some kind of regex could do it.
    >


    Since he's asking for valid xml as output, it isn't sufficient just to
    ignore entity definitions: HTML has a lot of named entities such as
    &nbsp; but xml only has a very limited set of predefined named entities.
    The safest technique is to convert them all to numeric escapes except
    for the very limited set also guaranteed to be available in xml.

    Try this:

    from cgi import escape
    import re
    from htmlentitydefs import name2codepoint
    name2codepoint = name2codepoint.copy()
    name2codepoint['apos']=ord("'")

    EntityPattern =
    re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

    def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
    code = match.group(1)
    if code:
    return unichr(int(code, 10))
    else:
    code = match.group(2)
    if code:
    return unichr(int(code, 16))
    else:
    return unichr(name2codepoint[match.group(3)])
    return EntityPattern.sub(unescape, s)

    >>> escape(

    decodeEntities("This & this &amp; that&nbsp;&eacute;")).encode(
    'ascii', 'xmlcharrefreplace')
    'This &amp; this &amp; that é'


    P.S. apos is handled specially as it isn't technically a
    valid html entity (and Python doesn't include it in its entity
    list), but it is an xml entity and recognised by many browsers so some
    people might use it in html.
     
    Duncan Booth, Dec 26, 2006
    #4
  5. Re: SPAM-LOW: Re: BeautifulSoup vs. loose & chars

    Duncan Booth skrev:
    > "Felipe Almeida Lessa" <> wrote:
    >
    >
    >> On 26 Dec 2006 04:22:38 -0800, placid <> wrote:
    >>
    >>> So do you want to remove "&" or replace them with "&amp;" ? If you
    >>> want to replace it try the following;
    >>>

    >> I think he wants to replace them, but just the invalid ones. I.e.,
    >>
    >> This & this &amp; that
    >>
    >> would become
    >>
    >> This &amp; this &amp; that
    >>
    >>
    >> No, i don't know how to do this efficiently. =/...
    >> I think some kind of regex could do it.
    >>
    >>

    >
    > Since he's asking for valid xml as output, it isn't sufficient just to
    > ignore entity definitions: HTML has a lot of named entities such as
    > &nbsp; but xml only has a very limited set of predefined named entities.
    > The safest technique is to convert them all to numeric escapes except
    > for the very limited set also guaranteed to be available in xml.
    >
    > Try this:
    >
    > from cgi import escape
    > import re
    > from htmlentitydefs import name2codepoint
    > name2codepoint = name2codepoint.copy()
    > name2codepoint['apos']=ord("'")
    >
    > EntityPattern =
    > re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
    >
    > def decodeEntities(s, encoding='utf-8'):
    > def unescape(match):
    > code = match.group(1)
    > if code:
    > return unichr(int(code, 10))
    > else:
    > code = match.group(2)
    > if code:
    > return unichr(int(code, 16))
    > else:
    > return unichr(name2codepoint[match.group(3)])
    > return EntityPattern.sub(unescape, s)
    >
    >
    >>>> escape(
    >>>>

    > decodeEntities("This & this &amp; that&nbsp;&eacute;")).encode(
    > 'ascii', 'xmlcharrefreplace')
    > 'This &amp; this &amp; that é'
    >
    >
    > P.S. apos is handled specially as it isn't technically a
    > valid html entity (and Python doesn't include it in its entity
    > list), but it is an xml entity and recognised by many browsers so some
    > people might use it in html.
    >

    Hey i fund this site:
    http://www.htmlhelp.com/reference/html40/entities/symbols.html

    I hope that its what you mean.

    /Scripter47
     
    Andreas Lysdal, Dec 26, 2006
    #5
  6. John Nagle

    John Nagle Guest

    Felipe Almeida Lessa wrote:
    > On 26 Dec 2006 04:22:38 -0800, placid <> wrote:
    >
    >> So do you want to remove "&" or replace them with "&amp;" ? If you want
    >> to replace it try the following;

    >
    >
    > I think he wants to replace them, but just the invalid ones. I.e.,
    >
    > This & this &amp; that
    >
    > would become
    >
    > This &amp; this &amp; that
    >
    >
    > No, i don't know how to do this efficiently. =/...
    > I think some kind of regex could do it.


    Yes, and the appropriate one is:

    krefindamp = re.compile(r'&(?!(\w|#)+;)')
    ...
    xmlsection = re.sub(krefindamp,'&amp;',xmlsection)

    This will replace an '&' with '&amp' if the '&' isn't
    immediately followed by some combination of letters, numbers,
    and '#' ending with a ';' Admittedly this would let something
    like '&xx#2;', which isn't a legal entity, through unmodified.

    There's still a potential problem with unknown entities in the output XML, but
    at least they're recognized as entities.

    John Nagle
     
    John Nagle, Dec 26, 2006
    #6
  7. John Nagle

    Duncan Booth Guest

    Re: SPAM-LOW: Re: BeautifulSoup vs. loose & chars

    Andreas Lysdal <> wrote:

    >> P.S. apos is handled specially as it isn't technically a
    >> valid html entity (and Python doesn't include it in its entity
    >> list), but it is an xml entity and recognised by many browsers so some
    >> people might use it in html.
    >>

    > Hey i fund this site:
    > http://www.htmlhelp.com/reference/html40/entities/symbols.html
    >
    > I hope that its what you mean.


    Try
    http://www.w3.org/TR/html4/sgml/entities.html#entities
    for a more complete list.
     
    Duncan Booth, Dec 26, 2006
    #7
  8. John Nagle wrote:
    > Felipe Almeida Lessa wrote:
    >
    >> On 26 Dec 2006 04:22:38 -0800, placid <> wrote:
    >>
    >>
    >>> So do you want to remove "&" or replace them with "&amp;" ? If you want
    >>> to replace it try the following;
    >>>

    >> I think he wants to replace them, but just the invalid ones. I.e.,
    >>
    >> This & this &amp; that
    >>
    >> would become
    >>
    >> This &amp; this &amp; that
    >>
    >>
    >> No, i don't know how to do this efficiently. =/...
    >> I think some kind of regex could do it.
    >>

    >
    > Yes, and the appropriate one is:
    >
    > krefindamp = re.compile(r'&(?!(\w|#)+;)')
    > ...
    > xmlsection = re.sub(krefindamp,'&amp;',xmlsection)
    >
    > This will replace an '&' with '&amp' if the '&' isn't
    > immediately followed by some combination of letters, numbers,
    > and '#' ending with a ';' Admittedly this would let something
    > like '&xx#2;', which isn't a legal entity, through unmodified.
    >
    > There's still a potential problem with unknown entities in the output XML, but
    > at least they're recognized as entities.
    >
    > John Nagle
    >
    >
    >


    Here's another idea:

    >>> s = '''<html> htm tag should not translate

    > & should be &amp;
    > &xx#2; isn't a legal entity and should translate
    > { is a legal entity and should not translate

    </html>

    >>> import SE # http://cheeseshop.python.org/pypi/SE/2.3
    >>> HTM_Escapes = SE.SE (definitions) # See definitions below the

    dotted line

    >>> print HTM_Escapes (s)

    <html> htm tag should not translate
    &gt; &amp; should be &amp;
    &gt; &amp;xx#2; isn&quot;t a legal entity and should translate
    &gt; { is a legal entity and should not translate
    </html>

    Regards

    Frederic


    ------------------------------------------------------------------------------


    definitions = '''

    # Do # Don't do
    # " =&nbsp;" &nbsp;== # 32 20
    (34)=&dquot; &dquot;== # 34 22
    &=&amp; &amp;== # 38 26
    '=&quot; &quot;== # 39 27
    <=&lt; &lt;== # 60 3c
    >=&gt; &gt;== # 62 3e

    ©=&copy; &copy;== # 169 a9
    ·=&middot; &middot;== # 183 b7
    »=&raquo; &raquo;== # 187 bb
    À=&Agrave; &Agrave;== # 192 c0
    Á=&Aacute; &Aacute;== # 193 c1
    Â=&Acirc; &Acirc;== # 194 c2
    Ã=&Atilde; &Atilde;== # 195 c3
    Ä=&Auml; &Auml;== # 196 c4
    Å=&Aring; &Aring;== # 197 c5
    Æ=&AElig; &AElig;== # 198 c6
    Ç=&Ccedil; &Ccedil;== # 199 c7
    È=&Egrave; &Egrave;== # 200 c8
    É=&Eacute; &Eacute;== # 201 c9
    Ê=&Ecirc; &Ecirc;== # 202 ca
    Ë=&Euml; &Euml;== # 203 cb
    Ì=&Igrave; &Igrave;== # 204 cc
    Í=&Iacute; &Iacute;== # 205 cd
    Î=&Icirc; &Icirc;== # 206 ce
    Ï=&Iuml; &Iuml;== # 207 cf
    Ð=&Eth; &Eth;== # 208 d0
    Ñ=&Ntilde; &Ntilde;== # 209 d1
    Ò=&Ograve; &Ograve;== # 210 d2
    Ó=&Oacute; &Oacute;== # 211 d3
    Ô=&Ocirc; &Ocirc;== # 212 d4
    Õ=&Otilde; &Otilde;== # 213 d5
    Ö=&Ouml; &Ouml;== # 214 d6
    Ø=&Oslash; &Oslash;== # 216 d8
    Ù=&Ugrve; &Ugrve;== # 217 d9
    Ú=&Uacute; &Uacute;== # 218 da
    Û=&Ucirc; &Ucirc;== # 219 db
    Ü=&Uuml; &Uuml;== # 220 dc
    Ý=&Yacute; &Yacute;== # 221 dd
    Þ=&Thorn; &Thorn;== # 222 de
    ß=&szlig; &szlig;== # 223 df
    à=&agrave; &agrave;== # 224 e0
    á=&aacute; &aacute;== # 225 e1
    â=&acirc; &acirc;== # 226 e2
    ã=&atilde; &atilde;== # 227 e3
    ä=&auml; &auml;== # 228 e4
    å=&aring; &aring;== # 229 e5
    æ=&aelig; &aelig;== # 230 e6
    ç=&ccedil; &ccedil;== # 231 e7
    è=&egrave; &egrave;== # 232 e8
    é=&eacute; &eacute;== # 233 e9
    ê=&ecirc; &ecirc;== # 234 ea
    ë=&euml; &euml;== # 235 eb
    ì=&igrave; &igrave;== # 236 ec
    í=&iacute; &iacute;== # 237 ed
    î=&icirc; &icirc;== # 238 ee
    ï=&iuml; &iuml;== # 239 ef
    ð=&eth; &eth;== # 240 f0
    ñ=&ntilde; &ntilde;== # 241 f1
    ò=&ograve; &ograve;== # 242 f2
    ó=&oacute; &oacute;== # 243 f3
    ô=&ocric; &ocric;== # 244 f4
    õ=&otilde; &otilde;== # 245 f5
    ö=&ouml; &ouml;== # 246 f6
    ø=&oslash; &oslash;== # 248 f8
    ù=&ugrave; &ugrave;== # 249 f9
    ú=&uacute; &uacute;== # 250 fa
    û=&ucirc; &ucirc;== # 251 fb
    ü=&uuml; &uuml;== # 252 fc
    ý=&yacute; &yacute;== # 253 fd
    þ=&thorn; &thorn;== # 254 fe
    (xff)=ÿ # 255 ff
    &#== # All numeric codes
    "~<(.|\n)*?>~==" # All HTM tags '''

    If the ampersand is all you need to handle you can erase the others
    in the first column. You need to keep the second column though, except
    the last entry, because the tags don't need protection if '<' and
    '>' in the first column are gone.
    Definitions are easily edited and can be kept in text files.
    The SE constructor accepts a file name instead of a definitions string:

    >>> HTM_Escapes = SE.SE ('definition_file_name')



    -------------------------------------------------------------------
     
    Frederic Rentsch, Dec 26, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Ter-Saakov

    If i am changing aspx file will i loose session?

    George Ter-Saakov, Aug 14, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    442
    Krissy
    Aug 15, 2003
  2. SStory
    Replies:
    2
    Views:
    365
    SStory
    Oct 16, 2003
  3. Kosio

    Floats to chars and chars to floats

    Kosio, Sep 16, 2005, in forum: C Programming
    Replies:
    44
    Views:
    1,347
    Tim Rentsch
    Sep 23, 2005
  4. Hongyu
    Replies:
    9
    Views:
    972
    James Kanze
    Aug 8, 2008
  5. M.Posseth

    receiving ??? chars instead of "special" chars

    M.Posseth, Nov 15, 2004, in forum: ASP .Net Web Services
    Replies:
    3
    Views:
    288
    Dan Rogers
    Nov 16, 2004
Loading...

Share This Page