SAXParseException: not well-formed (invalid token)

Discussion in 'Python' started by Pablo Rey, Aug 30, 2007.

  1. Pablo Rey

    Pablo Rey Guest

    Dear Colleagues,

    I am getting the following error with a XML page:

    > File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69, in getItems
    > d = minidom.parseString(xml.read())
    > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 967, in parseString
    > return _doparse(pulldom.parseString, args, kwargs)
    > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 954, in _doparse
    > toktype, rootNode = events.getEvent()
    > File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 265, in getEvent
    > self.parser.feed(buf)
    > File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", line 208, in feed
    > self._err_handler.fatalError(exc)
    > File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
    > raise exception
    > xml.sax._exceptions.SAXParseException: <unknown>:553:48: not well-formed (invalid token)



    > def getItems(page):
    > opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
    > try:
    > xml = opener.open(page)
    > except:
    > return []
    >
    > d = minidom.parseString(xml.read())
    > items = d.getElementsByTagName('item')
    > data = []
    > for i in items:
    > data.append(getText(i.childNodes))
    >
    > return data


    The page is
    https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
    and the line with the invalid character is (the invalid character is the
    final é of Université):

    <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    Louvain/CN=Roberfroid</item>


    I have tried several options but I am not able to avoid this problem.
    Any idea?.

    I am starting to work with Python so I am sorry if this problem is trivial.

    Thanks for your time.
    Pablo Rey
     
    Pablo Rey, Aug 30, 2007
    #1
    1. Advertising

  2. On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:

    > The page is
    > https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
    > and the line with the invalid character is (the invalid character is the
    > final é of Université):


    The URL doesn't work for me in a browser. (Could not connect…)

    Maybe you can download that XML file and use `xmllint` to check if it is
    well formed XML!?

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Aug 30, 2007
    #2
    1. Advertising

  3. Pablo Rey wrote:
    > I am getting the following error with a XML page:
    >
    >> File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
    >> in getItems
    >> d = minidom.parseString(xml.read())
    >> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
    >> line 967, in parseString
    >> return _doparse(pulldom.parseString, args, kwargs)
    >> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
    >> line 954, in _doparse
    >> toktype, rootNode = events.getEvent()
    >> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
    >> line 265, in getEvent
    >> self.parser.feed(buf)
    >> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
    >> line 208, in feed
    >> self._err_handler.fatalError(exc)
    >> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
    >> line 38, in fatalError
    >> raise exception
    >> xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
    >> well-formed (invalid token)

    >
    >
    >> def getItems(page):
    >> opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
    >> try:
    >> xml = opener.open(page)
    >> except:
    >> return []
    >>
    >> d = minidom.parseString(xml.read())
    >> items = d.getElementsByTagName('item')
    >> data = []
    >> for i in items:
    >> data.append(getText(i.childNodes))
    >>
    >> return data

    >
    > The page is
    > https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
    > and the line with the invalid character is (the invalid character is the
    > final é of Université):
    >
    > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    > Louvain/CN=Roberfroid</item>
    >
    >
    > I have tried several options but I am not able to avoid this
    > problem. Any idea?.


    Looks like the page is not well-formed XML (i.e. not XML at all). If it
    doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
    input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
    passing it to the SAX parser.

    Alternatively, tell the page authors to fix their page.

    Stefan
     
    Stefan Behnel, Aug 30, 2007
    #3
  4. Pablo Rey

    Pablo Rey Guest

    Hi Stefan,

    The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
    ?>).

    About the possibility that you mention to recoding the input, could you
    let me know how to do it?. I am sorry I am starting with Python and I
    don't know how to do it.

    Thanks by your help.
    Pablo



    On 30/08/2007 14:37, Stefan Behnel wrote:
    > Pablo Rey wrote:
    >> I am getting the following error with a XML page:
    >>
    >>> File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
    >>> in getItems
    >>> d = minidom.parseString(xml.read())
    >>> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
    >>> line 967, in parseString
    >>> return _doparse(pulldom.parseString, args, kwargs)
    >>> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
    >>> line 954, in _doparse
    >>> toktype, rootNode = events.getEvent()
    >>> File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
    >>> line 265, in getEvent
    >>> self.parser.feed(buf)
    >>> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
    >>> line 208, in feed
    >>> self._err_handler.fatalError(exc)
    >>> File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
    >>> line 38, in fatalError
    >>> raise exception
    >>> xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
    >>> well-formed (invalid token)

    >>
    >>> def getItems(page):
    >>> opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
    >>> try:
    >>> xml = opener.open(page)
    >>> except:
    >>> return []
    >>>
    >>> d = minidom.parseString(xml.read())
    >>> items = d.getElementsByTagName('item')
    >>> data = []
    >>> for i in items:
    >>> data.append(getText(i.childNodes))
    >>>
    >>> return data

    >> The page is
    >> https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
    >> and the line with the invalid character is (the invalid character is the
    >> final é of Université):
    >>
    >> <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    >> Louvain/CN=Roberfroid</item>
    >>
    >>
    >> I have tried several options but I am not able to avoid this
    >> problem. Any idea?.

    >
    > Looks like the page is not well-formed XML (i.e. not XML at all). If it
    > doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
    > input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
    > passing it to the SAX parser.
    >
    > Alternatively, tell the page authors to fix their page.
    >
    > Stefan
     
    Pablo Rey, Aug 30, 2007
    #4
  5. Pablo Rey

    Pablo Rey Guest

    On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:
    > On Thu, 30 Aug 2007 13:46:47 +0200, Pablo Rey wrote:
    >
    >> The page is
    >> https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
    >> and the line with the invalid character is (the invalid character is the
    >> final é of Université):

    >
    > The URL doesn't work for me in a browser. (Could not connect…)


    Hi Marc,

    To access to the page you need a X509 certificate signed by a CA
    recognised by the project.

    I have stored the XML file and you can find it attached.

    >
    > Maybe you can download that XML file and use `xmllint` to check if it is
    > well formed XML!?


    This is the output of the xmllint command:

    [prey@www3 voms2users]$ xmllint cms.xml
    cms.xml:553: error: Input is not proper UTF-8, indicate encoding !
    <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    Louvain/CN=Roberfroi
    ^
    cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
    <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    Louvain/CN=Roberfroi

    Thanks for your time.
    Pablo


    >
    > Ciao,
    > Marc 'BlackJack' Rintsch
     
    Pablo Rey, Aug 30, 2007
    #5
  6. On Thu, 30 Aug 2007 15:31:58 +0200, Pablo Rey wrote:

    > On 30/08/2007 14:35, Marc 'BlackJack' Rintsch wrote:
    >
    >> Maybe you can download that XML file and use `xmllint` to check if it
    >> is well formed XML!?

    >
    > This is the output of the xmllint command:
    >
    > [prey@www3 voms2users]$ xmllint cms.xml cms.xml:553: error: Input is not
    > proper UTF-8, indicate encoding !
    > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    > Louvain/CN=Roberfroi
    > ^
    > cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
    > <item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
    > Louvain/CN=Roberfroi
    >
    > […]
    >
    > <?xml version="1.0" encoding="UTF-8" ?>


    So the XML says it is encoded in UTF-8 but it contains at least one
    character that seems to be encoded in ISO-8859-1.

    Tell the authors/creators of that document there XML is broken.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Aug 30, 2007
    #6
  7. On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:
    > Hi Stefan,
    >
    > The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
    > ?>).


    It's possible that the encoding specification is incorrect:

    >>> u = u"\N{LATIN SMALL LETTER E WITH ACUTE}"
    >>> print repr(u.encode("latin-1"))

    '\xe9'
    >>> print repr(u.encode("utf-8"))

    '\xc3\xa9'

    If your input string contains the byte 0xe9 where your accented e is,
    the file is actually latin-1 encoded. If it contains the byte sequence
    0xc3,0xa9 it is UTF-8 encoded.

    If the string is encoded in latin-1, you can transcode it to utf-8 like
    this:

    contents = contents.decode("latin-1").encode("utf-8")

    HTH,

    --
    Carsten Haese
    http://informixdb.sourceforge.net
     
    Carsten Haese, Aug 30, 2007
    #7
  8. On Thu, 2007-08-30 at 15:20 +0200, Pablo Rey wrote:
    > About the possibility that you mention to recoding the input, could you
    > let me know how to do it?. I am sorry I am starting with Python and I
    > don't know how to do it.


    While I answered this question in my previous reply, I wanted to add
    that you might find the following How-To helpful in demystifying
    Unicode:

    http://www.amk.ca/python/howto/unicode

    --
    Carsten Haese
    http://informixdb.sourceforge.net
     
    Carsten Haese, Aug 30, 2007
    #8
  9. In message <>, Carsten
    Haese wrote:

    > If your input string contains the byte 0xe9 where your accented e is,
    > the file is actually latin-1 encoded. If it contains the byte sequence
    > 0xc3,0xa9 it is UTF-8 encoded.


    It is dismaying how often I come across Web pages that claim to be
    UTF-8-encoded, but are actually Latin-1 or Dimdows-1252.
     
    Lawrence D'Oliveiro, Aug 31, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wtsnet
    Replies:
    4
    Views:
    695
    wtsnet
    Nov 25, 2003
  2. Assimalyst
    Replies:
    4
    Views:
    14,371
    Brock Allen
    Jul 25, 2005
  3. Andy Dingley
    Replies:
    2
    Views:
    436
    Richard Tobin
    Oct 15, 2004
  4. Replies:
    4
    Views:
    671
    Diez B. Roggisch
    Mar 27, 2007
  5. pt
    Replies:
    3
    Views:
    545
Loading...

Share This Page