Please help!! SAXParseException: not well-formed (invalid token)

J

jvictor118

I've been using the xml.sax.handler module to do event-driven parsing
of XML files in this python application I'm working on. However, I
keep having really pesky invalid token exceptions. Initially, I was
only getting them on control characters, and a little "sed -e 's/
[^[:print:]]/ /g' $1;" took care of that just fine. But recently, I've
been getting these invalid token excpetions with n-tildes (like the n
in España), smart/fancy/curly quotes and other seemingly harmless
characters. Specifying encoding="utf-8" in the xml header hasn't
helped matters.

Any ideas? As a last resort, I'd be willing to scrub invalid
characters.... it just seems strange that curly quotes and n-tildes
wouldn't be valid XML! Is that really the case?

TIA!

Jason
 
K

kyosohma

I've been using the xml.sax.handler module to do event-driven parsing
of XML files in this python application I'm working on. However, I
keep having really pesky invalid token exceptions. Initially, I was
only getting them on control characters, and a little "sed -e 's/
[^[:print:]]/ /g' $1;" took care of that just fine. But recently, I've
been getting these invalid token excpetions with n-tildes (like the n
in España), smart/fancy/curly quotes and other seemingly harmless
characters. Specifying encoding="utf-8" in the xml header hasn't
helped matters.

Any ideas? As a last resort, I'd be willing to scrub invalid
characters.... it just seems strange that curly quotes and n-tildes
wouldn't be valid XML! Is that really the case?

TIA!

Jason

Are you making sure to encode the strings you pass into the parser in
UTF-8 or UTF-16? This article was illuminating in that respect and may
be helpful in diagnosing your problem:

http://www.xml.com/pub/a/2002/11/13/py-xml.html?page=2

Mike
 
D

Diez B. Roggisch

I've been using the xml.sax.handler module to do event-driven parsing
of XML files in this python application I'm working on. However, I
keep having really pesky invalid token exceptions. Initially, I was
only getting them on control characters, and a little "sed -e 's/
[^[:print:]]/ /g' $1;" took care of that just fine. But recently, I've
been getting these invalid token excpetions with n-tildes (like the n
in España), smart/fancy/curly quotes and other seemingly harmless
characters. Specifying encoding="utf-8" in the xml header hasn't
helped matters.

Any ideas? As a last resort, I'd be willing to scrub invalid
characters.... it just seems strange that curly quotes and n-tildes
wouldn't be valid XML! Is that really the case?

It's not the case, unless you have a wrong encoding. Then the whole
XML-Document isn't a XML-document at all.

Just putting an encoding header that doesn't match the actually used
encoding won't fix that.

Read up on what encodings are, and ensure your XML-generation respects that.
Then reading these files will cause no problems.

Diez
 
J

jvictor118

I checked the file format (of the file containing the n-tilde - ñ) and
it is indeed UTF-8! I'm baffled! Any ideas?

Thanks,
Jason

I've been using the xml.sax.handler module to do event-driven parsing
of XML files in this python application I'm working on. However, I
keep having really pesky invalid token exceptions. Initially, I was
only getting them on control characters, and a little "sed -e 's/
[^[:print:]]/ /g' $1;" took care of that just fine. But recently, I've
been getting these invalid token excpetions with n-tildes (like the n
in España), smart/fancy/curly quotes and other seemingly harmless
characters. Specifying encoding="utf-8" in the xml header hasn't
helped matters.
Any ideas? As a last resort, I'd be willing to scrub invalid
characters.... it just seems strange that curly quotes and n-tildes
wouldn't be valid XML! Is that really the case?

It's not the case, unless you have a wrong encoding. Then the whole
XML-Document isn't a XML-document at all.

Just putting an encoding header that doesn't match the actually used
encoding won't fix that.

Read up on what encodings are, and ensure your XML-generation respects that.
Then reading these files will cause no problems.

Diez
 
D

Diez B. Roggisch

I checked the file format (of the file containing the n-tilde - ñ) and
it is indeed UTF-8! I'm baffled! Any ideas?

Without you showing us your actual code and data - no. Because it works
for me and a lot of other people.


Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top