ElementTree, XML and Unicode -- C0 Controls

  • Thread starter =?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=
  • Start date
?

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

Hi all,

The unicode code points in the 0000-001F range --
except newline, tab, carriage return -- are not legal
XML 1.0 characters.

Attempts to serialize and deserialize such strings
with ElementTree will fail:
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 12

Good ! But I was expecting a failure *earlier*, in
the "tostring" function -- I basically assumed that
ElementTree would refuse to generate a XML
fragment that is not well-formed.

Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

Cheers,

SB
 
F

Fredrik Lundh

Sébastien Boisgérault said:
Could anyone comment on the rationale behind
the current behavior ? Is it a performance issue,
the search for non-valid unicode code points being
too expensive ?

the default serializer doesn't do any validation or well-formedness checks at all; it assumes
that you know what you're doing.

</F>
 
?

=?iso-8859-1?q?S=E9bastien_Boisg=E9rault?=

the default serializer doesn't do any validation or well-formedness checks at all; it assumes
that you know what you're doing.

</F>

Fair enough !

Thanks Fredrik.

SB
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top