XML/HTML Encoding problem

Dale Strickland-Clark · May 22, 2006

A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped. Â What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
â‚¬
</body>
</html>

So it encodes the entity reference to â‚¬ (Euro sign). Â I need it to remain as
€ so that the resulting HTML can render properly in a browser. Â Is
there a way to make the parser not convert the entity references? Â Or is
there a convenient post processing function that will do the conversion?

Sybren Stuvel · May 22, 2006

Dale Strickland-Clark enlightened us with:

So it encodes the entity reference to â‚¬ (Euro sign). Â I need it to
remain as € so that the resulting HTML can render properly in
a browser.

If you want proper display, why not use UTF-8?

Sybren

Duncan Booth · May 22, 2006

Dale said:
from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type"/> </head>
<body>
â‚¬
</body>
</html>

So it encodes the entity reference to â‚¬ (Euro sign). Â I need it to
remain as € so that the resulting HTML can render properly in a
browser. Â Is there a way to make the parser not convert the entity
references? Â Or is there a convenient post processing function that
will do the conversion?

First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't
any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>
€
<?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€

You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.

Dale Strickland-Clark · May 23, 2006

Thanks, Duncan. That did the trick.

If you're EuroPythoning, I'll buy you a drink.

Cheers.

How to have two html audio players on one page?	0	May 3, 2022
Script stops working when using variables to save time typing...	4	Oct 31, 2022
In javascript, XML File Create, File Save	2	Jul 17, 2023
XHTML - how extend/create ELEMENT body in my DTD?	0	Oct 29, 2019
Canvas drawing HTML Javascript on elementor	1	Feb 22, 2023
External html	2	May 13, 2020
Making sure this javascript code works	1	Nov 14, 2022
Not sure why drop-down is not working.	2	Mar 24, 2024

XML/HTML Encoding problem

Dale Strickland-Clark

Sybren Stuvel

Duncan Booth

Dale Strickland-Clark

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads