XML/HTML Encoding problem

  • Thread starter Dale Strickland-Clark
  • Start date
D

Dale Strickland-Clark

A colleague has asked me this and I don't know the answer. Can anyone here
help with this? Thanks in advance.

Here is his email:

I am trying to parse an HTML document using the xml.dom.minidom parser and
then outputting a valid HTML document, all using the ISO-8859-1 charset.
For example:

My input:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Desired output:
<?xml version="1.0" encoding="ISO-8859-1"?>
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

</body>
</html>

Note that it doesn't matter if the '<?xml version="1.0"
encoding="ISO-8859-1"?>' header gets stripped.  What does matter is that the
input document has the 'ISO-8859-1' charset and is an ANSI encoded file.

The problem I get is that when I run, for example:

from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>
€
</body>
</html>

So it encodes the entity reference to € (Euro sign).  I need it to remain as
€ so that the resulting HTML can render properly in a browser.  Is
there a way to make the parser not convert the entity references?  Or is
there a convenient post processing function that will do the conversion?
 
S

Sybren Stuvel

Dale Strickland-Clark enlightened us with:
So it encodes the entity reference to € (Euro sign).  I need it to
remain as € so that the resulting HTML can render properly in
a browser.

If you want proper display, why not use UTF-8?

Sybren
 
D

Duncan Booth

Dale said:
from xml.dom.minidom import parseString
output = parseString(strHTML).toxml()

The output is:

<?xml version="1.0" encoding="iso-8859-1"?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type"/> </head>
<body>
€
</body>
</html>

So it encodes the entity reference to € (Euro sign).  I need it to
remain as € so that the resulting HTML can render properly in a
browser.  Is there a way to make the parser not convert the entity
references?  Or is there a convenient post processing function that
will do the conversion?

First up, when I repeat what you did I don't get the same output. toxml()
without an encoding argument produces a unicode string, and no encoding
attribute in the <?xml ...?>

toxml() only takes a single encoding argument, so unfortunately there isn't
any way to tell it what to do for unicode characters which are not
supported in the encoding you are using. However, if you then encode the
unicode output to ascii with entity escapes, I think you should be alright
(unless I've missed something):
<html>
<head>
<title></title>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type" />
</head>
<body>

<?xml version="1.0" ?>
<html>
<head>
<title/>
<meta content="text/html; charset=iso-8859-1" http-equiv="Content-Type"/>
</head>
<body>


You lose the encoding at the top of the output, but since the output is
entirely ascii I don't think that matters.
 
D

Dale Strickland-Clark

Thanks, Duncan. That did the trick.

If you're EuroPythoning, I'll buy you a drink.

Cheers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,067
Latest member
HunterTere

Latest Threads

Top