Bug in htmlentitydefs.py with Python 3.0?

A

André

In trying to parse html files using ElementTree running under Python
3.0a1, and using htmlentitydefs.py to add "character entities" to the
parser, I found that I needed to create a customized version of
htmlentitydefs.py to make things work properly.

The change needed was to replace (at the bottom of the file)
====
for (name, codepoint) in name2codepoint.items():
codepoint2name[codepoint] = name
if codepoint <= 0xff:
entitydefs[name] = chr(codepoint)
else:
entitydefs[name] = '&#%d;' % codepoint
====
by
----
for (name, codepoint) in name2codepoint.items():
codepoint2name[codepoint] = name
entitydefs[name] = chr(codepoint)
----

It does work for me ... but I don't know enough about unicode to be
sure that it is a proper bug, and not a quirk due to the way I wrote
my app. So, I thought it would be appropriate to post it here so that
unicode experts could determine if it was indeed a bug - and file a
bug report/write a patch. The same code is present in Python 3.0a2 -
but I have not tested it under this new version.

André
 
M

Martin v. Löwis

André said:
In trying to parse html files using ElementTree running under Python
3.0a1, and using htmlentitydefs.py to add "character entities" to the
parser, I found that I needed to create a customized version of
htmlentitydefs.py to make things work properly.

Can you please state what precise problem you were seeing? The original
code looks fine to me as it stands.
It does work for me ... but I don't know enough about unicode to be
sure that it is a proper bug, and not a quirk due to the way I wrote
my app.

Without knowing what the actual problem is, it is hard to tell.

Regards,
Martin
 
A

André

Can you please state what precise problem you were seeing? The original
code looks fine to me as it stands.

As stated above, I was using ElementTree to parse an html file and
sending the output to a browser.
Without an additional parser, I was getting the following error
message:

Traceback (most recent call last):
File "/Users/andre/CrunchySVN/branches/andre/src/http_serve.py",
line 79, in do_POST self.server.get_handler(realpath)(self)
File "src/plugins/handle_default.py",
line 53, in handler data = path_to_filedata(request.path,
root_path)
File "src/plugins/handle_default.py",
line 39, in path_to_filedata return cp.create_vlam_page(open(npath),
path).read()
File "/Users/andre/CrunchySVN/branches/andre/src/CrunchyPlugin.py",
line 98, in create_vlam_page return vlam.CrunchyPage(filehandle,
url, remote=remote, local=local)
File "/Users/andre/CrunchySVN/branches/andre/src/vlam.py",
line 62, in __init__ self.tree =
parse(filehandle)#XmlFile(filehandle)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 823, in parse tree.parse(source, parser)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 561, in parse parser.feed(data)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 1201, in feed self._parser.Parse(data, 0)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 1157, in _default self._parser.ErrorColumnNumber)
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11

So, I tried to specify an additional parser via

if python_version >= 3:
import htmlentitydefs
class XmlFile(ElementTree.ElementTree):
def __init__(self, file=None):
ElementTree.ElementTree.__init__(self)
parser = ElementTree.XMLTreeBuilder(
target=ElementTree.TreeBuilder(ElementTree.Element))
ent = htmlentitydefs.entitydefs
for entity in ent:
if entity not in parser.entity:
parser.entity[entity] = ent[entity]
self.parse(source=file, parser=parser)
return

The output was "wrong". For example, one of the test I used was to
process
a copy of the main dict of htmlentitydefs.py inside an html page. A
few
of the characters came ok, but I got things like:

'Α': 0x0391, # greek capital letter alpha, U+0391

When using my modified version, I got the following (which may not be
transmitted properly by email...)
'Α': 0x0391, # greek capital letter alpha, U+0391

It does look like a Greek capital letter alpha here.

Without knowing what the actual problem is, it is hard to tell.

I hope the above is of some help.

Regards,

André
 
M

Martin v. Löwis

Without an additional parser, I was getting the following error
message: [...]
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11

To understand that problem better, it would have been helpful to see
what line 401, column 11 of the input file actually says. AFAICT,
it must have been something like "&é;" which would be really puzzling
to have in an XML file (usually, people restrict themselves to ASCII
for entity names).
for entity in ent:
if entity not in parser.entity:
parser.entity[entity] = ent[entity]

This looks fine to me.
The output was "wrong". For example, one of the test I used was to
process a copy of the main dict of htmlentitydefs.py inside an html page. A
few of the characters came ok, but I got things like:

'Α': 0x0391, # greek capital letter alpha, U+0391

Why do you think this is wrong?
When using my modified version, I got the following (which may not be
transmitted properly by email...)
'Α': 0x0391, # greek capital letter alpha, U+0391

It does look like a Greek capital letter alpha here.

Sure, however, your first version ALSO has the Greek capital letter
alpha there; it is just spelled as Α (which *is* a valid spelling
for that latter in XML).
I hope the above is of some help.

Thanks; I now think that htmlentitydefs is just as fine as it always
was - I don't see any problem here.

Regards,
Martin
 
A

André

Without an additional parser, I was getting the following error
message: [...]
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11

To understand that problem better, it would have been helpful to see
what line 401, column 11 of the input file actually says. AFAICT,
it must have been something like "&é;" which would be really puzzling
to have in an XML file (usually, people restrict themselves to ASCII
for entity names).


No, that one was &eacute; (testing with my own name that appeared in
a file).

for entity in ent:
if entity not in parser.entity:
parser.entity[entity] = ent[entity]

This looks fine to me.
The output was "wrong". For example, one of the test I used was to
process a copy of the main dict of htmlentitydefs.py inside an html page.. A
few of the characters came ok, but I got things like:
'Α': 0x0391, # greek capital letter alpha, U+0391

Why do you think this is wrong?

Sorry, that was just cut-and-pasted from the browser (not the source);
in the source of the processed html page, it is
'&amp;#913;': 0x0391, # greek capital letter alpha, U+0391

i.e. the "&" was transformed into "&amp;" in a number of places (all
places above ascii 127 I believe).


Here are a few more lines extracted from the html file that was
processed:
=============
'Â': 0x00c2, # latin capital letter A with circumflex, U+00C2
ISOlat1
'À': 0x00c0, # latin capital letter A with grave = latin capital
letter A grave, U+00C0 ISOlat1
'&amp;#913;': 0x0391, # greek capital letter alpha, U+0391
'Ã…': 0x00c5, # latin capital letter A with ring above = latin
capital letter A ring, U+00C5 ISOlat1
'Ã': 0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1
'Ä': 0x00c4, # latin capital letter A with diaeresis, U+00C4
ISOlat1
'&amp;#914;': 0x0392, # greek capital letter beta, U+0392
'Ç': 0x00c7, # latin capital letter C with cedilla, U+00C7
ISOlat1
'&amp;#935;': 0x03a7, # greek capital letter chi, U+03A7
'&amp;#8225;': 0x2021, # double dagger, U+2021 ISOpub
'&amp;#916;': 0x0394, # greek capital letter delta, U+0394
ISOgrk3
============


Sure, however, your first version ALSO has the Greek capital letter
alpha there; it is just spelled as Α (which *is* a valid spelling
for that latter in XML).

Agreed that it would be... However that was not how it was
transformed, see above; sorry if I was not clear about what was
happening (I should not have cut-and-pasted from the browser window).
Thanks; I now think that htmlentitydefs is just as fine as it always
was - I don't see any problem here.

You may well be right in that the problem may lie elsewhere. But as
making the change I mentioned "fixed" the problem at my, I figured
this was where the problem was located - and thought I should at least
report it here.

Regards,
André
 
A

André

Sorry for the top posting - I found out that the problem I encountered
was not something new in Python 3.0.

Here's a test program:
============
import xml.etree.ElementTree
ElementTree = xml.etree.ElementTree
import htmlentitydefs
class XmlParser(ElementTree.ElementTree):
def __init__(self, file=None):
ElementTree.ElementTree.__init__(self)
parser = ElementTree.XMLTreeBuilder(
target=ElementTree.TreeBuilder(ElementTree.Element))
parser.entity = htmlentitydefs.entitydefs
self.parse(source=file, parser=parser)
return

f = open('test.html')
tree = XmlParser(f)
tree.write('test_out.html', encoding='utf-8')
======
This program should be run with the following test file (test.html):
=====
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>test</title>
</head>
<body>
<p>&Alpha;</p>
</body>
</html>
======
If run as such, it will print out the following:
------
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>test</title>
</head>
<body>
<p>&amp;#913;</p>
</body>
</html>
-------
Notice how it is &amp;#913; that appears instead of Α

This is the behaviour with both Python3.0 and 2.5. (When I was
running with Python 2.5, I was always preprocessing the files with
BeautifulSoup, which removed many problems).

If I use "my_htmlentitiesdef.py" described in a previous message, I do
get an Alpha printed out (admittedly, not the character entity).

I would prefer to find a way to process such files and get Α
instead... (or even, to process files with hard-coded characters e.g.
é instead of &eacute; and have them processed properly...).

unicode-challengedly-yrs,

André


Without an additional parser, I was getting the following error
message: [...]
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11
To understand that problem better, it would have been helpful to see
what line 401, column 11 of the input file actually says. AFAICT,
it must have been something like "&é;" which would be really puzzling
to have in an XML file (usually, people restrict themselves to ASCII
for entity names).

No, that one was &eacute; (testing with my own name that appeared in
a file).


for entity in ent:
if entity not in parser.entity:
parser.entity[entity] = ent[entity]
This looks fine to me.
Why do you think this is wrong?

Sorry, that was just cut-and-pasted from the browser (not the source);
in the source of the processed html page, it is
'&amp;#913;': 0x0391, # greek capital letter alpha, U+0391

i.e. the "&" was transformed into "&amp;" in a number of places (all
places above ascii 127 I believe).

Here are a few more lines extracted from the html file that was
processed:
=============
'Â': 0x00c2, # latin capital letter A with circumflex, U+00C2
ISOlat1
'À': 0x00c0, # latin capital letter A with grave = latin capital
letter A grave, U+00C0 ISOlat1
'&amp;#913;': 0x0391, # greek capital letter alpha, U+0391
'Ã…': 0x00c5, # latin capital letter A with ring above = latin
capital letter A ring, U+00C5 ISOlat1
'Ã': 0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1
'Ä': 0x00c4, # latin capital letter A with diaeresis, U+00C4
ISOlat1
'&amp;#914;': 0x0392, # greek capital letter beta, U+0392
'Ç': 0x00c7, # latin capital letter C with cedilla, U+00C7
ISOlat1
'&amp;#935;': 0x03a7, # greek capital letter chi, U+03A7
'&amp;#8225;': 0x2021, # double dagger, U+2021 ISOpub
'&amp;#916;': 0x0394, # greek capital letter delta, U+0394
ISOgrk3
============


Sure, however, your first version ALSO has the Greek capital letter
alpha there; it is just spelled as Α (which *is* a valid spelling
for that latter in XML).

Agreed that it would be... However that was not how it was
transformed, see above; sorry if I was not clear about what was
happening (I should not have cut-and-pasted from the browser window).


Thanks; I now think that htmlentitydefs is just as fine as it always
was - I don't see any problem here.

You may well be right in that the problem may lie elsewhere. But as
making the change I mentioned "fixed" the problem at my, I figured
this was where the problem was located - and thought I should at least
report it here.

Regards,
André
Regards,
Martin
 
F

Fredrik Lundh

Martin said:
Can you please state what precise problem you were seeing? The original
code looks fine to me as it stands.

from what I can tell, his problem is that htmlentitydefs.entitydefs maps
to *either* character strings or HTML character references, depending on
the character value. he needs a dictionary that maps from entity names
to characters for *all* names; something like (untested):

entity_map = htmlentitydefs.entitydefs.copy()
for name, entity in entity_map.items():
if len(entity) != 1:
entity_map[name] = unichr(int(entity[2:-1]))

(entitydefs is pretty unusable as it is, but it was added to Python
before Python got Unicode strings, and changing it would break ancient
code...)

</F>
 
M

Martin v. Löwis

entity_map = htmlentitydefs.entitydefs.copy()
for name, entity in entity_map.items():
if len(entity) != 1:
entity_map[name] = unichr(int(entity[2:-1]))

(entitydefs is pretty unusable as it is, but it was added to Python
before Python got Unicode strings, and changing it would break ancient
code...)

I would not write it this way, but as

for name,codepoint in htmlentitydefs.name2codepoint:
entity_map[name] = unichr(codepoint)

I don't find that too unusable, although having yet another dictionary
name2char might be more convenient, for use with ElementTree.

(side note: I think it would be better if ElementTree treated the
..entity mapping similar to DTD ENTITY declarations, assuming internal
entities. The the OP's code might have worked out of the box. That
would be an incompatible change also, of course.)

Regards,
Martin
 
M

Martin v. Löwis

Sorry for the top posting - I found out that the problem I encountered
was not something new in Python 3.0.

See Fredrik's message. The problem is not with htmlentitydefs, but with
your usage of ElementTree - in ElementTree, the .entity dictionary
values are not again parsed, apparently causing the '&' to be taken
as a literal ampersand, and not as markup.

That said, it *would* be possible to change htmlentitydefs in Python
3.0, to always map to Unicode characters; this would not be possible for
2.x (as Fredrik points out).

Regards,
Martin
 
F

Fredrik Lundh

Martin said:
entity_map = htmlentitydefs.entitydefs.copy()
for name, entity in entity_map.items():
if len(entity) != 1:
entity_map[name] = unichr(int(entity[2:-1]))

(entitydefs is pretty unusable as it is, but it was added to Python
before Python got Unicode strings, and changing it would break ancient
code...)

I would not write it this way, but as

for name,codepoint in htmlentitydefs.name2codepoint:
entity_map[name] = unichr(codepoint)

has dictionary iteration changed in 3.0? if not, your code doesn't
quite work. and even if you fix that, it doesn't work for all Python
versions that ET works for...

</F>
 
M

Martin v. Löwis

I would not write it this way, but as
for name,codepoint in htmlentitydefs.name2codepoint:
entity_map[name] = unichr(codepoint)

has dictionary iteration changed in 3.0? if not, your code doesn't
quite work.

Right - I forgot to add .items().
and even if you fix that, it doesn't work for all Python
versions that ET works for...

That's true. However, the OP seemed to care only about Python 3.0
initially.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top