sgmlop: malformed charrefs?

M

Magnus Lie Hetland

According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as . In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity? I've tried to
read the C code, but I can't say that left me any wiser on the
subject; it doesn't seem to have any special-casing for this that I
can find.

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)? I'm
trying to write a parser that will accept *any* input text without
complaining -- but simply trapping this exception would seem to
disrupt the parsing process...

Thanks,

- Magnus
 
F

Fredrik Lundh

Magnus said:
According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as . In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity?

with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'
And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>
 
M

Magnus Lie Hetland

Magnus Lie Hetland wrote: [snip]
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'

OK, thanks. I guess I just wasn't creative enough in my entity naming
:)
if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)
if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

Strange. It doesn't seem to work that way for me... Here is an example:

.......................................................................
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('')
except Exception, e:
print e
.......................................................................

When I run this, I get:

character reference exceeds ASCII range
character reference exceeds ASCII range
character reference exceeds sys.maxunicode (0xffff)

If I remove the handle_data, nothing happens.
 
F

Fredrik Lundh

Magnus said:
Strange. It doesn't seem to work that way for me... Here is an example:

from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

</F>
 
M

Magnus Lie Hetland

[snip]
are the PyXML folks shipping the latest sgmlop?

I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)
I'm pretty sure they've forked the code (there's no UnicodeParser in
the effbot.org edition),

Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?
and I have no idea how things work in the fork.

I see.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik said:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes" :) It certainly
is the latest release of the fork.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik said:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes" :) It certainly
is the latest release of the fork.

Regards,
Martin
 
F

Fredrik Lundh

Martin said:
As we've forked the code, the answer is a clear "yes" :) It certainly
is the latest release of the fork.

if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

</F>
 
M

Magnus Lie Hetland

if the 2000-07-05 date is correct, there has been at least eight
public releases of the original sgmlop distribution since the fork.

Hm. This may, of course, be just fine -- but it seems a bit
unfortunate to me... I.e. nice features added in each of the two, but
no distribution where all the features are available... Or something.
(Or at least all the bug fixes :)

Is there any chance of at least sharing fixes for thins such as the
illegal charrefs becoming entity refs etc.? (Yeah, I know, I can
submit patches, but I don't know the code all that well...)

Or: What are the chances of handling Unicode with the Effbot sgmlop
(which seems to be the only feature I'm missing in that at the
moment)? Using UTF-8 or something would be completely acceptable to
me, as long as it works. (Maybe simply feeding it UTF-8 strings would
work as it is? Except for Unicode charrefs, of course... Or?)

- M
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top