sgmlop: malformed charrefs?

Magnus Lie Hetland · Mar 17, 2005

According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as . In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity? I've tried to
read the C code, but I can't say that left me any wiser on the
subject; it doesn't seem to have any special-casing for this that I
can find.

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)? I'm
trying to write a parser that will accept *any* input text without
complaining -- but simply trapping this exception would seem to
disrupt the parsing process...

Thanks,

- Magnus

Fredrik Lundh · Mar 17, 2005

Magnus said:
According to The Sgmlop Module Handbook [1], the handle_entityref()
callback is called for "malformed character entities". What does that
mean, exactly? What is a malformed character entity? I've tried
mis-spelling them (e.g., dropping the semicolon), but then they're
(quite naturally) treated as text/data, with handle_data(). I've tried
to use number that is too great, or (equivalently, it turns out) to
use names instead of numbers, such as . In these cases, I only
get an exception, because the number is too high...

So -- how can I produce a malformed character entity?

with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'

And another thing... For the case where a numeric reference is too
high (i.e. it can't be translated into a Unicode character) -- is it
possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

</F>

Magnus Lie Hetland · Mar 17, 2005

Magnus Lie Hetland wrote: [snip]
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print "ENTITY", repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed("&-10;&/()=?;")

prints:

ENTITY '-10'
ENTITY '/()=?'

OK, thanks. I guess I just wasn't creative enough in my entity naming

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between &# and ; is passed
to that method.

I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)

if you have a handle_entityref hook, but no handle_charref, the part between
& and ; is passed to handle_entityref.

Strange. It doesn't seem to work that way for me... Here is an example:

.......................................................................
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('')
except Exception, e:
print e
.......................................................................

When I run this, I get:

character reference exceeds ASCII range
character reference exceeds ASCII range
character reference exceeds sys.maxunicode (0xffff)

If I remove the handle_data, nothing happens.

Fredrik Lundh · Mar 17, 2005

Magnus said:
Strange. It doesn't seem to work that way for me... Here is an example:

from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

</F>

Magnus Lie Hetland · Mar 17, 2005

[snip]

are the PyXML folks shipping the latest sgmlop?

I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)

I'm pretty sure they've forked the code (there's no UnicodeParser in
the effbot.org edition),

Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?

and I have no idea how things work in the fork.

I see.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Mar 17, 2005

Fredrik said:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes"

It certainly
is the latest release of the fork.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Mar 17, 2005

Fredrik said:
are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

As we've forked the code, the answer is a clear "yes"

It certainly
is the latest release of the fork.

Regards,
Martin

Fredrik Lundh · Mar 18, 2005

Martin said:
As we've forked the code, the answer is a clear "yes" It certainly
is the latest release of the fork.

if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

</F>

Magnus Lie Hetland · Mar 18, 2005

if the 2000-07-05 date is correct, there has been at least eight
public releases of the original sgmlop distribution since the fork.

Hm. This may, of course, be just fine -- but it seems a bit
unfortunate to me... I.e. nice features added in each of the two, but
no distribution where all the features are available... Or something.
(Or at least all the bug fixes

Is there any chance of at least sharing fixes for thins such as the
illegal charrefs becoming entity refs etc.? (Yeah, I know, I can
submit patches, but I don't know the code all that well...)

Or: What are the chances of handling Unicode with the Effbot sgmlop
(which seems to be the only feature I'm missing in that at the
moment)? Using UTF-8 or something would be completely acceptable to
me, as long as it works. (Maybe simply feeding it UTF-8 strings would
work as it is? Except for Unicode charrefs, of course... Or?)

- M

OSX / Python 2.3 error"truncated or malformed object ..."	0	Mar 18, 2005
Can't understand code	2	Jun 5, 2022
Exception never raised when threading an eval containing a malformed command [might be a bug ?]	3	Feb 5, 2007
Malformed utf8; where's the null byte coming from?	6	Jun 28, 2006
Slightly OT - using PyUIC from Eclipse	0	Apr 30, 2014
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
Proper way to delete/kill a logger?	4	Nov 28, 2011
argparse zero-length switch	6	Oct 14, 2011

sgmlop: malformed charrefs?

Magnus Lie Hetland

Fredrik Lundh

Magnus Lie Hetland

Fredrik Lundh

Magnus Lie Hetland

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Fredrik Lundh

Magnus Lie Hetland

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads