A critique of cgi.escape

  • Thread starter Lawrence D'Oliveiro
  • Start date
L

Lawrence D'Oliveiro

The "escape" function in the "cgi" module escapes characters with special
meanings in HTML. The ones that need escaping are '<', '&' and '"'.
However, cgi.escape only escapes the quote character if you pass a second
argument of True (the default is False):
'the &quot;quick&quot; &amp; &lt;brown&gt; fox'

This seems to me to be dumb. The default option should be the safe one: that
is, escape _all_ the potentially troublesome characters. The only time you
can get away with NOT escaping the quote character is outside of markup,
e.g.

<TEXTAREA>
unescaped "quotes" allowed here
</TEXTAREA>

Nevertheless, even in that situation, escaped quotes are acceptable.

So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.
 
F

Fredrik Lundh

Lawrence said:
So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.
Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.

I'm not sure this "every time I don't immediately understand something,
I'll write a change proposal instead of reading the library reference"
approach is healthy, really.

</F>
 
L

Lawrence D'Oliveiro

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes.

What works for attributes also works for ordinary text.
 
J

Jon Ribbens

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.

He's not confused, he's correct; the author of cgi.escape is the
confused one. The optional extra parameter is completely unnecessary
and achieves nothing except to make it easier for people to end up
with bugs in their code.

Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.
 
L

Lawrence D'Oliveiro

He's not confused, he's correct; the author of cgi.escape is the
confused one.

Thanks for backing me up. :)
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.

I must confess I did a double-take on this. But I rechecked the HTML spec
(HTML 4.0, section 3.2.2, "Attributes"), and you're right--single quotes
ARE allowed as an alternative to double quotes. It's just I've never used
them as quotes. :)
 
F

Fredrik Lundh

Lawrence said:
What works for attributes also works for ordinary text.

attributes and ordinary text are two different things in HTML and XML.
you're arguing that it's a good idea for *everyone* to bloat down
ordinary text just because you're too lazy to use a piece of code in the
intended way.

</F>
 
F

Fredrik Lundh

Jon said:
Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.
> One thing that is flat-out wrong, by the way, is that cgi.escape()
> does not encode the apostrophe (') character.

it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes. again, punishing people who
actually read the docs and understand them is not a very good way to
maintain software.

btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all. if you want a
general purpose function that can be used for everything that can be put
in an HTML file, you need more than just a modified cgi.escape. feel
free to propose a general-purpose replacement (which should have a new
name), but make sure you think through *all* the issues before you do that.

</F>
 
L

Lawrence D'Oliveiro

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.

I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.
it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes.

Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.
btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all.

Why does it need to?
 
F

Fredrik Lundh

Lawrence said:
Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

do you ever think before you post?

</F>
 
G

Georg Brandl

Lawrence said:
I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.

&quot; is 4 characters more than ".
Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

A function is broken if its implementation doesn't match the documentation.

As a courtesy, I've pasted it below.

escape(s[, quote])
Convert the characters "&", "<" and ">" in string s to HTML-safe sequences.
Use this if you need to display text that might contain such characters in HTML.
If the optional flag quote is true, the quotation mark character (""") is also
translated; this helps for inclusion in an HTML attribute value, as in <A
HREF="...">. If the value to be quoted might include single- or double-quote
characters, or both, consider using the quoteattr() function in the
xml.sax.saxutils module instead.


Now, do you still think cgi.escape is broken?


Georg
 
F

Fredrik Lundh

Georg said:
A function is broken if its implementation doesn't match the documentation.

or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...

</F>
 
J

Jon Ribbens

A function is broken if its implementation doesn't match the documentation.

Or if the design, as described in the documentation, is flawed in some
way.
As a courtesy, I've pasted it below.
[...]

Now, do you still think cgi.escape is broken?

Yes.
 
J

Jon Ribbens

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way,

By a miniscule degree. That is a very weak argument by any standard.
and will also break unit tests.

Er, so change the unit tests at the same time?
it's intentional, of course:

I noticed. That doesn't mean it isn't wrong.
you're supposed to use " if you're using cgi.escape(s, True) to
escape attributes. again, punishing people who actually read the
docs and understand them is not a very good way to maintain
software.

In what way is anyone being "punished"? Deliberately retaining flaws
and misfeatures that can easily be fixed without damaging
backwards-compatibility is not a very good way to maintain software
either.
btw, you're both missing that cgi.escape isn't good enough for general
use anyway,

I'm sorry, I didn't realise this was a general thread about any and
all inadequacies of Python's cgi module.
since it doesn't deal with encodings at all.

Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.
 
F

Fredrik Lundh

Lawrence said:
_We_ certainly have noticed it.

you're not the designer, you're just some random guy who thinks that if you
don't understand something at first, it has to be changed, even if it that change
would break things for others. maybe you haven't done software long enough
to understand that software works better if you use it the way it was intended
to be used, but that's no excuse for being stupid.

</F>
 
F

Fredrik Lundh

Jon said:
Or if the design, as described in the documentation, is flawed in some
way.

it does exactly what it says, and is perfectly usable as is, if you bother to
use it the way it was intended to be used.

(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).

</F>
 
F

Fredrik Lundh

Jon said:
Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account. If you and Lawrence have a hard
time remembering how to use the existing cgi.escape function, despite
it's utter simplicity, surely it would make your life even easier if
there was an alternative API that would handle both the easy part
(escaping) and the hard part (encoding) ?
By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.

</F>
 
J

Jon Ribbens

maybe you haven't done software long enough to understand that
software works better if you use it the way it was intended to be
used, but that's no excuse for being stupid.

So what's your excuse?
 
J

Jon Ribbens

If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account.

Ever heard of modular programming? I would suggest that you do indeed
take a step back and look at the whole picture - it's the whole
picture that needs to take escaping and encoding into account. There's
nothing to say that cgi.escape should take them both into account in
the one function, and in fact as you yourself have already commented,
good reasons for it not to, in that it would make it excessively
complicated.
If you and Lawrence have a hard time remembering how to use the
existing cgi.escape function, despite it's utter simplicity, surely
it would make your life even easier if there was an alternative API
that would handle both the easy part (escaping) and the hard part
(encoding) ?

You seem to be arguing that because, in an ideal world, it would be
better to throw away the 'cgi' module completely and start again, it
is not worth making minor improvements in what we already have.
I would suggest that this is, to put it mildly, not a good argument.
I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself.

You are merely compounding your bad manners. All of your above
allegations are outright lies. I am not sure if you are simply not
understanding the simple points I am making, or are deliberately
trying to mislead people for some bizarre reason of your own.
Breaking things just because you think you can simply isn't the
Python way of doing things.

Your hyperbole is growing more extravagant. To begin with, you were
claiming that the suggested change would make things (minisculely)
less efficient, now you're claiming it will "break" unspecified
things. What precisely do you think it would "break"?
 
D

Duncan Booth

Jon Ribbens said:
Er, so change the unit tests at the same time?

It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.

If the change were seen as a good thing, then a backwards compatible change
(e.g. introducing a function with a different name) might be considered,
but if so it should address the whole issue: the current lack of support
for encodings is IMHO a far bigger problem than whether or a quote mark is
escaped.
Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

If I have a unicode string such as: u'\u201d' (right double quote), then I
want that encoded in my html as '”' (or &rdquo; but the numeric form
is better). For many purposes I could just encode it in the encoding to be
used for the page, typically latin1 or utf8, but sometimes that isn't
possible e.g. if you don't know the encoding at the point when you produce
the string, or if there is no translation for the character in the desired
encoding. The character reference will work whatever encoding is used for
the page.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;
at present I need to call both cgi.escape and s.encode to get the desired
effect.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top