A critique of cgi.escape

Jon Ribbens · Sep 25, 2006

(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).

Well, yes, you certainly seem to be good at the "pissing on others"
part, even if you have to lie to do it. You have had the "enhanced
escape" proposal all along - it was the post which started this
thread! If you are referring to your strawman argument about
encodings, you have yet to show that it's relevant.

If it'll make you any happier, here's the code for the 'cgi.escape'
equivalent that I usually use:

_html_encre = re.compile("[&<>\"'+]")
_html_encodes = { "&": "&", "<": "<", ">": ">", "\"": """,
"'": "'", "+": "+" }

def html_encode(raw):
return re.sub(_html_encre, lambda m: _html_encodes[m.group(0)], raw)

Fredrik Lundh · Sep 25, 2006

Jon said:
There's nothing to say that cgi.escape should take them both into account
in the one function

so what exactly are you using cgi.escape for in your code ?

What precisely do you think it would "break"?

existing code, and existing tests.

</F>

Max M · Sep 25, 2006

Fredrik Lundh skrev:

Jon Ribbens wrote:

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.

This thread is highly entertaining but perhaps not that productive.

Lawrence is right that the escape method doesn't work the way he expects
it to.

Rewriting a library module simply because a developer is surprised is a
*very* bad idea. It would break just about every web app out there that
uses the escape module and uses testing. Which is probably most of them.
That could mean several man years of wasted time. It also makes the
escaped html harder to read for standard cases.

Frederik is right that doing so is utterly ... well let us call it
"unproductive". Stupid is such a harsh word ;-)

Whether someone finds the bloat miniscule and thus a small enough change
to warrant the rewrite does not really matter.

Lawrence is free to write a wrapper and use that instead.

my_escape = lambda st: cgi.escape(st, 1)

So. Lawrence is happy, and the escape works as expected. Several man
years has been saved.

Max M

Jon Ribbens · Sep 25, 2006

It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.

Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;

I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.

Jon Ribbens · Sep 25, 2006

so what exactly are you using cgi.escape for in your code ?

To escape characters so that they will be treated as character data
and not control characters in HTML.

existing code, and existing tests.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Fredrik Lundh · Sep 25, 2006

Max said:
It also makes the escaped html harder to read for standard cases.

and slows things down a bit.

(cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
obvious for anyone who's looked at the code).

</F>

Georg Brandl · Sep 25, 2006

Jon said:
To escape characters so that they will be treated as character data
and not control characters in HTML.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Georg

Duncan Booth · Sep 25, 2006

Jon Ribbens said:
Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.

I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.

I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.

Sorry, convert into something I can safely insert wasn't meant to imply
encoding: just entity escaping.

To be clear:

I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.

The resulting string might be a byte string or it might still be unicode:
the point being that the conversion I want is from unescaped to entity
escaped, not from unicode to byte encoded. Right now the only way the
Python library gives me to do the entity escaping properly has a side
effect of encoding the string. I should be able to do the escaping without
having to encode the string at the same time.

Jon Ribbens · Sep 25, 2006

Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Sorry, that's still not good enough. Why would any code expect such a
thing?

Max M · Sep 25, 2006

Jon Ribbens skrev:

To escape characters so that they will be treated as character data
and not control characters in HTML.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Some examples are:

- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.

- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.

- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )

Any change in Python that has these consequences will rightfully be
considered a bug. So what you are suggesting is to knowingly introduce a
bug in the standard library!

You are right that the html generated by cgi.escape() would (probably)
have the same visual appearence in the browsers. But that is a *very*
narrow definition of being bug free and not breaking stuff.

If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!

Max M

Jon Ribbens · Sep 25, 2006

I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.

You're right - I've never seen anyone do such a thing. It sounds like
a highly dubious and very fragile sort of test to me, of very limited
use.

I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.

This is that muddled thinking I was talking about. They are *not* the
same operation. You want to encode "<", for example, because it must
always be encoded to prevent it being treated as an HTML control
character. This has nothing to do with character encodings.

You might sometimes want to escape "right double quote" because it may
or may not be available in the character encoding you using to output
to the browser. Yes, this might sometimes seem a bit similar to the
"<" escaping described above, because one of the ways you could avoid
the character encoding issue would be to use numeric entities, but it
is actually a completely separate issue and is none of the business of
cgi.escape.

By your argument, cgi.escape should in fact escape *every single*
character as a numeric entity, and even that wouldn't work properly
since "&", "#", ";" and the digits might not be in their usual
positions in the output encoding.

Right now the only way the Python library gives me to do the entity
escaping properly has a side effect of encoding the string. I should
be able to do the escaping without having to encode the string at
the same time.

I'm getting lost here - the opposite of what you say above is true.
cgi.escape does the escaping properly (modulo failing to escape
quotes) without encoding.

Max M · Sep 25, 2006

Jon Ribbens skrev:

Sorry, that's still not good enough. Why would any code expect such a
thing?

Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?

This funktion which is correct by current documentation will be broken
by you change.

def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:

You might think that it is stupid code that should be changed to take
escaped quotes into account. But that is really not your bussines to
decide if the other behaviour is documented and correct.

I find it amazing that you cannot understand this. I will stop replying
in this thread now.

Max M

Jon Ribbens · Sep 25, 2006

Some examples are:

- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.

Testing is good, but only if done correctly.

- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.

That's begging the question again ("an example of code that would
break is code that would break").

- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )

I'm sorry, I don't understand that one. What is "party code"? Code
that is scraping content from web sites already has to cope with
entities etc.

Your comment about Java is a little ironic given that I persuaded the
Java Struts people to make the exact same change we're talking about
here, back in 2002 (even if it did take 11 months) ;-)

If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!

I'll take my own opinion on that over yours, thanks.

and-google · Sep 25, 2006

Jon said:
I'm sorry, that's not good enough. How, precisely, would it break
"existing code"?

('owdo Mr. Ribbens!)

It's possible there could be software that relies on ' not being
escaped, for example:

# Auto-markup links to O'Reilly, everyone's favourite
# example name with an apostrophe in it
#
URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)

Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.

A similar change to str.encode('unicode-escape') in Python 2.5 caused a
number of similar subtle problems. (In this case the old documentation
was a bit woolly so didn't prescribe the exact older behaviour.)

I'm not saying that the cgi.escape interface is *good*, just that it's
too late to change it.

I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)

Jon Ribbens · Sep 25, 2006

Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?

No, but if nobody else can find one either, that's a clue that maybe
it's safe to change.

Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to. Even by your own argument, therefore, code is not
entitled to rely on the output of cgi.escape being any particular
exact string.

This funktion which is correct by current documentation will be broken
by you change.

def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:

That function is broken already, no change required.
I find it amazing that you cannot understand this.

Duncan Booth · Sep 25, 2006

Jon Ribbens said:
Sorry, that's still not good enough. Why would any code expect such a
thing?

It's easy enough to come up with examples which might. For example, I
have doctests which evaluate tal expressions. I don't think I currently
have any which depend on quotes, but I can easily create one (I just
did, and it passes):
<x title="It's a "tal" string" />

More likely I might output a field value and just happen to have used a quote
in it.

FWIW, in zope tal, the value of tal:content is escaped using the equivalent of
cgi.escape(s, False), and attribute values are escaped using
cgi.escape(s, True).

The function T I use is defined as:

def T(template, **kw):
"""Create and render a page template."""
pt = PageTemplate()
pt.pt_edit(template, 'text/html')
return pt.pt_render(extra_context=kw).strip('\n')

Fredrik Lundh · Sep 25, 2006

Jon said:
Sorry, that's still not good enough.

that's not up to you to decide, though.

</F>

Jon Ribbens · Sep 25, 2006

('owdo Mr. Ribbens!)

Good afternoon Mr Glover ;-)

URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)

Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.

I'm not sure about "subtly break things", but you're right that the
above code would break. I could argue that it's broken already,
(since it's doing a plain-text search on HTML data) but given
real-world considerations it's reasonable enough that I won't be that
pedantic ;-)

I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)

I agree that in most situations you should probably be using a
template library, but sometimes a simple CGI-and-manual-HTML system
suffices, and I think (a fixed version of) cgi.escape should exist at
a low level of the web application stack.

Filip Salomonsson · Sep 25, 2006

Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to.

If the documentation isn't clear enough, that means the documentation
should be fixed.

It does _not_ mean "you are free to introduce new behavior because
nobody should trust what this function does anyway".

Jon Ribbens · Sep 25, 2006

that's not up to you to decide, though.

It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.

Request critique of first program	14	Sep 2, 2007
Critique requested.....	5	Jan 25, 2006
ANN: pyTenjin 1.0.0 - a high-speed and full-featured template engine	1	Feb 22, 2011
Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 23, 2014
<c:out/> and escaping of unicode format	0	Oct 27, 2008
Default scope of variables	55	Jul 4, 2013
Building a tree-based readline completer	5	Nov 18, 2013
Problems of Symbol Congestion in Computer Languages	54	Feb 16, 2011

A critique of cgi.escape

Jon Ribbens

Fredrik Lundh

Max M

Jon Ribbens

Jon Ribbens

Fredrik Lundh

Georg Brandl

Duncan Booth

Jon Ribbens

Max M

Jon Ribbens

Max M

Jon Ribbens

and-google

Jon Ribbens

Duncan Booth

Fredrik Lundh

Jon Ribbens

Filip Salomonsson

Jon Ribbens

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads