Why Is Escaping Data Considered So Magical?

Lawrence D'Oliveiro · Jun 24, 2010

Just been reading this article
<http://www.theregister.co.uk/2010/06/23/xxs_sql_injection_attacks_testing_remedy/>
which says that a lot of security holes are arising these days because
everybody is concentrating on unit testing of their own particular
components, with less attention being devoted to overall integration
testing.

Fair enough. But itâ€™s disconcerting to see some of the advice being offered
in the reader comments, like â€œforce everyone to use stored proceduresâ€, or
â€œforce everyone to use prepared/parametrized statementsâ€, â€œnever construct
ad-hoc SQL queriesâ€ and the like.

I construct ad-hoc queries all the time. It really isnâ€™t that hard to do
safely. All you have to do is read the documentationâ€”for example,
<http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html>â€”and then write a
routine that takes arbitrary data and turns it into a valid string literal,
like this <http://www.codecodex.com/wiki/Useful_MySQL_Routines#Quoting>.

Iâ€™ve done this sort of thing for MySQL, for HTML and JavaScript (in both
Python and JavaScript itself), and for Bash. Itâ€™s not hard to verify youâ€™ve
done it correctly. It lets you easily create table-updating code like the
following, which makes it so easy to update the code to track changes in the
database structure:

sql.cursor.execute \
(
"update items set "
+
", ".join
(
tuple
(
"%(name)s = %(value)s"
%
{
"name" : field[0],
"value" : SQLString(Params.getvalue
(
"%s[%s]" % (field[1], urllib.quote(modify_id))
))
}
for field in
(
("class_name", "modify_class"),
("make", "modify_make"),
("model", "modify_model"),
("details", "modify_details"),
("serial_nr", "modify_serial"),
("inventory_nr", "modify_invent"),
("when_purchased", "modify_when_purchased"),
... you get the idea ...
("location_name", "modify_location"),
("comment", "modify_comment"),
)
)
+
(
"last_modified = %d" % int(time.time()),
)
)
+
" where inventory_nr = %s" % SQLString(modify_id)
)

Roy Smith · Jun 24, 2010

Lawrence D'Oliveiro said:
I construct ad-hoc queries all the time. It really isnâ€™t that hard to do
safely. All you have to do is read the documentation

I get worried when people talk about how easy it is to do something
safely. Let me suggest a couple of things you might not have considered:

1) Somebody is running your application (or the database server) with
the locale set to something unexpected. This might change how numbers,
dates, currency, etc, get formatted, which could change the meaning of
your constructed SQL statement.

2) Somebody runs your application with a different PYTHONPATH, which
causes a different (i.e. malicious) urllib module to get loaded, which
makes urllib.quote() do something you didn't expect.

Iâ€™ve done this sort of thing for MySQL, for HTML and JavaScript (in both
Python and JavaScript itself), and for Bash. Itâ€™s not hard to verify youâ€™ve
done it correctly. It lets you easily create table-updating code like the
following, which makes it so easy to update the code to track changes in the
database structure:

sql.cursor.execute \
(
"update items set "
+
", ".join
(
tuple
(
"%(name)s = %(value)s"
%
{
"name" : field[0],
"value" : SQLString(Params.getvalue
(
"%s[%s]" % (field[1],
urllib.quote(modify_id))
))
}
for field in
(
("class_name", "modify_class"),
("make", "modify_make"),
("model", "modify_model"),
("details", "modify_details"),
("serial_nr", "modify_serial"),
("inventory_nr", "modify_invent"),
("when_purchased", "modify_when_purchased"),
... you get the idea ...
("location_name", "modify_location"),
("comment", "modify_comment"),
)
)
+
(
"last_modified = %d" % int(time.time()),
)
)
+
" where inventory_nr = %s" % SQLString(modify_id)
)

Owen Jacobson · Jun 24, 2010

I get worried when people talk about how easy it is to do something
safely.

First: I agree with this. While it's definitely possible to correctly
escape a given SQL dialect under controlled conditions, it's not at all
easy to get it right, and the real world is even more unfriendly than
most people expect. Furthermore there's no reason to do it that way:
Python's DB API spec effectively requires that placeholder parameters
of *some* kind exist. Even if you feel the need to construct SQL, you
can construct it with parameters almost as easily as you can construct
it with the values baked in.

With that said...

2) Somebody runs your application with a different PYTHONPATH, which
causes a different (i.e. malicious) urllib module to get loaded, which
makes urllib.quote() do something you didn't expect.

Someone who can manipulate PYTHONPATH or otherwise add code to the
runtime environment is already in a position to hose your database,
independently of escaping-related issues. It's up to the sysadmin or
user to ensure that their environment is sane, and it's on their head
if they add broken code to a program's runtime environment.

I'€™ve done this sort of thing for MySQL, for HTML and JavaScript (in both
Python and JavaScript itself), and for Bash. Itâ€™s not hard to verify youâ€™ve
done it correctly. It lets you easily create table-updating code like the
following, which makes it so easy to update the code to track changes in the
database structure:

sql.cursor.execute \
(
"update items set "
+
", ".join
(
tuple
(
"%(name)s = %(value)s"
%
{
"name" : field[0],
"value" : SQLString(Params.getvalue
(
"%s[%s]" % (field[1],
urllib.quote(modify_id))
))
}
for field in
(
("class_name", "modify_class"),
("make", "modify_make"),
("model", "modify_model"),
("details", "modify_details"),
("serial_nr", "modify_serial"),
("inventory_nr", "modify_invent"),
("when_purchased", "modify_when_purchased"),
... you get the idea ...
("location_name", "modify_location"),
("comment", "modify_comment"),
)
)
+
(
"last_modified = %d" % int(time.time()),
)
)
+
" where inventory_nr = %s" % SQLString(modify_id)
)

Why would I write this when SQLAlchemy, even without using its ORM
features, can do it for me? It even uses the placeholder-generating
strategy I mentioned above, where possible.

Finally, it's worth noting that MySQL is (almost) the only mainstream
database that uses escaping for parameterization. PostgreSQL, SQL
Server, Oracle, DB2, and most other databases support parameters
natively in their communication protocols: parameters aren't injected
into the query string, but are sent separately and processed separately
within the DBMS. This neatly avoids encoding-related and
quoting-related problems entirely, and it means the type of the
parameter can be preserved if it's useful.

-o

Lawrence D'Oliveiro · Jun 24, 2010

1) Somebody is running your application (or the database server) with
the locale set to something unexpected.

Locales are under program control, so that wonâ€™t happen.

This is why I use UTF-8 encoding for everything.

Lawrence D'Oliveiro · Jun 24, 2010

Why would I write this when SQLAlchemy, even without using its ORM
features, can do it for me?

SQLAlchemy doesnâ€™t seem very flexible. Looking at the code examples
<http://www.sqlalchemy.org/docs/examples.html>, theyâ€™re very procedural:
build object, then do a string of separate method calls to add data to it. I
prefer the functional approach, as in my table-update example.

Cameron Simpson · Jun 25, 2010

| > Why would I write this when SQLAlchemy, even without using its ORM
| > features, can do it for me?
|
| SQLAlchemy doesnâ€™t seem very flexible. Looking at the code examples
| <http://www.sqlalchemy.org/docs/examples.html>, theyâ€™re very procedural:
| build object, then do a string of separate method calls to add data to it. I
| prefer the functional approach, as in my table-update example.

He said "without using its ORM". I do what you suggest (make SQL
statements at need) using SQLalchemy all the time. It is simple and easy
and _robust_ against odd data. The number of times I've had to
fix/remove insert-values-into-SQL-text code ...
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

Plague, Famine, Pestilence, and C++ stalk the land. We're doomed! Doomed!
- Simon E Spero

Nobody · Jun 25, 2010

Just been reading this article
...
which says that a lot of security holes are arising these days because
everybody is concentrating on unit testing of their own particular
components, with less attention being devoted to overall integration
testing.

Fair enough. But itâ€™s disconcerting to see some of the advice being
offered in the reader comments, like â€œforce everyone to use stored
proceduresâ€, or â€œforce everyone to use prepared/parametrized
statementsâ€, â€œnever construct ad-hoc SQL queriesâ€ and the like.

I construct ad-hoc queries all the time. It really isnâ€™t that hard to
do safely.

Wrong.

Even if you get the quoting absolutely correct (which is a very big "if"),
you have to remember to perform it every time, without exception. And you
need to perform it exactly once. As the program gets more complex,
ensuring that it's done in the correct place, and only there, gets harder.

More generally, as a program gets more complex, "this will work so long as
we do X every time without fail" approaches "this won't work".

All you have to do is read the documentationâ€”for example,
<http://dev.mysql.com/doc/refman/5.0/en/string-syntax.html>â€”and then
write a routine that takes arbitrary data and turns it into a valid
string literal, like this
<http://www.codecodex.com/wiki/Useful_MySQL_Routines#Quoting>.

That's okay. Provided the documentation is accurate. And provided that you
update the escaping algorithm whenever the SQL dialect gets extended, or
you switch to a different back-end, or modify the program. IOW, it's not
even remotely okay.

"Unparsing" data so that you get the correct answer out of a subsequent
parsing step is objectively and obviously the wrong approach. The
correct approach is to skip both the unparsing and parsing steps
entirely.

Formal grammars are a useful way to represent graph-like data structures
in a human-readable and human-editable form. But for creation,
modification and use by a computer, it is invariably preferable to operate
upon the graph directly. Textual formats inherit all of the "issues" which
apply to the underlying data structure, then add a few of their own for
good measure.

I've done this sort of thing for MySQL, for HTML and JavaScript (in both
Python and JavaScript itself), and for Bash.

And, of course, you're convinced that you got it right every time. That
attitude alone should set alarm bells ringing for anyone who's worked in
this industry for more than five minutes.

Paul Rubin · Jun 25, 2010

Nobody said:
More generally, as a program gets more complex, "this will work so long as
we do X every time without fail" approaches "this won't work".

QOTW

Jorgen Grahn · Jun 25, 2010

Just been reading this article
<http://www.theregister.co.uk/2010/06/23/xxs_sql_injection_attacks_testing_remedy/>
which says that a lot of security holes are arising these days because
everybody is concentrating on unit testing of their own particular
components, with less attention being devoted to overall integration
testing.

I don't do SQL and I don't even understand the terminology properly
.... but the discussion around it bothers me.

Do those people really do this?
- accept untrusted user data
- try to sanitize the data (escaping certain characters etc)
- turn this data into executable code (SQL)
- executing it

Like the example in the article

SELECT * FROM hotels WHERE city = '<untrusted>';

If so, its isomorphic with doing os.popen('zcat -f %s' % untrusted)
in Python (at least on Unix, where 'zcat ...' is executed as a shell
script).

I thought it was well-known that the solution is *not* to try to
sanitize the input -- it's to switch to an interface which doesn't
involve generating an intermediate executable. In the Python example,
that would be something like os.popen2(['zcat', '-f', '--', untrusted]).

Am I missing something? If not, I can go back to sleep -- and keep
avoiding SQL and web programming like the plague until that community
has entered the 21st century.

/Jorgen

John Nagle · Jun 25, 2010

Yes. I was just looking at some of my own code. Out of about 100
SQL statements, I'd used manual escaping once, in code where the WHERE
clause is built up depending on what information is available for the
search. It's done properly, using "MySQLdb.escape_string(s)", which
is what's used inside "cursor.execute". Looking at the code, I
now realize that it would have been better to
add sections to the SQL string with standard escapes, and at the same
time, append the key items to a list. Then the list can be
converted to a tuple for submission to "cursor.execute".

John Nagle

Nobody · Jun 25, 2010

I don't do SQL and I don't even understand the terminology properly
... but the discussion around it bothers me.

Do those people really do this?

Yes. And then some.

Among web developers, the median level of programming knowledge amounts to
the first 3 chapters of "Learn PHP in 7 Days".

It doesn't help the the guy who wrote PHP itself wasn't much better.

- accept untrusted user data
- try to sanitize the data (escaping certain characters etc)
- turn this data into executable code (SQL)
- executing it

Like the example in the article

SELECT * FROM hotels WHERE city = '<untrusted>';

Yep. Search the BugTraq archives for "SQL injection". And most of those
are for widely-deployed middleware; the zillions of bespoke site-specific
scripts are likely to be worse.

Also: http://xkcd.com/327/

I thought it was well-known that the solution is *not* to try to
sanitize the input

Well known by anyone with a reasonable understanding of the principles of
programming, but somewhat less well known by the other 98% of web
developers.

Am I missing something?

There's a world of difference between a skilled chef and the people
flipping burgers for a minimum wage. And between a chartered civil
engineer and the people laying the asphalt. And between what you
probably consider a programmer and the people doing most web development.

If not, I can go back to sleep -- and keep
avoiding SQL and web programming like the plague until that community
has entered the 21st century.

Don't hold your breath.

Of course, there's no fundamental reason why you can't apply sound
practices to web development. Well, other than the fact that you're
competing against an infinite number of (code-) monkeys for lowest-bidder
contracts.

To be fair, it isn't actually limited to web developers. I've seen the
following in scientific code written in C (or, more likely, ported to C
from Fortran) for Unix:

sprintf(buff, "rm -f %s", filename);
system(buff);

Why bother learning the Unix API when you already know system()?

Ian Kelly · Jun 25, 2010

To be fair, it isn't actually limited to web developers. I've seen the
following in scientific code written in C (or, more likely, ported to C
from Fortran) for Unix:

sprintf(buff, "rm -f %s", filename);
system(buff);

Tsk, tsk. And it's so easy to fix, too:

#define BUFSIZE 1000000
char buff[BUFSIZE];
if (snprintf(buff, BUFSIZE, "rm -f %s", filename) >= BUFSIZE) {
printf("No buffer overflow for you!\n");
} else {
system(buff);
}

There, that's much more secure.

Lawrence D'Oliveiro · Jun 25, 2010

Wrong.

Even if you get the quoting absolutely correct (which is a very big "if"),
you have to remember to perform it every time, without exception.

More generally, as a program gets more complex, "this will work so long as
we do X every time without fail" approaches "this won't work".

Thatâ€™s a content-free claim. Why? Because it applies equally to everything.
Replace â€œquotingâ€ with something like â€œarithmeticâ€, and youâ€™ll see what I
mean:

Even if you get the arithmetic absolutely correct (which is a very big
"if"), you have to remember to perform it every time, without exception.

More generally, as a program gets more complex, "this will work so long
as we do X every time without fail" approaches "this won't work".

From which we can conclude, according to your logic, that one shouldnâ€™t be
doing arithmetic.

Next time, try to avoid fallacious arguments.

And you need to perform it exactly once. As the program gets more complex,
ensuring that it's done in the correct place, and only there, gets harder.

Nonsense. It only needs to be done at the boundary to the appropriate
component (MySQL, HTML, JavaScript, whatever). Thatâ€™s the only place which
needs to have knowledge of whatâ€™s on the other side. Everything else can
work with arbitrary data without having to worry about such things.

Go back to my example, and youâ€™ll see this: the original updates two dozen
different fields in a database table, yet it only needs two calls to
SQLString: one deals with all the fields requiring updating, while the other
one deals with the key-matching. Thatâ€™s it. Instead of two dozen different
places needing checking, you only have two.

Thatâ€™s what â€œmaintainabilityâ€ is all about.

Roy Smith · Jun 25, 2010

Ian Kelly said:
To be fair, it isn't actually limited to web developers. I've seen the
following in scientific code written in C (or, more likely, ported to C
from Fortran) for Unix:

sprintf(buff, "rm -f %s", filename);
system(buff);

Click to expand...

Tsk, tsk. And it's so easy to fix, too:

#define BUFSIZE 1000000
char buff[BUFSIZE];
if (snprintf(buff, BUFSIZE, "rm -f %s", filename) >= BUFSIZE) {
printf("No buffer overflow for you!\n");
} else {
system(buff);
}

There, that's much more secure.

I recently fixed a bug in some production code. The programmer was
careful to use snprintf() to avoid buffer overflows. The only problem
is, he wrote something along the lines of:

snprintf(buf, strlen(foo), foo);

I'm sure the code got reviewed originally, and probably looked at dozens
of times over the years. Nobody caught the problem until we ran a
static code analysis tool (Coverity) over it.

To bring this back to something remotely Python related, the point of
all this is that security is hard. A lot of the security best practices
(such as "don't compose SQL queries on the fly with externally tainted
strings") exist because they address ways that people have gotten burned
in the past. It if foolish to think that you're smarter than everybody
else and have thought of every possibility to avoid getting burned by
doing the things that have gotten other people in trouble.

Lawrence D'Oliveiro · Jun 25, 2010

Cameron said:
On 25Jun2010 15:38, Lawrence D'Oliveiro <[email protected]_zealand>
wrote:

| In message <2010062422432660794-angrybaldguy@gmailcom>, Owen Jacobson
| wrote:

| > Why would I write this when SQLAlchemy, even without using its ORM
| > features, can do it for me?
|
| SQLAlchemy doesnâ€™t seem very flexible. Looking at the code examples
| <http://www.sqlalchemy.org/docs/examples.html>, theyâ€™re very procedural:
| build object, then do a string of separate method calls to add data to
| it. I prefer the functional approach, as in my table-update example.

He said "without using its ORM".

I noticed that. So were those examples I referenced above â€œusing its ORMâ€?
Can you offer better examples â€œwithout using its ORMâ€?

Lawrence D'Oliveiro · Jun 25, 2010

Jorgen Grahn said:
I thought it was well-known that the solution is *not* to try to
sanitize the input -- it's to switch to an interface which doesn't
involve generating an intermediate executable. In the Python example,
that would be something like os.popen2(['zcat', '-f', '--', untrusted]).

Thatâ€™s what I mean. Why do people consider input sanitization so hard?

Owen Jacobson · Jun 25, 2010

Jorgen Grahn said:
Jorgen Grahn said:

I thought it was well-known that the solution is *not* to try to
sanitize the input -- it's to switch to an interface which doesn't
involve generating an intermediate executable. In the Python example,
that would be something like os.popen2(['zcat', '-f', '--', untrusted]).

Click to expand...

Thatâ€™s what I mean. Why do people consider input sanitization so hard?

It's not hard. It's just begging for a visit from the fuckup fairy.

-o

Robert Kern · Jun 25, 2010

I noticed that. So were those examples I referenced above â€œusing its ORMâ€?
Can you offer better examples â€œwithout using its ORMâ€?

http://www.sqlalchemy.org/docs/sqlexpression.html

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Tim Chase · Jun 25, 2010

In the Python example, that would be something like
os.popen2(['zcat', '-f', '--', untrusted]).

Click to expand...

Thatâ€™s what I mean. Why do people consider input sanitization
so hard?

It's hard because it requires thinking. Sadly, many of the
people I know who call themselves programmers couldn't code their
way out of a paper bag, let alone think logically about the
security implications of their code.[1]

-tkc

[1] much of which ends up being cargo-cult programming,
cut-n-paste'd from Google search-results.

Ian Kelly · Jun 26, 2010

SQLAlchemy doesn’t seem very flexible. Looking at the code examples
<http://www.sqlalchemy.org/docs/examples.html>, they’re very procedural:
build object, then do a string of separate method calls to add data to it.. I
prefer the functional approach, as in my table-update example.

Your example from the first post of the thread rewritten using sqlalchemy:

conn.execute(
items.update()
.where(items.c.inventory_nr == modify_id)
.values(
dict(
(field[0], Params.getvalue("%s[%s]" % (field[1],
urllib.quote(modify_id))))
for field in [
(items.c.class_name, "modify_class"),
(items.c.make, "modify_make"),
(items.c.model, "modify_model"),
(items.c.details, "modify_details"),
(items.c.serial_nr, "modify_serial"),
(items.c.inventory_nr, "modify_invent"),
(items.c.when_purchased, "modify_when_purchased"),
... you get the idea ...
(items.c.location_name, "modify_location"),
(items.c.comment, "modify_comment"),
]
)
)
.values(last_modified = time.time())
)

Doesn't seem any less flexible to me, plus you don't have to worry
about calling your SQLString function at all.

Cheers,
Ian

is list comprehension necessary?	15	Oct 26, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Why Is Escaping Data Considered So Magical?

Lawrence D'Oliveiro

Roy Smith

Owen Jacobson

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Cameron Simpson

Nobody

Paul Rubin

Jorgen Grahn

John Nagle

Nobody

Ian Kelly

Lawrence D'Oliveiro

Roy Smith

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Owen Jacobson

Robert Kern

Tim Chase

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads