Why Is Escaping Data Considered So Magical?

Lawrence D'Oliveiro · Jun 30, 2010

Kushal said:
Why does this work, then:

ldo@theon:hack> cat test.c
#include <stdio.h>

int main(int argc, char ** argv)
{
char buf[512];
const int a = 2, b = 3;
snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);
fprintf(stdout, buf);
return
0;
} /*main*/
ldo@theon:hack> ./test
2 + 3 = 5

Click to expand...

By accident.

I have yet to find an architecture or C compiler where it DOESNâ€™T work.

Feel free to try and prove me wrong.

Lawrence D'Oliveiro · Jun 30, 2010

Jorgen Grahn said:
Jorgen Grahn said:

I thought it was well-known that the solution is *not* to try to
sanitize the input -- it's to switch to an interface which doesn't
involve generating an intermediate executable. In the Python example,
that would be something like os.popen2(['zcat', '-f', '--', untrusted]).

Click to expand...

Thatâ€™s what I mean. Why do people consider input sanitization so hard?

Click to expand...

I'm not sure you understood me correctly, because I advocate
*not* doing input sanitization. Hard or not -- I don't want to know,
because I don't want to do it.

But no-one has yet managed to come up with an alternative that involves less
work.

Carl Banks · Jun 30, 2010

MySQL version 5 finally added prepared statements and a discrete
parameter passing mechanism...

However, since there likely are many MySQL v4.x installations out
there, which only work with complete string SQL, MySQLdb still formats
full SQL statements (and it uses the Python % string interpolation to do
that, after converting/escaping parameters -- which is why %s is the
only allowed placeholder; even a numeric parameter has been converted to
a quoted string before being inserted in the SQL).

It would be nice if MySQLdb could become version aware in a future
release, and use prepared statements on v5 engines... I doubt it can
drop the existing string based queries any time soon... Consider the
arguments about how long Python 2.x will be in use (I'm still on 2.5)...
Imagine the sluggishness in having database engines converted
(especially in a shared provider environment, where the language
specific adapters also need updating -- ODBC drivers, etc.)

Thanks, your replies to this subthread have been most enlightening.

Carl Banks

Michael Torrie · Jun 30, 2010

I have yet to find an architecture or C compiler where it DOESNâ€™T work.

Feel free to try and prove me wrong.

Okay, I will. Your code passes a char** when a char* is expected. Every
compiler I know of will give you a *warning*. Mistaking char*, char**,
and char[] is a common mistake that almost every C program makes in the
beginning. Now for the proof:

Consider this variation where I use a dynamically allocated buffer
instead of static:

#include <stdio.h>

int main(int argc, char ** argv)
{
char *buf = malloc(512 * sizeof(char));
const int a = 2, b = 3;
snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);
fprintf(stdout, buf);
free(buf);
return 0;
} /*main*/

On my machine, an immediate segfault (stack overrun). Your code only
works because your buf is statically allocated, which means &buf==buf.
But this equivalance does not hold for any other situation. If your
buffer was dynamically allocated on the heap, instead of passing a
pointer to the buffer (which *is* what buf itself is), you are passing a
pointer to the pointer, which is where buf is stored on the stack, but
not the buffer itself. Instant stack corruption.

Michael Torrie · Jun 30, 2010

But no-one has yet managed to come up with an alternative that involves less
work.

Your case is still not persuasive.

How is using the DB API's placeholders and parameterization more work?
It's the same amount of keystrokes, perhaps even less. You would just
be substituting the API's parameter placeholders for Python's. In fact
with Psycopg2 and the mysql python db apis, it's almost a matter of
simply removing the "%" and putting in a comma, turning python's string
substitution into a method call. And you can leave out the quotes
around where the variables go. If I have to sanitize every input, I
have to do it on each and every field on each and every form action.
With the DB API doing the work I just do it once, in one place. Is this
not easier that manually escaping everything and then embedding it in
the query string?

I've not used sqlalchemy, but it looks similarly easy.

Michael Torrie · Jun 30, 2010

#include <stdio.h>

int main(int argc, char ** argv)
{
char *buf = malloc(512 * sizeof(char));
const int a = 2, b = 3;
snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);

^^^^^^^^^^
Make that 512*sizeof(buf)

Still segfaults though.

Michael Torrie · Jun 30, 2010

^^^^^^^^^^
Make that 512*sizeof(buf)

Sigh. Try again. How about "512 * sizeof(char)" ? Still doesn't make
a different. The code still crashes because the &buf is incorrect.

Another reason python programming is just so much funner and easier!

This little diversion is fun though. C is pretty powerful and I enjoy
it, but it sure keeps one on one's toes. I made a similar mistake to
the &buf thing years ago when I thought I could return strings (char *)
from functions on the stack the way Pascal and BASIC could. It was only
by pure luck that my code worked as the part of the stack being accessed
was invalid and could have been overwritten.

Carl Banks · Jun 30, 2010

I don't think it was as stupid as that back when C was
designed. Every byte of memory was precious in those days,
and if you had, say, 10 bytes allocated for a string, you
wanted to be able to use all 10 of them for useful data.

So the convention was that a NUL byte was used to mark
the end of the string *if it didn't fill all the available
space*.

I can't think of any function in the standard library that observes
that convention, which inclines me to disbelieve this convention ever
really existed. If it did, there would be functions to support it.

For that matter, I'm not really inclined to believe bytes were *that*
precious in those days.

Functions such as strncpy and snprintf are designed
for use with strings that follow this convention. Proper
usage requires being cognizant of the maximum length and
using appropriate length-limited functions for all operations
on such strings.

Well, no. Being cognizant of the string's maximum length doesn't make
you able to pass it to printf, or system, or any other C function.

The obvious rationale behind strncpy's stupid behavior is that it's
not a string function at all, but a memory block function, that stops
at a NUL in case you don't care what's after the NUL in a block. But
it leads you to believe it's a string function by it's name.

Carl Banks

Jorgen Grahn · Jun 30, 2010

Sigh. Try again. How about "512 * sizeof(char)" ? Still doesn't make
a different. The code still crashes because the &buf is incorrect.

I haven't tried to understand the rest ... but never write
'sizeof(char)' unless you might change the type later. 'sizeof(char)'
is by definition 1 -- even on odd-ball architectures where a char is
e.g. 16 bits.

/Jorgen

Jorgen Grahn · Jun 30, 2010

I can't think of any function in the standard library that observes
that convention,

Me neither, except strncpy(), according to above.

which inclines me to disbelieve this convention ever
really existed. If it did, there would be functions to support it.

Maybe others existed, but got killed off early. That would make
strncpy() a living fossil, like the Coelacanth ...

For that matter, I'm not really inclined to believe bytes were *that*
precious in those days.

It's somewhat believable. If I handled thousands of student names in a
big C array char[30][], I would resent the fact that 1/30 of the
memory was wasted on NUL bytes. I'm sure plenty of people have done what
Gregory suggests ... but it's not clear that strncpy() was designed to
support those people.

I suppose it's all lost in history.

/Jorgen

Cameron Simpson · Jun 30, 2010

| > Carl Banks wrote:
| > > Indeed, strncpy does not copy that final NUL if it's at or beyond the
| > > nth element. Â Probably the most mind-bogglingly stupid thing about the
| > > standard C library, which has lots of mind-boggling stupidity.
| >
| > I don't think it was as stupid as that back when C was
| > designed. Every byte of memory was precious in those days,
| > and if you had, say, 10 bytes allocated for a string, you
| > wanted to be able to use all 10 of them for useful data.
| >
| > So the convention was that a NUL byte was used to mark
| > the end of the string *if it didn't fill all the available
| > space*.
|
| I can't think of any function in the standard library that observes
| that convention, which inclines me to disbelieve this convention ever
| really existed. If it did, there would be functions to support it.
|
| For that matter, I'm not really inclined to believe bytes were *that*
| precious in those days.

Jeez. PDP-11s, 16 bit addressing, tiny tiny disc drives!

The original V7 (and probably earlier) UNIX filesystem has 16 byte directory
entries: 2 bytes for an inode and 14 bytes for the name. You could use 14
bytes of that name, and strncpy makes it effective to work with that data
structure.

Shortening something already only 14 bytes (the name) _is_ a big ask,
and it is well work the unusual convention in play.

| The obvious rationale behind strncpy's stupid behavior is that it's
| not a string function at all, but a memory block function, that stops
| at a NUL in case you don't care what's after the NUL in a block. But
| it leads you to believe it's a string function by it's name.

Bah. It's for copying a _string_ into a _buffer_! Strangely, since it
starts with a string (NUL-terminated byte sequence) it begins with
"str". And it _is_ copying, but not into another string.

It is special purpose but perfectly reasonable for the problem at hand.

Nobody · Jun 30, 2010

That's silly. RE is a good tool. Like all good tools, it is the right
tool for some jobs and the wrong tool for others.

"When all you have is a hammer, everything looks like a nail"

Except, REs are more like a turbocharged angle grinder: bloody
dangerous in the hands of a novice.

[I was going to say "hole hawg", but then realised that most of my post
would be a quotation explaining it. The reference is to Neal Stephenson's
essay "In the Beginning was the Command Line":

I've noticed over the years a significant anti-RE sentiment in the
Python community.

IMHO, the sentiment isn't so much against REs per se, but against
excessive or inappropriate use. Apart from making it easy to write
illegible code, they also make it easy to write code that "mostly sort-of
works" but somewhat harder to write code which is actually correct.

It doesn't help that questions on REs often start out by stating a problem
for which REs are inappropriate, e.g. parsing a context-free (or higher)
language, and in the same sentence indicate the the poster is already
predisposed to using REs.

Roy Smith · Jun 30, 2010

Cameron Simpson said:
Jeez. PDP-11s, 16 bit addressing, tiny tiny disc drives!

What you talking about, tiny? An RK-05 was huge! Why would anybody
ever need more than that?

The original V7 (and probably earlier) UNIX filesystem has 16 byte directory
entries

Certainly earlier. I used v6, and it was like that there. I'm
reasonably sure it pre-dated v6, however.

Michael Torrie · Jun 30, 2010

I haven't tried to understand the rest ... but never write
'sizeof(char)' unless you might change the type later. 'sizeof(char)'
is by definition 1 -- even on odd-ball architectures where a char is
e.g. 16 bits.

You're right. I normally don't use sizeof(char). This is obviously a
contrived example; I just wanted to make the example such that there's
no way the original poster could argue that the crash is caused by
something other than &buf.

Then again, it's always a bad idea in C to make assumptions about
anything. If you're on Windows and want to use the unicode versions of
everything, you'd need to do sizeof(). So using it here would remind
you that when you move to the 16-bit Microsoft unicode versions of
snprintf need to change the sizeof(char) lines as well to sizeof(wchar_t).

Jorgen Grahn · Jun 30, 2010

There's nothing silly about it.

It is an exaggeration though: but it does represent a good thing to keep
in mind.

Not an exaggeration: it's an absolute. It literally says that any time
you try to solve a problem with a regex, (A) it won't solve the problem
and (B) it will in itself become a problem. And it doesn't tell you
why: you're supposed to accept or reject this without thinking.

How can that be a good thing to keep in mind?

I wouldn't normally be annoyed by the quote, but it is thrown around a
lot in various places, not just here.

Yes, re is a tool -- and a useful one at that. But its also a tool which
/seems/ like an omnitool capable of tackling everything.

That's more like my attitude towards them.

/Jorgen

Stephen Hansen · Jun 30, 2010

Not an exaggeration: it's an absolute. It literally says that any time
you try to solve a problem with a regex, (A) it won't solve the problem
and (B) it will in itself become a problem. And it doesn't tell you
why: you're supposed to accept or reject this without thinking.

How can that be a good thing to keep in mind?

That it speaks in absolutes is what makes it an exaggeration. Yes, it
literally says something kind of like that (Your 'a' is a
mischaracterization).

It's still a very good thing to keep in mind.

Its a "saying" -- a proverb, an expression. Since when are the wise
remarks of our ancient forefathers literal? Not last I checked.

Reading into a saying as not a guide or suggestion or cautionary tale
but instead a doctrinal absolute is where we run into problems, not in
the repeating of them.

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/

Terry Reedy · Jun 30, 2010

IMHO, the sentiment isn't so much against REs per se, but against
excessive or inappropriate use. Apart from making it easy to write
illegible code, they also make it easy to write code that "mostly sort-of
works" but somewhat harder to write code which is actually correct.

It doesn't help that questions on REs often start out by stating a problem
for which REs are inappropriate, e.g. parsing a context-free (or higher)
language, and in the same sentence indicate the the poster is already
predisposed to using REs.

They also often start with a problem that is 'sub-relational-grammar'
and easily solved with string methods, and again the OP proposes to use
the overkill of REs. In other words, people ask "How do I do this with
an RE" rather than "What tool should I use for this, and how".

If people asked "How do I push a pin into a corkboard with a (standard)
hammer" or "How do I break up a concrete sidewalk with a (standard)
hammer), it would not be 'anti-hammer sentiment' to suggest another
tool, like pliers or a jackhammer.

Ethan Furman · Jun 30, 2010

Terry said:
They also often start with a problem that is 'sub-relational-grammar'
and easily solved with string methods, and again the OP proposes to use
the overkill of REs. In other words, people ask "How do I do this with
an RE" rather than "What tool should I use for this, and how".

If people asked "How do I push a pin into a corkboard with a (standard)
hammer" or "How do I break up a concrete sidewalk with a (standard)
hammer), it would not be 'anti-hammer sentiment' to suggest another
tool, like pliers or a jackhammer.

I took the time to learn REs about a year ago. It was well worth it,
even though I've only used REs a handful of times since, because when
you need them there is no good substitute. But when you don't, there
are plenty.

~Ethan~

Carl Banks · Jun 30, 2010

| > Carl Banks wrote:
| > > Indeed, strncpy does not copy that final NUL if it's at or beyond the
| > > nth element. Probably the most mind-bogglingly stupid thing about the
| > > standard C library, which has lots of mind-boggling stupidity.
| >
| > I don't think it was as stupid as that back when C was
| > designed. Every byte of memory was precious in those days,
| > and if you had, say, 10 bytes allocated for a string, you
| > wanted to be able to use all 10 of them for useful data.
| >
| > So the convention was that a NUL byte was used to mark
| > the end of the string *if it didn't fill all the available
| > space*.
|
| I can't think of any function in the standard library that observes
| that convention, which inclines me to disbelieve this convention ever
| really existed. If it did, there would be functions to support it.
|
| For that matter, I'm not really inclined to believe bytes were *that*
| precious in those days.

Jeez. PDP-11s, 16 bit addressing, tiny tiny disc drives!

The original V7 (and probably earlier) UNIX filesystem has 16 byte directory
entries: 2 bytes for an inode and 14 bytes for the name. You could use 14
bytes of that name, and strncpy makes it effective to work with that data
structure.

Shortening something already only 14 bytes (the name) _is_ a big ask,
and it is well work the unusual convention in play.

You are talking about fixed-length memory records, not strings.

I'm saying that bytes were not so precious that, when you operate on
*actual strings*, that you need to desperately cut off nul terminators
to save space.

| The obvious rationale behind strncpy's stupid behavior is that it's
| not a string function at all, but a memory block function, that stops
| at a NUL in case you don't care what's after the NUL in a block. But
| it leads you to believe it's a string function by it's name.

Bah. It's for copying a _string_ into a _buffer_! Strangely, since it
starts with a string (NUL-terminated byte sequence) it begins with
"str". And it _is_ copying, but not into another string.

I'm going to disagree. The input of strncpy can be either a string or
a memory block, and the output can only a memory block. In other
words, neither the source nor destination has to be a string. This is
a memory block function, not a string function. The correct name for
this function should have been memcpytonul.

Even if you disagree, then you must admit it should have been called
strcpytobuf. Nothing about the name strncpy gives the slightest
suggestion that the destination is not a string. Based on analogy
from other str functions, none of which have any sources or
destinations that are memory blocks, one would logically expect that
strncpy's destination was a string. It defies common sense.

And there should have been an actual, correctly working strncpy in the
standard library that copies and truncates actual strings.

It is special purpose but perfectly reasonable for the problem at hand.

The usefulness of strncpy's behavior for writing fixed-length memory
blocks is not in question here. The thing that's mind-bogglingly
stupid is that the function that does this is called "strncpy".

Carl Banks

Paul Rubin · Jun 30, 2010

Jorgen Grahn said:
It's somewhat believable. If I handled thousands of student names in a
big C array char[30][], I would resent the fact that 1/30 of the
memory was wasted on NUL bytes.

But you'd be wasting even more of the memory on bytes left unused when
the student's name is less than 30 chars. If memory is that scarce you
need a different representation.

is list comprehension necessary?	15	Oct 26, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Why Is Escaping Data Considered So Magical?

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Carl Banks

Michael Torrie

Michael Torrie

Michael Torrie

Michael Torrie

Carl Banks

Jorgen Grahn

Jorgen Grahn

Cameron Simpson

Nobody

Roy Smith

Michael Torrie

Jorgen Grahn

Stephen Hansen

Terry Reedy

Ethan Furman

Carl Banks

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads