Why Is Escaping Data Considered So Magical?

Carl Banks · Jun 28, 2010

snprintf goes to great lengths to be safe, in fact. You might be
thinking of strncpy.

Indeed, strncpy does not copy that final NUL if it's at or beyond the
nth element. Probably the most mind-bogglingly stupid thing about the
standard C library, which has lots of mind-boggling stupidity.

Whenever I do an audit of someone's C code the first thing I do is
search for strncpy and see if they set the nth character to 0. (They
usually didn't.)

Carl Banks

Jorgen Grahn · Jun 28, 2010

snprintf goes to great lengths to be safe, in fact. You might be
thinking of strncpy.

Yes, it was indeed strncpy I was thinking of. Thanks.

But actually, the snprintf(3) man page I have is not 100% clear on
this issue, so last time I used it, I added a manual NUL-termination
plus a comment saying I wasn't sure it was needed. I normally use C++
or Python, so I am a bit rusty on these things.

/Jorgen

Gregory Ewing · Jun 28, 2010

Carl said:
Indeed, strncpy does not copy that final NUL if it's at or beyond the
nth element. Probably the most mind-bogglingly stupid thing about the
standard C library, which has lots of mind-boggling stupidity.

I don't think it was as stupid as that back when C was
designed. Every byte of memory was precious in those days,
and if you had, say, 10 bytes allocated for a string, you
wanted to be able to use all 10 of them for useful data.

So the convention was that a NUL byte was used to mark
the end of the string *if it didn't fill all the available
space*. Functions such as strncpy and snprintf are designed
for use with strings that follow this convention. Proper
usage requires being cognizant of the maximum length and
using appropriate length-limited functions for all operations
on such strings.

Paul Rubin · Jun 28, 2010

Gregory Ewing said:
I don't think it was as stupid as that back when C was
designed. Every byte of memory was precious in those days,
and if you had, say, 10 bytes allocated for a string, you
wanted to be able to use all 10 of them for useful data.

No I don't think so. Traditional C strings simply didn't carry length
info except for the nul byte at the end. Most string functions expected
the nul to be there. The nul byte convention (instead of having a
header word with a length) arguably saved some space both by eliminating
a multi-byte header and by allowing trailing substrings to be
represented as pointers into a larger string. In retrospect it seems
like a big error.

Dennis Lee Bieber · Jun 28, 2010

Also, I was asking about databases. "SQL is a text language" is not
the answer to the question "Why do RDBs use string commands instead of
binary APIs"?

Try this: Why do RDBMs use SQL?

Prior to SQL (and relational databases) become common, one had to
learn an interface that was specific to each database engine (and had
quite different look&feel if the underlying engine was hierarchical or
DBTG network [relational was mostly a theoretical view for manipulating
databases stored under hierarchical or network engines]). If one was
lucky, there was even an interactive query language processor.

Coding for something like a DBTG network database did not allow for
easy changes in queries... What would be a simple join in SQL was
traversing a circular linked list in the DBTG database my college
taught. EG: loop get next "master" record; loop get next sub-record
[etc. until all needed data retrieved] until back to master; until back
to top of database.

SQL started as an interactive query language, meant to typed by
(knowledgeable) users at a command prompt. But since it melded with
relational databases so well it became a de facto standard query
language not only for interactive queries but as a common semi-portable
API for embedding into code -- no DBMS specific procedural function
library needed, just one interface to send a query, and one to retrieve
result records.

(Ever notice the cyclic history -- 50s lots of mixed flat files, 60s
hierarchical databases [in which some master record type has links to
related records -- but the data is stored in a tree so finding data fast
really needed careful database design to avoid having to traverse too
much of the tree; imagine needing to read department information records
to access personnel records to access promotion/pay-raise records to
find the current pay rate to produce the weekly paycheck for an
employee; and if the employee changes department you have to move all
their personnel data [promotion history, etc] from one link to another;
or duplicate the personnel record saving an "end date" in the first
department record and an effective start date on the new department
copy], 70s with network [easier to traverse as each record type could
link to any other record type -- doing payroll did not require reading
department records], 80s relational wherein nothing is linked via
pointers but only by logical comparisons of fields [and so easily
implemented as sets of flat files again <G>])

Dennis Lee Bieber · Jun 28, 2010

4. MySQL AB finally get off their collective duffs and adds real
parameter separation to the MySQL wire protocol, and implements real
prepared statements to massive speed gains in scenarios that are

They did with version 5 of MySQL... Also added triggers and stored
procedures as I recall (though possibly limited functionality). But
MySQLdb is still compatible with versions 3.x and 4.x (with some
difficulties in the connection string password handling). IT is what is
not version aware and uses the old established "complete SQL string"
query system.

Does MySQL AB still exist? I thought Sun absorbed MySQL, and Oracle
has absorbed Sun...

Dennis Lee Bieber · Jun 28, 2010

I'm not talking about SQL, I'm talking about RDBs. But I guess it is
important for serious RDBs to support queries complex enough that a
language like SQL is really needed to express it--even if being called
from an expressive language like Python. Not everything is a simple
inner joins. I defer to the community then, as my knowledge of
advanced SQL is minimal.

SQL is almost a hybrid of relational algebra and relational
calculus, though typically considered more in the latter category (the
simplistic definition of the two is the one specifies /how to/ obtain a
result in RA, whereas in RC one specifies /what/ the result should look
like and let the engine figure out how to generate it.

"Select field, field, ..., field from ..." is algebra "project"
operation... In RA you'd have to specify the steps...

x1 = join(t1, t2)
x2 = restrict(x1, t1.fld1 = t2.fld3)
result = select(x2, field, ..., field)

SQL:
select field, ..., field, from t1, t2 where t1.fld1 = t2.fld3

(implicit join, just as the algebra is a full cross product)

The classical example of RC is IBM's QBE (query by example) -- which
drew single record tables on the screen, and one filled in a result
table with references to fields in the sources, and included (somehow)
the join criteria...

Somewhere in storage I should have a 400 page text on relational
database theory, which covers relational algebra and calculus, but
predates SQL.

Dennis Lee Bieber · Jun 28, 2010

(This is an area where parametrized queries is even more important: but
I'm not sure if MySQL does proper prepared queries and caching of
execution plans).

MySQL version 5 finally added prepared statements and a discrete
parameter passing mechanism...

However, since there likely are many MySQL v4.x installations out
there, which only work with complete string SQL, MySQLdb still formats
full SQL statements (and it uses the Python % string interpolation to do
that, after converting/escaping parameters -- which is why %s is the
only allowed placeholder; even a numeric parameter has been converted to
a quoted string before being inserted in the SQL).

It would be nice if MySQLdb could become version aware in a future
release, and use prepared statements on v5 engines... I doubt it can
drop the existing string based queries any time soon... Consider the
arguments about how long Python 2.x will be in use (I'm still on 2.5)...
Imagine the sluggishness in having database engines converted
(especially in a shared provider environment, where the language
specific adapters also need updating -- ODBC drivers, etc.)

Lawrence D'Oliveiro · Jun 29, 2010

Kushal said:
What am I passing, then?

Click to expand...

Here's what gcc tells me (I declared buf as char buf[512]):
sprintf.c:8: warning: passing argument 1 of â€˜snprintfâ€™ from
incompatible pointer type
/usr/include/stdio.h:363: note: expected â€˜char * __restrict__â€™ but
argument is of type â€˜char (*)[512]â€™

You just need to lose the & from the macro.

Why does this work, then:

ldo@theon:hack> cat test.c
#include <stdio.h>

int main(int argc, char ** argv)
{
char buf[512];
const int a = 2, b = 3;
snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);
fprintf(stdout, buf);
return
0;
} /*main*/
ldo@theon:hack> ./test
2 + 3 = 5

Lawrence D'Oliveiro · Jun 29, 2010

In message

Carl said:
Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?

HTML is also effectively a string-based API. And what about regular
expressions? And all the functionality available through the subprocess
module and its predecessors?

The reality is, embedding one language within another is a fact of life. I
think itâ€™s important for programmers to be able to deal correctly with it.

Lawrence D'Oliveiro · Jun 29, 2010

For SQL, use stored procedures or prepared statements.

So feel free to rewrite my example using either stored procedures or
prepared statements, to prove how much easier it is.

Roy Smith · Jun 29, 2010

Paul Rubin said:
No I don't think so. Traditional C strings simply didn't carry length
info except for the nul byte at the end. Most string functions expected
the nul to be there. The nul byte convention (instead of having a
header word with a length) arguably saved some space both by eliminating
a multi-byte header and by allowing trailing substrings to be
represented as pointers into a larger string. In retrospect it seems
like a big error.

Null-terminated strings predate C. Various assembler languages had
ASCIIZ (or similar) directives long before that.

The nice thing about null-terminated strings is how portable they have
been over various word lengths. Life would have been truly inconvenient
if K&R had picked, say, a 16-bit length field, and then we needed to
bump that up to 32 bits in the 80's, and again to 64 bits in the 90's.

Steven D'Aprano · Jun 29, 2010

The nice thing about null-terminated strings is how portable they have
been over various word lengths. Life would have been truly inconvenient
if K&R had picked, say, a 16-bit length field, and then we needed to
bump that up to 32 bits in the 80's, and again to 64 bits in the 90's.

Or a Pascal 8 bit length field.

However the cost of null-terminated strings is that they can't store
binary data, and worse, they're slow. In fact, according to some, null-
terminated strings are the *worst* way to implement a string type.

http://www.joelonsoftware.com/articles/fog0000000319.html

Peter H. Coffin · Jun 29, 2010

Coding for something like a DBTG network database did not allow for
easy changes in queries... What would be a simple join in SQL was
traversing a circular linked list in the DBTG database my college
taught. EG: loop get next "master" record; loop get next sub-record
[etc. until all needed data retrieved] until back to master; until back
to top of database.

We'll also note that most of these you'd have to map out where each
field in a record was by hand, any time you wanted to open the file.
Often several times, because there would be multiple record layouts per
file.

Kushal Kumaran · Jun 29, 2010

In message <[email protected]>, Kushal
Kumaran wrote:

On Sun, Jun 27, 2010 at 9:47 AM, Lawrence D'Oliveiro

A long while ago I came up with this macro:

#define Descr(v) &v, sizeof v

making the correct version of the above become

snprintf(Descr(buf), foo);

Not quite right. Â If buf is a char array, as suggested by the use of
sizeof, then you're not passing a char* to snprintf.

What am I passing, then?

Click to expand...

Here's what gcc tells me (I declared buf as char buf[512]):
sprintf.c:8: warning: passing argument 1 of â€˜snprintfâ€™ from
incompatible pointer type
/usr/include/stdio.h:363: note: expected â€˜char * __restrict__â€™ but
argument is of type â€˜char (*)[512]â€™

You just need to lose the & from the macro.

Click to expand...

Why does this work, then:

ldo@theon:hack> cat test.c
#include <stdio.h>

int main(int argc, char ** argv)
Â {
Â Â char buf[512];
Â Â const int a = 2, b = 3;
Â Â snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);
Â Â fprintf(stdout, buf);
Â Â return
Â Â Â Â 0;
Â } /*main*/
ldo@theon:hack> ./test
2 + 3 = 5

By accident. I hope your compiler warned you about your snprintf call.

Reading these threads might help you understand how char* and char
(*)[512] are different:

http://groups.google.com/group/comp.lang.c++/browse_thread/thread/24708a9204061ce/848ceaf5ec774d81

http://groups.google.com/group/comp...read/thread/fe264c550947a2e5/32b330cdf8aba3d6

Dennis Lee Bieber · Jun 29, 2010

We'll also note that most of these you'd have to map out where each
field in a record was by hand, any time you wanted to open the file.
Often several times, because there would be multiple record layouts per
file.

Ah yes -- you did have to know the entire record structure
beforehand to create the "image" for processing...

And the database engine on the Xerox Sigma running CP/V really got
nasty -- you had to preallocate the expected disk space for the entire
database ahead of time. The engine used was CP/V* called a "random" file
-- you asked for a CONTIGUOUS chunk of disk space, and the OS maintained
NO information about the contents (not even an equivalent of EOF).

* CP/V had some interesting features: file types of "consecutive",
"keyed", and "random". As mentioned, "random" also implied contiguous
disk space allocation; "consecutive" and "keyed" could be disjoint disk
sectors. "Consecutive" is closest to the UNIX "stream"; start from the
beginning and just read... "Keyed" were ISAM files -- and were the most
common file type! The line editor (mid-70s here, editor was line
oriented) used "line numbers" as ISAM keys, so even source code was
being stored in an ISAM file (and the FORTRAN direct access I/O "record
number", as in
read(unit, rec=#) buffer
was not the more common record_length * (#-1) offset; it was an ISAM
key!)

Nobody · Jun 29, 2010

HTML is also effectively a string-based API.

HTML is a data format. The sane way to construct or manipulate HTML is via
the DOM, not string operations.

And what about regular expressions?

What about them? As the saying goes:

Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems.

They have some uses, e.g. defining tokens[1]. Using them to match more
complex constructs is error-prone and should generally be avoided unless
you're going to manually verify the result. Oh, and you should never
generate regexps dynamically; that way madness lies.

[1] Assuming that the language's tokens can be described by a regular
grammar. This isn't always the case, e.g. you can't tokenise PostScript
using regexps, as string literals can contain nested parentheses.

And all the functionality available through the subprocess
module and its predecessors?

The main reason why everyone recommends subprocess over its predecessors
is that it allows you to bypass the shell, which is one of the most
common sources of the type of error being discussed in this thread.

IOW, rather than having to construct a shell command which (hopefully)
will pass the desired arguments to the child, you just pass the desired
arguments to the child directly, without involving the shell.

The reality is, embedding one language within another is a fact of life. I
think itâ€™s important for programmers to be able to deal correctly with it.

That depends upon what you mean by "embedding". The correct way to use
code written in one language from code written in another is to make the
first accept parameters and make the second pass them, not to have the
second (try to) generate the former dynamically.

Sometimes dynamic code generation is inevitable (e.g. if you're writing a
compiler, you probably need to generate assembler or C code), but it's not
to be done lightly, and it's unwise to take shortcuts (e.g. ad-hoc string
substitutions).

Roy Smith · Jun 29, 2010

Nobody said:
What about them? As the saying goes:

Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems.

That's silly. RE is a good tool. Like all good tools, it is the right
tool for some jobs and the wrong tool for others.

I've noticed over the years a significant anti-RE sentiment in the
Python community. One reason, I suppose, is because Python gives you
some good string manipulation tools, i.e. split(), startswith(),
endswith(), and the 'in' operator, which cover many of the common RE use
cases. But there are still plenty of times when a RE is the best tool
and it's worth investing the effort to learn how to use them effectively.

One tool that Python gives you which makes RE a pleasure is raw strings.
Getting rid of all those extra backslashes really helps improve
readability.

Another great feature is VERBOSE. I've written some truly complicated
REs using that, and still been able to figure out what they meant the
next day

Stephen Hansen · Jun 29, 2010

That's silly. RE is a good tool. Like all good tools, it is the right
tool for some jobs and the wrong tool for others.

There's nothing silly about it.

It is an exaggeration though: but it does represent a good thing to keep
in mind.

Yes, re is a tool -- and a useful one at that. But its also a tool which
/seems/ like an omnitool capable of tackling everything.

Regular expressions are a complicated mini-language well suited towards
extensive use in a unix type environment where you want to embed certain
logic of 'what to operate on' into many different commands that aren't
languages at all -- and perl embraced it to make it perl's answer to
text problems. Which is fine.

In Python, certainly it has its uses: many of them in fact, and in many
it really is the best solution.

Its not just that its the right tool for some jobs and the wrong tool
for others, or that -- as you said also -- that Python provides a rather
rich string type which can do many common tasks natively and better, but
that regular expressions live in the front of the mind for so many
people coming to the language that its the first thing they even think
of, and what should be simple becomes difficult.

So people quote that proverb. Its a good proverb. As all proverbs, its
not perfectly applicable to all situations. But it does has an important
lesson to it: you should generally not consider re to be the solution
you're looking for until you are quite sure there's nothing else to
solve the same task.

It obviously applies less to the guru's who know all about regular
expressions and their subtleties including potential pathological behavior.

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/

Mark Lawrence · Jun 29, 2010

On 29/06/2010 01:55, Roy Smith wrote:

[snips]

The nice thing about null-terminated strings is how portable they have
been over various word lengths.

The bad thing about null-terminated strings is the number of off-by-one
errors they've helped to create. I obviously have never created an
off-by-one error myself.

Kindest regards.

Mark Lawrence.

is list comprehension necessary?	15	Oct 26, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Why Is Escaping Data Considered So Magical?

Carl Banks

Jorgen Grahn

Gregory Ewing

Paul Rubin

Dennis Lee Bieber

Dennis Lee Bieber

Dennis Lee Bieber

Dennis Lee Bieber

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Lawrence D'Oliveiro

Roy Smith

Steven D'Aprano

Peter H. Coffin

Kushal Kumaran

Dennis Lee Bieber

Nobody

Roy Smith

Stephen Hansen

Mark Lawrence

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads