Unrecognized escape sequences in string literals

Douglas Alan · Aug 10, 2009

I said:
But you're right, it's too late to change this now.

P.S. But if it weren't too late, I think that your idea to have "\s"
be the escape sequence for a backslash instead of "\\" might be a good
one.

|>ouglas

Douglas Alan · Aug 10, 2009

If you don't know what your string literals are, you don't
know what your program does. You can't expect the compiler
to save you from semantic errors. Adding escape codes into
the string literal doesn't change this basic truth.

I grow weary of these semantic debates. The bottom line is
that C++'s strategy here catches bugs early on that Python's
approach doesn't. It does so at no additional cost.

From a purely practical point of view, why would any
language not want to adopt a zero-cost approach to catching
bugs, even if they are relatively rare, as early as
possible?

(Other than the reason that adopting it *now* is sadly too
late.)

Furthermore, Python's strategy here is SPECIFICALLY
DESIGNED, according to the reference manual to catch bugs.
I.e., from the original posting on this issue:

Unlike Standard C, all unrecognized escape sequences
are left in the string unchanged, i.e., the backslash
is left in the string. (This behavior is useful when
debugging: if an escape sequence is mistyped, the
resulting output is more easily recognized as broken.)

If this "feature" is designed to catch bugs, why be
half-assed about it? Especially since there seems to be
little valid use case for allowing programmers to be lazy in
their typing here.

The compiler can't save you from typing 1234 instead of
11234, or 31.45 instead of 3.145, or "My darling Ho"
instead of "My darling Jo", so why do you expect it to
save you from typing "abc\d" instead of "abc\\d"?

Because in the former cases it can't catch the the bug, and
in the latter case, it can.

Perhaps it can catch *some* errors of that type, but only
at the cost of extra effort required to defeat the
compiler (forcing the programmer to type \\d to prevent
the compiler complaining about \d). I don't think the
benefit is worth the cost. You and your friend do. Who is
to say you're right?

Well, Bjarne Stroustrup, for one.

All of these are value judgments, of course, but I truly
doubt that anyone would have been bothered if Python from
day one had behaved the way that C++ does. Additionally, I
expect that if Python had always behaved the way that C++
does, and then today someone came along and proposed the
behavior that Python currently implements, so that the
programmer could sometimes get away with typing a bit less,
such a person would be chided for not understanding the Zen
of Python.

Why do you care if there are "funny characters"?

Because, of course, "funny characters" often have
interesting consequences when output. Furthermore, their
consequences aren't always immediately obvious from looking
at the source code, unless you are intimately familiar with
the function of the special characters in question.

For instance, sometimes in the wrong combination, they wedge
your xterm. Etc.

I'm surprised that this needs to be spelled out.

In C++, if you see an escape you don't recognize, do you
care?

Yes, of course I do. If I need to know what the program
does.

Do you go running for the manual? If the answer is No,
then why do it in Python?

The answer is that I do in both cases.

No. \z *is* a legal escape sequence, it just happens to map to \z.

If you stop thinking of \z as an illegal escape sequence
that Python refuses to raise an error for, the problem
goes away. It's a legal escape sequence that maps to
backslash + z.

(1) I already used that argument on my friend, and he wasn't
buying it. (Personally, I find the argument technically
valid, but commonsensically invalid. It's a language-lawyer
kind of argument, rather than one that appeals to any notion
of real aesthetics.)

(2) That argument disagrees with the Python reference
manual, which explicitly states that "unrecognized escape
sequences are left in the string unchanged", and that the
purpose for doing so is because it "is useful when
debugging".

No, because it actually is an illegal escape sequence.

What makes it "illegal". As far as I can tell, it's just
another "unrecognized escape sequence". JavaScript treats it
that way. Are you going to be the one to tell all the
JavaScript programmers that their language can't tell a
legal escape sequence from an illegal one?

(1) There is no missing \ in "foo\zbar".

(2) The problem with "\" isn't a missing backslash, but a
missing end- quote.

Says who? All of this really depends on your point of
view. The whole morass goes away completely if one adopts
C++'s approach here.

Python isn't DWIMing here. The rules are simple and straightforward,
there's no mind-reading or guessing required.

It may not be a complex form of DWIMing, but it's still
DWIMing a bit. Python is figuring that if I typed "\z", then
either I must have really meant to type "\\z", or that I
want to see the backslash when I'm debugging because I made
a mistake, or that I'm just too lazy to type "\\z".

Is it "a form of DWIMing" to consider 1.234e1 and 12.34
synonymous?

That's a very different issue, as (1) there are very
significant use cases for both kinds of numerical
representations, and (2) there's often only one obvious way
way that the number should be entered, depending on the
coding situation.

What about 86 and 0x44? Is that DWIMing?

See previous comment.

I'm sure both you and your friend are excellent
programmers, but you're tossing around DWIM as a
meaningless term of opprobrium without any apparent
understand of what DWIM actually is.

I don't know if my friend even knows the term DWIM, other
than me paraphrasing him, but I certainly understand all
about the term. It comes from InterLisp. When DWIM was
enabled, your program would run until it hit an error, and
for certain kinds of errors, it would wait a few seconds for
the user to notice the error message, and if the user didn't
tell the program to stop, it would try to figure out what
the user most likely meant, and then continue running using
the computer-generated "fix".

I.e., more or less like continuing on in the face of what
the Python Reference manual refers to as an "unrecognized
escape sequence".

|>ouglas

Steven D'Aprano · Aug 11, 2009

But you're right, it's too late to change this now.

Not really. There is a procedure for making non-backwards compatible
changes. If you care deeply enough about this, you could agitate for
Python 3.2 to raise a PendingDepreciation warning for "unexpected" escape
sequences like \z, Python 3.3 to raise a Depreciation warning, and Python
3.4 to treat it as an error.

It may even be possible to skip the PendingDepreciation warning and go
straight for Depreciation warning in 3.2.

Steven D'Aprano · Aug 11, 2009

I grow weary of these semantic debates. The bottom line is that C++'s
strategy here catches bugs early on that Python's approach doesn't. It
does so at no additional cost.

From a purely practical point of view, why would any language not want
to adopt a zero-cost approach to catching bugs, even if they are
relatively rare, as early as possible?

Because the cost isn't zero. Needing to write \\ in a string literal when
you want \ is a cost, and having to read \\ in source code and mentally
translate that to \ is also a cost. By all means argue that it's a cost
that is worth paying, but please stop pretending that it's not a cost.

Having to remember that \n is a "special" escape and \y isn't is also a
cost, but that's a cost you pay in C++ too, if you want your code to
compile.

By the way, you've stated repeatedly that \y will compile with a warning
in g++. So what precisely do you get if you ignore the warning? What do
other C++ compilers do? Apart from the lack of warning, what actually is
the difference between Python's behaviour and C++'s behaviour?

(Other than the reason that adopting it *now* is sadly too late.)

Furthermore, Python's strategy here is SPECIFICALLY DESIGNED, according
to the reference manual to catch bugs. I.e., from the original posting
on this issue:

Unlike Standard C, all unrecognized escape sequences are left in
the string unchanged, i.e., the backslash is left in the string.
(This behavior is useful when debugging: if an escape sequence is
mistyped, the resulting output is more easily recognized as
broken.)

You need to work on your reading comprehension. It doesn't say anything
about the motivation for this behaviour, let alone that it was
"SPECIFICALLY DESIGNED" to catch bugs. It says it is useful for
debugging. My shoe is useful for squashing poisonous spiders, but it
wasn't designed as a poisonous-spider squashing device.

Because in the former cases it can't catch the the bug, and in the
latter case, it can.

I'm not convinced this is a bug that needs catching, but if you think it
is, then that's a reasonable argument.

Well, Bjarne Stroustrup, for one.

Then let him design his own language *wink*

All of these are value judgments, of course, but I truly doubt that
anyone would have been bothered if Python from day one had behaved the
way that C++ does.

If I'm reading this page correctly, Python does behave as C++ does. Or at
least as Larch/C++ does:

http://www.cs.ucf.edu/~leavens/larchc++manual/lcpp_47.html

Yes, of course I do. If I need to know what the program does.

Precisely the same as in Python.

The answer is that I do in both cases.

You deleted without answer my next question:

"And if the answer is Yes, then how is Python worse than C++?"

Seems to me that the answer is "It's not worse than C++, it's the same"
-- in both cases, you have to memorize the "special" escape sequences,
and in both cases, if you see an escape you don't recognize, you need to
look it up.

(1) I already used that argument on my friend, and he wasn't buying it.
(Personally, I find the argument technically valid, but commonsensically
invalid. It's a language-lawyer kind of argument, rather than one that
appeals to any notion of real aesthetics.)

I disagree with your sense of aesthetics. I think that having to write
\\y when I want \y just to satisfy a bondage-and-discipline compiler is
ugly. That's not to deny that B&D isn't useful on occasion, but in this
case I believe the benefit is negligible, and so even a tiny cost is not
worth the pain.

The sweet sweet pain... oh wait, sorry, wrong newsgroup...

(2) That argument disagrees with the Python reference manual, which
explicitly states that "unrecognized escape sequences are left in the
string unchanged", and that the purpose for doing so is because it "is
useful when debugging".

How does it disagree? \y in the source code mapping to \y in the string
object is the sequence being left unchanged. And the usefulness of doing
so is hardly a disagreement over the fact that it does so.

What makes it "illegal". As far as I can tell, it's just another
"unrecognized escape sequence".

No, it's recognized, because \x is the prefix for an hexadecimal escape
code. And it's illegal, because it's missing the actual hexadecimal
digits.

JavaScript treats it that way. Are you
going to be the one to tell all the JavaScript programmers that their
language can't tell a legal escape sequence from an illegal one?

Well, it is Javascript...

All joking aside, syntax varies from one language to another. What counts
as a legal escape sequence in Javascript and what counts as a legal
escape sequence in Python are different. What makes you think I'm talking
about Javascript?

Says who? All of this really depends on your point of view. The whole
morass goes away completely if one adopts C++'s approach here.

But the morass only exists in the first place because you have adopted
C++'s approach instead of Python's approach -- and (possibly) not even a
standard part of the C++ approach, but a non-standard warning provided by
one compiler out of many.

Even if you disagree about (1), it's easy enough to prove that (2) is
correct:
File "<stdin>", line 1
"\"
^
SyntaxError: EOL while scanning single-quoted string

This is the exact same error you get here:

File "<stdin>", line 1
"a
^
SyntaxError: EOL while scanning single-quoted string

It may not be a complex form of DWIMing, but it's still DWIMing a bit.
Python is figuring that if I typed "\z", then either I must have really
meant to type "\\z",

Nope, not in the least. Python NEVER EVER EVER tries to guess what you
mean.

If you type "xyz", it assumes you want "xyz".

If you type "xyz\n", it assumes you want "xyz\n".

If you type "xyz\\n", it assumes you want "xyz\\n".

If you type "xyz\y", it assumes you want "xyz\y".

If you type "xyz\\y", it assumes you want "xyz\\y".

This is *exactly* like C++, except that in Python the semantics of \y and
\\y are identical. Python doesn't guess what you mean, it *imposes* a
meaning on the escape sequence. You just don't like that meaning.

or that I want to see the backslash when I'm
debugging because I made a mistake, or that I'm just too lazy to type
"\\z".

Oh jeez, if you're going to define DWIM so broadly, then *everything* is
DWIM. "If I type '1+2', then the C++ compiler figures out that I must
have wanted to add 1 and 2..."

I don't know if my friend even knows the term DWIM, other than me
paraphrasing him, but I certainly understand all about the term. It
comes from InterLisp. When DWIM was enabled, your program would run
until it hit an error, and for certain kinds of errors, it would wait a
few seconds for the user to notice the error message, and if the user
didn't tell the program to stop, it would try to figure out what the
user most likely meant, and then continue running using the
computer-generated "fix".

Right. And Python isn't doing anything even remotely similar to that.

I.e., more or less like continuing on in the face of what the Python
Reference manual refers to as an "unrecognized escape sequence".

The wording could be better, I accept. It would be better to talk about
"special escapes" (e.g. \n) and "any non-special escape" (e.g. \y).

Piet van Oostrum · Aug 11, 2009

Steven D'Aprano said:
SD> If I'm reading this page correctly, Python does behave as C++ does. Or at
SD> least as Larch/C++ does:

SD> http://www.cs.ucf.edu/~leavens/larchc++manual/lcpp_47.html

They call them `non-standard escape sequences' for a reason: that they
are not in standard C++.

test.cpp:
char* temp = "abc\yz";

TEMP> g++ -c test.cpp
test.cpp:1:1: warning: unknown escape sequence '\y'

Ethan Furman · Aug 11, 2009

Steven said:
Not really. There is a procedure for making non-backwards compatible
changes. If you care deeply enough about this, you could agitate for
Python 3.2 to raise a PendingDepreciation warning for "unexpected" escape
sequences like \z, Python 3.3 to raise a Depreciation warning, and Python
3.4 to treat it as an error.

It may even be possible to skip the PendingDepreciation warning and go
straight for Depreciation warning in 3.2.

And once it's fully depreciated you have to stop writing it off on your
taxes. *wink*

~Ethan~

Steven D'Aprano · Aug 11, 2009

They call them `non-standard escape sequences' for a reason: that they
are not in standard C++.

test.cpp:
char* temp = "abc\yz";

TEMP> g++ -c test.cpp
test.cpp:1:1: warning: unknown escape sequence '\y'

Isn't that a warning, not a fatal error? So what does temp contain?

Douglas Alan · Aug 11, 2009

Isn't that a warning, not a fatal error? So what does temp contain?

My "Annotated C++ Reference Manual" is packed, and surprisingly in
Stroustrup's Third Edition, there is no mention of the issue in the
entire 1,000 pages. But Microsoft to the rescue:

If you want a backslash character to appear within a string,
you must type two backslashes (\\)

(From http://msdn.microsoft.com/en-us/library/69ze775t.aspx)

The question of what any specific C++ does if you ignore the warning
is irrelevant, as such behavior in C++ is almost *always* undefined.
Hence the warning.

|>ouglas

Ethan Furman · Aug 11, 2009

Douglas said:
My "Annotated C++ Reference Manual" is packed, and surprisingly in
Stroustrup's Third Edition, there is no mention of the issue in the
entire 1,000 pages. But Microsoft to the rescue:

If you want a backslash character to appear within a string,
you must type two backslashes (\\)

(From http://msdn.microsoft.com/en-us/library/69ze775t.aspx)

The question of what any specific C++ does if you ignore the warning
is irrelevant, as such behavior in C++ is almost *always* undefined.
Hence the warning.

|>ouglas

Almost always undefined? Whereas with Python, and some memorization or
a small table/list nearby, you can easily *know* what you will get.

Mind you, I'm not really vested in how Python *should* handle
backslashes one way or the other, but I am glad it has rules that it
follows for consitent results, and I don't have to break out a byte-code
editor to find out what's in my string literal.

~Ethan~

Douglas Alan · Aug 11, 2009

Steven said:
Because the cost isn't zero. Needing to write \\ in a string
literal when you want \ is a cost,

I need to preface this entire post with the fact that I've
already used ALL of the arguments that you've provided on my
friend before I ever even came here with the topic, and my
own arguments on why Python can be considered to be doing
the right thing on this issue didn't even convince ME, much
less him. When I can't even convince myself with an argument
I'm making, then you know there's a problem with it!

Now back the our regularly scheduled debate:

I think that the total cost of all of that extra typing for
all the Python programmers in the entire world is now
significantly less than the time it took to have this
debate. Which would have never happened if Python did things
the right way on this issue to begin with. Meaning that
we're now at LESS than zero cost for doing things right!

And we haven't even yet included all the useless heat that
is going to be generated during code reviews and in-house coding
standard debates.

That's why I stand by Python's motto:

THERE SHOULD BE ONE-- AND PREFERABLY ONLY ONE --OBVIOUS
WAY TO DO IT.

and having to read \\ in source code and mentally
translate that to \ is also a cost.

For me that has no mental cost. What does have a mental cost
is remembering whether "\b" is an "unrecognized escape
sequence" or not.

By all means argue that it's a cost that is worth paying,
but please stop pretending that it's not a cost.

I'm not "pretending". I'm pwning you with logic and common
sense!

Having to remember that \n is a "special" escape and \y
isn't is also a cost, but that's a cost you pay in C++ too,
if you want your code to compile.

Ummm, no I don't! I just always use "\\" when I want a
backslash to appear, and I only think about the more obscure
escape sequences if I actually need them, or some code that
I am reading has used them.

By the way, you've stated repeatedly that \y will compile
with a warning in g++. So what precisely do you get if you
ignore the warning?

A program with undefined behavior. That's typically what a
warning means from a C++ compiler. (Sometimes it means
use of a deprecated feature, though.)

What do other C++ compilers do?

The Microsoft compilers also consider it to be incorrect
code, as I documented in a different post.

Apart from the lack of warning, what actually is the
difference between Python's behavior and C++'s behavior?

That question makes just about as much sense as, "Apart
from the lack of a fatal error, what actually is the
difference between Python's behavior and C++'s?"

Sure, warnings aren't fatal errors, but if you ignore them,
then you are almost always doing something very
wrong. (Unless you're building legacy code.)

You need to work on your reading comprehension. It doesn't
say anything about the motivation for this behaviour, let
alone that it was "SPECIFICALLY DESIGNED" to catch bugs. It
says it is useful for debugging. My shoe is useful for
squashing poisonous spiders, but it wasn't designed as a
poisonous-spider squashing device.

As I have a BS from MIT in BS-ology, I can readily set aside
your aspersions to my intellect, and point out the gross
errors of your ways: Natural language does not work the way
you claim. It is is much more practical, implicit, and
elliptical.

More specifically, if your shoe came with a reference manual
claiming that it was useful for squashing poisonous spiders,
then you may now validly assume poisonous spider squashing
was a design requirement of the shoe. (Or at least it has
become one, even if ipso facto.) Furthermore, if it turns out
that the shoe is deficient at poisonous spider squashing,
and consequently causes you to get bitten by a poisonous
spider, then you now have grounds for a lawsuit.

I'm not convinced this is a bug that needs catching, but if
you think it is, then that's a reasonable argument.

All my arguments are reasonable.

Then let him design his own language *wink*

Oh, I'm not sure that's such a good idea. He might come up
with a language as crazy as C++.

Precisely the same as in Python.

Not so at all!

In C++ I only have to run for the manual only when someone
actually puts a *real* escape sequence in their code. With
Python, I have to run for the manual (or at least the REPL),
every time some lame-brained person who thinks they should be
allowed near a keyboard programs using "unrecognized escape
sequences" because they can't be bothered to hit the "\" key
twice.

Seems to me that the answer is "It's not worse than C++,
it's the same" -- in both cases, you have to memorize the
"special" escape sequences, and in both cases, if you see
an escape you don't recognize, you need to look it up.

The answer is that in this particular case, C++ causes me
far fewer woes! And if C++ is causing me fewer woes than
Language X, then you've got to know that Language X has a
problem.

I disagree with your sense of aesthetics. I think that
having to write \\y when I want \y just to satisfy a
bondage-and-discipline compiler is ugly. That's not to deny
that B&D isn't useful on occasion, but in this case I
believe the benefit is negligible, and so even a tiny cost
is not worth the pain.

EXPLICIT IS BETTER THAN IMPLICIT.

How does it disagree? \y in the source code mapping to \y in
the string object is the sequence being left unchanged. And
the usefulness of doing so is hardly a disagreement over the
fact that it does so.

Because you've stated that "\y" is a legal escape sequence,
while the Python Reference Manual explicitly states that it
is an "unrecognized escape sequence", and that such
"unrecognized escape sequences" are sources of bugs.

No, it's recognized, because \x is the prefix for an
hexadecimal escape code. And it's illegal, because it's
missing the actual hexadecimal digits.

So? Why does that make it "illegal" rather than merely
"unrecognized?"

SIMPLE IS BETTER THAN COMPLEX.

All joking aside, syntax varies from one language to
another. What counts as a legal escape sequence in
Javascript and what counts as a legal escape sequence in
Python are different. What makes you think I'm talking
about Javascript?

Because anyone with common sense will agree that "\y" is an
illegal escape sequence. The only disagreement should then
be how illegal escape sequences should be handled. Python is
not currently handling them in a way that makes the most
sense.

ERRORS SHOULD NEVER PASS SILENTLY.

But the morass only exists in the first place because you
have adopted C++'s approach instead of Python's approach --
and (possibly) not even a standard part of the C++ approach,
but a non-standard warning provided by one compiler out of
many.

Them's fighting words! I rarely adopt the C++ approach to
anything! In this case, (1) C++ just coincidentally happens
to be right, and (2) as far as I can tell, g++ implements
the C++ standard correctly here.

Nope, not in the least. Python NEVER EVER EVER tries to
guess what you mean.

Neither does Perl. That doesn't mean that Perl isn't often
DWIMy.

This is *exactly* like C++, except that in Python the
semantics of \y and \\y are identical. Python doesn't
guess what you mean, it *imposes* a meaning on the escape
sequence. You just don't like that meaning.

That's because I don't like things that are ill-conceived.

The wording could be better, I accept. It would be better
to talk about "special escapes" (e.g. \n) and "any
non-special escape" (e.g. \y).

Or maybe the wording is just fine, and it's the treatment of
unrecognized escape sequences that could be better.

|>ouglas

Douglas Alan · Aug 11, 2009

Not really. There is a procedure for making non-backwards compatible
changes. If you care deeply enough about this, you could agitate for
Python 3.2 to raise a PendingDepreciation warning for "unexpected" escape
sequences like \z,

How does one do this?

Not that I necessarily think that it is important enough a nit to
break a lot of existing code.

Also, if I "agitate for change", then in the future people might
actually accurately accuse me of agitating for change, when typically
I just come here for a good argument, and I provide a connected series
of statements intended to establish a proposition, but in return I
receive merely the automatic gainsaying of any statement I make.

|>ouglas

Douglas Alan · Aug 11, 2009

Mind you, I'm not really vested in how Python *should* handle
backslashes one way or the other, but I am glad it has rules that it
follows for consitent results, and I don't have to break out a byte-code
editor to find out what's in my string literal.

I don't understand your comment. C++ generates a warning if you use an
undefined escape sequence, which indicates that your program should be
fixed. If the escape sequence isn't undefined, then C++ does the same
thing as Python.

It would be *even* better if C++ generated a fatal error in this
situation. (g++ probably has an option to make warnings fatal, but I
don't happen to know what that option is.) g++ might not generate an
error so that you can compile legacy C code with it.

In any case, my argument has consistently been that Python should have
treated undefined escape sequences consistently as fatal errors, not
as warnings.

|>ouglas

Steven D'Aprano · Aug 12, 2009

In any case, my argument has consistently been that Python should have
treated undefined escape sequences consistently as fatal errors,

A reasonable position to take. I disagree with it, but it is certainly
reasonable.

not as warnings.

I don't know what language you're talking about here, because non-special
escape sequences in Python aren't either errors or warnings:
ab\cd

No warning is made, because it's not considered an error that requires a
warning. This matches the behaviour of other languages, including C and
bash.

Steven D'Aprano · Aug 12, 2009

My "Annotated C++ Reference Manual" is packed, and surprisingly in
Stroustrup's Third Edition, there is no mention of the issue in the
entire 1,000 pages. But Microsoft to the rescue:

If you want a backslash character to appear within a string, you
must type two backslashes (\\)

(From http://msdn.microsoft.com/en-us/library/69ze775t.aspx)

Should I assume that Microsoft's C++ compiler treats it as an error, not
a warning? Or is is this *still* undefined behaviour, and MS C++ compiler
will happily compile "ab\cd" whatever it feels like?

The question of what any specific C++ does if you ignore the warning is
irrelevant, as such behavior in C++ is almost *always* undefined. Hence
the warning.

So a C++ compiler which follows Python's behaviour would be behaving
within the language specifications.

I note that the bash shell, which claims to follow C semantics, also does
what Python does:

$ echo $'a s\trin\g with escapes'
a s rin\g with escapes

Explain to me again why we're treating underspecified C++ semantics,
which may or may not do *exactly* what Python does, as if it were the One
True Way of treating escape sequences?

Steven D'Aprano · Aug 12, 2009

I need to preface this entire post with the fact that I've already used
ALL of the arguments that you've provided on my friend before I ever
even came here with the topic, and my own arguments on why Python can be
considered to be doing the right thing on this issue didn't even
convince ME, much less him. When I can't even convince myself with an
argument I'm making, then you know there's a problem with it!

I hear all your arguments, and to play Devil's Advocate I repeat them,
and they don't convince me either. So by your logic, there's obviously a
problem with your arguments as well!

That problem basically boils down to a deep-seated philosophical
disagreement over which philosophy a language should follow in regard to
backslash escapes:

"Anything not explicitly permitted is forbidden"

versus

"Anything not explicitly forbidden is permitted"

Python explicitly permits all escape sequences, with well-defined
behaviour, with the only ones forbidden being those explicitly forbidden:

* hex escapes with invalid hex digits;

* oct escapes with invalid oct digits;

* Unicode named escapes with unknown names;

* 16- and 32-bit Unicode escapes with invalid hex digits.

C++ apparently forbids all escape sequences, with unspecified behaviour
if you use a forbidden sequence, except for a handful of explicitly
permitted sequences.

That's not better, it's merely different.

Actually, that's not true -- that the C++ standard forbids a thing, but
leaves the consequences of doing that thing unspecified, is clearly a Bad
Thing.

[...]

That question makes just about as much sense as, "Apart from the lack of
a fatal error, what actually is the difference between Python's behavior
and C++'s?"

This is what I get:

[steve ~]$ cat test.cc
#include <iostream>
int main(int argc, char* argv[])
{
std::cout << "x\yz" << std::endl;
return 0;
}
[steve ~]$ g++ test.cc -o test
test.cc:4:14: warning: unknown escape sequence '\y'
[steve@soy ~]$ ./test
xyz

So on at least one machine in the world, C++ simply strips out
backslashes that it doesn't recognise, leaving the suffix. Unfortunately,
we can't rely on that, because C++ is underspecified. Fortunately this is
not a problem with Python, which does completely specify the behaviour of
escape sequences so there are no surprises.

[...]

EXPLICIT IS BETTER THAN IMPLICIT.

Quoting the Zen without understanding (especially shouting) doesn't
impress anyone. There's nothing implicit about escape sequences. \y is
perfectly explicit. Look Ma, there's a backslash, and a y, it gives a
backslash and a y!

Implicit has an actual meaning. You shouldn't use it as a mere term of
opprobrium for anything you don't like.

Because you've stated that "\y" is a legal escape sequence, while the
Python Reference Manual explicitly states that it is an "unrecognized
escape sequence", and that such "unrecognized escape sequences" are
sources of bugs.

There's that reading comprehension problem again.

Unrecognised != illegal.

"Useful for debugging" != "source of bugs". If they were equal, we could
fix an awful lot of bugs by throwing away our debugging tools.

Here's the URL to the relevant page:
http://www.python.org/doc/2.5.2/ref/strings.html

It seems to me that the behaviour the Python designers were looking to
avoid was the case where the coder accidentally inserted a backslash in
the wrong place, and the language stripped the backslash out, e.g.:

Wanted "a\bcd" but accidentally typed "ab\cd" instead, and got "abcd".

(This is what Bash does by design, and at least some C/C++ compilers do,
perhaps by accident, perhaps by design.)

In that case, with no obvious backslash, the user may not even be aware
that there was a problem:

s = "ab\cd" # assume the backslash is silently discarded
assert len(s) == 4
assert s[3] == 'c'
assert '\\' not in s

All of these tests would wrongly pass, but with Python's behaviour of
leaving the backslash in, they would all fail, and the string is visually
distinctive (it has an obvious backslash in it).

Now, if you consider that \c should be an error, then obviously it would
be even better if "ab\cd" would raise a SyntaxError. But why consider \c
to be an error?

[invalid hex escape sequences]

So? Why does that make it "illegal" rather than merely "unrecognized?"

Because the empty string is not a legal pair of hex digits.

In '\y', the suffix y is a legal character, but it isn't recognized as a
"special" character.

In '\x', the suffix '' is not a pair of hex digits. Since hex-escapes are
documented as requiring a pair of hex digits, this is an error.

[...]

Because anyone with common sense will agree that "\y" is an illegal
escape sequence.

"No True Scotsman would design a language that behaves like that!!!!"

Why should it be illegal? It seems like a perfectly valid escape sequence
to me, so long as the semantics are specified explicitly.

[...]

Neither does Perl. That doesn't mean that Perl isn't often DWIMy.

Fine, but we're not discussing Perl, we're discussing Python. Perl's DWIM-
iness is irrelevant.

That's because I don't like things that are ill-conceived.

And yet you like C++... go figure *wink*

Douglas Alan · Aug 12, 2009

A reasonable position to take. I disagree with it, but it is certainly
reasonable.

I don't know what language you're talking about here, because non-special
escape sequences in Python aren't either errors or warnings:

ab\cd

I was talking about C++, whose compilers tend to generate warnings for
this usage. I think that the C++ compilers I've used take the right
approach, only ideally they should be *even* more emphatic, and
elevate the problem from a warning to an error.

I assume, however, that the warning is a middle ground between doing
the completely right thing, and, I assume, maintaining backward
compatibility with common C implementations. As Python never had to
worry about backward compatibility with C, Python didn't have to walk
such a middle ground.

On the other hand, *now* it has to worry about backward compatibility
with itself.

|>ouglas

Douglas Alan · Aug 12, 2009

Should I assume that Microsoft's C++ compiler treats it as an error, not
a warning?

In my experience, C++ compilers generally generate warnings for such
situations, where they can. (Clearly, they often can't generate
warnings for running off the end of an array, which is also undefined,
though a really smart C++ compiler might be able to generate a warning
in certain such circumstances.)

Or is is this *still* undefined behaviour, and MS C++ compiler
will happily compile "ab\cd" whatever it feels like?

If it's a decent compiler, it will generate a warning. Who can say
with Microsoft, however. It's clearly documented as illegal code,
however.

So a C++ compiler which follows Python's behaviour would be behaving
within the language specifications.

It might be, but there are also *recommendations* in the C++ standard
about what to do in such situations, and the recommendations say, I am
pretty sure, not to do that, unless the particular compiler in
question has to meet some very specific backward compatibility needs.

I note that the bash shell, which claims to follow C semantics, also does
what Python does:

$ echo $'a s\trin\g with escapes'
a s rin\g with escapes

Really? Not on my computers. (One is a Mac, and the other is a Fedora
Core Linux box.) On my computers, bash doesn't seem to have *any*
escape sequences, other than \\, \", \$, and \`. It seems to treat
unknown escape sequences the same as Python does, but as there are
only four known escape sequences, and they are all meant merely to
guard against string interpolation, and the like, it's pretty darn
easy to keep straight.

Explain to me again why we're treating underspecified C++ semantics,
which may or may not do *exactly* what Python does, as if it were the One
True Way of treating escape sequences?

I'm not saying that C++ does it right for Python. The right thing for
Python to do is to generate an error, as Python doesn't have to deal
with all the crazy complexities that C++ has to.

|>ouglas

Douglas Alan · Aug 12, 2009

That problem basically boils down to a deep-seated
philosophical disagreement over which philosophy a
language should follow in regard to backslash escapes:

"Anything not explicitly permitted is forbidden"

versus

"Anything not explicitly forbidden is permitted"

No, it doesn't. It boils down to whether a language should:

(1) Try it's best to detect errors as early as possible,
especially when the cost of doing so is low.

(2) Make code as readable as possible, in part by making
code as self-evident as possible by mere inspection and by
reducing the amount of stuff that you have to memorize. Perl
fails miserably in this regard, for instance.

(3) To quote Einstein, make everything as simple as
possible, and no simpler.

(4) Take innately ambiguous things and not force them to be
unambiguous by mere fiat.

Allowing a programmer to program using a completely
arbitrary resolution of "unrecognized escape sequences"
violates all of the above principles.

The fact that the meanings of unrecognized escape sequences
are ambiguous is proved by the fact that every language
seems to treat them somewhat differently, demonstrating that
there is no natural intuitive meaning for them.

Furthermore, allowing programmers to use "unrecognized escape
sequences" without raising an error violates:

(1) Explicit is better than implicit:

Python provides a way to explicitly specify that you want a
backslash. Every programmer should be encouraged to use
Python's explicit mechanism here.

(2) Simple is better than complex:

Python currently has two classes of ambiguously
interpretable escape sequences: "unrecognized ones", and
"illegal" ones. Making a single class (i.e. just illegal
ones) is simpler.

Also, not having to memorize escape sequences that you
rarely have need to use is simpler.

(3) Readability counts:

See above comments on readability.

(4) Errors should never pass silently:

Even the Python Reference Manual indicates that unrecognized
escape sequences are a source of bugs. (See more comments on
this below.)

(5) In the face of ambiguity, refuse the temptation to
guess.

Every language, other than C++, is taking a guess at what
the programmer would find to be most useful expansion for
unrecognized escape sequences, and each of the languages is
guessing differently. This temptation should be refused!

You can argue that once it is in the Reference Manual it is
no longer a guess, but that is patently specious, as Perl
proves. For instance, the fact that Perl will quietly convert
an array into a scalar for you, if you assign the array to a
scalar variable is certainly a "guess" of the sort that this
Python koan is referring to. Likewise for an arbitrary
interpretation of unrecognized escape sequences.

(6) There should be one-- and preferably only one --obvious
way to do it.

What is the one obvious way to express "\\y"? It is "\\y" or
"\y"?

Python can easily make one of these ways the "one obvious
way" by making the other one raise an error.

(7) Namespaces are one honking great idea -- let's do more
of those!

Allowing "\y" to self-expand is intruding into the namespace
for special characters that require an escape sequence.

C++ apparently forbids all escape sequences, with
unspecified behaviour if you use a forbidden sequence,
except for a handful of explicitly permitted sequences.

That's not better, it's merely different.

It *is* better, as it catches errors early on at little
cost, and for all the other reasons listed above.

Actually, that's not true -- that the C++ standard forbids
a thing, but leaves the consequences of doing that thing
unspecified, is clearly a Bad Thing.

Indeed. But C++ has backward compatibly issues that make
any that Python has to deal with, pale in comparison. The
recommended behavior for a C++ compiler, however, is to flag
the problem as an error or as a warning.

So on at least one machine in the world, C++ simply strips
out backslashes that it doesn't recognize, leaving the
suffix. Unfortunately, we can't rely on that, because C++
is underspecified.

No, *fortunately* you can't rely on it, forcing you to go
fix your code.

Fortunately this is not a problem with
Python, which does completely specify the behaviour of
escape sequences so there are no surprises.

It's not a surprise when the C++ compiler issues a warning to
you. If you ignore the warning, then you have no one to
blame but yourself.

Implicit has an actual meaning. You shouldn't use it as a
mere term of opprobrium for anything you don't like.

Pardon me, but I'm using "implicit" to mean "implicit", and
nothing more.

Python's behavior here is "implicit" in the very same way
that Perl implicitly converts an array into a scalar for
you. (Though that particular Perl behavior is a far bigger
wart than Python's behavior is here!)

There's that reading comprehension problem again.

Unrecognised != illegal.

This is reasoning that only a lawyer could love.

The right thing for a programming language to do, when
handed something that is syntactically "unrecognized" is to
raise an error.

It seems to me that the behaviour the Python designers
were looking to avoid was the case where the coder
accidentally inserted a backslash in the wrong place, and
the language stripped the backslash out, e.g.:

Wanted "a\bcd" but accidentally typed "ab\cd" instead, and
got "abcd".

The moral of the story is that *any* arbitrary
interpretation of unrecognized escape sequences is a
potential source of bugs. In Python, you just end up with a
converse issue, where one might understandably assume that
"foo\bar" has a backslash in it, because "foo\yar" and
*most* other similar strings do. But then it doesn't.

And yet you like C++... go figure *wink*

Now that's a bold assertion!

I think that "tolerate C++" is more like it. But C++ does
have its moments.

|>ouglas

Steven D'Aprano · Aug 13, 2009

No, it doesn't. It boils down to whether a language should:

(1) Try it's best to detect errors as early as possible, especially when
the cost of doing so is low.

You are making an unjustified assumption: \y is not an error. It is only
an error if you think that anything not explicitly permitted is forbidden.

While I'm amused that you've made my own point for me, I'm less amused
that you seem to be totally incapable of seeing past your parochial
language assumptions, even when those assumptions are explicitly pointed
out to you. Am I wasting my time engaging you in discussion?

There's a lot more I could say, but time is short, so let me just
summarise:

I disagree with nearly everything you say in this post. I think that a
few points you make have some validity, but the vast majority are based
on a superficial and confused understanding of language design
principles. (I won't justify that claim now, perhaps later, time
permitting.) Nevertheless, I think that your ultimate wish -- for \y etc
to be considered an error -- is a reasonable design choice, given your
assumptions. But it's not the only reasonable design choice, and Bash has
made a different choice, and Python has made yet a third reasonable
choice, and Pascal made yet a fourth reasonable choice.

These are all reasonable choices, all have some good points and some bad
points, but ultimately the differences between them are mostly arbitrary
personal preference, like the colour of a car. Disagreements over
preferences I can live with. One party insisting that red is the only
logical colour for a car, and that anybody who prefers white or black or
blue is illogical, is unacceptable.

MRAB · Aug 13, 2009

Steven said:
You are making an unjustified assumption: \y is not an error. It is only
an error if you think that anything not explicitly permitted is forbidden.

While I'm amused that you've made my own point for me, I'm less amused
that you seem to be totally incapable of seeing past your parochial
language assumptions, even when those assumptions are explicitly pointed
out to you. Am I wasting my time engaging you in discussion?

There's a lot more I could say, but time is short, so let me just
summarise:

I disagree with nearly everything you say in this post. I think that a
few points you make have some validity, but the vast majority are based
on a superficial and confused understanding of language design
principles. (I won't justify that claim now, perhaps later, time
permitting.) Nevertheless, I think that your ultimate wish -- for \y etc
to be considered an error -- is a reasonable design choice, given your
assumptions. But it's not the only reasonable design choice, and Bash has
made a different choice, and Python has made yet a third reasonable
choice, and Pascal made yet a fourth reasonable choice.

IHMO, it would've been simpler in the long run to say that backslash
followed by one of [0-9A-Za-z] is an escape sequence, backslash followed
by newline is ignored, and backslash followed by anything else is that
something. That way there would be a way to introduce additional escape
sequences without breaking existing code.

Reversing backslashed escape sequences	3	Jul 1, 2010
Py-dea: Streamline string literals now!	21	Dec 28, 2011
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
Unicode escapes and String literals?	24	Dec 13, 2012
retriving escape unicode sequences from files ...	8	Aug 3, 2012
retriving escape unicode sequences from files ...	8	Aug 3, 2012
Windows XP unicode and escape sequences	2	Dec 12, 2007
Non latin characters in string literals	17	Jan 3, 2010

Unrecognized escape sequences in string literals

Douglas Alan

Douglas Alan

Steven D'Aprano

Steven D'Aprano

Piet van Oostrum

Ethan Furman

Steven D'Aprano

Douglas Alan

Ethan Furman

Douglas Alan

Douglas Alan

Douglas Alan

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Douglas Alan

Douglas Alan

Douglas Alan

Steven D'Aprano

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads