When does the escape character work within raw strings?

W

walterbyrd

I know that

s = r'x\nx'

means 'x' followed by a literal '\' followed by an 'n' (the '\n' is
not a carriage return).

s = r'x\tx'

means 'x' followed by a literal '\' followed by an 't' (the '\t' is
not a tab).

But, boundries seem to work differently.

s = re.sub(r'\bxxx\b', 'yyy', s)

Is *not* going to look for a literal '\' followed by 'b'

So I am confused about when escapes work within raw strings, and when
do escapes not work within raw strings.
 
M

MRAB

walterbyrd said:
I know that

s = r'x\nx'

means 'x' followed by a literal '\' followed by an 'n' (the '\n' is
not a carriage return).

s = r'x\tx'

means 'x' followed by a literal '\' followed by an 't' (the '\t' is
not a tab).

But, boundries seem to work differently.

s = re.sub(r'\bxxx\b', 'yyy', s)

Is *not* going to look for a literal '\' followed by 'b'

So I am confused about when escapes work within raw strings, and when
do escapes not work within raw strings.
>
The re module receives the regular expression in the form of a string
and then interprets it in its own way.

If you give the re module '\b', that's a backspace character, which is
treated as a literal character, just like 'x'.

If you give it r'\b', that _does_ contain 2 characters, but the re
module interprets it as representing a word boundary, except in a
character set [...] where that would be meaningless, so there it's
interpreted as representing a backspace character.
 
W

walterbyrd

I guess I am confused about when when escape characters are are
interpersonal as escape characters, and escape characters are not
treated as escape characters.

Sometimes escape characters in regular strings are treated as escape
characters, sometimes not. Same seems to go for raw strings. So how do
I know?

IMO: '\' characters in raw strings should not be given any special
meaning. That would also solve the common problem of r'c:\whatever\'
not working in python. But I digress.


To me this does not seem right. r'\n' should not equal '\n'
 
R

Rhodri James

I guess I am confused about when when escape characters are are
interpersonal as escape characters, and escape characters are not
treated as escape characters.

No, you're confused about the number of entirely different things
that are interpreting a string, and the difference between a string
literal and a string object.
Sometimes escape characters in regular strings are treated as escape
characters, sometimes not. Same seems to go for raw strings. So how do
I know?

IMO: '\' characters in raw strings should not be given any special
meaning. That would also solve the common problem of r'c:\whatever\'
not working in python. But I digress.

Escaping the delimiting quote is the *one* time backslashes have a
special meaning in raw string literals.
To me this does not seem right. r'\n' should not equal '\n'

And it doesn't. Let me explain. No, that would take too long,
let me summarise. :)

`s` is a string object containing the character 'x', a newline, and 'x'.

This calls re.sub with a pattern string object that contains a single
newline character. Since this character has no special meaning to the
sub function it faithfully searches `s` for newlines, and, finding one,
replaces it with an 'x' character.

This calls re.sub with a pattern string object that contains two
characters, a backslash followed by an 'n'. This combination *does*
have a special meaning to the sub function, which does it's own
translation of the pattern into a single newline character. Then,
as before, it spots the newline in `s` and replaces it with an 'x'.

Note, however, that the string object created by the raw string literal
was *two* characters long. It's re.sub (in common with the rest of the
re module functions) that chooses to interpret the backslash specially.
 
W

walterbyrd

Escaping the delimiting quote is the *one* time backslashes have a
special meaning in raw string literals.

If that were true, then wouldn't r'\b' be treated as two characters?
This calls re.sub with a pattern string object that contains two
characters, a backslash followed by an 'n'.  This combination *does*
have a special meaning to the sub function, which does it's own
translation of the pattern into a single newline character.  

So when do I know when a raw string is treated as a raw string, and
when it's not?
 
S

Steven D'Aprano

If that were true, then wouldn't r'\b' be treated as two characters?

It is.
2




So when do I know when a raw string is treated as a raw string, and when
it's not?

You have misunderstood. All strings are strings, but there are different
ways to build a string. Raw strings are not different from ordinary
strings, they're just a different way to *build* an ordinary string.

Here are four ways to make the same string, a backslash followed by a
lowercase b:

"\\b" # use an ordinary string, and escape the backslash
chr(92)+"b" # use the chr() function
"\x5cb" # use a hex escape
r"\b" # use a raw string, no escaping needed

The results you get from all of those (and many, many more!) are the same
string object. They're just written differently as source code.

Now, in regular expressions, the RE engine expects to see special codes
inside the string that have special meanings. For example, backslash
followed by lowercase B has a special meaning. So to create a string
containing that regex, you can use any of the above (or any of the
others). The RE engine doesn't know, and can't know, how you generated
the regex. All it sees is a string containing a backslash followed by
lowercase-B.

But if you forget that Python uses backslash escapes in strings, and just
write "\b", then the compiler creates the string chr(8) (BEL), which has
no special meaning to the RE engine.
 
S

Steven D'Aprano

But if you forget that Python uses backslash escapes in strings, and
just write "\b", then the compiler creates the string chr(8) (BEL),
which has no special meaning to the RE engine.

Correction: \b is BACKSPACE, not BELL. \a is BELL.
 
M

MRAB

Steven D'Aprano wrote:
[snip]
But if you forget that Python uses backslash escapes in strings, and just
write "\b", then the compiler creates the string chr(8) (BEL), which has
no special meaning to the RE engine.
"\b" or chr(8) is BS (backspace); "\a" or chr(7) is BEL (bell).
 
R

Rhodri James

If that were true, then wouldn't r'\b' be treated as two characters?

It is.
2


So when do I know when a raw string is treated as a raw string, and
when it's not?

A raw string LITERAL is always treated as a raw string LITERAL when
the Python interpreter turns it into a string OBJECT. I used the
capitalised words very deliberately and precisely, yet you seem to
have managed to conflate them again. Please don't. How the literal
is interpreted is up to the Python interpreter. How the object is
interpreted is up to the thing doing the interpretation, in this
case the re.sub() function.

How do you know how a string object is going to be treated by any
given function? Read the Fine Manual for that function.
 
W

walterbyrd

On May 22, 12:22 pm, "Rhodri James"
How do you know how a string object is going to be treated by any
given function?  Read the Fine Manual for that function.

So am I to understand that there is no consistency in string handling
throughout the standard modules/objects/methods?

Seems to make python a lot more complicated than it needs to be, but
okay.
 
R

Robert Kern

On May 22, 12:22 pm, "Rhodri James"

So am I to understand that there is no consistency in string handling
throughout the standard modules/objects/methods?

Seems to make python a lot more complicated than it needs to be, but
okay.

*Any* language would have such issues. Different functions do different things
to its inputs. That's why you have different functions. I certainly wouldn't
want my HTML parser to treat its inputs as if they were regular expressions.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
R

Rhodri James

On May 22, 12:22 pm, "Rhodri James"

So am I to understand that there is no consistency in string handling
throughout the standard modules/objects/methods?

How can there be? They all have different requirements, after all.
In C, for example, you wouldn't expect "^\\s*[qwerty]+" to be remotely
useful as a printf() format, but it might be exactly what you want for
some regular expression library.
 
S

Steven D'Aprano

On May 22, 12:22 pm, "Rhodri James"

So am I to understand that there is no consistency in string handling
throughout the standard modules/objects/methods?

No, you have completely misunderstood.

Seems to make python a lot more complicated than it needs to be, but
okay.

No, you are imagining complexity that doesn't exist.

To the Python compiler, a string is a string is a string. The rules are
very simple: you write a string literal using quotation marks to tell the
compiler "the text between these delimiters are a literal string". Here
are the delimiters understood by Python:


Regular strings, must be on a single line:
' ' or " "

Regular strings, allowed to include multiple lines:
''' ''' or """ """

Raw strings, must be on a single line:
r' ' or r" "

Raw strings, allowed to include multiple lines:
r''' ''' or r""" """

Regular strings interpret backslash escapes specially: \c has special
meaning depending on what c is. For example, \t is interpreted by the
compiler as a tab, and \n is interpreted as a newline. Raw strings
*don't* interpret backslashes specially (except that you can't end the
raw string with an odd number of backslashes).

That is how you *create* string literals. It is 100% consistent all
through Python: the rules apply in every module, in every function,
everywhere, because the compiler creates the string before the function
or module gets a chance to see the string.

Having been created, how the string is *used* depends on the application,
and Python modules and functions are no different. Inside a calculator
application, the meaning of the literal string "x/y" would be very
different than it would be inside an application dealing with file names.
Python modules are no different:

- the os module interprets many strings as file names according to the
rules for your operating system: e.g. on Linux '/' separates parts of the
pathname into sub-directories. On Windows, either forward or backslashes
are used to separate directories, and ':' is used to separate drive
letters from the path.

- the glob module interprets strings according to the rules for shell
globbing: e.g. '*' means 'match any number of any character', '?' means
'match a single of any character'.

- the re module interprets strings according to the rules for regular
expressions: e.g. '.*' means 'match any number of any character (except
newline by default)' and '\d' (backslash-d) means 'match a single decimal
digit'.

- the urllib and urllib2 modules interpret strings according to the rules
of dealing with URLs.


In every case, you construct the string literals using the same rules,
but the *meaning* of them differs according to the application. Because
regular expressions give special meanings to literal backslashes, it is
inconvenient to create many regexes using regular strings, because you
need to escape the backslashes. That's where raw strings are more useful.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top