using re module to find " but not " alone ... is this a BUG in re?

A

anton

Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced paranthesis
error.

It seems tha re is not able to do the job due to parsing/compiling problems
for this sort of strings.


Have you any idea??

Anton


Example: --------------------

import re

re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")

Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python25\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python25\lib\re.py", line 241, in _compile
raise error, v # invalid expression
error: unexpected end of regular expression
 
J

John Machin

Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced paranthesis
error.

Sounds like a deficit of backslashes causing re to regard \) as plain
text and not the magic closing parenthesis in (?<=...) -- and don't
you want (?<!...) ?
It seems tha re is not able to do the job due to parsing/compiling problems
for this sort of strings.

Nothing is ever as it seems.
Have you any idea??

For a start, *ALWAYS* use a raw string for an re pattern -- halves the
backslash pollution!

re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")

and if you have " in the pattern, use '...' to enclose the pattern so
that you don't have to use \"
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python25\lib\re.py", line 175, in findall
return _compile(pattern, flags).findall(string)
File "C:\Python25\lib\re.py", line 241, in _compile
raise error, v # invalid expression
error: unexpected end of regular expression

As expected.

What you want is:

HTH,
John
 
P

Peter Otten

anton said:
I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

this I want " while I dont want this \"

should be transformed to:

this I want \" while I dont want this \"

and NOT:

this I want \" while I dont want this \\"

I tried even the (?<=...) construction but here I get an unbalanced
paranthesis error.

It seems tha re is not able to do the job due to parsing/compiling
problems for this sort of strings.


Have you any idea??

The problem is underspecified. Should r'\\"' become r'\\\"' or remain
unchanged? If the backslash is supposed to escape the following letter
including another backslash -- that can't be done with regular expressions
alone:

# John's proposal:no \" one \", two \\"


One possible fix:
parts = re.compile("(\\\\.)").split('no " one \\", two \\\\"')
parts[::2] = [p.replace('"', '\\"') for p in parts[::2]]
print "".join(parts)
no \" one \", two \\\"

Peter
 
A

anton

.... cut text off
What you want is:


HTH,
John


First.. thanks John.

The whole problem is discussed in

http://docs.python.org/dev/howto/regex.html#the-backslash-plague

in the section "The Backslash Plague"

Unfortunately this is *NOT* mentioned in the standard
python documentation of the re module.

Another thing which will always remain strange to me, is that
even if in the python doc of raw string:

http://docs.python.org/ref/strings.html

its written:
"Specifically, a raw string cannot end in a single backslash"

s=r"\\" # works fine
s=r"\" # works not (as stated)

But both ENDS IN A SINGLE BACKSLASH !

The main thing which is hard to understand is:

If a raw string is a string which ignores backslashes,
then it should ignore them in all circumstances,

or where could be the problem here (python parser somewhere??).

Bye

Anton
 
J

John Machin

John Machin <sjmachin <at> lexicon.net> writes:




... cut text off





First.. thanks John.

The whole problem is discussed in

http://docs.python.org/dev/howto/regex.html#the-backslash-plague

in the section "The Backslash Plague"

Unfortunately this is *NOT* mentioned in the standard
python documentation of the re module.

Yes, and there's more to driving a car in heavy traffic than you will
find in the manufacturer's manual.
Another thing which will always remain strange to me, is that
even if in the python doc of raw string:

http://docs.python.org/ref/strings.html

its written:
"Specifically, a raw string cannot end in a single backslash"

s=r"\\" # works fine
s=r"\" # works not (as stated)

But both ENDS IN A SINGLE BACKSLASH !

Apply the interpretation that the first case ends in a double
backslash, and move on.
The main thing which is hard to understand is:

If a raw string is a string which ignores backslashes,
then it should ignore them in all circumstances,

Nobody defines a raw string to be a "string that ignores backslashes",
so your premise is invalid.
or where could be the problem here (python parser somewhere??).

Why r"\" is not a valid string token has been done to death IIRC at
least twice in this newsgroup ...

Cheers,
John
 
P

Paul McGuire

Hi,

I want to replace all occourences of " by \" in a string.

But I want to leave all occourences of \" as they are.

The following should happen:

  this I want " while I dont want this \"

should be transformed to:

  this I want \" while I dont want this \"

and NOT:

  this I want \" while I dont want this \\"

A pyparsing version is not as terse as an re, and certainly not as
fast, but it is easy enough to read. Here is my first brute-force
approach to your problem:

from pyparsing import Literal, replaceWith

escQuote = Literal(r'\"')
unescQuote = Literal(r'"')
unescQuote.setParseAction(replaceWith(r'\"'))

test1 = r'this I want " while I dont want this \"'
test2 = r'frob this " avoid this \", OK?'

for test in (test1, test2):
print (escQuote | unescQuote).transformString(test)

And it prints out the desired:

this I want \" while I dont want this \"
frob this \" avoid this \", OK?

This works by defining both of the patterns escQuote and unescQuote,
and only defines a transforming parse action for the unescQuote. By
listing escQuote first in the list of patterns to match, properly
escaped quotes are skipped over.

Then I looked at your problem slightly differently - why not find both
'\"' and '"', and replace either one with '\"'. In some cases, I'm
"replacing" '\"' with '\"', but so what? Here is the simplfied
transformer:

from pyparsing import Optional, replaceWith

quotes = Optional(r'\\') + '"'
quotes.setParseAction(replaceWith(r'\"'))
for test in (test1, test2):
print quotes.transformString(test)


Again, this prints out the desired output.

Now let's retrofit this altered logic back onto John Machin's
solution:

import re
for test in (test1, test2):
print re.sub(r'\\?"', r'\"', test)


Pretty short and sweet, and pretty readable for an re.

To address Peter Otten's question about what to do with an escaped
backslash, I can't compose this with an re, but I can by adjusting the
first pyparsing version to include an escaped backslash as a "match
but don't do anything with it" expression, just like we did with
escQuote:

from pyparsing import Optional, Literal, replaceWith

escQuote = Literal(r'\"')
unescQuote = Literal(r'"')
unescQuote.setParseAction(replaceWith(r'\"'))
backslash = chr(92)
escBackslash = Literal(backslash+backslash)

test3 = r'no " one \", two \\"'
for test in (test1, test2, test3):
print (escBackslash | escQuote |
unescQuote).transformString(test)

Prints:
this I want \" while I dont want this \"
frob this \" avoid this \", OK?
no \" one \", two \\\"

At first I thought the last transform was an error, but on closer
inspection, I see that the input line ends with an escaped backslash,
followed by a lone '"', which must be replaced with '\"'. So in the
transformed version we see '\\\"', the original escaped backslash,
followed by the replacement '\"' string.

Cheers,
-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top