Regular expression bug?

R

Ron Garret

I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.

Thanks,
rg
 
A

Albert Hopkins

I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

That's how re.split works, same as str.split...
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']


Wow!

To tell you the truth, I can't even read that... but one wonders why
don't you just do

def ccsplit(s):
cclist = []
current_word = ''
for char in s:
if char in string.uppercase:
if current_word:
cclist.append(current_word)
current_word = char
else:
current_word += char
if current_word:
ccl.append(current_word)
return cclist
--> ['foo', 'Bar', 'Baz']

This is arguably *much* more easy to read than the re example doesn't
require one to look ahead in the string.

-a
 
K

Kurt Smith

I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?
From what I can tell, re.split can't split on zero-length boundaries.
It needs something to split on, like str.split. Is this a bug?
Possibly. The docs for re.split say:

Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.

Note that it does not say that zero-length matches won't work.

I can work around the problem thusly:

re.sub(r'(?<=[a-z])(?=[A-Z])', '_', 'fooBarBaz').split('_')

Which is ugly. I reckon you can use re.findall with a pattern that
matches the components and not the boundaries, but you have to take
care of the beginning and end as special cases.

Kurt
 
P

Peter Otten

Ron said:
I'm trying to split a CamelCase string into its constituent components.

How about
re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

IRC the split pattern must consume at least one character, but I can't find
the reference.
(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

It's coded in C. The source is Modules/sremodule.c.

Peter
 
M

MRAB

Ron said:
I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.
I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.
You could do this instead:
>>> re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
['foo', 'Bar', 'Baz']
 
R

Ron Garret

MRAB said:
Ron said:
I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']

So the regular expression seems to be doing the Right Thing. Is this a
bug in re.split, or am I missing something?

(BTW, I tried looking at the source code for the re module, but I could
not find the relevant code. re.split calls sre_compile.compile().split,
but the string 'split' does not appear in sre_compile.py. So where does
this method come from?)

I'm using Python2.5.
I, amongst others, think it's a bug (or 'misfeature'); Guido thinks it
might be intentional, but changing it could break some existing code.

That seems unlikely. It would only break where people had code invoking
re.split on empty matches, which at the moment is essentially a no-op.
It's hard to imagine there's a lot of code like that around. What would
be the point?
You could do this instead:
re.sub('(?<=[a-z])(?=[A-Z])', '@', 'fooBarBaz').split('@')
['foo', 'Bar', 'Baz']

Blech! ;-) But thanks for the suggestion.

rg
 
R

Ron Garret

"andrew cooke said:
i wonder what fraction of people posting with "bug?" in their titles here
actually find bugs?

IMHO it ought to be an invariant that len(r.split(s)) should always be
one more than len(r.findall(s)).
anyway, how about:

re.findall('[A-Z]?[a-z]*', 'fooBarBaz')

or

re.findall('([A-Z][a-z]*|[a-z]+)', 'fooBarBaz')

That will do it. Thanks!

rg
 
R

Ron Garret

Albert Hopkins said:
I'm trying to split a CamelCase string into its constituent components.
This kind of works:
re.split('[a-z][A-Z]', 'fooBarBaz')
['fo', 'a', 'az']

but it consumes the boundary characters. To fix this I tried using
lookahead and lookbehind patterns instead, but it doesn't work:

That's how re.split works, same as str.split...

I think one could make the argument that 'foo'.split('') ought to return
['f','o','o']
re.split('((?<=[a-z])(?=[A-Z]))', 'fooBarBaz')
['fooBarBaz']

However, it does seem to work with findall:
re.findall('(?<=[a-z])(?=[A-Z])', 'fooBarBaz')
['', '']


Wow!

To tell you the truth, I can't even read that...

It's a regexp. Of course you can't read it. ;-)

rg
 
S

Steven D'Aprano

andrew said:
i wonder what fraction of people posting with "bug?" in their titles here
actually find bugs?

About 99.99%.

Unfortunately, 99.98% have found bugs in their code, not in Python.
 
L

Lie Ryan

Peter Otten said:
Ron said:
I'm trying to split a CamelCase string into its constituent
components.

How about
re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']

That's very clever. Thanks!
It's coded in C. The source is Modules/sremodule.c.

Ah. Thanks!

rg

This re.split() doesn't consume character:
re.split('([A-Z][a-z]*)', 'fooBarBaz')
['foo', 'Bar', '', 'Baz', '']

it does what the OP wants, albeit with extra blank strings.
 
U

umarpy

More elegant way
[x for x in re.split('([A-Z]+[a-z]+)', a) if x ]
['foo', 'Bar', 'Baz']

R.

Ron Garret wrote:
I'm trying to split a CamelCase string into its constituent
components.
How about
re.compile("[A-Za-z][a-z]*").findall("fooBarBaz")
['foo', 'Bar', 'Baz']
That's very clever.  Thanks!
Ah.  Thanks!

This re.split() doesn't consume character:
re.split('([A-Z][a-z]*)', 'fooBarBaz')

['foo', 'Bar', '', 'Baz', '']

it does what the OP wants, albeit with extra blank strings.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top