regex question

proctor · Apr 27, 2007

hello,

i have a regex: rx_test = re.compile('/x([^x])*x/')

which is part of this test program:

============

import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)

============

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\test>python retest.py
['c']

can anyone point out why this is occurring? i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

Josiah Carlson · Apr 27, 2007

proctor said:
i have a regex: rx_test = re.compile('/x([^x])*x/')

You probably want...

rx_test = re.compile('/x([^x]*)x/')

- Josiah

Paul McGuire · Apr 27, 2007

hello,

i have a regex: rx_test = re.compile('/x([^x])*x/')

which is part of this test program:

============

import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)

============

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\test>python retest.py
['c']

can anyone point out why this is occurring? i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul

proctor · Apr 27, 2007

hello,

Click to expand...

i have a regex: rx_test = re.compile('/x([^x])*x/')

Click to expand...

which is part of this test program:

import re

Click to expand...

rx_test = re.compile('/x([^x])*x/')

Click to expand...

s = '/xabcx/'

Click to expand...

if rx_test.findall(s):
print rx_test.findall(s)

============

Click to expand...

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

Click to expand...

C:\test>python retest.py
['c']

Click to expand...

can anyone point out why this is occurring? i can capture the entire
group by doing this:

Click to expand...

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

Click to expand...

any clarification is appreciated!

Click to expand...

sincerely,
proctor

Click to expand...

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:

===========

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===========

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor

Michael Hoffman · Apr 27, 2007

proctor said:
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

Click to expand...

Click to expand...

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Click to expand...

so my question remains, why doesn't the star quantifier seem to grab
all the data.

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:

>>> import re
>>> text = "xabc"
>>> re_test1 = re.compile("x([^x])*")
>>> re_test2 = re.compile("x([^x]*)")
>>> re_test1.match(text).groups() ('c',)
>>> re_test2.match(text).groups()

Click to expand...

Click to expand...

('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.

isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.

i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".

Duncan Booth · Apr 27, 2007

proctor said:
so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

findall returns the matched groups. You get one group for each
parenthesised sub-expression, and (the important bit) if a single
parenthesised expression matches more than once the group only contains
the last string which matched it.

Putting a star after a subexpression means that subexpression can match
zero or more times, but each time it only matches a single character
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group,
then the group will match only once but will include everything that it
matched.

Consider:

re.findall('(.)', 'abc') ['a', 'b', 'c']
re.findall('(.)*', 'abc') ['c', '']
re.findall('(.*)', 'abc')

Click to expand...

Click to expand...

['abc', '']

The first pattern finds a single character which findall manages to
match 3 times.

The second pattern finds a group with a single character zero or more
times in the pattern, so the first time it matches each of a,b,c in turn
and returns the c, and then next time around we get an empty string when
group matched zero times.

In the third pattern we are looking for a group with any number of
characters in it. First time we get all of the string, then we get
another empty match.

Paul McGuire · Apr 27, 2007

hello,
i have a regex: rx_test = re.compile('/x([^x])*x/')
which is part of this test program:
============
import re
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
C:\test>python retest.py
['c']
can anyone point out why this is occurring? i can capture the entire
group by doing this:
rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?
any clarification is appreciated!
sincerely,
proctor

Click to expand...

Click to expand...

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Click to expand...

Since re's do lookahead/backtracking, you can also write:

Click to expand...

rx_test = re.compile('/x(.*?)x/')

Click to expand...

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

Click to expand...

-- Paul

Click to expand...

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:

===========

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===========

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor- Hide quoted text -

- Show quoted text -

Again, I'll repeat some earlier advice: you need to move the '*'
inside the parens - you are still leaving it outside. Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment. In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul

proctor · Apr 27, 2007

proctor said:
proctor said:

rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
As Josiah already pointed out, the * needs to be inside the grouping
parens.

Click to expand...

so my question remains, why doesn't the star quantifier seem to grab
all the data.

Click to expand...

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:

import re
text = "xabc"
re_test1 = re.compile("x([^x])*")
re_test2 = re.compile("x([^x]*)")
re_test1.match(text).groups() ('c',)
re_test2.match(text).groups()

Click to expand...

Click to expand...

('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.

isn't findall() intended to return all matches?

Click to expand...

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.

i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

Click to expand...

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".

ok, thanks michael.

so i am now assuming that either the book's example assumes perl, and
perl is different from python in this regard, or, that the book's
example is faulty. i understand all the examples given since my
question, and i know what i need to do to make it work. i am raising
the question because the book says one thing, but the example is not
working for me. i am searching for the source of the discrepancy.

i will try to research the differences between perl's and python's
regex engines.

thanks again,

sincerely,
proctor

proctor · Apr 27, 2007

proctor said:
proctor said:

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

Click to expand...

findall returns the matched groups. You get one group for each
parenthesised sub-expression, and (the important bit) if a single
parenthesised expression matches more than once the group only contains
the last string which matched it.

Putting a star after a subexpression means that subexpression can match
zero or more times, but each time it only matches a single character
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group,
then the group will match only once but will include everything that it
matched.

Consider:

re.findall('(.)', 'abc') ['a', 'b', 'c']
re.findall('(.)*', 'abc') ['c', '']
re.findall('(.*)', 'abc')

Click to expand...

Click to expand...

['abc', '']

The first pattern finds a single character which findall manages to
match 3 times.

The second pattern finds a group with a single character zero or more
times in the pattern, so the first time it matches each of a,b,c in turn
and returns the c, and then next time around we get an empty string when
group matched zero times.

In the third pattern we are looking for a group with any number of
characters in it. First time we get all of the string, then we get
another empty match.

thank you this is interesting. in the second example, where does the
'nothingness' match, at the end? why does the regex 'run again' when
it has already matched everything? and if it reports an empty match
along with a non-empty match, why only the two?

sincerely,
proctor

Duncan Booth · Apr 27, 2007

proctor said:
re.findall('(.)*', 'abc')

Click to expand...

['c', '']

Click to expand...

thank you this is interesting. in the second example, where does the
'nothingness' match, at the end? why does the regex 'run again' when
it has already matched everything? and if it reports an empty match
along with a non-empty match, why only the two?

There are 4 possible starting points for a regular expression to match in a
three character string. The regular expression would match at any starting
point so in theory you could find 4 possible matches in the string. In this
case they would be 'abc', 'bc', 'c', ''.

However findall won't get any overlapping matches, so there are only two
possible matches and it returns both of them: 'abc' and '' (or rather it
returns the matching group within the match so you only see the 'c'
although it matched 'abc'.

If you use a regex which doesn't match an empty string (e.g. '/x(.*?)x/'
then you won't get the empty match.

proctor · Apr 27, 2007

hello,
i have a regex: rx_test = re.compile('/x([^x])*x/')
which is part of this test program:
============
import re
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
C:\test>python retest.py
['c']
can anyone point out why this is occurring? i can capture the entire
group by doing this:
rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?
any clarification is appreciated!
sincerely,
proctor
As Josiah already pointed out, the * needs to be inside the grouping
parens.
Since re's do lookahead/backtracking, you can also write:
rx_test = re.compile('/x(.*?)x/')
The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.
-- Paul

Click to expand...

Click to expand...

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:

===========

Click to expand...

/x([^x]|x+[^/x])*x+/

Click to expand...

it is supposed to match '/x', the opening delimiter, then

Click to expand...

(
either anything that is 'not x',

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

Click to expand...

followed finally by the closing delimiter.

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

Click to expand...

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

Click to expand...

thanks again for replying!

Click to expand...

sincerely,
proctor- Hide quoted text -

Click to expand...

- Show quoted text -

Click to expand...

Again, I'll repeat some earlier advice: you need to move the '*'
inside the parens - you are still leaving it outside. Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment. In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul

thanks paul,

the reason the regex now give tuples is that there are now 2 groups,
the inner and outer parens. so group 1 matches with the star, and
group 2 matches without the star.

sincerely,
proctor

proctor · Apr 29, 2007

proctorwrote:

rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
As Josiah already pointed out, the * needs to be inside the grouping
parens.

Click to expand...

so my question remains, why doesn't the star quantifier seem to grab
all the data.

Click to expand...

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:

import re
text = "xabc"
re_test1 = re.compile("x([^x])*")
re_test2 = re.compile("x([^x]*)")
re_test1.match(text).groups() ('c',)
re_test2.match(text).groups()

Click to expand...

Click to expand...

('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.

isn't findall() intended to return all matches?

Click to expand...

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.

i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

Click to expand...

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".

thank you all again for helping to clarify this for me. of course you
were exactly right, and the problem lay not with python or the text,
but with me. i mistakenly understood the text to be attempting to
capture the C style comment, when in fact it was merely matching it.

apologies.

sincerely,
proctor

regex question	9	Jan 8, 2007
Regex not matching a string	2	Jan 9, 2013
splitting file/content into lines based on regex termination	0	Nov 7, 2013
regex question	4	Feb 13, 2008
RegEx issues	6	Jan 24, 2009
Questions about regex	3	May 29, 2009
Ifs and assignments	0	Jan 2, 2014
Can someone explain why i have to drag my mouse on one window and the shape to be printed on another	1	Feb 9, 2022

regex question

proctor

Josiah Carlson

Paul McGuire

proctor

Michael Hoffman

Duncan Booth

Paul McGuire

proctor

proctor

Duncan Booth

proctor

proctor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads