regex question

P

proctor

hello,

i have a regex: rx_test = re.compile('/x([^x])*x/')

which is part of this test program:

============

import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)

============

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\test>python retest.py
['c']

can anyone point out why this is occurring? i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor
 
P

Paul McGuire

hello,

i have a regex: rx_test = re.compile('/x([^x])*x/')

which is part of this test program:

============

import re

rx_test = re.compile('/x([^x])*x/')

s = '/xabcx/'

if rx_test.findall(s):
print rx_test.findall(s)

============

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']

C:\test>python retest.py
['c']

can anyone point out why this is occurring? i can capture the entire
group by doing this:

rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?

any clarification is appreciated!

sincerely,
proctor

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul
 
P

proctor

i have a regex: rx_test = re.compile('/x([^x])*x/')
which is part of this test program:

import re
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============

i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
C:\test>python retest.py
['c']
can anyone point out why this is occurring? i can capture the entire
group by doing this:
rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?
any clarification is appreciated!
sincerely,
proctor

As Josiah already pointed out, the * needs to be inside the grouping
parens.

Since re's do lookahead/backtracking, you can also write:

rx_test = re.compile('/x(.*?)x/')

The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

-- Paul

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:

===========

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===========

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor
 
M

Michael Hoffman

proctor said:
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
As Josiah already pointed out, the * needs to be inside the grouping
parens.
so my question remains, why doesn't the star quantifier seem to grab
all the data.

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:
>>> import re
>>> text = "xabc"
>>> re_test1 = re.compile("x([^x])*")
>>> re_test2 = re.compile("x([^x]*)")
>>> re_test1.match(text).groups() ('c',)
>>> re_test2.match(text).groups()
('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.
isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.
i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".
 
D

Duncan Booth

proctor said:
so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??
findall returns the matched groups. You get one group for each
parenthesised sub-expression, and (the important bit) if a single
parenthesised expression matches more than once the group only contains
the last string which matched it.

Putting a star after a subexpression means that subexpression can match
zero or more times, but each time it only matches a single character
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group,
then the group will match only once but will include everything that it
matched.

Consider:
re.findall('(.)', 'abc') ['a', 'b', 'c']
re.findall('(.)*', 'abc') ['c', '']
re.findall('(.*)', 'abc')
['abc', '']

The first pattern finds a single character which findall manages to
match 3 times.

The second pattern finds a group with a single character zero or more
times in the pattern, so the first time it matches each of a,b,c in turn
and returns the c, and then next time around we get an empty string when
group matched zero times.

In the third pattern we are looking for a group with any number of
characters in it. First time we get all of the string, then we get
another empty match.
 
P

Paul McGuire

hello,
i have a regex: rx_test = re.compile('/x([^x])*x/')
which is part of this test program:
============
import re
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
C:\test>python retest.py
['c']
can anyone point out why this is occurring? i can capture the entire
group by doing this:
rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?
any clarification is appreciated!
sincerely,
proctor
As Josiah already pointed out, the * needs to be inside the grouping
parens.
Since re's do lookahead/backtracking, you can also write:
rx_test = re.compile('/x(.*?)x/')
The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.

i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:

===========

/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then

(
either anything that is 'not x',

or,

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')

followed finally by the closing delimiter.

===========

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.

so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

thanks again for replying!

sincerely,
proctor- Hide quoted text -

- Show quoted text -

Again, I'll repeat some earlier advice: you need to move the '*'
inside the parens - you are still leaving it outside. Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment. In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul
 
P

proctor

proctor said:
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
As Josiah already pointed out, the * needs to be inside the grouping
parens.
so my question remains, why doesn't the star quantifier seem to grab
all the data.

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:
import re
text = "xabc"
re_test1 = re.compile("x([^x])*")
re_test2 = re.compile("x([^x]*)")
re_test1.match(text).groups() ('c',)
re_test2.match(text).groups()
('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.
isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.
i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".

ok, thanks michael.

so i am now assuming that either the book's example assumes perl, and
perl is different from python in this regard, or, that the book's
example is faulty. i understand all the examples given since my
question, and i know what i need to do to make it work. i am raising
the question because the book says one thing, but the example is not
working for me. i am searching for the source of the discrepancy.

i will try to research the differences between perl's and python's
regex engines.

thanks again,

sincerely,
proctor
 
P

proctor

proctor said:
so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??

findall returns the matched groups. You get one group for each
parenthesised sub-expression, and (the important bit) if a single
parenthesised expression matches more than once the group only contains
the last string which matched it.

Putting a star after a subexpression means that subexpression can match
zero or more times, but each time it only matches a single character
which is why your findall only returned the last character it matched.

You need to move the * inside the parentheses used to define the group,
then the group will match only once but will include everything that it
matched.

Consider:
re.findall('(.)', 'abc') ['a', 'b', 'c']
re.findall('(.)*', 'abc') ['c', '']
re.findall('(.*)', 'abc')

['abc', '']

The first pattern finds a single character which findall manages to
match 3 times.

The second pattern finds a group with a single character zero or more
times in the pattern, so the first time it matches each of a,b,c in turn
and returns the c, and then next time around we get an empty string when
group matched zero times.

In the third pattern we are looking for a group with any number of
characters in it. First time we get all of the string, then we get
another empty match.

thank you this is interesting. in the second example, where does the
'nothingness' match, at the end? why does the regex 'run again' when
it has already matched everything? and if it reports an empty match
along with a non-empty match, why only the two?

sincerely,
proctor
 
D

Duncan Booth

proctor said:
re.findall('(.)*', 'abc')
['c', '']
thank you this is interesting. in the second example, where does the
'nothingness' match, at the end? why does the regex 'run again' when
it has already matched everything? and if it reports an empty match
along with a non-empty match, why only the two?

There are 4 possible starting points for a regular expression to match in a
three character string. The regular expression would match at any starting
point so in theory you could find 4 possible matches in the string. In this
case they would be 'abc', 'bc', 'c', ''.

However findall won't get any overlapping matches, so there are only two
possible matches and it returns both of them: 'abc' and '' (or rather it
returns the matching group within the match so you only see the 'c'
although it matched 'abc'.

If you use a regex which doesn't match an empty string (e.g. '/x(.*?)x/'
then you won't get the empty match.
 
P

proctor

hello,
i have a regex: rx_test = re.compile('/x([^x])*x/')
which is part of this test program:
============
import re
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
C:\test>python retest.py
['c']
can anyone point out why this is occurring? i can capture the entire
group by doing this:
rx_test = re.compile('/x([^x]+)*x/')
but why isn't the 'star' grabbing the whole group? and why isn't each
letter 'a', 'b', and 'c' present, either individually, or as a group
(group is expected)?
any clarification is appreciated!
sincerely,
proctor
As Josiah already pointed out, the * needs to be inside the grouping
parens.
Since re's do lookahead/backtracking, you can also write:
rx_test = re.compile('/x(.*?)x/')
The '?' is there to make sure the .* repetition stops at the first
occurrence of x/.
-- Paul
i am working through an example from the oreilly book mastering
regular expressions (2nd edition) by jeffrey friedl. my post was a
snippet from a regex to match C comments. every 'x' in the regex
represents a 'star' in actual usage, so that backslash escaping is not
needed in the example (on page 275). it looks like this:
===========
/x([^x]|x+[^/x])*x+/

it is supposed to match '/x', the opening delimiter, then
(
either anything that is 'not x',

'x' one or more times, 'not followed by a slash or an x'
) any number of times (the 'star')
followed finally by the closing delimiter.

this does not seem to work in python the way i understand it should
from the book, and i simplified the example in my first post to
concentrate on just one part of the alternation that i felt was not
acting as expected.
so my question remains, why doesn't the star quantifier seem to grab
all the data. isn't findall() intended to return all matches? i
would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match). why does it give only
one letter, and at that, the /last/ letter in the sequence??
thanks again for replying!
sincerely,
proctor- Hide quoted text -
- Show quoted text -

Again, I'll repeat some earlier advice: you need to move the '*'
inside the parens - you are still leaving it outside. Also, get in
the habit of using raw literal notation (that is r"slkjdfljf" instead
of "lsjdlfkjs") when defining re strings - you don't have backslash
issues yet, but you will as soon as you start putting real '*'
characters in your expression.

However, when I test this,

restr = r'/x(([^x]|x+[^/])*)x+/'
re_ = re.compile(restr)
print re_.findall("/xabxxcx/ /x123xxx/")

findall now starts to give a tuple for each "comment",

[('abxxc', 'xxc'), ('123xx', 'xx')]

so you have gone beyond my limited re skill, and will need help from
someone else.

But I suggest you add some tests with multiple consecutive 'x'
characters in the middle of your comment, and multiple consecutive 'x'
characters before the trailing comment. In fact, from my
recollections of trying to implement this type of comment recognizer
by hand a long time ago in a job far, far away, test with both even
and odd numbers of 'x' characters.

-- Paul

thanks paul,

the reason the regex now give tuples is that there are now 2 groups,
the inner and outer parens. so group 1 matches with the star, and
group 2 matches without the star.

sincerely,
proctor
 
P

proctor

proctorwrote:
rx_test = re.compile('/x([^x])*x/')
s = '/xabcx/'
if rx_test.findall(s):
print rx_test.findall(s)
============
i expect the output to be ['abc'] however it gives me only the last
single character in the group: ['c']
As Josiah already pointed out, the * needs to be inside the grouping
parens.
so my question remains, why doesn't the star quantifier seem to grab
all the data.

Because you didn't use it *inside* the group, as has been said twice.
Let's take a simpler example:
import re
text = "xabc"
re_test1 = re.compile("x([^x])*")
re_test2 = re.compile("x([^x]*)")
re_test1.match(text).groups() ('c',)
re_test2.match(text).groups()
('abc',)

There are three places that match ([^x]) in text. But each time you find
one you overwrite the previous example.
isn't findall() intended to return all matches?

It returns all matches of the WHOLE pattern, /x([^x])*x/. Since you used
a grouping parenthesis in there, it only returns one group from each
pattern.

Back to my example:
['c', 'a', 'c']

Here it finds multiple matches, but only because the x occurs multiple
times as well. In your example there is only one match.
i would expect either 'abc' or 'a', 'b', 'c' or at least just
'a' (because that would be the first match).

You are essentially doing this:

group1 = "a"
group1 = "b"
group1 = "c"

After those three statements, you wouldn't expect group1 to be "abc" or
"a". You'd expect it to be "c".

thank you all again for helping to clarify this for me. of course you
were exactly right, and the problem lay not with python or the text,
but with me. i mistakenly understood the text to be attempting to
capture the C style comment, when in fact it was merely matching it.

apologies.

sincerely,
proctor
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top