regex walktrough

Discussion in 'Python' started by rh, Dec 8, 2012.

  1. rh

    rh Guest

    Look through some code I found this and wondered about what it does:
    ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$

    Here's my walk through:

    1) ^ match at start of string
    2) ?P<salsipuedes> if a match is found it will be accessible in a variable
    salsipuedes
    3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
    4) + one or more from the preceeding char class
    5) () the grouping we want returned (see #2)
    6) $ end of the string to match against but before any newline


    more on #3
    the z-_ part looks wrong and seems that the - should be at the start
    of the char set otherwise we get another range z-_ or does the a-z
    preceeding the z-_ negate the z-_ from becoming a range? The "."
    might be ok inside a char set. The two slashes look wrong but maybe
    it has some special meaning in some case? I think only one slash is
    needed.

    I've looked at pydoc re, but it's cursory.
    rh, Dec 8, 2012
    #1
    1. Advertising

  2. rh

    Hans Mulder Guest

    On 8/12/12 18:48:13, rh wrote:
    > Look through some code I found this and wondered about what it does:
    > ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
    >
    > Here's my walk through:
    >
    > 1) ^ match at start of string
    > 2) ?P<salsipuedes> if a match is found it will be accessible in a
    > variable salsipuedes


    I wouldn't call it a variable. If m is a match-object produced
    by this regex, then m.group('salsipuedes') will return the part
    that was captured.

    I'm not sure, though, why you'd want to define a group that
    effectively spans the whole regex. If there's a match, then
    m.group(0) will return the matching substring, and
    m.group('salsipuedes') will return the substring that matched
    the parenthesized part of the pattern and these two substrings
    will be equal, since the only bits of the pattern outside the
    parenthesis are zero-width assertions.

    > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see below
    > 4) + one or more from the preceeding char class
    > 5) () the grouping we want returned (see #2)
    > 6) $ end of the string to match against but before any newline
    >
    > more on #3
    > the z-_ part looks wrong and seems that the - should be at the start
    > of the char set otherwise we get another range z-_ or does the a-z
    > preceeding the z-_ negate the z-_ from becoming a range?


    The latter: a-z is a range and block the z-_ from being a range.
    Consequently, the -_ bit matches only - and _.

    > The "." might be ok inside a char set.


    It is. Most special characters lose their special meaning
    inside a char set.

    > The two slashes look wrong but maybe it has some special meaning
    > in some case? I think only one slash is needed.


    You're correct: there's no special meaning and only one slash
    is needed. But then, a char set is a set and duplcates are
    simply ignored, so it does no harm.

    Perhaps the person who wrote this was confusing slashes and
    backslashes.

    > I've looked at pydoc re, but it's cursory.


    That's one way of putting it.


    Hope this helps,

    -- HansM
    Hans Mulder, Dec 8, 2012
    #2
    1. Advertising

  3. rh

    rh Guest

    On Sat, 08 Dec 2012 20:33:37 +0100
    Hans Mulder <> wrote:

    > On 8/12/12 18:48:13, rh wrote:
    > > Look through some code I found this and wondered about what it
    > > does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
    > >
    > > Here's my walk through:
    > >
    > > 1) ^ match at start of string
    > > 2) ?P<salsipuedes> if a match is found it will be accessible in a
    > > variable salsipuedes

    >
    > I wouldn't call it a variable. If m is a match-object produced
    > by this regex, then m.group('salsipuedes') will return the part
    > that was captured.
    >
    > I'm not sure, though, why you'd want to define a group that
    > effectively spans the whole regex. If there's a match, then
    > m.group(0) will return the matching substring, and
    > m.group('salsipuedes') will return the substring that matched
    > the parenthesized part of the pattern and these two substrings
    > will be equal, since the only bits of the pattern outside the
    > parenthesis are zero-width assertions.


    Good point, it's making the re engine do extra work.
    It's not my code and that's another gap in the author's proficiency.
    (I don't know who the author is....FWIW)

    >
    > > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
    > > below
    > > 4) + one or more from the preceeding char class
    > > 5) () the grouping we want returned (see #2)
    > > 6) $ end of the string to match against but before any newline
    > >
    > > more on #3
    > > the z-_ part looks wrong and seems that the - should be at the start
    > > of the char set otherwise we get another range z-_ or does the a-z
    > > preceeding the z-_ negate the z-_ from becoming a range?

    >
    > The latter: a-z is a range and block the z-_ from being a range.
    > Consequently, the -_ bit matches only - and _.
    >
    > > The "." might be ok inside a char set.

    >
    > It is. Most special characters lose their special meaning
    > inside a char set.
    >
    > > The two slashes look wrong but maybe it has some special meaning
    > > in some case? I think only one slash is needed.

    >
    > You're correct: there's no special meaning and only one slash
    > is needed. But then, a char set is a set and duplcates are
    > simply ignored, so it does no harm.


    I wonder if there's harm in the performance. Probably not
    but regex is some tricky code and can be expensive even when written
    well. For example does this perform better than the original:
    ^(?P<salsipuedes>[-\w./]+)$

    Not sure if the \w sequence includes the - or the . or the /
    I think it does not.

    >
    > Perhaps the person who wrote this was confusing slashes and
    > backslashes.


    Possibly.

    >
    > > I've looked at pydoc re, but it's cursory.

    >
    > That's one way of putting it.
    >
    >
    > Hope this helps,


    Does help, thanks.

    >
    > -- HansM
    >
    >



    --
    rh, Dec 8, 2012
    #3
  4. rh

    Hans Mulder Guest

    On 8/12/12 23:57:48, rh wrote:
    > Not sure if the \w sequence includes the - or the . or the /
    > I think it does not.


    You guessed right:

    >>> [ c for c in 'x-./y' if re.match(r'\w', c) ]

    ['x', 'y']
    >>>


    So x and y match \w and -, . and / do not.


    Hope this helps,

    -- HansM
    Hans Mulder, Dec 8, 2012
    #4
  5. rh

    MRAB Guest

    On 2012-12-08 23:34, Hans Mulder wrote:
    > On 8/12/12 23:57:48, rh wrote:
    >> Not sure if the \w sequence includes the - or the . or the /
    >> I think it does not.

    >
    > You guessed right:
    >
    >>>> [ c for c in 'x-./y' if re.match(r'\w', c) ]

    > ['x', 'y']
    >>>>

    >
    > So x and y match \w and -, . and / do not.
    >

    This is shorter:

    >>> re.findall(r'\w', 'x-./y')

    ['x', 'y']

    But remember that r"\w" is more than just r"[A-Za-z0-9_]" (unless
    you're using ASCII).
    MRAB, Dec 9, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    688
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,604
    Ant...
    Nov 6, 2003
  3. Replies:
    3
    Views:
    725
    Reedick, Andrew
    Jul 1, 2008
  4. MRAB

    Re: regex walktrough

    MRAB, Dec 8, 2012, in forum: Python
    Replies:
    0
    Views:
    128
  5. rh

    Re: regex walktrough

    rh, Dec 8, 2012, in forum: Python
    Replies:
    3
    Views:
    139
Loading...

Share This Page