Re: regex walktrough

Discussion in 'Python' started by rh, Dec 8, 2012.

  1. rh

    rh Guest

    On Sat, 08 Dec 2012 18:08:36 +0000
    MRAB <> wrote:

    > On 2012-12-08 17:48, rh wrote:
    > > Look through some code I found this and wondered about what it
    > > does: ^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$
    > >
    > > Here's my walk through:
    > >
    > > 1) ^ match at start of string
    > > 2) ?P<salsipuedes> if a match is found it will be accessible in a
    > > variable salsipuedes
    > > 3) [0-9A-Za-z-_.//] this is the one that looks wrong to me, see
    > > below
    > > 4) + one or more from the preceeding char class
    > > 5) () the grouping we want returned (see #2)
    > > 6) $ end of the string to match against but before any newline
    > >
    > >
    > > more on #3
    > > the z-_ part looks wrong and seems that the - should be at the start
    > > of the char set otherwise we get another range z-_ or does the a-z
    > > preceeding the z-_ negate the z-_ from becoming a range? The "."
    > > might be ok inside a char set. The two slashes look wrong but maybe
    > > it has some special meaning in some case? I think only one slash is
    > > needed.
    > >
    > > I've looked at pydoc re, but it's cursory.
    > >

    > Python itself will help you:
    >
    > >>> re.compile(r"^(?P<salsipuedes>[0-9A-Za-z-_.//]+)$",
    > >>> flags=re.DEBUG)

    > at at_beginning
    > subpattern 1
    > max_repeat 1 65535
    > in
    > range (48, 57)
    > range (65, 90)
    > range (97, 122)
    > literal 45
    > literal 95
    > literal 46
    > literal 47
    > literal 47
    > at at_end
    >
    > Inside the character set: "0-9", "A-Z" and "a-z" are ranges; "-", "_",
    > "." and "/" are literals. Doubling the "/" is unnecessary (it has no
    > special meaning). "-" is a literal because it immediately follows a
    > range, so it can't be defining another range (if it immediately
    > followed a literal and wasn't immediately followed by an unescaped "]"
    > then it would, so r"[a-]" is the same as r"[a\-]").


    Handy tip there, thanks.

    re.compile(r"^(?P<salsipuedes>[-\w./]+)$", flags=re.DEBUG)
    at at_beginning
    subpattern 1
    max_repeat 1 65535
    in
    literal 45
    category category_word
    literal 46
    literal 47
    at at_end

    I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
    category_word. Some other re flag?

    >
    > As for "(?P<salsipuedes>...)", it won't be accessible in a variable
    > "salsipuedes", but will be accessible as a named group in the match
    > object:
    >
    > >>> m = re.match(r"(?P<foo>[a-z]+)", "xyz")
    > >>> m.group("foo")

    > 'xyz'
    >


    Ok, "named group" it is.
     
    rh, Dec 8, 2012
    #1
    1. Advertising

  2. rh

    Hans Mulder Guest

    On 8/12/12 23:19:40, rh wrote:
    > I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
    > category_word. Some other re flag?


    he category word consists of the '_' character and the
    characters for which .isalnum() return True.

    On my system there are 102158 characters matching '\w':

    >>> sum(1 for i in range(sys.maxunicode+1)

    .... if re.match(r'\w', chr(i)))
    102158
    >>>


    You wouldn't want to see the complete list.

    -- HansM
     
    Hans Mulder, Dec 8, 2012
    #2
    1. Advertising

  3. rh

    MRAB Guest

    On 2012-12-08 23:27, Hans Mulder wrote:
    > On 8/12/12 23:19:40, rh wrote:
    >> I reduced the expression too. Now I wonder why re.DEBUG doesn't unroll
    >> category_word. Some other re flag?

    >
    > he category word consists of the '_' character and the
    > characters for which .isalnum() return True.
    >
    > On my system there are 102158 characters matching '\w':
    >

    That would be because you're using Python 3, where strings are Unicode.

    >>>> sum(1 for i in range(sys.maxunicode+1)

    > ... if re.match(r'\w', chr(i)))
    > 102158
    >>>>

    >
    > You wouldn't want to see the complete list.
    >

    The number of such codepoints depends on which version of Unicode is
    being supported (Unicode is evolving all the time).
     
    MRAB, Dec 9, 2012
    #3
  4. rh

    rh Guest

    On Sun, 09 Dec 2012 00:27:30 +0100
    Hans Mulder <> wrote:

    > On 8/12/12 23:19:40, rh wrote:
    > > I reduced the expression too. Now I wonder why re.DEBUG doesn't
    > > unroll category_word. Some other re flag?

    >
    > he category word consists of the '_' character and the
    > characters for which .isalnum() return True.
    >
    > On my system there are 102158 characters matching '\w':
    >
    > >>> sum(1 for i in range(sys.maxunicode+1)

    > ... if re.match(r'\w', chr(i)))
    > 102158
    > >>>

    >
    > You wouldn't want to see the complete list.


    No and also wouldn't want to use \w unless really needed.
    So that answers my other question.

    >
    > -- HansM



    --
     
    rh, Dec 9, 2012
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    744
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,689
    Ant...
    Nov 6, 2003
  3. Replies:
    3
    Views:
    825
    Reedick, Andrew
    Jul 1, 2008
  4. rh

    regex walktrough

    rh, Dec 8, 2012, in forum: Python
    Replies:
    4
    Views:
    142
  5. MRAB

    Re: regex walktrough

    MRAB, Dec 8, 2012, in forum: Python
    Replies:
    0
    Views:
    154
Loading...

Share This Page