Distributions, RE-verb and the like

Discussion in 'Python' started by bearophileHUGS@lycos.com, Dec 29, 2005.

  1. Guest

    Psyco is finished now, and it works on the x86, for Win, the new macs,
    many linux boxes, etc, and it's quite useful, so maybe it can be added
    to the standard Python distribution.

    PyChecker (and the other similar ones that work differently) is very
    useful too, and it's pure Python, so maybe it too (or something
    similar) can be added to the standard distribution.

    --------------------

    Regular Expressions can be useful but:
    - They look really an-pythonic
    - Their syntax is difficult to remember
    - It's not easy to understand and debug REs written by other people,
    comments help a little.
    - They can have hidden bugs (because of their low readability)
    - they mix their opeators/syntax with the data (this is thier advantage
    too), this can create problems, and makes them less general (because
    you have to avoid mistaking syntax elements with the data).
    - Python has already syntax to define structures, loops, etc, so
    another syntax can be seen as a kind of duplication.

    Such things go against lot of points of the Python Zen. So I'd like a
    more pythonic syntax, easy to remember, easy to read and debug, where
    data and operators are fully separated. The "reverb" library does
    something like that already:
    http://home.earthlink.net/~jasonrandharper/reverb.py

    It compiles RE written like this:

    xdigit = set(digits, char('a').to('f'), char('A').to('F'))
    pat = RE((text('$') | text('0x') | text('0X')) + required(hexdigit, max
    = 8) - followedBy(hexdigit))

    I have already "improved" reverb a little, but I think a better and
    simpler syntax can be invented by people more expert than me in REs.
    Here are some alrernative syntax possibilities, I don't like them, most
    of them are impossible, silly, stupid, etc, but I am sure a good syntax
    can be invented.

    Standard RE:
    pat.pattern: (?:\$|0x|0X)[\da-gA-G]{1,8}(?![\da-gA-G])

    hexdigit = set(digits, chrint('a','f'), chrint('A','F'))
    pat = RE((text('$') | text('0x') | text('0X')) + repeated(hexdigit, 1,
    8) - followedBy(hexdigit))

    hexdigit = alt(digits, chrint('a','f'), chrint('A','F'))
    pat = optional("-") + alt('$', '0x', '0X') + times(hexdigit, 1, 8) -
    hexdigit

    hexdigit = VR(digits, interval('a', 'f'), interval('A', 'F'))
    pat = optional("-") + VR('$', '0x', '0X') + times(hexdigit, 1, 8)-
    hexdigit

    hexdigit = VR(digits, interval('a', 'f'), interval('A', 'F'))
    pat = VR("-", min=0) + VR('$', '0x', '0X') + VR(hexdigit, min=1, max=8)
    - hexdigit

    hexdigit = VR( VR(digits) | interval('a', 'f') | interval('A', 'F') )
    pat = VR("-", 0) + VR(VR('$') | VR('0x') | VR('0X')) + VR(hexdigit, 1,
    8) - hexdigit

    hexdigit = Alt(digits, interval('a', 'f'), interval('A', 'F'))
    pat = VR("-", 0) + Alt('$', '0x', '0X') + VR(hexdigit, 1, 8) - hexdigit

    hexdigit = Alternative(digits, Interval('a', 'f'), Interval('A', 'F'))
    pat = Optional("-") + Alternative('$', '0x', '0X') + RE(Hexdigit, 1, 8)
    - Hexdigit

    hexdigit = Alternative(digits, Interval('a', 'f'), Interval('A', 'F'))
    pat = RE("-", 0) + Alternative('$', '0x', '0X') + RE(Hexdigit, 1, 8) -
    Hexdigit

    hexdigit = RE([digits, Interval('a', 'f'), Interval('A', 'F')]) #
    flatten sul primo parametro
    pat = RE("-", 0) + RE(['$', '0x', '0X']) + RE(Hexdigit, 1, 8) -
    Hexdigit

    hexdigit = RE(digits, Interval('a', 'f'), Interval('A', 'F'))
    pat = RE("-").repeat(0) + RE('$', '0x', '0X') + Hexdigit.repeat(1, 8) -
    Hexdigit

    hexdigit = VRE(digits, Interval('a', 'f'), Interval('A', 'F'))
    pat = VRE("-").repeat(0) + VRE('$', '0x', '0X') + Hexdigit.repeat(1, 8)
    - Hexdigit

    hexdigit = Vre(digits, Interval('a', 'f'), Interval('A', 'F'))
    hexnum = Vre("-").repeat(0) + Vre('$', '0x', '0X') + Hexdigit.repeat(1,
    8) - Hexdigit

    hexdigit = Vre(digits, Interval('a', 'f'), Interval('A', 'F'))
    hexnum = Vre("-").optional() + Vre('$', '0x', '0X') +
    Hexdigit.repeat(1, 8) - Hexdigit

    hexdigit = Vre(Vre().digits, Interval('a', 'f'), Interval('A', 'F'))
    hexnum = Optional("-") + Vre('$', '0x', '0X') + Repeat(Hexdigit, 1, 8)
    - Hexdigit

    hexdigit = Vre(Vre().digits, Interval('a', 'f'), Interval('A', 'F'))
    hexnum = Vre("-").optional() + Vre('$', '0x', '0X') +
    Hexdigit.repeat(1, 8) - Hexdigit

    hexdigit = Vre(Vre().digits, Interval('a', 'f')).ignorecase()
    hexnum = Vre("-").optional() + Vre('$', '0x').ignorecase() +
    Hexdigit.repeat(1, 8) - Hexdigit

    hexdigit = Alternative(Digits, Interval('a', 'f')).ignorecase()
    hexnum = Optional("-") + Alternative('$', '0x').ignorecase() +
    Repeat(Hexdigit, 1, 8) - Hexdigit


    I think that once the best syntax is found, implementing a better
    reverb-like module isn't too much work (my modified version of reverb
    is only about 130 LOCs).

    Bye,
    bearophile
     
    , Dec 29, 2005
    #1
    1. Advertising

  2. Paul McGuire Guest

    Bearophile -

    Well, I fear this may end up being another of those "easier to
    reinvent" wheels. All of your issues with RE's are the same ones I had
    with lex/yacc (and re's too) when I wrote pyparsing.

    Any chance for convergence here?

    (BTW, there is nothing inherently wrong with "reinventing wheels". The
    metaphor is a bit flawed, since there are many different types of
    wheels in the world, and not all interchangeable - consider a tractor
    wheel vs. a bicycle wheel. Some legitimate/valid/justified endeavors
    are wrongly indicted for "reinventing the wheel" when in fact they are
    focusing on a particular niche of wheeldom, deserving of its own
    specialized invention.)

    -- Paul
     
    Paul McGuire, Dec 29, 2005
    #2
    1. Advertising

  3. Paul McGuire Guest

    Oh, the pyparsing rendition of your initial pat expression would be
    something like:

    import pyparsing as pp
    pat = pp.Combine( pp.oneOf("$ 0x 0X") + pp.Word(pp.hexnums,max=8) )

    Combine is needed to ensure that the leading $, 0x, or 0X is
    immediately followed by 1-8 (and no more than 8) hex digits.
    Otherwise, pyparsing is pretty tolerant of whitespace cropping up
    wherever.

    As for some of your other syntaxes:

    I'm not sure what "Vre" means.

    I found that "Alternative" needs to support both greedy and non-greedy
    matches, so I provided Or and MatchFirst, respectively. They are also
    definable using '^' and '|' operators, again respectively. Finally, I
    ran into Literal("this") | Literal("that") | Literal("other") so often
    that I just added a helper method oneOf that would take the string
    "this that other" and build the right expression out of it. This too
    is non-trivial, as you have to take care that some short literals may
    mask longer ones in the list, as in oneOf("< = > <= >= !="). Just
    replacing this directly with Literal("<") | Literal("=") | ... would
    prevent any matching of the ">=" or "<=" literals. You could replace
    with the Or (^) form, but this exhaustively checks all alternatives all
    the time, a regrettable run-time performance penalty. Pyparsing's
    implementation of oneOf leaves the literals in the given order, unless
    a duplicate is given, or an earlier literal masks a later one - in that
    case, the longer literal is moved ahead of the shorter.

    I implemented Optional as a wrapper-type class, as opposed to the
    ..optional() method that you have given - I'd say there are tradeoffs
    either way, just making the comparison.

    Your "repeated" or "times" seem to map roughly to pyparsing's OneOrMore
    and ZeroOrMore.

    Any thought how a recursive grammar might look?

    I don't find 'Interval' to be very easy on the eyes. In this case, I
    stole^H^H^H^H^H borrowed the re form of "[A-Za-z0-9]", providing a
    method named srange ("s" is for "string") such that srange("a-fA-F")
    would return the string "abcdefABCDEF".

    The other end of this process has to do with how the calling program
    will process the parsed results. Once a grammar gets too deeply
    nested, or has too many Optional elements, just returning a simple list
    or nested list of tokens isn't enough. Pyparsing returns ParseResults
    objects, which can be accessed as a list, dictionary, or object with
    attributes (provided individual fields have been given names at grammar
    definition time). I *have* had some complaints about ParseResults
    ("ParseResults are evil"), but the named access is a life-saver for
    complex grammars. (Simple case, the first token for your hex number is
    an optional sign - without names, you can't just access field 2, say,
    of the expression, you have to first test to see if the sign was
    provided or not, and then access field 2 or 3 accordingly. On the
    other hand, if you had given field 2 a name, your parser would be more
    robust, even you later changed your grammar to include other elements,
    such as a leading, um, currency symbol or something.)

    Just some fodder for your reverb considerations...

    -- Paul
     
    Paul McGuire, Dec 29, 2005
    #3
  4. Guest

    Paul McGuire wrote:

    > I don't find 'Interval' to be very easy on the eyes. In this case, I
    > stole^H^H^H^H^H borrowed the re form of "[A-Za-z0-9]", providing a
    > method named srange ("s" is for "string") such that srange("a-fA-F")
    > would return the string "abcdefABCDEF".


    Thank you for your answers Paul. Just a note: I have called it
    interval-something (like cinterval or Interval) instead of
    range-something because it returns a closed interval (all letters
    inclusive of both endpoints), to show its difference from the Python
    range that returns a right open interval. Calling a srange("a","z") in
    Python lets me think that it generates the ["a",...,"y"] range.

    Bye,
    bearophile
     
    , Jan 3, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?UmV6YSBTb2xvdWtp?=

    The web serve has been locked down and is blocking the Debug verb.

    =?Utf-8?B?UmV6YSBTb2xvdWtp?=, Nov 24, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    3,613
    Ben Strackany
    Nov 24, 2004
  2. Randy
    Replies:
    0
    Views:
    433
    Randy
    Mar 29, 2006
  3. Replies:
    0
    Views:
    315
  4. Brian

    IIS5-like verb filter in IIS6

    Brian, Oct 16, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    286
    Brian
    Oct 16, 2006
  5. Timothy McDonald
    Replies:
    1
    Views:
    68
    Ned Deily
    Apr 28, 2014
Loading...

Share This Page