Distributions, RE-verb and the like

B

bearophileHUGS

Psyco is finished now, and it works on the x86, for Win, the new macs,
many linux boxes, etc, and it's quite useful, so maybe it can be added
to the standard Python distribution.

PyChecker (and the other similar ones that work differently) is very
useful too, and it's pure Python, so maybe it too (or something
similar) can be added to the standard distribution.

--------------------

Regular Expressions can be useful but:
- They look really an-pythonic
- Their syntax is difficult to remember
- It's not easy to understand and debug REs written by other people,
comments help a little.
- They can have hidden bugs (because of their low readability)
- they mix their opeators/syntax with the data (this is thier advantage
too), this can create problems, and makes them less general (because
you have to avoid mistaking syntax elements with the data).
- Python has already syntax to define structures, loops, etc, so
another syntax can be seen as a kind of duplication.

Such things go against lot of points of the Python Zen. So I'd like a
more pythonic syntax, easy to remember, easy to read and debug, where
data and operators are fully separated. The "reverb" library does
something like that already:
http://home.earthlink.net/~jasonrandharper/reverb.py

It compiles RE written like this:

xdigit = set(digits, char('a').to('f'), char('A').to('F'))
pat = RE((text('$') | text('0x') | text('0X')) + required(hexdigit, max
= 8) - followedBy(hexdigit))

I have already "improved" reverb a little, but I think a better and
simpler syntax can be invented by people more expert than me in REs.
Here are some alrernative syntax possibilities, I don't like them, most
of them are impossible, silly, stupid, etc, but I am sure a good syntax
can be invented.

Standard RE:
pat.pattern: (?:\$|0x|0X)[\da-gA-G]{1,8}(?![\da-gA-G])

hexdigit = set(digits, chrint('a','f'), chrint('A','F'))
pat = RE((text('$') | text('0x') | text('0X')) + repeated(hexdigit, 1,
8) - followedBy(hexdigit))

hexdigit = alt(digits, chrint('a','f'), chrint('A','F'))
pat = optional("-") + alt('$', '0x', '0X') + times(hexdigit, 1, 8) -
hexdigit

hexdigit = VR(digits, interval('a', 'f'), interval('A', 'F'))
pat = optional("-") + VR('$', '0x', '0X') + times(hexdigit, 1, 8)-
hexdigit

hexdigit = VR(digits, interval('a', 'f'), interval('A', 'F'))
pat = VR("-", min=0) + VR('$', '0x', '0X') + VR(hexdigit, min=1, max=8)
- hexdigit

hexdigit = VR( VR(digits) | interval('a', 'f') | interval('A', 'F') )
pat = VR("-", 0) + VR(VR('$') | VR('0x') | VR('0X')) + VR(hexdigit, 1,
8) - hexdigit

hexdigit = Alt(digits, interval('a', 'f'), interval('A', 'F'))
pat = VR("-", 0) + Alt('$', '0x', '0X') + VR(hexdigit, 1, 8) - hexdigit

hexdigit = Alternative(digits, Interval('a', 'f'), Interval('A', 'F'))
pat = Optional("-") + Alternative('$', '0x', '0X') + RE(Hexdigit, 1, 8)
- Hexdigit

hexdigit = Alternative(digits, Interval('a', 'f'), Interval('A', 'F'))
pat = RE("-", 0) + Alternative('$', '0x', '0X') + RE(Hexdigit, 1, 8) -
Hexdigit

hexdigit = RE([digits, Interval('a', 'f'), Interval('A', 'F')]) #
flatten sul primo parametro
pat = RE("-", 0) + RE(['$', '0x', '0X']) + RE(Hexdigit, 1, 8) -
Hexdigit

hexdigit = RE(digits, Interval('a', 'f'), Interval('A', 'F'))
pat = RE("-").repeat(0) + RE('$', '0x', '0X') + Hexdigit.repeat(1, 8) -
Hexdigit

hexdigit = VRE(digits, Interval('a', 'f'), Interval('A', 'F'))
pat = VRE("-").repeat(0) + VRE('$', '0x', '0X') + Hexdigit.repeat(1, 8)
- Hexdigit

hexdigit = Vre(digits, Interval('a', 'f'), Interval('A', 'F'))
hexnum = Vre("-").repeat(0) + Vre('$', '0x', '0X') + Hexdigit.repeat(1,
8) - Hexdigit

hexdigit = Vre(digits, Interval('a', 'f'), Interval('A', 'F'))
hexnum = Vre("-").optional() + Vre('$', '0x', '0X') +
Hexdigit.repeat(1, 8) - Hexdigit

hexdigit = Vre(Vre().digits, Interval('a', 'f'), Interval('A', 'F'))
hexnum = Optional("-") + Vre('$', '0x', '0X') + Repeat(Hexdigit, 1, 8)
- Hexdigit

hexdigit = Vre(Vre().digits, Interval('a', 'f'), Interval('A', 'F'))
hexnum = Vre("-").optional() + Vre('$', '0x', '0X') +
Hexdigit.repeat(1, 8) - Hexdigit

hexdigit = Vre(Vre().digits, Interval('a', 'f')).ignorecase()
hexnum = Vre("-").optional() + Vre('$', '0x').ignorecase() +
Hexdigit.repeat(1, 8) - Hexdigit

hexdigit = Alternative(Digits, Interval('a', 'f')).ignorecase()
hexnum = Optional("-") + Alternative('$', '0x').ignorecase() +
Repeat(Hexdigit, 1, 8) - Hexdigit


I think that once the best syntax is found, implementing a better
reverb-like module isn't too much work (my modified version of reverb
is only about 130 LOCs).

Bye,
bearophile
 
P

Paul McGuire

Bearophile -

Well, I fear this may end up being another of those "easier to
reinvent" wheels. All of your issues with RE's are the same ones I had
with lex/yacc (and re's too) when I wrote pyparsing.

Any chance for convergence here?

(BTW, there is nothing inherently wrong with "reinventing wheels". The
metaphor is a bit flawed, since there are many different types of
wheels in the world, and not all interchangeable - consider a tractor
wheel vs. a bicycle wheel. Some legitimate/valid/justified endeavors
are wrongly indicted for "reinventing the wheel" when in fact they are
focusing on a particular niche of wheeldom, deserving of its own
specialized invention.)

-- Paul
 
P

Paul McGuire

Oh, the pyparsing rendition of your initial pat expression would be
something like:

import pyparsing as pp
pat = pp.Combine( pp.oneOf("$ 0x 0X") + pp.Word(pp.hexnums,max=8) )

Combine is needed to ensure that the leading $, 0x, or 0X is
immediately followed by 1-8 (and no more than 8) hex digits.
Otherwise, pyparsing is pretty tolerant of whitespace cropping up
wherever.

As for some of your other syntaxes:

I'm not sure what "Vre" means.

I found that "Alternative" needs to support both greedy and non-greedy
matches, so I provided Or and MatchFirst, respectively. They are also
definable using '^' and '|' operators, again respectively. Finally, I
ran into Literal("this") | Literal("that") | Literal("other") so often
that I just added a helper method oneOf that would take the string
"this that other" and build the right expression out of it. This too
is non-trivial, as you have to take care that some short literals may
mask longer ones in the list, as in oneOf("< = > <= >= !="). Just
replacing this directly with Literal("<") | Literal("=") | ... would
prevent any matching of the ">=" or "<=" literals. You could replace
with the Or (^) form, but this exhaustively checks all alternatives all
the time, a regrettable run-time performance penalty. Pyparsing's
implementation of oneOf leaves the literals in the given order, unless
a duplicate is given, or an earlier literal masks a later one - in that
case, the longer literal is moved ahead of the shorter.

I implemented Optional as a wrapper-type class, as opposed to the
..optional() method that you have given - I'd say there are tradeoffs
either way, just making the comparison.

Your "repeated" or "times" seem to map roughly to pyparsing's OneOrMore
and ZeroOrMore.

Any thought how a recursive grammar might look?

I don't find 'Interval' to be very easy on the eyes. In this case, I
stole^H^H^H^H^H borrowed the re form of "[A-Za-z0-9]", providing a
method named srange ("s" is for "string") such that srange("a-fA-F")
would return the string "abcdefABCDEF".

The other end of this process has to do with how the calling program
will process the parsed results. Once a grammar gets too deeply
nested, or has too many Optional elements, just returning a simple list
or nested list of tokens isn't enough. Pyparsing returns ParseResults
objects, which can be accessed as a list, dictionary, or object with
attributes (provided individual fields have been given names at grammar
definition time). I *have* had some complaints about ParseResults
("ParseResults are evil"), but the named access is a life-saver for
complex grammars. (Simple case, the first token for your hex number is
an optional sign - without names, you can't just access field 2, say,
of the expression, you have to first test to see if the sign was
provided or not, and then access field 2 or 3 accordingly. On the
other hand, if you had given field 2 a name, your parser would be more
robust, even you later changed your grammar to include other elements,
such as a leading, um, currency symbol or something.)

Just some fodder for your reverb considerations...

-- Paul
 
B

bearophileHUGS

Paul said:
I don't find 'Interval' to be very easy on the eyes. In this case, I
stole^H^H^H^H^H borrowed the re form of "[A-Za-z0-9]", providing a
method named srange ("s" is for "string") such that srange("a-fA-F")
would return the string "abcdefABCDEF".

Thank you for your answers Paul. Just a note: I have called it
interval-something (like cinterval or Interval) instead of
range-something because it returns a closed interval (all letters
inclusive of both endpoints), to show its difference from the Python
range that returns a right open interval. Calling a srange("a","z") in
Python lets me think that it generates the ["a",...,"y"] range.

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top