Is pyparsing really a recursive descent parser?

Discussion in 'Python' started by Just Another Victim of the Ambient Morality, Nov 2, 2007.

  1. Is pyparsing really a recursive descent parser? I ask this because
    there are grammars it can't parse that my recursive descent parser would
    parse, should I have written one. For instance:


    from pyparsing import *

    grammar = OneOrMore(Word(alphas)) + Literal('end')
    grammar.parseString('First Second Third end')


    Amazingly enough, this will fail to parse!
    Now, maybe I have no idea what I'm talking about but shouldn't a
    recursive descent parser recursively descend through all possible rule
    combinations to discover one that works? Why does pyparsing fail to parse
    this? Is there a way to get pyparsing to parse a grammar like this?
    Thank you...
     
    Just Another Victim of the Ambient Morality, Nov 2, 2007
    #1
    1. Advertising

  2. On Fri, 02 Nov 2007 06:05:13 +0000, Just Another Victim of the Ambient
    Morality wrote:

    > Is pyparsing really a recursive descent parser? I ask this because
    > there are grammars it can't parse that my recursive descent parser would
    > parse, should I have written one. For instance:
    >
    >
    > from pyparsing import *
    >
    > grammar = OneOrMore(Word(alphas)) + Literal('end')
    > grammar.parseString('First Second Third end')
    >
    >
    > Amazingly enough, this will fail to parse!
    > Now, maybe I have no idea what I'm talking about but shouldn't a
    > recursive descent parser recursively descend through all possible rule
    > combinations to discover one that works? Why does pyparsing fail to parse
    > this?



    Pyparsing is no recursive descent parser. It doesn't go back in the input
    stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
    can't get more, the parser moves to the ``Literal('end')`` part which
    fails because the 'end' is already gone.

    > Is there a way to get pyparsing to parse a grammar like this?


    Negative lookahead maybe:

    grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    + Literal('end'))

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Nov 2, 2007
    #2
    1. Advertising

  3. Just Another Victim of the Ambient Morality

    Paul McGuire Guest

    On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <> wrote:
    >
    > Pyparsing is no recursive descent parser. It doesn't go back in the input
    > stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when it
    > can't get more, the parser moves to the ``Literal('end')`` part which
    > fails because the 'end' is already gone.
    >
    > > Is there a way to get pyparsing to parse a grammar like this?

    >
    > Negative lookahead maybe:
    >
    > grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    > + Literal('end'))
    >
    > Ciao,
    > Marc 'BlackJack' Rintsch- Hide quoted text -
    >
    > - Show quoted text -


    Well I'll be darned! All this time, I thought "recursive descent"
    described the recursive behavior of the parser, which pyparsing
    definitely has. I never knew that backing up in the face of parse
    mismatches was a required part of the picture.

    In pyparsing, each construction gets composed from more fine-grained
    constructions, and they are called recursively to match or not match
    against the input string.

    For example, taking your working parser example:

    grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    + Literal('end'))

    This creates the following data structure:

    - And
    - OneOrMore
    - And
    - NotAny
    - Literal('end')
    - Word(alphas)
    - Literal('end')

    Every instance in this structure derives from the base class
    ParserElement, which defines a method parse(). parse() in turn calls
    preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
    returns a ParseResults object and the next location within the input
    string to continue parsing.

    The parseImpl methods are most often overridden in the subclasses (a
    few override postParse as well), such as:
    - And.parseImpl invokes parse() (note the recursion) on each of the
    expressions in its list. All must succeed or And.parseImpl throws an
    exception.
    - OneOrMore.parseImpl invokes parse() on its contained expression,
    which must succeed at least once; if not, the exception is rethrown.
    If the contained expression succeeds once, then its parse() method is
    called again and again until it fails, but no exceptions are rethrown,
    since only one match was actually required.
    - NotAny inverts the success/failure of its contained expression. If
    the expression's parse() method succeeds, NotAny.parseImpl throws an
    exception. If the contained expression's parse() method throws an
    exception, NotAny returns successfully. (NotAny does not advance the
    parse location, nor does it return any tokens.)
    - Literal and Word are terminals, in that they do not invoke any other
    expression's parse() method. Literal.parseImpl tests whether its
    string exists at the current parse location, and throws an exception
    if it doesn't. Word.parseImpl tests whether the current parse
    location contains a letter in the Word instance's set of valid initial
    characters - if so success; if not, throws an exception. It then
    advances through the input string, matching characters in the Word
    instance's set of valid body characters. The entire matched string is
    then returned, along with an updated string index at which to continue
    parsing.

    In my concept of "recursive descent" parsing, I was under the
    impression that pyparsing's use of this data structure, and
    *recursive* calls of parse() as it *descends* through the data
    structure, constituted a recursive descent parser. What the OP
    requested was a more regular expression-ish matcher, with backup (or
    backtracking).

    Your example did not include either of the alternation classes,
    MatchFirst or Or. These classes implement a form of backtracking on
    the stack, that if we descend into an expression on the list of
    alternatives, and that expression fails, then MatchFirst or Or will
    back up to the same starting location and try the next alternative.
    (MatchFirst is an "eager" matcher, in that it will pick the first
    matching expression in its list - Or is an "exhaustive" matcher, in
    that it will evaluate all of the alternatives, and select the longest
    match.)

    The Wikipedia entry for "Recursive Descent Parser" describes this
    parser model as a "predictive parser", and later goes on to say that
    some (uncited) authors equate "predictive parser" with "recursive
    descent parsers". The article makes a special distinction for a
    "recursive parser with backup", which is what I believe the OP was
    asking for. Yes, pyparsing is definitely *not* a "recursive descent
    parser with backup." The solution, as you correctly posted, is to add
    a negative lookahead. NotAny is also represented using the '~'
    operator.

    By the way, there are a couple of other places where pyparsing
    diverges from the standard model:
    - implicit whitespace skipping
    - user-definable comment skipping
    - attachment of parse actions to expressions
    These features, while atypical, provide some added capability and ease-
    of-use in creating quick and simple parsers.

    So I will take my stance with the uncited authors who lump predictive
    parsers in with recursive descent parsers.

    -- Paul
     
    Paul McGuire, Nov 2, 2007
    #3
  4. "Paul McGuire" <> wrote in message
    news:...
    > On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <> wrote:
    >>
    >> Pyparsing is no recursive descent parser. It doesn't go back in the
    >> input
    >> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
    >> it
    >> can't get more, the parser moves to the ``Literal('end')`` part which
    >> fails because the 'end' is already gone.
    >>
    >> > Is there a way to get pyparsing to parse a grammar like this?

    >>
    >> Negative lookahead maybe:
    >>
    >> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    >> + Literal('end'))
    >>
    >> Ciao,
    >> Marc 'BlackJack' Rintsch- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > Well I'll be darned! All this time, I thought "recursive descent"
    > described the recursive behavior of the parser, which pyparsing
    > definitely has. I never knew that backing up in the face of parse
    > mismatches was a required part of the picture.


    It has recursion in it but that's not sufficient to call it a recursive
    descent parser any more than having a recursive implementation of the
    factorial function. The important part is what it recurses through...


    > In pyparsing, each construction gets composed from more fine-grained
    > constructions, and they are called recursively to match or not match
    > against the input string.
    >
    > For example, taking your working parser example:
    >
    > grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    > + Literal('end'))
    >
    > This creates the following data structure:
    >
    > - And
    > - OneOrMore
    > - And
    > - NotAny
    > - Literal('end')
    > - Word(alphas)
    > - Literal('end')
    >
    > Every instance in this structure derives from the base class
    > ParserElement, which defines a method parse(). parse() in turn calls
    > preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
    > returns a ParseResults object and the next location within the input
    > string to continue parsing.
    >
    > The parseImpl methods are most often overridden in the subclasses (a
    > few override postParse as well), such as:
    > - And.parseImpl invokes parse() (note the recursion) on each of the
    > expressions in its list. All must succeed or And.parseImpl throws an
    > exception.
    > - OneOrMore.parseImpl invokes parse() on its contained expression,
    > which must succeed at least once; if not, the exception is rethrown.
    > If the contained expression succeeds once, then its parse() method is
    > called again and again until it fails, but no exceptions are rethrown,
    > since only one match was actually required.
    > - NotAny inverts the success/failure of its contained expression. If
    > the expression's parse() method succeeds, NotAny.parseImpl throws an
    > exception. If the contained expression's parse() method throws an
    > exception, NotAny returns successfully. (NotAny does not advance the
    > parse location, nor does it return any tokens.)
    > - Literal and Word are terminals, in that they do not invoke any other
    > expression's parse() method. Literal.parseImpl tests whether its
    > string exists at the current parse location, and throws an exception
    > if it doesn't. Word.parseImpl tests whether the current parse
    > location contains a letter in the Word instance's set of valid initial
    > characters - if so success; if not, throws an exception. It then
    > advances through the input string, matching characters in the Word
    > instance's set of valid body characters. The entire matched string is
    > then returned, along with an updated string index at which to continue
    > parsing.


    Thank you for the detailed description of pyparsing's implementation.


    > In my concept of "recursive descent" parsing, I was under the
    > impression that pyparsing's use of this data structure, and
    > *recursive* calls of parse() as it *descends* through the data
    > structure, constituted a recursive descent parser. What the OP
    > requested was a more regular expression-ish matcher, with backup (or
    > backtracking).


    In my concept of "recursive descent" parsing, I was under the impression
    that one should recurse through all rule combinations to ensure that the
    grammar is fully applied. As I have mentioned before, merely having
    recursion in your algorithm is insufficient. What you recurse through is
    key. pyparsing recurses through rules but what's important is to recurse
    through rule combinations. Otherwise, the parser won't be correct. That is
    to say, there will be strings that are grammatically correct for which the
    parser will fail to parse. Honestly, what is the point of a recursive
    descent parser if it doesn't parse correct string? If you're willing to
    have correct strings unparsed, you might as well go for LALR parsing, or
    some such...
    In my opinion, the rule set I mentioned in my original post:


    grammar = OneOrMore(Word(alphas)) + Literal('end')


    ...should be translated (internally) to something like this:


    word = Word(alphas)
    grammar = Forward()
    grammar << ((word + grammar) | (word + Literal(end)))


    This allows a recursive search to find the string correct without any
    need for "backtracking," if I understand what you mean by that. Couldn't
    pyparsing have done something like this?


    > Your example did not include either of the alternation classes,
    > MatchFirst or Or. These classes implement a form of backtracking on
    > the stack, that if we descend into an expression on the list of
    > alternatives, and that expression fails, then MatchFirst or Or will
    > back up to the same starting location and try the next alternative.
    > (MatchFirst is an "eager" matcher, in that it will pick the first
    > matching expression in its list - Or is an "exhaustive" matcher, in
    > that it will evaluate all of the alternatives, and select the longest
    > match.)


    I guess I was hoping that I could simply specify, mathematically, a
    grammar and have the software Do The Right Thing(tm)...


    > The Wikipedia entry for "Recursive Descent Parser" describes this
    > parser model as a "predictive parser", and later goes on to say that
    > some (uncited) authors equate "predictive parser" with "recursive
    > descent parsers". The article makes a special distinction for a
    > "recursive parser with backup", which is what I believe the OP was
    > asking for. Yes, pyparsing is definitely *not* a "recursive descent
    > parser with backup." The solution, as you correctly posted, is to add
    > a negative lookahead. NotAny is also represented using the '~'
    > operator.
    >
    > So I will take my stance with the uncited authors who lump predictive
    > parsers in with recursive descent parsers.


    Well, the good thing about Wikipedia is that, if you don't like the
    definition on the page, you're welcome to change it! Although, I'd
    recommend discussing it on the talk page before you do, just to make sure
    there isn't a good reason for the page to be as it is...
    Thank you...
     
    Just Another Victim of the Ambient Morality, Nov 3, 2007
    #4
  5. Just Another Victim of the Ambient Morality

    Kay Schluehr Guest

    On Nov 3, 6:33 am, "Just Another Victim of the Ambient Morality"
    <> wrote:
    > "Paul McGuire" <> wrote in message
    >
    > news:...
    >
    >
    >
    > > On Nov 2, 5:47 am, Marc 'BlackJack' Rintsch <> wrote:

    >
    > >> Pyparsing is no recursive descent parser. It doesn't go back in the
    > >> input
    > >> stream. The ``OneOrMore(Word(alphas))`` part "eats" the 'end' and when
    > >> it
    > >> can't get more, the parser moves to the ``Literal('end')`` part which
    > >> fails because the 'end' is already gone.

    >
    > >> > Is there a way to get pyparsing to parse a grammar like this?

    >
    > >> Negative lookahead maybe:

    >
    > >> grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    > >> + Literal('end'))

    >
    > >> Ciao,
    > >> Marc 'BlackJack' Rintsch- Hide quoted text -

    >
    > >> - Show quoted text -

    >
    > > Well I'll be darned! All this time, I thought "recursive descent"
    > > described the recursive behavior of the parser, which pyparsing
    > > definitely has. I never knew that backing up in the face of parse
    > > mismatches was a required part of the picture.

    >
    > It has recursion in it but that's not sufficient to call it a recursive
    > descent parser any more than having a recursive implementation of the
    > factorial function. The important part is what it recurses through...
    >
    >
    >
    > > In pyparsing, each construction gets composed from more fine-grained
    > > constructions, and they are called recursively to match or not match
    > > against the input string.

    >
    > > For example, taking your working parser example:

    >
    > > grammar = (OneOrMore(NotAny(Literal('end')) + Word(alphas))
    > > + Literal('end'))

    >
    > > This creates the following data structure:

    >
    > > - And
    > > - OneOrMore
    > > - And
    > > - NotAny
    > > - Literal('end')
    > > - Word(alphas)
    > > - Literal('end')

    >
    > > Every instance in this structure derives from the base class
    > > ParserElement, which defines a method parse(). parse() in turn calls
    > > preParse(), parseImpl(), and postParse(). If parseImpl succeeds, it
    > > returns a ParseResults object and the next location within the input
    > > string to continue parsing.

    >
    > > The parseImpl methods are most often overridden in the subclasses (a
    > > few override postParse as well), such as:
    > > - And.parseImpl invokes parse() (note the recursion) on each of the
    > > expressions in its list. All must succeed or And.parseImpl throws an
    > > exception.
    > > - OneOrMore.parseImpl invokes parse() on its contained expression,
    > > which must succeed at least once; if not, the exception is rethrown.
    > > If the contained expression succeeds once, then its parse() method is
    > > called again and again until it fails, but no exceptions are rethrown,
    > > since only one match was actually required.
    > > - NotAny inverts the success/failure of its contained expression. If
    > > the expression's parse() method succeeds, NotAny.parseImpl throws an
    > > exception. If the contained expression's parse() method throws an
    > > exception, NotAny returns successfully. (NotAny does not advance the
    > > parse location, nor does it return any tokens.)
    > > - Literal and Word are terminals, in that they do not invoke any other
    > > expression's parse() method. Literal.parseImpl tests whether its
    > > string exists at the current parse location, and throws an exception
    > > if it doesn't. Word.parseImpl tests whether the current parse
    > > location contains a letter in the Word instance's set of valid initial
    > > characters - if so success; if not, throws an exception. It then
    > > advances through the input string, matching characters in the Word
    > > instance's set of valid body characters. The entire matched string is
    > > then returned, along with an updated string index at which to continue
    > > parsing.

    >
    > Thank you for the detailed description of pyparsing's implementation.
    >
    > > In my concept of "recursive descent" parsing, I was under the
    > > impression that pyparsing's use of this data structure, and
    > > *recursive* calls of parse() as it *descends* through the data
    > > structure, constituted a recursive descent parser. What the OP
    > > requested was a more regular expression-ish matcher, with backup (or
    > > backtracking).

    >
    > In my concept of "recursive descent" parsing, I was under the impression
    > that one should recurse through all rule combinations to ensure that the
    > grammar is fully applied. As I have mentioned before, merely having
    > recursion in your algorithm is insufficient. What you recurse through is
    > key. pyparsing recurses through rules but what's important is to recurse
    > through rule combinations.


    I think the confusing aspect of pyparsing for someone like me coming
    from an (E)BNF and formal language background is that scanning and
    parsing are merged into one. pyparsing is scannerless and where a
    tokenizer such as Pythons performs a longest match tokenization but
    interprets CFGs correctly ( in the bounds of the parsers power ) one
    has to disambiguate grammars in pyparsing where you expect a CFG to be
    established.

    Note that I don't like pyparsings approach and it is clearly not for
    everyone. I rather tend to make the experience that token definitions
    can be reused across different languages. I count 65 token for Python
    and I added 10 token for an experimental C++ parser and deactivated
    some others. The distance between both languages on a lexical level is
    not that huge.

    What pyparsing has been established in the Python community is
    something different IMO: readable and verbose pattern matching syntax
    which is not regexp line noise. Around 8 years ago when I started
    using Python I found a little module called reverb.py. It was actually
    nothing but a front end for regular expressions. It didn't make any
    career but I think it is the only one survived the anagrammatical
    release hopping from 1.5.2 to 2.5.1 in my own case:

    from reverb import*
    >>> p = required(required(alphanum)+whitespace)+text("end")
    >>> p

    <reverb.raw instance at 0x00C0D260>
    >>> RE(p).pattern

    '(?:\\w+\\s)+end'
     
    Kay Schluehr, Nov 3, 2007
    #5
  6. Just Another Victim of the Ambient Morality

    Paul McGuire Guest

    On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    <> wrote:
    > It has recursion in it but that's not sufficient to call it a recursive
    > descent parser any more than having a recursive implementation of the
    > factorial function. The important part is what it recurses through...


    <snip>

    > In my opinion, the rule set I mentioned in my original post:
    >
    > grammar = OneOrMore(Word(alphas)) + Literal('end')
    >
    > ...should be translated (internally) to something like this:
    >
    > word = Word(alphas)
    > grammar = Forward()
    > grammar << ((word + grammar) | (word + Literal(end)))
    >
    > This allows a recursive search to find the string correct without any
    > need for "backtracking," if I understand what you mean by that. Couldn't
    > pyparsing have done something like this?
    >


    Dear JAVotAM -

    This was a great post! (Although I'm not sure the comment in the
    first paragraph was entirely fair - I know the difference between
    recursive string parsing and recursive multiplication.)

    You really got me thinking more about how this recursion actually
    behaves, especially with respect to elements such as OneOrMore. Your
    original question is really quite common, and up until now, my stock
    answer has been to use negative lookahead. The expression you posted
    is the first time I've seen this solution, and it *does* work.

    I was all set to write a cocky response on why your expression
    wouldn't work. I've seen it many times before, where people (usually
    coming from EBNF experience) implement listOfXs = OneOrMore(x) as:

    listOfXs = Forward()
    listOfXs << ( x + listOfXs | x )

    Actually, what they usually write is:

    listOfXs << ( listOfXs + x )

    but this sends pyparsing into a recursive tailspin.

    So I fired up SciTE and copy/pasted your code into the editor and ran
    it, and it worked just fine - this was a shock! I studied this for a
    few minutes, and then realized what was happening. First of all, I
    misread what you posted. You posted this:

    grammar << ((word + grammar) | (word + Literal(end)))

    which works. I *thought* you posted this:

    grammar << ((word + grammar) | word) + Literal(end)

    which doesn't work. In fact this behaves the same as your original
    post, except it iterates over the input string's words recursively,
    vs. repetition ins a for-loop, as is done by OneOrMore.

    So going back to your working version, I had to see why it works.
    Initially, the first term in the MatchFirst (the '|' operator creates
    MatchFirst instances) is just the same, and by grammar referencing
    itself, it just goes word by word through the input trying to find a
    match. I'll try to visualize the steps:

    level "First Second Third end"
    1 word grammar
    2 word grammar
    3 word grammar
    4 word grammar <- fails!
    4 word end <- fails!
    (therefore level 3 grammar fails!)
    3 word end <-- success!!!

    grammar has 2 options: to match a word followed by a grammar, or to
    match a word followed by 'end'. At 4 levels deep into the Forward's
    recursion, the first option fails, because after parsing "end" as the
    initial word, there is no more text to try to match against grammar.
    Level 4's Forward then also tries to match a word followed by 'end',
    but this fails for the same reason. So at this point, the 4th level
    Forward fails to match either of its options, so it throws its
    exception back up to level 3, indicating that the first alternative,
    word followed by grammar, failed. Level 3 then moves on to see if
    word followed by the literal 'end' matches, and it does - success!

    Here's where I am stuck now. In the original grammar that you posted,
    which you want to render into this behavior, grammar is defined as:

    grammar = OneOrMore(Word(alphas)) + Literal('end')

    Unfortunately, when the OneOrMore gets constructed, it does not have
    any visibility beyond knowing what is to be repeated. Again, here is
    the data structure that is being built:

    - And
    - OneOrMore
    - Word(alphas)
    - Literal('end')

    Only at the level of the And is there any awareness that the OneOrMore
    is followed by anything, let alone by something which could be
    misinterpreted as something matching the OneOrMore's repetition
    expression.

    Can you suggest a way I could generalize this, so that OneOrMore stops
    matching before it gets to 'end'?

    -- Paul
     
    Paul McGuire, Nov 3, 2007
    #6
  7. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-03, Paul McGuire <> wrote:
    > On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    ><> wrote:
    >> It has recursion in it but that's not sufficient to call it a recursive
    >> descent parser any more than having a recursive implementation of the
    >> factorial function. The important part is what it recurses through...

    >
    ><snip>
    >
    >> In my opinion, the rule set I mentioned in my original post:
    >>
    >> grammar = OneOrMore(Word(alphas)) + Literal('end')
    >>
    >> ...should be translated (internally) to something like this:
    >>
    >> word = Word(alphas)
    >> grammar = Forward()
    >> grammar << ((word + grammar) | (word + Literal(end)))
    >>
    >> This allows a recursive search to find the string correct without any
    >> need for "backtracking," if I understand what you mean by that. Couldn't
    >> pyparsing have done something like this?
    >>

    >
    > Dear JAVotAM -
    >
    > This was a great post! (Although I'm not sure the comment in the
    > first paragraph was entirely fair - I know the difference between
    > recursive string parsing and recursive multiplication.)
    >
    > You really got me thinking more about how this recursion actually
    > behaves, especially with respect to elements such as OneOrMore. Your
    > original question is really quite common, and up until now, my stock
    > answer has been to use negative lookahead. The expression you posted
    > is the first time I've seen this solution, and it *does* work.
    >
    > I was all set to write a cocky response on why your expression
    > wouldn't work. I've seen it many times before, where people (usually
    > coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
    >
    > listOfXs = Forward()
    > listOfXs << ( x + listOfXs | x )



    >
    > Actually, what they usually write is:
    >
    > listOfXs << ( listOfXs + x )
    >
    > but this sends pyparsing into a recursive tailspin.
    >
    > So I fired up SciTE and copy/pasted your code into the editor and ran
    > it, and it worked just fine - this was a shock! I studied this for a
    > few minutes, and then realized what was happening. First of all, I
    > misread what you posted. You posted this:
    >
    > grammar << ((word + grammar) | (word + Literal(end)))
    >
    > which works. I *thought* you posted this:
    >
    > grammar << ((word + grammar) | word) + Literal(end)
    >
    > which doesn't work. In fact this behaves the same as your original
    > post, except it iterates over the input string's words recursively,
    > vs. repetition ins a for-loop, as is done by OneOrMore.
    >
    > So going back to your working version, I had to see why it works.
    > Initially, the first term in the MatchFirst (the '|' operator creates
    > MatchFirst instances) is just the same, and by grammar referencing
    > itself, it just goes word by word through the input trying to find a
    > match. I'll try to visualize the steps:
    >
    > level "First Second Third end"
    > 1 word grammar
    > 2 word grammar
    > 3 word grammar
    > 4 word grammar <- fails!
    > 4 word end <- fails!
    > (therefore level 3 grammar fails!)
    > 3 word end <-- success!!!
    >
    > grammar has 2 options: to match a word followed by a grammar, or to
    > match a word followed by 'end'. At 4 levels deep into the Forward's
    > recursion, the first option fails, because after parsing "end" as the
    > initial word, there is no more text to try to match against grammar.
    > Level 4's Forward then also tries to match a word followed by 'end',
    > but this fails for the same reason. So at this point, the 4th level
    > Forward fails to match either of its options, so it throws its
    > exception back up to level 3, indicating that the first alternative,
    > word followed by grammar, failed. Level 3 then moves on to see if
    > word followed by the literal 'end' matches, and it does - success!
    >
    > Here's where I am stuck now. In the original grammar that you posted,
    > which you want to render into this behavior, grammar is defined as:
    >
    > grammar = OneOrMore(Word(alphas)) + Literal('end')


    Is there not an ambiguity in the grammar?

    In EBNF:

    goal --> WORD { WORD } END

    WORD is '[a-zA-Z]+'
    END is 'end'

    I think it is fine that PyParsing can't guess what the composer
    of that grammar meant.

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 3, 2007
    #7
  8. "Paul McGuire" <> wrote in message
    news:...
    > On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    > <> wrote:
    >> It has recursion in it but that's not sufficient to call it a
    >> recursive
    >> descent parser any more than having a recursive implementation of the
    >> factorial function. The important part is what it recurses through...

    >
    > <snip>
    >
    >> In my opinion, the rule set I mentioned in my original post:
    >>
    >> grammar = OneOrMore(Word(alphas)) + Literal('end')
    >>
    >> ...should be translated (internally) to something like this:
    >>
    >> word = Word(alphas)
    >> grammar = Forward()
    >> grammar << ((word + grammar) | (word + Literal(end)))
    >>
    >> This allows a recursive search to find the string correct without any
    >> need for "backtracking," if I understand what you mean by that. Couldn't
    >> pyparsing have done something like this?
    >>

    >
    > Dear JAVotAM -
    >
    > This was a great post! (Although I'm not sure the comment in the
    > first paragraph was entirely fair - I know the difference between
    > recursive string parsing and recursive multiplication.)


    I often use hyperbole to emphasize a point. I honestly mean no offense.
    That comment wasn't even meant to be fair, I was hoping to be provocative...


    > You really got me thinking more about how this recursion actually
    > behaves, especially with respect to elements such as OneOrMore. Your
    > original question is really quite common, and up until now, my stock
    > answer has been to use negative lookahead. The expression you posted
    > is the first time I've seen this solution, and it *does* work.


    I'm glad I got you thinking! I'd hate to be another newbie with another
    of a thousand questions answered in the FAQ....
    Hey, are you the author of pyparsing?


    > I was all set to write a cocky response on why your expression
    > wouldn't work. I've seen it many times before, where people (usually
    > coming from EBNF experience) implement listOfXs = OneOrMore(x) as:
    >
    > listOfXs = Forward()
    > listOfXs << ( x + listOfXs | x )
    >
    > Actually, what they usually write is:
    >
    > listOfXs << ( listOfXs + x )
    >
    > but this sends pyparsing into a recursive tailspin.
    >
    > So I fired up SciTE and copy/pasted your code into the editor and ran
    > it, and it worked just fine - this was a shock! I studied this for a
    > few minutes, and then realized what was happening. First of all, I
    > misread what you posted. You posted this:
    >
    > grammar << ((word + grammar) | (word + Literal(end)))
    >
    > which works. I *thought* you posted this:
    >
    > grammar << ((word + grammar) | word) + Literal(end)
    >
    > which doesn't work. In fact this behaves the same as your original
    > post, except it iterates over the input string's words recursively,
    > vs. repetition ins a for-loop, as is done by OneOrMore.


    I'm grateful that you actually tested my code before posting your cocky
    response!


    > So going back to your working version, I had to see why it works.
    > Initially, the first term in the MatchFirst (the '|' operator creates
    > MatchFirst instances) is just the same, and by grammar referencing
    > itself, it just goes word by word through the input trying to find a
    > match. I'll try to visualize the steps:
    >
    > level "First Second Third end"
    > 1 word grammar
    > 2 word grammar
    > 3 word grammar
    > 4 word grammar <- fails!
    > 4 word end <- fails!
    > (therefore level 3 grammar fails!)
    > 3 word end <-- success!!!
    >
    > grammar has 2 options: to match a word followed by a grammar, or to
    > match a word followed by 'end'. At 4 levels deep into the Forward's
    > recursion, the first option fails, because after parsing "end" as the
    > initial word, there is no more text to try to match against grammar.
    > Level 4's Forward then also tries to match a word followed by 'end',
    > but this fails for the same reason. So at this point, the 4th level
    > Forward fails to match either of its options, so it throws its
    > exception back up to level 3, indicating that the first alternative,
    > word followed by grammar, failed. Level 3 then moves on to see if
    > word followed by the literal 'end' matches, and it does - success!


    This is, literally, what it's doing. I'm not exactly a programming whiz
    so I think of it a little more abstractly. In pyparsing's implementation,
    it recurses through the first rule, OneOrMore, then iteratively moves to the
    next rule, only to fail. So, obviously that rule must be part of the
    recursion if we're not to use backtracking or lookaheads. If it's part of
    the recursion, it has to be the terminal case. We know that the OneOrMore
    rule is necessarily followed by the literal, so we can safely conclude that
    the terminal case will necessarily be followed by the literal, hence the
    production mentioned...


    > Here's where I am stuck now. In the original grammar that you posted,
    > which you want to render into this behavior, grammar is defined as:
    >
    > grammar = OneOrMore(Word(alphas)) + Literal('end')
    >
    > Unfortunately, when the OneOrMore gets constructed, it does not have
    > any visibility beyond knowing what is to be repeated. Again, here is
    > the data structure that is being built:
    >
    > - And
    > - OneOrMore
    > - Word(alphas)
    > - Literal('end')
    >
    > Only at the level of the And is there any awareness that the OneOrMore
    > is followed by anything, let alone by something which could be
    > misinterpreted as something matching the OneOrMore's repetition
    > expression.
    >
    > Can you suggest a way I could generalize this, so that OneOrMore stops
    > matching before it gets to 'end'?


    Again, I'm not an elite hacker so it's hard to give a good suggestion.
    I might have implemented the parser by having its input be a string
    specifying the grammar but, then again, I'd need a parser to prase that
    string! Oh, the irony...
    Probably, I would have constructed the rule tree differently. I like
    how pyparsing uses the Python interpreter to parse the input string, the
    Python code, itself, as pyparsing's input. That's worth preserving...
    So, instead of organizing the tree by rule categories, as pyparsing
    does:


    -And
    -OneOrMore
    -Word(alphas)
    -Literal('end')


    ...I would have organized it by what the grammar actually expects. It's
    hard for me to describe what I'm thinking as a simple hierarchy. I could
    use a directed graph but that's hard to type, so I'll just use "labels":


    start:
    call OneOrMore:

    OneOrMore:
    Or(
    And(
    Word(alphas),
    call OneOrMore:),
    And(
    Word(alphas),
    Literal('end')))


    ...hopefully, that indentation makes some sort of sense. I would
    imagine that it would be constructed like this:


    # This translates to temp << ((word + temp) | word)
    temp = OneOrMore(word)

    # Instead of wrapping this in an And structure,
    # tell temp to And this with its terminal case...
    grammar = temp + literal


    In my parser, everything would be recursive, hence the term "recursive
    descent" parser. So, a list of Ands like:


    grammar = Literal('a') + Literal('b') + Literal('c')


    ...would look like this:


    start:
    call And_a:

    And_a:
    Literal('a') + call And_b:

    And_b:
    Literal('b') + call And_c:

    And_c:
    Literal('c')


    ...notice the utter lack of iteration.
    Now, this example isn't, strictly speaking, recursive, but it descends
    through a set of function calls that are, potentionally, recursive...
    Now, obviously things like Or will have iterate through its cases but
    that's unavoidable and should be okay. The other constructions should need
    some thought, too. However, hopefully, this technique will render some
    constructions unnecessary...
    So, in my parser, the grammar is represented by a directed graph (the
    temporary variable flowing through operations like a + b + c + d) and all
    operations modify this graph. Obviously, some research needs to be done on
    how feasible this implementation is, especially with the rich set of
    operations available in pyparsing, but I'm hopeful...
    Hopefully this all makes sense. Thanks for reading...
     
    Just Another Victim of the Ambient Morality, Nov 3, 2007
    #8
  9. "Neil Cerutti" <> wrote in message
    news:eek:z3Xi.39812$...
    > On 2007-11-03, Paul McGuire <> wrote:
    >> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    >><> wrote:
    >>> It has recursion in it but that's not sufficient to call it a
    >>> recursive
    >>> descent parser any more than having a recursive implementation of the
    >>> factorial function. The important part is what it recurses through...

    >>
    >><snip>
    >>
    >>> In my opinion, the rule set I mentioned in my original post:
    >>>
    >>> grammar = OneOrMore(Word(alphas)) + Literal('end')
    >>>
    >>> ...should be translated (internally) to something like this:
    >>>
    >>> word = Word(alphas)
    >>> grammar = Forward()
    >>> grammar << ((word + grammar) | (word + Literal(end)))
    >>>
    >>> This allows a recursive search to find the string correct without
    >>> any
    >>> need for "backtracking," if I understand what you mean by that.
    >>> Couldn't
    >>> pyparsing have done something like this?
    >>>

    >>
    >> Dear JAVotAM -
    >>
    >> This was a great post! (Although I'm not sure the comment in the
    >> first paragraph was entirely fair - I know the difference between
    >> recursive string parsing and recursive multiplication.)
    >>
    >> You really got me thinking more about how this recursion actually
    >> behaves, especially with respect to elements such as OneOrMore. Your
    >> original question is really quite common, and up until now, my stock
    >> answer has been to use negative lookahead. The expression you posted
    >> is the first time I've seen this solution, and it *does* work.

    >
    > Is there not an ambiguity in the grammar?
    >
    > In EBNF:
    >
    > goal --> WORD { WORD } END
    >
    > WORD is '[a-zA-Z]+'
    > END is 'end'
    >
    > I think it is fine that PyParsing can't guess what the composer
    > of that grammar meant.


    First, I don't know if that constitutes an ambiguity in the grammar.
    'end' is a word but it's unambiguous that this grammar must end in a literal
    'end'. You could interpret the input as just a sequence of words or you
    could interpret it as a sequence of words terminated by the word 'end'. One
    interpretation conforms to the grammar while the other doesn't. You would
    assume that the interpretation that agrees with the grammar would be the
    preferable choice and so should the program...
    Secondly, even if it is an ambiguity... so what? pyparsing's current
    behaviour is to return a parse error, pretending that the string can't be
    parsed. Ideally, perhaps it should alert you to the ambiguity but, surely,
    it's better to return _a_ valid parsing than to pretend that the string
    can't be parsed at all...
     
    Just Another Victim of the Ambient Morality, Nov 4, 2007
    #9
  10. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-04, Just Another Victim of the Ambient Morality
    <> wrote:
    >
    > "Neil Cerutti" <> wrote in message
    > news:eek:z3Xi.39812$...
    >> On 2007-11-03, Paul McGuire <> wrote:
    >>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    >>><> wrote:
    >>>> It has recursion in it but that's not sufficient to call it a
    >>>> recursive
    >>>> descent parser any more than having a recursive implementation of the
    >>>> factorial function. The important part is what it recurses through...
    >>>
    >>><snip>
    >>>
    >>>> In my opinion, the rule set I mentioned in my original post:
    >>>>
    >>>> grammar = OneOrMore(Word(alphas)) + Literal('end')
    >>>>
    >>>> ...should be translated (internally) to something like this:
    >>>>
    >>>> word = Word(alphas)
    >>>> grammar = Forward()
    >>>> grammar << ((word + grammar) | (word + Literal(end)))
    >>>>
    >>>> This allows a recursive search to find the string correct without
    >>>> any
    >>>> need for "backtracking," if I understand what you mean by that.
    >>>> Couldn't
    >>>> pyparsing have done something like this?
    >>>>
    >>>
    >>> Dear JAVotAM -
    >>>
    >>> This was a great post! (Although I'm not sure the comment in the
    >>> first paragraph was entirely fair - I know the difference between
    >>> recursive string parsing and recursive multiplication.)
    >>>
    >>> You really got me thinking more about how this recursion actually
    >>> behaves, especially with respect to elements such as OneOrMore. Your
    >>> original question is really quite common, and up until now, my stock
    >>> answer has been to use negative lookahead. The expression you posted
    >>> is the first time I've seen this solution, and it *does* work.

    >>
    >> Is there not an ambiguity in the grammar?
    >>
    >> In EBNF:
    >>
    >> goal --> WORD { WORD } END
    >>
    >> WORD is '[a-zA-Z]+'
    >> END is 'end'
    >>
    >> I think it is fine that PyParsing can't guess what the composer
    >> of that grammar meant.

    >
    > First, I don't know if that constitutes an ambiguity in the
    > grammar. 'end' is a word but it's unambiguous that this grammar
    > must end in a literal 'end'. You could interpret the input as
    > just a sequence of words or you could interpret it as a
    > sequence of words terminated by the word 'end'.


    Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
    regex, because a regex is generally just a recognizer; the
    ambiguity doesn't have to do with recognizing the language. But
    PyParsing is parser. The ambiguity is in deciding what's a
    Word(alphas) and what's an 'end'.

    > One interpretation conforms to the grammar while the other
    > doesn't. You would assume that the interpretation that agrees
    > with the grammar would be the preferable choice and so should
    > the program. Secondly, even if it is an ambiguity... so what?
    > pyparsing's current behaviour is to return a parse error,
    > pretending that the string can't be parsed. Ideally, perhaps
    > it should alert you to the ambiguity but, surely, it's better
    > to return _a_ valid parsing than to pretend that the string
    > can't be parsed at all...


    I wouldn't characterize it as pretending. How would you parse:

    hello end hello end

    "WORD END WORD END" and "WORD WORD WORD END" are both valid
    interpretations, according to the grammar.

    As soon as you remove the ambiguity from the grammar, PyParsing
    starts to work correctly.

    Consider writing a recursive decent parser by hand to parse the
    language '[ab]+b'.

    goal --> ab_list 'b'
    ab_list --> 'a' list_tail
    ab_list --> 'b' list_tail
    list_tail --> 'a' list_tail
    list_tail --> 'b' list_tail
    list_tail --> null


    The above has the exact same bug (and probably some others--I'm
    sorry unable to test it just now) as the PyParsing solution.

    The error is in the grammar. It might be fixed by specifying that
    'b' must be followed by EOF, and then it could be coded by using
    more than one character of lookahead.

    class ParseError(Exception):
    pass

    class Parser:
    def __init__(self, s):
    self.s = s
    self.i = 0
    self.c = self.s[self.i]

    def match(self, c):
    if self.c != c:
    raise ParseError('expected %s got %s' % (c, self.c))
    self.i += 1
    if self.i == len(self.s):
    raise ParseError('unexpected EOF')
    self.c = self.s[self.i]

    def goal(self):
    self.ab_list()
    self.match('b')

    def ab_list(self):
    if self.c in 'ab':
    self.match(self.c)
    self.list_tail()
    else:
    raise ParseError('expected list of a or b, got %s' % self.c)

    def list_tail(self):
    if self.c in 'ab':
    self.match(self.c)
    self.list_tail()
    else:
    raise ParseError('expected a list of a or b, got %s' % self.c)

    >>> Parser('aaab').goal()

    Traceback (most recent call last):
    File "py.py", line 37, in ?
    Parser('aaab').goal()
    File "py.py", line 20, in goal
    self.ab_list()
    File "py.py", line 26, in ab_list
    self.list_tail()
    File "py.py", line 33, in list_tail
    self.list_tail()
    File "py.py", line 33, in list_tail
    self.list_tail()
    File "py.py", line 32, in list_tail
    self.match(self.c)
    File "py.py", line 16, in match
    raise ParseError('unexpected EOF')
    __main__.ParseError: unexpected EOF

    It's not a coincidence that is has the same bug as the PyParsing
    solution. You can remove the ambiguity in several ways, perhaps
    by specifying where EOF should be (you seem to be assuming this
    in your interpretation of the grammar, but PyParsing does not).

    goal --> ab_list 'b' EOF
    ab_list --> 'a' list_tail
    ab_list --> 'b' list_tail
    list_tail --> 'a' list_tail
    list_tail --> 'b' list_tail
    list_tail --> null

    I need to change goal, match and list_tail for this new grammar,
    and add an EOF object and a peek method.

    ....
    EOF = object()
    ....
    def match(self, c):
    if self.c != c:
    raise ParseError('expected %s got %s' % (c, self.c))
    if self.i >= len(self.s)-1:
    self.c = EOF
    self.i = len(self.s)
    else:
    self.i += 1
    self.c = self.s[self.i]

    def peek(self, lookahead):
    if self.i+lookahead >= len(self.s):
    return EOF
    else:
    return self.s[self.i + lookahead]
    ...
    def list_tail(self):
    if self.c == 'a':
    self.match('a')
    self.list_tail()
    elif self.c == 'b':
    if self.peek(1) != EOF:
    self.match('b')
    self.list_tail()
    else:
    raise ParseError('expected list of a or b, got %s' % self.c)

    The grammar now has the problem that it's LL(2) rather than
    LL(1). PyParsing can handle that type of grammar with negative
    lookahead assertions, I think.

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 4, 2007
    #10
  11. Just Another Victim of the Ambient Morality

    Kay Schluehr Guest

    On 4 Nov., 03:07, Neil Cerutti <> wrote:
    > On 2007-11-04, Just Another Victim of the Ambient Morality
    >
    >
    >
    > <> wrote:
    >
    > > "Neil Cerutti" <> wrote in message
    > >news:eek:z3Xi.39812$...
    > >> On 2007-11-03, Paul McGuire <> wrote:
    > >>> On Nov 3, 12:33 am, "Just Another Victim of the Ambient Morality"
    > >>><> wrote:
    > >>>> It has recursion in it but that's not sufficient to call it a
    > >>>> recursive
    > >>>> descent parser any more than having a recursive implementation of the
    > >>>> factorial function. The important part is what it recurses through...

    >
    > >>><snip>

    >
    > >>>> In my opinion, the rule set I mentioned in my original post:

    >
    > >>>> grammar = OneOrMore(Word(alphas)) + Literal('end')

    >
    > >>>> ...should be translated (internally) to something like this:

    >
    > >>>> word = Word(alphas)
    > >>>> grammar = Forward()
    > >>>> grammar << ((word + grammar) | (word + Literal(end)))

    >
    > >>>> This allows a recursive search to find the string correct without
    > >>>> any
    > >>>> need for "backtracking," if I understand what you mean by that.
    > >>>> Couldn't
    > >>>> pyparsing have done something like this?

    >
    > >>> Dear JAVotAM -

    >
    > >>> This was a great post! (Although I'm not sure the comment in the
    > >>> first paragraph was entirely fair - I know the difference between
    > >>> recursive string parsing and recursive multiplication.)

    >
    > >>> You really got me thinking more about how this recursion actually
    > >>> behaves, especially with respect to elements such as OneOrMore. Your
    > >>> original question is really quite common, and up until now, my stock
    > >>> answer has been to use negative lookahead. The expression you posted
    > >>> is the first time I've seen this solution, and it *does* work.

    >
    > >> Is there not an ambiguity in the grammar?

    >
    > >> In EBNF:

    >
    > >> goal --> WORD { WORD } END

    >
    > >> WORD is '[a-zA-Z]+'
    > >> END is 'end'

    >
    > >> I think it is fine that PyParsing can't guess what the composer
    > >> of that grammar meant.

    >
    > > First, I don't know if that constitutes an ambiguity in the
    > > grammar. 'end' is a word but it's unambiguous that this grammar
    > > must end in a literal 'end'. You could interpret the input as
    > > just a sequence of words or you could interpret it as a
    > > sequence of words terminated by the word 'end'.

    >
    > Yeah. If it were a regex, it would be: '[ab]+b'. That is a fine
    > regex, because a regex is generally just a recognizer; the
    > ambiguity doesn't have to do with recognizing the language. But
    > PyParsing is parser. The ambiguity is in deciding what's a
    > Word(alphas) and what's an 'end'.
    >
    > > One interpretation conforms to the grammar while the other
    > > doesn't. You would assume that the interpretation that agrees
    > > with the grammar would be the preferable choice and so should
    > > the program. Secondly, even if it is an ambiguity... so what?
    > > pyparsing's current behaviour is to return a parse error,
    > > pretending that the string can't be parsed. Ideally, perhaps
    > > it should alert you to the ambiguity but, surely, it's better
    > > to return _a_ valid parsing than to pretend that the string
    > > can't be parsed at all...

    >
    > I wouldn't characterize it as pretending. How would you parse:
    >
    > hello end hello end
    >
    > "WORD END WORD END" and "WORD WORD WORD END" are both valid
    > interpretations, according to the grammar.
    >
    > As soon as you remove the ambiguity from the grammar, PyParsing
    > starts to work correctly.


    I think you are correct about this. But I'm not sure how much it shall
    matter. Just take a look at Pythons Grammar

    http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup

    Without special keyword treatment this grammar would be ambigous and
    couldn't be parsed using an LL(1) parser. The "grammar compiler" which
    builds the parser tables creates a special label for each keyword.
    This label is filtered when a NAME token is feeded into the parser.
    With the label that belongs to e.g. 'if' or 'while' the correct
    statement can be selected in constant time. Same happens when I use
    the parser generator with your EBNF grammar. With a little more
    adaption also NUMBER token could be filtered. But this would be
    overdesign.

    Theoretical beauty is compromised here using reasonable default
    assumptions for keeping the grammar simple ( "convention over
    configuration" to borrow a Rails slogan ).

    Tokenization is another issue in Python. It is indeed somewhat special
    due to significant whitespace and line continuation but tokenization
    is conceptually much simpler and one would likely throw all kinds of
    options and special case handling in the lexical analysis phase.
     
    Kay Schluehr, Nov 4, 2007
    #11
  12. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-04, Kay Schluehr <> wrote:
    > On 4 Nov., 03:07, Neil Cerutti <> wrote:
    >> I wouldn't characterize it as pretending. How would you parse:
    >>
    >> hello end hello end
    >>
    >> "WORD END WORD END" and "WORD WORD WORD END" are both valid
    >> interpretations, according to the grammar.
    >>
    >> As soon as you remove the ambiguity from the grammar, PyParsing
    >> starts to work correctly.

    >
    > I think you are correct about this. But I'm not sure how much
    > it shall matter. Just take a look at Pythons Grammar
    >
    > http://svn.python.org/view/python/trunk/Grammar/Grammar?rev=55446&view=markup
    >
    > Without special keyword treatment this grammar would be
    > ambigous and couldn't be parsed using an LL(1) parser.


    I agree. I don't know how easy it is to create keywords using
    PyParsing, or if the grammar in question would still be
    considered correct by the author.

    > The "grammar compiler" which builds the parser tables creates a
    > special label for each keyword. This label is filtered when a
    > NAME token is feeded into the parser. With the label that
    > belongs to e.g. 'if' or 'while' the correct statement can be
    > selected in constant time. Same happens when I use the parser
    > generator with your EBNF grammar. With a little more adaption
    > also NUMBER token could be filtered. But this would be
    > overdesign.
    >
    > Theoretical beauty is compromised here using reasonable default
    > assumptions for keeping the grammar simple ( "convention over
    > configuration" to borrow a Rails slogan ).


    Keywords are practically ubiquitous. I never thought of them as
    unbeautiful before.

    > Tokenization is another issue in Python. It is indeed somewhat
    > special due to significant whitespace and line continuation but
    > tokenization is conceptually much simpler and one would likely
    > throw all kinds of options and special case handling in the
    > lexical analysis phase.


    It might be a quick fix in PyParsing, which includes a Keyword
    type, but without the semantics that are needed in this case. You
    have to use (as suggested earlier) negative lookahead in either a
    Regex or with the NotAny type.

    >>> goal = OneOrMore(NotAny(Literal('end')) + Word(alphas)) + Literal('end')
    >>> goal.parseString('hello hello hello end')

    (['hello', 'hello', 'hello', 'end'], {})
    >>> goal.parseString('hello end hello end')

    (['hello', 'end'], {})

    No scanner/tokenizer needed! ;)

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 4, 2007
    #12
  13. "Neil Cerutti" <> wrote in message
    news:wP9Xi.39822$...
    > On 2007-11-04, Just Another Victim of the Ambient Morality
    > <> wrote:
    >>
    >> "Neil Cerutti" <> wrote in message
    >> news:eek:z3Xi.39812$...
    >>>
    >>> Is there not an ambiguity in the grammar?
    >>>
    >>> In EBNF:
    >>>
    >>> goal --> WORD { WORD } END
    >>>
    >>> WORD is '[a-zA-Z]+'
    >>> END is 'end'
    >>>
    >>> I think it is fine that PyParsing can't guess what the composer
    >>> of that grammar meant.

    >>
    >> One interpretation conforms to the grammar while the other
    >> doesn't. You would assume that the interpretation that agrees
    >> with the grammar would be the preferable choice and so should
    >> the program. Secondly, even if it is an ambiguity... so what?
    >> pyparsing's current behaviour is to return a parse error,
    >> pretending that the string can't be parsed. Ideally, perhaps
    >> it should alert you to the ambiguity but, surely, it's better
    >> to return _a_ valid parsing than to pretend that the string
    >> can't be parsed at all...

    >
    > I wouldn't characterize it as pretending. How would you parse:
    >
    > hello end hello end
    >
    > "WORD END WORD END" and "WORD WORD WORD END" are both valid
    > interpretations, according to the grammar.


    ...and it would be nice if the parser were to parse one of them since
    they are both right. Having more than one right answer is not the same as
    having no answer, which is what pyparsing claims...


    > As soon as you remove the ambiguity from the grammar, PyParsing
    > starts to work correctly.


    This is simply not true. Try this:


    grammar = OneOrMore(Word(alphas)) + Literal('end') + Literal('.')
    grammar.parseString('First Second Third end.')


    ...again, this will fail to parse. Where's the ambiguity?
    Besides, parsing ambiguous grammars is a useful feature. Not all
    grammars being parsed are designed by those doing the parsing...


    > Consider writing a recursive decent parser by hand to parse the
    > language '[ab]+b'.
    >
    > goal --> ab_list 'b'
    > ab_list --> 'a' list_tail
    > ab_list --> 'b' list_tail
    > list_tail --> 'a' list_tail
    > list_tail --> 'b' list_tail
    > list_tail --> null
    >
    >
    > The above has the exact same bug (and probably some others--I'm
    > sorry unable to test it just now) as the PyParsing solution.
    >
    > The error is in the grammar. It might be fixed by specifying that
    > 'b' must be followed by EOF, and then it could be coded by using
    > more than one character of lookahead.


    I don't exactly understand the syntax you used to describe the
    productions of your recursive descent parser so not only did I not follow it
    but I couldn't make out the rest of your post. Could you explain in a
    little more detail? The last part that points to 'null' is especially
    confusing...
    As demonstrated earlier, it's not just the grammar. There are
    situations that are unambiguous that pyparsing can't parse simply and
    there's no reason for it.
    Besides, ambiguous grammars are a fact of life and some of us need to
    parse them. It's usually okay, too. Consider a previous example:


    grammar = OneOrMore(Word(alphas)) + Literal('end')


    While you may consider this inherently ambiguous, it's usually not.
    That is to say, as long as it is rare that 'end' is used not at the end of
    the string, this will simply parse and, yet, pyparsing will consistently
    fail to parse it...
     
    Just Another Victim of the Ambient Morality, Nov 4, 2007
    #13
  14. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-04, Just Another Victim of the Ambient Morality
    <> wrote:
    >> Consider writing a recursive decent parser by hand to parse
    >> the language '[ab]+b'.
    >>
    >> goal --> ab_list 'b'
    >> ab_list --> 'a' list_tail
    >> ab_list --> 'b' list_tail
    >> list_tail --> 'a' list_tail
    >> list_tail --> 'b' list_tail
    >> list_tail --> null
    >>
    >>
    >> The above has the exact same bug (and probably some others--I'm
    >> sorry unable to test it just now) as the PyParsing solution.
    >>
    >> The error is in the grammar. It might be fixed by specifying that
    >> 'b' must be followed by EOF, and then it could be coded by using
    >> more than one character of lookahead.

    >
    > I don't exactly understand the syntax you used to describe the
    > productions of your recursive descent parser so not only did I not follow it
    > but I couldn't make out the rest of your post. Could you explain in a
    > little more detail? The last part that points to 'null' is especially
    > confusing...


    It's the BNF spelling of

    goal --> ab_list 'b'
    ab_list --> ab { ab }
    ab --> 'a' | 'b'

    The null is to say that list_tail can match nothing, i.e, an
    empty string.

    Then, in the Parser class, every method (except for match, which
    is used as a central place to consume characters) corresponds to
    one of the productions in the BNF. Breaking things down into
    BNF-based productions often makes implementation, debugging and
    code generation easier.

    PyParsing saves me that stop, since I can often directly
    implement the EBNF using PyParsing.

    > As demonstrated earlier, it's not just the grammar. There are
    > situations that are unambiguous that pyparsing can't parse
    > simply and there's no reason for it.


    Yes, many parser generators have many more limitations than just
    the requirement of an unambiguous grammar.

    > Besides, ambiguous grammars are a fact of life and some of us
    > need to parse them. It's usually okay, too. Consider a
    > previous example:
    >
    > grammar = OneOrMore(Word(alphas)) + Literal('end')
    >
    > While you may consider this inherently ambiguous, it's usually
    > not. That is to say, as long as it is rare that 'end' is used
    > not at the end of the string, this will simply parse and, yet,
    > pyparsing will consistently fail to parse it...


    I believe there's no cure for the confusion you're having except
    for implementing a parser for your proposed grammar.
    Alternatively, try implementing your grammar in one of your other
    favorite parser generators.

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 4, 2007
    #14
  15. "Neil Cerutti" <> wrote in message
    news:nPnXi.39845$...
    > On 2007-11-04, Just Another Victim of the Ambient Morality
    > <> wrote:
    >>> Consider writing a recursive decent parser by hand to parse
    >>> the language '[ab]+b'.
    >>>
    >>> goal --> ab_list 'b'
    >>> ab_list --> 'a' list_tail
    >>> ab_list --> 'b' list_tail
    >>> list_tail --> 'a' list_tail
    >>> list_tail --> 'b' list_tail
    >>> list_tail --> null
    >>>
    >>>
    >>> The above has the exact same bug (and probably some others--I'm
    >>> sorry unable to test it just now) as the PyParsing solution.
    >>>
    >>> The error is in the grammar. It might be fixed by specifying that
    >>> 'b' must be followed by EOF, and then it could be coded by using
    >>> more than one character of lookahead.

    >>
    >> I don't exactly understand the syntax you used to describe the
    >> productions of your recursive descent parser so not only did I not follow
    >> it
    >> but I couldn't make out the rest of your post. Could you explain in a
    >> little more detail? The last part that points to 'null' is especially
    >> confusing...

    >
    > It's the BNF spelling of
    >
    > goal --> ab_list 'b'
    > ab_list --> ab { ab }
    > ab --> 'a' | 'b'
    >
    > The null is to say that list_tail can match nothing, i.e, an
    > empty string.
    >
    > Then, in the Parser class, every method (except for match, which
    > is used as a central place to consume characters) corresponds to
    > one of the productions in the BNF. Breaking things down into
    > BNF-based productions often makes implementation, debugging and
    > code generation easier.
    >
    > PyParsing saves me that stop, since I can often directly
    > implement the EBNF using PyParsing.


    Okay, I see that now, thank you.
    Your statement from the previous post:


    >> Consider writing a recursive decent parser by hand to parse
    >> the language '[ab]+b'.
    >>
    >> goal --> ab_list 'b'
    >> ab_list --> 'a' list_tail
    >> ab_list --> 'b' list_tail
    >> list_tail --> 'a' list_tail
    >> list_tail --> 'b' list_tail
    >> list_tail --> null
    >>
    >>
    >> The above has the exact same bug (and probably some others--I'm
    >> sorry unable to test it just now) as the PyParsing solution.



    ...merely demonstrates that this grammar is similarly ambiguous. There
    are many ways to parse this correctly and pyparsing chooses none of these!
    Instead, it returns the same error it does when the string has no
    solutions...


    >> As demonstrated earlier, it's not just the grammar. There are
    >> situations that are unambiguous that pyparsing can't parse
    >> simply and there's no reason for it.

    >
    > Yes, many parser generators have many more limitations than just
    > the requirement of an unambiguous grammar.


    Yes, but a recursive descent parser? I expect such things from LALR and
    others, but not only do I expect a recursive descent parser to correctly
    parse grammars but I expect it to even parse ambiguous ones, in that it is
    the only technique prepared to find more than one solution...


    >> Besides, ambiguous grammars are a fact of life and some of us
    >> need to parse them. It's usually okay, too. Consider a
    >> previous example:
    >>
    >> grammar = OneOrMore(Word(alphas)) + Literal('end')
    >>
    >> While you may consider this inherently ambiguous, it's usually
    >> not. That is to say, as long as it is rare that 'end' is used
    >> not at the end of the string, this will simply parse and, yet,
    >> pyparsing will consistently fail to parse it...

    >
    > I believe there's no cure for the confusion you're having except
    > for implementing a parser for your proposed grammar.
    > Alternatively, try implementing your grammar in one of your other
    > favorite parser generators.


    I believe there is a cure and it's called recursive descent parsing.
    It's slow, obviously, but it's correct and, sometimes (arguably, often),
    that's more important the execution speed.

    I spent this morning whipping up a proof of concept parser whose
    interface greatly resembles pyparsing but, baring unknown bugs, works and
    works as I'd expect a recursive descent parser to work. I don't know Python
    very well so the parser is pretty simple. It only lexes single characters
    as tokens. It only supports And, Or, Optional, OneOrMore and ZeroOrMore
    rules but I already think this is a rich set of rules. I'm sure others can
    be added. Finally, I'm not sure it's safely copying all its parameter input
    the same way pyparsing does but surely those bugs can be worked out. It's
    merely a proof of concept to demonstrate a point.
    Everyone, please look it over and tell me what you think.
    Unfortunately, my news client is kind of poor, so I can't simply cut and
    paste the code into here. All the tabs get turned into single spacing, so I
    will post this link, instead:


    http://theorem.ca/~dlkong/new_pyparsing.zip


    I hope you can all deal with .zip files. Let me know if this is a
    problem.
    Thank you...
     
    Just Another Victim of the Ambient Morality, Nov 4, 2007
    #15
  16. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-04, Just Another Victim of the Ambient Morality
    <> wrote:
    > "Neil Cerutti" <> wrote in message
    > news:nPnXi.39845$...
    >> I believe there's no cure for the confusion you're having except
    >> for implementing a parser for your proposed grammar.
    >> Alternatively, try implementing your grammar in one of your other
    >> favorite parser generators.

    >
    > I believe there is a cure and it's called recursive descent
    > parsing. It's slow, obviously, but it's correct and, sometimes
    > (arguably, often), that's more important the execution speed.
    >
    > I spent this morning whipping up a proof of concept parser
    > whose interface greatly resembles pyparsing but, baring unknown
    > bugs, works and works as I'd expect a recursive descent parser
    > to work. I don't know Python very well so the parser is pretty
    > simple. It only lexes single characters as tokens. It only
    > supports And, Or, Optional, OneOrMore and ZeroOrMore rules but
    > I already think this is a rich set of rules. I'm sure others
    > can be added. Finally, I'm not sure it's safely copying all
    > its parameter input the same way pyparsing does but surely
    > those bugs can be worked out. It's merely a proof of concept
    > to demonstrate a point.
    > Everyone, please look it over and tell me what you think.
    > Unfortunately, my news client is kind of poor, so I can't
    > simply cut and paste the code into here. All the tabs get
    > turned into single spacing, so I will post this link, instead:
    >
    > http://theorem.ca/~dlkong/new_pyparsing.zip


    Your program doesn't necessarily address the ambiguity in the
    grammar in question, since right now it is only a recognizer.
    Will it be hard to get it to return a parse tree?

    The grammar in your implementation is:

    >>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
    >>> goal.parse(0, 'ab')

    True
    >>> goal.parse(0, 'ba')

    False
    >>> goal.parse(0, 'b')

    False
    >>> goal.parse(0, 'aaab')

    True
    >>> goal.parse(0, 'abc')

    True

    So far so good. :)

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 4, 2007
    #16
  17. "Neil Cerutti" <> wrote in message
    news:qzsXi.39852$...
    > On 2007-11-04, Just Another Victim of the Ambient Morality
    > <> wrote:
    >> "Neil Cerutti" <> wrote in message
    >> news:nPnXi.39845$...
    >>> I believe there's no cure for the confusion you're having except
    >>> for implementing a parser for your proposed grammar.
    >>> Alternatively, try implementing your grammar in one of your other
    >>> favorite parser generators.

    >>
    >> I believe there is a cure and it's called recursive descent
    >> parsing. It's slow, obviously, but it's correct and, sometimes
    >> (arguably, often), that's more important the execution speed.
    >>
    >> I spent this morning whipping up a proof of concept parser
    >> whose interface greatly resembles pyparsing but, baring unknown
    >> bugs, works and works as I'd expect a recursive descent parser
    >> to work. I don't know Python very well so the parser is pretty
    >> simple. It only lexes single characters as tokens. It only
    >> supports And, Or, Optional, OneOrMore and ZeroOrMore rules but
    >> I already think this is a rich set of rules. I'm sure others
    >> can be added. Finally, I'm not sure it's safely copying all
    >> its parameter input the same way pyparsing does but surely
    >> those bugs can be worked out. It's merely a proof of concept
    >> to demonstrate a point.
    >> Everyone, please look it over and tell me what you think.
    >> Unfortunately, my news client is kind of poor, so I can't
    >> simply cut and paste the code into here. All the tabs get
    >> turned into single spacing, so I will post this link, instead:
    >>
    >> http://theorem.ca/~dlkong/new_pyparsing.zip

    >
    > Your program doesn't necessarily address the ambiguity in the
    > grammar in question, since right now it is only a recognizer.
    > Will it be hard to get it to return a parse tree?


    Hey, it's only a proof of concept. If you can parse the tree, surely
    you can record what you parsed, right?
    Did you notice that the parse() functions have the rather serious bug of
    not returning how much of the string they could parse? It just so happens
    that the contstructions that I made only ever had to increment the matches
    by one, so they just happen to work. That's an easy bug to fix but a pretty
    major one to have overlooked. Hence, my enthusiasm for input...


    > The grammar in your implementation is:
    >
    >>>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
    >>>> goal.parse(0, 'ab')

    > True
    >>>> goal.parse(0, 'ba')

    > False
    >>>> goal.parse(0, 'b')

    > False
    >>>> goal.parse(0, 'aaab')

    > True
    >>>> goal.parse(0, 'abc')

    > True
    >
    > So far so good. :)


    Good! Keep hammering at it!
    More importantly, study it to understand the idea I'm trying to convey.
    This is what I thought a recursive descent parser would do...
     
    Just Another Victim of the Ambient Morality, Nov 5, 2007
    #17
  18. Just Another Victim of the Ambient Morality

    Kay Schluehr Guest

    On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
    <>

    > I believe there is a cure and it's called recursive descent parsing.
    > It's slow, obviously, but it's correct and, sometimes (arguably, often),
    > that's more important the execution speed.


    Recursive decendent parsing is not necessarily slow but from your
    remarks above I infer you want a general RD parser with backtracking:
    when one rule doesn't match, try another one to derive the current
    symbol in the input stream.

    I'm not sure one needs to start again with a naive approach just to
    avoid any parser theory. For a user of a parser it is quite important
    whether she has to wait 50 seconds for a parse to run or 50
    milliseconds. I don't like to compromise speed for implementation
    simplicity here.
     
    Kay Schluehr, Nov 5, 2007
    #18
  19. "Kay Schluehr" <> wrote in message
    news:...
    > On Nov 4, 10:44 pm, "Just Another Victim of the Ambient Morality"
    > <>
    >
    >> I believe there is a cure and it's called recursive descent parsing.
    >> It's slow, obviously, but it's correct and, sometimes (arguably, often),
    >> that's more important the execution speed.

    >
    > Recursive decendent parsing is not necessarily slow but from your
    > remarks above I infer you want a general RD parser with backtracking:
    > when one rule doesn't match, try another one to derive the current
    > symbol in the input stream.


    I think I've just discovered a major hurdle in my understand of the
    problem.
    You keep saying "with backtracking." Why? Isn't "backtracking"
    inherent in recursion? So, why can't these alleged "recursive descent
    parsers" find valid parsings? How are they not already backtracking? What
    was the point of being recursive if not to take advantage of the inherent
    backtracking in it?
    Obviously, these parsers aren't recursing through what I think they
    should be recursing. The question is "why not?"

    Correct me if I'm wrong but I'm beginning to think that pyparsing
    doesn't typically use recursion, at all. It only employs it if you create
    one, using the Forward class. Otherwise, it does everything iteratively,
    hence the lack of "backtracking."


    > I'm not sure one needs to start again with a naive approach just to
    > avoid any parser theory. For a user of a parser it is quite important
    > whether she has to wait 50 seconds for a parse to run or 50
    > milliseconds. I don't like to compromise speed for implementation
    > simplicity here.


    This attitude is all too prevalent among computer professionals... Of
    course it's a useful thing to shield users from the intricacies of parser
    theory! Just as much as it is useful to shield drivers from needing
    automotive engineering or software users from programing. How many people
    have come to this newsgroup asking about anomalous pyparsing behaviour,
    despite their grammars being mathematically correct.
    Think of it this way. You can force all the clients of pyparsing to
    duplicate work on figuring out how to massage pyparsing to their grammars,
    or you can do the work of getting pyparsing to solve people's problems,
    once. That's what a library is supposed to do...
    Finally, I can't believe you complain about potential speed problems.
    First, depending on the size of the string, it's likely to be the difference
    between 2ms and 200ms. Secondly, if speed were an issue, you wouldn't go
    with a recursive descent parser. You'd go with LALR or the many other
    parsing techniques available. Recursive descent parsing is for those
    situations where you need correctness, regardless of execution time. These
    situations happen...
    I've said this before, albeit for a different language, but it applies
    to Python just as well. I don't use Python to write fast code, I use it to
    write code fast.
    If _you_ "don't like to compromise speed for implementation simplicity"
    then you have a plethora choices available to you. What about the guy who
    needs to parse correctly and is unconcerned about speed?
     
    Just Another Victim of the Ambient Morality, Nov 5, 2007
    #19
  20. Just Another Victim of the Ambient Morality

    Neil Cerutti Guest

    On 2007-11-05, Just Another Victim of the Ambient Morality <> wrote:
    >
    > "Neil Cerutti" <> wrote in message
    > news:qzsXi.39852$...
    >> On 2007-11-04, Just Another Victim of the Ambient Morality
    >> <> wrote:
    >>> "Neil Cerutti" <> wrote in message
    >>> news:nPnXi.39845$...
    >>>> I believe there's no cure for the confusion you're having except
    >>>> for implementing a parser for your proposed grammar.
    >>>> Alternatively, try implementing your grammar in one of your other
    >>>> favorite parser generators.
    >>>
    >>> I believe there is a cure and it's called recursive descent
    >>> parsing. It's slow, obviously, but it's correct and, sometimes
    >>> (arguably, often), that's more important the execution speed.
    >>>
    >>> I spent this morning whipping up a proof of concept parser
    >>> whose interface greatly resembles pyparsing but, baring unknown
    >>> bugs, works and works as I'd expect a recursive descent parser
    >>> to work. I don't know Python very well so the parser is pretty
    >>> simple. It only lexes single characters as tokens. It only
    >>> supports And, Or, Optional, OneOrMore and ZeroOrMore rules but
    >>> I already think this is a rich set of rules. I'm sure others
    >>> can be added. Finally, I'm not sure it's safely copying all
    >>> its parameter input the same way pyparsing does but surely
    >>> those bugs can be worked out. It's merely a proof of concept
    >>> to demonstrate a point.
    >>> Everyone, please look it over and tell me what you think.
    >>> Unfortunately, my news client is kind of poor, so I can't
    >>> simply cut and paste the code into here. All the tabs get
    >>> turned into single spacing, so I will post this link, instead:
    >>>
    >>> http://theorem.ca/~dlkong/new_pyparsing.zip

    >>
    >> Your program doesn't necessarily address the ambiguity in the
    >> grammar in question, since right now it is only a recognizer.
    >> Will it be hard to get it to return a parse tree?

    >
    > Hey, it's only a proof of concept. If you can parse the tree, surely
    > you can record what you parsed, right?
    > Did you notice that the parse() functions have the rather serious bug of
    > not returning how much of the string they could parse?


    Unfortunately I haven't had much time to play with it today; just
    barely enough to put it through a very few paces.

    > It just so happens that the contstructions that I made only
    > ever had to increment the matches by one, so they just happen
    > to work. That's an easy bug to fix but a pretty major one to
    > have overlooked. Hence, my enthusiasm for input...
    >
    >> The grammar in your implementation is:
    >>
    >>>>> goal = OneOrMore(RuleAnd('a') | RuleAnd('b')) + RuleAnd('b')
    >>>>> goal.parse(0, 'ab')

    >> True
    >>>>> goal.parse(0, 'ba')

    >> False
    >>>>> goal.parse(0, 'b')

    >> False
    >>>>> goal.parse(0, 'aaab')

    >> True
    >>>>> goal.parse(0, 'abc')

    >> True
    >>
    >> So far so good. :)

    >
    > Good! Keep hammering at it!
    > More importantly, study it to understand the idea I'm
    > trying to convey. This is what I thought a recursive descent
    > parser would do...


    Kay has pointed out how it works. Strangely enough, I've never
    studied a backtracking RDP before (trying to teach yourself a
    subject like parsing can be tricky--I've had to somehow avoid all
    the texts that overuse Greek letters--those incomprehensible
    symbols confuse the hell out of me). It does simplify the job of
    the grammar designer, but Kay's message makes it sound like it
    won't scale very well.

    It might, perhaps, be an interesting feature for PyParsing to
    entertain by setting a 'backtracking' option, for when you're
    writing a quick script and don't want to fuss too much with a
    non-conformant grammar.

    I'll have more time to look at it tomorrow.

    --
    Neil Cerutti
     
    Neil Cerutti, Nov 5, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Phlip
    Replies:
    0
    Views:
    464
    Phlip
    Aug 2, 2004
  2. Phlip
    Replies:
    6
    Views:
    435
    Phlip
    Aug 5, 2004
  3. Phlip
    Replies:
    0
    Views:
    427
    Phlip
    Aug 2, 2004
  4. Replies:
    4
    Views:
    457
    Ben Sizer
    Oct 2, 2006
  5. Robert
    Replies:
    1
    Views:
    581
    Puppet_Sock
    Apr 14, 2008
Loading...

Share This Page