PyParsing and Headaches

Discussion in 'Python' started by Bytter, Nov 22, 2006.

  1. Bytter

    Bytter Guest

    Hi,

    I'm trying to construct a parser, but I'm stuck with some basic
    stuff... For example, I want to match the following:

    letter = "A"..."Z" | "a"..."z"
    literal = letter+
    include_bool := "+" | "-"
    term = [include_bool] literal

    So I defined this as:

    literal = Word(alphas)
    include_bool = Optional(oneOf("+ -"))
    term = include_bool + literal

    The problem is that:

    term.parseString("+a") -> (['+', 'a'], {}) # OK
    term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
    recognize any token since I didn't said the SPACE was allowed between
    include_bool and literal.

    Can anyone give me an hand here?

    Cheers!

    Hugo Ferreira

    BTW, the following is the complete grammar I'm trying to implement with
    pyparsing:

    ## L ::= expr | expr L
    ## expr ::= term | binary_expr
    ## binary_expr ::= term " " binary_op " " term
    ## binary_op ::= "*" | "OR" | "AND"
    ## include_bool ::= "+" | "-"
    ## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
    literal)
    ## modifier ::= (letter | "_")+
    ## literal ::= word | quoted_words
    ## quoted_words ::= '"' word (" " word)* '"'
    ## word ::= (letter | digit | "_")+
    ## number ::= digit+
    ## range ::= number (".." | "...") number
    ## letter ::= "A"..."Z" | "a"..."z"
    ## digit ::= "0"..."9"

    And this is where I got so far:

    word = Word(nums + alphas + "_")
    binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
    include_bool = oneOf("+ -")
    literal = (word | quotedString).setResultsName("literal")
    modifier = Word(alphas + "_")
    rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
    term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
    rng)) | ("~" + literal)).setResultsName("Term")
    binary_expr = (term + binary_op + term).setResultsName("binary")
    expr = (binary_expr | term).setResultsName("Expr")
    L = OneOrMore(expr)


    --
    GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85
    Bytter, Nov 22, 2006
    #1
    1. Advertising

  2. On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
    > Hi,
    >
    > I'm trying to construct a parser, but I'm stuck with some basic
    > stuff... For example, I want to match the following:
    >
    > letter = "A"..."Z" | "a"..."z"
    > literal = letter+
    > include_bool := "+" | "-"
    > term = [include_bool] literal
    >
    > So I defined this as:
    >
    > literal = Word(alphas)
    > include_bool = Optional(oneOf("+ -"))
    > term = include_bool + literal

    + here means that you allow a space. You need to explicitly override this.
    Try:

    term = Combine(include_bool + literal)

    >
    > The problem is that:
    >
    > term.parseString("+a") -> (['+', 'a'], {}) # OK
    > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
    > recognize any token since I didn't said the SPACE was allowed between
    > include_bool and literal.
    >
    > Can anyone give me an hand here?
    >
    > Cheers!
    >
    > Hugo Ferreira
    >
    > BTW, the following is the complete grammar I'm trying to implement with
    > pyparsing:
    >
    > ## L ::= expr | expr L
    > ## expr ::= term | binary_expr
    > ## binary_expr ::= term " " binary_op " " term
    > ## binary_op ::= "*" | "OR" | "AND"
    > ## include_bool ::= "+" | "-"
    > ## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
    > literal)
    > ## modifier ::= (letter | "_")+
    > ## literal ::= word | quoted_words
    > ## quoted_words ::= '"' word (" " word)* '"'
    > ## word ::= (letter | digit | "_")+
    > ## number ::= digit+
    > ## range ::= number (".." | "...") number
    > ## letter ::= "A"..."Z" | "a"..."z"
    > ## digit ::= "0"..."9"
    >
    > And this is where I got so far:
    >
    > word = Word(nums + alphas + "_")
    > binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
    > include_bool = oneOf("+ -")
    > literal = (word | quotedString).setResultsName("literal")
    > modifier = Word(alphas + "_")
    > rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
    > term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
    > rng)) | ("~" + literal)).setResultsName("Term")
    > binary_expr = (term + binary_op + term).setResultsName("binary")
    > expr = (binary_expr | term).setResultsName("Expr")
    > L = OneOrMore(expr)
    >
    >
    > --
    > GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Chris Lambacher, Nov 22, 2006
    #2
    1. Advertising

  3. Bytter

    Paul McGuire Guest

    "Bytter" <> wrote in message
    news:...
    > Hi,
    >
    > I'm trying to construct a parser, but I'm stuck with some basic
    > stuff... For example, I want to match the following:
    >
    > letter = "A"..."Z" | "a"..."z"
    > literal = letter+
    > include_bool := "+" | "-"
    > term = [include_bool] literal
    >
    > So I defined this as:
    >
    > literal = Word(alphas)
    > include_bool = Optional(oneOf("+ -"))
    > term = include_bool + literal
    >
    > The problem is that:
    >
    > term.parseString("+a") -> (['+', 'a'], {}) # OK
    > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
    > recognize any token since I didn't said the SPACE was allowed between
    > include_bool and literal.
    >


    As Chris pointed out in his post, the most direct way to fix this is to use
    Combine. Note that Combine does two things: it requires the expressions to
    be adjacent, and it combines the results into a single token. For instance,
    when defining the expression for a real number, something like:

    realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)

    Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.',
    '14159']. For this grammar, pyparsing would also accept "2. 23" as ['',
    '2', '.', '23'], even though there is a space between the decimal point and
    "23". But by wrapping it inside Combine, as in:

    realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums))

    we accomplish two things: pyparsing only matches if all the elements are
    adjacent, with no whitespace or comments; and the matched token is returned
    as ['3.14159']. (Yes, I left off scientific notation, but it is an
    extension of the same issue.)

    Pyparsing in general does implicit whitespace skipping; it is part of the
    zen of pyparsing, and distinguishes it from conventional regexps (although I
    think there is a new '?' switch for re's that puts '\s*'s between re terms
    for you). This is to simplify the grammar definition, so that it doesn't
    need to be littered with "optional whitespace or comments could go here"
    expressions; instead, whitespace and comments (or "ignorables" in pyparsing
    terminology) are parsed over before every grammar expression. I instituted
    this out of recoil from a previous project, in which a co-developer
    implemented a boolean parser by first tokenizing by whitespace, then parsing
    out the tokens. Unfortunately, this meant that "color=='blue' &&
    size=='medium'" would not parse successfully, instead requiring "color ==
    'blue' && size == 'medium'". It doesn't seem like much, but our support
    guys got many calls asking why the boolean clauses weren't matching. I
    decided that when I wrote a parser, "y=m*x+b" would be just as parseable as
    "y = m * x + b". For that matter, you'd be surprised where whitespace and
    comments sneak in to people's source code: spaces after left parentheses and
    comments after semicolons, for example, are easily forgotten when spec'ing
    out the syntax for a C "for" statement; whitespace inside HTML tags is
    another unanticipated surprise.

    So looking at your grammar, you say you don't want to have this be a
    successful parse:
    term.parseString("+ a") -> (['+', 'a'], {})

    because, "It shouldn't recognize any token since I didn't said the SPACE was
    allowed between include_bool and literal." In fact, pyparsing allows spaces
    by default, that's why the given parse succeeds. I would turn this question
    around, and ask you in terms of your grammar - what SHOULD be allowed
    between include_bool and literal? If spaces are not a problem, then your
    grammar as-is is sufficient. If spaces are absolutely verboten, then there
    are 2 or 3 different techniques in pyparsing to disable the
    whitespace-skipping behavior, depending on whether you want all whitespace
    skipping disabled, just for literals of a certain type, or just for literals
    when following a leading include_bool sign.

    Thanks for giving pyparsing a try; if you want further help, you can post
    here, or on the pyparsing wiki - the discussion threads on the Home page are
    a pretty good support and message log.

    -- Paul
    Paul McGuire, Nov 22, 2006
    #3
  4. Bytter

    Bytter Guest

    (This message has already been sent to the mailing-list, but I don't
    have sure this is arriving well since it doesn't come up in the usenet,
    so I'm posting it through here now.)

    Chris,

    Thanks for your quick answer. That changes a lot of stuff, and now I'm
    able to do my parsing as I intended to.

    Still, there's a remaining problem. By using Combine(), everything is
    interpreted as a single token. Though what I need is that
    'include_bool' and 'literal' be parsed as separated tokens, though
    without a space in the middle...

    Paul,

    Thanks for your detailed explanation. One of the things I think is
    missing from the documentation (or that I couldn't find easy) is the
    kind of explanation you give about 'The Way of PyParsing'. For example,
    It took me a while to understand that I could easily implement simple
    recursions using OneOrMany(Group()). Or maybe things were out there and
    I didn't searched enough...

    Still, fwiw, congratulations for the library. PyParsing allowed me to
    do in just a couple of hours, including learning about it's API (minus
    this little inconvenient) what would have taken me a couple of days
    with, for example, ANTLR (in fact, I've already put aside ANTLR more
    than once in the past for a built-from-scratch parser).

    Cheers,

    Hugo Ferreira

    On Nov 22, 7:50 pm, Chris Lambacher <> wrote:
    > On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
    > > Hi,

    >
    > > I'm trying to construct a parser, but I'm stuck with some basic
    > > stuff... For example, I want to match the following:

    >
    > > letter = "A"..."Z" | "a"..."z"
    > > literal = letter+
    > > include_bool := "+" | "-"
    > > term = [include_bool] literal

    >
    > > So I defined this as:

    >
    > > literal = Word(alphas)
    > > include_bool = Optional(oneOf("+ -"))
    > > term = include_bool + literal+ here means that you allow a space. You need to explicitly override this.

    > Try:
    >
    > term = Combine(include_bool + literal)
    >
    >
    >
    > > The problem is that:

    >
    > > term.parseString("+a") -> (['+', 'a'], {}) # OK
    > > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
    > > recognize any token since I didn't said the SPACE was allowed between
    > > include_bool and literal.

    >
    > > Can anyone give me an hand here?

    >
    > > Cheers!

    >
    > > Hugo Ferreira

    >
    > > BTW, the following is the complete grammar I'm trying to implement with
    > > pyparsing:

    >
    > > ## L ::= expr | expr L
    > > ## expr ::= term | binary_expr
    > > ## binary_expr ::= term " " binary_op " " term
    > > ## binary_op ::= "*" | "OR" | "AND"
    > > ## include_bool ::= "+" | "-"
    > > ## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
    > > literal)
    > > ## modifier ::= (letter | "_")+
    > > ## literal ::= word | quoted_words
    > > ## quoted_words ::= '"' word (" " word)* '"'
    > > ## word ::= (letter | digit | "_")+
    > > ## number ::= digit+
    > > ## range ::= number (".." | "...") number
    > > ## letter ::= "A"..."Z" | "a"..."z"
    > > ## digit ::= "0"..."9"

    >
    > > And this is where I got so far:

    >
    > > word = Word(nums + alphas + "_")
    > > binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
    > > include_bool = oneOf("+ -")
    > > literal = (word | quotedString).setResultsName("literal")
    > > modifier = Word(alphas + "_")
    > > rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
    > > term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
    > > rng)) | ("~" + literal)).setResultsName("Term")
    > > binary_expr = (term + binary_op + term).setResultsName("binary")
    > > expr = (binary_expr | term).setResultsName("Expr")
    > > L = OneOrMore(expr)

    >
    > > --
    > > GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85

    >
    > > --
    > >http://mail.python.org/mailman/listinfo/python-list
    Bytter, Nov 23, 2006
    #4
  5. Bytter

    Bytter Guest

    Heya there,

    Ok, found the solution. I just needed to use leaveWhiteSpace() in the
    places I want pyparsing to take into consideration the spaces.
    Thx for the help.

    Cheers!

    Hugo Ferreira

    On Nov 23, 11:57 am, "Bytter" <> wrote:
    > (This message has already been sent to the mailing-list, but I don't
    > have sure this is arriving well since it doesn't come up in the usenet,
    > so I'm posting it through here now.)
    >
    > Chris,
    >
    > Thanks for your quick answer. That changes a lot of stuff, and now I'm
    > able to do my parsing as I intended to.
    >
    > Still, there's a remaining problem. By using Combine(), everything is
    > interpreted as a single token. Though what I need is that
    > 'include_bool' and 'literal' be parsed as separated tokens, though
    > without a space in the middle...
    >
    > Paul,
    >
    > Thanks for your detailed explanation. One of the things I think is
    > missing from the documentation (or that I couldn't find easy) is the
    > kind of explanation you give about 'The Way of PyParsing'. For example,
    > It took me a while to understand that I could easily implement simple
    > recursions using OneOrMany(Group()). Or maybe things were out there and
    > I didn't searched enough...
    >
    > Still, fwiw, congratulations for the library. PyParsing allowed me to
    > do in just a couple of hours, including learning about it's API (minus
    > this little inconvenient) what would have taken me a couple of days
    > with, for example, ANTLR (in fact, I've already put aside ANTLR more
    > than once in the past for a built-from-scratch parser).
    >
    > Cheers,
    >
    > Hugo Ferreira
    >
    > On Nov 22, 7:50 pm, Chris Lambacher <> wrote:
    >
    > > On Wed, Nov 22, 2006 at 11:17:52AM -0800, Bytter wrote:
    > > > Hi,

    >
    > > > I'm trying to construct a parser, but I'm stuck with some basic
    > > > stuff... For example, I want to match the following:

    >
    > > > letter = "A"..."Z" | "a"..."z"
    > > > literal = letter+
    > > > include_bool := "+" | "-"
    > > > term = [include_bool] literal

    >
    > > > So I defined this as:

    >
    > > > literal = Word(alphas)
    > > > include_bool = Optional(oneOf("+ -"))
    > > > term = include_bool + literal+ here means that you allow a space. You need to explicitly override this.

    > > Try:

    >
    > > term = Combine(include_bool + literal)

    >
    > > > The problem is that:

    >
    > > > term.parseString("+a") -> (['+', 'a'], {}) # OK
    > > > term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
    > > > recognize any token since I didn't said the SPACE was allowed between
    > > > include_bool and literal.

    >
    > > > Can anyone give me an hand here?

    >
    > > > Cheers!

    >
    > > > Hugo Ferreira

    >
    > > > BTW, the following is the complete grammar I'm trying to implement with
    > > > pyparsing:

    >
    > > > ## L ::= expr | expr L
    > > > ## expr ::= term | binary_expr
    > > > ## binary_expr ::= term " " binary_op " " term
    > > > ## binary_op ::= "*" | "OR" | "AND"
    > > > ## include_bool ::= "+" | "-"
    > > > ## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
    > > > literal)
    > > > ## modifier ::= (letter | "_")+
    > > > ## literal ::= word | quoted_words
    > > > ## quoted_words ::= '"' word (" " word)* '"'
    > > > ## word ::= (letter | digit | "_")+
    > > > ## number ::= digit+
    > > > ## range ::= number (".." | "...") number
    > > > ## letter ::= "A"..."Z" | "a"..."z"
    > > > ## digit ::= "0"..."9"

    >
    > > > And this is where I got so far:

    >
    > > > word = Word(nums + alphas + "_")
    > > > binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
    > > > include_bool = oneOf("+ -")
    > > > literal = (word | quotedString).setResultsName("literal")
    > > > modifier = Word(alphas + "_")
    > > > rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
    > > > term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
    > > > rng)) | ("~" + literal)).setResultsName("Term")
    > > > binary_expr = (term + binary_op + term).setResultsName("binary")
    > > > expr = (binary_expr | term).setResultsName("Expr")
    > > > L = OneOrMore(expr)

    >
    > > > --
    > > > GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85

    >
    > > > --
    > > >http://mail.python.org/mailman/listinfo/python-list
    Bytter, Nov 23, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dano

    Oh those Transaction Headaches!

    Dano, Nov 17, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    1,630
    bruce barker
    Nov 17, 2003
  2. Stephajn Craig

    Impersonation headaches

    Stephajn Craig, Dec 16, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    476
    bruce barker
    Dec 17, 2003
  3. Eric
    Replies:
    5
    Views:
    894
  4. Alex Greenberg

    Publish Web Site Headaches

    Alex Greenberg, Jan 29, 2006, in forum: ASP .Net
    Replies:
    9
    Views:
    561
    gerry
    Jan 31, 2006
  5. D. Shane Fowlkes

    repost: headaches with formatting in VWD

    D. Shane Fowlkes, Mar 16, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    392
    Juan T. Llibre
    Mar 17, 2006
Loading...

Share This Page