PyParsing and Headaches

Bytter · Nov 22, 2006

Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal

The problem is that:

term.parseString("+a") -> (['+', 'a'], {}) # OK
term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

Can anyone give me an hand here?

Cheers!

Hugo Ferreira

BTW, the following is the complete grammar I'm trying to implement with
pyparsing:

## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"

And this is where I got so far:

word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)

Chris Lambacher · Nov 22, 2006

Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal

+ here means that you allow a space. You need to explicitly override this.
Try:

term = Combine(include_bool + literal)

The problem is that:

term.parseString("+a") -> (['+', 'a'], {}) # OK
term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

Can anyone give me an hand here?

Cheers!

Hugo Ferreira

BTW, the following is the complete grammar I'm trying to implement with
pyparsing:

## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"

And this is where I got so far:

word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)

Paul McGuire · Nov 22, 2006

Bytter said:
Hi,

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

So I defined this as:

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal

The problem is that:

term.parseString("+a") -> (['+', 'a'], {}) # OK
term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

As Chris pointed out in his post, the most direct way to fix this is to use
Combine. Note that Combine does two things: it requires the expressions to
be adjacent, and it combines the results into a single token. For instance,
when defining the expression for a real number, something like:

realnum = Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums)

Pyparsing would parse "3.14159" into the separate tokens ['', '3', '.',
'14159']. For this grammar, pyparsing would also accept "2. 23" as ['',
'2', '.', '23'], even though there is a space between the decimal point and
"23". But by wrapping it inside Combine, as in:

realnum = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Word(nums))

we accomplish two things: pyparsing only matches if all the elements are
adjacent, with no whitespace or comments; and the matched token is returned
as ['3.14159']. (Yes, I left off scientific notation, but it is an
extension of the same issue.)

Pyparsing in general does implicit whitespace skipping; it is part of the
zen of pyparsing, and distinguishes it from conventional regexps (although I
think there is a new '?' switch for re's that puts '\s*'s between re terms
for you). This is to simplify the grammar definition, so that it doesn't
need to be littered with "optional whitespace or comments could go here"
expressions; instead, whitespace and comments (or "ignorables" in pyparsing
terminology) are parsed over before every grammar expression. I instituted
this out of recoil from a previous project, in which a co-developer
implemented a boolean parser by first tokenizing by whitespace, then parsing
out the tokens. Unfortunately, this meant that "color=='blue' &&
size=='medium'" would not parse successfully, instead requiring "color ==
'blue' && size == 'medium'". It doesn't seem like much, but our support
guys got many calls asking why the boolean clauses weren't matching. I
decided that when I wrote a parser, "y=m*x+b" would be just as parseable as
"y = m * x + b". For that matter, you'd be surprised where whitespace and
comments sneak in to people's source code: spaces after left parentheses and
comments after semicolons, for example, are easily forgotten when spec'ing
out the syntax for a C "for" statement; whitespace inside HTML tags is
another unanticipated surprise.

So looking at your grammar, you say you don't want to have this be a
successful parse:
term.parseString("+ a") -> (['+', 'a'], {})

because, "It shouldn't recognize any token since I didn't said the SPACE was
allowed between include_bool and literal." In fact, pyparsing allows spaces
by default, that's why the given parse succeeds. I would turn this question
around, and ask you in terms of your grammar - what SHOULD be allowed
between include_bool and literal? If spaces are not a problem, then your
grammar as-is is sufficient. If spaces are absolutely verboten, then there
are 2 or 3 different techniques in pyparsing to disable the
whitespace-skipping behavior, depending on whether you want all whitespace
skipping disabled, just for literals of a certain type, or just for literals
when following a leading include_bool sign.

Thanks for giving pyparsing a try; if you want further help, you can post
here, or on the pyparsing wiki - the discussion threads on the Home page are
a pretty good support and message log.

-- Paul

Bytter · Nov 23, 2006

(This message has already been sent to the mailing-list, but I don't
have sure this is arriving well since it doesn't come up in the usenet,
so I'm posting it through here now.)

Chris,

Thanks for your quick answer. That changes a lot of stuff, and now I'm
able to do my parsing as I intended to.

Still, there's a remaining problem. By using Combine(), everything is
interpreted as a single token. Though what I need is that
'include_bool' and 'literal' be parsed as separated tokens, though
without a space in the middle...

Paul,

Thanks for your detailed explanation. One of the things I think is
missing from the documentation (or that I couldn't find easy) is the
kind of explanation you give about 'The Way of PyParsing'. For example,
It took me a while to understand that I could easily implement simple
recursions using OneOrMany(Group()). Or maybe things were out there and
I didn't searched enough...

Still, fwiw, congratulations for the library. PyParsing allowed me to
do in just a couple of hours, including learning about it's API (minus
this little inconvenient) what would have taken me a couple of days
with, for example, ANTLR (in fact, I've already put aside ANTLR more
than once in the past for a built-from-scratch parser).

Cheers,

Hugo Ferreira

Hi,

Click to expand...

I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:

Click to expand...

letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal

Click to expand...

So I defined this as:

Click to expand...

literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal+ here means that you allow a space. You need to explicitly override this.

Click to expand...

Try:

term = Combine(include_bool + literal)

The problem is that:

Click to expand...

term.parseString("+a") -> (['+', 'a'], {}) # OK
term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.

Click to expand...

Can anyone give me an hand here?

Hugo Ferreira

Click to expand...

BTW, the following is the complete grammar I'm trying to implement with
pyparsing:

Click to expand...

## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"

Click to expand...

And this is where I got so far:

Click to expand...

word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)

Click to expand...

Bytter · Nov 23, 2006

Heya there,

Ok, found the solution. I just needed to use leaveWhiteSpace() in the
places I want pyparsing to take into consideration the spaces.
Thx for the help.

Cheers!

Hugo Ferreira

(This message has already been sent to the mailing-list, but I don't
have sure this is arriving well since it doesn't come up in the usenet,
so I'm posting it through here now.)

Chris,

Thanks for your quick answer. That changes a lot of stuff, and now I'm
able to do my parsing as I intended to.

Still, there's a remaining problem. By using Combine(), everything is
interpreted as a single token. Though what I need is that
'include_bool' and 'literal' be parsed as separated tokens, though
without a space in the middle...

Paul,

Thanks for your detailed explanation. One of the things I think is
missing from the documentation (or that I couldn't find easy) is the
kind of explanation you give about 'The Way of PyParsing'. For example,
It took me a while to understand that I could easily implement simple
recursions using OneOrMany(Group()). Or maybe things were out there and
I didn't searched enough...

Still, fwiw, congratulations for the library. PyParsing allowed me to
do in just a couple of hours, including learning about it's API (minus
this little inconvenient) what would have taken me a couple of days
with, for example, ANTLR (in fact, I've already put aside ANTLR more
than once in the past for a built-from-scratch parser).

Cheers,

Hugo Ferreira

Hi,
I'm trying to construct a parser, but I'm stuck with some basic
stuff... For example, I want to match the following:
letter = "A"..."Z" | "a"..."z"
literal = letter+
include_bool := "+" | "-"
term = [include_bool] literal
So I defined this as:
literal = Word(alphas)
include_bool = Optional(oneOf("+ -"))
term = include_bool + literal+ here means that you allow a space. You need to explicitly override this. Try:

Click to expand...

term = Combine(include_bool + literal)

The problem is that:
term.parseString("+a") -> (['+', 'a'], {}) # OK
term.parseString("+ a") -> (['+', 'a'], {}) # KO. It shouldn't
recognize any token since I didn't said the SPACE was allowed between
include_bool and literal.
Can anyone give me an hand here?
Cheers!
Hugo Ferreira
BTW, the following is the complete grammar I'm trying to implement with
pyparsing:
## L ::= expr | expr L
## expr ::= term | binary_expr
## binary_expr ::= term " " binary_op " " term
## binary_op ::= "*" | "OR" | "AND"
## include_bool ::= "+" | "-"
## term ::= ([include_bool] [modifier ":"] (literal | range)) | ("~"
literal)
## modifier ::= (letter | "_")+
## literal ::= word | quoted_words
## quoted_words ::= '"' word (" " word)* '"'
## word ::= (letter | digit | "_")+
## number ::= digit+
## range ::= number (".." | "...") number
## letter ::= "A"..."Z" | "a"..."z"
## digit ::= "0"..."9"
And this is where I got so far:
word = Word(nums + alphas + "_")
binary_op = oneOf("* and or", caseless=True).setResultsName("operator")
include_bool = oneOf("+ -")
literal = (word | quotedString).setResultsName("literal")
modifier = Word(alphas + "_")
rng = Word(nums) + (Literal("..") | Literal("...")) + Word(nums)
term = ((Optional(include_bool) + Optional(modifier + ":") + (literal |
rng)) | ("~" + literal)).setResultsName("Term")
binary_expr = (term + binary_op + term).setResultsName("binary")
expr = (binary_expr | term).setResultsName("Expr")
L = OneOrMore(expr)

Click to expand...

Click to expand...

pyparsing and svg	2	Nov 8, 2007
pyparsing listAllMatches problem	2	Sep 9, 2006
Problem using Optional pyparsing	2	Aug 16, 2007
pyparsing problem	3	Jul 1, 2008
ANN: pyparsing 1.5.6 released!	1	Jul 1, 2011
Getting pyparsing to backtrack	4	Jul 5, 2010
Pyparsing...	2	Sep 21, 2004
Pyparsing help	9	Mar 22, 2008

PyParsing and Headaches

Bytter

Chris Lambacher

Paul McGuire

Bytter

Bytter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads