[pyparsing] make sure entire string was parsed

S

Steven Bethard

How do I make sure that my entire string was parsed when I call a
pyparsing element's parseString method? Here's a dramatically
simplified version of my problem:

py> import pyparsing as pp
py> match = pp.Word(pp.nums)
py> def parse_num(s, loc, toks):
.... n, = toks
.... return int(n) + 10
....
py> match.setParseAction(parse_num)
W:(0123...)
py> match.parseString('121abc')
([131], {})

I want to know (somehow) that when I called match.parseString(), there
was some of the string left over (in this case, 'abc') after the parse
was complete. How can I do this? (I don't think I can do character
counting; all my internal setParseAction() functions return non-strings).

STeVe

P.S. FWIW, I've included the real code below. I need to throw an
exception when I call the parseString method of cls._root_node or
cls._root_nodes and the entire string is not consumed.

----------------------------------------------------------------------
# some character classes
printables_trans = _pp.printables.translate
word_chars = printables_trans(_id_trans, '()')
syn_tag_chars = printables_trans(_id_trans, '()-=')
func_tag_chars = printables_trans(_id_trans, '()-=0123456789')

# basic tag components
sep = _pp.Literal('-').leaveWhitespace()
alt_sep = _pp.Literal('=').leaveWhitespace()
special_word = _pp.Combine(sep + _pp.Word(syn_tag_chars) + sep)
supp_sep = (alt_sep | sep).suppress()
syn_word = _pp.Word(syn_tag_chars).leaveWhitespace()
func_word = _pp.Word(func_tag_chars).leaveWhitespace()
id_word = _pp.Word(_pp.nums).leaveWhitespace()

# the different tag types
special_tag = special_word.setResultsName('tag')
syn_tag = syn_word.setResultsName('tag')
func_tags = _pp.ZeroOrMore(supp_sep + func_word)
func_tags = func_tags.setResultsName('funcs')
id_tag = _pp.Optional(supp_sep + id_word).setResultsName('id')
tags = special_tag | (syn_tag + func_tags + id_tag)
def get_tag(orig_string, tokens_start, tokens):
tokens = dict(tokens)
tag = tokens.pop('tag')
if tag == '-NONE-':
tag = None
functions = list(tokens.pop('funcs', []))
id = tokens.pop('id', None)
return [dict(tag=tag, functions=functions, id=id)]
tags.setParseAction(get_tag)

# node parentheses
start = _pp.Literal('(').suppress()
end = _pp.Literal(')').suppress()

# words
word = _pp.Word(word_chars).setResultsName('word')

# leaf nodes
leaf_node = tags + _pp.Optional(word)
def get_leaf_node(orig_string, tokens_start, tokens):
try:
tag_dict, word = tokens
word = cls._unescape(word)
except ValueError:
tag_dict, = tokens
word = None
return cls(word=word, **tag_dict)
leaf_node.setParseAction(get_leaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(node)
def get_branch_node(orig_string, tokens_start, tokens):
return cls(children=tokens[1:], **tokens[0])
branch_node.setParseAction(get_branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
cls._root_node = node | start + node + end
cls._root_nodes = _pp.OneOrMore(cls._root_node)
 
P

Paul McGuire

Steven -

Thanks for giving pyparsing a try! To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF. Then if there is more text after the parsed text, parseString
will throw a ParseException.

I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString. I
am curious whether leaveWhitespace is really necessary for your
grammar. If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.

Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

HTH,
-- Paul
 
S

Steven Bethard

Paul said:
Thanks for giving pyparsing a try! To see whether your input text
consumes the whole string, add a StringEnd() element to the end of your
BNF. Then if there is more text after the parsed text, parseString
will throw a ParseException.

Thanks, that's exactly what I was looking for.
I notice you call leaveWhitespace on several of your parse elements, so
you may have to rstrip() the input text before calling parseString. I
am curious whether leaveWhitespace is really necessary for your
grammar. If it is, you can usually just call leaveWhitespace on the
root element, and this will propagate to all the sub elements.

Yeah, sorry, I was still messing around with that part of the code. My
problem is that I have to differentiate between:

(NP -x-y)

and:

(NP-x -y)

I'm doing this now using Combine. Does that seem right?
Lastly, you may get caught up with operator precedence, I think your
node assignment statement may need to change from
node << start + (branch_node | leaf_node) + end
to
node << (start + (branch_node | leaf_node) + end)

I think I'm okay:

py> 2 << 1 + 2
16
py> (2 << 1) + 2
6
py> 2 << (1 + 2)
16

Thanks for the help!

STeVe
 
P

Paul McGuire

Steve -

If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace:

from pyparsing import *

thing = Word(alphanums+"-")
LPAREN = Literal("(").suppress()
RPAREN = Literal(")").suppress()
node = LPAREN + OneOrMore(thing) + RPAREN

print node.parseString("(NP -x-y)")
print node.parseString("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']


Your examples helped me to see what my operator precedence concern was.
Fortunately, your usage was an And, composed using '+' operators. If
your construct was a MatchFirst, composed using '|' operators, things
aren't so pretty:

print 2 << 1 | 3
print 2 << (1 | 3)

7
16

So I've just gotten into the habit of parenthesizing anything I load
into a Forward using '<<'.

-- Paul
 
S

Steven Bethard

Paul said:
If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace:

from pyparsing import *

thing = Word(alphanums+"-")
LPAREN = Literal("(").suppress()
RPAREN = Literal(")").suppress()
node = LPAREN + OneOrMore(thing) + RPAREN

print node.parseString("(NP -x-y)")
print node.parseString("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']

I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

I know the dict syntax afterwards isn't quite what pyparsing would
output, but hopefully my intent is clear. I need to use the dict-style
results from setResultsName() calls because in the full grammar, I have
a lot of optional elements. For example:

(NP-1 -a)
--> {'tag':'NP', 'id':'1', 'word':'-a'}
(NP-x-2 -B)
--> {'tag':'NP', 'functions':['x'], 'id':'2', 'word':'-B'}
(NP-x-y=2-3 -4)
--> {'tag':'NP', 'functions':['x', 'y'], 'coord':'2', 'id':'3',
'word':'-4'}
(-NONE- x)
--> {'tag':None, 'word':'x'}



STeVe

P.S. In case you're curious, here's my current draft of the code:

# some character classes
printables_trans = _pp.printables.translate
word_chars = printables_trans(_id_trans, '()')
word_elem = _pp.Word(word_chars)
syn_chars = printables_trans(_id_trans, '()-=')
syn_word = _pp.Word(syn_chars)
func_chars = printables_trans(_id_trans, '()-=0123456789')
func_word = _pp.Word(func_chars)
num_word = _pp.Word(_pp.nums)

# tag separators
dash = _pp.Literal('-')
tag_sep = dash.suppress()
coord_sep = _pp.Literal('=').suppress()

# tag types (use Combine to guarantee no spaces)
special_tag = _pp.Combine(dash + syn_word + dash)
syn_tag = syn_word
func_tags = _pp.ZeroOrMore(_pp.Combine(tag_sep + func_word))
coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word))
id_tag = _pp.Optional(_pp.Combine(tag_sep + num_word))

# give tag types result names
special_tag = special_tag.setResultsName('tag')
syn_tag = syn_tag.setResultsName('tag')
func_tags = func_tags.setResultsName('funcs')
coord_tag = coord_tag.setResultsName('coord')
id_tag = id_tag.setResultsName('id')

# combine tag types into a tags element
normal_tags = syn_tag + func_tags + coord_tag + id_tag
tags = special_tag | _pp.Combine(normal_tags)
def get_tag(orig_string, tokens_start, tokens):
tokens = dict(tokens)
tag = tokens.pop('tag')
if tag == '-NONE-':
tag = None
functions = list(tokens.pop('funcs', []))
coord = tokens.pop('coord', None)
id = tokens.pop('id', None)
return [dict(tag=tag, functions=functions,
coord=coord, id=id)]
tags.setParseAction(get_tag)

# node parentheses
start = _pp.Literal('(').suppress()
end = _pp.Literal(')').suppress()

# words
word = word_elem.setResultsName('word')

# leaf nodes
leaf_node = tags + _pp.Optional(word)
def get_leaf_node(orig_string, tokens_start, tokens):
try:
tag_dict, word = tokens
word = cls._unescape(word)
except ValueError:
tag_dict, = tokens
word = None
return cls(word=word, **tag_dict)
leaf_node.setParseAction(get_leaf_node)

# node, recursive
node = _pp.Forward()

# branch nodes
branch_node = tags + _pp.OneOrMore(node)
def get_branch_node(orig_string, tokens_start, tokens):
return cls(children=tokens[1:], **tokens[0])
branch_node.setParseAction(get_branch_node)

# node, recursive
node << start + (branch_node | leaf_node) + end

# root node may have additional parentheses
root_node = node | start + node + end
root_nodes = _pp.OneOrMore(root_node)

# make sure nodes start and end string
str_start = _pp.StringStart()
str_end = _pp.StringEnd()
cls._root_node = str_start + root_node + str_end
cls._root_nodes = str_start + root_nodes + str_end
 
S

Steven Bethard

Steven said:
Paul said:
I have to differentiate between:
(NP -x-y)
and:
(NP-x -y)
I'm doing this now using Combine. Does that seem right?


If your word char set is just alphanums+"-", then this will work
without doing anything unnatural with leaveWhitespace:

from pyparsing import *

thing = Word(alphanums+"-")
LPAREN = Literal("(").suppress()
RPAREN = Literal(")").suppress()
node = LPAREN + OneOrMore(thing) + RPAREN

print node.parseString("(NP -x-y)")
print node.parseString("(NP-x -y)")

will print:

['NP', '-x-y']
['NP-x', '-y']


I actually need to break these into:

['NP', '-x-y'] {'tag':'NP', 'word:'-x-y'}
['NP', 'x', 'y'] {tag:'NP', 'functions':['x'], 'word':'y'}

Oops, sorry, the last line should have been:

['NP', 'x', '-y'] {tag:'NP', 'functions':['x'], 'word':'-y'}

Sorry to introduce confusion into an already confusing parsing problem. ;)

STeVe
 
P

Paul McGuire

Steve -

Wow, this is a pretty dense pyparsing program. You are really pushing
the envelope in your use of ParseResults, dicts, etc., but pretty much
everything seems to be working.

I still don't know the BNF you are working from, but here are some
other "shots in the dark":

1. I'm surprised func_word does not permit numbers anywhere in the
body. Is this just a feature you have not implemented yet? As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_chars,func_chars+_pp.nums)

Perhaps similar for syn_word as well.

2. Is coord an optional sub-element of a func? If so, you might want
to group them so that they stay together, something like:

coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word))
func_tags = _pp.ZeroOrMore(_pp.Group(tag_sep + func_word+coord_tag))

You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?

coord_tag = _pp.Optional(_pp.Combine(coord_sep + num_word),None)

Now the coords and funcs will be kept together.

3. Of course, you are correct in using Combine to ensure that you only
accept adjacent characters. But you only need to use it at the
outermost level.

4. You can use several dict-like functions directly on a ParseResults
object, such as keys(), items(), values(), in, etc. Also, the []
notation and the .attribute notation are nearly identical, except that
[] refs on a missing element will raise a KeyError, .attribute will
always return something. For instance, in your example, the getTag()
parse action uses dict.pop() to extract the 'coord' field. If coord is
present, you could retrieve it using "tokens['coord']" or
"tokens.coord". If coord is missing, "tokens['coord']" will raise a
KeyError, but tokens.coord will return an empty string. If you need to
"listify" a ParseResults, try calling asList().


It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered. But
please let us know how things work out.

-- Paul
 
S

Steven Bethard

Paul said:
I still don't know the BNF you are working from

Just to satisfy any curiosity you might have, it's the Penn TreeBank
format: http://www.cis.upenn.edu/~treebank/
(Except that the actual Penn Treebank data unfortunately differs from
the format spec in a few ways.)
1. I'm surprised func_word does not permit numbers anywhere in the
body. Is this just a feature you have not implemented yet? As long as
func_word does not start with a digit, you can still define one
unambiguously to allow numbers after the first character if you define
func_word as

func_word = _pp.Word(func_chars,func_chars+_pp.nums)

Ahh, very nice. The spec's vague, but this is probably what I want to do.
2. Is coord an optional sub-element of a func?

No, functions, coord and id are optional sub-elements of the tags string.
You might also add a default value for coord_tag if none is supplied,
to simplify your parse action?

Oh, that's nice. I missed that functionality.
It's not clear to me what if any further help you are looking for, now
that your initial question (about StringEnd()) has been answered.

Yes, thanks, you definitely answered the initial question. And your
followup commentary was also very helpful. Thanks again!

STeVe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top