J
Jon Ericson
I've been playing with Parse::RecDescent lately and I've run into an
annoyance. I'd appreciate some advice about how to deal with it.
Take, for an example, this abbreviated grammar for reading a baseball
score-sheet:
events: play(s) /\z/
play: home_run | caught_stealing | hit_by_pitch | catchers_interference
| <error>
home_run: 'H' | 'HR'
caught_stealing: 'CS'
hit_by_pitch: 'HP'
catchers_interference: 'C'
Obviously, there are bugs in the grammar. For instance, it fails when
given 'HP' or 'HR', since the first production in the home_run rule
greedily gobbles up the initial 'H'. It isn't that hard to fix:
events: play(s) /\z/
play: caught_stealing | catchers_interference | hit_by_pitch | home_run
| <error>
caught_stealing: 'CS'
catchers_interference: 'C'
hit_by_pitch: 'HP'
home_run: 'HR' | 'H'
This works and it's easy to maintain, but it's massively inefficient.
Home runs are substantially more common than hit batsmen, but because
the code for hit by pitch is longer than one of the home run codes,
you have to try it first. That means a lot of expensive backtracking,
which is criminal in this sort of data. It's trivial to order
baseball plays by frequency.
Notice that Huffman coding hurts us here. The most common plays in
baseball have the shortest codes, in general. Cather's interference
and caught stealing are the exceptions that prove the rule.
Fortunately, we can do negative lookahead and preserve the original
ordering:
home_run: 'HR' |'H' ...!'P'
{$item[1]}
The action explicitly returns the first item, otherwise we get
whatever was returned by the negative lookahead when 'H' matches. I'm
not sure if this is a bug, since it doesn't seem to be documented.
Maybe I should use regexp lookahead instead?
But this is harder to maintain. The next time you add a play, you
have to go back and make sure it doesn't interfere with one of the
other productions. Or if you reorder the productions to squeeze a bit
more performance out of the grammar, you have to audit the rule.
How have other people dealt with this?
Thanks,
Jon
annoyance. I'd appreciate some advice about how to deal with it.
Take, for an example, this abbreviated grammar for reading a baseball
score-sheet:
events: play(s) /\z/
play: home_run | caught_stealing | hit_by_pitch | catchers_interference
| <error>
home_run: 'H' | 'HR'
caught_stealing: 'CS'
hit_by_pitch: 'HP'
catchers_interference: 'C'
Obviously, there are bugs in the grammar. For instance, it fails when
given 'HP' or 'HR', since the first production in the home_run rule
greedily gobbles up the initial 'H'. It isn't that hard to fix:
events: play(s) /\z/
play: caught_stealing | catchers_interference | hit_by_pitch | home_run
| <error>
caught_stealing: 'CS'
catchers_interference: 'C'
hit_by_pitch: 'HP'
home_run: 'HR' | 'H'
This works and it's easy to maintain, but it's massively inefficient.
Home runs are substantially more common than hit batsmen, but because
the code for hit by pitch is longer than one of the home run codes,
you have to try it first. That means a lot of expensive backtracking,
which is criminal in this sort of data. It's trivial to order
baseball plays by frequency.
Notice that Huffman coding hurts us here. The most common plays in
baseball have the shortest codes, in general. Cather's interference
and caught stealing are the exceptions that prove the rule.
Fortunately, we can do negative lookahead and preserve the original
ordering:
home_run: 'HR' |'H' ...!'P'
{$item[1]}
The action explicitly returns the first item, otherwise we get
whatever was returned by the negative lookahead when 'H' matches. I'm
not sure if this is a bug, since it doesn't seem to be documented.
Maybe I should use regexp lookahead instead?
But this is harder to maintain. The next time you add a play, you
have to go back and make sure it doesn't interfere with one of the
other productions. Or if you reorder the productions to squeeze a bit
more performance out of the grammar, you have to audit the rule.
How have other people dealt with this?
Thanks,
Jon