Huffman coding and Parse::RecDescent

Jon Ericson · Apr 22, 2004

I've been playing with Parse::RecDescent lately and I've run into an
annoyance. I'd appreciate some advice about how to deal with it.

Take, for an example, this abbreviated grammar for reading a baseball
score-sheet:

events: play(s) /\z/

play: home_run | caught_stealing | hit_by_pitch | catchers_interference
| <error>

home_run: 'H' | 'HR'
caught_stealing: 'CS'
hit_by_pitch: 'HP'
catchers_interference: 'C'

Obviously, there are bugs in the grammar. For instance, it fails when
given 'HP' or 'HR', since the first production in the home_run rule
greedily gobbles up the initial 'H'. It isn't that hard to fix:

events: play(s) /\z/

play: caught_stealing | catchers_interference | hit_by_pitch | home_run
| <error>

caught_stealing: 'CS'
catchers_interference: 'C'
hit_by_pitch: 'HP'
home_run: 'HR' | 'H'

This works and it's easy to maintain, but it's massively inefficient.
Home runs are substantially more common than hit batsmen, but because
the code for hit by pitch is longer than one of the home run codes,
you have to try it first. That means a lot of expensive backtracking,
which is criminal in this sort of data. It's trivial to order
baseball plays by frequency.

Notice that Huffman coding hurts us here. The most common plays in
baseball have the shortest codes, in general. Cather's interference
and caught stealing are the exceptions that prove the rule.
Fortunately, we can do negative lookahead and preserve the original
ordering:

home_run: 'HR' |'H' ...!'P'
{$item[1]}

The action explicitly returns the first item, otherwise we get
whatever was returned by the negative lookahead when 'H' matches. I'm
not sure if this is a bug, since it doesn't seem to be documented.
Maybe I should use regexp lookahead instead?

But this is harder to maintain. The next time you add a play, you
have to go back and make sure it doesn't interfere with one of the
other productions. Or if you reorder the productions to squeeze a bit
more performance out of the grammar, you have to audit the rule.

How have other people dealt with this?

Thanks,
Jon

Uri Guttman · Apr 22, 2004

JE> events: play(s) /\z/

JE> play: home_run | caught_stealing | hit_by_pitch | catchers_interference
JE> | <error>

JE> home_run: 'H' | 'HR'
JE> caught_stealing: 'CS'
JE> hit_by_pitch: 'HP'
JE> catchers_interference: 'C'

JE> Obviously, there are bugs in the grammar. For instance, it fails when
JE> given 'HP' or 'HR', since the first production in the home_run rule
JE> greedily gobbles up the initial 'H'. It isn't that hard to fix:

two thoughts (i have used P:RD but i am no expert in it). first, you can
use regexes instead of strings and then use anchors/assertions of some
sort:

home_run: /\bH\b/ | /\bHR\b/

also i was under the impression that basic P:RD will split input on
white space so the home_run production should only match H when the
token is H.

all this is conjecture and hopefully the damian will see this and jump
in to correct me. will someone please turn on the damian signal lamp?

uri

Jon Ericson · Apr 22, 2004

Uri Guttman said:
two thoughts (i have used P:RD but i am no expert in it). first, you can
use regexes instead of strings and then use anchors/assertions of some
sort:

home_run: /\bH\b/ | /\bHR\b/

also i was under the impression that basic P:RD will split input on
white space so the home_run production should only match H when the
token is H.

Right. The problem is this data is *not* split into tokens by white
space. For instance `H8' is the code for an inside the park home run
fielded by the center fielder. Sorry I didn't make that clear.

Someone wrote an email with the suggestion:

home_run: /H(?![A-Z])/ | 'HR'

That does the trick, I think.

I'm not sure Parse::RecDescent is the right tool for the job, but it
sure is fun to play around with.

Thanks,
Jon

Uri Guttman · Apr 22, 2004

JE> Right. The problem is this data is *not* split into tokens by white
JE> space. For instance `H8' is the code for an inside the park home run
JE> fielded by the center fielder. Sorry I didn't make that clear.

well, i can't help much with poor specs!

JE> Someone wrote an email with the suggestion:

JE> home_run: /H(?![A-Z])/ | 'HR'

JE> That does the trick, I think.

looks ok. if you gave a complete spec then we could help more.

JE> I'm not sure Parse::RecDescent is the right tool for the job, but it
JE> sure is fun to play around with.

sounds like overkill for this. is it all just one line? P:RD is meant
for real grammars and such and what you have seems like a simple string
with some tricky parsing. a decent set of regexes could do that but a
proper and full spec is needed.

uri

Jon Ericson · Apr 23, 2004

Uri Guttman said:
JE> Someone wrote an email with the suggestion:

JE> home_run: /H(?![A-Z])/ | 'HR'

JE> That does the trick, I think.

looks ok. if you gave a complete spec then we could help more.

Ok. http://www.retrosheet.org/datause.txt is a close to a spec as
exists. There are quite a few things that are not documented. I'm
looking at item 6. a., but that is fairly incomplete. I've been
writing the grammar by trial and error. (Emphasis on error.

Mostly, however, this is an exercise in learning. I suppose this
example is an aberration, but it really bothered me that adding an
uncommon case broke the frequent case.

Jon

Parse::RecDescent and unordered rules	1	Aug 19, 2005
huffman encoder	16	Dec 20, 2005
[SUMMARY] Cows and Bulls (#32)	1	May 19, 2005
urgent opening for QA Engineer in San Francisco, CA- 1 year contract	0	Jan 9, 2009
Treetop Email Parser	13	Jan 28, 2008
Possible bug with Range#include?	6	Jun 8, 2007
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008

Huffman coding and Parse::RecDescent

Jon Ericson

Uri Guttman

Jon Ericson

Uri Guttman

Jon Ericson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads