Huffman coding and Parse::RecDescent

J

Jon Ericson

I've been playing with Parse::RecDescent lately and I've run into an
annoyance. I'd appreciate some advice about how to deal with it.

Take, for an example, this abbreviated grammar for reading a baseball
score-sheet:

events: play(s) /\z/

play: home_run | caught_stealing | hit_by_pitch | catchers_interference
| <error>

home_run: 'H' | 'HR'
caught_stealing: 'CS'
hit_by_pitch: 'HP'
catchers_interference: 'C'

Obviously, there are bugs in the grammar. For instance, it fails when
given 'HP' or 'HR', since the first production in the home_run rule
greedily gobbles up the initial 'H'. It isn't that hard to fix:

events: play(s) /\z/

play: caught_stealing | catchers_interference | hit_by_pitch | home_run
| <error>

caught_stealing: 'CS'
catchers_interference: 'C'
hit_by_pitch: 'HP'
home_run: 'HR' | 'H'

This works and it's easy to maintain, but it's massively inefficient.
Home runs are substantially more common than hit batsmen, but because
the code for hit by pitch is longer than one of the home run codes,
you have to try it first. That means a lot of expensive backtracking,
which is criminal in this sort of data. It's trivial to order
baseball plays by frequency.

Notice that Huffman coding hurts us here. The most common plays in
baseball have the shortest codes, in general. Cather's interference
and caught stealing are the exceptions that prove the rule.
Fortunately, we can do negative lookahead and preserve the original
ordering:

home_run: 'HR' |'H' ...!'P'
{$item[1]}

The action explicitly returns the first item, otherwise we get
whatever was returned by the negative lookahead when 'H' matches. I'm
not sure if this is a bug, since it doesn't seem to be documented.
Maybe I should use regexp lookahead instead?

But this is harder to maintain. The next time you add a play, you
have to go back and make sure it doesn't interfere with one of the
other productions. Or if you reorder the productions to squeeze a bit
more performance out of the grammar, you have to audit the rule.

How have other people dealt with this?

Thanks,
Jon
 
U

Uri Guttman

JE> events: play(s) /\z/

JE> play: home_run | caught_stealing | hit_by_pitch | catchers_interference
JE> | <error>

JE> home_run: 'H' | 'HR'
JE> caught_stealing: 'CS'
JE> hit_by_pitch: 'HP'
JE> catchers_interference: 'C'

JE> Obviously, there are bugs in the grammar. For instance, it fails when
JE> given 'HP' or 'HR', since the first production in the home_run rule
JE> greedily gobbles up the initial 'H'. It isn't that hard to fix:

two thoughts (i have used P:RD but i am no expert in it). first, you can
use regexes instead of strings and then use anchors/assertions of some
sort:

home_run: /\bH\b/ | /\bHR\b/

also i was under the impression that basic P:RD will split input on
white space so the home_run production should only match H when the
token is H.

all this is conjecture and hopefully the damian will see this and jump
in to correct me. will someone please turn on the damian signal lamp? :)

uri
 
J

Jon Ericson

Uri Guttman said:
two thoughts (i have used P:RD but i am no expert in it). first, you can
use regexes instead of strings and then use anchors/assertions of some
sort:

home_run: /\bH\b/ | /\bHR\b/

also i was under the impression that basic P:RD will split input on
white space so the home_run production should only match H when the
token is H.

Right. The problem is this data is *not* split into tokens by white
space. For instance `H8' is the code for an inside the park home run
fielded by the center fielder. Sorry I didn't make that clear.

Someone wrote an email with the suggestion:

home_run: /H(?![A-Z])/ | 'HR'

That does the trick, I think.

I'm not sure Parse::RecDescent is the right tool for the job, but it
sure is fun to play around with. :)

Thanks,
Jon
 
U

Uri Guttman

JE> Right. The problem is this data is *not* split into tokens by white
JE> space. For instance `H8' is the code for an inside the park home run
JE> fielded by the center fielder. Sorry I didn't make that clear.

well, i can't help much with poor specs! :)

JE> Someone wrote an email with the suggestion:

JE> home_run: /H(?![A-Z])/ | 'HR'

JE> That does the trick, I think.

looks ok. if you gave a complete spec then we could help more.

JE> I'm not sure Parse::RecDescent is the right tool for the job, but it
JE> sure is fun to play around with. :)

sounds like overkill for this. is it all just one line? P:RD is meant
for real grammars and such and what you have seems like a simple string
with some tricky parsing. a decent set of regexes could do that but a
proper and full spec is needed.

uri
 
J

Jon Ericson

Uri Guttman said:
JE> Someone wrote an email with the suggestion:

JE> home_run: /H(?![A-Z])/ | 'HR'

JE> That does the trick, I think.

looks ok. if you gave a complete spec then we could help more.

Ok. http://www.retrosheet.org/datause.txt is a close to a spec as
exists. There are quite a few things that are not documented. I'm
looking at item 6. a., but that is fairly incomplete. I've been
writing the grammar by trial and error. (Emphasis on error. :)

Mostly, however, this is an exercise in learning. I suppose this
example is an aberration, but it really bothered me that adding an
uncommon case broke the frequent case.

Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top