Huffman coding and Parse::RecDescent

Discussion in 'Perl Misc' started by Jon Ericson, Apr 22, 2004.

  1. Jon Ericson

    Jon Ericson Guest

    I've been playing with Parse::RecDescent lately and I've run into an
    annoyance. I'd appreciate some advice about how to deal with it.

    Take, for an example, this abbreviated grammar for reading a baseball
    score-sheet:

    events: play(s) /\z/

    play: home_run | caught_stealing | hit_by_pitch | catchers_interference
    | <error>

    home_run: 'H' | 'HR'
    caught_stealing: 'CS'
    hit_by_pitch: 'HP'
    catchers_interference: 'C'

    Obviously, there are bugs in the grammar. For instance, it fails when
    given 'HP' or 'HR', since the first production in the home_run rule
    greedily gobbles up the initial 'H'. It isn't that hard to fix:

    events: play(s) /\z/

    play: caught_stealing | catchers_interference | hit_by_pitch | home_run
    | <error>

    caught_stealing: 'CS'
    catchers_interference: 'C'
    hit_by_pitch: 'HP'
    home_run: 'HR' | 'H'

    This works and it's easy to maintain, but it's massively inefficient.
    Home runs are substantially more common than hit batsmen, but because
    the code for hit by pitch is longer than one of the home run codes,
    you have to try it first. That means a lot of expensive backtracking,
    which is criminal in this sort of data. It's trivial to order
    baseball plays by frequency.

    Notice that Huffman coding hurts us here. The most common plays in
    baseball have the shortest codes, in general. Cather's interference
    and caught stealing are the exceptions that prove the rule.
    Fortunately, we can do negative lookahead and preserve the original
    ordering:

    home_run: 'HR' |'H' ...!'P'
    {$item[1]}

    The action explicitly returns the first item, otherwise we get
    whatever was returned by the negative lookahead when 'H' matches. I'm
    not sure if this is a bug, since it doesn't seem to be documented.
    Maybe I should use regexp lookahead instead?

    But this is harder to maintain. The next time you add a play, you
    have to go back and make sure it doesn't interfere with one of the
    other productions. Or if you reorder the productions to squeeze a bit
    more performance out of the grammar, you have to audit the rule.

    How have other people dealt with this?

    Thanks,
    Jon
    Jon Ericson, Apr 22, 2004
    #1
    1. Advertising

  2. Jon Ericson

    Uri Guttman Guest

    >>>>> "JE" == Jon Ericson <> writes:

    JE> events: play(s) /\z/

    JE> play: home_run | caught_stealing | hit_by_pitch | catchers_interference
    JE> | <error>

    JE> home_run: 'H' | 'HR'
    JE> caught_stealing: 'CS'
    JE> hit_by_pitch: 'HP'
    JE> catchers_interference: 'C'

    JE> Obviously, there are bugs in the grammar. For instance, it fails when
    JE> given 'HP' or 'HR', since the first production in the home_run rule
    JE> greedily gobbles up the initial 'H'. It isn't that hard to fix:

    two thoughts (i have used P:RD but i am no expert in it). first, you can
    use regexes instead of strings and then use anchors/assertions of some
    sort:

    home_run: /\bH\b/ | /\bHR\b/

    also i was under the impression that basic P:RD will split input on
    white space so the home_run production should only match H when the
    token is H.

    all this is conjecture and hopefully the damian will see this and jump
    in to correct me. will someone please turn on the damian signal lamp? :)

    uri
    Uri Guttman, Apr 22, 2004
    #2
    1. Advertising

  3. Jon Ericson

    Jon Ericson Guest

    Uri Guttman <> writes:

    > two thoughts (i have used P:RD but i am no expert in it). first, you can
    > use regexes instead of strings and then use anchors/assertions of some
    > sort:
    >
    > home_run: /\bH\b/ | /\bHR\b/
    >
    > also i was under the impression that basic P:RD will split input on
    > white space so the home_run production should only match H when the
    > token is H.


    Right. The problem is this data is *not* split into tokens by white
    space. For instance `H8' is the code for an inside the park home run
    fielded by the center fielder. Sorry I didn't make that clear.

    Someone wrote an email with the suggestion:

    home_run: /H(?![A-Z])/ | 'HR'

    That does the trick, I think.

    I'm not sure Parse::RecDescent is the right tool for the job, but it
    sure is fun to play around with. :)

    Thanks,
    Jon
    Jon Ericson, Apr 23, 2004
    #3
  4. Jon Ericson

    Uri Guttman Guest

    >>>>> "JE" == Jon Ericson <> writes:

    JE> Uri Guttman <> writes:
    >>
    >> home_run: /\bH\b/ | /\bHR\b/
    >>
    >> also i was under the impression that basic P:RD will split input on
    >> white space so the home_run production should only match H when the
    >> token is H.


    JE> Right. The problem is this data is *not* split into tokens by white
    JE> space. For instance `H8' is the code for an inside the park home run
    JE> fielded by the center fielder. Sorry I didn't make that clear.

    well, i can't help much with poor specs! :)

    JE> Someone wrote an email with the suggestion:

    JE> home_run: /H(?![A-Z])/ | 'HR'

    JE> That does the trick, I think.

    looks ok. if you gave a complete spec then we could help more.

    JE> I'm not sure Parse::RecDescent is the right tool for the job, but it
    JE> sure is fun to play around with. :)

    sounds like overkill for this. is it all just one line? P:RD is meant
    for real grammars and such and what you have seems like a simple string
    with some tricky parsing. a decent set of regexes could do that but a
    proper and full spec is needed.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Apr 23, 2004
    #4
  5. Jon Ericson

    Jon Ericson Guest

    Uri Guttman <> writes:

    >>>>>> "JE" == Jon Ericson <> writes:

    >
    > JE> Someone wrote an email with the suggestion:
    >
    > JE> home_run: /H(?![A-Z])/ | 'HR'
    >
    > JE> That does the trick, I think.
    >
    > looks ok. if you gave a complete spec then we could help more.


    Ok. http://www.retrosheet.org/datause.txt is a close to a spec as
    exists. There are quite a few things that are not documented. I'm
    looking at item 6. a., but that is fairly incomplete. I've been
    writing the grammar by trial and error. (Emphasis on error. :)

    Mostly, however, this is an exercise in learning. I suppose this
    example is an aberration, but it really bothered me that adding an
    uncommon case broke the frequent case.

    Jon
    Jon Ericson, Apr 23, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jean-gert nesselbosch

    XPath1.0-grammar compatible to Parse::RecDescent

    jean-gert nesselbosch, May 12, 2006, in forum: XML
    Replies:
    0
    Views:
    403
    jean-gert nesselbosch
    May 12, 2006
  2. ccm news
    Replies:
    0
    Views:
    3,055
    ccm news
    Jan 15, 2009
  3. Lex Williams

    Alternative to Parse::RecDescent

    Lex Williams, Aug 26, 2008, in forum: Ruby
    Replies:
    4
    Views:
    190
    James Gray
    Aug 27, 2008
  4. 88888 Dihedral
    Replies:
    4
    Views:
    569
    88888 Dihedral
    Feb 28, 2012
  5. Trond Michelsen

    Parse::RecDescent and unordered rules

    Trond Michelsen, Aug 19, 2005, in forum: Perl Misc
    Replies:
    1
    Views:
    91
    Dave Weaver
    Aug 22, 2005
Loading...

Share This Page