parser performance comparisions

Discussion in 'Ruby' started by Eric Mahurin, Nov 4, 2005.

  1. Eric Mahurin

    Eric Mahurin Guest

    ------=_Part_37706_8148117.1131146360947
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: inline

    Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
    parser). I was a little bit too... As a test case, I compared against the
    racc calc.y parser (simple expression calculator) with a huge
    2.6MBexpression (randomly generated by a script). Since ruby could
    evaluate this
    expression directly (I used load), I also threw it into the mix. Here are
    the results:

    parser lexer user vmsize notes
    ------ ----- ---- ------ -----
    ruby ruby 0.8s 30.6MB -
    racc Regexp 33.4s 278.2MB w/o racc C extension
    racc Regexp 15.2s 142.1MB w/ racc C extension
    Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
    Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
    Grammar Grammar 21.1s 11.3MB one token at a time
    Grammar Grammar 20.1s 11.4MB one token at a time, inlined
    Grammar Regexp 15.6s 76.0MB -
    Grammar Regexp 13.8s 74.4MB inlined

    I think the best apples-to-apples comparision is racc w/ its Regexp lexer
    and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
    Regexp lexer. You can even get a little more speed by inlining your actions
    with Grammar. Unfortunately this makes it less readable because you have to
    put your code in strings instead of a simple block. My parser is more than
    twice as fast with those circumstances. racc's C extension gets it back on
    par with my pure ruby solution. I think the memory usage is so high for the
    racc solutions because with racc they typically generate all the tokens up
    front (in memory). I think every racc example I found did this. I'm not sur=
    e
    why.

    I only showed the Regexp lexer solution because that is what the racc
    examples use. For a Grammar parser, I recommend using a Grammar lexer. A
    Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
    meant to work with a String, not an IO. Because of Regexp's work with
    Strings, all of the racc examples I found read the IO into a String first.
    That is why the memory usage is so much higher for the Regexp lexers above.

    BTW, I did this on my experimental code. Hopefully this will become version
    0.6 in a few weeks.

    ------=_Part_37706_8148117.1131146360947--
    Eric Mahurin, Nov 4, 2005
    #1
    1. Advertising

  2. Eric Mahurin

    Phil Tomson Guest

    In article <>,
    Eric Mahurin <> wrote:
    >------=_Part_37706_8148117.1131146360947
    >Content-Type: text/plain; charset=ISO-8859-1
    >Content-Transfer-Encoding: quoted-printable
    >Content-Disposition: inline
    >
    >Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
    >parser). I was a little bit too... As a test case, I compared against the
    >racc calc.y parser (simple expression calculator) with a huge
    >2.6MBexpression (randomly generated by a script). Since ruby could
    >evaluate this
    >expression directly (I used load), I also threw it into the mix. Here are
    >the results:
    >
    >parser lexer user vmsize notes
    >------ ----- ---- ------ -----
    >ruby ruby 0.8s 30.6MB -
    >racc Regexp 33.4s 278.2MB w/o racc C extension
    >racc Regexp 15.2s 142.1MB w/ racc C extension
    >Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
    >Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
    >Grammar Grammar 21.1s 11.3MB one token at a time
    >Grammar Grammar 20.1s 11.4MB one token at a time, inlined
    >Grammar Regexp 15.6s 76.0MB -
    >Grammar Regexp 13.8s 74.4MB inlined
    >
    >I think the best apples-to-apples comparision is racc w/ its Regexp lexer
    >and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
    >Regexp lexer. You can even get a little more speed by inlining your actions
    >with Grammar. Unfortunately this makes it less readable because you have to
    >put your code in strings instead of a simple block. My parser is more than
    >twice as fast with those circumstances. racc's C extension gets it back on
    >par with my pure ruby solution. I think the memory usage is so high for the
    >racc solutions because with racc they typically generate all the tokens up
    >front (in memory). I think every racc example I found did this. I'm not sur=
    >e
    >why.
    >
    >I only showed the Regexp lexer solution because that is what the racc
    >examples use. For a Grammar parser, I recommend using a Grammar lexer. A
    >Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
    >meant to work with a String, not an IO. Because of Regexp's work with
    >Strings, all of the racc examples I found read the IO into a String first.
    >That is why the memory usage is so much higher for the Regexp lexers above.
    >
    >BTW, I did this on my experimental code. Hopefully this will become version
    >0.6 in a few weeks.
    >


    It's been a while since I looked into the LL(k) vs. LALR issues (LL(k) in
    he case of ANTLR and LALR in the case of YACC), but isn't the calc.y
    grammar pretty simple and thus maybe not a good way to compare the two?
    Specifically, did you need to use any lookahead?

    A better comparison would be to create a Ruby grammar...
    You're working on that right ;-)

    Phil
    Phil Tomson, Nov 4, 2005
    #2
    1. Advertising

  3. Eric Mahurin

    Jim Freeze Guest

    On 11/4/05, Phil Tomson <> wrote:
    > A better comparison would be to create a Ruby grammar...


    Other comparisons may be needed, but this one is very important.
    It says that Grammar can play equally with RACC in terms of speed,
    which is important if you are the new kid on the block.

    What I would like to see is grammar speed up a bit so that the
    one-token-at-a-time
    is as fast as Racc with the C extension. Then it would almost be a no brain=
    er
    to use Grammar instead of Racc.

    Good work Eric.



    --
    Jim Freeze
    Jim Freeze, Nov 5, 2005
    #3
  4. Eric Mahurin

    Eric Mahurin Guest

    ------=_Part_41583_5431276.1131203036023
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: inline

    On 11/5/05, Jim Freeze <> wrote:
    >
    > On 11/4/05, Phil Tomson <> wrote:
    > > A better comparison would be to create a Ruby grammar...



    And compare to what? We have no ruby parsers in ruby.

    I would like to compare to something else too. If anybody else has other
    suggestions, here's what I would need to compare with:

    - racc (or even working rockit) grammar
    - a large test file that takes at least several seconds to parse
    - hope something simple enough that it won't take much time to convert to
    Grammar (I don't want to distract from my ruby parser much)

    Other comparisons may be needed, but this one is very important.
    > It says that Grammar can play equally with RACC in terms of speed,
    > which is important if you are the new kid on the block.
    >
    > What I would like to see is grammar speed up a bit so that the
    > one-token-at-a-time
    > is as fast as Racc with the C extension. Then it would almost be a no
    > brainer
    > to use Grammar instead of Racc.



    Why is the one-token-at-a-time Grammar lexer important? If you already have
    a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
    for a Grammar parser - just make it an object that has a read1next method
    (like a Cursor) - similar to next_token with yacc. Until I start generating
    C code, it will be hard to compete against a Regexp lexer since it does
    stuff in C.

    If you are starting from scratch though, the recommended way for making a
    lexer for a Grammar parser is with Grammar is the multi-threaded approach.
    This gives the most flexibility and better readability for complex lexers.
    With this approach, the Grammar of the lexer parses the whole file and
    generates all tokens (instead of parsing just one token). By doing it this
    way, most lexers don't need to hold state internal to the lexer. For
    example, if a string corresponds to multiple tokens, the token-at-atime
    lexer would have to set a lexer state when it entered a string so that the
    next time it is called it would return a string-type token (and reset that
    state when it found the end-of-string). With the lexer parsing all tokens,
    when it finds a string it just generates all the tokens for that string
    right there - no need for a lexer state.

    But, Grammar will get faster over time. Here are some of my ideas:

    - use ruby2c to convert my generated ruby code to C.
    - if that doesn't pan out, make another set of Grammar classes/methods (sam=
    e
    API) that generate C code and inline it.
    - optimize the methods from various Cursor classes that Grammar might use.

    Also keep in mind that with my Grammar stuff, parser generation is done at
    run-time (included in the numbers I gave). This should typically take a
    fraction of a second for most parsers. If it becomes an issue I will add th=
    e
    ability to dump the generated code to a file. But, for now I'd like to leav=
    e
    it because the programmer not having to worry about the parser generation
    stage and how to integrate that the generated file makes the usability much
    higher. Grammar can be treated as just another library.

    Good work Eric.


    Thanks, Jim.

    ------=_Part_41583_5431276.1131203036023--
    Eric Mahurin, Nov 5, 2005
    #4
  5. Eric Mahurin

    Phil Tomson Guest

    In article <>,
    Eric Mahurin <> wrote:
    >------=_Part_41583_5431276.1131203036023
    >Content-Type: text/plain; charset=ISO-8859-1
    >Content-Transfer-Encoding: quoted-printable
    >Content-Disposition: inline
    >
    >On 11/5/05, Jim Freeze <> wrote:
    >>
    >> On 11/4/05, Phil Tomson <> wrote:
    >> > A better comparison would be to create a Ruby grammar...

    >
    >
    >And compare to what? We have no ruby parsers in ruby.


    Well, the suggestion wasn't entirely serious...



    Phil
    Phil Tomson, Nov 5, 2005
    #5
  6. Eric Mahurin

    Jim Freeze Guest

    On 11/5/05, Eric Mahurin <> wrote:
    > Other comparisons may be needed, but this one is very important.
    > > It says that Grammar can play equally with RACC in terms of speed,
    > > which is important if you are the new kid on the block.
    > >
    > > What I would like to see is grammar speed up a bit so that the
    > > one-token-at-a-time
    > > is as fast as Racc with the C extension. Then it would almost be a no
    > > brainer
    > > to use Grammar instead of Racc.

    >
    > Why is the one-token-at-a-time Grammar lexer important? If you already ha=

    ve
    > a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
    > for a Grammar parser - just make it an object that has a read1next method
    > (like a Cursor) - similar to next_token with yacc. Until I start generati=

    ng
    > C code, it will be hard to compete against a Regexp lexer since it does
    > stuff in C.


    Memory consumption.
    From the data that you showed, it looks like I should avoid
    Regexp if I am concerned about memory usage.

    > Good work Eric.
    >
    > Thanks, Jim.


    Don't mention it. :)

    --
    Jim Freeze
    Jim Freeze, Nov 5, 2005
    #6
  7. Eric Mahurin

    Eric Mahurin Guest

    ------=_Part_43477_18019238.1131231440761
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: inline

    On 11/5/05, Jim Freeze <> wrote:
    >
    > On 11/5/05, Eric Mahurin <> wrote:
    > > Other comparisons may be needed, but this one is very important.
    > > > It says that Grammar can play equally with RACC in terms of speed,
    > > > which is important if you are the new kid on the block.
    > > >
    > > > What I would like to see is grammar speed up a bit so that the
    > > > one-token-at-a-time
    > > > is as fast as Racc with the C extension. Then it would almost be a no
    > > > brainer
    > > > to use Grammar instead of Racc.

    > >
    > > Why is the one-token-at-a-time Grammar lexer important? If you already

    > have
    > > a Regexp lexer (for RACC or whatever), it is very easy to make it a

    > lexer
    > > for a Grammar parser - just make it an object that has a read1next

    > method
    > > (like a Cursor) - similar to next_token with yacc. Until I start

    > generating
    > > C code, it will be hard to compete against a Regexp lexer since it does
    > > stuff in C.

    >
    > Memory consumption.
    > From the data that you showed, it looks like I should avoid
    > Regexp if I am concerned about memory usage.
    >


    Yes, but you have the same problem in racc (and more since typical usage is
    to generate the entire token stream in memory first). The Regexp lexers
    could be brought down to more reasonable memory levels if you read in a
    block (or line) at a time to match against. In this particular example a
    line a time wouldn't help since I put the entire 2.6MB expression on a line
    (not realistic of course). With reading in a block (fixed number of
    characters) or line at a time, you'd have to deal with tokens that can span
    multiple blocks or lines. You'd lose some of the Regexp advantage but
    probably not much. The racc memory usage could also be helped if you don't
    generate the entire token stream in memory first. I'm not sure why there ar=
    e
    no racc examples with low memory usage (don't read the entire file into a
    string and don't generate the entire token stream in memory up front).

    ------=_Part_43477_18019238.1131231440761--
    Eric Mahurin, Nov 5, 2005
    #7
  8. Eric Mahurin wrote:
    > The racc memory usage could also be helped if you don't
    > generate the entire token stream in memory first. I'm not sure why there are
    > no racc examples with low memory usage (don't read the entire file into a
    > string and don't generate the entire token stream in memory up front).


    Probably because for most applications it doesn't matter. But the
    example I sent you doesn't read the entire file into a string. It does
    load the entire token stream into an array, but that could easily be
    changed by running the lexer in a separate thread and using a SizedQueue
    for the token stream. Maybe a five-line change.

    For this appliation, I really don't care about memory consumption. The
    largest file I'll ever parse is 100 times smaller than the physical
    memory on the machine, much less the available virtual memory.

    Steve
    Steven Jenkins, Nov 5, 2005
    #8
  9. Eric Mahurin

    Eric Mahurin Guest

    ------=_Part_43698_7580707.1131236160605
    Content-Type: text/plain; charset=ISO-8859-1
    Content-Transfer-Encoding: quoted-printable
    Content-Disposition: inline

    On 11/5/05, Steven Jenkins <> wrote:
    >
    > Eric Mahurin wrote:
    > > The racc memory usage could also be helped if you don't
    > > generate the entire token stream in memory first. I'm not sure why ther=

    e
    > are
    > > no racc examples with low memory usage (don't read the entire file into

    > a
    > > string and don't generate the entire token stream in memory up front).

    >
    > Probably because for most applications it doesn't matter. But the
    > example I sent you doesn't read the entire file into a string. It does
    > load the entire token stream into an array, but that could easily be
    > changed by running the lexer in a separate thread and using a SizedQueue
    > for the token stream. Maybe a five-line change.
    >
    > For this appliation, I really don't care about memory consumption. The
    > largest file I'll ever parse is 100 times smaller than the physical
    > memory on the machine, much less the available virtual memory.
    >
    > Steve
    >
    >

    Do you even need a separate thread? Why can't you just have #next_token jus=
    t
    parse the next token and return it? From what I saw of your lexer, it looke=
    d
    like that would be possible, but then again I have almost no racc experienc=
    e
    (I've only learned enough to try to benchmark against it). When you go
    mutli-threaded, you do have to make sure your doing enough work in each
    thread because thread switching is relatively expensive. For my
    multi-threaded test, increasing my token buffer from no buffering (lexer an=
    d
    parser synchronized) to a 300 token buffer made a couple orders of magnitud=
    e
    difference in performance.

    To me, if a parser is reading in a whole file into memory and/or generating
    all the tokens of the file in memory up front, it doesn't seem like a "real=
    "
    parser. Regexp fits in this camp (parses from a string - in memory). With
    some of features that are being added to regexes (look at some of perl's
    advanced/beta features where you can embed code and do recursion), you coul=
    d
    call a regex a parser, but it doesn't seem like one to me because it needs
    the whole thing it is parsing in memory. It just seems strange to me that
    the common racc usage does things this way. You could actually optimize a
    parser quite a bit if you knew everything it was parsing was in memory
    (regexes make this assumption for doing backtracking - similar to lookahead
    for a standard parser).

    ------=_Part_43698_7580707.1131236160605--
    Eric Mahurin, Nov 6, 2005
    #9
  10. Eric Mahurin wrote:
    > Do you even need a separate thread? Why can't you just have #next_token just
    > parse the next token and return it? From what I saw of your lexer, it looked
    > like that would be possible, but then again I have almost no racc experience
    > (I've only learned enough to try to benchmark against it). When you go
    > mutli-threaded, you do have to make sure your doing enough work in each
    > thread because thread switching is relatively expensive. For my
    > multi-threaded test, increasing my token buffer from no buffering (lexer and
    > parser synchronized) to a 300 token buffer made a couple orders of magnitude
    > difference in performance.


    If you want to limit memory consumption, a separate thread and
    SizedQueue are sufficient. That might not give the highest performance,
    but it's the cleanest code. If performance is an issue, then yes, you
    could have next_token read another line from the input file if the
    current buffer is empty.

    > To me, if a parser is reading in a whole file into memory and/or generating
    > all the tokens of the file in memory up front, it doesn't seem like a "real"
    > parser. Regexp fits in this camp (parses from a string - in memory). With
    > some of features that are being added to regexes (look at some of perl's
    > advanced/beta features where you can embed code and do recursion), you could
    > call a regex a parser, but it doesn't seem like one to me because it needs
    > the whole thing it is parsing in memory. It just seems strange to me that
    > the common racc usage does things this way. You could actually optimize a
    > parser quite a bit if you knew everything it was parsing was in memory
    > (regexes make this assumption for doing backtracking - similar to lookahead
    > for a standard parser).


    Again, only if performance and memory usage are issues. I'd guess that
    for most applications using racc, they aren't. And if they really are,
    Ruby might not be the best choice anyway. Ruby/racc will never seriously
    compete with C/yacc in speed. For my particular application, the largest
    file I'll ever encounter can be parsed in less than one second, so I put
    exactly zero effort into making it faster.

    Steve
    Steven Jenkins, Nov 6, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. td
    Replies:
    0
    Views:
    476
  2. Bernd Oninger
    Replies:
    0
    Views:
    756
    Bernd Oninger
    Jun 9, 2004
  3. T Ryi

    Pointer comparisions of same type

    T Ryi, Mar 24, 2010, in forum: C Programming
    Replies:
    4
    Views:
    273
    Richard Bos
    Mar 24, 2010
  4. Stuart Clarke

    String comparisions and counting

    Stuart Clarke, Nov 5, 2008, in forum: Ruby
    Replies:
    7
    Views:
    110
    Stuart Clarke
    Nov 5, 2008
  5. Talib Hussain

    String comparisions

    Talib Hussain, Feb 12, 2009, in forum: Ruby
    Replies:
    4
    Views:
    91
    Robert Klemme
    Feb 12, 2009
Loading...

Share This Page