parser performance comparisions

E

Eric Mahurin

------=_Part_37706_8148117.1131146360947
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
parser). I was a little bit too... As a test case, I compared against the
racc calc.y parser (simple expression calculator) with a huge
2.6MBexpression (randomly generated by a script). Since ruby could
evaluate this
expression directly (I used load), I also threw it into the mix. Here are
the results:

parser lexer user vmsize notes
------ ----- ---- ------ -----
ruby ruby 0.8s 30.6MB -
racc Regexp 33.4s 278.2MB w/o racc C extension
racc Regexp 15.2s 142.1MB w/ racc C extension
Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
Grammar Grammar 21.1s 11.3MB one token at a time
Grammar Grammar 20.1s 11.4MB one token at a time, inlined
Grammar Regexp 15.6s 76.0MB -
Grammar Regexp 13.8s 74.4MB inlined

I think the best apples-to-apples comparision is racc w/ its Regexp lexer
and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
Regexp lexer. You can even get a little more speed by inlining your actions
with Grammar. Unfortunately this makes it less readable because you have to
put your code in strings instead of a simple block. My parser is more than
twice as fast with those circumstances. racc's C extension gets it back on
par with my pure ruby solution. I think the memory usage is so high for the
racc solutions because with racc they typically generate all the tokens up
front (in memory). I think every racc example I found did this. I'm not sur=
e
why.

I only showed the Regexp lexer solution because that is what the racc
examples use. For a Grammar parser, I recommend using a Grammar lexer. A
Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
meant to work with a String, not an IO. Because of Regexp's work with
Strings, all of the racc examples I found read the IO into a String first.
That is why the memory usage is so much higher for the Regexp lexers above.

BTW, I did this on my experimental code. Hopefully this will become version
0.6 in a few weeks.

------=_Part_37706_8148117.1131146360947--
 
P

Phil Tomson

------=_Part_37706_8148117.1131146360947
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Jim Freeze was curious as to how Grammar (LL parser) compares to RACC (LR
parser). I was a little bit too... As a test case, I compared against the
racc calc.y parser (simple expression calculator) with a huge
2.6MBexpression (randomly generated by a script). Since ruby could
evaluate this
expression directly (I used load), I also threw it into the mix. Here are
the results:

parser lexer user vmsize notes
------ ----- ---- ------ -----
ruby ruby 0.8s 30.6MB -
racc Regexp 33.4s 278.2MB w/o racc C extension
racc Regexp 15.2s 142.1MB w/ racc C extension
Grammar Grammar 22.1s 11.4MB multi-threaded (300 token buf)
Grammar Grammar 20.9s 11.4MB multi-threaded, inlined
Grammar Grammar 21.1s 11.3MB one token at a time
Grammar Grammar 20.1s 11.4MB one token at a time, inlined
Grammar Regexp 15.6s 76.0MB -
Grammar Regexp 13.8s 74.4MB inlined

I think the best apples-to-apples comparision is racc w/ its Regexp lexer
and no C extension (mine is pure ruby) vs. my Grammar parser with a similar
Regexp lexer. You can even get a little more speed by inlining your actions
with Grammar. Unfortunately this makes it less readable because you have to
put your code in strings instead of a simple block. My parser is more than
twice as fast with those circumstances. racc's C extension gets it back on
par with my pure ruby solution. I think the memory usage is so high for the
racc solutions because with racc they typically generate all the tokens up
front (in memory). I think every racc example I found did this. I'm not sur=
e
why.

I only showed the Regexp lexer solution because that is what the racc
examples use. For a Grammar parser, I recommend using a Grammar lexer. A
Regexp lexer isn't as readable, flexible, and most importantly a Regexp is
meant to work with a String, not an IO. Because of Regexp's work with
Strings, all of the racc examples I found read the IO into a String first.
That is why the memory usage is so much higher for the Regexp lexers above.

BTW, I did this on my experimental code. Hopefully this will become version
0.6 in a few weeks.

It's been a while since I looked into the LL(k) vs. LALR issues (LL(k) in
he case of ANTLR and LALR in the case of YACC), but isn't the calc.y
grammar pretty simple and thus maybe not a good way to compare the two?
Specifically, did you need to use any lookahead?

A better comparison would be to create a Ruby grammar...
You're working on that right ;-)

Phil
 
J

Jim Freeze

A better comparison would be to create a Ruby grammar...

Other comparisons may be needed, but this one is very important.
It says that Grammar can play equally with RACC in terms of speed,
which is important if you are the new kid on the block.

What I would like to see is grammar speed up a bit so that the
one-token-at-a-time
is as fast as Racc with the C extension. Then it would almost be a no brain=
er
to use Grammar instead of Racc.

Good work Eric.
 
E

Eric Mahurin

------=_Part_41583_5431276.1131203036023
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline



And compare to what? We have no ruby parsers in ruby.

I would like to compare to something else too. If anybody else has other
suggestions, here's what I would need to compare with:

- racc (or even working rockit) grammar
- a large test file that takes at least several seconds to parse
- hope something simple enough that it won't take much time to convert to
Grammar (I don't want to distract from my ruby parser much)

Other comparisons may be needed, but this one is very important.
It says that Grammar can play equally with RACC in terms of speed,
which is important if you are the new kid on the block.

What I would like to see is grammar speed up a bit so that the
one-token-at-a-time
is as fast as Racc with the C extension. Then it would almost be a no
brainer
to use Grammar instead of Racc.


Why is the one-token-at-a-time Grammar lexer important? If you already have
a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
for a Grammar parser - just make it an object that has a read1next method
(like a Cursor) - similar to next_token with yacc. Until I start generating
C code, it will be hard to compete against a Regexp lexer since it does
stuff in C.

If you are starting from scratch though, the recommended way for making a
lexer for a Grammar parser is with Grammar is the multi-threaded approach.
This gives the most flexibility and better readability for complex lexers.
With this approach, the Grammar of the lexer parses the whole file and
generates all tokens (instead of parsing just one token). By doing it this
way, most lexers don't need to hold state internal to the lexer. For
example, if a string corresponds to multiple tokens, the token-at-atime
lexer would have to set a lexer state when it entered a string so that the
next time it is called it would return a string-type token (and reset that
state when it found the end-of-string). With the lexer parsing all tokens,
when it finds a string it just generates all the tokens for that string
right there - no need for a lexer state.

But, Grammar will get faster over time. Here are some of my ideas:

- use ruby2c to convert my generated ruby code to C.
- if that doesn't pan out, make another set of Grammar classes/methods (sam=
e
API) that generate C code and inline it.
- optimize the methods from various Cursor classes that Grammar might use.

Also keep in mind that with my Grammar stuff, parser generation is done at
run-time (included in the numbers I gave). This should typically take a
fraction of a second for most parsers. If it becomes an issue I will add th=
e
ability to dump the generated code to a file. But, for now I'd like to leav=
e
it because the programmer not having to worry about the parser generation
stage and how to integrate that the generated file makes the usability much
higher. Grammar can be treated as just another library.

Good work Eric.


Thanks, Jim.

------=_Part_41583_5431276.1131203036023--
 
P

Phil Tomson

------=_Part_41583_5431276.1131203036023
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline




And compare to what? We have no ruby parsers in ruby.

Well, the suggestion wasn't entirely serious...



Phil
 
J

Jim Freeze

Other comparisons may be needed, but this one is very important.

Why is the one-token-at-a-time Grammar lexer important? If you already ha= ve
a Regexp lexer (for RACC or whatever), it is very easy to make it a lexer
for a Grammar parser - just make it an object that has a read1next method
(like a Cursor) - similar to next_token with yacc. Until I start generati= ng
C code, it will be hard to compete against a Regexp lexer since it does
stuff in C.

Memory consumption.
From the data that you showed, it looks like I should avoid
Regexp if I am concerned about memory usage.
Good work Eric.

Thanks, Jim.

Don't mention it. :)
 
E

Eric Mahurin

------=_Part_43477_18019238.1131231440761
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Memory consumption.
From the data that you showed, it looks like I should avoid
Regexp if I am concerned about memory usage.

Yes, but you have the same problem in racc (and more since typical usage is
to generate the entire token stream in memory first). The Regexp lexers
could be brought down to more reasonable memory levels if you read in a
block (or line) at a time to match against. In this particular example a
line a time wouldn't help since I put the entire 2.6MB expression on a line
(not realistic of course). With reading in a block (fixed number of
characters) or line at a time, you'd have to deal with tokens that can span
multiple blocks or lines. You'd lose some of the Regexp advantage but
probably not much. The racc memory usage could also be helped if you don't
generate the entire token stream in memory first. I'm not sure why there ar=
e
no racc examples with low memory usage (don't read the entire file into a
string and don't generate the entire token stream in memory up front).

------=_Part_43477_18019238.1131231440761--
 
S

Steven Jenkins

Eric said:
The racc memory usage could also be helped if you don't
generate the entire token stream in memory first. I'm not sure why there are
no racc examples with low memory usage (don't read the entire file into a
string and don't generate the entire token stream in memory up front).

Probably because for most applications it doesn't matter. But the
example I sent you doesn't read the entire file into a string. It does
load the entire token stream into an array, but that could easily be
changed by running the lexer in a separate thread and using a SizedQueue
for the token stream. Maybe a five-line change.

For this appliation, I really don't care about memory consumption. The
largest file I'll ever parse is 100 times smaller than the physical
memory on the machine, much less the available virtual memory.

Steve
 
E

Eric Mahurin

------=_Part_43698_7580707.1131236160605
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

e

Probably because for most applications it doesn't matter. But the
example I sent you doesn't read the entire file into a string. It does
load the entire token stream into an array, but that could easily be
changed by running the lexer in a separate thread and using a SizedQueue
for the token stream. Maybe a five-line change.

For this appliation, I really don't care about memory consumption. The
largest file I'll ever parse is 100 times smaller than the physical
memory on the machine, much less the available virtual memory.

Steve
Do you even need a separate thread? Why can't you just have #next_token jus=
t
parse the next token and return it? From what I saw of your lexer, it looke=
d
like that would be possible, but then again I have almost no racc experienc=
e
(I've only learned enough to try to benchmark against it). When you go
mutli-threaded, you do have to make sure your doing enough work in each
thread because thread switching is relatively expensive. For my
multi-threaded test, increasing my token buffer from no buffering (lexer an=
d
parser synchronized) to a 300 token buffer made a couple orders of magnitud=
e
difference in performance.

To me, if a parser is reading in a whole file into memory and/or generating
all the tokens of the file in memory up front, it doesn't seem like a "real=
"
parser. Regexp fits in this camp (parses from a string - in memory). With
some of features that are being added to regexes (look at some of perl's
advanced/beta features where you can embed code and do recursion), you coul=
d
call a regex a parser, but it doesn't seem like one to me because it needs
the whole thing it is parsing in memory. It just seems strange to me that
the common racc usage does things this way. You could actually optimize a
parser quite a bit if you knew everything it was parsing was in memory
(regexes make this assumption for doing backtracking - similar to lookahead
for a standard parser).

------=_Part_43698_7580707.1131236160605--
 
S

Steven Jenkins

Eric said:
Do you even need a separate thread? Why can't you just have #next_token just
parse the next token and return it? From what I saw of your lexer, it looked
like that would be possible, but then again I have almost no racc experience
(I've only learned enough to try to benchmark against it). When you go
mutli-threaded, you do have to make sure your doing enough work in each
thread because thread switching is relatively expensive. For my
multi-threaded test, increasing my token buffer from no buffering (lexer and
parser synchronized) to a 300 token buffer made a couple orders of magnitude
difference in performance.

If you want to limit memory consumption, a separate thread and
SizedQueue are sufficient. That might not give the highest performance,
but it's the cleanest code. If performance is an issue, then yes, you
could have next_token read another line from the input file if the
current buffer is empty.
To me, if a parser is reading in a whole file into memory and/or generating
all the tokens of the file in memory up front, it doesn't seem like a "real"
parser. Regexp fits in this camp (parses from a string - in memory). With
some of features that are being added to regexes (look at some of perl's
advanced/beta features where you can embed code and do recursion), you could
call a regex a parser, but it doesn't seem like one to me because it needs
the whole thing it is parsing in memory. It just seems strange to me that
the common racc usage does things this way. You could actually optimize a
parser quite a bit if you knew everything it was parsing was in memory
(regexes make this assumption for doing backtracking - similar to lookahead
for a standard parser).

Again, only if performance and memory usage are issues. I'd guess that
for most applications using racc, they aren't. And if they really are,
Ruby might not be the best choice anyway. Ruby/racc will never seriously
compete with C/yacc in speed. For my particular application, the largest
file I'll ever encounter can be parsed in less than one second, so I put
exactly zero effort into making it faster.

Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top