Writing fast(er) performing parsers in Perl

C

Clint Olsen

I've been working on a parser project to translate from a proprietary
hardware description language into something more industry standard, and
I've found some peculiar performance characteristics in Perl. Hopefully
someone can shed a bit more light on the subject. I've been using Perl
5.8.2, so my findings should be pretty relevant to current users.

Since this project is fairly complex, I'm using Parse::Yapp to handle the
parsing functionality. Perl's natural pattern matching ability makes it
pretty simple to code up an lexical analyzer by hand.

After running the profiler on the tool, I found that over 70% of the
runtime was being taken up in the scanner (the function actually scanning
tokens from the file). So, I decided to start there.

The scanner is pretty straightforward. I precompiled patterns using qr//o
with all the reserved keywords and special patterns for tokens in the
language. I reused these patterns in m// like so:

#
# Precompile patterns.
#
my $INTEGER = qr/(-?\d+)/o;

....
....
....

elsif (/\G$INTEGER/ogc) {
return [ 'INTEGER', $1 ];
}

There are many such patterns like this, some of them quite a bit more
complicated than integers. I used /gc and the assertion \G in all the
patterns since this was mentioned in the Perl documentation for writing
lex-like scanners.

For file I/O, I snarfed the entire contents into a scalar by adjusting $/:

sub read_file {
if (open FILE, '<', $_[0]) {
local $/ = undef;
my $data = <FILE>;
\$data;
}
}

Using this method seems popular among Perl users and eliminates file i/o
issues from the programmer.

Unfortunately, the result of this combination was pretty sluggish
performance. To read a nominal file which involves considerable file
inclusion takes slightly over a minute of runtime. There is also quite a
bit of system overtime registered by the time(1) command.

The first thing I tried was to eliminate the reliance on the zero-width
assertion \G and /gc on the search. I changed all the calls from using m//
to using s// and just stripped off the leading text as I matched it. The
result of that was a fair amount of savings - around 35% runtime savings.

Encouraged by that result, I decided to take it one step further and
experiment with more traditional lexer file i/o techniques. I assumed Perl
just uses mmap() to read entire files into a scalar, but I'm not entirely
sure what happens behind the scenes without diving into the source.

I tried just opening the file and using Perl's read() call to snarf the
data in 8K chunks, using the following fill function:

sub fill {
my ($buf) = @_;
my $bufin;

$$buf[2] = (read($$buf[0], $bufin, 8192) != 8192);

$$buf[1] .= $bufin;
}

The buf object is an array triplet of a file descriptor, scalar containing
the file data currently read, and EOF indicator. Fill() isn't called until
the length of the current buffer has fallen below 256 characters. So, in
essence this just makes an assumption that no token is greater than 256
characters.

The result of this change produced dramatically better results. The
runtime dropped to no more than 15 seconds! Just to get a sense of how
much system call overhead there was, I ran strace on each of them:

-rw-r--r-- 1 clint users 8994621 Apr 16 17:00 /tmp/parse1.trc
-rw-r--r-- 1 clint users 3625534 Apr 16 17:00 /tmp/parse2.trc
-rw-r--r-- 1 clint users 976800 Apr 17 11:13 /tmp/parse3.trc

The first is the original program reading the entire file(s) in as scalars
and using the zero-width assertion (\G), the second was substituting s//
for m//gc, and the last was the change to the manual buffering. I'm pretty
shocked at the speed differences between these changes. I wouldn't call my
manual buffering very sophisticated, either. But it is certainly worth it
since it yields over a 75% speed improvement on average.

-Clint
 
W

Walter Roberson

:Since this project is fairly complex, I'm using Parse::Yapp to handle the
:parsing functionality. Perl's natural pattern matching ability makes it
:pretty simple to code up an lexical analyzer by hand.

:After running the profiler on the tool, I found that over 70% of the
:runtime was being taken up in the scanner (the function actually scanning
:tokens from the file). So, I decided to start there.

In my parsing programs, I fairly consistantly find that 70% or more of
the runtime is being spent simply split()'ing the lines into fields
(using the default whitespace splitting.)

I haven't tried Parse::Yapp, but I did find that performance improved
a fair bit when I "guess" that a key field will start at the same
string offset as it did on the previous line, and use substr() to probe
for it there, and only doing the more general split() if that probe failed
[or if the context of the keyword is such that I need the full split to
understand the line anyhow.]
 
C

Clint Olsen

In my parsing programs, I fairly consistantly find that 70% or more of
the runtime is being spent simply split()'ing the lines into fields
(using the default whitespace splitting.)

I haven't tried Parse::Yapp, but I did find that performance improved a
fair bit when I "guess" that a key field will start at the same string
offset as it did on the previous line, and use substr() to probe for it
there, and only doing the more general split() if that probe failed [or
if the context of the keyword is such that I need the full split to
understand the line anyhow.]

It's actually more efficient to treat the file as a stream of tokens and
just handle newline just like any other token. This is more effective when
dealing with languages like C or Verilog which are free-form and whitespace
isn't necessarily required to separate tokens. It's also helpful when
doing bookkeeping like line/column tracking for tokens within the file.

-Clint
 
W

Walter Roberson

:It's actually more efficient to treat the file as a stream of tokens and
:just handle newline just like any other token.

Not if your grammar happens to be line-oriented (as has been the case
for the parsing I've been doing.) In your approach, you have the overhead
of tokenizing -everything-, but in line-oriented grammars there may
be portions of the line that can be ignored (at least until the
context calls upon them.)

For example, a lot of my parsing these days is on firewall logs.
Each line starts with a fixed-width datestamp (put in by the logging host),
then the name of the logging device, then a second fixed-width datestamp
(put in by the firewall), followed by a fixed-width message-type code.
For my purposes, well over half of the message-type codes are
indicative of lines that can be ignored, so by just extracting one
substring and doing a hash lookup, half the time I can determine that
I don't need to tokenize anything else on the line.
 
C

Clint Olsen

Not if your grammar happens to be line-oriented (as has been the case for
the parsing I've been doing.) In your approach, you have the overhead of
tokenizing -everything-, but in line-oriented grammars there may be
portions of the line that can be ignored (at least until the context
calls upon them.)

Yes, I did mention that I was parsing a freeform language, not
line-oriented data. In your case you are combining (or blurring the lines
between) lexing and parsing - and you're doing mostly lexing since you're
just separating fields. Since you are guaranteed a complete command/phrase
on a single line, your approach makes sense.

However, I don't have that luxury, which is why I was inquiring as to why
s///o is so much faster than m/\Gstuff/ogc, and why using read() is
considerably faster than: local $/ = undef; $data = <FILE>.

Now that I've made my changes, the lexing phase is still the largest chunk
of time at around 26%. This is expected, but it's not anywhere near 70%
that it was before.

-Clint
 
U

Uri Guttman

CO> However, I don't have that luxury, which is why I was inquiring as
CO> to why s///o is so much faster than m/\Gstuff/ogc, and why using
CO> read() is considerably faster than: local $/ = undef; $data =
CO> <FILE>.

read is faster since it doesn't have to check for EOF in a loop like the
slurp does. and sysread is even faster than read. and you can get that
speed with cleaner code by using File::Slurp which does fast sysread
under the hood and has many useful options.

s/// is faster than m//g because it stays inside perl guts (c code) all
the time while m// has to be wrapped in a slower perl level loop.

also the /o is almost never needed anymore. perl will only recompile a
regex when it sees that an interpolated variable has changed since the
last time it was compiled.


CO> Now that I've made my changes, the lexing phase is still the
CO> largest chunk of time at around 26%. This is expected, but it's
CO> not anywhere near 70% that it was before.

try the file::slurp module and you should get even more speedup.

uri
 
C

Clint Olsen

read is faster since it doesn't have to check for EOF in a loop like the
slurp does. and sysread is even faster than read. and you can get that
speed with cleaner code by using File::Slurp which does fast sysread
under the hood and has many useful options.

s/// is faster than m//g because it stays inside perl guts (c code) all
the time while m// has to be wrapped in a slower perl level loop.

also the /o is almost never needed anymore. perl will only recompile a
regex when it sees that an interpolated variable has changed since the
last time it was compiled.

try the file::slurp module and you should get even more speedup.

I tried slurp, and while it was faster reading the entire file than
reassigning $/ to undef and doing $data = <FILE>, it's still far slower
than my buffered reads using read(). I'm still twice as fast as slurp. I
assume you wanted me to slurp the entire file into memory as I had done
originally.

It's possible that this performance behavior is a peculiarity on Linux
only, but historically doing reads into a finite buffer has been the most
time/space efficient way to read a file.

Thanks for your explanation on substitution vs. match. That answers quite
a bit.

-Clint
 
U

Uri Guttman

CO> I tried slurp, and while it was faster reading the entire file
CO> than reassigning $/ to undef and doing $data = <FILE>, it's still
CO> far slower than my buffered reads using read(). I'm still twice
CO> as fast as slurp. I assume you wanted me to slurp the entire file
CO> into memory as I had done originally.

canyou show your benchmarks? and i didn't see (skipped much of the
thread) your code that does a read() loop. that isn't true slurping and
it could be faster as it doesn't have to allocate (or reallocate) a very
large buffer. file slurping is not good for very large files (for some
definitions of large - today slurping in megabytes is easy, not so too
long ago).

CO> It's possible that this performance behavior is a peculiarity on
CO> Linux only, but historically doing reads into a finite buffer has
CO> been the most time/space efficient way to read a file.

that is a different beast. when you need to process the entire file, a
true slurp is faster. sequential reads into a fixed buffer is totally
different.

uri
 
C

Clint Olsen

canyou show your benchmarks? and i didn't see (skipped much of the
thread) your code that does a read() loop. that isn't true slurping and
it could be faster as it doesn't have to allocate (or reallocate) a very
large buffer. file slurping is not good for very large files (for some
definitions of large - today slurping in megabytes is easy, not so too
long ago).

I can't show you the entire benchmark, since this is a parser for a
proprietary language, but I can show you excerpts:

The fill() function looks like:

#
# $file is an array reference which is the file descriptor, scalar buffer
# for the data, and EOF flag indicator respectively.
#
sub fill {
my ($file) = @_;
my $buf;

$$file[2] = (read($$file[0], $buf, 8192) != 8192);

$$file[1] .= $buf;
}

The loop of the scanner looks like:

sub scan {
....
....
....
for ($$buf[1]) { {
my @src;

if (!$$buf[2] and (!$$buf[1] or length($$buf[1])) < 128) {
fill($buf);
}

#
# Newlines
#
if (s/^( *\n)//o) {
++$$src[1];
$$src[2] = 1;
redo;
}
#
# Whitespace (not newline)
#
if (s/^([ \t]+)//o) {
$$src[2] += length($1);
redo;
}

if (s/^$KEYWORD//o) {
$$src[2] += length($1);
$tok = [ uc($1), [ $1, \@src ] ]
}
elsif (s/^$SYMBOL//o) {
$$src[2] += length($1);
$tok = [ $1, [ $1, \@src ] ]
}
elsif (s/^$STRING//o) {
$$src[2] += length($1) + 2;
$tok = [ 'STRINGCNST', [ $1, \@src ] ];
}
elsif (s/^$INTEGER//o) {
$$src[2] += length($1);
$tok = [ 'INTEGERCNST', [ $1, \@src ] ];
}
elsif (s/^$REAL//o) {
$$src[2] += length($1);
$tok = [ 'REALCNST', [ $1, \@src ] ];
}
elsif (s/^$ID//o) {
$$src[2] += length($1);
$tok = [ 'ID', [ $1, \@src ] ];
}
elsif (s/^(.)//so) {
perror(\@src, "illegal character '$1'\n");
++$$src[2];
redo;
}
}
}

Where $KEYWORD, $SYMBOL, $STRING, $INTEGER, $REAL, and $ID are all
precompiled regular expressions using qr//.
that is a different beast. when you need to process the entire file, a
true slurp is faster. sequential reads into a fixed buffer is totally
different.

Well, you have to consider the Perl overhead of resizing a large buffer
periodically as a result of the substitutions that occur. As you strip off
the leading text, doesn't perl have to move this data around?

-Clint
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,744
Messages
2,569,480
Members
44,900
Latest member
Nell636132

Latest Threads

Top