[ann] regexp-engine 0.11

  • Thread starter Simon Strandgaard
  • Start date
S

Simon Strandgaard

Phwew.. regexp now supports unicode ;-)

download as TGZ:
http://rubyforge.org/frs/download.php/706/regexp-engine-0.11.tar.gz

download as ZIP:
http://rubyforge.org/frs/download.php/707/regexp-engine-0.11.zip

demo site:
http://neoneye.dk/regexp.rbx

changelog:
http://rubyforge.org/cgi-bin/viewcv...=aeditor&content-type=text/vnd.viewcvs-markup


Overview
========

Regexp engine is written entirely in Ruby. It is very compatible
with Ruby's builtin regexp-engine. Carefully tested (+2000 tests).


Features
========

There is at the moment 3 parsers.. perl5, perl6, xml.

Encodings supported: ASCII, UTF8.


Not yet supported stuff
=======================

Send me a mail in case there are something you want, or if
you are a developer yourself then send me some patches.

* subcaptures inside negative-lookahead/behind.
* grammars.
* UTF16 and other encoding.
* inline-code.
* named captures.
* possesive quantifiers.
* recursive expression.


Perl5 syntax
============

a|b|c alternation
[...] [^...] character class.. and inverse charclass
[[:alpha:]] posix character class
[[:^alpha:]] inverse posix character class
. dot matches anything except newline, same as [^\n]
\1 .. \9 backreference . . . . . . . . . . . . . . . . . . . . . . see [3]
* *? loop 0 or more times greedy/lazy
+ +? loop 1 or more times greedy/lazy
{n,} {n,}? loop n or more times greedy/lazy
? ?? loop 0..1 times greedy/lazy
{n,m} {n,m}? loop n..m times greedy/lazy
{n} {n}? loop n times greedy/lazy
( ... ) capturing group
(?: ... ) non-capturing group
(?> ... ) atomic grouping
(?= ... ) positive-lookahead
(?! ... ) negative-lookahead . . . . . . . . . . . . . . . . . . . see [2]
(?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . . . . see [1]
(?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . . . . see [1], [2]
(?# ... ) posix-comment
(?i) (?-i) ignorecase on/off
(?m) (?-m) multiline on/off
(?x) (?-x) extended on/off
^ \A begin of line, begin of string
$ \z \Z end of line, end of string (excl newline)
\b \B word boundary, nonword boundary
\d \D [[:digit:]] and the inverse [^[:digit:]]
\s \S [[:space:]] and the inverse [^[:space:]]
\w \W [[:word:]] and the inverse [^[:word:]]
\x20 hex . . . . . . . . . . . . . . . . . . . . . . . . . . . see [4]
\040 octal . . . . . . . . . . . . . . . . . . . . . . . . . . see [3], [4]
\x{deadbeef} widechar codepoint specified as hex
\n newline
\a bell
\ escape next char

precedens between operators:
() pattern memory
+ * ? {} number of occurrences
^ $ \b \B pattern anchors
| alternatives


1. Variable-width-lookbehind are fairly supported by this engine.
For instance this (?<=(a.*)g) is a valid expression.
Beware that the left-most-longest rule is inversed inside lookbehind,
and that Backreferences are not possible (yet).

2. Subcaptures inside negative-lookahead/behind are empty
at the moment.

3. If one tries to backreference a not-existing capture then it
will be interpreted as an octal symbol.

4. When encoding is ASCII, you can specify hex/octal values in
the range 0-255. However when encoding is UTF8 then only the
range 0-127 are valid, in this case the range 128-255 is undefined.




Call For Help
=============

etablish contact, if you have interest in perl6 regexp.
etablish contact, if you have knowledge about asian text-encodings.
 
G

Gavin Kistner

Encodings supported: ASCII, UTF8.
(?: ... ) non-capturing group
(?> ... ) atomic grouping
(?= ... ) positive-lookahead
(?! ... ) negative-lookahead . . . . . . . . . . . . . . . . .
. . see [2]
(?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . .
. . see [1]
(?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . .
. . see [1], [2]

w00t! Thanks so much, Simon!

How is the performance of yours vs. built-in? (On features which they
both support.)
 
S

Simon Strandgaard

Encodings supported: ASCII, UTF8.
(?: ... ) non-capturing group
(?> ... ) atomic grouping
(?= ... ) positive-lookahead
(?! ... ) negative-lookahead . . . . . . . . . . . . . . . . .
. . see [2]
(?<= ... ) positive-lookbehind . . . . . . . . . . . . . . . . .
. . see [1]
(?<! ... ) negative-lookbehind . . . . . . . . . . . . . . . . .
. . see [1], [2]

w00t! Thanks so much, Simon!

I am happy you like it.. yesterday I added support for
UTF-16BE and UTF-16LE. Now im working on perl6 syntax.

How is the performance of yours vs. built-in? (On features which they
both support.)

performance hasn't really been benchmarked yet.
However we can compare against the time between Ruby's builtin (GNU)
engine..

First engine 0.11
'test_blackbox_p5.rb' takes 16.86 seconds for ~400 tests.
'test_blackbox_rubicon.rb' takes 15.93 seconds for ~1520 tests.
In total ~ 31 seconds for about 1900 regexp.
In average we can execute about 61 regexp's per second.

Then builtin GNU
'test_engine_builtin.rb' takes 2.96 seconds for 2000 tests.
The builtin can do 675 per second.


Lets calculate how many times GNU is faster
675 / 61 = 11
So GNU can do eleven times as many operations per second than mine.
This surprices me a little.. I thought my engine were way slower ;-)
I am thinking about reimplementing only the scanner in C++, in order
to get better performance. But first I must implement some of the
most common regexp optimizations: fastmaps and single-repeat.

Has anyone experience with how much speed can be gained
by reimplementing a ruby algorithm in C/C++ ?

my environment are:
bash-2.05b$ cat /proc/cpuinfo | grep MH
cpu MHz : 726.631
bash-2.05b$ uname -a
Linux server 2.4.25-gentoo-r1 #1 Sun Jun 6 18:09:28 CEST 2004 i686 AMD
Duron(TM)Processor AuthenticAMD GNU/Linux
bash-2.05b$ ruby18 -v
ruby 1.8.1 (2004-04-24) [i386-linux-gnu]
bash-2.05b$
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top