regular expression module?

W

Walter Roberson

I remember reading on this newsgroup, maybe about a year ago, of
a module that did a Finite State Machine implementation of "standard"
regular expressions -- i.e., just what can be built out of
concatenation, alternation, and repetition, (and perhaps
some syntactic sugar such as charactre-classes), without the perl
extensions that lift perl's regex's into a different parser class
entirely.

Unfortunately, when I tried to find the posting again, I could
not, and I could not find anything like that in CPAN.

The matter came to my mind again today when someone mentioned
the Yapp module and parsing speeds -- one of the points that
can really slow down perl's regexes is the backtracking that it
does. If one happens to be working with input whose grammar is
not overly complex, then one might be perfectly happy with
faster but less flexible RE's and a perl equivilent of
yacc and lex.

Would anyone happen to recall the name and location of that
finite-state RE module? Something like that could potentially speed
up a number of my programs.
 
A

Anno Siegel

Walter Roberson said:
I remember reading on this newsgroup, maybe about a year ago, of
a module that did a Finite State Machine implementation of "standard"
regular expressions -- i.e., just what can be built out of
concatenation, alternation, and repetition, (and perhaps
some syntactic sugar such as charactre-classes), without the perl
extensions that lift perl's regex's into a different parser class
entirely.

Unfortunately, when I tried to find the posting again, I could
not, and I could not find anything like that in CPAN.

The matter came to my mind again today when someone mentioned
the Yapp module and parsing speeds -- one of the points that
can really slow down perl's regexes is the backtracking that it
does.

Backtracking is certainly not a Perl-specific addition. Any regular
expression implementation must backtrack.
If one happens to be working with input whose grammar is
not overly complex, then one might be perfectly happy with
faster but less flexible RE's and a perl equivilent of
yacc and lex.

Would anyone happen to recall the name and location of that
finite-state RE module? Something like that could potentially speed
up a number of my programs.

I don't know the module you have in mind, but if it's a re-implementation
of regular expressions in Perl, any speedup is unlikely.

Anno
 
B

Brian McCauley

Backtracking is certainly not a Perl-specific addition. Any regular
expression implementation must backtrack.

No, the OP is right - "standard" regular expressions can be done with
a Finite State Machine.

I don't, however, recall the thread.

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 
A

Anno Siegel

Brian McCauley said:
No, the OP is right - "standard" regular expressions can be done with
a Finite State Machine.

I may be confused, but I don't see the contradiction. The finite-state-
machine *does* the backtracking, no?

Anno
 
W

Walter Roberson

:> No, the OP is right - "standard" regular expressions can be done with
:> a Finite State Machine.

:I may be confused, but I don't see the contradiction. The finite-state-
:machine *does* the backtracking, no?

No, no backtracking is needed.

There are two major representations of finite state machines, one "DFA"
(Deterministic Finite Automata) and the other one "NDFA"
(Non-Deterministic Finite Automata). NDFA are smaller to diagram out,
and are possibly more popular. It is common for beginners to believe
that NDFA can "do more" than DFA can, but they both have the same expressive
power, and there is a mechanical procedure to convert any NDFA to an
exactly equivilent DFA. Some implimentations of NDFA use
backtracking in order to avoid the (non-trivial) overhead of doing
the NDFA -> DFA conversion; those backtracking implimentations are
often faster on simple expressions, but when one has a more complex
expression, then it becomes better to do the conversion and run the
equivilent-DFA.

DFA's never need to backtrack. Once you -have- the DFA, it runs in
time linear to the length of the input: you read the next input
symbol, index that symbol in the transition table of the current
state, and that tells you the new state number. Repeat and by the
end of processing each symbol in the string exactly -once-,
you have either "accepted" the input (in which case the pattern matched)
or else you run out of input at a non-accepting state, in which case
the pattern match failed. The difficulty, the reason this isn't
used much more often, is that if you are starting with an NDFA, the
conversion into a DFA takes exponential time. Going the
regex -> NDFA -> DFA route is thus best suited for cases in which you
expect to "compile once, run often". Too expensive of a process if
you are farting around doing on-the-fly typing in of expressions to
be processed against short files, but well worth doing if you have a
big input file.... especially if you can separate out the compiled
DFA and "save" it so that you don't have to recompile it from run
to run or when other parts of the program change.

'yacc' is a well known program that does all the work of building
up compiled DFA, complete with some kind of state-table compression
so that the result is both space and time efficient. But yacc produces
C code. I seem to recall hearing of a yacc implimentation that
had an option for spitting out perl. Or was that a 'lex' program?
I do not remember clearly.
 
U

Uri Guttman

WR> 'yacc' is a well known program that does all the work of building
WR> up compiled DFA, complete with some kind of state-table compression
WR> so that the result is both space and time efficient. But yacc produces
WR> C code. I seem to recall hearing of a yacc implimentation that
WR> had an option for spitting out perl. Or was that a 'lex' program?
WR> I do not remember clearly.

search google for perl-byacc

uri
 
P

pkent

'yacc' is a well known program that does all the work of building
up compiled DFA, complete with some kind of state-table compression
so that the result is both space and time efficient. But yacc produces
C code. I seem to recall hearing of a yacc implimentation that
had an option for spitting out perl. Or was that a 'lex' program?
I do not remember clearly.

Would you be thinking of yapp? As in:

http://search.cpan.org/~fdesar/Parse-Yapp-1.05/lib/Parse/Yapp.pm

specifically it says "The script yapp is a front-end to the Parse::Yapp
module and let you easily create a Perl OO parser from an input grammar
file." and "Parse::Yapp should compile a clean yacc grammar without any
modification..."

P
 
W

Walter Roberson

:In article <[email protected]>,
: (e-mail address removed)-cnrc.gc.ca (Walter Roberson) wrote:

:<snip interesting stuff about regexes>

:> 'yacc' is a well known program that does all the work of building
:> up compiled DFA, complete with some kind of state-table compression
:> so that the result is both space and time efficient. But yacc produces
:> C code. I seem to recall hearing of a yacc implimentation that
:> had an option for spitting out perl. Or was that a 'lex' program?
:> I do not remember clearly.

:Would you be thinking of yapp? As in:

:http://search.cpan.org/~fdesar/Parse-Yapp-1.05/lib/Parse/Yapp.pm

I was not, but thank you for the reference. I seem to recall seeing
[years ago] a version of yacc with an explicit perl option. I might
be able to find it again if I looked around.

But getting back to my original question: although Yapp does appear to
have a lot of functionality useful for my purposes, the module that
I am remembering as having seen mentioned, was, as best I recall,
just an RE (as opposed to regex) compiler and FSM driver, without the
full LALR parsing facilities of yacc/yapp. Useful if, for example,
one had a large number of simple RE's (or just plain strings) to match
against. join('|', @wordlist) is not very efficient to match against
in perl, even if one qr's it, as perl backtracks because it assumes
later portions of the expression might require the full regexp power.

Maybe I should be looking more closely at the regexp optimizer... but
boy does it produce ugly expressions! ;-)
 
J

Jim Cochrane

:In article <[email protected]>,
: (e-mail address removed)-cnrc.gc.ca (Walter Roberson) wrote:

:<snip interesting stuff about regexes>

:> 'yacc' is a well known program that does all the work of building
:> up compiled DFA, complete with some kind of state-table compression
:> so that the result is both space and time efficient. But yacc produces
:> C code. I seem to recall hearing of a yacc implimentation that
:> had an option for spitting out perl. Or was that a 'lex' program?
:> I do not remember clearly.

:Would you be thinking of yapp? As in:

:http://search.cpan.org/~fdesar/Parse-Yapp-1.05/lib/Parse/Yapp.pm

I was not, but thank you for the reference. I seem to recall seeing
[years ago] a version of yacc with an explicit perl option. I might
be able to find it again if I looked around.

But getting back to my original question: although Yapp does appear to
have a lot of functionality useful for my purposes, the module that
I am remembering as having seen mentioned, was, as best I recall,
just an RE (as opposed to regex) compiler and FSM driver, without the
full LALR parsing facilities of yacc/yapp. Useful if, for example,
one had a large number of simple RE's (or just plain strings) to match
against. join('|', @wordlist) is not very efficient to match against
in perl, even if one qr's it, as perl backtracks because it assumes
later portions of the expression might require the full regexp power.

Maybe I should be looking more closely at the regexp optimizer... but
boy does it produce ugly expressions! ;-)

Sorry to ask the obvious, but did you try some carefully-worded google
searches?
 
W

Walter Roberson

:> But getting back to my original question: although Yapp does appear to
:> have a lot of functionality useful for my purposes, the module that
:> I am remembering as having seen mentioned, was, as best I recall,
:> just an RE (as opposed to regex) compiler and FSM driver, without the
:> full LALR parsing facilities of yacc/yapp.

:Sorry to ask the obvious, but did you try some carefully-worded google
:searches?

Yes, I spent a couple of hours looking for what I thought I remembered
existing, including examining the thread topic of everything in the
time range that I thought I had seen the particular message in.

It is possible that I did not happen upon the right keywords when I
did the search, but I did try a variety of approaches, and my ability
to elicit useful information from google is usually quite good.
 
W

Walter Roberson

:search google for perl-byacc

Thanks for the reference.

When I look around, I see a debian RPM at about the 2.0.5 rev
level, and I see (in a small number of places) the (1993) source for the
1.8.2 rev level. When I look in cpan, I see a distribution which
is named perl5-byacc-patches, but not perl-byacc itself.

Would you happen to have bookmarked for a 2.0.x source that isn't
an RPM?
 
U

Uri Guttman

WR> In article <[email protected]>,
WR> :search google for perl-byacc

WR> Thanks for the reference.

WR> Would you happen to have bookmarked for a 2.0.x source that isn't
WR> an RPM?

i have never used it. i just remembered the name byacc and that it had a
perl output option and googled.

uri
 
W

Walter Roberson

:i have never used it. i just remembered the name byacc and that it had a
:perl output option and googled.

Thanks, I was able to find enough pieces to put it together.
I have not compared it to Parse::Yapp as yet, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top