Alternate Regular Expressions?

Ari Brown · Aug 7, 2007

Just randomly curious -

Is there an alternate RegExp "language" to the current one in Ruby
and Perl?

Don't get me wrong, I love the current RegExp in Ruby, but I'm
allowed to be curious...

Also, is Ruby going to jump on the PERL 6 RegExp ship?

^^^^^^^ That's a big one to some people I know.

Thanks,
~ Ari
English is like a pseudo-random number generator - there are a
bajillion rules to it, but nobody cares.

Phlip · Aug 7, 2007

Ari said:
Just randomly curious -

Is there an alternate RegExp "language" to the current one in Ruby and
Perl?

I don't know. So here's a dissertation on where to start.

The good news is a RegExp is only two things at heart...

- a Domain-Specific Language to program
- a state machine.

The bad news is, back in the day, people used to invent DSL as long strings
of easily parsed characters. For example, a language called LSYSTEM might
describe turtle graphics like this:

s=[::cc!!!!&&[FFcccZ]^^^^FFcccZ] # upper spikes

The really bad news is RegExp is one of these string-oriented DSLs that
stuck. It will always be useful, so programmers forget how much room it has
for improvement.

The good news is Ruby excels at generating light DSLs. The equivalent
expression for a modern implementation of LSYSTEM might look like this:

upper_spikes = push.twist(2).thinner(2).increase_angle(4)....

etc. Because Ruby gives your programming interfaces extreme notational
flexibility, you can declare the interfaces most convenient for your domain.

So start writing! and research other DSLs as you go. For example, here's a
DSL written with C++ metaprogramming:

http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/index.html

Whenever you like, that language slips back to raw RegExp. Your effort
should have a similar shunt.

English is like a pseudo-random number generator - there are a bajillion
rules to it, but nobody cares.

Of all the world's languages, English is both the ugliest and the
beautifulest.

Ari Brown · Aug 7, 2007

So start writing! and research other DSLs as you go.

Ugh. If I must (which I must). What would you suggest as syntax?

Also, should I completely try to reinvent the wheel, or create a
wrapper for current RegExp?

Man. I need a mentor on this :-|

aRi
--------------------------------------------|
IMO, Arabic has THE most beautiful script.
Poetically, English is extremely beautiful. It's like a language of
RegExp - except there are no rules!
Spoken, the most beautiful language is either French (sorry) or
Esperanto.

Tim Hunter · Aug 7, 2007

Ari said:
Ugh. If I must (which I must). What would you suggest as syntax?

Also, should I completely try to reinvent the wheel, or create a
wrapper for current RegExp?

Man. I need a mentor on this :-|

This might give you a place to start:
http://en.wikipedia.org/wiki/Parsing_expression_grammar

Kenneth McDonald · Aug 7, 2007

Ari,

How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I've started
to port it to Ruby, but simply don't have the time. If you do (you're
probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including decent
documentation), I'm happy to pass
on the Python code, the Ruby code, and give advice and so on.

To help you evaluate this, and also as a potential source of ideas in
case you do something else, I've
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

Ari said:
Ugh. If I must (which I must). What would you suggest as syntax?

Also, should I completely try to reinvent the wheel, or create a
wrapper for current RegExp?

Man. I need a mentor on this :-|

aRi
--------------------------------------------|
IMO, Arabic has THE most beautiful script.
Poetically, English is extremely beautiful. It's like a language of
RegExp - except there are no rules!
Spoken, the most beautiful language is either French (sorry) or
Esperanto.

Text from the _Python_ library (In retrospect, I would do quite a bit
different):

Overview
========

'rex' provides regular expression and parsing facilities. It uses
(and is intended to functionally
replace) the Python 're' module.

Regular expression functionality is provided through the '_Rexp' and
'MatchResult' classes,
and the CHAR, REP0, REP1, OPT, PAT, and ALT constructs.
These constructs can be used as or provide functions to create
rexps, and also define
attributes for commonly used rexps. (For example, PAT.float provides
a rexp
which matches basic floating-point (no exponent) numbers.)

Pattern-Matching Example
----------------------

If you are familiar with regular expressions, the following will
probably make at
least some sense. If you are not, skip this example for now. In
either case, come
back to it once you have have read the formal definitions of
functions and
constructs provided by rex.

COMPLEX= PAT.float['re'] + \
REP0.whitespace + \
ALT("+", "-")['op'] + \
REP0.whitespace + \
PAT.float['im'] + \
'i'

The above example defines a pattern which will match complex
numbers, of the form "-2.718 + 3.14i", for example. It uses the
predefined
match expressions PAT.float and REP0.whitespace to
ease the definition. Applied to the example complex number string,
the result will contain three
named substrings: 're' will map to "-2.718", "op" will map to "+",
and "im" will map to "3.14".

SEQ is an alternative form of joining rexps; the above is equivalent to:

COMPLEX= SEQ(
PAT.float['re'],
REP0.whitespace,
ALT("+", "-")['op'],
REP0.whitespace,
PAT.float['im'],
'i'
)

Regular Expressions
---------------

This is an introduction to using the pattern-matching
(regular-expression-related)
part of rex. See documentation associated
with a specific method/function/name for details on that entity.

In the following, we use the abbreviation RE to refer to standard
regular
expressions defined as strings, and the word 'rexp' to refer to rex
objects
which denote regular expressions.

The starting point for building a rexp is either rex.PAT,
which we'll just call PAT, or rex.CHAR, which we'll just call CHAR,
or rex.LIT.
CHAR provides rexps defining a set of characters, and which
will match a single character string if that character is in the given
set. In addition to providing attributes which provide prebuilt
character
sets, the CHAR function may be used to define your own character
sets.

LIT builds rexps which match strings of varying lengths.

REP0 and REP1 are zero or more and 1 or ore

Also

- PAT._someattribute_ returns (for defined attributes) a
corresponding rexp.
For example, PAT.stringstart returns a rexp matching at the
start of a string.

- CHAR(a1, a2, . . .) returns a rexp matching a single character
from a set
of characters defined by its arguments. For example,
CHAR("-", ["0","9"], ".")
iter the characters necessary to build basic floating point
numbers.
See CHAR docs for details.

- CHAR._someattribute_ returns (for defined attributes) a
corresponding rexp
defining a set of characters.
For example, CHAR.digit returns a rexp matching a single digit.

Now assume that A, B, C,... are rexps. The following Python expressions
(_not_ strings) may be used to build more complex rexps:

- X | Y | Z . . . : returns a rexp which iter a string if any of
the operands
match that string. Similar to "X|Y|Z" in normal REs, except
of course you can't
use Python code to define a normal RE.

- X + Y + Z ...: returns a rexp which iter a string if all of X,
Y, Z match consecutive
substrings of the string in succession. Like "XYZ" in normal
REs.

- X*n : returns a rexp which iter a number of times as defined by n.
This replaces '?', '+', and '*' as used in normal REs. See
docs for details.
'rex' defines constants which allow you to say X*REP0,
X*REP1, or X*MAYBE,
indicating (0 or more iter), (1 or more iter), or (0 or 1 iter),
respectively.

- X**n : Like X*n, but does nongreedy matching.

- +X : positive lookahead assertion: iter if X iter, but doesn't
consume any of the input.

- ~+X : negative lookahead assertion: iter if X _doesn't_ match,
but doesn't consume any of the input.

- -X, ~-X : positive and negative lookback assertions. Lke
lookahead assertions,
but in the other direction.

- X[name] : name must be a string: any matched by X can be referred
to by the given name in the match result object. (This is
the equivalent
of named groups in the re module).

- X.group() : X will be in an unnamed group, referable by number.

In addition, a few other operations may be performed:

- Some of the attributes defined in PAT have "natural inverses";
for such
attributes, the inverse may be taken. For example, ~PAT.digit is
a pattern matching any character except a digit.

- Character classes may be inverted: ~CHAR("aeiouAEIOU") returns
a pattern
matching any except a vowel.

- 'ALT' gives a different way to denote alternation: ALT(X, Y,
Z,...) does
the same thing as X | Y | Z | . . ., except that none of the
arguments
to ALT need be rexps; any which are normal strings will be
converted
to a rexp using PAT.

- 'SEQ' can take multiple arguments: PAT(X, Y, Z,...), which
gives the same
result as PAT(X) + PAT(Y) + PAT(Z) + . . . .

Finally, a very convenient shortcut is that only the first object in
a sequence of
operator/method calls needs to be a rexp; all others will be
automatically
converted as if LIT(...) had been called on them. For example, the
sequence X | "hello" is the same as X | LIT("hello")

Phlip · Aug 7, 2007

Ari said:
Ugh. If I must (which I must).

You missed where I said I didn't know the actual answer.

What would you suggest as syntax?

Ruby itself, as a DSL; that was the point.

rx = match('foo') or match('bar') # like /(foo|bar)/
assert_equal [['foo', 'bar']], rx('a foo b bar')

Make match() return an object that overloads the or operator, and away you
go!

Ari Brown · Aug 7, 2007

I'm moderately serious. This is going to be one of those projects
that won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.new

letters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby's ideas.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

Thanks,
Ari

Ari,

How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I've started
to port it to Ruby, but simply don't have the time. If you do
(you're probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including
decent documentation), I'm happy to pass
on the Python code, the Ruby code, and give advice and so on.

To help you evaluate this, and also as a potential source of ideas
in case you do something else, I've
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Cheers,
Ken

--------------------------------------------|
If you're not living on the edge,
then you're just wasting space.

Kenneth McDonald · Aug 7, 2007

Ari said:
I'm moderately serious. This is going to be one of those projects that
won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.newletters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby's ideas.

Reflecting on my own experience, I'd suggest a less verbose notation,
and one that uses Ruby idioms more. For example:

letters = CharClass.new('a'..'z').case_insensitive
a = letters + "@" + letters + "." + (Literal.new("com") | "net" | "org"
| "edu")

It's not at all difficult to do this with Ruby. Strings can be used for
literals and character classes, and
ranges are perfect for use as char ranges in character classes.

Also, the ability to safely combine regular expressions (as shown above,
where "letters" is used in "a")
is _paramount_ in making this sort of wrapper really useful.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

It was in Python; instances of my 'rex' class simply construct and use
Python patterns, and their associated
functions, internally and invisibly to the user.

Ken

Robert Klemme · Aug 7, 2007

I'm moderately serious. This is going to be one of those projects that
won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.newletters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

You cannot do this because Hashes are unordered so you loose the
original order. Also [a-z] is only valid if you define local variables
a and z.

Personally I find regular expressions pretty readable - at least if they
are crafted properly.

See also below.

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language for
humans and UberBeings, and I think this should reflect Ruby's ideas.

Do you know the /x modifier? Than can go a long way to make a regular
expression readable. For example:

input = <<TEXT
adjasdkajda dadkajd (e-mail address removed) adklskkdaldjskj
(e-mail address removed) adkjasdjk
blah@org akjsd askdl asd (e-mail address removed) hello
asdj
TEXT

input.scan %r{
\b # word boundary
(?i:[a-z]+) # user name
@ # the famous "at" sign
(?i:[a-z]+) # host name
\. # a literal dot
(?:com|net|org|edu) # only some of the TLDs
\b # word boundary
}x do |match|
puts "Found email address #{match}"
end

Kind regards

robert

dblack · Aug 7, 2007

Hi --

I'm moderately serious. This is going to be one of those projects that won't
see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.newletters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language for
humans and UberBeings, and I think this should reflect Ruby's ideas.

Regular expressions are nothing if not logical

And readability,
as always, is largely in the eye of the beholder. I think the quest
for an alternative notation is fine, but there's nothing inherently
un-Ruby-like about what's there already. Then again, I'm in a small
minority who find /x with a lot of extra whitespace a serious
impediment to understanding a pattern

Anyway -- somewhere out there, though I haven't been able to find it,
is a library called Regexp::English by Florian Gross, which provides a
kind of English-language wrapper for regexes. I don't know whether
it's still in development and/or at a point of usability.

David

--
* Books:
RAILS ROUTING (new! http://www.awprofessional.com/title/0321509242)
RUBY FOR RAILS (http://www.manning.com/black)
* Ruby/Rails training
& consulting: Ruby Power and Light, LLC (http://www.rubypal.com)

Wolfgang NÃ¡dasi-donner · Aug 7, 2007

Ari said:
Is there an alternate RegExp "language" to the current one in Ruby
and Perl?

Snobol4 pattern are now available as a Python library. It should be
possible to port it to Ruby. I don't think that the implementation is
complete, because I didn't see the possibility of recursive pattern
definitions, which give Snobol4 the extreme power.

Infos

http://permalink.gmane.org/gmane.comp.python.announce/7217 (Snobol4 in
Python)

http://en.wikipedia.org/wiki/SNOBOL (has some links)

Wolfgang NÃ¡dasi-Donner

Yossef Mendelssohn · Aug 7, 2007

I'm moderately serious. This is going to be one of those projects
that won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.newletters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

case line
when a
# ...

My idea is to make it logical and human readable. Ruby is a language
for humans and UberBeings, and I think this should reflect Ruby's ideas.

Also, was you library a wrapper for underlying PERL RegExp? or was it
the whole RegExp engine?

Thanks,
Ari

Ari,

Click to expand...

How serious are you about this? Several years ago I wrote a Python
library that treats Python regular
expressions as semantic, not syntactic, objects, and that has been
incredibly useful to me. I've started
to port it to Ruby, but simply don't have the time. If you do
(you're probably looking at a couple of
weeks of full-time-equivalent hours to do a good job, including
decent documentation), I'm happy to pass
on the Python code, the Ruby code, and give advice and so on.

Click to expand...

To help you evaluate this, and also as a potential source of ideas
in case you do something else, I've
appended my (probably out of date) intro text to the library at the
bottom of this reply.

Click to expand...

Cheers,
Ken

Click to expand...

--------------------------------------------|
If you're not living on the edge,
then you're just wasting space.

Ari,

There have been other responses to this already, but I thought I'd
give you something else to look at:

http://bofh.org.uk/articles/2007/03/15/thats-not-fluent

I second (or third, or whatever) the contention that regular
expressions are pretty readable on their own (given some knowledge of
the syntax and good formatting). The thing to keep in mind is that
they're a language of their own. Once you learn the language, you
find you can use it in many a programming language (though there are
some dialectical problems here and there).

Simon Strandgaard · Aug 7, 2007

On 8/7/07 said:
Anyway -- somewhere out there, though I haven't been able to find it,
is a library called Regexp::English by Florian Gross, which provides a
kind of English-language wrapper for regexes. I don't know whether
it's still in development and/or at a point of usability.

long time ago I wrote a regexp engine 96% compatible with the ruby's,
at that point in time. Maybe it's useful to somebody?
http://raa.ruby-lang.org/project/regexp/

John Joyce · Aug 7, 2007

Ari,

Do it!
excellent project. even if it fails in the long run, or if you pass
it off to somebody else.

I like the Rails-like hash-looking idea, of course you would need
some ordering, so it would need to be some kind of array or struct,
but it is an idea worth toying with.

Ari Brown · Aug 7, 2007

I was actually 2 unread emails away from writing the list, thanking
everyone for their help, and that I would only write the wrapper if
someone really wanted me to.

Looks like I'm writing it.

-Ari

Ari,

Do it!
excellent project. even if it fails in the long run, or if you pass
it off to somebody else.

I like the Rails-like hash-looking idea, of course you would need
some ordering, so it would need to be some kind of array or struct,
but it is an idea worth toying with.

Ari
-------------------------------------------|
Nietzsche is my copilot

Kenneth McDonald · Aug 7, 2007

Speaking as someone who has actually written and used (in Python) a more
abstract regex library,
the biggest problem with regular expressions in most languages isn't the
syntax, but rather
the inability to easily compose small REs into larger REs. Which is why
so many programs end
up with huge, unreadable REs. As a small example, it's really nice (and
obvious) to be able to say

re3 = re1 + re2

instead of

re3 = "(?:#{re1})(?:#{re2})"

And the advantages go well beyond the convenience illustrated in the
above example...

Also, I think that people who are accustomed to regular expressions (or
any other DSL) tend
to forget about the problems with that DSL; the need for newcomers to
learn another syntax,
the inability to use standard language tools with the DSL, and so on.

So, though I've used REs for years, I certainly don't agree with the
contention that "REs
are actually pretty good". RE syntax in RE languages is optimized for
quickly entering onetime
REs on the command line, not for building robust REs that can be easily
maintained by
other programmers. It's the difference between weird, Perl-style
variables, and meaningful
variable names. A good abstract wrapper in Ruby would be very useful.

Ken

Ari Brown · Aug 7, 2007

Thanks! I was actually thinking about this myself.

Please people, send an email if you want to see something in a Ruby
RegExp wrapper. Don't be shy. If i can get drafted into making this,
then you can tell me what you want to see.

Thanks,
Ari

Speaking as someone who has actually written and used (in Python) a
more abstract regex library,
the biggest problem with regular expressions in most languages
isn't the syntax, but rather
the inability to easily compose small REs into larger REs. Which is
why so many programs end
up with huge, unreadable REs. As a small example, it's really nice
(and obvious) to be able to say

re3 = re1 + re2

instead of

re3 = "(?:#{re1})(?:#{re2})"

And the advantages go well beyond the convenience illustrated in
the above example...

Also, I think that people who are accustomed to regular expressions
(or any other DSL) tend
to forget about the problems with that DSL; the need for newcomers
to learn another syntax,
the inability to use standard language tools with the DSL, and so on.

So, though I've used REs for years, I certainly don't agree with
the contention that "REs
are actually pretty good". RE syntax in RE languages is optimized
for quickly entering onetime
REs on the command line, not for building robust REs that can be
easily maintained by
other programmers. It's the difference between weird, Perl-style
variables, and meaningful
variable names. A good abstract wrapper in Ruby would be very useful.

Ken

Yossef Mendelssohn wrote:

--------------------------------------------|
If you're not living on the edge,
then you're just wasting space.

Daniel DeLorme · Aug 8, 2007

Kenneth said:
Speaking as someone who has actually written and used (in Python) a more
abstract regex library,
the biggest problem with regular expressions in most languages isn't the
syntax, but rather
the inability to easily compose small REs into larger REs. Which is why
so many programs end
up with huge, unreadable REs. As a small example, it's really nice (and
obvious) to be able to say

re3 = re1 + re2

I agree with this and that's why I have the following add-on in my
standard lib:

class Regexp
def +(other)
if other.is_a?(Regexp)
if self.options == other.options
Regexp.new(source + other.source, options)
else
Regexp.new(source + other.to_s, options)
end
else
Regexp.new(source + Regexp.escape(other.to_s), options)
end
end
end

It could easily be improved so that, for example, a range would get
appended as a character class, etc.

Daniel

Michael W. Ryder · Aug 8, 2007

Ari said:
I'm moderately serious. This is going to be one of those projects that
won't see the light of day for maybe 6 months to a year.
This looks largely what I was hoping to make, although in Ruby I had
invisioned this:

matching email addresses (sample):
a = LeetExp.newletters => [[a-z], :insensitive],
:string => "@",
:letters => [[a-z], :insensitive],
:string => ".",
:string => ["com", "net", "org", "edu"]
)

Another way to do something like this is to use a "compound" regular
expression where each part continues where the last one ended.
Something like \@\.\* where $1 would be everything up to the @ sign,
i.e. the name. $2 would be everything between the @ and the ., i.e. the
ISP name. And $3 would be the remainder of the address. The only time
something like this would fail would be an address like mine, where the
ISP name is worldnet.att rather than just worldnet.

Stefan Rusterholz · Aug 8, 2007

Kenneth said:
the biggest problem with regular expressions in most languages isn't the
syntax, but rather
the inability to easily compose small REs into larger REs. Which is why
so many programs end
up with huge, unreadable REs.

You can do that in ruby rather simply:
# example taken from an example earlier in this thread
name = /[a-z]+/i
host = /[a-z]+/i
tld = /com|net|org|edu/
input.scan(%r{\b#{name}@#{host}\.#{tld}\b}) do |match|
puts "Found email address #{match}"
end

Regards
Stefan

Polymorphic Code	31	Jul 8, 2007
Clearing the RAM	7	Jun 25, 2007
SCP Ruby	2	Jul 10, 2007
Recursive regular expressions in Ruby?	4	Jan 31, 2011
AVL Tree	4	Dec 14, 2007
Regular Expressions	4	Jun 17, 2008
Keylogging in Ruby	0	Aug 3, 2007
Rubinius on Mac PPC	3	Sep 20, 2007

Alternate Regular Expressions?

Ari Brown

Phlip

Ari Brown

Tim Hunter

Kenneth McDonald

Phlip

Ari Brown

Kenneth McDonald

Robert Klemme

dblack

Wolfgang NÃ¡dasi-donner

Yossef Mendelssohn

Simon Strandgaard

John Joyce

Ari Brown

Kenneth McDonald

Ari Brown

Daniel DeLorme

Michael W. Ryder

Stefan Rusterholz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads