How would you design regexps in the integer domain?

  • Thread starter Andreas Launila
  • Start date
A

Andreas Launila

I'm trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers ("17 followed by 3 or 15"
rather than "'a' followed by 'b' or 'c'").

My main objective is to make the syntax intuitive to Ruby users. I have
been toying around with a few different approaches, but I'm not sure if
any meet the goal. How would you design the syntax for regexps in the
integer domain?

The syntax does not need to support especially many operators:
* Kleene star ('*' in character regexps)
* "At least once" ('+' in character regexps)
* Alternation ('|' in character regexps, i.e. "a|b" being "'a' or 'b'")

Below are some of the approaches one could take.


== String representation

Many Ruby users have used character regexps defined as strings, so it
would seem like a good idea to make integer regexps as similar as
possible. A delimiter would be needed though, since "17" could otherwise
mean "1 followed by 7" or just "17". One delimiter could for instance be
a blank space. The following is how the pattern "Any number of (17 or, 1
followed by 5) followed by 4711" could look.

IntRegexp.new('(17|1 5)*4711')


== Method combination

In this approach the regexp is gradually built up by invoking methods.
One major question is what the methods should be named.

The following is how the above example could look.

first_part = IntRegexp.new(17).or(IntRegexp.new(1).followed_by 5)
first_part.any_number_of_times.followed_by 4711

Or with some different method names

first_part = IntRegexp.new(17) | (IntRegexp.new(1) + 5)
first_part.any_times + 4711

I'm unsure what sort of method names would strike a good balance between
verbosity and readability.


== Blocks

TextualRegexp[1] provides a block-based interface for specifying normal
regexps. Perhaps something similar could be done for integer regexps?
The following is the syntax of TextualRegexp but with integers.

regexp do
at_least_zero do
any :eek:f do
integer 17
group{ integer 1; integer 5 }
end
end
integer 4711
end

There's a fair amount of redundancy since there will never be any thing
other than "integer" specified.


[1] http://rubyforge.org/projects/texrex/
 
R

Robert Klemme

2008/5/5 Andreas Launila said:
I'm trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers ("17 followed by 3 or 15"
rather than "'a' followed by 'b' or 'c'").

Why do you need this, i.e. what advantages do you expect for a
particular "integer regexp" over classic regular expressions?
My main objective is to make the syntax intuitive to Ruby users. I have
been toying around with a few different approaches, but I'm not sure if
any meet the goal. How would you design the syntax for regexps in the
integer domain?

The syntax does not need to support especially many operators:
* Kleene star ('*' in character regexps)
* "At least once" ('+' in character regexps)
* Alternation ('|' in character regexps, i.e. "a|b" being "'a' or 'b'")

Below are some of the approaches one could take.


== String representation

Many Ruby users have used character regexps defined as strings, so it
would seem like a good idea to make integer regexps as similar as
possible. A delimiter would be needed though, since "17" could otherwise
mean "1 followed by 7" or just "17". One delimiter could for instance be
a blank space. The following is how the pattern "Any number of (17 or, 1
followed by 5) followed by 4711" could look.

IntRegexp.new('(17|1 5)*4711')

You can as well do /(17|1 5)*4711/, can't you?
== Method combination

In this approach the regexp is gradually built up by invoking methods.
One major question is what the methods should be named.

The following is how the above example could look.

first_part = IntRegexp.new(17).or(IntRegexp.new(1).followed_by 5)
first_part.any_number_of_times.followed_by 4711

This looks ugly.
Or with some different method names

first_part = IntRegexp.new(17) | (IntRegexp.new(1) + 5)
first_part.any_times + 4711

I'm unsure what sort of method names would strike a good balance between
verbosity and readability.


== Blocks

TextualRegexp[1] provides a block-based interface for specifying normal
regexps. Perhaps something similar could be done for integer regexps?

The change would be easy, i.e. you just need to allow anything where
so far only raw strings were allowed and internally convert it via
#to_s.
The following is the syntax of TextualRegexp but with integers.

regexp do
at_least_zero do
any :eek:f do
integer 17
group{ integer 1; integer 5 }
end
end
integer 4711
end

There's a fair amount of redundancy since there will never be any thing
other than "integer" specified.


[1] http://rubyforge.org/projects/texrex/

Apparently the project has gone and I could not prevent it since Ari
did not give me access to it. However, I posted an early version
here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/263785

Kind regards

robert
 
A

Andreas Launila

Robert said:
Why do you need this, i.e. what advantages do you expect for a
particular "integer regexp" over classic regular expressions?

The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].
You can as well do /(17|1 5)*4711/, can't you?

True, assuming that one then convert it back to its source to reparse.
It would probably make it even more familiar.


[1] http://gecoder.rubyforge.org/
 
E

Eivind Eklund

Robert said:
Why do you need this, i.e. what advantages do you expect for a
particular "integer regexp" over classic regular expressions?

The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].

I'd implement this using a generic repeat operator instead of
hardcoding * and +, and using a generic comparator instead of
restricting to Integers.

The declaration for this would be something like

# Use repeat(0, nil, ...) for *
# Use repeat(1, nil, ...) for +
# Use repeat(0, 1, ...) for ?
def repeat(minimum, maximum, repeated_expression)
def or(*expressions)

So you'd convert your example "(17|1 5)*" to repeat(0, nil, or(17, [1, 5]))

Actually, as I thought of after I had written the above, I have
written code that does part of this already, and is generic - it's
available in types.rb (http://people.freebsd.org/~eivind/ruby/types/).
If you want it, I think I can extend that to support repeats tonight.
Feel free to use code from it under any license that absolves me of
legal blame (ie, the only reason I don't put it in the public domain
is because that expose me to legal liability.)

Eivind.
 
R

Robert Klemme

2008/5/5 Andreas Launila said:
The two types of regexps are not really comparable since they work in
different domains. I.e. classic regular expressions describe patterns in
arrays of characters rather than in arrays of integers. One could for
instance use an integer regexp to decide whether an array begins with 17
and ends with 8.

Ah! That bit of information was missing from the original posting.
Specifically this is for specifying deterministic finite automatons,
with an integer alphabet, for an upcoming constraint in Gecode/R[1].

I see. Then of course it's a different story. Unless you translate
the array of integers into a string and apply a textual regexp. But
you can still do that internally, i.e. shield that away as an
implementation detail.

I'd probably choose the approach we had taken with texrex because it
is easy to implement and has good usability (albeit it's a bit
verbose). I would base the decision ultimately on the users of such a
package, i.e. if they are familiar with regular expressions then the
string based approach might be the better choice (short and concise)
while for others the wordy approach might be better.

Kind regards

robert
 
A

Andreas Launila

Eivind said:
I'd implement this using a generic repeat operator instead of
hardcoding * and +, and using a generic comparator instead of
restricting to Integers.

The declaration for this would be something like

# Use repeat(0, nil, ...) for *
# Use repeat(1, nil, ...) for +
# Use repeat(0, 1, ...) for ?
def repeat(minimum, maximum, repeated_expression)
def or(*expressions)

So you'd convert your example "(17|1 5)*" to repeat(0, nil, or(17, [1, 5]))

I like that idea, especially using the arrays to group integers. It
seems like a rather economic way to do it. I think "any" would fit
better than "or" though (since it can take more than two arguments). The
example in its entirety would then be

[repeat(0, nil, any(17, [1, 5])), 4711]

I would also consider swapping the position of repeated_expression to
the first argument of repeat, since it seems like a more natural order
to me ("repeat expression between 0 and 4 times"). I'm a bit split on
using nil for infinity, but that is minor.

I will need to explore the idea, but it feels like a good fit to me.
Thank you.
Actually, as I thought of after I had written the above, I have
written code that does part of this already, and is generic - it's
available in types.rb (http://people.freebsd.org/~eivind/ruby/types/).
If you want it, I think I can extend that to support repeats tonight.
Feel free to use code from it under any license that absolves me of
legal blame (ie, the only reason I don't put it in the public domain
is because that expose me to legal liability.)

Thanks for the offer, but there's no need to add additional support. The
important part is the idea, which you have provided.
 
P

Phrogz

I'm trying to come up with a clean way to specify regexps in the integer
domain. I.e. instead of describing a pattern of characters (as in normal
regexps) one describes patterns of integers ("17 followed by 3 or 15"
rather than "'a' followed by 'b' or 'c'").

You could specify it as a PEG[1], possibly using Treetop[2] in some
way.

The benefit of the PEG (re-usable sub-expressions) may not outweigh
the multi-line nature. However, you could do something like LPEG[3]
does. Instead of using standard PEG syntax come up with a custom
syntax via operators that allows you to describe an entire grammar in
a single Ruby expression. I'm personally not a fan of the syntax LPEG
uses; you could come up with a better one, perhaps using terse method
names instead of operators in some cases. (Perhaps
"pattern.atleast(5)" instead of LPEG's "pattern^5".)

[1] http://en.wikipedia.org/wiki/Parsing_expression_grammar
[2] http://treetop.rubyforge.org/
[3] http://www.inf.puc-rio.br/~roberto/lpeg.html
 
P

Phrogz

The benefit of the PEG (re-usable sub-expressions) may not outweigh
the multi-line nature.

To expand on this a little bit more, and use an LPEG-like custom
grammar entirely in Ruby, using PEG instead of regexp would allow you
to do something like:

header = IPEG[17,34,15,12,89,127]
break = IPEG[13,10]
close = IPEG[0]

body = (IPEG[12,43,13,break,57,75] | IPEG[18,13,31])
message_style_1 = header + body + close
message_style_2 = header + IPEG[108,103] + break + body + close
message = message_style_1 | message_style_2

result = message.match( my_array_of_numbers )

As you can see, this allows you to break up complex expressions into
named parts. This makes it both easier to understand, and DRYs up the
expression.

In the above made up syntax:
* "IPEG" stands for "Integer Parsing Expression Grammar"
* The IPEG.[] notation might represent a literal sequence of
integers
* The + operator would combine sequences
* The | operator provides alternation
 
A

ara.t.howard

I'm trying to come up with a clean way to specify regexps in the
integer
domain. I.e. instead of describing a pattern of characters (as in
normal
regexps) one describes patterns of integers ("17 followed by 3 or 15"
rather than "'a' followed by 'b' or 'c'").

My main objective is to make the syntax intuitive to Ruby users. I
have
been toying around with a few different approaches, but I'm not sure
if
any meet the goal. How would you design the syntax for regexps in the
integer domain?

The syntax does not need to support especially many operators:
* Kleene star ('*' in character regexps)
* "At least once" ('+' in character regexps)
* Alternation ('|' in character regexps, i.e. "a|b" being "'a' or
'b'")

Below are some of the approaches one could take.


i personally would not bother making a new syntax, rather i'd simply
use the exiting one and add a new atom builder:



cfp:~ > cat a.rb
N = 0.chr

def N n
"#{ n }#{ N }"
end

module Enumerable
def nmatch pattern
match = pattern.match( select{|i| Fixnum === i}.join(N) )
if match
returned = []
captured = match.to_a.map{|capture| capture.split N}
returned << captured.first.map{|s| Float(s).to_i}
captured[1..-1].each do |list|
returned << Float(list.first).to_i
end
returned
else
nil
end
end
end

pattern = %r/#{ N 42 }#{ N 7 }(#{ N 17 }){2}(#{ N 1 }){1,}/

p [42, 7, 17, 17, 1, 1, 1].nmatch( pattern ).to_a



cfp:~ > ruby a.rb
[[42, 7, 17, 17, 1, 1], 17, 1]




a @ http://codeforpeople.com/
 
U

Une Bévue

ara.t.howard said:
we can deny everything, except that we have the possibility of being
better. simply reflect on that.

which suppose their is a norm somewhere, ie. something able to measure
what is worst what is best...
 
A

ara.t.howard

which suppose their is a norm somewhere, ie. something able to measure
what is worst what is best...

i don't think the implication is an absolute one. whether relative or =20=

absolute i think it would be rare the to find a being that did not =20
know this about themselves at some, possibly deeply hidden, level.

a @ http://codeforpeople.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top