Regular Expressions -- Backtracking?

V

Vincent

I'm not sure if there is an easy way to do this short of resetting the
matcher index, but I thought I would ask. Is it possible to construct
a regular expression so that it will return all consecutive, 2-letter
pairs? So, for example, if I had "hello world", successive calls to
matcher.find() would return:

he
el
ll
lo

wo
or
rl
ld

Thanks for any and all help.

Vincent
 
A

Arved Sandstrom

Vincent said:
I'm not sure if there is an easy way to do this short of resetting the
matcher index, but I thought I would ask. Is it possible to construct
a regular expression so that it will return all consecutive, 2-letter
pairs? So, for example, if I had "hello world", successive calls to
matcher.find() would return:

he
el
ll
lo

wo
or
rl
ld

Thanks for any and all help.

Vincent

Matcher matcher = Pattern.compile("(?=(..))").matcher(args[0]);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

AHS
 
E

Eric Sosman

Vincent said:
I'm not sure if there is an easy way to do this short of resetting the
matcher index, but I thought I would ask. Is it possible to construct
a regular expression so that it will return all consecutive, 2-letter
pairs? So, for example, if I had "hello world", successive calls to
matcher.find() would return:

he
el
ll
lo

wo
or
rl
ld

Thanks for any and all help.

Vincent

Matcher matcher = Pattern.compile("(?=(..))").matcher(args[0]);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

Matches non-letters, too. "(?=(\\w\\w))" comes closer, but
still doesn't get the blank (empty?) line between "lo" and "wo"
that the O.P. wants.

Personally, I think Peter Duniho is on the right track: This
is a job for a simple-minded loop, not for ponderous regexes.
 
A

Arved Sandstrom

Eric said:
Vincent said:
I'm not sure if there is an easy way to do this short of resetting
the matcher index, but I thought I would ask. Is it possible to
construct a regular expression so that it will return all
consecutive, 2-letter pairs? So, for example, if I had "hello
world", successive calls to matcher.find() would return:

he
el
ll
lo

wo
or
rl
ld

Thanks for any and all help.

Vincent

Matcher matcher = Pattern.compile("(?=(..))").matcher(args[0]);
while (matcher.find()) {
System.out.println(matcher.group(1));
}

Matches non-letters, too. "(?=(\\w\\w))" comes closer, but
still doesn't get the blank (empty?) line between "lo" and "wo"
that the O.P. wants.

Ahhh, I never twigged to that stipulation...even given his example. Still,
for a single "word", where you're not worried about WS, the provided regex
does the trick.
Personally, I think Peter Duniho is on the right track: This
is a job for a simple-minded loop, not for ponderous regexes.

Depends on the spirit in which the question was asked - does the OP really
want to use a regex to solve that problem, or is he just interested in
regular expressions? The RE I provided isn't what I'd call ponderous
(leaving aside the WS issue which I wouldn't solve with regular expressions
anyway; I'd just split into words first, then process each word), but the
real reason for preferring a "simple-minded loop" in production code, to
solve this problem, is that it would take less time to *write and test* the
loop than it would take to explain the regular expression to 99% of coders.

I was mainly motivated to throw this out there because of Peter's
speculation. Given that I didn't account for the word boundaries (yet :)),
an RE *can* be constructed that does produce overlapping consecutive
pairs...albeit without any matching at all.

AHS
 
V

Vincent

Eric said:
Vincent wrote:
I'm not sure if there is an easy way to do this short of resetting
the matcher index, but I thought I would ask.  Is it possible to
construct a regular expression so that it will return all
consecutive, 2-letter pairs?  So, for example, if I had "hello
world", successive calls to matcher.find() would return:
he
el
ll
lo
wo
or
rl
ld
Thanks for any and all help.
Vincent
Matcher matcher = Pattern.compile("(?=(..))").matcher(args[0]);
while (matcher.find()) {
     System.out.println(matcher.group(1));
}
    Matches non-letters, too.  "(?=(\\w\\w))" comes closer, but
still doesn't get the blank (empty?) line between "lo" and "wo"
that the O.P. wants.

Ahhh, I never twigged to that stipulation...even given his example. Still,
for a single "word", where you're not worried about WS, the provided regex
does the trick.
    Personally, I think Peter Duniho is on the right track: This
is a job for a simple-minded loop, not for ponderous regexes.

Depends on the spirit in which the question was asked - does the OP really
want to use a regex to solve that problem, or is he just interested in
regular expressions? The RE I provided isn't what I'd call ponderous
(leaving aside the WS issue which I wouldn't solve with regular expressions
anyway; I'd just split into words first, then process each word), but the
real reason for preferring a "simple-minded loop" in production code, to
solve this problem, is that it would take less time to *write and test* the
loop than it would take to explain the regular expression to 99% of coders.

I was mainly motivated to throw this out there because of Peter's
speculation. Given that I didn't account for the word boundaries (yet :)),
an RE *can* be constructed that does produce overlapping consecutive
pairs...albeit without any matching at all.

AHS

Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

Also, I didn't meant to include the WS in my example. I am strictly
looking for adjacent two character pairs. After looking at the
suggestions, I came up with the following expression that seems to be
working properly:

(?=([a-zA-Z]{2}))

Again, thank you for your assistance.

Vincent
 
L

Lew

Don't quote sigs.
Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

I am having a very hard time wrapping my mind around associating "elegant" and
"regular expressions".
 
E

Eric Sosman

[...]
Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

Java also provides the ability to establish an encrypted network
connection to a String-splitting server, get back all pairs of
consecutive characters, send them in turn over the network to a
letters-only filtering server, get the filtered pairs back, load
them all into a relational database, and retrieve them with SQL.

... which isn't elegance, just needless complexity. Canaricide
by cannon. Like regexes in this case, IMHO -- but what the heck,
it's your code and you're the boss; do as you please.
 
A

Arved Sandstrom

Eric said:
[...]
Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

Java also provides the ability to establish an encrypted network
connection to a String-splitting server, get back all pairs of
consecutive characters, send them in turn over the network to a
letters-only filtering server, get the filtered pairs back, load
them all into a relational database, and retrieve them with SQL.

Please God, don't crack wise like that around some of the managers and
architects I've worked with. :) Phrased just a bit differently this is
exactly the kind of system they'd go for.
[ SNIP ]

AHS
 
R

Robert Klemme

Vincent wrote:

I am having a very hard time wrapping my mind around associating
"elegant" and "regular expressions".

:) If they are first class data types in a language like in many
dynamic languages they can make for awful elegant solutions.

Just a silly Ruby example:

irb(main):011:0> "hello world".scan(/\w(?=\w)/) {|m| p(m+$'[0])}
"he"
"el"
"ll"
"lo"
"wo"
"or"
"rl"
"ld"
=> "hello world"

Well, "elegance" of course lies in the eye of the beholder and I would
not claim that this is the most elegant code you can get with regexps... :)

Cheers

robert
 
M

Mike Schilling

Arved Sandstrom said:
Eric said:
[...]
Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

Java also provides the ability to establish an encrypted network
connection to a String-splitting server, get back all pairs of
consecutive characters, send them in turn over the network to a
letters-only filtering server, get the filtered pairs back, load
them all into a relational database, and retrieve them with SQL.

Please God, don't crack wise like that around some of the managers and
architects I've worked with. :) Phrased just a bit differently this is
exactly the kind of system they'd go for.
[ SNIP ]

And you could pull in some developers I've worked with by putting
string-splitting and letters-only-filtering in different classes (to reduce
coupling) and wiring the two together with a Spring XML configuration
 
A

Arved Sandstrom

Lew said:
Arved Sandstrom wrote
Please God, don't crack wise like that around some of the managers
and architects I've worked with. :) Phrased just a bit differently
this is exactly the kind of system they'd go for.
[ SNIP ]

Mike said:
And you could pull in some developers I've worked with by putting
string-splitting and letters-only-filtering in different classes (to
reduce coupling) and wiring the two together with a Spring XML
configuration

Half of me wants to burn you all at the stake, half of me wants
shower you with awards, and half of me wants to implement that system
and foist it onto one of those managers or architects.

It'll be RESTful web services, of course, based on that Ruby code
that Robert Klemme posted a while back.

And it goes without saying that it should really be NoSQL in the cloud.

For the benefit of the marketing types I would simply describe all this as a
"best-of-breed content enrichment service".

AHS
 
A

Arne Vajhøj

I am having a very hard time wrapping my mind around associating
"elegant" and "regular expressions".

In some cases regex allows for short, easily readable (for those
that know regex) and easily modifiable code.

Like XPath for getting info out of XML.

Arne
 
A

Arne Vajhøj

Vincent wrote:

I am having a very hard time wrapping my mind around associating
"elegant" and "regular expressions".

:) If they are first class data types in a language like in many
dynamic languages they can make for awful elegant solutions.

Just a silly Ruby example:

irb(main):011:0> "hello world".scan(/\w(?=\w)/) {|m| p(m+$'[0])}
"he"
"el"
"ll"
"lo"
"wo"
"or"
"rl"
"ld"
=> "hello world"

Well, "elegance" of course lies in the eye of the beholder and I would
not claim that this is the most elegant code you can get with regexps...
:)

I would consider it pretty bad, because it is very difficult
to read for those not familiar with Ruby.

In contrast with Arveds Java code, which should be understandable
by anyone:
- knowing regex
- knowing any mainstream programming language

Arne
 
R

Robert Klemme

Vincent wrote:
Well, thank you to everyone who replied. Yes, you are right that a
loop would solve this problem, but, for whatever reason, it didn't
seem like an elegant approach when Java provides the ability to use
regular expressions.

I am having a very hard time wrapping my mind around associating
"elegant" and "regular expressions".

:) If they are first class data types in a language like in many
dynamic languages they can make for awful elegant solutions.

Just a silly Ruby example:

irb(main):011:0> "hello world".scan(/\w(?=\w)/) {|m| p(m+$'[0])}
"he"
"el"
"ll"
"lo"
"wo"
"or"
"rl"
"ld"
=> "hello world"

Well, "elegance" of course lies in the eye of the beholder and I would
not claim that this is the most elegant code you can get with regexps...
:)

I would consider it pretty bad, because it is very difficult
to read for those not familiar with Ruby.

Hmmm... I don't think this is an important criterion. After all, if
someone needs to read code then it is usually because he needs to modify
it. For that knowledge of the programming language at hand is mandatory
anyway. And I don't think the code is so esoteric for someone with at
least a working knowledge of Ruby.

If the code would be hard to read for the general Ruby programmer that
would qualify as a valid argument IMHO. I'm probably the wrong person
to judge that since I brought the example up in the first place. :)

Cheers

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top