regex capability

Roedy Green · Apr 4, 2011

Consider a string like this:

Support DDR2 1066/800/667/533/400 DDR2 SDRAM

Is it possible to compose a regex that will peel out those numbers for
you each in its own field, or do you have to extract the string
"1066/800/667/533/400" and use split?

The various things I have tried just grab the last number.

Roedy Green · Apr 4, 2011

Easiest is to just use split. You can always do a regex of the type
"(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly
complicated. There's no reason why you should use a regex when "normal"
string parsing is simpler and easier to read.

(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.

Eric Sosman · Apr 4, 2011

(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.

A (section of a) regex matches a (section of a) string, and the
Matcher machinery can tell you what substring was matched. The
machinery has no provision for doing further processing on that
matched substring, like saying "Oh, your regex didn't match a
string this time, but an array of strings."

You could, perhaps, cook up substitutes for Pattern and Matcher
to do such a thing. But I'm not sure you'd want to, because it
could make the API rather complicated. For example, consider a
fanex (for "fancy expression," like "regular expression" only
more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
can match and return arrays of substrings. The FancyMatcher says
"I matched five substrings." So you call group(3) to get the
third of them -- was it matched by "pat1" or by "pat2"? Yes, you
could invent an API to deal with this -- maybe FancyMatcher returns
a tree of nodes that point to other nodes and/or to substrings --
but I'm not confident this would be an unqualified improvement.

Robert Klemme · Apr 4, 2011

I think normal practice (in Perl, and Java) would be repeated
use of a fairly simple regexp.

In Java, I use

while(matcher.find()) {
...
}

The key is that Matcher is stateful.

And for added security a two level approach could be taken:

// untested
Pattern whole = Pattern.compile("Support DDR2 (\\d+(?:/\\d+)*) DDR2 SDRAM");

Pattern number = Patter.compile("\\d+");

Matcher m = whole.matcher(input);

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find()

{
int x = Integer.parse(m.group());
}
}
else {
// error?
}

Kind regards

robert

David Lamb · Apr 4, 2011

For example, consider a
fanex (for "fancy expression," like "regular expression" only
more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
can match and return arrays of substrings. The FancyMatcher says
"I matched five substrings." So you call group(3) to get the
third of them -- was it matched by "pat1" or by "pat2"? Yes, you
could invent an API to deal with this -- maybe FancyMatcher returns
a tree of nodes that point to other nodes and/or to substrings --
but I'm not confident this would be an unqualified improvement.

Matching a pattern to generate a tree sounds a lot like a full-blown
context-free parser.

Jim Gibson · Apr 5, 2011

Roedy Green said:
(xxx|yyy)+ seems to generate only one group item, no matter how many
repetitions there are. That strikes me as a bug, but likely someone
can explain why it is a feature or inevitability.

The "feature" is that the number of capture groups is equal to the
number of capturing parenthesis pairs. If the above regular expression
results in multiple matches, each match is captured and stored into the
single capture buffer. After the match is finished, only the last
captured substring remains in the single capture buffer.

markspace · Apr 5, 2011

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find() {
int x = Integer.parse(m.group());
}

Why re-invent the wheel?

public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2 SDRAM");

Scanner scanner = new Scanner(in);
scanner.useDelimiter( "[^0-9]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}
}

(Lightly tested.)

Paul Cager · Apr 5, 2011

if ( m.matches() ) {
for (m = number.matcher(m.group(1)); m.find() {
int x = Integer.parse(m.group());
}

Click to expand...

Why re-invent the wheel?

public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2 SDRAM");

Scanner scanner = new Scanner(in);
scanner.useDelimiter( "[^0-9]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}

}

(Lightly tested.)

$ java ScannerTest
2
100
200
300
400
2

Robert Klemme · Apr 5, 2011

In this case I just wanted to demonstrate the strategy to first check
overall validity of the input and extract the interesting part and
then ripping that interesting part apart. Whether a Scanner or
another Matcher is used for the second step wasn't that important to
me. Also, the thread is called "regex capability".

But, of course, your approach using the Scanner is perfectly
compatible with the two step strategy as Patricia also pointed
out.

public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2SDRAM");
Scanner scanner = new Scanner(in);
scanner.useDelimiter( "[^0-9]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}
}
(Lightly tested.)

Click to expand...

Click to expand...

$ java ScannerTest
2
100
200
300
400
2

Click to expand...

This is a nice illustration of the case for a strategy I often use in
this sort of situation, combining tools using each to do the jobs it
does best.

For example, a regular expression match could pull out the
"100/200/300/400" substring, and a Scanner could extract the integers
from that. More generally, it could be split and then each of the split
results processed some other way.

I generally prefer scanning over splitting in those cases. The
difference might be negligible for this case but assuming that the
original pattern changes (e.g. because we want to allow "@" as
separator instead of or additionally to "/") then for the split
approach two patterns need to be changed while for scanning of
integers (pattern \d+) only the master pattern needs to change. Also,
with scanning it is clear what I want (positively defining the matched
portion) while with splitting it is not so clear (negatively defining
what I do not want, the separator) - but that leaves a lot of room for
what is returned from _between_ separators.

Kind regards

robert

markspace · Apr 5, 2011

In this case I just wanted to demonstrate the strategy to first check
overall validity of the input and extract the interesting part and
then ripping that interesting part apart. Whether a Scanner or
another Matcher is used for the second step wasn't that important to
me. Also, the thread is called "regex capability".

Fair enough.

But, of course, your approach using the Scanner is perfectly
compatible with the two step strategy as Patricia also pointed
out.

Don't forget too that Scanner can do other things besides use
delimiters. It has methods like skip() and findInLine() that ignore
delimiters and could be used to build a simple parser. You can also
change the delimiters on the fly to extract different sections of text.

A simple change to my example above:

public class ScannerTest {
public static void main(String[] args) {
StringReader in = new StringReader(
"Support DDR2 100/200/300/400 DDR2 SDRAM");

Scanner scanner = new Scanner(in);
scanner.findInLine( "Support DDR2" );
scanner.useDelimiter( "[ /]+" );
while( scanner.hasNextInt() ) {
System.out.println( scanner.nextInt() );
}
}
}

Can this be combined into one statement?	32	Oct 28, 2013
getting the results of a simple regex	2	Apr 6, 2009
How can I make a better program from the following one	1	Jun 14, 2008
SINGAPORE PRIVATE CONDO / APT FOR SALE / Singapore New Upcoming Residential Projects	5	Dec 16, 2006
SINGAPORE PRIVATE CONDO / APT FOR SALE / Singapore New Upcoming Residential Projects	1	Dec 16, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
convert string number to real number - ValueError: invalid literal	11	Feb 28, 2008
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 17, 2010

regex capability

Roedy Green

Roedy Green

Eric Sosman

Robert Klemme

David Lamb

Jim Gibson

markspace

Paul Cager

Robert Klemme

markspace

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads