regex capability

Discussion in 'Java' started by Roedy Green, Apr 4, 2011.

  1. Roedy Green

    Roedy Green Guest

    Consider a string like this:

    Support DDR2 1066/800/667/533/400 DDR2 SDRAM

    Is it possible to compose a regex that will peel out those numbers for
    you each in its own field, or do you have to extract the string
    "1066/800/667/533/400" and use split?

    The various things I have tried just grab the last number.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Doing what the user expects with respect to navigation is absurdly important for user satisfaction.
    ~ anonymous Google Android developer
     
    Roedy Green, Apr 4, 2011
    #1
    1. Advertising

  2. Roedy Green

    Roedy Green Guest

    On Mon, 04 Apr 2011 02:34:30 -0500, Leif Roar Moldskred
    <> wrote, quoted or indirectly quoted someone who
    said :

    >
    >Easiest is to just use split. You can always do a regex of the type
    >"(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly
    >complicated. There's no reason why you should use a regex when "normal"
    >string parsing is simpler and easier to read.


    (xxx|yyy)+ seems to generate only one group item, no matter how many
    repetitions there are. That strikes me as a bug, but likely someone
    can explain why it is a feature or inevitability.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    Doing what the user expects with respect to navigation is absurdly important for user satisfaction.
    ~ anonymous Google Android developer
     
    Roedy Green, Apr 4, 2011
    #2
    1. Advertising

  3. Roedy Green

    Eric Sosman Guest

    On 4/4/2011 3:50 AM, Roedy Green wrote:
    > On Mon, 04 Apr 2011 02:34:30 -0500, Leif Roar Moldskred
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    >>
    >> Easiest is to just use split. You can always do a regex of the type
    >> "(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly
    >> complicated. There's no reason why you should use a regex when "normal"
    >> string parsing is simpler and easier to read.

    >
    > (xxx|yyy)+ seems to generate only one group item, no matter how many
    > repetitions there are. That strikes me as a bug, but likely someone
    > can explain why it is a feature or inevitability.


    A (section of a) regex matches a (section of a) string, and the
    Matcher machinery can tell you what substring was matched. The
    machinery has no provision for doing further processing on that
    matched substring, like saying "Oh, your regex didn't match a
    string this time, but an array of strings."

    You could, perhaps, cook up substitutes for Pattern and Matcher
    to do such a thing. But I'm not sure you'd want to, because it
    could make the API rather complicated. For example, consider a
    fanex (for "fancy expression," like "regular expression" only
    more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
    can match and return arrays of substrings. The FancyMatcher says
    "I matched five substrings." So you call group(3) to get the
    third of them -- was it matched by "pat1" or by "pat2"? Yes, you
    could invent an API to deal with this -- maybe FancyMatcher returns
    a tree of nodes that point to other nodes and/or to substrings --
    but I'm not confident this would be an unqualified improvement.

    --
    Eric Sosman
    d
     
    Eric Sosman, Apr 4, 2011
    #3
  4. On 04.04.2011 10:26, bugbear wrote:
    > Roedy Green wrote:
    >> Consider a string like this:
    >>
    >> Support DDR2 1066/800/667/533/400 DDR2 SDRAM
    >>
    >> Is it possible to compose a regex that will peel out those numbers for
    >> you each in its own field, or do you have to extract the string
    >> "1066/800/667/533/400" and use split?
    >>
    >> The various things I have tried just grab the last number.

    >
    > I think normal practice (in Perl, and Java) would be repeated
    > use of a fairly simple regexp.
    >
    > In Java, I use
    >
    > while(matcher.find()) {
    > ...
    > }
    >
    > The key is that Matcher is stateful.


    And for added security a two level approach could be taken:

    // untested
    Pattern whole = Pattern.compile("Support DDR2 (\\d+(?:/\\d+)*) DDR2 SDRAM");

    Pattern number = Patter.compile("\\d+");

    Matcher m = whole.matcher(input);

    if ( m.matches() ) {
    for (m = number.matcher(m.group(1)); m.find();) {
    int x = Integer.parse(m.group());
    }
    }
    else {
    // error?
    }

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Apr 4, 2011
    #4
  5. Roedy Green

    David Lamb Guest

    On 04/04/2011 8:03 AM, Eric Sosman wrote:
    > For example, consider a
    > fanex (for "fancy expression," like "regular expression" only
    > more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2"
    > can match and return arrays of substrings. The FancyMatcher says
    > "I matched five substrings." So you call group(3) to get the
    > third of them -- was it matched by "pat1" or by "pat2"? Yes, you
    > could invent an API to deal with this -- maybe FancyMatcher returns
    > a tree of nodes that point to other nodes and/or to substrings --
    > but I'm not confident this would be an unqualified improvement.


    Matching a pattern to generate a tree sounds a lot like a full-blown
    context-free parser.
     
    David Lamb, Apr 4, 2011
    #5
  6. Roedy Green

    Jim Gibson Guest

    In article <>, Roedy Green
    <> wrote:

    > On Mon, 04 Apr 2011 02:34:30 -0500, Leif Roar Moldskred
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    > >
    > >Easiest is to just use split. You can always do a regex of the type
    > >"(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly
    > >complicated. There's no reason why you should use a regex when "normal"
    > >string parsing is simpler and easier to read.

    >
    > (xxx|yyy)+ seems to generate only one group item, no matter how many
    > repetitions there are. That strikes me as a bug, but likely someone
    > can explain why it is a feature or inevitability.


    The "feature" is that the number of capture groups is equal to the
    number of capturing parenthesis pairs. If the above regular expression
    results in multiple matches, each match is captured and stored into the
    single capture buffer. After the match is finished, only the last
    captured substring remains in the single capture buffer.

    --
    Jim Gibson
     
    Jim Gibson, Apr 5, 2011
    #6
  7. Roedy Green

    markspace Guest

    On 4/4/2011 1:13 PM, Robert Klemme wrote:

    > if ( m.matches() ) {
    > for (m = number.matcher(m.group(1)); m.find();) {
    > int x = Integer.parse(m.group());
    > }



    Why re-invent the wheel?


    public class ScannerTest {
    public static void main(String[] args) {
    StringReader in = new StringReader(
    "Support DDR2 100/200/300/400 DDR2 SDRAM");

    Scanner scanner = new Scanner(in);
    scanner.useDelimiter( "[^0-9]+" );
    while( scanner.hasNextInt() ) {
    System.out.println( scanner.nextInt() );
    }
    }
    }


    (Lightly tested.)
     
    markspace, Apr 5, 2011
    #7
  8. Roedy Green

    Paul Cager Guest

    On Apr 5, 2:35 am, markspace <-@.> wrote:
    > On 4/4/2011 1:13 PM, Robert Klemme wrote:
    >
    > > if ( m.matches() ) {
    > > for (m = number.matcher(m.group(1)); m.find();) {
    > > int x = Integer.parse(m.group());
    > > }

    >
    > Why re-invent the wheel?
    >
    > public class ScannerTest {
    >      public static void main(String[] args) {
    >          StringReader in = new StringReader(
    >                  "Support DDR2 100/200/300/400 DDR2 SDRAM");
    >
    >          Scanner scanner = new Scanner(in);
    >          scanner.useDelimiter( "[^0-9]+" );
    >          while( scanner.hasNextInt() ) {
    >              System.out.println( scanner.nextInt() );
    >          }
    >      }
    >
    > }
    >
    > (Lightly tested.)


    $ java ScannerTest
    2
    100
    200
    300
    400
    2
     
    Paul Cager, Apr 5, 2011
    #8
  9. On 5 Apr., 14:28, Patricia Shanahan <> wrote:
    > On 4/5/2011 2:10 AM, Paul Cager wrote:
    >
    > > On Apr 5, 2:35 am, markspace<-@.>  wrote:
    > >> On 4/4/2011 1:13 PM, Robert Klemme wrote:

    >
    > >>> if ( m.matches() ) {
    > >>> for (m = number.matcher(m.group(1)); m.find();) {
    > >>> int x = Integer.parse(m.group());
    > >>> }

    >
    > >> Why re-invent the wheel?


    In this case I just wanted to demonstrate the strategy to first check
    overall validity of the input and extract the interesting part and
    then ripping that interesting part apart. Whether a Scanner or
    another Matcher is used for the second step wasn't that important to
    me. Also, the thread is called "regex capability". :)

    But, of course, your approach using the Scanner is perfectly
    compatible with the two step strategy as Patricia also pointed
    out. :)

    > >> public class ScannerTest {
    > >>       public static void main(String[] args) {
    > >>           StringReader in = new StringReader(
    > >>                   "Support DDR2 100/200/300/400 DDR2SDRAM");

    >
    > >>           Scanner scanner = new Scanner(in);
    > >>           scanner.useDelimiter( "[^0-9]+" );
    > >>           while( scanner.hasNextInt() ) {
    > >>               System.out.println( scanner.nextInt() );
    > >>           }
    > >>       }

    >
    > >> }

    >
    > >> (Lightly tested.)

    >
    > > $ java ScannerTest
    > > 2
    > > 100
    > > 200
    > > 300
    > > 400
    > > 2

    >
    > This is a nice illustration of the case for a strategy I often use in
    > this sort of situation, combining tools using each to do the jobs it
    > does best.
    >
    > For example, a regular expression match could pull out the
    > "100/200/300/400" substring, and a Scanner could extract the integers
    > from that. More generally, it could be split and then each of the split
    > results processed some other way.


    I generally prefer scanning over splitting in those cases. The
    difference might be negligible for this case but assuming that the
    original pattern changes (e.g. because we want to allow "@" as
    separator instead of or additionally to "/") then for the split
    approach two patterns need to be changed while for scanning of
    integers (pattern \d+) only the master pattern needs to change. Also,
    with scanning it is clear what I want (positively defining the matched
    portion) while with splitting it is not so clear (negatively defining
    what I do not want, the separator) - but that leaves a lot of room for
    what is returned from _between_ separators.

    Kind regards

    robert
     
    Robert Klemme, Apr 5, 2011
    #9
  10. Roedy Green

    markspace Guest

    On 4/5/2011 6:33 AM, Robert Klemme wrote:

    >>> On Apr 5, 2:35 am, markspace<-@.> wrote:
    >>>> Why re-invent the wheel?


    >
    > In this case I just wanted to demonstrate the strategy to first check
    > overall validity of the input and extract the interesting part and
    > then ripping that interesting part apart. Whether a Scanner or
    > another Matcher is used for the second step wasn't that important to
    > me. Also, the thread is called "regex capability". :)


    Fair enough. :)


    >
    > But, of course, your approach using the Scanner is perfectly
    > compatible with the two step strategy as Patricia also pointed
    > out. :)



    Don't forget too that Scanner can do other things besides use
    delimiters. It has methods like skip() and findInLine() that ignore
    delimiters and could be used to build a simple parser. You can also
    change the delimiters on the fly to extract different sections of text.

    A simple change to my example above:

    public class ScannerTest {
    public static void main(String[] args) {
    StringReader in = new StringReader(
    "Support DDR2 100/200/300/400 DDR2 SDRAM");

    Scanner scanner = new Scanner(in);
    scanner.findInLine( "Support DDR2" );
    scanner.useDelimiter( "[ /]+" );
    while( scanner.hasNextInt() ) {
    System.out.println( scanner.nextInt() );
    }
    }
    }
     
    markspace, Apr 5, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Novice
    Replies:
    5
    Views:
    535
  2. bj daniels

    adding a search capability

    bj daniels, Aug 14, 2003, in forum: ASP .Net
    Replies:
    4
    Views:
    473
    bj daniels
    Aug 20, 2003
  3. Chris Welch
    Replies:
    1
    Views:
    339
    S. Justin Gengo
    Nov 25, 2003
  4. =?Utf-8?B?Qkc=?=
    Replies:
    0
    Views:
    821
    =?Utf-8?B?Qkc=?=
    Dec 27, 2004
  5. Replies:
    3
    Views:
    832
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page