Extracting substring with regexp

Discussion in 'Java' started by Alex, Jan 31, 2008.

  1. Alex

    Alex Guest

    How to extract substring with regexp when we have start and end for
    substring?
    For example I want to find what is between "abc" and "xyz" in the
    String.

    Pattern p = Pattern.compile("abc*xyz");
    Matcher m = p.matcher("aaaaaaaaaabc123xyzzzzzzzzzzzzz");

    but m.matches() returns false and can't find my patter.

    How can I get these "123" in this example?

    Alex Kizub.
     
    Alex, Jan 31, 2008
    #1
    1. Advertisements

  2. Alex

    Lord Zoltar Guest

    You could try grouping:

    (a+bc)(.+)(xyz+)

    group 2 is the one that would have what you want.
     
    Lord Zoltar, Jan 31, 2008
    #2
    1. Advertisements

  3. Alex

    Eric Sosman Guest

    First, correct your regexp: As written, it looks for
    an a, a b, any number of c's (including zero), then x,
    y, and z. You probably want "abc.*xyz" instead.

    Second, realize that matches() tries to match the
    entire input sequence. So it will fail, because the "aa"
    at the start does not match either the original or the
    corrected regexp. You probably want the find() method
    instead.

    Third, since what you are interested is the stuff
    between the abc and the xyz, you should indicate your
    interest by making that part into a "group." Change the
    regexp yet again, this time to "abc(.*)xyz". When find()
    returns true, you can then use m.group(1) to retrieve the
    part between the parentheses.

    Finally, it would be a Really Good Idea for you to read
    the Javadoc on the Pattern and Matcher classes, where all
    this and more is described, with reasonably comprehensible
    examples, too.
     
    Eric Sosman, Jan 31, 2008
    #3
  4. Alex

    Alex Guest

    Eric:
    Thanks a lot. You are right.
    But here is mismatch in my knowledge again:
    I assume that this


    Pattern p = Pattern.compile("abc(.*)xyz");
    Matcher m = p.matcher("xxxxxabc123xyz789xyzxxxxx");
    if (m.find())System.out.println(m.group(1));

    should print "123" but, instead, it prints "123xyz789".
    How can I force regexp to find first match?

    I agree that I should read the Sun documentation. And, trust me, I
    did...

    Thanks in advance.
    Alex Kizub.
     
    Alex, Jan 31, 2008
    #4
  5. Alex

    Lord Zoltar Guest

    If you're having trouble with regular expressions, it's a lot easier
    to test them before add them to your code and compile and run. I like
    to use a program called Expresso to test regular expressions.
     
    Lord Zoltar, Jan 31, 2008
    #5
  6. Short answer: By default, matching will take the longest matching group.
    Use "abc(.*?)xyz" instead.

    Long answer: The *, +, and ? operators (unqualified) match by first
    assuming that the match continues and then backtrack until they fail.
    The `?' operator, when concatenated, will override that behavior by
    first trying to match without applying the operator and then applying
    it. The `+' operator will also override the behavior by prohibiting
    backtracking.

    "(a*"+operator+")a" on the string "aaa", group 1 matches:
    "": aa
    "?": a
    "+": <failure>
     
    Joshua Cranmer, Jan 31, 2008
    #6
  7. Alex

    Roedy Green Guest

    There are two way I can think of:

    1. find each piece then use an ordinary substring to extract the
    middle.

    2. use groups i.e. (...) around the middle piece you want.

    see http://mindprod.com/jgloss/regex.html
     
    Roedy Green, Feb 1, 2008
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.