Help simplify complex regexp needing positive lookahead and reluctant quantifers

Discussion in 'Java' started by david.karr@wamu.net, Mar 21, 2005.

  1. Guest

    I'm trying to build a regexp to handle somewhat complex data.

    My sample data is the following (abstracted from real data):
    --------------
    *XXXlkjsflkw34lkjsfd
    2XXXlkjsdfojsfjoimf344
    3XXXabcdef9999999
    4XXX9f9f9f9f9f9f9f9f
    5XXXg8g8g8g8g8g8g8g
    6XXXe6e6e6e6e6e6e6e6e
    YYY=D/23333333
    -xxxxxxxxxxxx
    -yyyyyyyyyyyy
    ZZZ=gggggggggggg
    AAA=hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk
    /XXX 2
    --------------

    The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
    "YYY", "ZZZ", and "AAA" could be in any order, and some could be
    missing, or others like it could be added. What I'd like to build is a
    regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
    "associated data", up to either the next "[A-Z]{3}=", or the ending
    "/XXX". If I can get the "associated data" into group values, I can
    use other regexps for the detail in those group values.

    The regexp that I've built so far comes close to solving this, but not
    quite. This is what I have so far:

    --------------
    "(?sm)\\*.{3}.*\n" +
    "2.{3}.*\n" +
    "3.{3}.*\n" +
    "4.{3}.*\n" +
    "5.{3}.*\n" +
    "6.{3}.*\n" +
    " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    "/[A-Z]{3}.*";
    --------------

    You can ignore for now the fact that I'm not verifying that all the
    places that require "XXX" are all "XXX". The problem area is the
    "[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
    able to simplify those three repeated lines into a single expression,
    which would handle any number of those. I tried the following, to
    replace those three lines:

    "( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

    but that didn't seem to work, and I'm not sure why.

    The following is the output from my Java program, using the working
    regexp, where it iterated through the found groups. I provide this
    just as another view of what I'm trying to capture:

    --------------
    group[YYY=]
    group[D/23333333
    -xxxxxxxxxxxx
    -yyyyyyyyyyyy
    ]
    group[ZZZ=]
    group[gggggggggggg
    ]
    group[AAA=]
    group[hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk
    ]
    --------------
     
    , Mar 21, 2005
    #1
    1. Advertising

  2. Lisa Guest

    <> wrote in message
    news:...
    > I'm trying to build a regexp to handle somewhat complex data.
    >
    > My sample data is the following (abstracted from real data):
    > --------------
    > *XXXlkjsflkw34lkjsfd
    > 2XXXlkjsdfojsfjoimf344
    > 3XXXabcdef9999999
    > 4XXX9f9f9f9f9f9f9f9f
    > 5XXXg8g8g8g8g8g8g8g
    > 6XXXe6e6e6e6e6e6e6e6e
    > YYY=D/23333333
    > -xxxxxxxxxxxx
    > -yyyyyyyyyyyy
    > ZZZ=gggggggggggg
    > AAA=hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    > /XXX 2
    > --------------
    >
    > The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
    > "YYY", "ZZZ", and "AAA" could be in any order, and some could be
    > missing, or others like it could be added. What I'd like to build is a
    > regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
    > "associated data", up to either the next "[A-Z]{3}=", or the ending
    > "/XXX". If I can get the "associated data" into group values, I can
    > use other regexps for the detail in those group values.
    >
    > The regexp that I've built so far comes close to solving this, but not
    > quite. This is what I have so far:
    >
    > --------------
    > "(?sm)\\*.{3}.*\n" +
    > "2.{3}.*\n" +
    > "3.{3}.*\n" +
    > "4.{3}.*\n" +
    > "5.{3}.*\n" +
    > "6.{3}.*\n" +
    > " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    > " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    > " ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    > "/[A-Z]{3}.*";
    > --------------
    >
    > You can ignore for now the fact that I'm not verifying that all the
    > places that require "XXX" are all "XXX". The problem area is the
    > "[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
    > able to simplify those three repeated lines into a single expression,
    > which would handle any number of those. I tried the following, to
    > replace those three lines:
    >
    > "( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
    >
    > but that didn't seem to work, and I'm not sure why.
    >
    > The following is the output from my Java program, using the working
    > regexp, where it iterated through the found groups. I provide this
    > just as another view of what I'm trying to capture:
    >
    > --------------
    > group[YYY=]
    > group[D/23333333
    > -xxxxxxxxxxxx
    > -yyyyyyyyyyyy
    > ]
    > group[ZZZ=]
    > group[gggggggggggg
    > ]
    > group[AAA=]
    > group[hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    > ]
    > --------------
    >


    did you consider having a simpler expression and passing
    over the data in two passes like unix folks like to do

    grep "pat1" filename | grep "pat2" | grep "pat3"
     
    Lisa, Mar 21, 2005
    #2
    1. Advertising

  3. Alan Moore Guest

    On 20 Mar 2005 18:54:39 -0800, wrote:

    >I'm trying to build a regexp to handle somewhat complex data.
    >
    >My sample data is the following (abstracted from real data):
    >--------------
    >*XXXlkjsflkw34lkjsfd
    >2XXXlkjsdfojsfjoimf344
    >3XXXabcdef9999999
    >4XXX9f9f9f9f9f9f9f9f
    >5XXXg8g8g8g8g8g8g8g
    >6XXXe6e6e6e6e6e6e6e6e
    > YYY=D/23333333
    > -xxxxxxxxxxxx
    > -yyyyyyyyyyyy
    > ZZZ=gggggggggggg
    > AAA=hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    >/XXX 2
    >--------------
    >
    >The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
    >"YYY", "ZZZ", and "AAA" could be in any order, and some could be
    >missing, or others like it could be added. What I'd like to build is a
    >regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
    >"associated data", up to either the next "[A-Z]{3}=", or the ending
    >"/XXX". If I can get the "associated data" into group values, I can
    >use other regexps for the detail in those group values.
    >
    >The regexp that I've built so far comes close to solving this, but not
    >quite. This is what I have so far:
    >
    >--------------
    >"(?sm)\\*.{3}.*\n" +
    >"2.{3}.*\n" +
    >"3.{3}.*\n" +
    >"4.{3}.*\n" +
    >"5.{3}.*\n" +
    >"6.{3}.*\n" +
    >" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    >" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    >" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" +
    >"/[A-Z]{3}.*";
    >--------------
    >
    >You can ignore for now the fact that I'm not verifying that all the
    >places that require "XXX" are all "XXX". The problem area is the
    >"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
    >able to simplify those three repeated lines into a single expression,
    >which would handle any number of those. I tried the following, to
    >replace those three lines:
    >
    >"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"
    >
    >but that didn't seem to work, and I'm not sure why.
    >
    >The following is the output from my Java program, using the working
    >regexp, where it iterated through the found groups. I provide this
    >just as another view of what I'm trying to capture:
    >
    >--------------
    >group[YYY=]
    >group[D/23333333
    > -xxxxxxxxxxxx
    > -yyyyyyyyyyyy
    >]
    >group[ZZZ=]
    >group[gggggggggggg
    >]
    >group[AAA=]
    >group[hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    >]
    >--------------


    The "(?sm)" at the beginnng puts the whole regex in DOTALL and
    MULTILINE mode. The 'm' is having no effect, since you aren't using
    any line anchors; the 's' is what's causing your problem. Each ".*"
    initially gobbles up the whole rest of the input, then backs off as
    far as necessary to permit the next part of the regex to match. That
    works as intended until the line starting with '6' is reached. After
    the dot-star there wolfs everything down, it starts regurgitating as
    usual. When it reaches the '/' at the beginning of the last line, the
    rest of the regex is able to match, because your combined
    subexpression is optional. The dot-star in the '6' line ends up
    keeping all the text the subexpression was supposed to match.
    Changing the "*" that controls the subexpression to a "+" won't
    help--it will only force the subexpression to match once, letting the
    dot-star keep anything else.

    You could fix that by making all the dot-stars reluctant, but a better
    way (more efficient, less error-prone) would be to remove the "(?sm)"
    and add "(?s)" to the subexpression, since that's the only place you
    actually need DOTALL mode:

    --------------
    "\\*.{3}.*\n" +
    "2.{3}.*\n" +
    "3.{3}.*\n" +
    "4.{3}.*\n" +
    "5.{3}.*\n" +
    "6.{3}.*\n" +
    "((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
    "/[A-Z]{3}.*";
    --------------

    Note that I also changed the subexpression's enclosing group to
    non-capturing, and put the capturing group around it and its
    quantifier. That way, all the YYY|ZZZ|AAA entries with their
    associated data are captured in group(1). The way you had it, only
    the last entry would have been retained.
     
    Alan Moore, Mar 21, 2005
    #3
  4. Guest

    Ok, this looks very promising, but it doesn't quite work yet. I'll
    provide both the regexp I'm using a sample string, so you could
    validate what I see, if you can.

    I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
    I tried both variations, with the same result.

    The regexp I'm now using is this:
    ---------------
    "\\*.{3}.*\n" +
    "2.{3}.*\n" +
    "3.{3}.*\n" +
    "4.{3}.*\n" +
    "5.{3}.*\n" +
    "6.{3}.*\n" +
    "((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)" +
    "/[A-Z]{3}.*";
    ---------------

    My sample data is this:
    ---------------
    *XXXlkjsflkw34lkjsfd
    2XXXlkjsdfojsfjoimf344
    3XXXabcdef9999999
    4XXX9f9f9f9f9f9f9f9f
    5XXXg8g8g8g8g8g8g8g
    6XXXe6e6e6e6e6e6e6e6e
    YYY=D/23333333
    -xxxxxxxxxxxx
    -yyyyyyyyyyyy
    ZZZ=gggggggggggg
    AAA=hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk
    /XXX 2
    ---------------

    My code is roughly this:
    ---------------
    Pattern pattern = Pattern.compile(patternMask);
    Matcher matcher = pattern.matcher(readSample);
    System.out.println("groupCount[" + matcher.groupCount() + "]");
    boolean found = matcher.find();
    System.out.println("found[" + found + "]");
    ---------------

    Where "patternMask" and "readSample" correspond to my regexp and the
    sample data.

    With this regexp and sample data, the "groupCount" prints out as "3",
    and "found" is false.
     
    , Mar 23, 2005
    #4
  5. Alan Moore Guest

    On 23 Mar 2005 12:25:04 -0800, wrote:

    >Ok, this looks very promising, but it doesn't quite work yet. I'll
    >provide both the regexp I'm using a sample string, so you could
    >validate what I see, if you can.


    That looks like what I'm doing; here's my test code:

    //==== code ========================================================

    import java.util.regex.*;

    public class Test
    {
    public static void main(String[] args)
    {
    String regex = "\\*.{3}.*\n"
    + "2.{3}.*\n"
    + "3.{3}.*\n"
    + "4.{3}.*\n"
    + "5.{3}.*\n"
    + "6.{3}.*\n"
    + "((?s: ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*)"
    + "/[A-Z]{3}.*";

    String input = "*XXXlkjsflkw34lkjsfd\n"
    + "2XXXlkjsdfojsfjoimf344\n"
    + "3XXXabcdef9999999\n"
    + "4XXX9f9f9f9f9f9f9f9f\n"
    + "5XXXg8g8g8g8g8g8g8g\n"
    + "6XXXe6e6e6e6e6e6e6e6e\n"
    + " YYY=D/23333333\n"
    + " -xxxxxxxxxxxx\n"
    + " -yyyyyyyyyyyy\n"
    + " ZZZ=gggggggggggg\n"
    + " AAA=hhhhhhhhhh\n"
    + " -jjjjjjjjjjj\n"
    + " -kkkkkkkkkkk\n"
    + "/XXX 2";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    if (m.find())
    {
    System.out.println(m.group(1));
    }
    }
    }

    //==================================================================

    This prints:

    YYY=D/23333333
    -xxxxxxxxxxxx
    -yyyyyyyyyyyy
    ZZZ=gggggggggggg
    AAA=hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk

    >
    >I'm also wondering whether you meant to enter "?s:", or "(?s)" instead.
    > I tried both variations, with the same result.


    "(?s)" sets the DOTALL flag for the rest of the rest of the regex or
    until you cancel it with "(?-s)". "(?s:<expr>)" both creates a
    non-capturing group and sets the flag, but the flag is in effect only
    within that group.
     
    Alan Moore, Mar 24, 2005
    #5
  6. Guest

    Ok, the difference between our two was that my sample has "\r\n" for
    eols. Once I changed my pattern to check for that explicitly, I get
    similar output. I tried some variations with "$" and "(?m)", but it
    only got past this if I specifically used "\r\n".

    However, now I have to go deeper into this, and the current expression
    doesn't quite do what I need.

    What I really need to capture in individual groups would be the
    following (each group surrounded by brackets):

    [YYY=]
    [D/23333333
    xxxxxxxxxxxx
    yyyyyyyyyyyy]
    [ZZZ=]
    [gggggggggggg]
    [AAA=]
    [hhhhhhhhhh
    jjjjjjjjjjj
    kkkkkkkkkkk]

    Note that I've removed the initial spaces and dashes. That's my end
    state, but I can work to that step by step.

    When my code steps through all the groups it found, it finds this:

    ---------------
    group[ YYY=D/23333333
    -xxxxxxxxxxxx
    -yyyyyyyyyyyy
    ZZZ=gggggggggggg
    AAA=hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk
    ]
    group[AAA=]
    group[hhhhhhhhhh
    -jjjjjjjjjjj
    -kkkkkkkkkkk
    ]
    ---------------

    I don't care about the first group, because that surrounds all of the
    subrecords. I would have hoped that the next group would be "YYY=",
    followed by the group with its associated data, and so on.
     
    , Mar 24, 2005
    #6
  7. Alan Moore Guest

    On 24 Mar 2005 09:10:47 -0800, wrote:

    >Ok, the difference between our two was that my sample has "\r\n" for
    >eols. Once I changed my pattern to check for that explicitly, I get
    >similar output. I tried some variations with "$" and "(?m)", but it
    >only got past this if I specifically used "\r\n".
    >
    >However, now I have to go deeper into this, and the current expression
    >doesn't quite do what I need.
    >
    >What I really need to capture in individual groups would be the
    >following (each group surrounded by brackets):
    >
    >[YYY=]
    >[D/23333333
    >xxxxxxxxxxxx
    >yyyyyyyyyyyy]
    >[ZZZ=]
    >[gggggggggggg]
    >[AAA=]
    >[hhhhhhhhhh
    >jjjjjjjjjjj
    >kkkkkkkkkkk]
    >
    >Note that I've removed the initial spaces and dashes. That's my end
    >state, but I can work to that step by step.
    >
    >When my code steps through all the groups it found, it finds this:
    >
    >---------------
    >group[ YYY=D/23333333
    > -xxxxxxxxxxxx
    > -yyyyyyyyyyyy
    > ZZZ=gggggggggggg
    > AAA=hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    >]
    >group[AAA=]
    >group[hhhhhhhhhh
    > -jjjjjjjjjjj
    > -kkkkkkkkkkk
    >]
    >---------------
    >
    >I don't care about the first group, because that surrounds all of the
    >subrecords. I would have hoped that the next group would be "YYY=",
    >followed by the group with its associated data, and so on.


    When you have a capturing group that's controlled by a quantifier, the
    only thing you can retrieve after a successful match is the *last*
    thing that was matched by that group. Remember that the groupCount()
    method only tells you how many capturing groups there are in the
    Matcher's parent Pattern; it doesn't say anything about what was
    actually matched.

    You initially changed your regex to match all the subrecords with a
    quantified subexpression because you didn't know how many subrecords
    there would be. When you did that, you gave up the ability to break
    out the individual subrecords in a single pass. What you have to do
    now is take the substring containing the subrecords and process it
    separately to break them out. In the following code, I went ahead and
    added a third layer of processing to get rid of those initial spaces
    and dashes as well.

    //==== code ========================================================

    import java.util.regex.*;

    public class Test
    {
    public static void main(String[] args)
    {
    String regex1 = "\\*.{3}.*\r?\n"
    + "2.{3}.*\r?\n"
    + "3.{3}.*\r?\n"
    + "4.{3}.*\r?\n"
    + "5.{3}.*\r?\n"
    + "6.{3}.*\r?\n"
    + "((?s: [A-Z]{3}=.*?(?=[ /][A-Z]{3}))*)"
    + "/[A-Z]{3}.*";
    Pattern p1 = Pattern.compile(regex1);

    String regex2 = "(?s) ([A-Z]{3}=)(.*?)(?=\r?\n [A-Z]{3}|$)";
    Pattern p2 = Pattern.compile(regex2);

    String regex3 = "(?: -)?(.+)";
    Pattern p3 = Pattern.compile(regex3);

    String input = "*XXXlkjsflkw34lkjsfd\n"
    + "2XXXlkjsdfojsfjoimf344\n"
    + "3XXXabcdef9999999\n"
    + "4XXX9f9f9f9f9f9f9f9f\n"
    + "5XXXg8g8g8g8g8g8g8g\n"
    + "6XXXe6e6e6e6e6e6e6e6e\n"
    + " YYY=D/23333333\n"
    + " -xxxxxxxxxxxx\n"
    + " -yyyyyyyyyyyy\n"
    + " ZZZ=gggggggggggg\n"
    + " AAA=hhhhhhhhhh\n"
    + " -jjjjjjjjjjj\n"
    + " -kkkkkkkkkkk\n"
    + "/XXX 2";

    Matcher m1 = p1.matcher(input);
    if (m1.find())
    {
    String sub = m1.group(1);
    Matcher m2 = p2.matcher(sub);
    while (m2.find())
    {
    System.out.println("[" + m2.group(1) + "]");
    String subsub = m2.group(2);
    System.out.print("[");
    Matcher m3 = p3.matcher(subsub);
    while (m3.find())
    {
    System.out.println(m3.group(1));
    }
    System.out.println("]");
    }
    }
    }
    }

    //==================================================================

    result:

    [YYY=]
    [D/23333333
    xxxxxxxxxxxx
    yyyyyyyyyyyy
    ]
    [ZZZ=]
    [gggggggggggg
    ]
    [AAA=]
    [hhhhhhhhhh
    jjjjjjjjjjj
    kkkkkkkkkkk
    ]
     
    Alan Moore, Mar 24, 2005
    #7
  8. Guest

    Excellent. Thanks for the thorough detail. This could have been a
    whole chapter in "Regular Expression Recipes" :) .
     
    , Mar 25, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. tobiah

    Positive lookahead assertion

    tobiah, Sep 7, 2006, in forum: Python
    Replies:
    8
    Views:
    622
    Steve Holden
    Sep 8, 2006
  2. Hicham Mouline
    Replies:
    2
    Views:
    831
    Keith Thompson
    Apr 23, 2010
  3. Tom Aadland

    Treetop positive lookahead problem

    Tom Aadland, Jul 11, 2008, in forum: Ruby
    Replies:
    4
    Views:
    162
    Tom Aadland
    Jul 14, 2008
  4. Replies:
    1
    Views:
    128
    Sherm Pendley
    Mar 20, 2005
  5. vbgunz
    Replies:
    6
    Views:
    166
    vbgunz
    Nov 28, 2007
Loading...

Share This Page