Problem with boost::regex_replace

Discussion in 'C++' started by derek.google@grog.net, Oct 5, 2005.

  1. Guest

    I hope a Boost question is not too off-topic here. It seems that
    upgrading to Boost 1.33 broke some old regex code that used to work. I
    have reduced the problem to this simple example:

    cout << boost::regex_replace(string("foo"),
    boost::regex(".*"),
    string("bar")) << endl;

    The above code prints "barbar" where I expect "bar". Can anyone shed
    some light on this? It used to work with 1.30 (though regex_replace
    was called regex_merge). It seems to me that ".*" should match all of
    "foo" and replace it with "bar", so where does the second "bar" in
    "barbar" come from?

    Through trial and error I discovered that adding boost::match_not_null
    to the normal boost::match_default flag restores the desired behavior,
    but I'm not sure if this is significant or just coincidence.

    Derek
    , Oct 5, 2005
    #1
    1. Advertising

  2. Greg Guest

    wrote:
    > I hope a Boost question is not too off-topic here. It seems that
    > upgrading to Boost 1.33 broke some old regex code that used to work. I
    > have reduced the problem to this simple example:
    >
    > cout << boost::regex_replace(string("foo"),
    > boost::regex(".*"),
    > string("bar")) << endl;
    >
    > The above code prints "barbar" where I expect "bar". Can anyone shed
    > some light on this? It used to work with 1.30 (though regex_replace
    > was called regex_merge). It seems to me that ".*" should match all of
    > "foo" and replace it with "bar", so where does the second "bar" in
    > "barbar" come from?
    >
    > Through trial and error I discovered that adding boost::match_not_null
    > to the normal boost::match_default flag restores the desired behavior,
    > but I'm not sure if this is significant or just coincidence.



    The ".*" expression matches zero or more characters. The "zero" means
    it doesn't need any characters to make a match; so when there are zero
    characters left to search, the patttern finds another match.

    Because it matches no string at all, a .* search pattern is rarely the
    best choice. It's more likely that the regex expression .+ is the
    intended search string.

    Greg
    Greg, Oct 6, 2005
    #2
    1. Advertising

  3. Guest

    Greg wrote:
    > > I hope a Boost question is not too off-topic here. It
    > > seems that upgrading to Boost 1.33 broke some old regex
    > > code that used to work. I have reduced the problem to
    > > this simple example:
    > >
    > > cout << boost::regex_replace(string("foo"),
    > > boost::regex(".*"),
    > > string("bar")) << endl;
    > >
    > > The above code prints "barbar" where I expect "bar".
    > > Can anyone shed some light on this? It used to work
    > > with 1.30 (though regex_replace was called regex_merge).
    > > It seems to me that ".*" should match all of "foo" and
    > > replace it with "bar", so where does the second "bar" in
    > > "barbar" come from?
    > >
    > > Through trial and error I discovered that adding
    > > boost::match_not_null to the normal boost::match_default
    > > flag restores the desired behavior, but I'm not sure if
    > > this is significant or just coincidence.

    >
    > The ".*" expression matches zero or more characters. The
    > "zero" means it doesn't need any characters to make a
    > match; so when there are zero characters left to search,
    > the patttern finds another match.
    >
    > Because it matches no string at all, a .* search pattern
    > is rarely the best choice. It's more likely that the regex
    > expression .+ is the intended search string.


    Thanks, Greg. Your explanation is precisely why I thought adding the
    boost::match_not_null flag fixed the problem -- because it "specifies
    that the expression can not be matched against an empty sequence."
    Unfortunately that explanation seems contradicted by this example,
    which is my original example with the empty sequence "" as input
    instead of "foo":

    cout << boost::regex_replace(string(""), // empty sequence
    boost::regex(".*"),
    string("bar")) << endl;

    By your reasoning the ".*" should match the empty input sequence "" and
    the output should be "bar". However, the output I get is the empty
    sequence "", not "bar".

    I am also bothered that in other languages I routinely use -- and in
    previous versions of Boost -- the expression ".*" does not match the
    empty sequence. It all seems very inconsistent.
    , Oct 6, 2005
    #3
  4. Greg Guest

    wrote:
    > Greg wrote:
    > > > I hope a Boost question is not too off-topic here. It
    > > > seems that upgrading to Boost 1.33 broke some old regex
    > > > code that used to work. I have reduced the problem to
    > > > this simple example:
    > > >
    > > > cout << boost::regex_replace(string("foo"),
    > > > boost::regex(".*"),
    > > > string("bar")) << endl;
    > > >
    > > > The above code prints "barbar" where I expect "bar".
    > > > Can anyone shed some light on this? It used to work
    > > > with 1.30 (though regex_replace was called regex_merge).
    > > > It seems to me that ".*" should match all of "foo" and
    > > > replace it with "bar", so where does the second "bar" in
    > > > "barbar" come from?
    > > >
    > > > Through trial and error I discovered that adding
    > > > boost::match_not_null to the normal boost::match_default
    > > > flag restores the desired behavior, but I'm not sure if
    > > > this is significant or just coincidence.

    > >
    > > The ".*" expression matches zero or more characters. The
    > > "zero" means it doesn't need any characters to make a
    > > match; so when there are zero characters left to search,
    > > the patttern finds another match.
    > >
    > > Because it matches no string at all, a .* search pattern
    > > is rarely the best choice. It's more likely that the regex
    > > expression .+ is the intended search string.

    >
    > Thanks, Greg. Your explanation is precisely why I thought adding the
    > boost::match_not_null flag fixed the problem -- because it "specifies
    > that the expression can not be matched against an empty sequence."
    > Unfortunately that explanation seems contradicted by this example,
    > which is my original example with the empty sequence "" as input
    > instead of "foo":
    >
    > cout << boost::regex_replace(string(""), // empty sequence
    > boost::regex(".*"),
    > string("bar")) << endl;
    >
    > By your reasoning the ".*" should match the empty input sequence "" and
    > the output should be "bar". However, the output I get is the empty
    > sequence "", not "bar".


    I am not able to reproduce this behavior with the default boost
    configuration. The above line of code does output "bar" with boost 1.33
    on my machine. Note that there is an unspecified default argument which
    should be boost::match_default. In other words, the above line should
    be equivalent to this statement:

    cout << boost::regex_replace(string(""), // empty sequence
    boost::regex(".*"),
    string("bar"),
    boost::match_default) << endl;

    Replacing "boost::match_default" with "boost::match_not_null" or with
    "boost:: match_not_dot_null" changes the output to a zero-length
    string, as expected.

    You may wish to run the boost::regex test suite that is part of the
    distribution to ensure that your build was properly compiled.

    Greg
    Greg, Oct 6, 2005
    #4
  5. Pete Becker Guest

    Greg wrote:
    >
    > The ".*" expression matches zero or more characters. The "zero" means
    > it doesn't need any characters to make a match; so when there are zero
    > characters left to search, the patttern finds another match.
    >
    > Because it matches no string at all, a .* search pattern is rarely the
    > best choice. It's more likely that the regex expression .+ is the
    > intended search string.
    >


    However, under the maximum munch rule, in an otherwise unconstrained
    search like the one at issue it should match the entire target sequence.

    --

    Pete Becker
    Dinkumware, Ltd. (http://www.dinkumware.com)
    Pete Becker, Oct 6, 2005
    #5
  6. Guest

    Greg wrote:
    > I am not able to reproduce this behavior with the default boost
    > configuration. The above line of code does output "bar" with boost 1.33
    > on my machine. Note that there is an unspecified default argument which
    > should be boost::match_default. In other words, the above line should
    > be equivalent to this statement:
    >
    > cout << boost::regex_replace(string(""), // empty sequence
    > boost::regex(".*"),
    > string("bar"),
    > boost::match_default) << endl;
    >
    > Replacing "boost::match_default" with "boost::match_not_null" or with
    > "boost:: match_not_dot_null" changes the output to a zero-length
    > string, as expected.
    >
    > You may wish to run the boost::regex test suite that is part of the
    > distribution to ensure that your build was properly compiled.
    >
    > Greg


    You are quite right; I'm not sure why it didn't work before, but I
    re-compiled my example and sure enough ".*" does indeed match the empty
    string. That is, the folowing code outputs "bar":

    cout << regex_replace(string(""), // empty sequence
    regex(".*"),
    string("bar"),
    match_default) << endl;

    However, going back to the original example, I'm still confused about
    one thing. Recall that the following code outputs "barbar":

    cout << regex_replace(string("foo"),
    regex(".*"),
    string("bar"),
    match_default) << endl;

    So if I understand everything you've said, ".*" matches "foo" and
    replaces it with "bar", and then it matches the remaining empty
    sequence and outputs "bar" again (hence "barbar"). However, as Mr.
    Becker points out, the "maximum munch" rule suggests that the ".*"
    should consume "foo" *and* the empty sequence that implicitly follows,
    right? It seems to me that the "maximum munch" rule suggests the
    output should be "bar", not "barbar".
    , Oct 6, 2005
    #6
  7. Pete Becker Guest

    wrote:
    >
    > So if I understand everything you've said, ".*" matches "foo" and
    > replaces it with "bar", and then it matches the remaining empty
    > sequence and outputs "bar" again (hence "barbar"). However, as Mr.
    > Becker points out, the "maximum munch" rule suggests that the ".*"
    > should consume "foo" *and* the empty sequence that implicitly follows,
    > right? It seems to me that the "maximum munch" rule suggests the
    > output should be "bar", not "barbar".
    >


    No, the maximum munch rule consumes the three characters, leaving an
    empty target string. The next attempt to match ".*" succeeds, so there
    is a second match. Having just matched an empty string, the search
    algorithm now requires a non-null match, which fails, and the search
    terminates. So you should get "barbar", because you got two matches. At
    least, that's what I currently think, but I'm at the C++ Standards
    Committee meeting, and half listening to a discussion of concept
    checking, so I don't promise that I've analyzed it correctly. However,
    our implementation hit an infinite loop on your example because it
    didn't force non-null for the search after an empty match, and once I
    fixed that, I got the same result as you're seeing. <g>

    --

    Pete Becker
    Dinkumware, Ltd. (http://www.dinkumware.com)
    Pete Becker, Oct 6, 2005
    #7
  8. Guest

    Pete Becker wrote:
    > > So if I understand everything you've said, ".*" matches "foo" and
    > > replaces it with "bar", and then it matches the remaining empty
    > > sequence and outputs "bar" again (hence "barbar"). However, as Mr.
    > > Becker points out, the "maximum munch" rule suggests that the ".*"
    > > should consume "foo" *and* the empty sequence that implicitly follows,
    > > right? It seems to me that the "maximum munch" rule suggests the
    > > output should be "bar", not "barbar".
    > >

    >
    > No, the maximum munch rule consumes the three characters, leaving an
    > empty target string. The next attempt to match ".*" succeeds, so there
    > is a second match. Having just matched an empty string, the search
    > algorithm now requires a non-null match, which fails, and the search
    > terminates. So you should get "barbar", because you got two matches. At
    > least, that's what I currently think, but I'm at the C++ Standards
    > Committee meeting, and half listening to a discussion of concept
    > checking, so I don't promise that I've analyzed it correctly. However,
    > our implementation hit an infinite loop on your example because it
    > didn't force non-null for the search after an empty match, and once I
    > fixed that, I got the same result as you're seeing.


    Thanks for the explanation -- to you and Greg both. My world makes a
    little more sense now.
    , Oct 6, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    383
    Robbie Hatley
    Jul 14, 2006
  2. Yahooooooooo

    boost::regex_replace compiler error

    Yahooooooooo, Jan 22, 2007, in forum: C++
    Replies:
    3
    Views:
    444
  3. Yahooooooooo

    boost::regex_replace issue

    Yahooooooooo, Jan 30, 2007, in forum: C++
    Replies:
    1
    Views:
    1,306
    David Harmon
    Jan 31, 2007
  4. Replies:
    1
    Views:
    890
  5. Friedel Jantzen

    regex_replace()

    Friedel Jantzen, May 10, 2011, in forum: C++
    Replies:
    16
    Views:
    1,362
    James Kanze
    May 15, 2011
Loading...

Share This Page