regex_replace()

Discussion in 'C++' started by Friedel Jantzen, May 10, 2011.

  1. Hi!
    Using MSVS 2008, STL TR1 <regex>.

    Is there a way to get to know the number of replacements done by
    regex_replace(), or at least, whether there was a replacement at all?
    Ok, I can compare if(output != input) after regex_replace(), but this is
    wasted performance, if there is a better way.

    If a regex_error is thrown, how can I get the position of the error in the
    regular expression string?

    I found that icase (ignore case) only works with A-Za-z, but e.g. not with
    German Umlaute (Ää etc.), though I set my German user locale using
    regex::imbue(). Am I missing something?

    TIA,
    Friedel
     
    Friedel Jantzen, May 10, 2011
    #1
    1. Advertising

  2. On 10 mai, 09:23, Friedel Jantzen <> wrote:
    > Hi!
    > Using MSVS 2008, STL TR1 <regex>.
    >
    > Is there a way to get to know the number of replacements done by
    > regex_replace(), or at least, whether there was a replacement at all?
    > Ok, I can compare if(output != input) after regex_replace(), but this is
    > wasted performance, if there is a better way.


    You can roll your own: regex_replace basically instantiate a
    regex_iterator from the parameters an performs the replace. It
    shouldn't be too hard.

    > If a regex_error is thrown, how can I get the position of the error in the
    > regular expression string?


    AFAIS you cannot; and POSIX regcomp doesn't give more information
    either.
    You will need a regex format validator.

    > I found that icase (ignore case) only works with A-Za-z, but e.g. not with
    > German Umlaute (Ää etc.), though I set my German user locale using
    > regex::imbue(). Am I missing something?


    This may not be implemented or handled correctly by the compiler.

    --
    Michael
     
    Michael Doubez, May 10, 2011
    #2
    1. Advertising

  3. Michael Doubez <> wrote:
    > You will need a regex format validator.


    You can use a regexp to validate a regexp string. There would be a
    marvelous conceptual recursion there... :)
     
    Juha Nieminen, May 10, 2011
    #3
  4. On 10 mai, 16:47, Juha Nieminen <> wrote:
    > Michael Doubez <> wrote:
    > > You will need a regex format validator.

    >
    >   You can use a regexp to validate a regexp string. There would be a
    > marvelous conceptual recursion there... :)


    I hesitated to make the joke but IMO regex grammar is not powerful
    enough to validate regex expression.

    I actually tried to find one available on the net in C or C++ but,
    strangely, none seem readily available (I didn't look too hard, just
    googled a bit).

    --
    Michael
     
    Michael Doubez, May 10, 2011
    #4
  5. Michael Doubez <> writes:

    > On 10 mai, 16:47, Juha Nieminen <> wrote:
    >> Michael Doubez <> wrote:
    >> > You will need a regex format validator.

    >>
    >>   You can use a regexp to validate a regexp string. There would be a
    >> marvelous conceptual recursion there... :)

    >
    > I hesitated to make the joke but IMO regex grammar is not powerful
    > enough to validate regex expression.


    You're right, regular expression languages are not regular.

    -- Alain.
     
    Alain Ketterlin, May 10, 2011
    #5
  6. > You can use a regexp to validate a regexp string. There would be a
    > marvelous conceptual recursion there... :)


    :)
    A homespun validator could possibly not return the same position where the
    engine has detected the error.

    Friedel
     
    Friedel Jantzen, May 11, 2011
    #6
  7. Thank you for your reply!
    ....
    > You can roll your own: regex_replace basically instantiate a
    > regex_iterator from the parameters an performs the replace. It
    > shouldn't be too hard.


    I wrote test code to do this, but as STL regex is new for me, I thought I
    could have missed something and reinvent the wheel.
    ....
    > AFAIS you cannot; and POSIX regcomp doesn't give more information
    > either.
    > You will need a regex format validator.


    :)

    >> I found that icase (ignore case) only works with A-Za-z, but e.g. not with
    >> German Umlaute (Ää etc.), though I set my German user locale using
    >> regex::imbue(). Am I missing something?

    >
    > This may not be implemented or handled correctly by the compiler.


    Yes, it looks somehow "premature" to me.

    Thank you,
    Friedel
     
    Friedel Jantzen, May 11, 2011
    #7
  8. On 11 mai, 08:02, Friedel Jantzen <> wrote:
    > >> I found that icase (ignore case) only works with A-Za-z, but e.g. not with
    > >> German Umlaute (Ää etc.), though I set my German user locale using
    > >> regex::imbue(). Am I missing something?

    >
    > > This may not be implemented or handled correctly by the compiler.

    >
    > Yes, it looks somehow "premature" to me.


    You could try toupper/tolower with your local and see if works on the
    umlaut (and the eszett :) ).

    --
    Michael
     
    Michael Doubez, May 11, 2011
    #8
  9. Friedel Jantzen

    Ralf Goertz Guest

    Michael Doubez wrote:


    > You could try toupper/tolower with your local and see if works on the
    > umlaut (and the eszett :) ).


    I was about to tell you that there is no uppercase "ß". But then I
    noticed the smiley which made me think that you knew. So I will refrain
    from telling you.
     
    Ralf Goertz, May 11, 2011
    #9
  10. Juha Nieminen <> writes:

    > Michael Doubez <> wrote:
    >> You will need a regex format validator.

    >
    > You can use a regexp to validate a regexp string. There would be a
    > marvelous conceptual recursion there... :)


    Nope, at least not by itself. The language of regexps not itself regular.
    I don't know the exact details of TR1 regexps, but I doubt they can
    check for matched parentheses.

    /L
    --
    Lasse Reichstein Holst Nielsen
    DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
    'Faith without judgement merely degrades the spirit divine.'
     
    Lasse Reichstein Nielsen, May 11, 2011
    #10
  11. Am Wed, 11 May 2011 02:09:34 -0700 (PDT) schrieb Michael Doubez:
    ....
    > You could try toupper/tolower with your local and see if works on the
    > umlaut (and the eszett :) ).


    Thank you for this hint.

    cout << "User locale: " << locale("").name() << endl;//German_Germany.1252
    setlocale(LC_ALL, "");

    toupper() result:
    toupper('ö', locale("")) == 'Ö'
    (but toupper('ö') != 'Ö')

    Replacing:
    regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    string rxStr = "ö";
    string replStr = "oe";
    string input("Schönes Österreich");
    regex rx;
    rx.imbue(locale(""));
    rx.assign(rxStr, rxFlags);
    string output = regex_replace(input, rx, replStr);
    // output == "Schoenes Österreich" --> capital Ö NOT replaced

    I wonder if it works on e.g. a French system with sth. like é and É ?

    Regards,
    Friedel
     
    Friedel Jantzen, May 12, 2011
    #11
  12. Friedel Jantzen

    Ralf Goertz Guest

    Friedel Jantzen wrote:

    > Am Wed, 11 May 2011 02:09:34 -0700 (PDT) schrieb Michael Doubez:
    > ...
    >> You could try toupper/tolower with your local and see if works on the
    >> umlaut (and the eszett :) ).

    >
    > Thank you for this hint.
    >
    > cout << "User locale: " << locale("").name() << endl;//German_Germany.1252
    > setlocale(LC_ALL, "");
    >
    > toupper() result:
    > toupper('ö', locale("")) == 'Ö'
    > (but toupper('ö') != 'Ö')
    >
    > Replacing:
    > regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    > string rxStr = "ö";
    > string replStr = "oe";
    > string input("Schönes Österreich");
    > regex rx;
    > rx.imbue(locale(""));
    > rx.assign(rxStr, rxFlags);
    > string output = regex_replace(input, rx, replStr);
    > // output == "Schoenes Österreich" --> capital Ö NOT replaced
    >
    > I wonder if it works on e.g. a French system with sth. like é and É ?


    If you use wstrings it should work (except for the toupper without
    locale specification). Here I used boost under linux:


    #include <iostream>
    #include <string>
    #include <boost/regex.hpp>

    using namespace std;
    using namespace boost;

    int main() {
    ios::sync_with_stdio(false);
    cout << "User locale: " << locale("").name() << endl;
    setlocale(LC_ALL, "");
    wcout.imbue(locale(""));

    wcout<<L"toupper('ö', locale("")) == 'Ö': "<<boolalpha<<(toupper(L'ö',
    locale(""))==L'Ö')<<endl;
    wcout<<L"toupper('ö')==Ö: "<<boolalpha<<(toupper(L'ö')==L'Ö')<<endl;
    regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    wstring rxStr = L"ö";
    wstring replStr = L"oe";
    wstring input(L"Schönes Österreich");
    wregex rx;
    rx.imbue(locale(""));
    rx.assign(rxStr, rxFlags);
    wstring output = regex_replace(input, rx, replStr);
    wcout<<input<<L" -> "<<output<<endl;
    }

    output:

    User locale: de_DE.UTF-8
    toupper('ö', locale()) == 'Ö': true
    toupper('ö')==Ö: false
    Schönes Österreich -> Schoenes oesterreich
     
    Ralf Goertz, May 12, 2011
    #12
  13. On 12 mai, 07:39, Friedel Jantzen <> wrote:
    > Am Wed, 11 May 2011 02:09:34 -0700 (PDT) schrieb Michael Doubez:
    > ...
    >
    > > You could try toupper/tolower with your local and see if works on the
    > > umlaut (and the eszett :) ).

    >
    > Thank you for this hint.
    >
    > cout << "User locale: " << locale("").name() << endl;//German_Germany.1252
    > setlocale(LC_ALL, "");
    >
    > toupper() result:
    > toupper('ö', locale("")) == 'Ö'
    > (but toupper('ö') != 'Ö')
    >
    > Replacing:
    > regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    > string rxStr = "ö";
    > string replStr = "oe";
    > string input("Schönes Österreich");
    > regex rx;
    > rx.imbue(locale(""));
    > rx.assign(rxStr, rxFlags);
    > string output = regex_replace(input, rx, replStr);
    > // output == "Schoenes Österreich" --> capital Ö NOT replaced
    >
    > I wonder if it works on e.g. a French system with sth. like é and É ?


    It works well enough on gcc version 4.3.3:

    std::locale loc("");
    std::cout<<"User locale: " << loc.name() << std::endl;
    char const str[] = "àäâéèêëïîöôüû";
    std::cout<<str<<std::endl;
    for( char const * it = str; *it ; ++it )
    {
    std::cout<<toupper(*it, loc);
    }
    std::cout<<std::endl;

    Output:
    User locale: fr_FR
    àäâéèêëïîöôüû
    ÀÄÂÉÈÊËÏÎÖÔÜÛ

    Deutsch locale is not installed on my system and I couldn't try it.

    --
    Michael
     
    Michael Doubez, May 12, 2011
    #13
  14. On 12 mai, 11:53, Michael Doubez <> wrote:
    > On 12 mai, 07:39, Friedel Jantzen <> wrote:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > > Am Wed, 11 May 2011 02:09:34 -0700 (PDT) schrieb Michael Doubez:
    > > ...

    >
    > > > You could try toupper/tolower with your local and see if works on the
    > > > umlaut (and the eszett :) ).

    >
    > > Thank you for this hint.

    >
    > > cout << "User locale: " << locale("").name() << endl;//German_Germany.1252
    > > setlocale(LC_ALL, "");

    >
    > > toupper() result:
    > > toupper('ö', locale("")) == 'Ö'
    > > (but toupper('ö') != 'Ö')

    >
    > > Replacing:
    > > regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    > > string rxStr = "ö";
    > > string replStr = "oe";
    > > string input("Schönes Österreich");
    > > regex rx;
    > > rx.imbue(locale(""));
    > > rx.assign(rxStr, rxFlags);
    > > string output = regex_replace(input, rx, replStr);
    > > // output == "Schoenes Österreich" --> capital Ö NOT replaced

    >
    > > I wonder if it works on e.g. a French system with sth. like é and É?

    >
    > It works well enough on gcc version 4.3.3:

    [snip]

    Oups, you were talking about regex. Well, I don't have a recent
    compiler on this machine (and no admin right) so I cannot test it
    right now.

    --
    Michael
     
    Michael Doubez, May 12, 2011
    #14
  15. Thank you for testing.

    Am Thu, 12 May 2011 10:12:20 +0200 schrieb Ralf Goertz:
    > ...
    > If you use wstrings it should work (except for the toupper without
    > locale specification). Here I used boost under linux:
    > ...
    >
    > output:
    >
    > User locale: de_DE.UTF-8
    > toupper('ö', locale()) == 'Ö': true
    > toupper('ö')==Ö: false
    > Schönes Österreich -> Schoenes oesterreich


    Compiled with MS VS2008, on Windows Vista, the output is:

    User locale: German_Germany.1252
    toupper('ö', locale("")) == 'Ö': true
    toupper('ö')==Ö: true
    Schönes Österreich -> Schoenes Österreich

    It looks like with this regex implementation (afaik MS lizensed it from
    Dinkumware) icase does not work with wstring, too.
    Interesting is, that toupper('ö')==Ö: true

    Regards,
    Friedel
     
    Friedel Jantzen, May 13, 2011
    #15
  16. Friedel Jantzen

    Jorgen Grahn Guest

    On Tue, 2011-05-10, Michael Doubez wrote:
    > On 10 mai, 09:23, Friedel Jantzen <> wrote:

    ....
    >> If a regex_error is thrown, how can I get the position of the error in the
    >> regular expression string?

    >
    > AFAIS you cannot; and POSIX regcomp doesn't give more information
    > either.
    > You will need a regex format validator.


    POSIX gives you *something* using regerror(3); I assume it's more than
    "your regexp is broken" but less than "the problem is the backslash in
    position 42".

    Regexps are best used hard-coded anyway, rather than generated on the
    fly or (worse) generated from user input. So this is usually not a
    major problem.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, May 13, 2011
    #16
  17. Friedel Jantzen

    James Kanze Guest

    On May 12, 6:39 am, Friedel Jantzen <> wrote:
    > Am Wed, 11 May 2011 02:09:34 -0700 (PDT) schrieb Michael Doubez:
    > ...
    >
    > > You could try toupper/tolower with your local and see if works on the
    > > umlaut (and the eszett :) ).


    > Thank you for this hint.


    > cout << "User locale: " << locale("").name() << endl;//German_Germany.1252
    > setlocale(LC_ALL, "");


    > toupper() result:
    > toupper('ö', locale("")) == 'Ö'
    > (but toupper('ö') != 'Ö')


    > Replacing:
    > regex::flag_type rxFlags = regex::icase | regex::ECMAScript;
    > string rxStr = "ö";
    > string replStr = "oe";
    > string input("Schönes Österreich");
    > regex rx;
    > rx.imbue(locale(""));
    > rx.assign(rxStr, rxFlags);
    > string output = regex_replace(input, rx, replStr);
    > // output == "Schoenes Österreich" --> capital Ö NOT replaced


    This one's tricky. It's why Unicode introduced title case: if
    you ever really wanted to do this, what you'd what to get would
    be: "SChoenes Oesterreich". Not sure what that might mean in
    the context of regular expressions, however; you'd probably want
    a flag stating whether substitution should use a) the case of
    the original, b) title case if the original was upper case, or
    c) context sensitive title case.

    > I wonder if it works on e.g. a French system with sth. like é and É ?
     
    James Kanze, May 15, 2011
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    7
    Views:
    1,837
  2. Replies:
    6
    Views:
    395
    Robbie Hatley
    Jul 14, 2006
  3. Yahooooooooo

    boost::regex_replace compiler error

    Yahooooooooo, Jan 22, 2007, in forum: C++
    Replies:
    3
    Views:
    458
  4. Yahooooooooo

    boost::regex_replace issue

    Yahooooooooo, Jan 30, 2007, in forum: C++
    Replies:
    1
    Views:
    1,330
    David Harmon
    Jan 31, 2007
  5. Replies:
    1
    Views:
    916
Loading...

Share This Page