Regular expressions (multiple match problem)

Discussion in 'C Programming' started by mikko.n, Apr 2, 2008.

  1. mikko.n

    mikko.n Guest

    I have recently been experimenting with GNU C library regular
    expression functions and noticed a problem with pattern matching. It
    seems to recognize only the first match but ignoring the rest of them.
    An example:

    mikko.c:
    -----

    #include <stdio.h>
    #include <regex.h>
    #include <sys/types.h>

    int main(int argc, char *argv[]) {
    regex_t p;
    regmatch_t pm[2];
    regcomp(&p,"k",0);
    regexec(&p,"mikko",2,pm,0);
    printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    regfree(&p);
    return 0;
    }

    -----

    This intends to match regular expression 'k' against string 'mikko'
    and return start and end of two first matches in the array pm of
    regmatch_t:s. The output is, however:

    $ ./mikko
    start=2 end=3
    start=-1 end=-1

    instead of the expected

    start=2 end=3
    start=3 end=4

    Is this a bug in GNU library or have I overlooked something? I have
    not found any examples from the Internet of multiple subexpression
    matching with <regex.h> either.
    With more complicated regular expressions it usually seems to return
    only the first match as here, but with wildcards the largest match,
    nevertheless only one of them.

    Thanks,

    Mikko Nummelin
    mikko.n, Apr 2, 2008
    #1
    1. Advertising

  2. In article <>,
    mikko.n <> wrote:
    >I have recently been experimenting with GNU C library regular
    >expression functions and noticed a problem with pattern matching.


    Then you should ask in a GNU newsgroup. Regular expressions are
    not part of the C standard, so the proper usage of
    any particular regular expression library should be discussed
    in the appropriate forum for that library.
    --
    "They called it golf because all the other four letter words
    were taken." -- Walter Hagen
    Walter Roberson, Apr 2, 2008
    #2
    1. Advertising

  3. On 2 Apr 2008 at 6:20, mikko.n wrote:
    > I have recently been experimenting with GNU C library regular
    > expression functions and noticed a problem with pattern matching. It
    > seems to recognize only the first match but ignoring the rest of them.
    > An example:
    >
    > mikko.c:
    > -----
    >
    > #include <stdio.h>
    > #include <regex.h>
    > #include <sys/types.h>
    >
    > int main(int argc, char *argv[]) {
    > regex_t p;
    > regmatch_t pm[2];
    > regcomp(&p,"k",0);
    > regexec(&p,"mikko",2,pm,0);
    > printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    > printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    > regfree(&p);
    > return 0;
    > }
    >
    > -----
    >
    > This intends to match regular expression 'k' against string 'mikko'
    > and return start and end of two first matches in the array pm of
    > regmatch_t:s. The output is, however:
    >
    > $ ./mikko
    > start=2 end=3
    > start=-1 end=-1
    >
    > instead of the expected
    >
    > start=2 end=3
    > start=3 end=4
    >
    > Is this a bug in GNU library or have I overlooked something? I have
    > not found any examples from the Internet of multiple subexpression
    > matching with <regex.h> either.
    > With more complicated regular expressions it usually seems to return
    > only the first match as here, but with wildcards the largest match,
    > nevertheless only one of them.


    The problem is that you misunderstand what a match is.

    If the regex matches, then pm[0] contains the offsets of the (first)
    match for the whole regex. But pm[1],... don't contain the offets for
    subsequent matches of the whole regex, but rather contain the offsets of
    any parenthesized subexpressions that matched (in the match recorded in
    pm[0]).

    For example, try:

    #include <stdio.h>
    #include <regex.h>
    #include <sys/types.h>

    int main(void)
    {
    regex_t p;
    regmatch_t pm[2];
    regcomp(&p,"k\\(.\\)",0);
    regexec(&p,"mikko",2,pm,0);
    printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    regfree(&p);
    return 0;
    }


    $ ./a
    start=2 end=4
    start=3 end=4
    Antoninus Twink, Apr 2, 2008
    #3
  4. mikko.n

    mikko.n Guest

    On 2 huhti, 11:01, Antoninus Twink <> wrote:
    > On 2 Apr 2008 at 6:20, mikko.n wrote:
    >
    >
    >
    > > I have recently been experimenting with GNU C library regular
    > > expression functions and noticed a problem with pattern matching. It
    > > seems to recognize only the first match but ignoring the rest of them.
    > > An example:

    >
    > > mikko.c:
    > > -----

    >
    > > #include <stdio.h>
    > > #include <regex.h>
    > > #include <sys/types.h>

    >
    > > int main(int argc, char *argv[]) {
    > > regex_t p;
    > > regmatch_t pm[2];
    > > regcomp(&p,"k",0);
    > > regexec(&p,"mikko",2,pm,0);
    > > printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    > > printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    > > regfree(&p);
    > > return 0;
    > > }

    >
    > > -----

    >
    > > This intends to match regular expression 'k' against string 'mikko'
    > > and return start and end of two first matches in the array pm of
    > > regmatch_t:s. The output is, however:

    >
    > > $ ./mikko
    > > start=2 end=3
    > > start=-1 end=-1

    >
    > > instead of the expected

    >
    > > start=2 end=3
    > > start=3 end=4

    >
    > > Is this a bug in GNU library or have I overlooked something? I have
    > > not found any examples from the Internet of multiple subexpression
    > > matching with <regex.h> either.
    > > With more complicated regular expressions it usually seems to return
    > > only the first match as here, but with wildcards the largest match,
    > > nevertheless only one of them.

    >
    > The problem is that you misunderstand what a match is.
    >
    > If the regex matches, then pm[0] contains the offsets of the (first)
    > match for the whole regex. But pm[1],... don't contain the offets for
    > subsequent matches of the whole regex, but rather contain the offsets of
    > any parenthesized subexpressions that matched (in the match recorded in
    > pm[0]).
    >
    > For example, try:
    >
    > #include <stdio.h>
    > #include <regex.h>
    > #include <sys/types.h>
    >
    > int main(void)
    > {
    > regex_t p;
    > regmatch_t pm[2];
    > regcomp(&p,"k\\(.\\)",0);
    > regexec(&p,"mikko",2,pm,0);
    > printf("start=%d end=%d\n",pm[0].rm_so,pm[0].rm_eo);
    > printf("start=%d end=%d\n",pm[1].rm_so,pm[1].rm_eo);
    > regfree(&p);
    > return 0;
    >
    > }
    >
    > $ ./a
    > start=2 end=4
    > start=3 end=4


    Is there then a simple alternative which would work so that it returns
    all the matches of the original regexp in the text?

    Mikko Nummelin
    mikko.n, Apr 2, 2008
    #4
  5. mikko.n

    Flash Gordon Guest

    mikko.n wrote, On 02/04/08 09:37:
    > On 2 huhti, 11:01, Antoninus Twink <> wrote:
    >> On 2 Apr 2008 at 6:20, mikko.n wrote:


    <snip>

    > Is there then a simple alternative which would work so that it returns
    > all the matches of the original regexp in the text?


    As Walter suggested, ask in a GNU group or mailing list where your
    question would be topical (there is one specifically for regexp) instead
    of comp.lang.c where it is not.

    I note that this time you have added a cross post to
    comp.unix.programmer where your question might be topical, but why
    continue posting where it is not?
    --
    Flash Gordon
    Flash Gordon, Apr 2, 2008
    #5
  6. On 2 Apr 2008 at 8:37, mikko.n wrote:
    > Is there then a simple alternative which would work so that it returns
    > all the matches of the original regexp in the text?


    Just use a loop, like this:


    #include <stdio.h>
    #include <regex.h>
    #include <sys/types.h>

    int main(void)
    {
    regex_t p;
    regmatch_t pm;
    char *s="mikko mikko";
    regoff_t last_match=0;
    regcomp(&p, "k", 0);
    while(regexec(&p, s+last_match, 1, &pm, 0) == 0) {
    printf("start=%d end=%d\n", pm.rm_so + last_match, pm.rm_eo + last_match);
    last_match += pm.rm_so+1;
    }
    regfree(&p);
    return 0;
    }
    Antoninus Twink, Apr 2, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    592
    Jay Douglas
    Aug 15, 2003
  2. Replies:
    2
    Views:
    409
    George Sakkis
    Jul 13, 2005
  3. ast
    Replies:
    4
    Views:
    248
    Lasse Reichstein Nielsen
    Mar 11, 2011
  4. ast
    Replies:
    2
    Views:
    113
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page