Strange problem with PCRE

S

szr

I encountered this strange problem while making a graphical application
under Windows and I was able to make a simple test program that
demonstrates the problem that appears to be occuring (if indeed it is a
problem and I'm not just missing something.)

It runs under both windows (using either Embarcadero's or Borland's C++
Builders, though I was unable to test under any version of Visual Studio
VC++ since it appears to lack any PCRE libraries) or g++ under linux.


Begin Demonstration Code
----------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h> // strncpy.


#ifdef __GNUG__
#include <regex.h> // regcomp, regexec.
#else
#include <pcreposix.h> // regcomp, regexec.

// Since Embaradero and Borland don't seem to define this.
#define REG_EXTENDED 1
#endif


void error(const char* title, regex_t &re, int status) {
char err_msg[128];
regerror(status, &re, err_msg, 128);
printf( "[%s] Error: %s\n", title, err_msg );
}

void clean_up( regex_t &re, regmatch_t* pmatch = NULL ) {
if ( pmatch ) delete[] pmatch;
regfree( &re );
}

int main() { // int argc, char* argv[]
char str[128] = "$$$ ABC123def ###";
char pat[128] = "([A-Za-z]+)([0-9]+)([A-Za-z]+)";

regex_t re;
int status;

if ((status = regcomp(&re, pat, REG_EXTENDED)) != 0) {
error("regcomp", re, status);
clean_up( re );
exit( -1 );
}

// !!! Changing "re.re_nsub + 3" to just "re.re_nsub" in the next
// !!! two lines seems to exibit the problem. I don't understand
// !!! why it only works right with 3 additional slots allocated.
// ( See OUTPUT section below for comparisons. )
regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);

printf(
"[%2d] so[%2d] eo[%2d] m[%s]\n",
i,
pmatch.rm_so,
pmatch.rm_eo,
cap
);
}

// Clean up.
delete[] pmatch;
regfree( &re );
getchar(); // Pause (so cmd.exe doesn't close under Windows.)
return 0;
}

/*
OUTPUT COMPARISON:

// Using: re.re_nsub
// Under linux, sometimes I get
// *** glibc detected *** ./test: malloc():
// memory corruption: 0x0804a960 ***
// instead of the below.
[ 0] so[ 0] eo[ 9] m[ABC123def]
[ 1] so[ 0] eo[ 3] m[ABC]
[ 2] so[-1] eo[ 0] m[]
[ 3] so[ 0] eo[ 0] m[]

// Using; re.re_nsub + 1
// Sometimes I get the same crash as noted above.
[ 0] so[ 0] eo[ 9] m[ABC123def]
[ 1] so[ 0] eo[ 3] m[ABC]
[ 2] so[-1] eo[ 0] m[]
[ 3] so[-1] eo[-1] m[]

// Using: re.re_nsub + 2
// Sometimes I get the same crash as noted above.
[ 0] so[ 0] eo[ 9] m[ABC123def]
[ 1] so[ 0] eo[ 3] m[ABC]
[ 2] so[ 3] eo[ 6] m[123]
[ 3] so[-1] eo[ 3] m[]

// Using: re.re_nsub + 3
// Only with 3 extra indices allocated does it work right. But why?
[ 0] so[ 0] eo[ 9] m[ABC123def]
[ 1] so[ 0] eo[ 3] m[ABC]
[ 2] so[ 3] eo[ 6] m[123]
[ 3] so[ 6] eo[ 9] m[def]
*/
----------------------------------------------------------------------
End Demonstration Code



As can be seen from the output snapshots above, if I change

regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {

to

regmatch_t *pmatch = new regmatch_t[ re.re_nsub ];

if ((status = regexec(&re, str, re.re_nsub, pmatch, 0)) != 0) {

Then it doesn't display the matches (capture groups) correctly and even
sometimes crashes outright with a
*** glibc detected *** ./test: malloc(): memory corruption: ...
error from glibc, under Linux. Otherwise I get the first output
comparison above.

Thanks for any insight on this. I admit I'm not used to working with
regex in C or C++, so maybe I'm missing something, but I can't seem to
find anything in any documentation about a need for (at least) 3 extra
slots to be allocated.
 
J

Jorgen Grahn

I encountered this strange problem while making a graphical application
under Windows and I was able to make a simple test program that
demonstrates the problem that appears to be occuring (if indeed it is a
problem and I'm not just missing something.)

It runs under both windows (using either Embarcadero's or Borland's C++
Builders, though I was unable to test under any version of Visual Studio
VC++ since it appears to lack any PCRE libraries) or g++ under linux.
....
#ifdef __GNUG__
#include <regex.h> // regcomp, regexec.
#else
#include <pcreposix.h> // regcomp, regexec.

// Since Embaradero and Borland don't seem to define this.
#define REG_EXTENDED 1
#endif

What you're using is not PCRE (Perl Compatible Regular Expressions)
but the plain Posix regular expressions. "pcreposix.h" might mean
"Posix-regcomp/regexec emulation on top of PCRE".

Or not. http://www.pcre.org/readme.txt says

In addition, there is a set of C wrapper functions (again, just
for the 8-bit library) that are based on the POSIX regular
expression API (see the pcreposix man page). These end up in the
library called libpcreposix. Note that this just provides a POSIX
calling interface to PCRE; the regular expressions themselves
still follow Perl syntax and semantics. [...]

This makes your #ifdef above a problem. Both include files provide
regcomp() and regexec(), but they do different things. There are
plenty of differences between Posix and Perl REs.

That's not the cause of the crashes, though.

/Jorgen
 
J

Jorgen Grahn

I encountered this strange problem while making a graphical application
under Windows and I was able to make a simple test program that
demonstrates the problem that appears to be occuring (if indeed it is a
problem and I'm not just missing something.)

It runs under both windows (using either Embarcadero's or Borland's C++
Builders, though I was unable to test under any version of Visual Studio
VC++ since it appears to lack any PCRE libraries) or g++ under linux.


Begin Demonstration Code
----------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h> // strncpy.


#ifdef __GNUG__
#include <regex.h> // regcomp, regexec.
#else
#include <pcreposix.h> // regcomp, regexec.

// Since Embaradero and Borland don't seem to define this.
#define REG_EXTENDED 1
#endif


void error(const char* title, regex_t &re, int status) {
char err_msg[128];
regerror(status, &re, err_msg, 128);
printf( "[%s] Error: %s\n", title, err_msg );
}

void clean_up( regex_t &re, regmatch_t* pmatch = NULL ) {
if ( pmatch ) delete[] pmatch;

You never have to check for NULL before deleting. (Or before calling
delete() in C for that matter.)
regfree( &re );
}

int main() { // int argc, char* argv[]
char str[128] = "$$$ ABC123def ###";
char pat[128] = "([A-Za-z]+)([0-9]+)([A-Za-z]+)";

Use this instead for clarity and safety:

const char str[] = "$$$ ABC123def ###";
const char pat[] = "([A-Za-z]+)([0-9]+)([A-Za-z]+)";
regex_t re;
int status;

if ((status = regcomp(&re, pat, REG_EXTENDED)) != 0) {

Using a REG_EXTENDED which you just made up yourself is a problem.
libpcreposix won't magically understand what you mean.
In the regex.h case it's of course correct.
error("regcomp", re, status);
clean_up( re );
exit( -1 );
}

// !!! Changing "re.re_nsub + 3" to just "re.re_nsub" in the next
// !!! two lines seems to exibit the problem. I don't understand
// !!! why it only works right with 3 additional slots allocated.
// ( See OUTPUT section below for comparisons. )
regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);


valgrind: "Invalid read of size 4", reported two times.

You were told there were re.re_nsub==3 sub-expressions, but then you
loop through 0, 1, 2, 3. That's four, i.e. the fencepost error.

Lesson: "for (int i=0; i<=N; i++)" is a rare thing in C and C++.
It's almost always "for (int i=0; i<N; i++)".
printf(
"[%2d] so[%2d] eo[%2d] m[%s]\n",
i,
pmatch.rm_so,
pmatch.rm_eo,
cap
);


valgrind: "Invalid read of size 4".
}

// Clean up.
delete[] pmatch;
regfree( &re );
getchar(); // Pause (so cmd.exe doesn't close under Windows.)

Those fancy IDEs *still* don't do that for you?
return 0;
}

/*
OUTPUT COMPARISON:

// Using: re.re_nsub
// Under linux, sometimes I get
// *** glibc detected *** ./test: malloc():
// memory corruption: 0x0804a960 ***

Since you have access to Linux, do yourself a favor and learn how to
use valgrind. Remember to compile with -g so it will report line
numbers.

....
Thanks for any insight on this. I admit I'm not used to working with
regex in C or C++, so maybe I'm missing something, but I can't seem to
find anything in any documentation about a need for (at least) 3 extra
slots to be allocated.

Hope that helped,
/Jorgen
 
S

szr

]
int main() { // int argc, char* argv[]
char str[128] = "$$$ ABC123def ###";
char pat[128] = "([A-Za-z]+)([0-9]+)([A-Za-z]+)";

Use this instead for clarity and safety:

const char str[] = "$$$ ABC123def ###";
const char pat[] = "([A-Za-z]+)([0-9]+)([A-Za-z]+)";
Thanks.

[...]
if ((status = regcomp(&re, pat, REG_EXTENDED)) != 0) {

Using a REG_EXTENDED which you just made up yourself is a problem.
libpcreposix won't magically understand what you mean.
In the regex.h case it's of course correct.

Well I only defined it manually if it's not gcc/g++.

$ grep REG_EXTENDED /usr/include/pcreposix.h # from pcre-8.12
#define REG_EXTENDED 0

$ grep REG_EXTENDED /usr/include/regex.h # from glibc-2.14
#define REG_EXTENDED 1
#define REG_ICASE (REG_EXTENDED << 1)

I hadn't realize they were different, and I'll of course take that into
consideration.

[...]
if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);


valgrind: "Invalid read of size 4", reported two times.

You were told there were re.re_nsub==3 sub-expressions, but then you
loop through 0, 1, 2, 3. That's four, i.e. the fencepost error.

Lesson: "for (int i=0; i<=N; i++)" is a rare thing in C and C++.
It's almost always "for (int i=0; i<N; i++)".


I agree but it's my understanding that with pmatch, [0] is the whole
match, and after that you have each individual sub match (group), hence
i<=re.re_nsub instead of i<re.re_nsub there.

[...]
Those fancy IDEs *still* don't do that for you?

Actually no. If there is nothing stopping the execution, it'll just run
to the end, though I realize I could have just put a stop point.

[...]
Since you have access to Linux, do yourself a favor and learn how to
use valgrind. Remember to compile with -g so it will report line
numbers.

Thanks, I will that up.
 
S

szr

On 6/15/2012 23:16 PM, Jorgen Grahn wrote:
[...]
// !!! Changing "re.re_nsub + 3" to just "re.re_nsub" in the next
// !!! two lines seems to exibit the problem. I don't understand
// !!! why it only works right with 3 additional slots allocated.
// ( See OUTPUT section below for comparisons. )
regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);


valgrind: "Invalid read of size 4", reported two times.

You were told there were re.re_nsub==3 sub-expressions, but then you
loop through 0, 1, 2, 3. That's four, i.e. the fencepost error.

Lesson: "for (int i=0; i<=N; i++)" is a rare thing in C and C++.
It's almost always "for (int i=0; i<N; i++)".


Something I forgot to mention in my previous reply; yes, re.re_nsub
comes back as 3 in my test case, and as far I know, pmatch array, at the
minimum, needs to be re.re_nsub + 1 big (index 0 is the whole match, and
what follows, if anything, are the sub expressions (capture groups),
which there should only be re.re_nsub of.

So why then does it crash (and I've reproducd this under both Linux/g++
and windows/borland/embarcadero C++ builder) if it the pmatch array
isn't re.re_nsub plus 3 big? That is, when re.re_nsub is 3, pmatch
appears to only use the 4 indices, as I would expect. But why does does
it crash if pmatch is only have 4 indices allocated?

That is what is mysterious to me.
 
J

Jorgen Grahn

On 6/15/2012 23:16 PM, Jorgen Grahn wrote:
[...]
// !!! Changing "re.re_nsub + 3" to just "re.re_nsub" in the next
// !!! two lines seems to exibit the problem. I don't understand
// !!! why it only works right with 3 additional slots allocated.
// ( See OUTPUT section below for comparisons. )
regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);


valgrind: "Invalid read of size 4", reported two times.

You were told there were re.re_nsub==3 sub-expressions, but then you
loop through 0, 1, 2, 3. That's four, i.e. the fencepost error.

Lesson: "for (int i=0; i<=N; i++)" is a rare thing in C and C++.
It's almost always "for (int i=0; i<N; i++)".


Something I forgot to mention in my previous reply; yes, re.re_nsub
comes back as 3 in my test case, and as far I know, pmatch array, at the
minimum, needs to be re.re_nsub + 1 big (index 0 is the whole match, and
what follows, if anything, are the sub expressions (capture groups),
which there should only be re.re_nsub of.


OK, well, then you have to consult the documentation more I guess. If
that is indeed how it works, this aspect of the Poxix regex library is
dangerous -- you normally try to provide lengths which can be
immediately used as array sizes and for loop end conditions, without a
need to add 1 or whatever.

My regcomp(3) man page doesn't document re_nsub at all, and doesn't
say that there's a special subexpression #0 which is the whole match.
I haven't used this part of the library.

Also note that there's standard regular expression support in C++
nowadays. If you don't have TR1 or C++11 or whatever support, there's
the Boost library which all that is based on.

[...]

/Jorgen
 
B

Ben Bacarisse

szr said:
On 6/15/2012 23:16 PM, Jorgen Grahn wrote:
[...]
// !!! Changing "re.re_nsub + 3" to just "re.re_nsub" in the next
// !!! two lines seems to exibit the problem. I don't understand
// !!! why it only works right with 3 additional slots allocated.
// ( See OUTPUT section below for comparisons. )
regmatch_t *pmatch = new regmatch_t[ re.re_nsub + 3 ];

if ((status = regexec(&re, str, re.re_nsub + 3, pmatch, 0)) != 0) {
error("regexec", re, status);
clean_up( re, pmatch );
exit( -1 );
}

for (unsigned int i=0; i<=re.re_nsub; i++) {
char cap[128] = ""; // Capture Group.
strncpy(
cap,
str + pmatch.rm_so,
pmatch.rm_eo - pmatch.rm_so
);


valgrind: "Invalid read of size 4", reported two times.

You were told there were re.re_nsub==3 sub-expressions, but then you
loop through 0, 1, 2, 3. That's four, i.e. the fencepost error.

Lesson: "for (int i=0; i<=N; i++)" is a rare thing in C and C++.
It's almost always "for (int i=0; i<N; i++)".


Something I forgot to mention in my previous reply; yes, re.re_nsub
comes back as 3 in my test case, and as far I know, pmatch array, at
the minimum, needs to be re.re_nsub + 1 big (index 0 is the whole
match, and what follows, if anything, are the sub expressions (capture
groups), which there should only be re.re_nsub of.


Yes, but did you change the argument you pass to regexec as well? If
not, it will try to "clear" (set to -1) pmatch elements that don't
exist.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top