Trying to parse/match a C string literal

jl_post · Sep 24, 2009

Dear Perl community,

I'm trying to write Perl code that scans through a C/C++ and
matches string literals. I want to use a regular expression for this,
so that if given these inputs, it will extract these outputs:

input1: before "12 34 56" after
output1: 12 34 56

input2: before "12 34" 56" after
output2: 12 34

input3: before "12 34\" 56" after
output3: 12 34\" 56

input4: before "12 34\\" 56" after
output4: 12 34\\

input5: before "12 34\\\" 56" after
output5: 12 34\\\" 56

input6: before "12 34\\\\" 56" after
output6: 12 34\\\\

Note that inputs 3 through 6 account for the backslash escape
character in that if a double-quote is directly preceded by a non-
escaped backslash, then that double-quote should not be interpreted as
the C string terminator.

At first, I came up with this simple regular expression:

m/" (.*) "/x

this puts everything between the first and the last quote into $1.
This works fine for input1, but reads too much with input2.

Then I changed it to be non-greedy, like this:

m/" (.*?) "/x

which works great for inputs 1 and 2, but now fails with input3, as it
doesn't account for escaped-out quotes.

So then I added a negative look-behind to ensure that the last
character matched by the parentheses is not a backslash (I could use [^
\\] instead of the negative look-behind, but then we won't match empty
strings):

m/" (.*? (?<!\\) ) "/x

This works with inputs 1 through 3, but fails at input4, since the
quote after the double-backslash should be the terminator, but isn't
treated as such (due to the fact that it is preceded by a backslash).

So then I reasoned that, after the last non-backslash matched, an
even number of backslashes can be matched (as each pair of backslashes
represents one literal backslash), so I changed the expression to
this:

m/" (.*? (?<!\\) (\\{2})* ) "/x

Now it works for all the inputs I gave. I then added "?:" to the last
set of parentheses (so it wouldn't offset $2, $3, etc. if I decide to
add more later):

m/" (.*? (?<!\\) (?:\\{2})* ) "/x

I tested this out with the following code:

m/" (.*? (?<!\\) (?:\\{2})*) "/x and print "$1\n" while <>;
before "12 34 56" after # input 1
12 34 56
before "12 34" 56" after # input 2
12 34
before "12 34\" 56" after # input 3
12 34\" 56
before "12 34\\" 56" after #input 4
12 34\\
before "12 34\\\" 56" after # input 5
12 34\\\" 56
before "12 34\\\\" 56" after # input 6
12 34\\\\

So it looks like it works. My question is, even though I came up
with a way of parsing a C string literal, is there a better or simpler
way of doing this?

(Now, I know of the quotewords() function in the Text:

arseWords
module, but I don't think it's what I'm looking for. I prefer a
regular expression that extracts the string literal (not individual
tokens), and I can embed it into other regular expressions that look
for other pieces of code.)

I tried "perldoc -q string", but the best advice I could find was
to use Text:

arseWords, which I stated before is probably not what I
need.

Thanks!

-- Jean-Luc

Uri Guttman · Sep 24, 2009

jpc> I'm trying to write Perl code that scans through a C/C++ and
jpc> matches string literals. I want to use a regular expression for this,
jpc> so that if given these inputs, it will extract these outputs:

that can't be done easily with a single regex so don't even try. look at
text::balanced on cpan which is designed to match c strings and similar things.

uri

Randal L. Schwartz · Sep 24, 2009

jl> I'm trying to write Perl code that scans through a C/C++ and
jl> matches string literals. I want to use a regular expression for this,
jl> so that if given these inputs, it will extract these outputs:

jl> input1: before "12 34 56" after
jl> output1: 12 34 56

jl> input2: before "12 34" 56" after
jl> output2: 12 34

jl> input3: before "12 34\" 56" after
jl> output3: 12 34\" 56

jl> input4: before "12 34\\" 56" after
jl> output4: 12 34\\

jl> input5: before "12 34\\\" 56" after
jl> output5: 12 34\\\" 56

jl> input6: before "12 34\\\\" 56" after
jl> output6: 12 34\\\\

[...]

jl> m/" (.*? (?<!\\) (?:\\{2})* ) "/x

I would just make a regex that does what you want, and ignore all that
fancy newfangled lookbehind/ahead/aside:

m/
" # quote
(
[^\"]+ # any non-special characters are cool
| # ... or ...
\. # a backslash escaping the following character
) * # repeated zero or more times
" # quote
/sx

If you need to remove the quotes from your match, just add an inner set of
parens around the juicy bits.

print "Just another Perl hacker,"; # the original

sln · Sep 24, 2009

jl> I'm trying to write Perl code that scans through a C/C++ and
jl> matches string literals. I want to use a regular expression for this,
jl> so that if given these inputs, it will extract these outputs:

I would just make a regex that does what you want, and ignore all that
fancy newfangled lookbehind/ahead/aside:

m/
" # quote
(
[^\"]+ # any non-special characters are cool
| # ... or ...
\. # a backslash escaping the following character
) * # repeated zero or more times
" # quote
/sx

If you need to remove the quotes from your match, just add an inner set of
parens around the juicy bits.

print "Just another Perl hacker,"; # the original

If it were done this way, it would probably be better with
something like below.

-sln
---------------
use strict;
use warnings;

my $string = <DATA>;
print "\n",$string,"\n";

my $rx = qr/
" # quote
(
(?:
[^\\"]*? # any non-special characters are cool
| # ... or ...
\\. # a backslash escaping the following character
)* # repeated zero or more times
)
" # quote
/x;

while ($string =~ /$rx/sg)
{ print "<$1>\n"; }

__DATA__
" \"\"\"\"\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""

sln · Sep 24, 2009

jl> I'm trying to write Perl code that scans through a C/C++ and
jl> matches string literals. I want to use a regular expression for this,
jl> so that if given these inputs, it will extract these outputs:

Click to expand...

I would just make a regex that does what you want, and ignore all that
fancy newfangled lookbehind/ahead/aside:

m/
" # quote
(
[^\"]+ # any non-special characters are cool
| # ... or ...
\. # a backslash escaping the following character
) * # repeated zero or more times
" # quote
/sx

If you need to remove the quotes from your match, just add an inner set of
parens around the juicy bits.

Click to expand...

This works good too.
m/
" # quote
(
(?:
[^\\"]+ # any non-special characters are cool
| # ... or ...
\\. # a backslash escaping the following character
)* # repeated zero or more times
)
" # quote
/sx

-sln

jl_post · Sep 24, 2009

I'm trying to write Perl code that scans through a C/C++ and
matches string literals. I want to use a regular expression for
this,

Thanks for your responses! I now have two regular expressions that
work well. The one I came up with:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x

and one from Randal Schwartz, with some modification by poster sln and
myself:

m/" ( (?: [^\\"] | \\. )* ) "/x

When I tested them in my Perl script, I found that it read in and
processed 6827 C/C++ files in about 13 seconds, no matter which of the
above two regular expressions I used.

(Actually, they initially clocked in around 45-55 seconds, but
after repeatedly running them, they "slimmed down" to a consistent 13
seconds. I'm sure caching of some sort is involved somehow.)

However, the second one you see above I modified a bit. Randal's
suggestion was to use the '+' modifier after [^\\"] while sln
suggested using '*?'.

So I experimented with these three variants:

m/" ( (?: [^\\"] | \\. )* ) "/x
m/" ( (?: [^\\"]+ | \\. )* ) "/x
m/" ( (?: [^\\"]* | \\. )* ) "/x

What I found out was that the version without any modifier took
about 13 seconds (when operating on 6827 files), the version with the
'+' modifier took about 24 seconds, and the version with the '*'
modifier took about 32 seconds. (I made sure to run them over and
over to make sure caching had taken effect.)

(I discovered that converting '*' and '+' into their non-greedy
versions '*?' and '+?' didn't seem to have a measurable effect.)

So oddly enough, inclusion of the modifiers had an 11 to 19 second
penalty, with '*' being worse than '+'. I'm not sure why this is so,
but it's interesting to point out:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x # 13 seconds
m/" ( (?: [^\\"] | \\. )* ) "/x # 13 seconds
m/" ( (?: [^\\"]+ | \\. )* ) "/x # 24 seconds
m/" ( (?: [^\\"]* | \\. )* ) "/x # 32 seconds

(As an aside, converting \\. to (?:\\.)+ and (?:\\.)* didn't seem to
have an effect, probably because escaping a character was relatively
rare.)

Therefore, if you want to match a C/C++ string literal, I'd
recommend using one of the following two regular expressions:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x
m/" ( (?: [^\\"] | \\. )* ) "/x

They both seem to run about as fast.

Thanks for all your help!

-- Jean-Luc Romano

Peter J. Holzer · Sep 25, 2009

jpc> I'm trying to write Perl code that scans through a C/C++ and
jpc> matches string literals. I want to use a regular expression for this,
jpc> so that if given these inputs, it will extract these outputs:

that can't be done easily with a single regex so don't even try. look at
text::balanced on cpan which is designed to match c strings and similar things.

At the translation stage where string literals are recognised by a C
compiler there are no nesting constructs, so I don't see why you would
want to use Text::Balanced.

(For a real solution you would need to take comments into account, but
they don't nest either)

hp

Uri Guttman · Sep 25, 2009

jpc> I'm trying to write Perl code that scans through a C/C++ and
jpc> matches string literals. I want to use a regular expression for this,
jpc> so that if given these inputs, it will extract these outputs:
PJH> At the translation stage where string literals are recognised by
PJH> a C compiler there are no nesting constructs, so I don't see why
PJH> you would want to use Text::Balanced.

PJH> (For a real solution you would need to take comments into account, but
PJH> they don't nest either)

but you can have a string literal inside a comment and it needs to be
skipped. there are other cases i bet. ask damian why it is better.

uri

sln · Sep 26, 2009

I'm trying to write Perl code that scans through a C/C++ and
matches string literals. I want to use a regular expression for
this,

Click to expand...

Thanks for your responses! I now have two regular expressions that
work well. The one I came up with:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x

and one from Randal Schwartz, with some modification by poster sln and
myself:

m/" ( (?: [^\\"] | \\. )* ) "/x

When I tested them in my Perl script, I found that it read in and
processed 6827 C/C++ files in about 13 seconds, no matter which of the
above two regular expressions I used.

(Actually, they initially clocked in around 45-55 seconds, but
after repeatedly running them, they "slimmed down" to a consistent 13
seconds. I'm sure caching of some sort is involved somehow.)

However, the second one you see above I modified a bit. Randal's
suggestion was to use the '+' modifier after [^\\"] while sln
suggested using '*?'.

So I experimented with these three variants:

m/" ( (?: [^\\"] | \\. )* ) "/x
m/" ( (?: [^\\"]+ | \\. )* ) "/x
m/" ( (?: [^\\"]* | \\. )* ) "/x

What I found out was that the version without any modifier took
about 13 seconds (when operating on 6827 files), the version with the
'+' modifier took about 24 seconds, and the version with the '*'
modifier took about 32 seconds. (I made sure to run them over and
over to make sure caching had taken effect.)

(I discovered that converting '*' and '+' into their non-greedy
versions '*?' and '+?' didn't seem to have a measurable effect.)

So oddly enough, inclusion of the modifiers had an 11 to 19 second
penalty, with '*' being worse than '+'. I'm not sure why this is so,
but it's interesting to point out:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x # 13 seconds
m/" ( (?: [^\\"] | \\. )* ) "/x # 13 seconds
m/" ( (?: [^\\"]+ | \\. )* ) "/x # 24 seconds
m/" ( (?: [^\\"]* | \\. )* ) "/x # 32 seconds

(As an aside, converting \\. to (?:\\.)+ and (?:\\.)* didn't seem to
have an effect, probably because escaping a character was relatively
rare.)

Therefore, if you want to match a C/C++ string literal, I'd
recommend using one of the following two regular expressions:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x
m/" ( (?: [^\\"] | \\. )* ) "/x

They both seem to run about as fast.

Thanks for all your help!

-- Jean-Luc Romano

If I had to worry about time I would probably use this
m/ " ( (?: [^"\\]+ | (?:\\.)+ )* ) " /x

Your results may vary.
-sln

--------------------
Output:
"\"\"\"\"\\\\\\\\\\\\\\\\\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""

(?x-ism:" ( (?: \\?. )*? ) ")
<\"\"\"\"\\\\\\\\\\\\\\\\\" >
<this is one>
<this is tw\o \" isin't it?>
<>
the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)

(?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
<\"\"\"\"\\\\\\\\\\\\\\\\\" >
<this is one>
<this is tw\o \" isin't it?>
<>
the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)

(?x-ism:" ( (?: [^\\"] | \\. )* ) ")
<\"\"\"\"\\\\\\\\\\\\\\\\\" >
<this is one>
<this is tw\o \" isin't it?>
<>
the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
<\"\"\"\"\\\\\\\\\\\\\\\\\" >
<this is one>
<this is tw\o \" isin't it?>
<>
the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)

-----------------------

use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1,$rx);

my $string = <DATA>;
print "\n",$string,"\n";

{
$rx = qr/" ( (?: \\?. )*? ) "/x;
print "\n$rx\n";
$t0 = new Benchmark;
for (1..100_000) {
1 while ($string =~ /$rx/sg);
}
$t1 = new Benchmark;
while ($string =~ /$rx/sg) { print "<$1> \n"; }
print "the code took:",timestr(timediff($t1, $t0)),"\n";

$rx = qr/" (.*? (?<!\\) (?:\\{2})* ) "/x;
print "\n$rx\n";
$t0 = new Benchmark;
for (1..100_000) {
1 while ($string =~ /$rx/sg);
}
$t1 = new Benchmark;
while ($string =~ /$rx/sg) { print "<$1> \n"; }
print "the code took:",timestr(timediff($t1, $t0)),"\n";

$rx = qr/" ( (?: [^\\"] | \\. )* ) "/x;
print "\n$rx\n";
$t0 = new Benchmark;
for (1..100_000) {
1 while ($string =~ /$rx/sg);
}
$t1 = new Benchmark;
while ($string =~ /$rx/sg) { print "<$1> \n"; }
print "the code took:",timestr(timediff($t1, $t0)),"\n";

$rx = qr/ " ( (?: [^"\\]+ | (?:\\.)+ )* ) " /x;
print "\n$rx\n";
$t0 = new Benchmark;
for (1..100_000) {
1 while ($string =~ /$rx/sg);
}
$t1 = new Benchmark;
while ($string =~ /$rx/sg) { print "<$1> \n"; }
print "the code took:",timestr(timediff($t1, $t0)),"\n";
}

__DATA__
"\"\"\"\"\\\\\\\\\\\\\\\\\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""

Charlton Wilbur · Sep 26, 2009

uri> but you can have a string literal inside a comment and it needs
uri> to be skipped. there are other cases i bet. ask damian why it
uri> is better.

If you're actually parsing C, comments turn into whitespace during the
tokenizing in translation phase 3 -- "Each comment is replaced by one
space character." There are no string literals inside comments.

Further, the C90 standard, paragraph 3.1.9 says:

Except within a character constant, a string literal, or a comment,
the characters /* introduce a comment. The contents of a comment are
examined only to identify multibyte characters and to find the
characters */ that terminate it.[21]

[21] Thus, comments do not nest.

At least in C parsing, this really *isn't* a case of balanced text,
because all you are looking for is the closing */, and it can
successfully be handled by regular expressions.

Charlton

Peter J. Holzer · Sep 27, 2009

jpc> I'm trying to write Perl code that scans through a C/C++ and
jpc> matches string literals. I want to use a regular expression for this,
jpc> so that if given these inputs, it will extract these outputs:

PJH> At the translation stage where string literals are recognised by
PJH> a C compiler there are no nesting constructs, so I don't see why
PJH> you would want to use Text::Balanced.

PJH> (For a real solution you would need to take comments into account, but
PJH> they don't nest either)

but you can have a string literal inside a comment

No, you can't. You can have something that looks like a string literal,
but it's just a series of characters which happens to have two quote
characters in it:

/* aa "bb */ " cc

This is a comment with the content « aa "bb », followed by a string
literal which starts with « cc», not the beginning of a comment with the
content « aa "bb */ " cc».

Here is my take on the problem with regexps:

#!/usr/bin/perl
use warnings;
use strict;

use File::Slurp;

my $_ = read_file($ARGV[0]);

while (
m{
\G
(?:
(?: /\* .*? \*/ )
|
[^'"]
|
' (?: [^'\\] | \\ (?: [0-7]{0,3} | x [0-9A-Fa-f]{2} | . ) ) '
)*
(" (?: [^"\\] | \\ (?: [0-7]{0,3} | x [0-9A-Fa-f]{2} | . ) )* " )
}sxg
) {
print "$1\n";
}
__END__

and here my attempt at using Text::Balanced:

#!/usr/bin/perl
use warnings;
use strict;

use File::Slurp;
use Text::Balanced qw(extract_multiple extract_delimited);

my $_ = read_file($ARGV[0]);

while (defined (
my $section
= extract_multiple(
$_,
[
qr{/\* .*? \*/}sx,
qr{[^'"]}sx,
sub { extract_delimited($_[0], q{'"}) },
]
)
)
) {
print "$section\n" if $section =~ /^"/;
}
__END__

I'm not sure whether the second one works correctly: The description of
extract_delimited doesn't quite match the definition of C string and
character literals but I think the difference doesn't matter in this
case (Except for multi-line strings, but I don't handle them correctly
in the regexp version either).

I don't think the Text::Balanced version is much more readable or
elegant or easier to maintain.

hp

jl_post · Oct 16, 2009

(?x-ism:" ( (?: \\?. )*? ) ")
the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)

(?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)

(?x-ism:" ( (?: [^\\"] | \\. )* ) ")
the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)

Thanks for the benchmark code, sln. I ran it myself and pretty
much got the same results (in that their rankings in speed were the
same). Which puzzles me, because when I ran my code against real-
world data, it showed that:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x

and:

m/" ( (?: [^\\"] | \\. )* ) "/x

were clearly the fastest. (That is, in contrast to your benchmark
code, which shows that:

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )

is the fastest.)

I did discover, however, that if I took your sample input (the part
after the __DATA__ statement) and removed all embedded double-quotes
and re-ran your benchmark program, then the "slower" regular
expressions started catching up to the faster ones.

The only thing I can think of is that the efficiency of the regular
expressions can change depending on how many escaped quotes (and other
escaped characters) there are in the string that's examined. And
since I was parsing through log messages (meant for the end user, not
the programmer), escaped quotes were fairly uncommon. (They did
exist, but rarely as more than one pair at a time.)

So in the end, it really depends on the data itself. And the best
way to correctly "simulate" processing the input data is to process on
the input data itself (if that makes any sense).

At any rare, thanks for your hard work in investigating this for
me, sln.

-- Jean-Luc

sln · Oct 18, 2009

(?x-ism:" ( (?: \\?. )*? ) ")
the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)

(?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)

(?x-ism:" ( (?: [^\\"] | \\. )* ) ")
the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)

Click to expand...

Thanks for the benchmark code, sln. I ran it myself and pretty
much got the same results (in that their rankings in speed were the
same). Which puzzles me, because when I ran my code against real-
world data, it showed that:

m/" (.*? (?<!\\) (?:\\{2})* ) "/x

and:

m/" ( (?: [^\\"] | \\. )* ) "/x

were clearly the fastest. (That is, in contrast to your benchmark
code, which shows that:

(?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )

is the fastest.)

I did discover, however, that if I took your sample input (the part
after the __DATA__ statement) and removed all embedded double-quotes
and re-ran your benchmark program, then the "slower" regular
expressions started catching up to the faster ones.

The only thing I can think of is that the efficiency of the regular
expressions can change depending on how many escaped quotes (and other
escaped characters) there are in the string that's examined. And
since I was parsing through log messages (meant for the end user, not
the programmer), escaped quotes were fairly uncommon. (They did
exist, but rarely as more than one pair at a time.)

So in the end, it really depends on the data itself. And the best
way to correctly "simulate" processing the input data is to process on
the input data itself (if that makes any sense).

I would take acception to the 'real world' scenario.
How are you using this? Is this quote regex just a sub-expresion in a
larger regex?

I re-ran the tests taking out any escaped characters (I actually did before too)
and matched just one time instead of global.

I also ran the benches on a single large '.cpp' file with about 100 medium-large
strings with some scattered '\'ed characters. There is about a %25 performance increase
with the (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " ) still, compared to the next
fastest.

I think the percentage difference is linear and related to the number of sub-matches
within the " " anchors. If there are no " in the sample, the numbers are
identical (as you would expect).

Below is a simple few line program that does regex's on the two fastests and
includes regular expression debug information output. It simply reads the first
line of DATA and does regex on it.

Below that line is my educated guess as to why these results are why they are.
Below that is the output of the simple program.

-sln
---------------------------

use strict;
use warnings;

use re "debug";

my $string = <DATA>; # just get 1 line

$string =~ /"((?:[^"\\]+|(?:\\.)+)*)"/;

$string =~ /"((?:[^\\"]|\\.)*)"/;

__DATA__
1 "this one"

Analysis
-------------
Intuitively, (?: | )* is a complex grouping.
It says do the contents 0 or more times in a *loop*.

After each loop itteration, the next anchor (if there is one)
past the group is checked.

Within the group, if you are only checking for one simple character
at a time, the total time is:

(1 char check + 1 loop check per character) X the number of characters that pass
example: "abbbcccd" =~ /a(?:[bc])*d/;

However, if you check for multiple characters within the loop at a time,
the total time is:

(1 char check X the number of characters that pass) + 1 or 0 loop check
example: "abbbcccd" =~ /a(?:[bc]+)*d/;

Since the sum of all character checks only include 1 or 0 loop check (without backtracking),
the total time is reduced.

The regex engine initially processes anchors, it finds out where they are
then allocates a span (as a limit) of characters between them, from
which to process variable data. This is called an optimization.

In this particular case /a(?:[bc]+)*d/, the anchors are 'a' and 'd'.
The engine finds these characters in the sample, measures the distance
between them and allows the simple character class [bc] to match up to
that span amount (because of the '+' and can be less, but no more) before
it does a single loop check.

It is always possible that the regex engine will do more complex optimizations
depending on the surrounding sub-expressions. When in doubt run a simple test
using the pragma: use re 'debug';
Usually, the more steps (lines of output) it takes, the longer it takes.

See the Docs -
C:\Perl\html\lib\pods\perldebguts.html
search for: "Debugging regular expressions"
----------------------------------------------

Output:
-----------------
Compiling REx "%"((?:[^%"\\]+|(?:\\.)+)*)%""
Final program:
1: EXACT <"> (3)
3: OPEN1 (5)
5: CURLYX[0] {0,32767} (30)
7: BRANCH (20)
8: PLUS (29)
9: ANYOF[\0-!#-[\]-\377{unicode_all}] (0)
20: BRANCH (FAIL)
21: CURLYM[0] {1,32767} (29)
23: EXACT <\\> (25)
25: REG_ANY (26)
26: SUCCEED (0)
27: NOTHING (29)
28: TAIL (29)
29: WHILEM[1/2] (0)
30: NOTHING (31)
31: CLOSE1 (33)
33: EXACT <"> (35)
35: END (0)
anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
Compiling REx "%"((?:[^\\%"]|\\.)*)%""
Final program:
1: EXACT <"> (3)
3: OPEN1 (5)
5: CURLYX[0] {0,32767} (25)
7: BRANCH (19)
8: ANYOF[\0-!#-[\]-\377{unicode_all}] (24)
19: BRANCH (FAIL)
20: EXACT <\\> (22)
22: REG_ANY (24)
23: TAIL (24)
24: WHILEM[1/1] (0)
25: NOTHING (26)
26: CLOSE1 (28)
28: EXACT <"> (30)
30: END (0)
anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
Guessing start of match in sv for REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "1
%"this one%" %n"
Found floating substr "%"" at offset 2...
Contradicts anchored substr "%"", trying floating at offset 3...
Found floating substr "%"" at offset 11...
Found anchored substr "%"" at offset 2...
Starting position does not contradict /^/m...
Guessed: match at offset 2
Matching REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "%"this one%" %n"
2 <1 > <"this one"> | 1:EXACT <">(3)
3 <1 "> <this one" > | 3:OPEN1(5)
3 <1 "> <this one" > | 5:CURLYX[0] {0,32767}(30)
3 <1 "> <this one" > | 29: WHILEM[1/2](0)
whilem: matched 0 out of 0..32767
3 <1 "> <this one" > | 7: BRANCH(20)
3 <1 "> <this one" > | 8: PLUS(29)
ANYOF[\0-!#-[\]-\377{unicode_all}] can m
atch 8 times out of 2147483647...
11 <"this one> <" %n> | 29: WHILEM[1/2](0)
whilem: matched 1 out of 0..32767
11 <"this one> <" %n> | 7: BRANCH(20)
11 <"this one> <" %n> | 8: PLUS(29)
ANYOF[\0-!#-[\]-\377{unicode_all}]
can match 0 times out of 2147483647...
failed...
11 <"this one> <" %n> | 20: BRANCH(28)
11 <"this one> <" %n> | 21: CURLYM[0] {1,32767}(29)
11 <"this one> <" %n> | 23: EXACT <\\>(25)
failed...
failed...
BRANCH failed...
whilem: failed, trying continuation...

11 <"this one> <" %n> | 30: NOTHING(31)
11 <"this one> <" %n> | 31: CLOSE1(33)
11 <"this one> <" %n> | 33: EXACT <">(35)
12 <"this one"> < %n> | 35: END(0)
Match successful!
Guessing start of match in sv for REx "%"((?:[^\\%"]|\\.)*)%"" against "1 %"this
one%" %n"
Found floating substr "%"" at offset 2...
Contradicts anchored substr "%"", trying floating at offset 3...
Found floating substr "%"" at offset 11...
Found anchored substr "%"" at offset 2...
Starting position does not contradict /^/m...
Guessed: match at offset 2
Matching REx "%"((?:[^\\%"]|\\.)*)%"" against "%"this one%" %n"
2 <1 > <"this one"> | 1:EXACT <">(3)
3 <1 "> <this one" > | 3:OPEN1(5)
3 <1 "> <this one" > | 5:CURLYX[0] {0,32767}(25)
3 <1 "> <this one" > | 24: WHILEM[1/1](0)
whilem: matched 0 out of 0..32767
3 <1 "> <this one" > | 7: BRANCH(19)
3 <1 "> <this one" > | 8: ANYOF[\0-!#-[\]-\377{unicode_all}](24)
4 <1 "t> <his one" > | 24: WHILEM[1/1](0)
whilem: matched 1 out of 0..32767
4 <1 "t> <his one" > | 7: BRANCH(19)
4 <1 "t> <his one" > | 8: ANYOF[\0-!#-[\]-\377{unicode_all}](2
4)
5 <1 "th> <is one" %n> | 24: WHILEM[1/1](0)
whilem: matched 2 out of 0..32767
5 <1 "th> <is one" %n> | 7: BRANCH(19)
5 <1 "th> <is one" %n> | 8: ANYOF[\0-!#-[\]-\377{unicode_all
}](24)
6 < "thi> <s one" %n> | 24: WHILEM[1/1](0)
whilem: matched 3 out of 0..3276
7
6 < "thi> <s one" %n> | 7: BRANCH(19)
6 < "thi> <s one" %n> | 8: ANYOF[\0-!#-[\]-\377{unicode
_all}](24)
7 <"this> < one" %n> | 24: WHILEM[1/1](0)
whilem: matched 4 out of 0..
32767
7 <"this> < one" %n> | 7: BRANCH(19)
7 <"this> < one" %n> | 8: ANYOF[\0-!#-[\]-\377{uni
code_all}](24)
8 <"this > <one" %n> | 24: WHILEM[1/1](0)
whilem: matched 5 out of
0..32767
8 <"this > <one" %n> | 7: BRANCH(19)
8 <"this > <one" %n> | 8: ANYOF[\0-!#-[\]-\377
{unicode_all}](24)
9 <"this o> <ne" %n> | 24: WHILEM[1/1](0)
whilem: matched 6 ou
t of 0..32767
9 <"this o> <ne" %n> | 7: BRANCH(19)
9 <"this o> <ne" %n> | 8: ANYOF[\0-!#-[\]-
\377{unicode_all}](24)
10 <"this on> <e" %n> | 24: WHILEM[1/1](0)
whilem: matched
7 out of 0..32767
10 <"this on> <e" %n> | 7: BRANCH(19)
10 <"this on> <e" %n> | 8: ANYOF[\0-!#-
[\]-\377{unicode_all}](24)
11 <"this one> <" %n> | 24: WHILEM[1/1](
0)
whilem: matc
hed 8 out of 0..32767
11 <"this one> <" %n> | 7: BRANCH(19)

11 <"this one> <" %n> | 8: ANYOF[\0
-!#-[\]-\377{unicode_all}](24)
failed..
..
11 <"this one> <" %n> | 19: BRANCH(23)

11 <"this one> <" %n> | 20: EXACT <\
\>(22)
failed..
..
BRANCH fai
led...
whilem: fail
ed, trying continuation...
11 <"this one> <" %n> | 25: NOTHING(26
)
11 <"this one> <" %n> | 26: CLOSE1(28)

11 <"this one> <" %n> | 28: EXACT <">(
30)
12 <"this one"> < %n> | 30: END(0)
Match successful!
Freeing REx: "%"((?:[^%"\\]+|(?:\\.)+)*)%""
Freeing REx: "%"((?:[^\\%"]|\\.)*)%""

Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026
Database Manager: A C++ Console Application	14	May 12, 2025
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
Unable to read input from keyboard, in below C code, for a BST.	0	Jul 20, 2025
AES-128 Clipboard Protector: Auto-Encrypt Ctrl+C, Smart-Decrypt Ctrl+V (C++ Windows Hook)	7	Mar 24, 2026
Regex to match a numerical IP range	7	Dec 11, 2010
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023

Trying to parse/match a C string literal

jl_post

Uri Guttman

Randal L. Schwartz

sln

sln

jl_post

Peter J. Holzer

Uri Guttman

sln

Charlton Wilbur

Peter J. Holzer

jl_post

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads