Trying to parse/match a C string literal

Discussion in 'Perl Misc' started by jl_post@hotmail.com, Sep 24, 2009.

  1. Guest

    Dear Perl community,

    I'm trying to write Perl code that scans through a C/C++ and
    matches string literals. I want to use a regular expression for this,
    so that if given these inputs, it will extract these outputs:

    input1: before "12 34 56" after
    output1: 12 34 56

    input2: before "12 34" 56" after
    output2: 12 34

    input3: before "12 34\" 56" after
    output3: 12 34\" 56

    input4: before "12 34\\" 56" after
    output4: 12 34\\

    input5: before "12 34\\\" 56" after
    output5: 12 34\\\" 56

    input6: before "12 34\\\\" 56" after
    output6: 12 34\\\\

    Note that inputs 3 through 6 account for the backslash escape
    character in that if a double-quote is directly preceded by a non-
    escaped backslash, then that double-quote should not be interpreted as
    the C string terminator.

    At first, I came up with this simple regular expression:

    m/" (.*) "/x

    this puts everything between the first and the last quote into $1.
    This works fine for input1, but reads too much with input2.

    Then I changed it to be non-greedy, like this:

    m/" (.*?) "/x

    which works great for inputs 1 and 2, but now fails with input3, as it
    doesn't account for escaped-out quotes.

    So then I added a negative look-behind to ensure that the last
    character matched by the parentheses is not a backslash (I could use [^
    \\] instead of the negative look-behind, but then we won't match empty
    strings):

    m/" (.*? (?<!\\) ) "/x

    This works with inputs 1 through 3, but fails at input4, since the
    quote after the double-backslash should be the terminator, but isn't
    treated as such (due to the fact that it is preceded by a backslash).

    So then I reasoned that, after the last non-backslash matched, an
    even number of backslashes can be matched (as each pair of backslashes
    represents one literal backslash), so I changed the expression to
    this:

    m/" (.*? (?<!\\) (\\{2})* ) "/x

    Now it works for all the inputs I gave. I then added "?:" to the last
    set of parentheses (so it wouldn't offset $2, $3, etc. if I decide to
    add more later):

    m/" (.*? (?<!\\) (?:\\{2})* ) "/x

    I tested this out with the following code:

    m/" (.*? (?<!\\) (?:\\{2})*) "/x and print "$1\n" while <>;
    before "12 34 56" after # input 1
    12 34 56
    before "12 34" 56" after # input 2
    12 34
    before "12 34\" 56" after # input 3
    12 34\" 56
    before "12 34\\" 56" after #input 4
    12 34\\
    before "12 34\\\" 56" after # input 5
    12 34\\\" 56
    before "12 34\\\\" 56" after # input 6
    12 34\\\\

    So it looks like it works. My question is, even though I came up
    with a way of parsing a C string literal, is there a better or simpler
    way of doing this?

    (Now, I know of the quotewords() function in the Text::parseWords
    module, but I don't think it's what I'm looking for. I prefer a
    regular expression that extracts the string literal (not individual
    tokens), and I can embed it into other regular expressions that look
    for other pieces of code.)

    I tried "perldoc -q string", but the best advice I could find was
    to use Text::parseWords, which I stated before is probably not what I
    need.

    Thanks!

    -- Jean-Luc
     
    , Sep 24, 2009
    #1
    1. Advertising

  2. Uri Guttman Guest

    >>>>> "jpc" == jl post@hotmail com <> writes:

    jpc> I'm trying to write Perl code that scans through a C/C++ and
    jpc> matches string literals. I want to use a regular expression for this,
    jpc> so that if given these inputs, it will extract these outputs:

    that can't be done easily with a single regex so don't even try. look at
    text::balanced on cpan which is designed to match c strings and similar things.

    uri
     
    Uri Guttman, Sep 24, 2009
    #2
    1. Advertising

  3. >>>>> "jl" == jl post@hotmail com <> writes:

    jl> I'm trying to write Perl code that scans through a C/C++ and
    jl> matches string literals. I want to use a regular expression for this,
    jl> so that if given these inputs, it will extract these outputs:

    jl> input1: before "12 34 56" after
    jl> output1: 12 34 56

    jl> input2: before "12 34" 56" after
    jl> output2: 12 34

    jl> input3: before "12 34\" 56" after
    jl> output3: 12 34\" 56

    jl> input4: before "12 34\\" 56" after
    jl> output4: 12 34\\

    jl> input5: before "12 34\\\" 56" after
    jl> output5: 12 34\\\" 56

    jl> input6: before "12 34\\\\" 56" after
    jl> output6: 12 34\\\\

    [...]

    jl> m/" (.*? (?<!\\) (?:\\{2})* ) "/x

    I would just make a regex that does what you want, and ignore all that
    fancy newfangled lookbehind/ahead/aside:

    m/
    " # quote
    (
    [^\"]+ # any non-special characters are cool
    | # ... or ...
    \. # a backslash escaping the following character
    ) * # repeated zero or more times
    " # quote
    /sx

    If you need to remove the quotes from your match, just add an inner set of
    parens around the juicy bits.

    print "Just another Perl hacker,"; # the original

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
     
    Randal L. Schwartz, Sep 24, 2009
    #3
  4. Guest

    On Thu, 24 Sep 2009 12:11:28 -0700, (Randal L. Schwartz) wrote:

    >>>>>> "jl" == jl post@hotmail com <> writes:

    >
    >jl> I'm trying to write Perl code that scans through a C/C++ and
    >jl> matches string literals. I want to use a regular expression for this,
    >jl> so that if given these inputs, it will extract these outputs:
    >

    <snip>
    >I would just make a regex that does what you want, and ignore all that
    >fancy newfangled lookbehind/ahead/aside:
    >
    >m/
    > " # quote
    > (
    > [^\"]+ # any non-special characters are cool
    > | # ... or ...
    > \. # a backslash escaping the following character
    > ) * # repeated zero or more times
    > " # quote
    >/sx
    >
    >If you need to remove the quotes from your match, just add an inner set of
    >parens around the juicy bits.
    >
    >print "Just another Perl hacker,"; # the original


    If it were done this way, it would probably be better with
    something like below.

    -sln
    ---------------
    use strict;
    use warnings;

    my $string = <DATA>;
    print "\n",$string,"\n";

    my $rx = qr/
    " # quote
    (
    (?:
    [^\\"]*? # any non-special characters are cool
    | # ... or ...
    \\. # a backslash escaping the following character
    )* # repeated zero or more times
    )
    " # quote
    /x;

    while ($string =~ /$rx/sg)
    { print "<$1>\n"; }

    __DATA__
    " \"\"\"\"\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""
     
    , Sep 24, 2009
    #4
  5. Guest

    On Thu, 24 Sep 2009 13:51:28 -0700, wrote:

    >On Thu, 24 Sep 2009 12:11:28 -0700, (Randal L. Schwartz) wrote:
    >
    >>>>>>> "jl" == jl post@hotmail com <> writes:

    >>
    >>jl> I'm trying to write Perl code that scans through a C/C++ and
    >>jl> matches string literals. I want to use a regular expression for this,
    >>jl> so that if given these inputs, it will extract these outputs:
    >>

    ><snip>
    >>I would just make a regex that does what you want, and ignore all that
    >>fancy newfangled lookbehind/ahead/aside:
    >>
    >>m/
    >> " # quote
    >> (
    >> [^\"]+ # any non-special characters are cool
    >> | # ... or ...
    >> \. # a backslash escaping the following character
    >> ) * # repeated zero or more times
    >> " # quote
    >>/sx
    >>
    >>If you need to remove the quotes from your match, just add an inner set of
    >>parens around the juicy bits.
    >>

    This works good too.
    m/
    " # quote
    (
    (?:
    [^\\"]+ # any non-special characters are cool
    | # ... or ...
    \\. # a backslash escaping the following character
    )* # repeated zero or more times
    )
    " # quote
    /sx

    -sln
     
    , Sep 24, 2009
    #5
  6. Guest

    On Sep 24, 12:43 pm, "" <>
    wrote:
    >
    > I'm trying to write Perl code that scans through a C/C++ and
    > matches string literals. I want to use a regular expression for
    > this,


    Thanks for your responses! I now have two regular expressions that
    work well. The one I came up with:

    m/" (.*? (?<!\\) (?:\\{2})* ) "/x

    and one from Randal Schwartz, with some modification by poster sln and
    myself:

    m/" ( (?: [^\\"] | \\. )* ) "/x

    When I tested them in my Perl script, I found that it read in and
    processed 6827 C/C++ files in about 13 seconds, no matter which of the
    above two regular expressions I used.

    (Actually, they initially clocked in around 45-55 seconds, but
    after repeatedly running them, they "slimmed down" to a consistent 13
    seconds. I'm sure caching of some sort is involved somehow.)

    However, the second one you see above I modified a bit. Randal's
    suggestion was to use the '+' modifier after [^\\"] while sln
    suggested using '*?'.

    So I experimented with these three variants:

    m/" ( (?: [^\\"] | \\. )* ) "/x
    m/" ( (?: [^\\"]+ | \\. )* ) "/x
    m/" ( (?: [^\\"]* | \\. )* ) "/x

    What I found out was that the version without any modifier took
    about 13 seconds (when operating on 6827 files), the version with the
    '+' modifier took about 24 seconds, and the version with the '*'
    modifier took about 32 seconds. (I made sure to run them over and
    over to make sure caching had taken effect.)

    (I discovered that converting '*' and '+' into their non-greedy
    versions '*?' and '+?' didn't seem to have a measurable effect.)

    So oddly enough, inclusion of the modifiers had an 11 to 19 second
    penalty, with '*' being worse than '+'. I'm not sure why this is so,
    but it's interesting to point out:

    m/" (.*? (?<!\\) (?:\\{2})* ) "/x # 13 seconds
    m/" ( (?: [^\\"] | \\. )* ) "/x # 13 seconds
    m/" ( (?: [^\\"]+ | \\. )* ) "/x # 24 seconds
    m/" ( (?: [^\\"]* | \\. )* ) "/x # 32 seconds

    (As an aside, converting \\. to (?:\\.)+ and (?:\\.)* didn't seem to
    have an effect, probably because escaping a character was relatively
    rare.)

    Therefore, if you want to match a C/C++ string literal, I'd
    recommend using one of the following two regular expressions:

    m/" (.*? (?<!\\) (?:\\{2})* ) "/x
    m/" ( (?: [^\\"] | \\. )* ) "/x

    They both seem to run about as fast.

    Thanks for all your help!

    -- Jean-Luc Romano
     
    , Sep 25, 2009
    #6
  7. On 2009-09-24 18:56, Uri Guttman <> wrote:
    >>>>>> "jpc" == jl post@hotmail com <> writes:

    >
    > jpc> I'm trying to write Perl code that scans through a C/C++ and
    > jpc> matches string literals. I want to use a regular expression for this,
    > jpc> so that if given these inputs, it will extract these outputs:
    >
    > that can't be done easily with a single regex so don't even try. look at
    > text::balanced on cpan which is designed to match c strings and similar things.


    At the translation stage where string literals are recognised by a C
    compiler there are no nesting constructs, so I don't see why you would
    want to use Text::Balanced.

    (For a real solution you would need to take comments into account, but
    they don't nest either)

    hp
     
    Peter J. Holzer, Sep 25, 2009
    #7
  8. Uri Guttman Guest

    >>>>> "PJH" == Peter J Holzer <> writes:

    PJH> On 2009-09-24 18:56, Uri Guttman <> wrote:
    >>>>>>> "jpc" == jl post@hotmail com <> writes:

    >>

    jpc> I'm trying to write Perl code that scans through a C/C++ and
    jpc> matches string literals. I want to use a regular expression for this,
    jpc> so that if given these inputs, it will extract these outputs:
    >>
    >> that can't be done easily with a single regex so don't even try. look at
    >> text::balanced on cpan which is designed to match c strings and similar things.


    PJH> At the translation stage where string literals are recognised by
    PJH> a C compiler there are no nesting constructs, so I don't see why
    PJH> you would want to use Text::Balanced.

    PJH> (For a real solution you would need to take comments into account, but
    PJH> they don't nest either)

    but you can have a string literal inside a comment and it needs to be
    skipped. there are other cases i bet. ask damian why it is better. :)

    uri
     
    Uri Guttman, Sep 25, 2009
    #8
  9. Guest

    On Thu, 24 Sep 2009 17:13:25 -0700 (PDT), wrote:

    >On Sep 24, 12:43 pm, "" <>
    >wrote:
    >>
    >> I'm trying to write Perl code that scans through a C/C++ and
    >> matches string literals. I want to use a regular expression for
    >> this,

    >
    > Thanks for your responses! I now have two regular expressions that
    >work well. The one I came up with:
    >
    > m/" (.*? (?<!\\) (?:\\{2})* ) "/x
    >
    >and one from Randal Schwartz, with some modification by poster sln and
    >myself:
    >
    > m/" ( (?: [^\\"] | \\. )* ) "/x
    >
    > When I tested them in my Perl script, I found that it read in and
    >processed 6827 C/C++ files in about 13 seconds, no matter which of the
    >above two regular expressions I used.
    >
    > (Actually, they initially clocked in around 45-55 seconds, but
    >after repeatedly running them, they "slimmed down" to a consistent 13
    >seconds. I'm sure caching of some sort is involved somehow.)
    >
    > However, the second one you see above I modified a bit. Randal's
    >suggestion was to use the '+' modifier after [^\\"] while sln
    >suggested using '*?'.
    >
    > So I experimented with these three variants:
    >
    > m/" ( (?: [^\\"] | \\. )* ) "/x
    > m/" ( (?: [^\\"]+ | \\. )* ) "/x
    > m/" ( (?: [^\\"]* | \\. )* ) "/x
    >
    > What I found out was that the version without any modifier took
    >about 13 seconds (when operating on 6827 files), the version with the
    >'+' modifier took about 24 seconds, and the version with the '*'
    >modifier took about 32 seconds. (I made sure to run them over and
    >over to make sure caching had taken effect.)
    >
    > (I discovered that converting '*' and '+' into their non-greedy
    >versions '*?' and '+?' didn't seem to have a measurable effect.)
    >
    > So oddly enough, inclusion of the modifiers had an 11 to 19 second
    >penalty, with '*' being worse than '+'. I'm not sure why this is so,
    >but it's interesting to point out:
    >
    > m/" (.*? (?<!\\) (?:\\{2})* ) "/x # 13 seconds
    > m/" ( (?: [^\\"] | \\. )* ) "/x # 13 seconds
    > m/" ( (?: [^\\"]+ | \\. )* ) "/x # 24 seconds
    > m/" ( (?: [^\\"]* | \\. )* ) "/x # 32 seconds
    >
    >(As an aside, converting \\. to (?:\\.)+ and (?:\\.)* didn't seem to
    >have an effect, probably because escaping a character was relatively
    >rare.)
    >
    > Therefore, if you want to match a C/C++ string literal, I'd
    >recommend using one of the following two regular expressions:
    >
    > m/" (.*? (?<!\\) (?:\\{2})* ) "/x
    > m/" ( (?: [^\\"] | \\. )* ) "/x
    >
    >They both seem to run about as fast.
    >
    > Thanks for all your help!
    >
    > -- Jean-Luc Romano


    If I had to worry about time I would probably use this
    m/ " ( (?: [^"\\]+ | (?:\\.)+ )* ) " /x

    Your results may vary.
    -sln

    --------------------
    Output:
    "\"\"\"\"\\\\\\\\\\\\\\\\\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""

    (?x-ism:" ( (?: \\?. )*? ) ")
    <\"\"\"\"\\\\\\\\\\\\\\\\\" >
    <this is one>
    <this is tw\o \" isin't it?>
    <>
    the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)

    (?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
    <\"\"\"\"\\\\\\\\\\\\\\\\\" >
    <this is one>
    <this is tw\o \" isin't it?>
    <>
    the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)

    (?x-ism:" ( (?: [^\\"] | \\. )* ) ")
    <\"\"\"\"\\\\\\\\\\\\\\\\\" >
    <this is one>
    <this is tw\o \" isin't it?>
    <>
    the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)

    (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
    <\"\"\"\"\\\\\\\\\\\\\\\\\" >
    <this is one>
    <this is tw\o \" isin't it?>
    <>
    the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)

    -----------------------

    use strict;
    use warnings;
    use Benchmark ':hireswallclock';
    my ($t0,$t1,$rx);

    my $string = <DATA>;
    print "\n",$string,"\n";

    {
    $rx = qr/" ( (?: \\?. )*? ) "/x;
    print "\n$rx\n";
    $t0 = new Benchmark;
    for (1..100_000) {
    1 while ($string =~ /$rx/sg);
    }
    $t1 = new Benchmark;
    while ($string =~ /$rx/sg) { print "<$1> \n"; }
    print "the code took:",timestr(timediff($t1, $t0)),"\n";

    $rx = qr/" (.*? (?<!\\) (?:\\{2})* ) "/x;
    print "\n$rx\n";
    $t0 = new Benchmark;
    for (1..100_000) {
    1 while ($string =~ /$rx/sg);
    }
    $t1 = new Benchmark;
    while ($string =~ /$rx/sg) { print "<$1> \n"; }
    print "the code took:",timestr(timediff($t1, $t0)),"\n";

    $rx = qr/" ( (?: [^\\"] | \\. )* ) "/x;
    print "\n$rx\n";
    $t0 = new Benchmark;
    for (1..100_000) {
    1 while ($string =~ /$rx/sg);
    }
    $t1 = new Benchmark;
    while ($string =~ /$rx/sg) { print "<$1> \n"; }
    print "the code took:",timestr(timediff($t1, $t0)),"\n";

    $rx = qr/ " ( (?: [^"\\]+ | (?:\\.)+ )* ) " /x;
    print "\n$rx\n";
    $t0 = new Benchmark;
    for (1..100_000) {
    1 while ($string =~ /$rx/sg);
    }
    $t1 = new Benchmark;
    while ($string =~ /$rx/sg) { print "<$1> \n"; }
    print "the code took:",timestr(timediff($t1, $t0)),"\n";
    }

    __DATA__
    "\"\"\"\"\\\\\\\\\\\\\\\\\" " 1 "this is one" 2 "this is tw\o \" isin't it?" ""
     
    , Sep 26, 2009
    #9
  10. >>>>> "uri" == Uri Guttman <> writes:

    uri> but you can have a string literal inside a comment and it needs
    uri> to be skipped. there are other cases i bet. ask damian why it
    uri> is better. :)

    If you're actually parsing C, comments turn into whitespace during the
    tokenizing in translation phase 3 -- "Each comment is replaced by one
    space character." There are no string literals inside comments.

    Further, the C90 standard, paragraph 3.1.9 says:

    Except within a character constant, a string literal, or a comment,
    the characters /* introduce a comment. The contents of a comment are
    examined only to identify multibyte characters and to find the
    characters */ that terminate it.[21]

    [21] Thus, comments do not nest.

    At least in C parsing, this really *isn't* a case of balanced text,
    because all you are looking for is the closing */, and it can
    successfully be handled by regular expressions.

    Charlton



    --
    Charlton Wilbur
     
    Charlton Wilbur, Sep 26, 2009
    #10
  11. On 2009-09-25 16:38, Uri Guttman <> wrote:
    >>>>>> "PJH" == Peter J Holzer <> writes:

    > PJH> On 2009-09-24 18:56, Uri Guttman <> wrote:
    > >>>>>>> "jpc" == jl post@hotmail com <> writes:
    > >>

    > jpc> I'm trying to write Perl code that scans through a C/C++ and
    > jpc> matches string literals. I want to use a regular expression for this,
    > jpc> so that if given these inputs, it will extract these outputs:
    > >>
    > >> that can't be done easily with a single regex so don't even try. look at
    > >> text::balanced on cpan which is designed to match c strings and similar things.

    >
    > PJH> At the translation stage where string literals are recognised by
    > PJH> a C compiler there are no nesting constructs, so I don't see why
    > PJH> you would want to use Text::Balanced.
    >
    > PJH> (For a real solution you would need to take comments into account, but
    > PJH> they don't nest either)
    >
    > but you can have a string literal inside a comment


    No, you can't. You can have something that looks like a string literal,
    but it's just a series of characters which happens to have two quote
    characters in it:

    /* aa "bb */ " cc

    This is a comment with the content « aa "bb », followed by a string
    literal which starts with « cc», not the beginning of a comment with the
    content « aa "bb */ " cc».


    Here is my take on the problem with regexps:

    #!/usr/bin/perl
    use warnings;
    use strict;

    use File::Slurp;

    my $_ = read_file($ARGV[0]);

    while (
    m{
    \G
    (?:
    (?: /\* .*? \*/ )
    |
    [^'"]
    |
    ' (?: [^'\\] | \\ (?: [0-7]{0,3} | x [0-9A-Fa-f]{2} | . ) ) '
    )*
    (" (?: [^"\\] | \\ (?: [0-7]{0,3} | x [0-9A-Fa-f]{2} | . ) )* " )
    }sxg
    ) {
    print "$1\n";
    }
    __END__


    and here my attempt at using Text::Balanced:

    #!/usr/bin/perl
    use warnings;
    use strict;

    use File::Slurp;
    use Text::Balanced qw(extract_multiple extract_delimited);

    my $_ = read_file($ARGV[0]);

    while (defined (
    my $section
    = extract_multiple(
    $_,
    [
    qr{/\* .*? \*/}sx,
    qr{[^'"]}sx,
    sub { extract_delimited($_[0], q{'"}) },
    ]
    )
    )
    ) {
    print "$section\n" if $section =~ /^"/;
    }
    __END__


    I'm not sure whether the second one works correctly: The description of
    extract_delimited doesn't quite match the definition of C string and
    character literals but I think the difference doesn't matter in this
    case (Except for multi-line strings, but I don't handle them correctly
    in the regexp version either).

    I don't think the Text::Balanced version is much more readable or
    elegant or easier to maintain.

    hp
     
    Peter J. Holzer, Sep 27, 2009
    #11
  12. Guest

    On Sep 25, 10:02 pm, wrote:
    > (?x-ism:" ( (?: \\?. )*? ) ")
    > the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)
    >
    > (?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
    > the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)
    >
    > (?x-ism:" ( (?: [^\\"] | \\. )* ) ")
    > the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)
    >
    > (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
    > the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)



    Thanks for the benchmark code, sln. I ran it myself and pretty
    much got the same results (in that their rankings in speed were the
    same). Which puzzles me, because when I ran my code against real-
    world data, it showed that:

    m/" (.*? (?<!\\) (?:\\{2})* ) "/x

    and:

    m/" ( (?: [^\\"] | \\. )* ) "/x

    were clearly the fastest. (That is, in contrast to your benchmark
    code, which shows that:

    (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )

    is the fastest.)

    I did discover, however, that if I took your sample input (the part
    after the __DATA__ statement) and removed all embedded double-quotes
    and re-ran your benchmark program, then the "slower" regular
    expressions started catching up to the faster ones.

    The only thing I can think of is that the efficiency of the regular
    expressions can change depending on how many escaped quotes (and other
    escaped characters) there are in the string that's examined. And
    since I was parsing through log messages (meant for the end user, not
    the programmer), escaped quotes were fairly uncommon. (They did
    exist, but rarely as more than one pair at a time.)

    So in the end, it really depends on the data itself. And the best
    way to correctly "simulate" processing the input data is to process on
    the input data itself (if that makes any sense).

    At any rare, thanks for your hard work in investigating this for
    me, sln.

    -- Jean-Luc
     
    , Oct 16, 2009
    #12
  13. Guest

    On Fri, 16 Oct 2009 07:24:05 -0700 (PDT), "" <> wrote:

    >On Sep 25, 10:02 pm, wrote:
    >> (?x-ism:" ( (?: \\?. )*? ) ")
    >> the code took:2.84375 wallclock secs ( 2.84 usr + 0.00 sys = 2.84 CPU)
    >>
    >> (?x-ism:" (.*? (?<!\\) (?:\\{2})* ) ")
    >> the code took:2.62468 wallclock secs ( 2.62 usr + 0.00 sys = 2.62 CPU)
    >>
    >> (?x-ism:" ( (?: [^\\"] | \\. )* ) ")
    >> the code took:2.14033 wallclock secs ( 2.14 usr + 0.00 sys = 2.14 CPU)
    >>
    >> (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
    >> the code took:1.74956 wallclock secs ( 1.75 usr + 0.00 sys = 1.75 CPU)

    >
    >
    > Thanks for the benchmark code, sln. I ran it myself and pretty
    >much got the same results (in that their rankings in speed were the
    >same). Which puzzles me, because when I ran my code against real-
    >world data, it showed that:
    >
    > m/" (.*? (?<!\\) (?:\\{2})* ) "/x
    >
    >and:
    >
    > m/" ( (?: [^\\"] | \\. )* ) "/x
    >
    >were clearly the fastest. (That is, in contrast to your benchmark
    >code, which shows that:
    >
    > (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " )
    >
    >is the fastest.)
    >
    > I did discover, however, that if I took your sample input (the part
    >after the __DATA__ statement) and removed all embedded double-quotes
    >and re-ran your benchmark program, then the "slower" regular
    >expressions started catching up to the faster ones.
    >
    > The only thing I can think of is that the efficiency of the regular
    >expressions can change depending on how many escaped quotes (and other
    >escaped characters) there are in the string that's examined. And
    >since I was parsing through log messages (meant for the end user, not
    >the programmer), escaped quotes were fairly uncommon. (They did
    >exist, but rarely as more than one pair at a time.)
    >
    > So in the end, it really depends on the data itself. And the best
    >way to correctly "simulate" processing the input data is to process on
    >the input data itself (if that makes any sense).


    I would take acception to the 'real world' scenario.
    How are you using this? Is this quote regex just a sub-expresion in a
    larger regex?

    I re-ran the tests taking out any escaped characters (I actually did before too)
    and matched just one time instead of global.

    I also ran the benches on a single large '.cpp' file with about 100 medium-large
    strings with some scattered '\'ed characters. There is about a %25 performance increase
    with the (?x-ism: " ( (?: [^"\\]+ | (?:\\.)+ )* ) " ) still, compared to the next
    fastest.

    I think the percentage difference is linear and related to the number of sub-matches
    within the " " anchors. If there are no " in the sample, the numbers are
    identical (as you would expect).

    Below is a simple few line program that does regex's on the two fastests and
    includes regular expression debug information output. It simply reads the first
    line of DATA and does regex on it.

    Below that line is my educated guess as to why these results are why they are.
    Below that is the output of the simple program.

    -sln
    ---------------------------

    use strict;
    use warnings;

    use re "debug";

    my $string = <DATA>; # just get 1 line

    $string =~ /"((?:[^"\\]+|(?:\\.)+)*)"/;

    $string =~ /"((?:[^\\"]|\\.)*)"/;

    __DATA__
    1 "this one"


    Analysis
    -------------
    Intuitively, (?: | )* is a complex grouping.
    It says do the contents 0 or more times in a *loop*.

    After each loop itteration, the next anchor (if there is one)
    past the group is checked.

    Within the group, if you are only checking for one simple character
    at a time, the total time is:

    (1 char check + 1 loop check per character) X the number of characters that pass
    example: "abbbcccd" =~ /a(?:[bc])*d/;

    However, if you check for multiple characters within the loop at a time,
    the total time is:

    (1 char check X the number of characters that pass) + 1 or 0 loop check
    example: "abbbcccd" =~ /a(?:[bc]+)*d/;

    Since the sum of all character checks only include 1 or 0 loop check (without backtracking),
    the total time is reduced.

    The regex engine initially processes anchors, it finds out where they are
    then allocates a span (as a limit) of characters between them, from
    which to process variable data. This is called an optimization.

    In this particular case /a(?:[bc]+)*d/, the anchors are 'a' and 'd'.
    The engine finds these characters in the sample, measures the distance
    between them and allows the simple character class [bc] to match up to
    that span amount (because of the '+' and can be less, but no more) before
    it does a single loop check.

    It is always possible that the regex engine will do more complex optimizations
    depending on the surrounding sub-expressions. When in doubt run a simple test
    using the pragma: use re 'debug';
    Usually, the more steps (lines of output) it takes, the longer it takes.

    See the Docs -
    C:\Perl\html\lib\pods\perldebguts.html
    search for: "Debugging regular expressions"
    ----------------------------------------------

    Output:
    -----------------
    Compiling REx "%"((?:[^%"\\]+|(?:\\.)+)*)%""
    Final program:
    1: EXACT <"> (3)
    3: OPEN1 (5)
    5: CURLYX[0] {0,32767} (30)
    7: BRANCH (20)
    8: PLUS (29)
    9: ANYOF[\0-!#-[\]-\377{unicode_all}] (0)
    20: BRANCH (FAIL)
    21: CURLYM[0] {1,32767} (29)
    23: EXACT <\\> (25)
    25: REG_ANY (26)
    26: SUCCEED (0)
    27: NOTHING (29)
    28: TAIL (29)
    29: WHILEM[1/2] (0)
    30: NOTHING (31)
    31: CLOSE1 (33)
    33: EXACT <"> (35)
    35: END (0)
    anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
    Compiling REx "%"((?:[^\\%"]|\\.)*)%""
    Final program:
    1: EXACT <"> (3)
    3: OPEN1 (5)
    5: CURLYX[0] {0,32767} (25)
    7: BRANCH (19)
    8: ANYOF[\0-!#-[\]-\377{unicode_all}] (24)
    19: BRANCH (FAIL)
    20: EXACT <\\> (22)
    22: REG_ANY (24)
    23: TAIL (24)
    24: WHILEM[1/1] (0)
    25: NOTHING (26)
    26: CLOSE1 (28)
    28: EXACT <"> (30)
    30: END (0)
    anchored "%"" at 0 floating "%"" at 1..2147483647 (checking floating) minlen 2
    Guessing start of match in sv for REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "1
    %"this one%" %n"
    Found floating substr "%"" at offset 2...
    Contradicts anchored substr "%"", trying floating at offset 3...
    Found floating substr "%"" at offset 11...
    Found anchored substr "%"" at offset 2...
    Starting position does not contradict /^/m...
    Guessed: match at offset 2
    Matching REx "%"((?:[^%"\\]+|(?:\\.)+)*)%"" against "%"this one%" %n"
    2 <1 > <"this one"> | 1:EXACT <">(3)
    3 <1 "> <this one" > | 3:OPEN1(5)
    3 <1 "> <this one" > | 5:CURLYX[0] {0,32767}(30)
    3 <1 "> <this one" > | 29: WHILEM[1/2](0)
    whilem: matched 0 out of 0..32767
    3 <1 "> <this one" > | 7: BRANCH(20)
    3 <1 "> <this one" > | 8: PLUS(29)
    ANYOF[\0-!#-[\]-\377{unicode_all}] can m
    atch 8 times out of 2147483647...
    11 <"this one> <" %n> | 29: WHILEM[1/2](0)
    whilem: matched 1 out of 0..32767
    11 <"this one> <" %n> | 7: BRANCH(20)
    11 <"this one> <" %n> | 8: PLUS(29)
    ANYOF[\0-!#-[\]-\377{unicode_all}]
    can match 0 times out of 2147483647...
    failed...
    11 <"this one> <" %n> | 20: BRANCH(28)
    11 <"this one> <" %n> | 21: CURLYM[0] {1,32767}(29)
    11 <"this one> <" %n> | 23: EXACT <\\>(25)
    failed...
    failed...
    BRANCH failed...
    whilem: failed, trying continuation...

    11 <"this one> <" %n> | 30: NOTHING(31)
    11 <"this one> <" %n> | 31: CLOSE1(33)
    11 <"this one> <" %n> | 33: EXACT <">(35)
    12 <"this one"> < %n> | 35: END(0)
    Match successful!
    Guessing start of match in sv for REx "%"((?:[^\\%"]|\\.)*)%"" against "1 %"this
    one%" %n"
    Found floating substr "%"" at offset 2...
    Contradicts anchored substr "%"", trying floating at offset 3...
    Found floating substr "%"" at offset 11...
    Found anchored substr "%"" at offset 2...
    Starting position does not contradict /^/m...
    Guessed: match at offset 2
    Matching REx "%"((?:[^\\%"]|\\.)*)%"" against "%"this one%" %n"
    2 <1 > <"this one"> | 1:EXACT <">(3)
    3 <1 "> <this one" > | 3:OPEN1(5)
    3 <1 "> <this one" > | 5:CURLYX[0] {0,32767}(25)
    3 <1 "> <this one" > | 24: WHILEM[1/1](0)
    whilem: matched 0 out of 0..32767
    3 <1 "> <this one" > | 7: BRANCH(19)
    3 <1 "> <this one" > | 8: ANYOF[\0-!#-[\]-\377{unicode_all}](24)
    4 <1 "t> <his one" > | 24: WHILEM[1/1](0)
    whilem: matched 1 out of 0..32767
    4 <1 "t> <his one" > | 7: BRANCH(19)
    4 <1 "t> <his one" > | 8: ANYOF[\0-!#-[\]-\377{unicode_all}](2
    4)
    5 <1 "th> <is one" %n> | 24: WHILEM[1/1](0)
    whilem: matched 2 out of 0..32767
    5 <1 "th> <is one" %n> | 7: BRANCH(19)
    5 <1 "th> <is one" %n> | 8: ANYOF[\0-!#-[\]-\377{unicode_all
    }](24)
    6 < "thi> <s one" %n> | 24: WHILEM[1/1](0)
    whilem: matched 3 out of 0..3276
    7
    6 < "thi> <s one" %n> | 7: BRANCH(19)
    6 < "thi> <s one" %n> | 8: ANYOF[\0-!#-[\]-\377{unicode
    _all}](24)
    7 <"this> < one" %n> | 24: WHILEM[1/1](0)
    whilem: matched 4 out of 0..
    32767
    7 <"this> < one" %n> | 7: BRANCH(19)
    7 <"this> < one" %n> | 8: ANYOF[\0-!#-[\]-\377{uni
    code_all}](24)
    8 <"this > <one" %n> | 24: WHILEM[1/1](0)
    whilem: matched 5 out of
    0..32767
    8 <"this > <one" %n> | 7: BRANCH(19)
    8 <"this > <one" %n> | 8: ANYOF[\0-!#-[\]-\377
    {unicode_all}](24)
    9 <"this o> <ne" %n> | 24: WHILEM[1/1](0)
    whilem: matched 6 ou
    t of 0..32767
    9 <"this o> <ne" %n> | 7: BRANCH(19)
    9 <"this o> <ne" %n> | 8: ANYOF[\0-!#-[\]-
    \377{unicode_all}](24)
    10 <"this on> <e" %n> | 24: WHILEM[1/1](0)
    whilem: matched
    7 out of 0..32767
    10 <"this on> <e" %n> | 7: BRANCH(19)
    10 <"this on> <e" %n> | 8: ANYOF[\0-!#-
    [\]-\377{unicode_all}](24)
    11 <"this one> <" %n> | 24: WHILEM[1/1](
    0)
    whilem: matc
    hed 8 out of 0..32767
    11 <"this one> <" %n> | 7: BRANCH(19)

    11 <"this one> <" %n> | 8: ANYOF[\0
    -!#-[\]-\377{unicode_all}](24)
    failed..
    ..
    11 <"this one> <" %n> | 19: BRANCH(23)

    11 <"this one> <" %n> | 20: EXACT <\
    \>(22)
    failed..
    ..
    BRANCH fai
    led...
    whilem: fail
    ed, trying continuation...
    11 <"this one> <" %n> | 25: NOTHING(26
    )
    11 <"this one> <" %n> | 26: CLOSE1(28)

    11 <"this one> <" %n> | 28: EXACT <">(
    30)
    12 <"this one"> < %n> | 30: END(0)
    Match successful!
    Freeing REx: "%"((?:[^%"\\]+|(?:\\.)+)*)%""
    Freeing REx: "%"((?:[^\\%"]|\\.)*)%""
     
    , Oct 18, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Old Wolf
    Replies:
    0
    Views:
    563
    Old Wolf
    Mar 14, 2005
  2. Replies:
    2
    Views:
    424
    George Sakkis
    Jul 13, 2005
  3. Replies:
    19
    Views:
    1,160
    Daniel Vallstrom
    Mar 15, 2005
  4. Anonieko Ramos

    What's wrong with rpc-literal? Why use doc-literal?

    Anonieko Ramos, Sep 27, 2004, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    392
    Anonieko Ramos
    Sep 27, 2004
  5. Old Echo
    Replies:
    1
    Views:
    190
    Adam Shelly
    Sep 4, 2008
Loading...

Share This Page