know-how(-not) about regular expressions

Discussion in 'Perl Misc' started by Helmut Richter, Feb 12, 2010.

  1. For a seemingly simple problem with regular expressions I tried out several
    solutions. One of them seems to be working now, but I would like to learn why
    the solutions behave differently. Perl is 5.8.8 on Linux.

    The task is to replace the characters # $ \ by their HTML entity, e.g. #
    but not within markup. The following code reads and consumes a variable
    $inbuf0 and builds up a variable $inbuf with the result.

    Solution 1:

    while ($inbuf0) {
    $inbuf0 =~ /^(?: # skip initial sequences of
    [^<\&#\$\\]+ # harmless characters
    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
    | <!--(?:.|\n)*?--> # comments
    | <[?](?:.|\n)*?[?]> # processing instructions, etc.
    )*/x;
    $inbuf .= $&;
    $inbuf0 = $';
    if ($inbuf0) {
    $inbuf .= '&#' . ord($inbuf0) . ';';
    substr ($inbuf0, 0, 1) = '';
    $replaced = 1;
    };
    };

    Here the regexp eats up the maximal initial string (note the * at the end of
    the regexp) that needs not be processed and then processes the first character
    of the remainder.

    In this version, it sometimes works and sometimes blows up with segmentation
    fault.

    Another version has * instead of + at the "harmless characters". That one does
    not try all alternatives as the first one matches always, that is, the * at
    the end of the regexp is not used in this case.

    Yet another version has nothing instead of + at the "harmless characters";
    thus eating zero or one character per iteration of the final *. This should
    have the same net effect, but it always blows up with segmentation fault.


    Solution 2:

    while ($inbuf0) {
    if ($inbuf0 =~ /^# skip initial
    [^<\&#\$\\]+ # harmless characters
    | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
    | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
    | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
    | <!--(?:.|\n)*?--> # comments
    | <[?](?:.|\n)*?[?]> # processing instructions, etc.
    /x) {
    $inbuf .= $&;
    $inbuf0 = $';
    } else {
    $inbuf .= '&#' . ord($inbuf0) . ';';
    substr ($inbuf0, 0, 1) = '';
    $replaced = 1;
    };
    };

    Here the regexp eats up an initial string, typically not maximal (note the
    absence of * at the end of the regexp), that needs not be processed and, if
    nothing has been found, processes the first character of the input.

    This version runs considerably slower, by a factor of three, but has so far
    not yielded segmentation faults. I am using it now.

    I am sure there are lots of other ways to do it. With which knowledge
    could I have saved the time of the numerous trial-and-error cycles and
    done it alright from the beginning?

    --
    Helmut Richter
     
    Helmut Richter, Feb 12, 2010
    #1
    1. Advertising

  2. Helmut Richter <> writes:

    > For a seemingly simple problem with regular expressions I tried out several
    > solutions. One of them seems to be working now, but I would like to learn why
    > the solutions behave differently. Perl is 5.8.8 on Linux.


    The regexp engine in perl 5.8.8 is implemented by recursion. This is
    known to cause segmentation faults on some occasions. See
    http://www.nntp.perl.org/group/perl.perl5.porters/2006/05/msg113036.html

    Upgrading to perl 5.10 solves this issue by making the regexp engine
    iterative instead.

    > The task is to replace the characters # $ \ by their HTML entity, e.g. #
    > but not within markup. The following code reads and consumes a variable
    > $inbuf0 and builds up a variable $inbuf with the result.


    Trying to handle XML and HTML correctly by parsing it with regular
    expressions isn't recommended at all. I would use some XML parser and
    walk through the DOM and change the content of text nodes with the
    trivial substitution on each text node.

    //Makholm
     
    Peter Makholm, Feb 12, 2010
    #2
    1. Advertising

  3. Helmut Richter <> wrote:
    >For a seemingly simple problem with regular expressions I tried out several
    >solutions. One of them seems to be working now, but I would like to learn why
    >the solutions behave differently. Perl is 5.8.8 on Linux.
    >
    >The task is to replace the characters # $ \ by their HTML entity, e.g. #
    >but not within markup.

    [...]

    You may want to read up on Chomsky hierarchy. HTML is a not a a regular
    language but a context-free language. Therefore it cannot be parsed by a
    regular engine.

    Granted, Perl's Regular Expressions have extensions that make them
    significantly more powerful than a formal regular engine, but they are
    still the wrong tool for the job. Use any standard HTML parser to
    dissect your file into its components and then apply your substitution
    to those components where you want them applied.

    jue
     
    Jürgen Exner, Feb 12, 2010
    #3
  4. On Fri, 12 Feb 2010, wrote:

    > You may want to read up on Chomsky hierarchy. HTML is a not a a regular
    > language but a context-free language. Therefore it cannot be parsed by a
    > regular engine.


    But the distinction of markup and non-markup is. The only parenthesis-like
    structure I have so far found is the nesting of brackets in <!CDATA[ ... ]]>
    but this is also regular, as ]]> cannot occur inside.

    *If* I were interested in the semantics of the tags, I would probably
    follow the advice given here to use an XML analyser, provided I keep the
    control of what to do when the input is not well-formed XML. Just being
    told "your data is not okay, so cannot do anything with it" would not
    suffice: Even in an environment where the end-user has full control of
    everything, it is not always the best idea to have him fix every error
    before proceeding; sometimes it is better to let errors in the input and
    fix them at a later step.

    --
    Helmut Richter
     
    Helmut Richter, Feb 12, 2010
    #4
  5. Helmut Richter

    Dr.Ruud Guest

    Helmut Richter wrote:

    > [again parsing the wrong way]


    Is there a newsgroup or mailing list that we can refer "them" to?
    I am sure that we are well past our monthly share already.

    --
    Ruud
     
    Dr.Ruud, Feb 12, 2010
    #5
  6. Helmut Richter

    Guest

    On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <> wrote:

    >For a seemingly simple problem with regular expressions I tried out several
    >solutions. One of them seems to be working now, but I would like to learn why
    >the solutions behave differently. Perl is 5.8.8 on Linux.
    >
    >The task is to replace the characters # $ \ by their HTML entity, e.g. #
    >but not within markup.

    ^^^
    I find that odd but I guess you would have to parse out att-val's to do
    that.

    > The following code reads and consumes a variable
    >$inbuf0 and builds up a variable $inbuf with the result.
    >
    >Solution 1:
    >
    >while ($inbuf0) {
    > $inbuf0 =~ /^(?: # skip initial sequences of
    > [^<\&#\$\\]+ # harmless characters
    > | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
    > | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
    > | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
    > | <!--(?:.|\n)*?--> # comments
    > | <[?](?:.|\n)*?[?]> # processing instructions, etc.
    > )*/x;

    ^^^
    This is good, your trying to filter out not only all the markup,
    but references as well.
    However, some forms are omitted and the ones there are partially in error.
    This is no big deal, but the stream has to be partitioned precisely to
    extract segments with %100 certainty. This means a little more robust
    structure to allow stream realignment in the case of a bad markup.
    This is because validation is missing, but you don't care about that, you
    just don't want to stop in that case.

    But your expression will get you close.

    > $inbuf .= $&;
    > $inbuf0 = $';
    > if ($inbuf0) {
    > $inbuf .= '&#' . ord($inbuf0) . ';';
    > substr ($inbuf0, 0, 1) = '';
    > $replaced = 1;
    > };
    >};
    >
    >Here the regexp eats up the maximal initial string (note the * at the end of
    >the regexp) that needs not be processed and then processes the first character
    >of the remainder.
    >
    >In this version, it sometimes works and sometimes blows up with segmentation
    >fault.


    The code above is wrong, you don't check for a sucessful match, *substr* is
    going GPF on your ass! (mmm substr(), gpf paradise)

    >
    >Another version has * instead of + at the "harmless characters". That one does
    >not try all alternatives as the first one matches always, that is, the * at
    >the end of the regexp is not used in this case.
    >
    >Yet another version has nothing instead of + at the "harmless characters";
    >thus eating zero or one character per iteration of the final *. This should
    >have the same net effect, but it always blows up with segmentation fault.
    >
    >
    >Solution 2:
    >
    >while ($inbuf0) {
    > if ($inbuf0 =~ /^# skip initial
    > [^<\&#\$\\]+ # harmless characters
    > | <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
    > | <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
    > | \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
    > | <!--(?:.|\n)*?--> # comments
    > | <[?](?:.|\n)*?[?]> # processing instructions, etc.
    > /x) {
    > $inbuf .= $&;
    > $inbuf0 = $';
    > } else {
    > $inbuf .= '&#' . ord($inbuf0) . ';';
    > substr ($inbuf0, 0, 1) = '';
    > $replaced = 1;
    > };
    >};
    >


    Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
    of thing. I'm not sure but it looks like the regex is being initialized every time
    through the loop.
    Also, a while(length $inbuf0) might be more readable.

    >Here the regexp eats up an initial string, typically not maximal (note the
    >absence of * at the end of the regexp), that needs not be processed and, if
    >nothing has been found, processes the first character of the input.
    >
    >This version runs considerably slower, by a factor of three, but has so far
    >not yielded segmentation faults. I am using it now.
    >
    >I am sure there are lots of other ways to do it. With which knowledge
    >could I have saved the time of the numerous trial-and-error cycles and
    >done it alright from the beginning?


    I've pieced some code together that may help you on this.
    The ordering (alternation) of the markup forms are very important,
    take note of them.
    Especially - CDATA before comments and finally *content* must always,
    always be last.

    The ordering shown below is absolutely crucial to
    correctly partition markup!
    The biggest mistake people make is trying to parse out a sub-form.
    It just can't be done. The entire set of forms (and in order)
    are necessary to get even one little piece of encapsulated data.

    I didn't bench the code, its probably fairly quick.
    One thing I can say is that it will work on any markup given
    the included forms. Remember its not validating and quitting on
    error, but it does re-align the stream and continue.

    Let me know how it works in your case (errors, inconsistent, etcc).
    Good luck!

    -sln

    # add_refs_to_content.pl
    # - sln, 2/2010
    # $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    # - Util to create general reference's
    # from a character class. Does content only.
    # Can do attribute values with a little more
    # cut and paste .. not needed.
    # -------------------------------------------
    use strict;
    use warnings;

    my (
    $Name,
    $Rxmarkup,
    $Rxent
    );
    Initregex();

    ##
    my $html = join '', <DATA>;

    my $newhtml = ParseAndMakeEnt(\$html);
    if ($$newhtml ne $html) {
    print "\nFixed markup:\n",$$newhtml,"\n";
    }
    else {
    print "\nNothing to fix!\n";
    }
    exit (0);


    ##
    sub ParseAndMakeEnt
    {
    my ($markup) = @_;
    my (
    $MarkupNew,
    $content,
    $lcbpos,
    $last_content_pos,
    $begin_pos
    ) = ('','',0,0,0);

    ## parse loop
    while ($$markup =~ /$Rxmarkup/g)
    {
    ## handle content buffer
    if (defined $1) {
    ## speed it up
    $content .= $1;
    if (length $2)
    {
    if ($lcbpos == pos($$markup)) {
    $content .= $2;
    } else {
    $lcbpos = pos($$markup);
    pos($$markup) = $lcbpos - 1;
    }
    }
    $last_content_pos = pos($$markup);
    next;
    }
    ## content here ... take it off
    if (length $content)
    {
    $begin_pos = $last_content_pos;
    ## check '<'
    if ($content =~ /</) {
    ## markup in content
    print "Markup '<' in content, da stuff is crap!\n";
    }
    $MarkupNew .= ${_entconv(\$content)};
    $content = '';
    }
    ## markup here ... take it off
    $MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
    $begin_pos = pos($$markup);

    } ## end parse loop

    ## check for leftover content
    if (length $content)
    {
    ## check '<'
    if ($content =~ /</) {
    ## markup in content
    print "Markup '<' in left over content, da stuff is crap!\n";
    }
    $MarkupNew .= ${_entconv(\$content)};
    }
    return \$MarkupNew;
    }

    sub _entconv
    {
    my ($strref) = @_;
    my ($buf,$lbufpos) = ('',0);

    while ($$strref =~ /$Rxent/g) {
    if (defined $3) {
    $buf .= $3;
    if (length $4) {
    if ($lbufpos == pos($$strref)) {
    $buf .= $4;
    } else {
    $lbufpos = pos($$strref);
    pos($$strref) = $lbufpos - 1;
    }
    }
    next;
    }
    if (defined $2) {
    $buf .= '&#'.ord($2).';';
    }
    if (defined $1) {
    $buf .= $1;
    }
    }
    return \$buf;
    }

    sub Initregex
    {
    my @UC_Nstart = (
    "\\x{C0}-\\x{D6}",
    "\\x{D8}-\\x{F6}",
    "\\x{F8}-\\x{2FF}",
    "\\x{370}-\\x{37D}",
    "\\x{37F}-\\x{1FFF}",
    "\\x{200C}-\\x{200D}",
    "\\x{2070}-\\x{218F}",
    "\\x{2C00}-\\x{2FEF}",
    "\\x{3001}-\\x{D7FF}",
    "\\x{F900}-\\x{FDCF}",
    "\\x{FDF0}-\\x{FFFD}",
    "\\x{10000}-\\x{EFFFF}",
    );
    my @UC_Nchar = (
    "\\x{B7}",
    "\\x{0300}-\\x{036F}",
    "\\x{203F}-\\x{2040}",
    );
    my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
    my $Nchar = "[\\w:.".join ('',@UC_Nchar).join ('',@UC_Nstart)."-]";
    $Name = "(?:$Nstrt$Nchar*)";

    $Rxmarkup = qr/
    (?:
    <
    (?:
    (?: \/* $Name \s* \/*)
    |(?: $Name (?:\s+(?:".*?"|'.*?'|[^>]*?)+) \s* \/?)
    |(?: \?.*?\?)
    |(?:
    !
    (?:
    (?: DOCTYPE.*?)
    |(?: \[CDATA\[.*?\]\])
    |(?: --.*?--)
    |(?: \[[A-Z][A-Z\ ]*\[.*?\]\]) # who knows?
    |(?: ATTLIST.*?)
    |(?: ENTITY.*?)
    |(?: ELEMENT.*?)
    # add more if necessary
    )
    )
    )
    >

    ) | ([^<]*)(<?)/xs;

    my $Refchars = quotemeta('#$\\'); # These are the char's to make references from
    $Rxent = qr/
    ([&%](?:$Name|\#(?:[0-9]+|x[0-9a-fA-F]+));)
    |([$Refchars])
    |([^&%$Refchars]*)([&%]?)
    /x;
    }

    __DATA__

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 # $ \ Transitional//EN">
    <HTML><HEAD>
    <META http-equiv=3DContent-Type content=3D"text/html; =
    charset=3Diso-8859-1">
    <META content=3D "MSHTML 6.00.2900.3395" name=3DGENERATOR>

    <STYLE></STYLE>
    <test name = " thi<s # $ \ is a " test>
    </HEAD>
    <BODY bgColor=3D#ffffff>

    should fix these: # $ \
    but not these:  ¯
    fix some here: &&%#$ &as; &&#a0

    <IMG SRC = "foo.gif" ALT = "A > B">
    <IMG SRC = "foo.gif"
    ALT = "A > # $ \ B">
    <!-- <A comment # $ \ > -->
    <NN & a # $ \>
    <AA & # $ \>

    <# Just data #>

    <![INCLUDE CDATA [ >>>>>\\ # $ \ >>>>>>> ]]>

    <!-- This section commented out.
    <B>You can't # $ \ see me!</B>
    -->

    at root # $ \ > # $ \ level
     
    , Feb 12, 2010
    #6
  7. On Fri, 12 Feb 2010, Dr.Ruud wrote:

    > Date: Fri, 12 Feb 2010 19:41:58 +0100
    > From: Dr.Ruud <>
    > Newsgroups: comp.lang.perl.misc
    > Subject: Re: know-how(-not) about regular expressions
    >
    > Helmut Richter wrote:
    >
    > > [again parsing the wrong way]

    >
    > Is there a newsgroup or mailing list that we can refer "them" to?
    > I am sure that we are well past our monthly share already.
    >
    > --
    > Ruud
    >


    --
    Helmut Richter
     
    Helmut Richter, Feb 13, 2010
    #7
  8. Helmut Richter

    Guest

    On Fri, 12 Feb 2010 15:22:20 -0800, wrote:

    >On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <> wrote:
    >

    [snip]

    >Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
    >of thing.


    [snip]

    >>This version runs considerably slower, by a factor of three


    [snip]

    >I didn't bench the code, its probably fairly quick.


    [snip]

    I did bench the code on a 7 mbyte file 'mscore.xml'.
    What really makes it slow on large files is the constant
    "appending" to a variable. Its roughly 2 times + slower doing
    it this way.

    The fastest way to do it, is to write it to the disk as you
    get it. Pass in a filehandle, or some other method.

    Perl would have to spend all its time on realloc() because
    of all the appending.

    -sln

    >
    ># add_refs_to_content.pl
    ># - sln, 2/2010
    ># $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    ># - Util to create general reference's
    ># from a character class. Does content only.
    ># Can do attribute values with a little more
    ># cut and paste .. not needed.
    ># -------------------------------------------
    >use strict;
    >use warnings;
    >
    >my (
    > $Name,
    > $Rxmarkup,
    > $Rxent
    >);
    >Initregex();
    >


    ....
    >sub ParseAndMakeEnt
    >{

    ...
    > while ($$markup =~ /$Rxmarkup/g)
    > {
    > ## handle content buffer
    > if (defined $1) {
    > ## speed it up
    > $content .= $1;
    > if (length $2)
    > {
    > if ($lcbpos == pos($$markup)) {
    > $content .= $2;
    > } else {
    > $lcbpos = pos($$markup);
    > pos($$markup) = $lcbpos - 1;
    > }
    > }
    > $last_content_pos = pos($$markup);
    > next;
    > }
    > ## content here ... take it off
    > if (length $content)
    > {
    > $begin_pos = $last_content_pos;
    > ## check '<'
    > if ($content =~ /</) {
    > ## markup in content
    > print "Markup '<' in content, da stuff is crap!\n";
    > }
    > $MarkupNew .= ${_entconv(\$content)};

    ^^^^^^^^^^^^^^^^^^^^^^^^
    ->> do this instead: print $fh ${_entconv(\$content)};

    > $content = '';
    > }
    > ## markup here ... take it off
    > $MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);

    ^^^^^^^^^^^^^^^^^^^^^^^^
    ->> print $fh substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);

    > $begin_pos = pos($$markup);
    >
    > } ## end parse loop
    >
    > ## check for leftover content
    > if (length $content)
    > {
    > ## check '<'
    > if ($content =~ /</) {
    > ## markup in content
    > print "Markup '<' in left over content, da stuff is crap!\n";
    > }
    > $MarkupNew .= ${_entconv(\$content)};

    ^^^^^^^^^^^^^^^^^^^^^^^^
    ->> print $fh ${_entconv(\$content)};
     
    , Feb 13, 2010
    #8
  9. On 2010-02-13 17:15, <> wrote:
    > On Fri, 12 Feb 2010 15:22:20 -0800, wrote:
    >
    >>On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <> wrote:
    >>

    > [snip]
    >
    >>Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
    >>of thing.

    >
    > [snip]
    >
    >>>This version runs considerably slower, by a factor of three

    >
    > [snip]
    >
    >>I didn't bench the code, its probably fairly quick.

    >
    > [snip]
    >
    > I did bench the code on a 7 mbyte file 'mscore.xml'.
    > What really makes it slow on large files is the constant
    > "appending" to a variable. Its roughly 2 times + slower doing
    > it this way.
    >
    > The fastest way to do it, is to write it to the disk as you
    > get it. Pass in a filehandle, or some other method.
    >
    > Perl would have to spend all its time on realloc() because
    > of all the appending.


    That's a surprising result. Perl doubles the size of a string every time
    it needs to expand it, so it shouldn't have to realloc much
    (only O(log(length($MarkupNew))) times).

    As it is, I cannot reproduce your result. Trying it on a 22 MB file I
    get these times:

    append 9.031 9.041 9.150
    tempfile 9.285 9.370 9.479

    As you can see, appending is consistently faster than writing to a
    temporary file and reading it back.

    According to Devel::NYTProf nearly all of the time is spent in these
    lines:


    while ($$markup =~ /$Rxmarkup/g)

    $begin_pos = pos($$markup);

    while ($$strref =~ /$Rxent/g) {

    where the second is the end of the loop started in the first, so I
    suspect that the time attributed to the second line is really spent in
    the match, not the pos call.

    hp

    PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/
     
    Peter J. Holzer, Feb 14, 2010
    #9
  10. Helmut Richter

    Guest

    On Sun, 14 Feb 2010 13:11:13 +0100, "Peter J. Holzer" <> wrote:

    >On 2010-02-13 17:15, <> wrote:
    >> On Fri, 12 Feb 2010 15:22:20 -0800, wrote:
    >>
    >>>On Fri, 12 Feb 2010 12:40:14 +0100, Helmut Richter <> wrote:
    >>>

    >> [snip]
    >>
    >>>Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
    >>>of thing.

    >>
    >> [snip]
    >>
    >>>>This version runs considerably slower, by a factor of three

    >>
    >> [snip]
    >>
    >>>I didn't bench the code, its probably fairly quick.

    >>
    >> [snip]
    >>
    >> I did bench the code on a 7 mbyte file 'mscore.xml'.
    >> What really makes it slow on large files is the constant
    >> "appending" to a variable. Its roughly 2 times + slower doing
    >> it this way.
    >>
    >> The fastest way to do it, is to write it to the disk as you
    >> get it. Pass in a filehandle, or some other method.
    >>
    >> Perl would have to spend all its time on realloc() because
    >> of all the appending.

    >
    >That's a surprising result. Perl doubles the size of a string every time
    >it needs to expand it, so it shouldn't have to realloc much
    >(only O(log(length($MarkupNew))) times).
    >
    >As it is, I cannot reproduce your result. Trying it on a 22 MB file I
    >get these times:
    >
    >append 9.031 9.041 9.150
    >tempfile 9.285 9.370 9.479
    >
    >As you can see, appending is consistently faster than writing to a
    >temporary file and reading it back.
    >
    >According to Devel::NYTProf nearly all of the time is spent in these
    >lines:
    >
    >
    > while ($$markup =~ /$Rxmarkup/g)
    >
    > $begin_pos = pos($$markup);
    >
    > while ($$strref =~ /$Rxent/g) {
    >
    >where the second is the end of the loop started in the first, so I
    >suspect that the time attributed to the second line is really spent in
    >the match, not the pos call.
    >
    > hp
    >
    >PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/


    I looked at that profiling result. Impressive utility. Is it free?

    To isolate what I am seeing, I am posting a benchmark that simulates
    what I found on the other code. It shows huge performance degredation.
    I don't know if its the Perl build 5.10.0 (from ActiveState) or what.

    Run this and compare the relative numbers with your build.
    I'd feel better knowing Perl is not like this and there is a grave error
    on my part/and or build.

    Thanks.

    -sln

    -----------------------
    ## bench.pl
    ## ----------
    use strict;
    use warnings;
    use Benchmark ':hireswallclock';

    my ($t0,$t1);
    my @limit = (
    0, # 0
    1_000_000, # 1 MB
    2_000_000, # 2 MB
    3_000_000, # 3 MB
    4_000_000, # 4 MB
    5_000_000 # 5 MB
    );
    my @buf = ('') x scalar(@limit);
    my $append = '<RXZWQ>sdfgg<oo/>';

    print "Starting ...\n";

    for (1 .. 2)
    {
    print "\n",'-' x 30,"\n>> Pass $_:\n";
    for my $ndx (0 .. $#limit)
    {
    my ($t0,$t1);

    $buf[$ndx] = 'P' x $limit[$ndx]; # pre-allocate buffer from limit array
    $buf[$ndx] = ''; # clear buffer

    $t0 = new Benchmark;
    for ( 1 .. 235_000 ) { # simulate 235,000 segment appends
    $buf[$ndx] .= $append; # from 'mscorlib.xml'
    }
    $t1 = new Benchmark;

    printf STDERR "\nBuf[%d]", $ndx;
    printf STDERR ", start size = %.0fmb", $limit[$ndx]/1_000_000;
    printf STDERR ", current size = %d bytes\n", length $buf[$ndx];
    print STDERR "code metrics: ",timestr( timediff($t1, $t0) ),"\n";
    }
    }

    print "\n", '-' x 30, "\n";
    system ('perl -V');

    __END__

    Output =

    Starting ...

    ------------------------------
    >> Pass 1:


    Buf[0], start size = 0mb, current size = 3995000 bytes
    code metrics: 2.32798 wallclock secs ( 1.52 usr + 0.81 sys = 2.33 CPU)

    Buf[1], start size = 1mb, current size = 3995000 bytes
    code metrics: 2.23181 wallclock secs ( 1.47 usr + 0.77 sys = 2.23 CPU)

    Buf[2], start size = 2mb, current size = 3995000 bytes
    code metrics: 1.7917 wallclock secs ( 1.34 usr + 0.45 sys = 1.80 CPU)

    Buf[3], start size = 3mb, current size = 3995000 bytes
    code metrics: 1.0548 wallclock secs ( 0.78 usr + 0.28 sys = 1.06 CPU)

    Buf[4], start size = 4mb, current size = 3995000 bytes
    code metrics: 0.0685248 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

    Buf[5], start size = 5mb, current size = 3995000 bytes
    code metrics: 0.0682061 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    ------------------------------
    >> Pass 2:


    Buf[0], start size = 0mb, current size = 3995000 bytes
    code metrics: 0.0659492 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[1], start size = 1mb, current size = 3995000 bytes
    code metrics: 0.0691559 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

    Buf[2], start size = 2mb, current size = 3995000 bytes
    code metrics: 0.069617 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[3], start size = 3mb, current size = 3995000 bytes
    code metrics: 0.0686679 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[4], start size = 4mb, current size = 3995000 bytes
    code metrics: 0.0811398 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

    Buf[5], start size = 5mb, current size = 3995000 bytes
    code metrics: 0.068722 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

    ------------------------------
    Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
    Platform:
    osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
    uname=''
    config_args='undef'
    hint=recommended, useposix=true, d_sigaction=undef
    useithreads=define, usemultiplicity=define
    useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
    use64bitint=undef, use64bitall=undef, uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    Compiler:
    cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -
    DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -DPERL_IM
    PLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX',
    optimize='-MD -Zi -DNDEBUG -O1',
    cppflags='-DWIN32'
    ccversion='12.0.8804', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksi
    ze=8
    alignbytes=8, prototype=define
    Linker and Libraries:
    ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:
    \Perl\lib\CORE" -machine:x86'
    libpth=\lib
    libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32
    ..lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_
    32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
    perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comd
    lg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib
    ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
    libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
    gnulibc_version=''
    Dynamic Linking:
    dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -
    libpath:"C:\Perl\lib\CORE" -machine:x86'


    Characteristics of this binary (from libperl):
    Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
    PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS
    PERL_MALLOC_WRAP PL_OP_SLAB_ALLOC USE_ITHREADS
    USE_LARGE_FILES USE_PERLIO USE_SITECUSTOMIZE
    Locally applied patches:
    ActivePerl Build 1004 [287188]
    33741 avoids segfaults invoking S_raise_signal() (on Linux)
    33763 Win32 process ids can have more than 16 bits
    32809 Load 'loadable object' with non-default file extension
    32728 64-bit fix for Time::Local
    Built under MSWin32
    Compiled at Sep 3 2008 13:16:37
    @INC:
    C:/Perl/site/lib
    C:/Perl/lib
     
    , Feb 14, 2010
    #10
  11. On 2010-02-14 22:22, <> wrote:
    > On Sun, 14 Feb 2010 13:11:13 +0100, "Peter J. Holzer" <> wrote:
    >
    >>On 2010-02-13 17:15, <> wrote:
    >>> I did bench the code on a 7 mbyte file 'mscore.xml'.
    >>> What really makes it slow on large files is the constant
    >>> "appending" to a variable. Its roughly 2 times + slower doing
    >>> it this way.
    >>>
    >>> The fastest way to do it, is to write it to the disk as you
    >>> get it. Pass in a filehandle, or some other method.
    >>>
    >>> Perl would have to spend all its time on realloc() because
    >>> of all the appending.

    >>
    >>That's a surprising result. Perl doubles the size of a string every time
    >>it needs to expand it, so it shouldn't have to realloc much
    >>(only O(log(length($MarkupNew))) times).
    >>
    >>As it is, I cannot reproduce your result. Trying it on a 22 MB file I
    >>get these times:
    >>
    >>append 9.031 9.041 9.150
    >>tempfile 9.285 9.370 9.479
    >>
    >>As you can see, appending is consistently faster than writing to a
    >>temporary file and reading it back.
    >>
    >>According to Devel::NYTProf nearly all of the time is spent in these
    >>lines:

    [...]
    >>PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/

    >
    > I looked at that profiling result. Impressive utility. Is it free?


    Yes. Available from CPAN.

    Devel::NYTProf is really nice. However, it adds a rather large overhead
    (smaller than most other Perl profilers, but still large), so it is
    impractical for programs which run for a long time and sometimes the
    overhead hides the real performance bottleneck.


    > To isolate what I am seeing, I am posting a benchmark that simulates
    > what I found on the other code. It shows huge performance degredation.
    > I don't know if its the Perl build 5.10.0 (from ActiveState) or what.

    [...]
    > ------------------------------
    >>> Pass 1:

    >
    > Buf[0], start size = 0mb, current size = 3995000 bytes
    > code metrics: 2.32798 wallclock secs ( 1.52 usr + 0.81 sys = 2.33 CPU)
    >
    > Buf[1], start size = 1mb, current size = 3995000 bytes
    > code metrics: 2.23181 wallclock secs ( 1.47 usr + 0.77 sys = 2.23 CPU)
    >
    > Buf[2], start size = 2mb, current size = 3995000 bytes
    > code metrics: 1.7917 wallclock secs ( 1.34 usr + 0.45 sys = 1.80 CPU)
    >
    > Buf[3], start size = 3mb, current size = 3995000 bytes
    > code metrics: 1.0548 wallclock secs ( 0.78 usr + 0.28 sys = 1.06 CPU)
    >
    > Buf[4], start size = 4mb, current size = 3995000 bytes
    > code metrics: 0.0685248 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)
    >
    > Buf[5], start size = 5mb, current size = 3995000 bytes
    > code metrics: 0.0682061 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
    >
    > ------------------------------
    >>> Pass 2:

    >
    > Buf[0], start size = 0mb, current size = 3995000 bytes
    > code metrics: 0.0659492 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
    >
    > Buf[1], start size = 1mb, current size = 3995000 bytes
    > code metrics: 0.0691559 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)
    >
    > Buf[2], start size = 2mb, current size = 3995000 bytes
    > code metrics: 0.069617 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
    >
    > Buf[3], start size = 3mb, current size = 3995000 bytes
    > code metrics: 0.0686679 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)
    >
    > Buf[4], start size = 4mb, current size = 3995000 bytes
    > code metrics: 0.0811398 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)
    >
    > Buf[5], start size = 5mb, current size = 3995000 bytes
    > code metrics: 0.068722 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)
    >
    > ------------------------------


    Ouch. That's really a ludicrous slowdown.

    Here are the results from my home system:

    ------------------------------
    >> Pass 1:


    Buf[0], start size = 0mb, current size = 3995000 bytes
    code metrics: 0.093436 wallclock secs ( 0.08 usr + 0.01 sys = 0.09 CPU)

    Buf[1], start size = 1mb, current size = 3995000 bytes
    code metrics: 0.105453 wallclock secs ( 0.10 usr + 0.01 sys = 0.11 CPU)

    Buf[2], start size = 2mb, current size = 3995000 bytes
    code metrics: 0.10132 wallclock secs ( 0.07 usr + 0.03 sys = 0.10 CPU)

    Buf[3], start size = 3mb, current size = 3995000 bytes
    code metrics: 0.10031 wallclock secs ( 0.05 usr + 0.04 sys = 0.09 CPU)

    Buf[4], start size = 4mb, current size = 3995000 bytes
    code metrics: 0.0609372 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[5], start size = 5mb, current size = 3995000 bytes
    code metrics: 0.060972 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    ------------------------------
    >> Pass 2:


    Buf[0], start size = 0mb, current size = 3995000 bytes
    code metrics: 0.058821 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[1], start size = 1mb, current size = 3995000 bytes
    code metrics: 0.0602 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[2], start size = 2mb, current size = 3995000 bytes
    code metrics: 0.060935 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[3], start size = 3mb, current size = 3995000 bytes
    code metrics: 0.0601468 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[4], start size = 4mb, current size = 3995000 bytes
    code metrics: 0.0608931 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    Buf[5], start size = 5mb, current size = 3995000 bytes
    code metrics: 0.0607629 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

    ------------------------------

    The base time (0.06 seconds) is about the same as for you so I assume
    that we use a processor of roughly the same speed (Intel Core2 6300 @
    1.86GHz in my case). But I have only a slowdown of less than 2
    (0.10/0.06), and you have a slowdown of almost 35 (2.33/0.068).

    I don't have a plausible explanation for that. It seems most likely that
    Activestate perl extends strings linearly instead of exponentially but
    why it would do such a stupid thing is beyond me.

    hp
     
    Peter J. Holzer, Feb 14, 2010
    #11
  12. Helmut Richter

    Guest

    On Mon, 15 Feb 2010 00:10:56 +0100, "Peter J. Holzer" <> wrote:

    >On 2010-02-14 22:22, <> wrote:
    >> On Sun, 14 Feb 2010 13:11:13 +0100, "Peter J. Holzer" <> wrote:
    >>
    >>>On 2010-02-13 17:15, <> wrote:
    >>>> I did bench the code on a 7 mbyte file 'mscore.xml'.
    >>>> What really makes it slow on large files is the constant
    >>>> "appending" to a variable. Its roughly 2 times + slower doing
    >>>> it this way.
    >>>>
    >>>> The fastest way to do it, is to write it to the disk as you
    >>>> get it. Pass in a filehandle, or some other method.
    >>>>
    >>>> Perl would have to spend all its time on realloc() because
    >>>> of all the appending.
    >>>
    >>>That's a surprising result. Perl doubles the size of a string every time
    >>>it needs to expand it, so it shouldn't have to realloc much
    >>>(only O(log(length($MarkupNew))) times).
    >>>
    >>>As it is, I cannot reproduce your result. Trying it on a 22 MB file I
    >>>get these times:
    >>>
    >>>append 9.031 9.041 9.150
    >>>tempfile 9.285 9.370 9.479
    >>>
    >>>As you can see, appending is consistently faster than writing to a
    >>>temporary file and reading it back.
    >>>
    >>>According to Devel::NYTProf nearly all of the time is spent in these
    >>>lines:

    >[...]
    >>>PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/

    >>
    >> I looked at that profiling result. Impressive utility. Is it free?

    >
    >Yes. Available from CPAN.
    >
    >Devel::NYTProf is really nice. However, it adds a rather large overhead
    >(smaller than most other Perl profilers, but still large), so it is
    >impractical for programs which run for a long time and sometimes the
    >overhead hides the real performance bottleneck.
    >
    >
    >> To isolate what I am seeing, I am posting a benchmark that simulates
    >> what I found on the other code. It shows huge performance degredation.
    >> I don't know if its the Perl build 5.10.0 (from ActiveState) or what.

    >[...]
    >> ------------------------------
    >>>> Pass 1:

    [snip]
    >> ------------------------------

    >
    >Ouch. That's really a ludicrous slowdown.
    >
    >Here are the results from my home system:
    >------------------------------

    [snip]
    >The base time (0.06 seconds) is about the same as for you so I assume
    >that we use a processor of roughly the same speed (Intel Core2 6300 @
    >1.86GHz in my case). But I have only a slowdown of less than 2
    >(0.10/0.06), and you have a slowdown of almost 35 (2.33/0.068).
    >
    >I don't have a plausible explanation for that. It seems most likely that
    >Activestate perl extends strings linearly instead of exponentially but
    >why it would do such a stupid thing is beyond me.
    >
    > hp


    Yep, I have a 2.35 gz Opteron 170 (over clocked) dual core,
    2 gig ram, on Windows XP.

    My Activestate is using gcc and built using MS CRT, so its using realloc from win32.

    Apparently using the win32 crt - realloc() and flavors, are
    crap. If you use custom malloc, example gcc:
    quote from link below:
    "Compiling perl 5.10.1 without USE_IMP_SYS
    and with USE_PERL_MALLOC makes a huge difference." ,
    it disables threading ..

    Ha ha. M$hit strikes again.

    The gory details are to be found here (@ 11/09):
    (btw, some guy used an example like mine)
    --------------------
    Subject:
    "Why is Windows 100 times slower than Linux when growing a large scalar?"

    http://www.perlmonks.org/?node_id=810276

    Subquote:
    "The problem seems to lie with the CRT realloc() which grows
    the heap in iddy-biddy chunks each time"
    ----------------------

    There are not many windows programs that use realloc(), (I know
    I never use it), instead, just malloc and free.
    But, in a dynamic typeless language, built on primitive C,
    var .= "..." dictates the simplest approach, ie: realloc.
    In C++, operator overloading can append using a private growing
    scheme without using realloc. Helpfull if using win32 anyway.

    In circumstances such as these, if the final size is nearly known,
    preallocating using $var = 'a' x $size; or $var = 'a' x $size * 2);
    should mitigate this dreadfull circumstance.

    I'm actually mortified of this situation.

    -sln
     
    , Feb 15, 2010
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    609
    Jay Douglas
    Aug 15, 2003
  2. Tony C
    Replies:
    6
    Views:
    325
  3. Bart Kastermans
    Replies:
    3
    Views:
    309
    Terry Reedy
    Aug 28, 2008
  4. Andries

    I know, I know, I don't know

    Andries, Apr 23, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    236
    Gregory Toomey
    Apr 23, 2004
  5. Noman Shapiro
    Replies:
    0
    Views:
    234
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page