know-how(-not) about regular expressions

H

Helmut Richter

For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.

Solution 1:

while ($inbuf0) {
$inbuf0 =~ /^(?: # skip initial sequences of
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
)*/x;
$inbuf .= $&;
$inbuf0 = $';
if ($inbuf0) {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.

In this version, it sometimes works and sometimes blows up with segmentation
fault.

Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.

Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.


Solution 2:

while ($inbuf0) {
if ($inbuf0 =~ /^# skip initial
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
/x) {
$inbuf .= $&;
$inbuf0 = $';
} else {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.

This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.

I am sure there are lots of other ways to do it. With which knowledge
could I have saved the time of the numerous trial-and-error cycles and
done it alright from the beginning?
 
P

Peter Makholm

Helmut Richter said:
For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The regexp engine in perl 5.8.8 is implemented by recursion. This is
known to cause segmentation faults on some occasions. See
http://www.nntp.perl.org/group/perl.perl5.porters/2006/05/msg113036.html

Upgrading to perl 5.10 solves this issue by making the regexp engine
iterative instead.
The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup. The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.

Trying to handle XML and HTML correctly by parsing it with regular
expressions isn't recommended at all. I would use some XML parser and
walk through the DOM and change the content of text nodes with the
trivial substitution on each text node.

//Makholm
 
J

Jürgen Exner

Helmut Richter said:
For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup.
[...]

You may want to read up on Chomsky hierarchy. HTML is a not a a regular
language but a context-free language. Therefore it cannot be parsed by a
regular engine.

Granted, Perl's Regular Expressions have extensions that make them
significantly more powerful than a formal regular engine, but they are
still the wrong tool for the job. Use any standard HTML parser to
dissect your file into its components and then apply your substitution
to those components where you want them applied.

jue
 
H

Helmut Richter

You may want to read up on Chomsky hierarchy. HTML is a not a a regular
language but a context-free language. Therefore it cannot be parsed by a
regular engine.

But the distinction of markup and non-markup is. The only parenthesis-like
structure I have so far found is the nesting of brackets in <!CDATA[ ... ]]>
but this is also regular, as ]]> cannot occur inside.

*If* I were interested in the semantics of the tags, I would probably
follow the advice given here to use an XML analyser, provided I keep the
control of what to do when the input is not well-formed XML. Just being
told "your data is not okay, so cannot do anything with it" would not
suffice: Even in an environment where the end-user has full control of
everything, it is not always the best idea to have him fix every error
before proceeding; sometimes it is better to let errors in the input and
fix them at a later step.
 
D

Dr.Ruud

Helmut said:
[again parsing the wrong way]

Is there a newsgroup or mailing list that we can refer "them" to?
I am sure that we are well past our monthly share already.
 
S

sln

For a seemingly simple problem with regular expressions I tried out several
solutions. One of them seems to be working now, but I would like to learn why
the solutions behave differently. Perl is 5.8.8 on Linux.

The task is to replace the characters # $ \ by their HTML entity, e.g. #
but not within markup.
^^^
I find that odd but I guess you would have to parse out att-val's to do
that.
The following code reads and consumes a variable
$inbuf0 and builds up a variable $inbuf with the result.

Solution 1:

while ($inbuf0) {
$inbuf0 =~ /^(?: # skip initial sequences of
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
)*/x;
^^^
This is good, your trying to filter out not only all the markup,
but references as well.
However, some forms are omitted and the ones there are partially in error.
This is no big deal, but the stream has to be partitioned precisely to
extract segments with %100 certainty. This means a little more robust
structure to allow stream realignment in the case of a bad markup.
This is because validation is missing, but you don't care about that, you
just don't want to stop in that case.

But your expression will get you close.
$inbuf .= $&;
$inbuf0 = $';
if ($inbuf0) {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Here the regexp eats up the maximal initial string (note the * at the end of
the regexp) that needs not be processed and then processes the first character
of the remainder.

In this version, it sometimes works and sometimes blows up with segmentation
fault.

The code above is wrong, you don't check for a sucessful match, *substr* is
going GPF on your ass! (mmm substr(), gpf paradise)
Another version has * instead of + at the "harmless characters". That one does
not try all alternatives as the first one matches always, that is, the * at
the end of the regexp is not used in this case.

Yet another version has nothing instead of + at the "harmless characters";
thus eating zero or one character per iteration of the final *. This should
have the same net effect, but it always blows up with segmentation fault.


Solution 2:

while ($inbuf0) {
if ($inbuf0 =~ /^# skip initial
[^<\&#\$\\]+ # harmless characters
| <[A-Za-z:_\200-\377](?:[^>"']|"[^"]*"|'[^']*')*> # start tags
| <\/[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*\s*> # end tags
| \&(?:[A-Za-z:_\200-\377][-.0-9A-Za-z:_\200-\377]*|\#(?:[0-9]+|x[0-9A-Fa-f]+)); # entity or character references
| <!--(?:.|\n)*?--> # comments
| <[?](?:.|\n)*?[?]> # processing instructions, etc.
/x) {
$inbuf .= $&;
$inbuf0 = $';
} else {
$inbuf .= '&#' . ord($inbuf0) . ';';
substr ($inbuf0, 0, 1) = '';
$replaced = 1;
};
};

Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
of thing. I'm not sure but it looks like the regex is being initialized every time
through the loop.
Also, a while(length $inbuf0) might be more readable.
Here the regexp eats up an initial string, typically not maximal (note the
absence of * at the end of the regexp), that needs not be processed and, if
nothing has been found, processes the first character of the input.

This version runs considerably slower, by a factor of three, but has so far
not yielded segmentation faults. I am using it now.

I am sure there are lots of other ways to do it. With which knowledge
could I have saved the time of the numerous trial-and-error cycles and
done it alright from the beginning?

I've pieced some code together that may help you on this.
The ordering (alternation) of the markup forms are very important,
take note of them.
Especially - CDATA before comments and finally *content* must always,
always be last.

The ordering shown below is absolutely crucial to
correctly partition markup!
The biggest mistake people make is trying to parse out a sub-form.
It just can't be done. The entire set of forms (and in order)
are necessary to get even one little piece of encapsulated data.

I didn't bench the code, its probably fairly quick.
One thing I can say is that it will work on any markup given
the included forms. Remember its not validating and quitting on
error, but it does re-align the stream and continue.

Let me know how it works in your case (errors, inconsistent, etcc).
Good luck!

-sln

# add_refs_to_content.pl
# - sln, 2/2010
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# - Util to create general reference's
# from a character class. Does content only.
# Can do attribute values with a little more
# cut and paste .. not needed.
# -------------------------------------------
use strict;
use warnings;

my (
$Name,
$Rxmarkup,
$Rxent
);
Initregex();

##
my $html = join '', <DATA>;

my $newhtml = ParseAndMakeEnt(\$html);
if ($$newhtml ne $html) {
print "\nFixed markup:\n",$$newhtml,"\n";
}
else {
print "\nNothing to fix!\n";
}
exit (0);


##
sub ParseAndMakeEnt
{
my ($markup) = @_;
my (
$MarkupNew,
$content,
$lcbpos,
$last_content_pos,
$begin_pos
) = ('','',0,0,0);

## parse loop
while ($$markup =~ /$Rxmarkup/g)
{
## handle content buffer
if (defined $1) {
## speed it up
$content .= $1;
if (length $2)
{
if ($lcbpos == pos($$markup)) {
$content .= $2;
} else {
$lcbpos = pos($$markup);
pos($$markup) = $lcbpos - 1;
}
}
$last_content_pos = pos($$markup);
next;
}
## content here ... take it off
if (length $content)
{
$begin_pos = $last_content_pos;
## check '<'
if ($content =~ /</) {
## markup in content
print "Markup '<' in content, da stuff is crap!\n";
}
$MarkupNew .= ${_entconv(\$content)};
$content = '';
}
## markup here ... take it off
$MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
$begin_pos = pos($$markup);

} ## end parse loop

## check for leftover content
if (length $content)
{
## check '<'
if ($content =~ /</) {
## markup in content
print "Markup '<' in left over content, da stuff is crap!\n";
}
$MarkupNew .= ${_entconv(\$content)};
}
return \$MarkupNew;
}

sub _entconv
{
my ($strref) = @_;
my ($buf,$lbufpos) = ('',0);

while ($$strref =~ /$Rxent/g) {
if (defined $3) {
$buf .= $3;
if (length $4) {
if ($lbufpos == pos($$strref)) {
$buf .= $4;
} else {
$lbufpos = pos($$strref);
pos($$strref) = $lbufpos - 1;
}
}
next;
}
if (defined $2) {
$buf .= '&#'.ord($2).';';
}
if (defined $1) {
$buf .= $1;
}
}
return \$buf;
}

sub Initregex
{
my @UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
my @UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
my $Nchar = "[\\w:.".join ('',@UC_Nchar).join ('',@UC_Nstart)."-]";
$Name = "(?:$Nstrt$Nchar*)";

$Rxmarkup = qr/
(?:
<
(?:
(?: \/* $Name \s* \/*)
|(?: $Name (?:\s+(?:".*?"|'.*?'|[^>]*?)+) \s* \/?)
|(?: \?.*?\?)
|(?:
!
(?:
(?: DOCTYPE.*?)
|(?: \[CDATA\[.*?\]\])
|(?: --.*?--)
|(?: \[[A-Z][A-Z\ ]*\[.*?\]\]) # who knows?
|(?: ATTLIST.*?)
|(?: ENTITY.*?)
|(?: ELEMENT.*?)
# add more if necessary
)
)
)) | ([^<]*)(<?)/xs;

my $Refchars = quotemeta('#$\\'); # These are the char's to make references from
$Rxent = qr/
([&%](?:$Name|\#(?:[0-9]+|x[0-9a-fA-F]+));)
|([$Refchars])
|([^&%$Refchars]*)([&%]?)
/x;
}

__DATA__

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 # $ \ Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D "MSHTML 6.00.2900.3395" name=3DGENERATOR>

<STYLE></STYLE>
<test name = " thi<s # $ \ is a " test>
</HEAD>
<BODY bgColor=3D#ffffff>

should fix these: # $ \
but not these:  ¯
fix some here: &&%#$ &as; &&#a0

<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > # $ \ B">
<!-- <A comment # $ \ > -->
<NN & a # $ \>
<AA & # $ \>

<# Just data #>

<![INCLUDE CDATA [ >>>>>\\ # $ \ >>>>>>> ]]>

<!-- This section commented out.
<B>You can't # $ \ see me!</B>
-->

at root # $ \ > # $ \ level
 
H

Helmut Richter

Date: Fri, 12 Feb 2010 19:41:58 +0100
From: Dr.Ruud <[email protected]>
Newsgroups: comp.lang.perl.misc
Subject: Re: know-how(-not) about regular expressions

Helmut said:
[again parsing the wrong way]

Is there a newsgroup or mailing list that we can refer "them" to?
I am sure that we are well past our monthly share already.
 
S

sln

[snip]

Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
of thing.
[snip]
This version runs considerably slower, by a factor of three
[snip]

I didn't bench the code, its probably fairly quick.

[snip]

I did bench the code on a 7 mbyte file 'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl would have to spend all its time on realloc() because
of all the appending.

-sln
# add_refs_to_content.pl
# - sln, 2/2010
# $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
# - Util to create general reference's
# from a character class. Does content only.
# Can do attribute values with a little more
# cut and paste .. not needed.
# -------------------------------------------
use strict;
use warnings;

my (
$Name,
$Rxmarkup,
$Rxent
);
Initregex();

....
sub ParseAndMakeEnt
{ ...
while ($$markup =~ /$Rxmarkup/g)
{
## handle content buffer
if (defined $1) {
## speed it up
$content .= $1;
if (length $2)
{
if ($lcbpos == pos($$markup)) {
$content .= $2;
} else {
$lcbpos = pos($$markup);
pos($$markup) = $lcbpos - 1;
}
}
$last_content_pos = pos($$markup);
next;
}
## content here ... take it off
if (length $content)
{
$begin_pos = $last_content_pos;
## check '<'
if ($content =~ /</) {
## markup in content
print "Markup '<' in content, da stuff is crap!\n";
}
$MarkupNew .= ${_entconv(\$content)};
^^^^^^^^^^^^^^^^^^^^^^^^
->> do this instead: print $fh ${_entconv(\$content)};
$content = '';
}
## markup here ... take it off
$MarkupNew .= substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
^^^^^^^^^^^^^^^^^^^^^^^^
->> print $fh substr ($$markup, $begin_pos, pos($$markup)-$begin_pos);
$begin_pos = pos($$markup);

} ## end parse loop

## check for leftover content
if (length $content)
{
## check '<'
if ($content =~ /</) {
## markup in content
print "Markup '<' in left over content, da stuff is crap!\n";
}
$MarkupNew .= ${_entconv(\$content)};
^^^^^^^^^^^^^^^^^^^^^^^^
->> print $fh ${_entconv(\$content)};
 
P

Peter J. Holzer

[snip]

Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
of thing.
[snip]
This version runs considerably slower, by a factor of three
[snip]

I didn't bench the code, its probably fairly quick.

[snip]

I did bench the code on a 7 mbyte file 'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl would have to spend all its time on realloc() because
of all the appending.

That's a surprising result. Perl doubles the size of a string every time
it needs to expand it, so it shouldn't have to realloc much
(only O(log(length($MarkupNew))) times).

As it is, I cannot reproduce your result. Trying it on a 22 MB file I
get these times:

append 9.031 9.041 9.150
tempfile 9.285 9.370 9.479

As you can see, appending is consistently faster than writing to a
temporary file and reading it back.

According to Devel::NYTProf nearly all of the time is spent in these
lines:


while ($$markup =~ /$Rxmarkup/g)

$begin_pos = pos($$markup);

while ($$strref =~ /$Rxent/g) {

where the second is the end of the loop started in the first, so I
suspect that the time attributed to the second line is really spent in
the match, not the pos call.

hp

PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/
 
S

sln

[snip]

Yea, this is better. Slow but maybe try to reduce copying with a while(/.../g) type
of thing.
[snip]

This version runs considerably slower, by a factor of three
[snip]

I didn't bench the code, its probably fairly quick.

[snip]

I did bench the code on a 7 mbyte file 'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl would have to spend all its time on realloc() because
of all the appending.

That's a surprising result. Perl doubles the size of a string every time
it needs to expand it, so it shouldn't have to realloc much
(only O(log(length($MarkupNew))) times).

As it is, I cannot reproduce your result. Trying it on a 22 MB file I
get these times:

append 9.031 9.041 9.150
tempfile 9.285 9.370 9.479

As you can see, appending is consistently faster than writing to a
temporary file and reading it back.

According to Devel::NYTProf nearly all of the time is spent in these
lines:


while ($$markup =~ /$Rxmarkup/g)

$begin_pos = pos($$markup);

while ($$strref =~ /$Rxent/g) {

where the second is the end of the loop started in the first, so I
suspect that the time attributed to the second line is really spent in
the match, not the pos call.

hp

PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/

I looked at that profiling result. Impressive utility. Is it free?

To isolate what I am seeing, I am posting a benchmark that simulates
what I found on the other code. It shows huge performance degredation.
I don't know if its the Perl build 5.10.0 (from ActiveState) or what.

Run this and compare the relative numbers with your build.
I'd feel better knowing Perl is not like this and there is a grave error
on my part/and or build.

Thanks.

-sln

-----------------------
## bench.pl
## ----------
use strict;
use warnings;
use Benchmark ':hireswallclock';

my ($t0,$t1);
my @limit = (
0, # 0
1_000_000, # 1 MB
2_000_000, # 2 MB
3_000_000, # 3 MB
4_000_000, # 4 MB
5_000_000 # 5 MB
);
my @buf = ('') x scalar(@limit);
my $append = '<RXZWQ>sdfgg<oo/>';

print "Starting ...\n";

for (1 .. 2)
{
print "\n",'-' x 30,"\n>> Pass $_:\n";
for my $ndx (0 .. $#limit)
{
my ($t0,$t1);

$buf[$ndx] = 'P' x $limit[$ndx]; # pre-allocate buffer from limit array
$buf[$ndx] = ''; # clear buffer

$t0 = new Benchmark;
for ( 1 .. 235_000 ) { # simulate 235,000 segment appends
$buf[$ndx] .= $append; # from 'mscorlib.xml'
}
$t1 = new Benchmark;

printf STDERR "\nBuf[%d]", $ndx;
printf STDERR ", start size = %.0fmb", $limit[$ndx]/1_000_000;
printf STDERR ", current size = %d bytes\n", length $buf[$ndx];
print STDERR "code metrics: ",timestr( timediff($t1, $t0) ),"\n";
}
}

print "\n", '-' x 30, "\n";
system ('perl -V');

__END__

Output =

Starting ...

------------------------------
Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 2.32798 wallclock secs ( 1.52 usr + 0.81 sys = 2.33 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 2.23181 wallclock secs ( 1.47 usr + 0.77 sys = 2.23 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 1.7917 wallclock secs ( 1.34 usr + 0.45 sys = 1.80 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 1.0548 wallclock secs ( 0.78 usr + 0.28 sys = 1.06 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0685248 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.0682061 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

------------------------------
Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 0.0659492 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 0.0691559 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 0.069617 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 0.0686679 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0811398 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.068722 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

------------------------------
Summary of my perl5 (revision 5 version 10 subversion 0) configuration:
Platform:
osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread
uname=''
config_args='undef'
hint=recommended, useposix=true, d_sigaction=undef
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -
DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -DPERL_IM
PLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX',
optimize='-MD -Zi -DNDEBUG -O1',
cppflags='-DWIN32'
ccversion='12.0.8804', gccversion='', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksi
ze=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:
\Perl\lib\CORE" -machine:x86'
libpth=\lib
libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32
..lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_
32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comd
lg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib
ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib
libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -
libpath:"C:\Perl\lib\CORE" -machine:x86'


Characteristics of this binary (from libperl):
Compile-time options: MULTIPLICITY PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT PERL_IMPLICIT_SYS
PERL_MALLOC_WRAP PL_OP_SLAB_ALLOC USE_ITHREADS
USE_LARGE_FILES USE_PERLIO USE_SITECUSTOMIZE
Locally applied patches:
ActivePerl Build 1004 [287188]
33741 avoids segfaults invoking S_raise_signal() (on Linux)
33763 Win32 process ids can have more than 16 bits
32809 Load 'loadable object' with non-default file extension
32728 64-bit fix for Time::Local
Built under MSWin32
Compiled at Sep 3 2008 13:16:37
@INC:
C:/Perl/site/lib
C:/Perl/lib
 
P

Peter J. Holzer

I did bench the code on a 7 mbyte file 'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl would have to spend all its time on realloc() because
of all the appending.

That's a surprising result. Perl doubles the size of a string every time
it needs to expand it, so it shouldn't have to realloc much
(only O(log(length($MarkupNew))) times).

As it is, I cannot reproduce your result. Trying it on a 22 MB file I
get these times:

append 9.031 9.041 9.150
tempfile 9.285 9.370 9.479

As you can see, appending is consistently faster than writing to a
temporary file and reading it back.

According to Devel::NYTProf nearly all of the time is spent in these
lines: [...]
PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/

I looked at that profiling result. Impressive utility. Is it free?

Yes. Available from CPAN.

Devel::NYTProf is really nice. However, it adds a rather large overhead
(smaller than most other Perl profilers, but still large), so it is
impractical for programs which run for a long time and sometimes the
overhead hides the real performance bottleneck.

To isolate what I am seeing, I am posting a benchmark that simulates
what I found on the other code. It shows huge performance degredation.
I don't know if its the Perl build 5.10.0 (from ActiveState) or what. [...]
------------------------------

Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 2.32798 wallclock secs ( 1.52 usr + 0.81 sys = 2.33 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 2.23181 wallclock secs ( 1.47 usr + 0.77 sys = 2.23 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 1.7917 wallclock secs ( 1.34 usr + 0.45 sys = 1.80 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 1.0548 wallclock secs ( 0.78 usr + 0.28 sys = 1.06 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0685248 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.0682061 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

------------------------------
Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 0.0659492 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 0.0691559 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 0.069617 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 0.0686679 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0811398 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.068722 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU)

------------------------------

Ouch. That's really a ludicrous slowdown.

Here are the results from my home system:

------------------------------
Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 0.093436 wallclock secs ( 0.08 usr + 0.01 sys = 0.09 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 0.105453 wallclock secs ( 0.10 usr + 0.01 sys = 0.11 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 0.10132 wallclock secs ( 0.07 usr + 0.03 sys = 0.10 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 0.10031 wallclock secs ( 0.05 usr + 0.04 sys = 0.09 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0609372 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.060972 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

------------------------------
Buf[0], start size = 0mb, current size = 3995000 bytes
code metrics: 0.058821 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[1], start size = 1mb, current size = 3995000 bytes
code metrics: 0.0602 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[2], start size = 2mb, current size = 3995000 bytes
code metrics: 0.060935 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[3], start size = 3mb, current size = 3995000 bytes
code metrics: 0.0601468 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[4], start size = 4mb, current size = 3995000 bytes
code metrics: 0.0608931 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

Buf[5], start size = 5mb, current size = 3995000 bytes
code metrics: 0.0607629 wallclock secs ( 0.06 usr + 0.00 sys = 0.06 CPU)

------------------------------

The base time (0.06 seconds) is about the same as for you so I assume
that we use a processor of roughly the same speed (Intel Core2 6300 @
1.86GHz in my case). But I have only a slowdown of less than 2
(0.10/0.06), and you have a slowdown of almost 35 (2.33/0.068).

I don't have a plausible explanation for that. It seems most likely that
Activestate perl extends strings linearly instead of exponentially but
why it would do such a stupid thing is beyond me.

hp
 
S

sln

I did bench the code on a 7 mbyte file 'mscore.xml'.
What really makes it slow on large files is the constant
"appending" to a variable. Its roughly 2 times + slower doing
it this way.

The fastest way to do it, is to write it to the disk as you
get it. Pass in a filehandle, or some other method.

Perl would have to spend all its time on realloc() because
of all the appending.

That's a surprising result. Perl doubles the size of a string every time
it needs to expand it, so it shouldn't have to realloc much
(only O(log(length($MarkupNew))) times).

As it is, I cannot reproduce your result. Trying it on a 22 MB file I
get these times:

append 9.031 9.041 9.150
tempfile 9.285 9.370 9.479

As you can see, appending is consistently faster than writing to a
temporary file and reading it back.

According to Devel::NYTProf nearly all of the time is spent in these
lines: [...]
PS: The nytprofhtml output is at http://www.hjp.at/junk/nytprof/

I looked at that profiling result. Impressive utility. Is it free?

Yes. Available from CPAN.

Devel::NYTProf is really nice. However, it adds a rather large overhead
(smaller than most other Perl profilers, but still large), so it is
impractical for programs which run for a long time and sometimes the
overhead hides the real performance bottleneck.

To isolate what I am seeing, I am posting a benchmark that simulates
what I found on the other code. It shows huge performance degredation.
I don't know if its the Perl build 5.10.0 (from ActiveState) or what. [...] [snip]
------------------------------

Ouch. That's really a ludicrous slowdown.

Here are the results from my home system:
------------------------------ [snip]
The base time (0.06 seconds) is about the same as for you so I assume
that we use a processor of roughly the same speed (Intel Core2 6300 @
1.86GHz in my case). But I have only a slowdown of less than 2
(0.10/0.06), and you have a slowdown of almost 35 (2.33/0.068).

I don't have a plausible explanation for that. It seems most likely that
Activestate perl extends strings linearly instead of exponentially but
why it would do such a stupid thing is beyond me.

hp

Yep, I have a 2.35 gz Opteron 170 (over clocked) dual core,
2 gig ram, on Windows XP.

My Activestate is using gcc and built using MS CRT, so its using realloc from win32.

Apparently using the win32 crt - realloc() and flavors, are
crap. If you use custom malloc, example gcc:
quote from link below:
"Compiling perl 5.10.1 without USE_IMP_SYS
and with USE_PERL_MALLOC makes a huge difference." ,
it disables threading ..

Ha ha. M$hit strikes again.

The gory details are to be found here (@ 11/09):
(btw, some guy used an example like mine)
--------------------
Subject:
"Why is Windows 100 times slower than Linux when growing a large scalar?"

http://www.perlmonks.org/?node_id=810276

Subquote:
"The problem seems to lie with the CRT realloc() which grows
the heap in iddy-biddy chunks each time"
----------------------

There are not many windows programs that use realloc(), (I know
I never use it), instead, just malloc and free.
But, in a dynamic typeless language, built on primitive C,
var .= "..." dictates the simplest approach, ie: realloc.
In C++, operator overloading can append using a private growing
scheme without using realloc. Helpfull if using win32 anyway.

In circumstances such as these, if the final size is nearly known,
preallocating using $var = 'a' x $size; or $var = 'a' x $size * 2);
should mitigate this dreadfull circumstance.

I'm actually mortified of this situation.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top