Speed of reading some MB of data using qx(...)

I

Ilya Zakharevich

If you timed perl's own realloc, you would (I believe) find it does much
better than this. AFAICS from the code, it has a fixed set of block
sizes it actually allocates. Enlarging a block such that it doesn't go
over the block size actually allocated is *free*, not even linear in the
size of the block, since all it does is adjust the end marker. This is
the logic you are expecting sv_grow to implement, but perl has decided
that this is the allocator's responsibility.

Wrong analysis. *In this particular situation* one would find Perl's
realloc() as slow as M$'s one - but it would be called MUCH more
rarely. With my malloc(), I implemented an extra call:
malloced_size(). So on the 1st iteration, things would go similarly
to "their-malloc", and realloc() would be called - and your analysis
would apply.

But on the 2nd iteration of appending, Perl would discover the new
*really* allocated size for the string, and would update the buffer
size stored in SV. After this, for many iteractions the realloc() is
out-of-the-loop altogether.

BTW, IIRC, my malloc() does not store the "end marker" at all (unless
a debugging version is used). (The bucket size of short strings
depends only on which page of memory they belong to - well, actually,
half-page.)

Hope this helps,
Ilya
 
I

Ilya Zakharevich

I'm pretty sure that GNU malloc doesn't round up to powers of two or
something like that. However, the performance difference between GNU
malloc and Perl malloc is rather small:

Yeah, right. Not more than 3 times. ;-)
perl 5.12.1, default config, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 3.9 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 4.4 ms

1E7 chars + 1E5 x 1E1 chars: 28.4 ms
1E7 chars + 1E4 x 1E2 chars: 4.5 ms
1E7 chars + 1E3 x 1E3 chars: 2.6 ms
1E7 chars + 1E2 x 1E4 chars: 2.0 ms
1E7 chars + 1E1 x 1E5 chars: 1.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms

perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 2.6 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms

1E7 chars + 1E5 x 1E1 chars: 18.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 0.9 ms
1E7 chars + 1E1 x 1E5 chars: 1.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

Yours,
Ilya
 
P

Peter J. Holzer

Yeah, right. Not more than 3 times. ;-)

Compared to the factor of about 1000 that Wolfram and Ben observed this
is small. More importantly, growing a string to length n in constant
increments appears to be an O(n) operation with both your malloc and GNU
malloc, but O(n²) with Win32 malloc and BSD malloc.

Your malloc is certainly faster than GNU malloc. OTOH, AFAICS from
quickly scanning the comments in malloc.c, it always pads allocation to
the next power of two (minus 4) and it doesn't return unused memory back
to the OS. So it might use quite a bit more memory. Which of these
effects is more important depends on the program and is hard to
determine analytically. Which means that I should benchmark some of my
more performance-critical scripts with both allocators ;-).

hp
 
W

Wolfram Humann

Good l*rd!

The current algorithm is optimized to work in tandem with "my"
malloc(), which would round up to a certain geometric progression
anyway.  So if one use as different malloc()s, one should better use

  newlen += (newlen >> 4) + 10; /* avoid copy each time */

I finally managed to compile my own win32 perl. (Actually it was quite
easy once I refrained from doing mistakes so stupid I do not dare to
talk about them...)
Now I could modify Perl_sv_grow() and insert debugging prints and I
found good and bad news.

The bad Looks like I was *overly optimistic* (LOL!) concerning
the efficiency of the current string memory allocation on win32. The
"newlen += 10 * (newlen - SvCUR(sv))" line is only executed if
SvOOK(sv) -- i.e. in most cases it is *not* executed. Therefore win32
system realloc is not called every tenth string-append operation but
*every* time something gets appended to a string.

The good A single additional line of code makes win32 perl
100...1000 times faster!
(for code that appends to strings very frequently)

I went with Ilya's proposal but inserted the line a little further
down, just after
if (newlen > SvLEN(sv)) { /* need more room? */

So now we have:
if (newlen > SvLEN(sv)) { /* need more room? */
newlen += (newlen >> 2) + 10;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

The remaining question is by what ratio a string's memory should grow.
I tried several values from (newlen >> 0) to (newlen >> 6) for the
best compromise between execution time and memory usage and my
personal favorite is (newlen >> 2). What do others here think? At the
end of this post I will attach the results for my benchmark script
starting with Cygwin Perl followed by several versions of (newlen >>
x) and finally the unpatched Strawberry Perl. These reports now also
include memory footprint info (courtesy of pslist from the
Sysinternals suite). I also went back to my original task of reading a
12 MB postscript file using qx(cat ...) and in some cases I also
report times for that -- here Cygwin (70 ms) still beats my modified
perl (210 ms), but that's still waaaaay better than the original 18000
ms :)

I will also report to p5p.

Wolfram

###########################################################

c:\cygwin\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 1.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 12.2 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.9 ms

Private MB: 326.5
Peak Private MB: 326.5

--------------

qx(cat postscriptfile.ps): 68.7 ms
Private MB: 38.5
Peak Private MB: 38.5

###########################################################

newlen += (newlen >> 0) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.2 ms
1E6 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.0 ms

Private MB: 378.3
Peak Private MB: 418.0

--------------

qx(cat postscriptfile.ps): 181.2 ms
Private MB: 25.1
Peak Private MB: 40.3

###########################################################

newlen += (newlen >> 1) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms

1E7 chars + 1E5 x 1E1 chars: 9.6 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms
1E7 chars + 1E3 x 1E3 chars: 0.7 ms
1E7 chars + 1E2 x 1E4 chars: 0.7 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.4 ms

Private MB: 290.2
Peak Private MB: 319.5

###########################################################

newlen += (newlen >> 2) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 9.2 ms
1E6 chars + 1E4 x 1E2 chars: 5.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 9.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.4 ms

Private MB: 244.9
Peak Private MB: 270.1

--------------

qx(cat postscriptfile.ps): 209.8 ms
Private MB: 16.2
Peak Private MB: 29.0

###########################################################

newlen += (newlen >> 3) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 12.1 ms
1E6 chars + 1E4 x 1E2 chars: 6.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.5 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.6 ms

Private MB: 221.9
Peak Private MB: 244.3

###########################################################

newlen += (newlen >> 4) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 17.0 ms
1E6 chars + 1E4 x 1E2 chars: 13.8 ms
1E7 chars + 1E4 x 1E2 chars: 11.2 ms

1E7 chars + 1E5 x 1E1 chars: 19.4 ms
1E7 chars + 1E4 x 1E2 chars: 10.1 ms
1E7 chars + 1E3 x 1E3 chars: 10.9 ms
1E7 chars + 1E2 x 1E4 chars: 11.1 ms
1E7 chars + 1E1 x 1E5 chars: 11.0 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.3 ms

Private MB: 219.4
Peak Private MB: 233.8

--------------

qx(cat postscriptfile.ps): 312.0 ms
Private MB: 14.0
Peak Private MB: 25.8

###########################################################

newlen += (newlen >> 6) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 57.7 ms
1E6 chars + 1E4 x 1E2 chars: 59.8 ms
1E7 chars + 1E4 x 1E2 chars: 67.9 ms

1E7 chars + 1E5 x 1E1 chars: 69.4 ms
1E7 chars + 1E4 x 1E2 chars: 71.6 ms
1E7 chars + 1E3 x 1E3 chars: 69.6 ms
1E7 chars + 1E2 x 1E4 chars: 64.8 ms
1E7 chars + 1E1 x 1E5 chars: 53.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.7 ms

Private MB: 219.8
Peak Private MB: 230.0

###########################################################

unpatched Strawberry Perl

c:\strawberry\perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 96.2 ms
1E6 chars + 1E4 x 1E2 chars: 325.7 ms
1E7 chars + 1E4 x 1E2 chars: 2655.9 ms

1E7 chars + 1E5 x 1E1 chars: 2687.3 ms
1E7 chars + 1E4 x 1E2 chars: 2687.4 ms
1E7 chars + 1E3 x 1E3 chars: 2656.1 ms
1E7 chars + 1E2 x 1E4 chars: 1093.6 ms
1E7 chars + 1E1 x 1E5 chars: 108.3 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.1 ms

Private MB: 200.4
Peak Private MB: 210.2
 
P

Peter J. Holzer

I went with Ilya's proposal but inserted the line a little further
down, just after
if (newlen > SvLEN(sv)) { /* need more room? */

So now we have:
if (newlen > SvLEN(sv)) { /* need more room? */
newlen += (newlen >> 2) + 10;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

The remaining question is by what ratio a string's memory should grow.
I tried several values from (newlen >> 0) to (newlen >> 6) for the
best compromise between execution time and memory usage and my
personal favorite is (newlen >> 2). What do others here think?

That sounds about right. I've used factors between 1.2 and 1.5 in the
past for similar problems. I suggest you base the growth on the old
size, though, something like:

if (newlen > SvLEN(sv)) { /* need more room? */
size_t min = SvLEN(sv) * 5/4 + 10;
if (newlen < min) newlen = min;
...

This gives you the same growth pattern if the increments are small, but
it doesn't allocate extra memory if you append a large chunk.

hp
 
I

Ilya Zakharevich

So now we have:
if (newlen > SvLEN(sv)) { /* need more room? */
newlen += (newlen >> 2) + 10;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

I think you make your life too simple. What you must do is find the
chain of events which sets
Perl_safesysmalloc_size/PERL_STRLEN_ROUNDUP, and modify this chain. :-(
I tried several values from (newlen >> 0) to (newlen >> 6) for the
best compromise between execution time and memory usage and my
personal favorite is (newlen >> 2). What do others here think?

My approach is never to take responsibility for such decisions.
Make a default value, and shift responsibility to the user. ;-)

#ifndef PERL_STRLEN_ROUNDUP_SHIFT
# define PERL_STRLEN_ROUNDUP_SHIFT 2
#endif

The suggestion to use the OLD length is also very viable...
1E5 chars + 1E4 x 1E2 chars: 1.5 ms

I hope your `ms' are actually seconds. It does not make sense to
measure performance on runs shorter than a second (maybe more on Win,
which is doing more unknown stuff in background)...

Yours,
Ilya
 
W

Wolfram Humann

I think you make your life too simple.  What you must do is find the
chain of events which sets
Perl_safesysmalloc_size/PERL_STRLEN_ROUNDUP, and modify this chain.  :-(

I'm traveling territory that's fairly unknown to me already, so if
this is too simple I should leave it to someone more knowledgeable to
do it right. What I *think* is, that PERL_STRLEN_ROUNDUP is just
concerned with memory boundary alignment, e.g. roundup to the next
multiple of 4. If this is the case, it should be independent of any
string memory expansion strategy.
My approach is never to take responsibility for such decisions.
Make a default value, and shift responsibility to the user.  ;-)

#ifndef PERL_STRLEN_ROUNDUP_SHIFT
#  define PERL_STRLEN_ROUNDUP_SHIFT 2
#endif

Given that nobody stumbled across the devastating current state, I
don't envision hoards of users trying to optimize this. But I agree
that it serves well as a reminder for someone reading the code that
this is not *the* correct value but a trade-off and could be decided
differently. However, given that IMO it's a different concept from the
existing PERL_STRLEN_ROUNDUP, I would prefer to give it a different
name. How about PERL_STRLEN_EXPAND_SHIFT?
The suggestion to use the OLD length is also very viable...
Agreed.


I hope your `ms' are actually seconds.  It does not make sense to
measure performance on runs shorter than a second (maybe more on Win,
which is doing more unknown stuff in background)...

No, these are milliseconds. Yes, this makes it a lousy benchmark. Yes,
these numbers do vary easily by +-30% and sometimes more from run to
run. I tried to post "average" runs, but that's even more subjective,
of course. However the time differences encountered between different
cases are often a factor of 10 or even much more (in the unmodified
case, many of these do run for several seconds ;-)), so I think this
is a valid base for comparison.

So my current proposal reads like this:

#ifndef PERL_STRLEN_EXPAND_SHIFT
# define PERL_STRLEN_EXPAND_SHIFT 2
#endif

if (newlen > SvLEN(sv)) { /* need more room? */
size_t minlen = SvCUR(sv);
minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
if (newlen < minlen) newlen = minlen;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

My benchmark script does run slower now. The previous version did
expand allocated memory beyond minimum requirements during the initial
assignment, so that the reallocation count during append was 0 in many
cases. The new version does not do that so that at least 1 realloc is
required when one starts appending to the string.

Wolfram
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top