Speed of reading some MB of data using qx(...)

W

Wolfram Humann

I have a program that processes PDF files by converting them to
Postscript, read the ps and do something with it. I use pdftops (from
xpdf) for the pdf->ps conversion and retrieve the result like this:

$ps_text = qx( pdftops $infile - );

On win32 using strawberry perl (tried 5.10 and 5.12) this takes much
more time than I expected so I did a test and first converted the PDF
to Postscript, then read the Postscript (about 12 MB) like this (cat
on win32 provided by cygwin):

perl -E" $t = qx(cat psfile.ps); say length $t "

This takes about 16 seconds on win32 but only <1 seconds on Linux. I
was afraid that this might be a 'binmode' problem so I also tried
this:

perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
say length $t "

But the effect is the same: fast on linux, slow on win32. Besides
bashing win32 :) and ideas for reason and (possibly) cure?

Wolfram
 
U

Uri Guttman

WH> perl -E" $t = qx(cat psfile.ps); say length $t "

WH> This takes about 16 seconds on win32 but only <1 seconds on Linux. I
WH> was afraid that this might be a 'binmode' problem so I also tried
WH> this:

WH> perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
WH> say length $t "

WH> But the effect is the same: fast on linux, slow on win32. Besides
WH> bashing win32 :) and ideas for reason and (possibly) cure?

you also bashed File::Slurp in a bug report you sent me. obviously it is
a winblows issue. one possibility is that winblows does poor process
context switching and piping between processes causes lots of
that. there may be other reasons but i stay away from redmond.

uri
 
I

Ilya Zakharevich

WH> But the effect is the same: fast on linux, slow on win32. Besides
WH> bashing win32 :) and ideas for reason and (possibly) cure?

you also bashed File::Slurp in a bug report you sent me. obviously it is
a winblows issue. one possibility is that winblows does poor process
context switching and piping between processes causes lots of
that. there may be other reasons but i stay away from redmond.

Disable your network adapters, and switch off virus checking. Recheck
speed.

Hope this helps,
Ilya
 
W

Wolfram Humann

  WH> perl -E" $t = qx(cat psfile.ps); say length $t "

  WH> This takes about 16 seconds on win32 but only <1 seconds on Linux.. I
  WH> was afraid that this might be a 'binmode' problem so I also tried
  WH> this:

  WH> perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
  WH> say length $t "

  WH> But the effect is the same: fast on linux, slow on win32. Besides
  WH> bashing win32 :) and ideas for reason and (possibly) cure?

you also bashed File::Slurp in a bug report you sent me. obviously it is
a winblows issue. one possibility is that winblows does poor process
context switching and piping between processes causes lots of
that. there may be other reasons but i stay away from redmond.

uri

Uri,
it was certainly not my intention to bash File::Slurp in my bug report
-- and I when I re-read what I wrote I don't think I did. If I did
bash, I am sorry for it (and would be grateful for explanation from
your side). I'm not a native speaker; maybe things came out different
from what I intended.
Besides that, I think the two issues (my bug report and my post here)
are unrelated. It looks like File::Slurp needs to do substitutions on
win32. I was not aware of that as I do not seem to need them when I
read line-by-line on win32. Those substitutions naturally suffer from
the use of $& elsewhere in the program. I think it would be o.k. to
mention this briefly in the docs for File ::Slurp. This is a special
case and you might disagree that it should be in the docs. In that
case, just reject and close my bug report.

Regards,
Wolfram
 
U

Uri Guttman

WH> it was certainly not my intention to bash File::Slurp in my bug report
WH> -- and I when I re-read what I wrote I don't think I did. If I did
WH> bash, I am sorry for it (and would be grateful for explanation from
WH> your side). I'm not a native speaker; maybe things came out different
WH> from what I intended.
WH> Besides that, I think the two issues (my bug report and my post here)
WH> are unrelated. It looks like File::Slurp needs to do substitutions on
WH> win32. I was not aware of that as I do not seem to need them when I
WH> read line-by-line on win32. Those substitutions naturally suffer from
WH> the use of $& elsewhere in the program. I think it would be o.k. to
WH> mention this briefly in the docs for File ::Slurp. This is a special
WH> case and you might disagree that it should be in the docs. In that
WH> case, just reject and close my bug report.

i don't think i should worry about someone else using $& in their
code. most modules use s/// with grabbing and will suffer for the same
reason. this is a silly idea to warn about something i have no control
over.

but you showed with your slurp code test that it was slower on winblows
without my module. so telling me to warn about using $& is wrong as it
isn't the reason for the slowdown. winblows doesn't do forks/pipe well
since it doesn't like process context switching. they push threads which
suck for other reasons. on most unix flavors process and thread context
switching are about equally fast.

uri
 
W

Wolfram Humann

Disable your network adapters, and switch off virus checking.  Recheck
speed.

Hope this helps,
Ilya

Ilya,
I tried that and it makes no difference. Also, when enabled, I don't
see any CPU usage in the virus check process -- it all goes to
perl.exe.
I also tried things using cygwin perl on the same win32 pc: cygwin
perl also runs <1 second ...

Thanks for the suggestion,
Wolfram
 
W

Wolfram Humann

Win32's pipes are *really* slow. Write it to a temporary and then read
the file normally in perl.

After further experiments I am now convinced that pipes are not the
bottleneck (even on win32 piping 10MB can't take 20 seconds...). The
problem seems to be Straberry's memory management for long strings. I
need to do some more benchmarking and will report this issue
separately.

Does anybody know of the appropriate place to report Strawberry Perl
specific bugs?

Wolfram
 
W

Wolfram Humann

I seriously doubt the issue is with Strawberry specifically; almost
certainly any issue applies Win32 perl in general and should be reported
to p5p with perlbug. If you can confirm it is specific to Strawberry
(so, e.g., a self-compiled mingw perl *doesn't* have the problem) then I
think the correct place is the Perl::Dist::Strawberry queue on
rt.cpan.org (mail (e-mail address removed)).

(IIRC malloc on Win32 is *also* known to be deadly slow, and also IIRC
it's impossible to use perl's malloc without breaking things...)

Ben

Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
5.12) against Cygwin Perl on the same machine. The latter (as well as
Perl on Linux) doesn't have the issues I see. Is that a sufficient
"proof" for the issues being Strawberry specific?

Wolfram
 
P

Peter J. Holzer

Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
5.12) against Cygwin Perl on the same machine. The latter (as well as
Perl on Linux) doesn't have the issues I see. Is that a sufficient
"proof" for the issues being Strawberry specific?

I remember vaguely that Activestate Perl has similar issues. I think Ben
is correct that this is a problem with Win32 malloc and will affect any
Perl build which uses the Win32 malloc implementation (Cygwin probably
provides its own malloc implementation). A fix which works on both
Activestate and Strawberry would certainly be preferrable to a
Strawberry-specific one.

hp
 
P

Peter J. Holzer

If the issue simply turns out to be 'Microsoft don't know how to write a
decent malloc', there is very little p5p can do about it, of course. On
most platforms perl can, and often does, use its own malloc
implementation which is optimised for perl's use (lots of tiny
allocations and deallocations all the time). This isn't possible on
Win32 unless you make a custom build of perl that doesn't support the
fork emulation.

Since the fork emulation works with Win32 malloc, I think it should be
possible to write a custom malloc based on Win32 malloc (or the
underlying API calls) which still works with the fork emulation but is
faster. But it's probably not easy or somebody would have done it
already (I don't pretend to understand either memory allocation or the
fork emulation on windows).

hp
 
W

Wolfram Humann

No. As far as Perl is concerned, Cygwin is a separate OS. A fair
comparison would be with ActiveState or (as I said) with a Win32 perl
you've compiled yourself.

Oh dear, you're right: ActiveState Perl is just as bad as Strawberry.
Here's my test-case:
(I always append a number of chunks with a total size of 1E6 chars to
an existing string,
but the start-size of the existing string and the chunk-size vary)


use strict;
use warnings;
use Time::HiRes qw(time);

my $c1E1 = '#' x 1E1;
my $c1E2 = '#' x 1E2;
my $c1E3 = '#' x 1E3;
my $c1E4 = '#' x 1E4;
my $c1E5 = '#' x 1E5;


my $str1 = '#' x 1E5;
my $str2 = '#' x 1E6;
my $str3 = '#' x 1E7;

my $str4 = '#' x 1E7;
my $str5 = '#' x 1E7;
my $str6 = '#' x 1E7;
my $str7 = '#' x 1E7;
my $str8 = '#' x 1E7;

my $str9 = '#' x 2E7;
$str9 = '#' x 1E7;

my @ar1 = map{ $c1E2 } 1..1E5;

my @c = (
'1E5 chars + 1E4 x 1E2 chars' => sub{ $str1 .= $c1E2 for 1..1E4 },
'1E6 chars + 1E4 x 1E2 chars' => sub{ $str2 .= $c1E2 for 1..1E4 },
'1E7 chars + 1E4 x 1E2 chars' => sub{ $str3 .= $c1E2 for 1..1E4 },
'',
'1E7 chars + 1E5 x 1E1 chars' => sub{ $str4 .= $c1E1 for 1..1E5 },
'1E7 chars + 1E4 x 1E2 chars' => sub{ $str5 .= $c1E2 for 1..1E4 },
'1E7 chars + 1E3 x 1E3 chars' => sub{ $str6 .= $c1E3 for 1..1E3 },
'1E7 chars + 1E2 x 1E4 chars' => sub{ $str7 .= $c1E4 for 1..1E2 },
'1E7 chars + 1E1 x 1E5 chars' => sub{ $str8 .= $c1E5 for 1..1E1 },
'',
'1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars' => sub{ $str9 .=
$c1E2 for 1..1E4 },
'1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars ' => sub{ push @ar1,
$c1E2 for 1..1E4 },
);

while (@c) {
my $name = shift @c;
print("\n"), next unless $name;
my $code = shift @c;
my $t1 = time; &$code; my $t2 = time;
printf "%s: %6.1f ms\n", $name, 1000 * ($t2 - $t1);
}

##########################################################
And these are the results:

c:\cygwin\bin\perl LongStrings.pl
1E5 chars + 1E4 x 1E2 chars: 1.6 ms
1E6 chars + 1E4 x 1E2 chars: 2.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 11.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 1.0 ms
1E7 chars + 1E1 x 1E5 chars: 0.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.5 ms

##########################################################
c:\strawberry\perl\bin\perl LongStrings.pl
1E5 chars + 1E4 x 1E2 chars: 94.4 ms
1E6 chars + 1E4 x 1E2 chars: 319.9 ms
1E7 chars + 1E4 x 1E2 chars: 2710.4 ms

1E7 chars + 1E5 x 1E1 chars: 2656.0 ms
1E7 chars + 1E4 x 1E2 chars: 2656.1 ms
1E7 chars + 1E3 x 1E3 chars: 2609.1 ms
1E7 chars + 1E2 x 1E4 chars: 1109.1 ms
1E7 chars + 1E1 x 1E5 chars: 118.3 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.5 ms

I compared Strawberry and ActiveState on another machine: the times
are close to each other but even longer than the ones above due to the
older hardware.

Wolfram
 
I

Ilya Zakharevich

Since the fork emulation works with Win32 malloc, I think it should be
possible to write a custom malloc based on Win32 malloc (or the
underlying API calls) which still works with the fork emulation but is
faster. But it's probably not easy or somebody would have done it
already (I don't pretend to understand either memory allocation or the
fork emulation on windows).

"My" malloc() (one shipped with Perl) needs only 1 syscall implemented
on a particular architecture: get some pages from the system. So the
macro should be defined on the command line of $(CC) -c malloc.c.

Hope this helps,
Ilya
 
W

Wolfram Humann

I do understand how an unefficient malloc would explain the chunk-size
dependency in the code. What I do not understand is the dependency on
the initial size of the string that I append to. Does Perl re-allocate
the complete string (at least from time to time) when I keep appending
to it? Or could there be other effects involved besides malloc speed ?
(Ideally I would profile Strawberry Perl on the C level but I don't
know how to do that...)

Wolfram
 
U

Uri Guttman

WH> I do understand how an unefficient malloc would explain the chunk-size
WH> dependency in the code. What I do not understand is the dependency on
WH> the initial size of the string that I append to. Does Perl re-allocate
WH> the complete string (at least from time to time) when I keep appending
WH> to it? Or could there be other effects involved besides malloc speed ?
WH> (Ideally I would profile Strawberry Perl on the C level but I don't
WH> know how to do that...)

perl does a realloc (or its equiv) on strings which grow too big. i
believe it does a doubling each time (same as for hash buckets). if so,
a large initial size will affect things as it will mean fewer malloc
calls as it grows. if malloc is the bottleneck, this will show up.

uri
 
W

Wolfram Humann

  WH> I do understand how an unefficient malloc would explain the chunk-size
  WH> dependency in the code. What I do not understand is the dependency on
  WH> the initial size of the string that I append to. Does Perl re-allocate
  WH> the complete string (at least from time to time) when I keep appending
  WH> to it? Or could there be other effects involved besides malloc speed ?
  WH> (Ideally I would profile Strawberry Perl on the C level but I don't
  WH> know how to do that...)

perl does a realloc (or its equiv) on strings which grow too big. i
believe it does a doubling each time (same as for hash buckets). if so,
a large initial size will affect things as it will mean fewer malloc
calls as it grows. if malloc is the bottleneck, this will show up.

Unfortunately, that makes thing just more mysterious (at least for my
understanding):
1E5 chars + 1E4 x 1E2 chars: 94.4 ms
1E6 chars + 1E4 x 1E2 chars: 319.9 ms
1E7 chars + 1E4 x 1E2 chars: 2710.4 ms
While appending 1E6 chars (= 1E4 chunks of 1E2 chars each) to an
initial string size of 1E5 might require 3-4 reallocs, appending 1E6
chars to an initial string size of 1E7 chars involves at most 1
realloc. And (even though not measured in my sample code above) the
from-scratch en-bloc allocation of a 2E7 chars string is reasonably
fast in Strawberry Perl.

Wolfram
 
W

Wolfram Humann

Alright, I did some profiling and code-reading and what I found is
something that I would consider a bug or at least fairly poor coding
practice in the core.
Opinions very welcome!

In Strawberry almost the entire time is spent in the following call
sequence:

Perl_sv_catpvn_flags -> Perl_sv_grow -> Perl_safesysrealloc -> realloc
(msvcrt)

Perl_sv_catpvn_flags (in sv.c) is documented as "Concatenates the
string onto the end of the string which is in the SV". That's what my
code does all the time. So far so good.
Perl_sv_catpvn_flags *always* calls SvGrow -> Perl_sv_grow (also in
sv.c). Perl_sv_grow then needs to decide if the string's memory is
already sufficient or really needs to grow. In the latter case,
safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
interesting point is: how much memory does it request? The answer is:

newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

I.e. it requests 10 times as much memory as is required for the
current append operation. So when I loop 10000 times and each time
append 100 chars to an initial string size of 10 million, the memory
grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
it ends at 11.000e6. I can sort of confirm this to be true if I look
at the memory graph in Process Explorer: it grows smoothly (no
discernible steps), becoming incrementally slower towards the end
(because the amount of memory that needs to be copied for each realloc
increases).

Growing memory in such tiny increments is what I consider bad
practice.

By the way: I estimate the time required for each realloc to be around
3 ms for 10e6 chars, growing linearly with the amount of data -- I
consider that a fair speed and no reason to blame win32.

What happens in Cygwin? The stack-sampling profiler is of little help
because it easily misses infrequent events. I would expect that
Perl_sv_grow is called just as often as in Strawberry Perl. The
difference is that safesysrealloc does not call Perl_safesysrealloc ->
realloc, it calls Perl_realloc. And Perl_realloc (in malloc.c) seems
to have it's own logic (something with '<<' and 'LOG' and 'pow' which
I did not try to fully understand) to determine what amount of memory
it finally allocates. When I add some sleep() to the string append
process, I can see how memory grows in Process Explorer: There are 5
steps (probably corresponding to 5 calls to Perl_realloc) of growing
size when I start with 0.1e6 chars and then grow to 1.1e6 chars. When
I start with 10e6 chars and grow to 11e6 chars, there is just 1 step
in memory size. This looks like a sensible memory growth strategy to
me. It explains why Cygwin is several 100 times faster than Strawberry
Perl. It also explains why I observed during my experiments that
Cygwin Perl consistently needs more memory than Strawberry Perl -- but
that's a small price to pay for such a dramatic speedup.

Wolfram
 
W

Wolfram Humann

Would you be able to repeat these tests with 5.12.0: that is, build
(under Cygwin, if you don't have access to a Unix system) matched perls
configured with and without -Dusemymalloc, and run the test on both?
I'll try and do the same here, but I can't promise I'll have time. If
the slowdown still exists in 5.12, I think you have a good case for a
bug report. I'm not sure how possible it would be to fix, but it would
clearly be a big win under some circumstances to be able to build Win32
perl with perl's malloc.

I do have a linux machine and I did comile my own perl there so I
think a could redo that (possibly easier than recompiling Perl for
Cygwin). The strange thing is that on Cygwin perl -V says
'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
Linux my bechmark runs everything under 12 ms. Are you certain
changing usemymalloc would have much effect there?


What I would much more *like* to try is recompile a perl (e.g.
strawberry perl) on win32 and replace

newlen += 10 * (newlen - SvCUR(sv));

with something like

newlen += 10 * (newlen - SvCUR(sv)) + 0.5 * SvCUR(sv);

(with the factor reasonably somewhere between 0.2 and 1)
but a quick attempt to follow http://perldoc.perl.org/perlwin32.html
was not successful :(

Wolfram
 
J

jl_post

I have a program that processes PDF files by converting them to
Postscript, read the ps and do something with it. I use pdftops (from
xpdf) for the pdf->ps conversion and retrieve the result like this:

$ps_text = qx( pdftops $infile - );

On win32 using strawberry perl (tried 5.10 and 5.12) this takes much
more time than I expected so I did a test and first converted the PDF
to Postscript, then read the Postscript (about 12 MB) like this (cat
on win32 provided by cygwin):

perl -E" $t = qx(cat psfile.ps); say length $t "

This takes about 16 seconds on win32 but only <1 seconds on Linux.


Dear Wolfram,

I've encountered a similar problem on Strawberry Perl before.

I'm curious: Could you try "pre-allocating" the needed space to
$ps_text (or $t) before you set it? For example, try this:

perl -E "$t = ' ' x (-s 'psfile.ps'); $t = qx(cat psfile.ps); say
length $t"

See if that helps. I've found that setting my variable to the
target length BEFORE it's set to the proper string can reduce time
significantly (when it is eventually being set to its target value).
I'm not sure why this is so, but I can guess that it's because it can
avoid the time-consuming process of "growing" the string a little at a
time.

I hope this helps,

-- Jean-Luc
 
P

Peter J. Holzer

No. It's possible that glibc's malloc already behaves the way perl is
expecting it to, so using perl's malloc doesn't change the performance
much.

I'm pretty sure that GNU malloc doesn't round up to powers of two or
something like that. However, the performance difference between GNU
malloc and Perl malloc is rather small:

perl 5.12.1, default config, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 3.9 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 4.4 ms

1E7 chars + 1E5 x 1E1 chars: 28.4 ms
1E7 chars + 1E4 x 1E2 chars: 4.5 ms
1E7 chars + 1E3 x 1E3 chars: 2.6 ms
1E7 chars + 1E2 x 1E4 chars: 2.0 ms
1E7 chars + 1E1 x 1E5 chars: 1.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms

perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 2.6 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms

1E7 chars + 1E5 x 1E1 chars: 18.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 0.9 ms
1E7 chars + 1E1 x 1E5 chars: 1.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

That may be accidental, though: An strace output for GNU malloc) shows
that very few reallocations actually result in a different address -
mostly the allocated area can be grown because there is nothing after
it. This is even true if two strings grow in parallel - each time one of
the strings moves it leaves a hole which the other string can grow into,
so in practice this works like a binary backoff. I guess there are
allocation patterns which spoil this effect (e.g. if you allocate lots
of small objects while growing large strings) but I haven't tried to
find them.

hp
 
I

Ilya Zakharevich

sv.c). Perl_sv_grow then needs to decide if the string's memory is
already sufficient or really needs to grow. In the latter case,
safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
interesting point is: how much memory does it request? The answer is:

newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

I.e. it requests 10 times as much memory as is required for the
current append operation. So when I loop 10000 times and each time
append 100 chars to an initial string size of 10 million, the memory
grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
it ends at 11.000e6.

Good l*rd!

The current algorithm is optimized to work in tandem with "my"
malloc(), which would round up to a certain geometric progression
anyway. So if one use as different malloc()s, one should better use

newlen += (newlen >> 4) + 10; /* avoid copy each time */

Thanks for an analysis (and report this on p5p)!

Yours,
Ilya

P.S. And the current algorithm is a disaster if you try to append a
string of length 200e6 to a short string...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top