Re: Problem with splitting data

Discussion in 'Perl Misc' started by Peter J. Holzer, Mar 25, 2012.

  1. On 2012-03-25 05:02, Uri Guttman <> wrote:
    >>>>>> "PJH" == Peter J Holzer <> writes:

    >
    > PJH> On 2012-03-21 16:33, Uri Guttman <> wrote:
    >
    > >> my $text = do { local( @ARGV, $/ ) = $filename ; <> } ;
    > >>
    > >> that is the (im)proper idiom for slurping in a file. no open needed as
    > >> it is done by the <> on the values in @ARGV. slow as hell too!

    >
    > PJH> Have you actually benchmarked this in the last 10 years?
    >
    > PJH> On my systems
    > PJH> my $text = do { local( @ARGV, $/ ) = $filename ; <> } ;
    > PJH> and
    > PJH> my $text = read_file($filename);
    > PJH> are almost exactly the same speed for largish files (for very small
    > PJH> files the former is even a bit faster).
    >
    > PJH> However,
    > PJH> read_file($filename, buf_ref => \$text);
    > PJH> is a lot (factor 3-4) faster, since it avoids the extra copy.
    >
    > yes. and that is mentioned in the docs as the fastest style of slurp.


    It is not, however, mentioned in the synopsis.

    I bet most users just use
    my $text = read_file($filename);

    OTOH, performance probably isn't an issue for most users.

    > and the benchmark script shows that as well. given that i rewrote the
    > benchmark script last year (to improve the structure, options and
    > such), you know i benchmarked all the slurps recently.


    Your benchmark script doesn't include the case
    $text = do { local( @ARGV, $/ ) = $filename ; <> } ;

    It includes a case
    my $text = orig_slurp_scalar( $file_name )

    where orig_slurp_scalar then calls orig_slurp, which does the above. So
    that adds two function calls and at least one, more likely several extra
    copies (I don't know how scalar returns are implemented in perl).

    I have added this to the end of bench_scalar_slurp and rerun the script:

    direct_slurp_scalar =>·
    sub { my $text = do { local( @ARGV, $/ ) = $file_name ; <> } },

    The result is surprising. I would have expected that to be about as fast
    as FS::read_file (because that's what I've seen in my own benchmarks),
    but it's a lot faster, even faster than FS::read_file_buf_ref2:

    Rate orig_slurp FS::read_file FS::read_file_buf_ref2 direct_slurp_scalar
    file_contents 169/s -76% -81% -90% -92%
    file_contents_no_OO 170/s -75% -81% -90% -92%
    orig_read_file 560/s -19% -39% -67% -73%
    orig_slurp 694/s -- -24% -59% -66%
    FS12::read_file 907/s 31% -0% -46% -56%
    FS::read_file 910/s 31% -- -46% -55%
    old_sysread_file 919/s 32% 1% -45% -55%
    FS::read_file_scalar_ref 1047/s 51% 15% -37% -49%
    FS::read_file_buf_ref 1051/s 52% 15% -37% -49%
    old_read_file 1232/s 78% 35% -26% -40%
    FS::read_file_buf_ref2 1673/s 141% 84% -- -18%
    direct_slurp_scalar 2043/s 195% 124% 22% --

    (irrelevant columns omitted)

    I wonder if there is a systematic error here ...

    > PJH> All tests were made with files which were already cached in memory -
    > PJH> when the files have to be read from disk, all differences will probably
    > PJH> be negligible.
    >
    > the benchmark script uses Benchmark.pm and so it runs on the same files
    > many times. if you run the script twice in a row it will almost for sure
    > have the files cached in ram.


    Yes, I know. I just wanted to mention that in real life the files you
    have to read are not always already in memory, but often on disk, which
    is a lot slower. So my benchmarks (like yours) exaggerate the
    differences (If you have to wait for 20 disk seeks it doesn't matter if
    you save 1 millisecond or not).

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Mar 25, 2012
    #1
    1. Advertising

  2. Peter J. Holzer

    Dr.Ruud Guest

    On 2012-03-25 13:25, Peter J. Holzer wrote:

    > direct_slurp_scalar =>·
    > sub { my $text = do { local( @ARGV, $/ ) = $file_name ;<> } },


    What is the role of the "my $text = do {...}" wrapper?

    I would expect just:

    direct_slurp_scalar =>
    sub { local( @ARGV, $/ ) = $file_name; <> },

    --
    Ruud
    Dr.Ruud, Mar 25, 2012
    #2
    1. Advertising

  3. "Dr.Ruud" <> writes:
    > On 2012-03-25 13:25, Peter J. Holzer wrote:
    >
    >> direct_slurp_scalar =>·
    >> sub { my $text = do { local( @ARGV, $/ ) = $file_name ;<> } },

    >
    > What is the role of the "my $text = do {...}" wrapper?


    Make the code appear more complicated than it actually is.
    Rainer Weikusat, Mar 25, 2012
    #3
  4. On 2012-03-25 13:13, Dr.Ruud <> wrote:
    > On 2012-03-25 13:25, Peter J. Holzer wrote:
    >> direct_slurp_scalar =>·
    >> sub { my $text = do { local( @ARGV, $/ ) = $file_name ;<> } },

    >
    > What is the role of the "my $text = do {...}" wrapper?
    >
    > I would expect just:
    >
    > direct_slurp_scalar =>
    > sub { local( @ARGV, $/ ) = $file_name; <> },


    All the other benchmarks assign the result to a variable. So I
    have to do that here, too, to make the results comparable.

    There are various ways in which the assignments can happen, so it makes
    sense to benchmark the effect of those ways. Just throwing away the
    result doesn't make much sense, however.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Mar 25, 2012
    #4
  5. Peter J. Holzer

    Uri Guttman Guest

    >>>>> "PJH" == Peter J Holzer <> writes:

    PJH> Your benchmark script doesn't include the case
    PJH> $text = do { local( @ARGV, $/ ) = $filename ; <> } ;

    PJH> It includes a case
    PJH> my $text = orig_slurp_scalar( $file_name )

    PJH> where orig_slurp_scalar then calls orig_slurp, which does the above. So
    PJH> that adds two function calls and at least one, more likely several extra
    PJH> copies (I don't know how scalar returns are implemented in perl).

    true. i didn't account for the overhead in the extra sub calls.

    PJH> I have added this to the end of bench_scalar_slurp and rerun the script:

    PJH> direct_slurp_scalar =>·
    PJH> sub { my $text = do { local( @ARGV, $/ ) = $file_name ; <> } },

    PJH> The result is surprising. I would have expected that to be about as fast
    PJH> as FS::read_file (because that's what I've seen in my own benchmarks),
    PJH> but it's a lot faster, even faster than FS::read_file_buf_ref2:

    what size file are you testing? the script has the option of selecting
    multiple file sizes. slurp's speed wins more for larger files as it has
    less overhead (much of that is in arg processing and error checking).

    PJH> Rate orig_slurp FS::read_file FS::read_file_buf_ref2 direct_slurp_scalar
    PJH> file_contents 169/s -76% -81% -90% -92%
    PJH> file_contents_no_OO 170/s -75% -81% -90% -92%
    PJH> orig_read_file 560/s -19% -39% -67% -73%
    PJH> orig_slurp 694/s -- -24% -59% -66%
    PJH> FS12::read_file 907/s 31% -0% -46% -56%
    PJH> FS::read_file 910/s 31% -- -46% -55%
    PJH> old_sysread_file 919/s 32% 1% -45% -55%
    PJH> FS::read_file_scalar_ref 1047/s 51% 15% -37% -49%
    PJH> FS::read_file_buf_ref 1051/s 52% 15% -37% -49%
    PJH> old_read_file 1232/s 78% 35% -26% -40%
    PJH> FS::read_file_buf_ref2 1673/s 141% 84% -- -18%
    PJH> direct_slurp_scalar 2043/s 195% 124% 22% --

    i wouldn't call that much faster. also as i said, file sizes matter
    too. and perl could have improved the guts of <> since i first wrote
    that (it needed it badly). even so, it is such a fugly idiom that i
    would never teach it.

    PJH> I wonder if there is a systematic error here ...

    PJH> All tests were made with files which were already cached in memory -
    PJH> when the files have to be read from disk, all differences will probably
    PJH> be negligible.

    not exactly as requesting larger reads is still faster than what stdio
    would do. but sure, disk is much slower than ram as we all know.

    when i get to the next version (maybe in a couple of weeks) i will add
    your entry to the benchmark. i have a couple of other minor fixes to
    make.

    uri
    Uri Guttman, Mar 26, 2012
    #5
  6. On 2012-03-26 00:21, Uri Guttman <> wrote:
    >>>>>> "PJH" == Peter J Holzer <> writes:

    >
    > PJH> Your benchmark script doesn't include the case
    > PJH> $text = do { local( @ARGV, $/ ) = $filename ; <> } ;
    >
    > PJH> It includes a case
    > PJH> my $text = orig_slurp_scalar( $file_name )
    >
    > PJH> where orig_slurp_scalar then calls orig_slurp, which does the above. So
    > PJH> that adds two function calls and at least one, more likely several extra
    > PJH> copies (I don't know how scalar returns are implemented in perl).
    >
    > true. i didn't account for the overhead in the extra sub calls.
    >
    > PJH> I have added this to the end of bench_scalar_slurp and rerun the script:
    >
    > PJH> direct_slurp_scalar =>·
    > PJH> sub { my $text = do { local( @ARGV, $/ ) = $file_name ; <> } },
    >
    > PJH> The result is surprising. I would have expected that to be about as fast
    > PJH> as FS::read_file (because that's what I've seen in my own benchmarks),
    > PJH> but it's a lot faster, even faster than FS::read_file_buf_ref2:
    >
    > what size file are you testing?


    Sorry, I accidentally deleted that line. These times are from the 1MB
    scalar read test case (on a 3GHz Core2).

    For the smaller sizes (512B, 10kB) orig_slurp is *faster* than
    FS::read_file and and direct_slurp_scalar ist still faster, but
    old_sysread_file beats them all ;-).

    > PJH> Rate orig_slurp FS::read_file FS::read_file_buf_ref2 direct_slurp_scalar
    > PJH> file_contents 169/s -76% -81% -90% -92%
    > PJH> file_contents_no_OO 170/s -75% -81% -90% -92%
    > PJH> orig_read_file 560/s -19% -39% -67% -73%
    > PJH> orig_slurp 694/s -- -24% -59% -66%
    > PJH> FS12::read_file 907/s 31% -0% -46% -56%
    > PJH> FS::read_file 910/s 31% -- -46% -55%
    > PJH> old_sysread_file 919/s 32% 1% -45% -55%
    > PJH> FS::read_file_scalar_ref 1047/s 51% 15% -37% -49%
    > PJH> FS::read_file_buf_ref 1051/s 52% 15% -37% -49%
    > PJH> old_read_file 1232/s 78% 35% -26% -40%
    > PJH> FS::read_file_buf_ref2 1673/s 141% 84% -- -18%
    > PJH> direct_slurp_scalar 2043/s 195% 124% 22% --
    >
    > i wouldn't call that much faster.


    Well, you called orig_slurp "slow as hell", but FS::read_file is only
    31% faster, while direct_slurp_scalar is 124% faster than FS::read_file.


    > also as i said, file sizes matter too.


    Yes, of course.


    > and perl could have improved the guts of <> since i first wrote that
    > (it needed it badly).


    That's why I asked whether you had repeated your benchmarks in the last
    ten years. Perl I/O has been significantly revamped for 5.8.x and it
    hasn't used stdio by default for a long time (it's still available as a
    compile time option I think). Oh and the last time we had this
    discussion (about 2 years ago) you quoted benchmark results from a 300
    MHz SPARC (IIRC), which wasn't exactly bleeding edge at the time.


    > even so, it is such a fugly idiom that i would never teach it.


    That I agree with.


    > PJH> I wonder if there is a systematic error here ...
    >
    > PJH> All tests were made with files which were already cached in memory -
    > PJH> when the files have to be read from disk, all differences will probably
    > PJH> be negligible.
    >
    > not exactly as requesting larger reads is still faster than what stdio
    > would do.


    Even stdio is much faster than disk and has been for a long time (at
    least on Linux). A CPU can burn an awful lot of cycles while waiting for
    the next block. And perl doesn't use stdio anyway.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
    Peter J. Holzer, Mar 26, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Balrog
    Replies:
    2
    Views:
    488
    Balrog
    Apr 22, 2004
  2. John Ericson
    Replies:
    0
    Views:
    423
    John Ericson
    Jul 19, 2003
  3. Mark
    Replies:
    0
    Views:
    440
  4. John Dibling
    Replies:
    0
    Views:
    411
    John Dibling
    Jul 19, 2003
  5. hroyd hroyd

    splitting binary data

    hroyd hroyd, Apr 20, 2011, in forum: Ruby
    Replies:
    8
    Views:
    252
    Y. NOBUOKA
    Apr 26, 2011
Loading...

Share This Page