is it possible to efficiently read a large file?

Mark Seger · Aug 12, 2006

I'm trying to read a 3GB file efficiently. If I do with a benchmarking
tool, I use about 6-8% of the cpu and can read it in about 44 seconds -
obviously the time if very closely tied to the type of disk, but I'm
including that for reference.

When I do the same thing in perl using:

$reclen=1024*128;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

it takes just under 60 seconds and used 25-30% of the cpu. I'm sure
there is a lot of data movement between buffers and am wondering if
there is some way to avoid this. I'm guessing that perhaps perl is
generating a new instance of $buffer every pass through the loop and if
so that would involve mallocs() and frees() every pass, which I'd like
to avoid if possible.

-mark

John Bokma · Aug 13, 2006

Mark Seger said:
I'm trying to read a 3GB file efficiently. If I do with a benchmarking
tool, I use about 6-8% of the cpu and can read it in about 44 seconds -
obviously the time if very closely tied to the type of disk, but I'm
including that for reference.

When I do the same thing in perl using:

$reclen=1024*128;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

it takes just under 60 seconds and used 25-30% of the cpu. I'm sure
there is a lot of data movement between buffers and am wondering if
there is some way to avoid this. I'm guessing that perhaps perl is
generating a new instance of $buffer every pass through the loop and if
so that would involve mallocs() and frees() every pass, which I'd like
to avoid if possible.

you might want to increase reclen, since 128 kbytes sounds like an
extremely small buffer to me.

John W. Krahn · Aug 13, 2006

Mark said:
I'm trying to read a 3GB file efficiently. If I do with a benchmarking
tool, I use about 6-8% of the cpu and can read it in about 44 seconds -
obviously the time if very closely tied to the type of disk, but I'm
including that for reference.

When I do the same thing in perl using:

$reclen=1024*128;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

it takes just under 60 seconds and used 25-30% of the cpu. I'm sure
there is a lot of data movement between buffers and am wondering if
there is some way to avoid this. I'm guessing that perhaps perl is
generating a new instance of $buffer every pass through the loop and if
so that would involve mallocs() and frees() every pass, which I'd like
to avoid if possible.

Are you using open() or sysopen() to open the file? sysread() "bypasses
buffered IO" but your $reclen may be too large (or too small) for efficient
IO. Your example appears to use $main::buffer which means that the same
variable is used for each read however I don't know whether Perl reallocates
memory for each read. You could use something like strace(1) to determine
exactly what system calls the program is making.

John

xhoster · Aug 13, 2006

Mark Seger said:
I'm trying to read a 3GB file efficiently. If I do with a benchmarking
tool, I use about 6-8% of the cpu and can read it in about 44 seconds -
obviously the time if very closely tied to the type of disk, but I'm
including that for reference.

What kind of benchmarking tool is it? For benchmarking raw disks, or
the OS FS, or what? It may be using methods that are simply unavailable to
a general purpose language like perl.

When I do the same thing in perl using:

$reclen=1024*128;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

it takes just under 60 seconds

For me, it takes 7 seconds to read 4e9 bytes with your code and
buffer size. If I make it read from /dev/zero rather than a real
file, then it takes less than 0.5 seconds. Since Perl doesn't know
that it is reading from /dev/zero, I would have to assume that at least
6.5 of those 7 seconds for the real files are taken up by things outside
perl's control.

and used 25-30% of the cpu.

I don't think CPU reporting in these cases is very meaningful. Every
monitoring tool seems to use a different method for how to attribute time
between user, system, idle, IO-wait, etc.

I'm sure
there is a lot of data movement between buffers and am wondering if
there is some way to avoid this. I'm guessing that perhaps perl is
generating a new instance of $buffer every pass through the loop and if
so that would involve mallocs() and frees() every pass, which I'd like
to avoid if possible.

Since I can't replicate your poor performance, I can't really investigate
it. But I doubt any of this stuff is worth worrying about. I'd look at the
systems level, rather than at perl.

Xho

Mark Seger · Aug 13, 2006

John said:
you might want to increase reclen, since 128 kbytes sounds like an
extremely small buffer to me.

a 128kb buffer is more than enough. remember, this can be done directly
from C very efficiently. in any event I did try with a 1M buffer and no
difference.
-mark

Mark Seger · Aug 13, 2006

John said:
Are you using open() or sysopen() to open the file? sysread() "bypasses
buffered IO" but your $reclen may be too large (or too small) for efficient
IO. Your example appears to use $main::buffer which means that the same
variable is used for each read however I don't know whether Perl reallocates
memory for each read. You could use something like strace(1) to determine
exactly what system calls the program is making.

I'm usong open(), but I'll give sysopen() a whirl in the morning. I
also like the idea abot strace. my fear is the data is being read into
one buffer and storage is getting allocated for $buffer on each call and
then moved to it. the challenge is, is there a way to read directly
into the $buffer. maybe strace() will provide some clues...
-mark

Ben Morrow · Aug 13, 2006

Quoth Mark Seger said:
I'm usong open(), but I'll give sysopen() a whirl in the morning. I
also like the idea abot strace. my fear is the data is being read into
one buffer and storage is getting allocated for $buffer on each call and
then moved to it. the challenge is, is there a way to read directly
into the $buffer. maybe strace() will provide some clues...

open/sysopen should make no difference. To preallocate a buffer, create
a long string and overwrite bits of it with substr or directly with
sysread. You have to do your own buffer manglement as in C, of course,
but that's how you get efficiency.

Ben

Mark Seger · Aug 13, 2006

What kind of benchmarking tool is it? For benchmarking raw disks, or
the OS FS, or what? It may be using methods that are simply unavailable to
a general purpose language like perl.

I'm using a tool called dt, see:
http://home.comcast.net/~SCSIguy/SCSI_FAQ/RMiller_Tools/dt.html, which
has been around for years and is very efficient. All I'm doing is basic
sequential reads. Nothing fancy.

For me, it takes 7 seconds to read 4e9 bytes with your code and
buffer size. If I make it read from /dev/zero rather than a real
file, then it takes less than 0.5 seconds. Since Perl doesn't know
that it is reading from /dev/zero, I would have to assume that at least
6.5 of those 7 seconds for the real files are taken up by things outside
perl's control.

I find that number impossible to believe unless you're doing somethere
very fancy OR you have more than 4GB RAM and are reading it out of cache
(that's why I'm reading a 3GB file - I have 3GB of RAM). A very common
mistake in benchmarking is to write a file that is < the size of your
RAM. After the write the whole file is sitting in memory and none of th
reads will provide accurate numbers. To deliver the data at the rate
you're suggesting, that disk would have do be delivering >500MB/sec and
I haven't seen any cable of even delivering 100.

I don't think CPU reporting in these cases is very meaningful. Every
monitoring tool seems to use a different method for how to attribute time
between user, system, idle, IO-wait, etc.

but it is! I'd claim if the tools use time that differ by that much
they're not doing the same thing and the whole point of such a tool is
to be efficient to show one maximum values. The fact that I can do
basci i/o in a simple C program that uses <10% of the cpu (and I should
point out that I had forgotten this an average over my 2 cpus, so one of
them is really using almost 20%), this demonstrates that one can and
should be able to read without massive consumption in a tool.

Since I can't replicate your poor performance, I can't really investigate
it. But I doubt any of this stuff is worth worrying about. I'd look at the
systems level, rather than at perl.

sorry, but if you can't replicate my numbers you must be doing something
wrong. are you SURE you cache is empty - it could be as simple as that.
See how much memory you have and then write a file bigger than that
amount. Now when you read it back you're guaranteed it's not in cache
and I promise it'll take to much longer to read it back.

-mark

Mark Seger · Aug 13, 2006

Ben said:
open/sysopen should make no difference. To preallocate a buffer, create
a long string and overwrite bits of it with substr or directly with
sysread. You have to do your own buffer manglement as in C, of course,
but that's how you get efficiency.

I didn't think it would make a differnce but I'm desparate and willing
to try anything. 8-(

I had created a long string and passed it to sysread but it didn't seem
to make any difference, and besides one subsequent reads it would
already be allocated to the proper length by the previous reads. Or am
I missing something?

Just to back up a step or two, I wrote a short C program that mallocs a
buffer and calls fread with the address of the buffer and it runs as
efficiently and at the same speed as my benchmark tool (which I've
included a pointer to in a previous response - a very cool tool if you
haven't tried it yet).

What I don't see how is to pass the address of my buffer to sysread as
it wants a scalar and so won't that always force it to be created/malloc'd?

Here's exactly what I'm doing, noting that I'm counting the bytes read
just to make sure I'm reading no more than I should be. Assuming I
understand what you were saying, I believe I am preallocating the buffer.

#!/usr/bin/perl -w

$reclen=1024*128;
$buffer=' 'x$reclen;
$filename='/mnt/scratch/test';
open FILE, "<$filename" or die;

$total=0;
$start=time;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

$duration=time-$start;
printf "Filesize: %5dM Recsize:%5dK %5.1fSecs %6dKB/sec\n",
$total/(1024*1024), $reclen/1024, $duration, $total/$duration/1024;

-mark

Mark Seger · Aug 13, 2006

I realized I didn't answer all your questions. Sorry about that. See
below:

What kind of benchmarking tool is it? For benchmarking raw disks, or
the OS FS, or what? It may be using methods that are simply unavailable to
a general purpose language like perl.

I do believe DT (which I mentioned in an earlier note) uses very basic,
sequential reads (there are certainly switches to do async and other
switches as well, but I just use the basics). If you do care to
download a copy of it (and I highly recommned it), try the command:

dt of=/tmp/file limit=10g bs=1m disable=compare,verify dispose=keep

to create a 10G file. As I said before, pick a size at least as large
as your current RAM to assume it doesn't get cached. Now read it back
with the identical command, replacing the 'of' with 'if' (output to
input). Here's an example on my machine, remember I only have 3 GB RAM:

[root@cag-dl380-01 mjs]# ./dt of=/mnt/scratch/test limit=3g bs=1m
disable=compare,verify dispose=keep

Total Statistics:
Output device/file name: /mnt/scratch/test (device type=regular)
Type of I/O's performed: sequential (forward)
Data pattern written: 0x39c39c39 (read verify disabled)
Total records processed: 3072 @ 1048576 bytes/record (1024.000 Kbytes)
Total bytes transferred: 3221225472 (3145728.000 Kbytes, 3072.000
Mbytes)
Average transfer rates: 50135805 bytes/sec, 48960.747 Kbytes/sec
Number I/O's per second: 47.813
Total passes completed: 1/1
Total errors detected: 0/1
Total elapsed time: 01m04.25s
Total system time: 00m08.46s
Total user time: 00m00.00s
Starting time: Sun Aug 13 08:57:30 2006
Ending time: Sun Aug 13 08:58:35 2006

[root@cag-dl380-01 mjs]# ./dt if=/mnt/scratch/test limit=3g bs=1m
disable=compare,verify dispose=keep

Total Statistics:
Input device/file name: /mnt/scratch/test (device type=regular)
Type of I/O's performed: sequential (forward)
Data pattern read: 0x39c39c39 (data compare disabled)
Total records processed: 3072 @ 1048576 bytes/record (1024.000 Kbytes)
Total bytes transferred: 3221225472 (3145728.000 Kbytes, 3072.000
Mbytes)
Average transfer rates: 88252753 bytes/sec, 86184.329 Kbytes/sec
Number I/O's per second: 84.164
Total passes completed: 1/1
Total errors detected: 0/1
Total elapsed time: 00m36.50s
Total system time: 00m06.85s
Total user time: 00m00.02s
Starting time: Sun Aug 13 09:00:00 2006
Ending time: Sun Aug 13 09:00:36 2006

The 2 main things to note are the Average Transfer Rates and the Total
Elapsed time. Now here's another run using a file size of 1GB (I left
out the intermediate stats for the sake of brevity). As you can see the
reads are MUST faster (almost 250MB/sec) since it's reading out of
memory rather than disk:

[root@cag-dl380-01 mjs]# ./dt of=/mnt/scratch/test2 limit=1g bs=1m
disable=compare,verify dispose=keep

Total Statistics:
Average transfer rates: 50815988 bytes/sec, 49624.988 Kbytes/sec
Total elapsed time: 00m21.13s

[root@cag-dl380-01 mjs]# ./dt if=/mnt/scratch/test2 limit=1g bs=1m
disable=compare,verify dispose=keep

Total Statistics:
Average transfer rates: 249128033 bytes/sec, 243289.095 Kbytes/sec
Total elapsed time: 00m04.31s

-mark

Peter J. Holzer · Aug 13, 2006

Mark Seger said:
Mark Seger said:

I'm trying to read a 3GB file efficiently. If I do with a benchmarking
tool, I use about 6-8% of the cpu and can read it in about 44 seconds -
obviously the time if very closely tied to the type of disk, but I'm
including that for reference. [...]
When I do the same thing in perl using:

$reclen=1024*128;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

it takes just under 60 seconds

Click to expand...

For me, it takes 7 seconds to read 4e9 bytes with your code and
buffer size.

That would be about 570 MB/s. Just for reference, the Seagate Cheetah
300GB Fiberchannel drive (which should be among the fastest you can buy
today) delivers at most 150 MB/s. So either you have 4 of them in a
well-tuned RAID-0 configuration or you have more than 4 GB of RAM and
are reading from the buffer cache. Either way, that doesn't seem like an
average system you are using. The performance Mark is seeing (50-70 MB/s)
is much closer to my experience.

If I make it read from /dev/zero rather than a real file, then it
takes less than 0.5 seconds.

Right. 0.35 Seconds on a 2.4 GHz P4 Xeon.

Also, the difference in CPU usage between dd bs=128k and the perl script
is about 0.2 seconds. Both take about 12 seconds CPU time when reading
from disk, but almost all of that is system time - time the OS needs to
shuffle buffers around, talk to the fiber channel adapters, etc. That
could probably be reduced quite a bit using mmap instead of read.

I don't think CPU reporting in these cases is very meaningful. Every
monitoring tool seems to use a different method for how to attribute time
between user, system, idle, IO-wait, etc.

I don't know what platform Mark is using, but on Unixish systems these
statistics are collected by the OS. Not much a monitoring tool (the
shell, usually) can do about it.

So a difference between 6% and 25% CPU usage coupled with an increase in
wall-clock time from 44 to 60 seconds on the same platform is IMHO very
significant and shows that perl does something a lot less efficiently
than Marks unnamed "benchmark tool".

First I'd look at the hardware: Is this an old or low-power cpu, which
would explain the difference between my 0.2 seconds and Marks 16
seconds?

Next I'd check what the perl script is really doing. For me strace
prints for Mark's script a steady stream of

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072

Finally, I'd look at the script. Mark didn't show how he opened the
file. If he didn't open it in binary mode, an input layer may add a lot
of overhead.

Now that I think of it, I'd reverse the order

.

Perl has almost certainly a lot more overhead than C. That's the price
you pay for using a higher-level language. Usually, that overhead
doesn't matter because your script spends most of its time elsewhere,
though.

hp

Mark Seger · Aug 13, 2006

So a difference between 6% and 25% CPU usage coupled with an increase in
wall-clock time from 44 to 60 seconds on the same platform is IMHO very
significant and shows that perl does something a lot less efficiently
than Marks unnamed "benchmark tool".

I thought I named it in a previous email. It's Robin Miller's dt, which
he wrote when he was at DEC. Very popular/flexible. Do yourself a
favor and download a copy. You won't be sorry:

http://home.comcast.net/~SCSIguy/SCSI_FAQ/RMiller_Tools/dt.html

First I'd look at the hardware: Is this an old or low-power cpu, which
would explain the difference between my 0.2 seconds and Marks 16
seconds?

Next I'd check what the perl script is really doing. For me strace
prints for Mark's script a steady stream of

read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072

Finally, I'd look at the script. Mark didn't show how he opened the
file. If he didn't open it in binary mode, an input layer may add a lot
of overhead.

plain of open for input nothing fancy. If there IS something fancy I
can/should do via fcntl, etc. I'm all ears as I do believe the default
behavior is not going to do what I want.

Now that I think of it, I'd reverse the order .

Perl has almost certainly a lot more overhead than C. That's the price
you pay for using a higher-level language. Usually, that overhead
doesn't matter because your script spends most of its time elsewhere,
though.

No argument on that one. However I have also found that perl can come
very close to C on some operations when there's not much between it and
system calls. For example, writing data perl has no problem getting the
same numbers as dt. I'm guessing in that case both need to move data
from user space to the buffer cache and there's no extra movement in
between. I've also found perl does a great job on socket I/O. But I
also agree, there are certainly many cases where you just can't beat raw
C power and the trick is to know when perl is sufficient and when it's
not. That's what I'm trying to figure out right now.

-mark

Ben Morrow · Aug 13, 2006

Quoth Mark Seger said:
I didn't think it would make a differnce but I'm desparate and willing
to try anything. 8-(

I had created a long string and passed it to sysread but it didn't seem
to make any difference, and besides one subsequent reads it would
already be allocated to the proper length by the previous reads. Or am
I missing something?

read and sysread only ever make strings longer. By using the fourth
parameter OFFSET, and keeping track of where you are in the string, you
can write into a precreated string just as you would a buffer in C.

What I don't see how is to pass the address of my buffer to sysread as
it wants a scalar and so won't that always force it to be created/malloc'd?

sysread effectively 'takes the address' of the scalar you pass in: that
is, it writes directly into the scalar as given, and only allocates
memory if it isn't long enough.

Here's exactly what I'm doing, noting that I'm counting the bytes read
just to make sure I'm reading no more than I should be. Assuming I
understand what you were saying, I believe I am preallocating the buffer.

#!/usr/bin/perl -w

Instread of -w you want

use warnings;

Also you want

use strict;

and you need to declare your variables with my.

$reclen=1024*128;

This is still a very small record size. The overhead of Perl ops is much
higher than C ops, so each spin round the loop will cost you much more
in Perl than in C. This makes it more important to minimize the number
of times you need to loop by reading as much as you reasonably can at a
time.

$buffer=' 'x$reclen;
$filename='/mnt/scratch/test';
open FILE, "<$filename" or die;

Use lexical filehandles.
Use 3-arg open.
Give a meaningful error message, even for tiny programs.

open my $FILE, '<', $filename or die "can't read '$filename': $!";

$total=0;
$start=time;

Note that it is often easier to use the Benchmark module for timing
things.

while ($bytes=sysread(FILE, $buffer, $reclen))

This will overwrite the contents of $buffer, starting from the beginning
each time. You want something more like

while ($bytes = sysread($FILE, $buffer, $reclen, $total)) {

, although you want to bear in mind what I said above about reading as
much as you can at a time.

Ben

xhoster · Aug 14, 2006

Mark Seger said:
I realized I didn't answer all your questions. Sorry about that. See
below:

I do believe DT (which I mentioned in an earlier note) uses very basic,
sequential reads (there are certainly switches to do async and other
switches as well, but I just use the basics). ....
to create a 10G file. As I said before, pick a size at least as large
as your current RAM to assume it doesn't get cached.

But I thought the point was to investigate Perl, not my hard drive.
What difference does it make to Perl if it is cached or not? (My previous
run was using a sparse file, most of the pages were all nulls but that
shouldn't make any difference to the buffering--the data is still real data
as far as that goes.)

Now read it back
with the identical command, replacing the 'of' with 'if' (output to
input). Here's an example on my machine, remember I only have 3 GB RAM:

[root@cag-dl380-01 mjs]# ./dt of=/mnt/scratch/test limit=3g bs=1m
disable=compare,verify dispose=keep

OK, I used dt to make a 4G file (I have 2 G ram) and then used it again to
read it like above, it took 1:24 to read 4 Gigs. (I straced it, and it
seemed to use only ordinary read commands, same as Perl does.)

I ran your Perl sysread code, it took 84 seconds, or 1:24, to read the same
4G file.

This is perl, v5.8.8 built for i686-linux-thread-multi

Xho

John W. Krahn · Aug 14, 2006

Mark said:
I didn't think it would make a differnce but I'm desparate and willing
to try anything. 8-(

I had created a long string and passed it to sysread but it didn't seem
to make any difference, and besides one subsequent reads it would
already be allocated to the proper length by the previous reads. Or am
I missing something?

Just to back up a step or two, I wrote a short C program that mallocs a
buffer and calls fread with the address of the buffer and it runs as
efficiently and at the same speed as my benchmark tool (which I've
included a pointer to in a previous response - a very cool tool if you
haven't tried it yet).

What I don't see how is to pass the address of my buffer to sysread as
it wants a scalar and so won't that always force it to be created/malloc'd?

Here's exactly what I'm doing, noting that I'm counting the bytes read
just to make sure I'm reading no more than I should be. Assuming I
understand what you were saying, I believe I am preallocating the buffer.

#!/usr/bin/perl -w

$reclen=1024*128;
$buffer=' 'x$reclen;
$filename='/mnt/scratch/test';
open FILE, "<$filename" or die;

$total=0;
$start=time;
while ($bytes=sysread(FILE, $buffer, $reclen))
{
$total+=$bytes;
}

$duration=time-$start;
printf "Filesize: %5dM Recsize:%5dK %5.1fSecs %6dKB/sec\n",
$total/(1024*1024), $reclen/1024, $duration, $total/$duration/1024;

I tried writing a C program and a Perl program that did (basicly) the same
thing and I got these results:

$ gcc -o seger-test seger-test.c
$ time ./seger-test
Filesize: 1043M Recsize: 128K 45Secs 23749KB/sec

real 0m45.024s
user 0m0.016s
sys 0m7.408s
$ time ./seger-test.pl
Filesize: 1043M Recsize: 128K 45Secs 23749KB/sec

real 0m45.606s
user 0m0.171s
sys 0m7.462s

And it doesn't look like C and Perl differ very much in performance.

Memory size:

$ free -b
total used free shared buffers cached
Mem: 462196736 456695808 5500928 0 2555904 334778368
-/+ buffers/cache: 119361536 342835200
Swap: 0 0 0

File size:

$ ls -l SUSE-10.0-LiveDVD.iso
-rw-r--r-- 1 john users 1094363136 2005-10-25 05:08
download/SUSE-10.0-LiveDVD.iso

The C program:

$ cat ./seger-test.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <error.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>

int main ( void )
{
ssize_t reclen = 1024 * 128;
char filename[] = "SUSE-10.0-LiveDVD.iso";
int fd;

if ( ( fd = open( filename, O_RDONLY ) ) == -1 )
{
perror( "Cannot open file" );
return EXIT_FAILURE;
}

time_t start = time( NULL );
char *buffer = malloc( reclen );
ssize_t total = 0;
ssize_t bytes;
while ( bytes = read( fd, buffer, reclen ) )
{
if ( bytes == -1 )
{
perror( "Cannot read from file" );
return EXIT_FAILURE;
}
total += bytes;
}

free( buffer );
time_t duration = time( NULL ) - start;

printf( "Filesize: %5uM Recsize:%5uK %5uSecs %6uKB/sec\n",
total / ( 1024 * 1024 ), reclen / 1024, duration, total /
duration / 1024 );

return EXIT_SUCCESS;
}

The Perl program:

$ cat ./seger-test.pl
#!/usr/bin/perl
use warnings;
use strict;
use bytes;
use integer;
use Fcntl;

my $reclen = 1024 * 128;
my $filename = 'SUSE-10.0-LiveDVD.iso';

sysopen my $fd, $filename, O_RDONLY or die "Cannot open '$filename' $!";

my $start = time;
my $buffer;
my $total = 0;

while ( my $bytes = sysread $fd, $buffer, $reclen ) {

defined $bytes or die "Cannot read from '$filename' $!";

$total += $bytes;
}

my $duration = time() - $start;

printf( "Filesize: %5uM Recsize:%5uK %5uSecs %6uKB/sec\n",
$total / ( 1024 * 1024 ), $reclen / 1024, $duration, $total /
$duration / 1024 );

__END__

John

Dr.Ruud · Aug 14, 2006

Mark Seger schreef:

I've also found perl does a great job on socket
I/O. But I also agree, there are certainly many cases where you just
can't beat raw C power and the trick is to know when perl is
sufficient and when it's not.

It is also feasible to mix Perl and C, see perlxstut and perlxs.

Mark Seger · Aug 14, 2006

OK, I used dt to make a 4G file (I have 2 G ram) and then used it again to
read it like above, it took 1:24 to read 4 Gigs. (I straced it, and it
seemed to use only ordinary read commands, same as Perl does.)

I ran your Perl sysread code, it took 84 seconds, or 1:24, to read the same
4G file.

This is perl, v5.8.8 built for i686-linux-thread-multi

bingo and thank you! I was using perl 5.8.0 (I know, shame on me) and
foolishly thought something as basic as the interface between perl and
system calls was very efficient. While I'm sure it is, apparantly
there's enough of a difference between perl versions to matter. Then
again, this 'test' system was also RedHat 9 - I know, also ancient.

my latest tests were on 5.8.5 on a dual-socket/dual-core opteron running
RHEL4/Update4.

btw - I hope like dt...

-mark

Mark Seger · Aug 14, 2006

sysread effectively 'takes the address' of the scalar you pass in: that
is, it writes directly into the scalar as given, and only allocates
memory if it isn't long enough.

great - that's what I was hoping it would do. As you may have seen from
my response to Xho, it now appears my problems were in an older version
of perl!

Instread of -w you want

use warnings;

Also you want

use strict;

and you need to declare your variables with my.

re: -w, I guess I still don't get it vs 'use warnings'
re: strict, you are right and I am lazy, but will try harder...

This is still a very small record size. The overhead of Perl ops is much
higher than C ops, so each spin round the loop will cost you much more
in Perl than in C. This makes it more important to minimize the number
of times you need to loop by reading as much as you reasonably can at a
time.

I'm not sure I buy that. The limiting factor here is the disk speed.
When you write a bunch of 128KB blocks, they first get written to the
buffer cache and then the i/o scheduler merges them into bigger requests
(see the merges stat for I/O) so it really doesn't make all that much of
a difference. On a 3GHz processor we're not talking a lot of extra time.

Use lexical filehandles.
Use 3-arg open.
Give a meaningful error message, even for tiny programs.

open my $FILE, '<', $filename or die "can't read '$filename': $!";

I'd be curious to know what the benefit is of a 3-arg open - in fact I
didn't even realize you could do this.
re: error messages - I agree completely with you and virtually ALWAYS
provide something more meaningful. In this quick throwaway, I was again
being lazy.

Note that it is often easier to use the Benchmark module for timing
things.

This is true, but working with performance numbers all the time, I fell
a lot more comfortable looking at disk stats vs times. For example, I
ran my tests on my opteron which has 8GB of ram using an 8GB file and
found inconsistent disk numbers. That led me to the belief that the
aging algorithm for the buffer cache is not truly oldest first. It
wasn't until I did a umount/mount that my numbers were what they should
have been. I don't believe I could have found that from 'benchmark'

This will overwrite the contents of $buffer, starting from the beginning
each time. You want something more like

while ($bytes = sysread($FILE, $buffer, $reclen, $total)) {

, although you want to bear in mind what I said above about reading as
much as you can at a time.

super! and thanks for the coding tips...

-mark

Ben Morrow · Aug 14, 2006

[could you put a blank line between a quote an your reply? It makes
things much clearer...]

Quoth Mark Seger said:
great - that's what I was hoping it would do. As you may have seen from
my response to Xho, it now appears my problems were in an older version
of perl!

Ah, right. 5.8.0 is known to be very buggy.

re: -w, I guess I still don't get it vs 'use warnings'

use warnings lets you easily turn on or off warnings for a given block,
and lets you control warnings by category.

re: strict, you are right and I am lazy, but will try harder...

I'm not sure I buy that. The limiting factor here is the disk speed.
When you write a bunch of 128KB blocks, they first get written to the
buffer cache and then the i/o scheduler merges them into bigger requests
(see the merges stat for I/O) so it really doesn't make all that much of
a difference. On a 3GHz processor we're not talking a lot of extra time.

We're talking about read, not write. I suspect in 5.8.0 your code ended
up spending most of its time in the perl runloop, rather than waiting on
IO.

I'd be curious to know what the benefit is of a 3-arg open - in fact I
didn't even realize you could do this.

There are two benefits: it cleanly deals with files with any characters
in their names, and it lets you use IO layers (see PerlIO).

Consider what would happen if (for example) $filename began with a '+'.

This is true, but working with performance numbers all the time, I fell
a lot more comfortable looking at disk stats vs times.

Maybe... I don't know about that. The point is, using Benchmark is
easier and more reliable than just calling time.

Ben

FAQ 5.16 How come when I open a file read-write it wipes it out?	0	Apr 26, 2011
Is it possible to make a unittest decorator to rename a method from"x" to "testx?"	10	Aug 8, 2013
FAQ 5.31 How can I read a single character from a file? From the keyboard?	0	Jan 29, 2011
is it possible to package a Java application in one jar that includeseverything?	7	Feb 20, 2012
FAQ 5.29 How can I read in an entire file all at once?	0	Mar 16, 2011
FAQ 8.29 Why can't my script read from STDIN after I gave it EOF (^D on Unix, ^Z on MS-DOS)?	19	Apr 11, 2011
FAQ 5.30 How can I read in a file by paragraphs?	0	Mar 31, 2011
is it possible to override a 'read only' environmental variable like$" ?	2	Jan 20, 2009

is it possible to efficiently read a large file?

Mark Seger

John Bokma

John W. Krahn

xhoster

Mark Seger

Mark Seger

Ben Morrow

Mark Seger

Mark Seger

Mark Seger

Peter J. Holzer

Mark Seger

Ben Morrow

xhoster

John W. Krahn

Dr.Ruud

Mark Seger

Mark Seger

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads