reading large file in chunks: optimal chunk size?

B

bwv549

I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

Thanks.
 
B

Bill Kelly

From: "bwv549 said:
I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

I'd recommend trying some benchmarks using various
chunk sizes, and also trying lower level unbuffered
read (sysread).

Last time I benchmarked this, a few years ago (although
using C not Ruby) on Win32 (NTFS) and OS X (HFS+, i think)
I was surprised to find the optimal read chunk size was 4K,
which happened to be the partition allocation unit size,
and also the VM page size.

I tried all sorts of chunk sizes. The result was counter-
intuitive to me. I figured, if I allocated a large buffer,
and made a single read() call, that should be faster, if
not _at least as fast_ as making a whole lot of separate
4K reads.

But no, 4K was always the fastest in my tests.

But, maybe it will be different for you, so if it's
important, just benchmark it. :)


Regards,

Bill
 
P

Pascal J. Bourguignon

bwv549 said:
I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.


If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.

So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.

But then, if there are a lot of layers above, eg, a RAID, all bets are
off.

Really, the best you can do is to benchmark your particular
circumstances.
 
E

Eleanor McHugh

I'd recommend trying some benchmarks using various
chunk sizes, and also trying lower level unbuffered
read (sysread).

Last time I benchmarked this, a few years ago (although
using C not Ruby) on Win32 (NTFS) and OS X (HFS+, i think)
I was surprised to find the optimal read chunk size was 4K,
which happened to be the partition allocation unit size,
and also the VM page size.

I tried all sorts of chunk sizes. The result was counter-
intuitive to me. I figured, if I allocated a large buffer,
and made a single read() call, that should be faster, if not _at
least as fast_ as making a whole lot of separate
4K reads.

But no, 4K was always the fastest in my tests.

Which makes perfect sense when you consider that modern operating
systems maintain file caches in virtual memory, so once you start
accessing a file it's going to be mapped into VM and subsequent reads/
writes will generate page faults in the kernel and cause one or more
page-sized chunks to be physically loaded into RAM. That's the point
at which drive geometry is going to matter so a larger read might
necessitate several disk accesses to load physically discontinuous but
logically adjacent blocks.

Once the pages are loaded into the cache all subsequent reads will be
at RAM speeds rather than HDD speeds and reproducible benchmarks can
be difficult to harvest at that point: neither process scheduling nor
in-kernel page loading are deterministic.


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top