reading large file in chunks: optimal chunk size?

bwv549 · Jun 16, 2009

I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

Thanks.

Bill Kelly · Jun 17, 2009

From: "bwv549 said:
I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

I'd recommend trying some benchmarks using various
chunk sizes, and also trying lower level unbuffered
read (sysread).

Last time I benchmarked this, a few years ago (although
using C not Ruby) on Win32 (NTFS) and OS X (HFS+, i think)
I was surprised to find the optimal read chunk size was 4K,
which happened to be the partition allocation unit size,
and also the VM page size.

I tried all sorts of chunk sizes. The result was counter-
intuitive to me. I figured, if I allocated a large buffer,
and made a single read() call, that should be faster, if
not _at least as fast_ as making a whole lot of separate
4K reads.

But no, 4K was always the fastest in my tests.

But, maybe it will be different for you, so if it's
important, just benchmark it.

Regards,

Bill

Pascal J. Bourguignon · Jun 17, 2009

bwv549 said:
I'm reading a large file (too big to fit in memory) and doing a
hexdigest on it. What are some optimal size chunks to read the file
in and why? (speed is probably most important here as long as enough
memory is available for most machines)

Theorically, the fastest I/O could occur when the buffer is
page-aligned, and when its size is a multiple of the system block
size.

If the file was stored continuously on the hard disk, the fastest I/O
throughput would be attained by reading cylinder by cylinder. (There
could even not be any latency then, since when you read a whole track
you don't need to start from the start, you can start in the middle,
and loop over). Unfortunately nowadays it would be near impossible to
do that, since the hard disk firmware hides the physical layout of the
sectors, and may use replacement sectors needing even seeks while
reading a single track. The OS file system may also try to spread the
file blocks all over the disk (or at least, all over a cylinder
group), to avoid having to do long seeks over big files when
acccessing small files.

So instead of considering physical tracks, you may try to take into
account the hard disk buffer size. Most hard disks have buffer size of
8 MB, some 16 MB. So reading the file by chunks of up to 1 MB or 2 MB
should let the hard disk firmware optimize the throughput without
stalling it's buffer.

But then, if there are a lot of layers above, eg, a RAID, all bets are
off.

Really, the best you can do is to benchmark your particular
circumstances.

Eleanor McHugh · Jun 17, 2009

I'd recommend trying some benchmarks using various
chunk sizes, and also trying lower level unbuffered
read (sysread).

Last time I benchmarked this, a few years ago (although
using C not Ruby) on Win32 (NTFS) and OS X (HFS+, i think)
I was surprised to find the optimal read chunk size was 4K,
which happened to be the partition allocation unit size,
and also the VM page size.

I tried all sorts of chunk sizes. The result was counter-
intuitive to me. I figured, if I allocated a large buffer,
and made a single read() call, that should be faster, if not _at
least as fast_ as making a whole lot of separate
4K reads.

But no, 4K was always the fastest in my tests.

Which makes perfect sense when you consider that modern operating
systems maintain file caches in virtual memory, so once you start
accessing a file it's going to be mapped into VM and subsequent reads/
writes will generate page faults in the kernel and cause one or more
page-sized chunks to be physically loaded into RAM. That's the point
at which drive geometry is going to matter so a larger read might
necessitate several disk accesses to load physically discontinuous but
logically adjacent blocks.

Once the pages are loaded into the cache all subsequent reads will be
at RAM speeds rather than HDD speeds and reproducible benchmarks can
be difficult to harvest at that point: neither process scheduling nor
in-kernel page loading are deterministic.

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

Reading a file in chunks, to a byte array	1	Jan 29, 2009
Problem reading binary chunks from file	0	Jul 12, 2010
Reading web service resposes in chunks?	1	Feb 27, 2007
Webservices and large chunks of binary data	3	Feb 6, 2009
Async HttpHandlers + file downloading in chunks	1	May 13, 2009
reading file objects in chunks	1	Nov 12, 2007
Reading in chunks of data	4	Jul 10, 2008
reading the buffer in chunks	5	Feb 1, 2007

reading large file in chunks: optimal chunk size?

bwv549

Bill Kelly

Pascal J. Bourguignon

Eleanor McHugh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads