File Read Cache - How to purge?

S

Signal

To become part of a larger script that will read through all files on
a given drive, I was playing around with reading files and wanted to
see if there was an optimum value for a read size on my system.

What I noticed is that the file being read is "cached" on subsequent
reads.
Based on some testing it looks like it's by the underlying OS (windows
in this case) but have a few questions.

Here's a code sample:
-------------------------------------------------------------------------
import os, time

# Set the following two variables to
# different large files on your system.
# Suggest files in the range of 500MB to 1Gig
testfile1 = "d:\\test1\\junk1.file"
testfile2 = "d:\\test1\\junk2.file"

def readfile(filename):
size = os.path.getsize(filename)
bufsize = 4096
print filename, size, "Bytes"

while bufsize < 132000:
start = time.clock()

f = open(filename, "rb")
buf = f.read(bufsize)
while buf:
buf = f.read(bufsize)
f.flush() # note: put here as a test and
# it doesn't make a difference
f.close()

end = time.clock()
print bufsize, round(end - start,3)
bufsize = bufsize*2

print " "

# Comment the second and third readfile and run
# the program twice to see a similar result for testfile1
readfile(testfile1)
readfile(testfile1)
readfile(testfile2)
-----------------------------------------------------------------


Sample output for first testfile1:
d:\test1\junk1.file 759167228 Bytes
4096 20.366
8192 0.923
16384 0.783
32768 0.737
65536 0.74
131072 0.82

After the first read test at 4096, subsequent read tests seem to be
cached.
This is even though the file is being closed before initiating another
read test.

Sample output for second testfile1
d:\test1\junk1.file 759167228 Bytes
4096 1.258
8192 0.944
16384 0.795
32768 0.743
65536 0.725
131072 0.826

Ok, didn't expect much difference here based on the first read, but
wanted to note how 4096 is now 1.2 seconds.

Sample output for testfile2:
d:\test1\junk2.file 1142511616 Bytes
4096 31.514
8192 1.417
16384 1.202
32768 1.11
65536 1.089
131072 1.245

Same situation as our first sample for testfile1. 4096 is not cached,
but subsequent reads are.

Now some things to note:

So it seems the file is being cached, however on my system only ~2MB
of additional memory is used when the program is run. This 2MB of
memory is released when the script exits.

If you comment the second and third readfile lines (as noted in the
code):

a. Run the program twice, you will see that even if the program exits,
this cache is not cleared.

b. If you open another command prompt and run the code, it's cached.

c. If you close both command prompts, open a new one and run the code
it's still cached.

It isn't "cleared" until another large file is read.

My questions are:

1. I don't quite understand how after one full read of a file, another
full read of the same file is "cached" so significantly while
consuming so little memory. What exactly is being cached to improve
the reading of the file a second time?

2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine? That is
without having to read another large file first?
 
M

Marc 'BlackJack' Rintsch

1. I don't quite understand how after one full read of a file, another
full read of the same file is "cached" so significantly while
consuming so little memory. What exactly is being cached to improve
the reading of the file a second time?

What do you mean by so little memory? It (the whole file) is cached by the
operating system totally independent of your program, so the memory used
does of course not show up in the memory stats of your program. Just
think about this: some file `a.dat` is cached by the OS and you start a
program that might eventually read that file. The memory is used already
*before* the program starts and the OS does not know in advance which
files will be read by the program. So how, why and when should the memory
used for the cache should be added to the programs memory stats.
2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

You mean reading the file without actually reading it!? :)
3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine? That is
without having to read another large file first?

AFAIK no.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Signal

What do you mean by so little memory. It (the whole file) is cached by the
operating system totally independent of your program.

Please note I already stated it was more than likely by the OS and
noted the tests to confirm that.
It (the whole file) is cached by the operating system totally independent
of your program, so the memory used does of course not show up in the memory
stats of your program... <snip>

In this case the OS is Windows and monitoring the memory usage in Task
Manager, not through the script. The entire 759MB file is not being
cached in memory and only 2MB of memory is used when the script runs.

You can see in the example script that I'm not storing the file in
memory (buf is "overwritten" at each read(size)) and no memory stats
are being kept there. Not sure where I might have eluded otherwise,
but hope this clears that up.
You mean reading the file without actually reading it!? :)

Think you misunderstood.

What the "tests" are eluding to is:

a. The whole file itself is NOT being cached in memory.
b. If there is mechanism to which it is "caching" something (which
obviously isn't the whole file itself), why not possibly take
advantage of it?

And sometimes there can be "tricks" to "initializing" before actually
read/writing a file to help improve some performance (and not
necessarily via a cache).
 
M

Marc 'BlackJack' Rintsch

In this case the OS is Windows and monitoring the memory usage in Task
Manager, not through the script. The entire 759MB file is not being
cached in memory and only 2MB of memory is used when the script runs.

If you read from a file the operating system usually caches all read data
until there is no cache memory left, then some old cached data is
replaced. So if you read the whole 756 MB file, even in small blocks, and
have enough RAM chances are that the whole file is in the cache. And of
course the memory consumption of the process is just the memory for
interpreter, program and data. 2 MB sounds reasonable.
Think you misunderstood.

What the "tests" are eluding to is:

a. The whole file itself is NOT being cached in memory.

Everything read is cached as long as there's enough space in the cache.
b. If there is mechanism to which it is "caching" something (which
obviously isn't the whole file itself), why not possibly take
advantage of it?

How? Your speedup comes from data in caches, but the time putting it
there was spend in the previous run. So you only gain something on
subsequent reads on the file.

Ciao,
Marc 'BlackJack' Rintsch
 
N

Neil Hodgson

Signal:
So it seems the file is being cached, however on my system only ~2MB
of additional memory is used when the program is run. This 2MB of
memory is released when the script exits.

You are not measuring the memory used by the cache. This may help:
http://www.microsoft.com/technet/archive/ntwrkstn/reskit/07cache.mspx?mfr=true
2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

The Win32 API provides FILE_FLAG_SEQUENTIAL_SCAN (don't know how
effective this will be for your application) although its probably
simpler to use a read-ahead thread or overlapped I/O.
3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine? That is
without having to read another large file first?

http://www.microsoft.com/technet/sysinternals/FileAndDisk/CacheSet.mspx

Neil
 
H

Hrvoje Niksic

Signal said:
2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine? That
is without having to read another large file first?

On a Unix system the standard way to purge the cache is to unmount the
file system and remount it. If you can't do that on Windows, you can
get the same effect by placing the test files on an external (USB)
hard drive; unplugging the drive and plugging it back again will
almost certainly force the OS to flush any associated caches. Having
to do that is annoying, even as a last resort, but still better than
nothing.
 
W

Wolfgang Draxinger

Marc said:
You mean reading the file without actually reading it!? :)

Linux provides its specific syscall 'readahead' that does exactly
this.

Wolfgang Draxinger
 
N

Nick Craig-Wood

Hrvoje Niksic said:
On a Unix system the standard way to purge the cache is to unmount the
file system and remount it.

If you are running linux > 2.6.18 then you can use
/proc/sys/vm/drop_caches for exactly that purpose.

http://www.linuxinsight.com/proc_sys_vm_drop_caches.html

Eg

# free
total used free shared buffers cached
Mem: 1036396 954404 81992 0 33536 347384
# echo 1 > /proc/sys/vm/drop_caches
# free
total used free shared buffers cached
Mem: 1036396 658604 377792 0 348 91240
# echo 2 > /proc/sys/vm/drop_caches
# free
total used free shared buffers cached
Mem: 1036396 587296 449100 0 392 91284
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buffers cached
Mem: 1036396 588228 448168 0 692 91808
 
N

Nick Craig-Wood

Wolfgang Draxinger said:
Linux provides its specific syscall 'readahead' that does exactly
this.

It isn't in the posix module, but you can use it with ctypes like this

(That example could do with more ctypes magic to set the types and the
return type of readahead...)
 
S

Steve Holden

Hrvoje said:
That URL claims that you need to run "sync" before dropping the cache,
and so do other resources. I wonder if that means that dropping the
cache is unsafe on a running system.

Good grief. Just let the operating system do its job, for Pete's sake,
and go find something else to obsess about.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
M

Marc 'BlackJack' Rintsch

That URL claims that you need to run "sync" before dropping the cache,
and so do other resources. I wonder if that means that dropping the
cache is unsafe on a running system.

Of course not. It means that dirty pages are not dropped, so if you
really want to invalidate as much cache memory as possible you have to
``sync`` before.

Ciao,
Marc 'BlackJack' Rintsch
 
H

Hrvoje Niksic

Steve Holden said:
Good grief. Just let the operating system do its job, for Pete's
sake, and go find something else to obsess about.

Purging the page cache for the purposes of benchmarking (such as
measuring cold start time of large applications) is an FAQ, not an
"obsession". No one is arguing that the OS shouldn't do its job in
the general case.
 
N

Nick Craig-Wood

Hrvoje Niksic said:
That URL claims that you need to run "sync" before dropping the cache,
and so do other resources. I wonder if that means that dropping the
cache is unsafe on a running system.

It isn't unsafe, the OS just can't drop pages which haven't been
synced to disk so you won't get all the pages dropped unless you sync
first.
 
H

Hrvoje Niksic

Nick Craig-Wood said:
It isn't unsafe, the OS just can't drop pages which haven't been
synced to disk so you won't get all the pages dropped unless you
sync first.

Thanks for the clarification.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top