Memory consumption when writing to files

B

Bart Van der Donck

Hello,

I am doing a bit of research about Perl's I/O memory consumption.

Using ">>" seems a very memory-efficient strategy for logging. In so
far I can see, the log file would not need to be opened. Web servers
use such techniques for HTTP logs - I've seen very busy web servers
that don't have trouble to write many log lines simultaneously into
already huge files. My best guess would be that the new data is stored
separately into some memory blocks (in C?), and that then the new
blocks are somehow added to the original file. I could be wrong, but
this could maybe explain the low memory use of ">>". In this logic, I
could write at the beginning of a file with exactly the same amount of
memory.

Using Perl's ">" would require more memory, as the occupied memory
would first need to be freed, and then new memory is allocated to
store the new data.

A traditional "Edit File" action would stand apart from this. It
requires much more memory: namely (1) open, (2) edit, and (3) save.
Like you would for example modify a text-document on your computer.

I'm not very confident with low-level memory handling; maybe some of
my thoughts are wrong.

Are these mechanisms done by the O.S. ? Or are they done by the Perl
executive in C (in a broader sense, the application that processes the
data) ? Or is this a general principle with identical results, e.g.
when I would do it from PHP/ASP ?

Thanks,
 
W

Wolf Behrenhoff

Hello,

I am doing a bit of research about Perl's I/O memory consumption.

Using ">>" seems a very memory-efficient strategy for logging. In so
far I can see, the log file would not need to be opened.

What exactly are you referring to when writing ">>"? The only place that
makes sense to me is indeed "open". You will always need to open a file
even when only appending to it.
Web servers
use such techniques for HTTP logs - I've seen very busy web servers
that don't have trouble to write many log lines simultaneously into
already huge files.

Simultaneously? Are you really sure? They don't use some sort of locking?
My best guess would be that the new data is stored
separately into some memory blocks (in C?), and that then the new
blocks are somehow added to the original file. I could be wrong, but
this could maybe explain the low memory use of ">>". In this logic, I
could write at the beginning of a file with exactly the same amount of
memory.

At any place in the file. One does not need additional memory.
Using Perl's ">" would require more memory, as the occupied memory
would first need to be freed, and then new memory is allocated to
store the new data.

I am a bit lost. What are you talking about? Opening a file with ">"
(without +) truncates it at the very beginning, that does not need to
allocate any memory at all.
A traditional "Edit File" action would stand apart from this. It
requires much more memory: namely (1) open, (2) edit, and (3) save.
Like you would for example modify a text-document on your computer.

Open requires almost no memory (just the file handle). "Edit" and "save"
- that's a file editor offers you. On disk, you can only read, write,
seek and truncate. You cannot, for example, add something in the middle.
I'm not very confident with low-level memory handling; maybe some of
my thoughts are wrong.

Are these mechanisms done by the O.S. ? Or are they done by the Perl
executive in C (in a broader sense, the application that processes the
data) ? Or is this a general principle with identical results, e.g.
when I would do it from PHP/ASP ?

I'd say it's a general principle - but I am still not sure I understood
you correctly, especially because you are talking about memory and
writing about editing a file which are two separate things. Maybe you
want to know something about memory consumption vs. disk space usage?

Wolf
 
P

Peter Makholm

Wolf Behrenhoff said:
Simultaneously? Are you really sure? They don't use some sort of locking?

I think it is common to just open log-files in append mode and expect
writes to be atomic. This is not true for large write requests though.

//Makholm
 
B

Bart Van der Donck

Wolf said:
On 24.11.2010 10:26, Bart Van der Donck wrote:

What exactly are you referring to when writing ">>"? The only place that
makes sense to me is indeed "open". You will always need to open a file
even when only appending to it.


Simultaneously? Are you really sure? They don't use some sort of locking?

I suppose they do. I let the OS/C handle the queues/locks (if any),
since I believe it's not directly related to the situation.
At any place in the file. One does not need additional memory.

Sounds logic, but see some benchmarks further.
I am a bit lost. What are you talking about? Opening a file with ">"
(without +) truncates it at the very beginning, that does not need to
allocate any memory at all.

Yes, '>' would free up memory, not allocate new memory.
Open requires almost no memory (just the file handle). "Edit" and "save"
- that's a file editor offers you. On disk, you can only read, write,
seek and truncate. You cannot, for example, add something in the middle.


I'd say it's a general principle - but I am still not sure I understood
you correctly, especially because you are talking about memory and
writing about editing a file which are two separate things. Maybe you
want to know something about memory consumption vs. disk space usage?

I'm interested in memory consumption (CPU), not disk space. Here are
some benchmarks... Roughly compared, (1) would be 10 times faster than
(2), and 100 times faster than (3).

-----------------------------------
(1) Using >
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>>', 'bigfile' or die $!;
print FILE 'newdata';
close FILE;
print time."\n";

%perl t1.pl
1290675301.52260
1290675301.52286
%perl t1.pl
1290675302.84315
1290675302.84342
%perl t1.pl
1290675304.69552
1290675304.69578
%perl t1.pl
1290675307.52355
1290675307.52380
%

Average: (26 + 27 + 26 + 25) / 4 = 26
I am surprised that this operation runs so incredibly fast. My
'bigfile' is 3MB!

-----------------------------------
(2) Using >>
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>', c or die $!;
print FILE 'newdata';
close FILE;
print time."\n";

%perl t2.pl
1290675371.92332
1290675371.92533
%perl t2.pl
1290675406.37242
1290675406.37437
%perl t2.pl
1290675410.75578
1290675410.75824
%perl t2.pl
1290675415.55636
1290675415.55867
%

Average: (201 + 195 + 246 + 231) / 4 = 218,25
Suppose that (1) allocates new memory blocks for 'newdata', and
afterwards joins these blocks to the existing blocks of 'bigfile':
this could maybe explain the extremely low memory consumption.
Suppose that (2) frees the blocks of 'newdata' (truncate action), and
then assigns new blocks for 'newdata': this could maybe explain why it
is 8.4 times slower.

-----------------------------------
(3) Edit a file
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
my $newdata = '1234567';
open READ, 'bigfile' or die $!;
while (<READ>) { $newdata.=$_; }
close READ;
print time."\n";
open FILE, '>', 'bigfile' or die $!;
print FILE $newdata;
close FILE;
print time."\n";

%perl t3.pl
1290676373.76196
1290676373.78672
%perl t3.pl
1290676395.75144
1290676395.77698
%perl t3.pl
1290676399.74201
1290676399.76632
%perl t3.pl
1290676403.67342
1290676403.70075
%

Average: (2476 + 2554 + 2431 + 2733) / 4 = 2548,50
The memory of $newdata would likely play a role here; but even then,
only the time of the write action is recorded.
 
W

Wolf Behrenhoff

I'm interested in memory consumption (CPU), not disk space. Here are
some benchmarks... Roughly compared, (1) would be 10 times faster than
(2), and 100 times faster than (3).

-----------------------------------
(1) Using >
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>>', 'bigfile' or die $!;
print FILE 'newdata';
close FILE;
print time."\n";

%perl t1.pl
1290675301.52260
1290675301.52286

Average: (26 + 27 + 26 + 25) / 4 = 26
I am surprised that this operation runs so incredibly fast. My
'bigfile' is 3MB!

This very much depends on the file system. When writing only a few bytes
of data, it might even dominated by the call to "open". On some file
systems, open can take a very long time, e.g. if data is on tape and
needs to be staged to disk first, or if it is a networking file system
or ...

But how do you measure memory consumption using execution time?!
-----------------------------------
(2) Using >>
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>', c or die $!;

Bareword "c" not allowed while "strict subs" in use at - line 6.
print FILE 'newdata';
close FILE;
print time."\n";

Average: (201 + 195 + 246 + 231) / 4 = 218,25
Suppose that (1) allocates new memory blocks for 'newdata', and
afterwards joins these blocks to the existing blocks of 'bigfile':
this could maybe explain the extremely low memory consumption.

???
The existing blocks of bigfile are not touched by Perl. It depends on
the file system how this is done. Perl just forwards the request to
append "newdata" to the OS/file system driver.
Suppose that (2) frees the blocks of 'newdata' (truncate action), and
then assigns new blocks for 'newdata': this could maybe explain why it
is 8.4 times slower.

No. Perl does not need to free anything. It just tells the OS to open
the file in write mode. Suppose bigfile is on a different server
attached via NFS. Perl never sees the old content of the file and does
not allocate and/or free any memory for that.
-----------------------------------
(3) Edit a file
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
my $newdata = '1234567';
open READ, 'bigfile' or die $!;
while (<READ>) { $newdata.=$_; }
close READ;
print time."\n";
open FILE, '>', 'bigfile' or die $!;
print FILE $newdata;
close FILE;
print time."\n";

This is rewriting a file completely. (I thought "editing" in this case
would be to overwrite some data somewhere in the file)
Average: (2476 + 2554 + 2431 + 2733) / 4 = 2548,50
The memory of $newdata would likely play a role here; but even then,
only the time of the write action is recorded.

Now, in this algorithm you allocate a lot of memory. Because you are
reading line by line and making $newdata larger and larger again, Perl
will need to allocate (or realloc) more memory for $newdata again and
again (depending on the file size). You could also solve this by
creating a new file, printing newdata to it and then appending the
contents of the old file in blocks of some fixed size. At the end,
unlink the old file and rename the new file. That will need more disk
space but less memory.

But still, timing IO operations and then drawing conclusions on memory
consumption will not result in anything useful. IO operations only need
the file handle, that's it. Ok, if you're reading, you need memory for
the amount of data you want to read, but that's it.

Wolf
 
B

Bart Van der Donck

Wolf said:
Bareword "c" not allowed while "strict subs" in use at - line 6.

Typing mistake - should be

open FILE, '>', 'bigfile' or die $!;
The existing blocks of bigfile are not touched by Perl. It depends on
the file system how this is done. Perl just forwards the request to
append "newdata" to the OS/file system driver.

Thanks for that insight. I was in doubt all the time between Perl, C
and the OS.
No. Perl does not need to free anything.

Well, I'm not saying it is perl who frees the memory. But the fact
that it *is* freed, is beyond all doubt.
It just tells the OS to open the file in write mode.

So then the OS is responsible why (1) runs 10 times faster than (2) ?
Now, in this algorithm you allocate a lot of memory. Because you are
reading line by line and making $newdata larger and larger again, Perl
will need to allocate (or realloc) more memory for $newdata again and
again (depending on the file size).

Yes, but that would not count up into my benchmarks; 'time' is only
displayed after $newdata is complete. Anyway, IMHO (3) can not be
compared to (1) and (2) because of the huge $newdata. Still I have no
idea why (2) is so much heavier than (1). Probably nothing to do with
Perl but some ground level OS reason.
But still, timing IO operations and then drawing conclusions on memory
consumption will not result in anything useful. IO operations only need
the file handle, that's it. Ok, if you're reading, you need memory for
the amount of data you want to read, but that's it.

I know of 2 ways; one being the RAM byte size evolution, as in 'top',
the other one the execution time. Agreed that the latter is absolutely
not exact, but is in my experience a good indicator on a quiet machine
if you don't do anything special. My benchmarks showed similar values
over time; I would say they give a quick general idea.
 
S

sln

I'm interested in memory consumption (CPU), not disk space. Here are
some benchmarks... Roughly compared, (1) would be 10 times faster than
(2), and 100 times faster than (3).

-----------------------------------
(1)
-----------------------------------

open FILE, '>>', 'bigfile' or die $!;
print FILE 'newdata';
close FILE;

Average: (26 + 27 + 26 + 25) / 4 = 26
I am surprised that this operation runs so incredibly fast. My
'bigfile' is 3MB!

-----------------------------------
(2)
-----------------------------------

open FILE, '>', c or die $!;
print FILE 'newdata';
close FILE;
Average: (201 + 195 + 246 + 231) / 4 = 218,25
Suppose that (1) allocates new memory blocks for 'newdata', and
afterwards joins these blocks to the existing blocks of 'bigfile':
this could maybe explain the extremely low memory consumption.
Suppose that (2) frees the blocks of 'newdata' (truncate action), and
then assigns new blocks for 'newdata': this could maybe explain why it
is 8.4 times slower.

I used a 3.5 MB file. To mitigate file cache, used system(copy)
every time. I don't get the 10 factor you do going from append
to truncate, more like a 5 factor (but this could be because of
the raid array). This is using NTFS. Not sure what *nix uses
(maybe hpfs).

Tests were append/close, truncate/close, then a write was thrown in for
the last tests using these modes. The write (size doesen't matter really)
is relative and is proportional to the size and times are independent of
append or truncate. So this factors out.

Apparently, for 'truncate', the file system returns the sectors to the
free pool in the allocation tables. I believe it attempts local combining
around the clusters it returns. So, if the file is fragmented, the process
could take a varying amount of time.

This does not happen with 'append' which accounts for the extra time truncation
takes depending on the fragmentation.

Writing data on the other hand, can't be mitigated, since the free pool is alterred
depending on if it was truncated or appended to because both see a different view
of the free pool, but it is a separate test, probably not measureable this way.

So the write must be taken out of the test, leaving only open for truncation or
append.

The debugging process is now over.

I would conclude -
Truncation returns sector allocation blocks to the free pool requiring a
little overhead in time where appending does not.

-sln

----------------
use strict;
use warnings;
use Time::HiRes qw(time);

print "Open Append/Close\n";
for (1..5) {
system ("copy bigfile_orig bigfile > ttt.txt");
my $t1 = time;
open FILE, '>>', 'bigfile' or die $!;
close FILE;
my $t2 = time;
printf " %.2f\n", (($t2 - $t1)*100000);
}
print "Open Truncate/Close\n";
for (1..5) {
system ("copy bigfile_orig bigfile > ttt.txt");
my $t1 = time;
open FILE, '>', 'bigfile' or die $!;
close FILE;
my $t2 = time;
printf " %.2f\n", (($t2 - $t1)*100000);
}

print "\nOpen Append/Write/Close\n";
for (1..5) {
system ("copy bigfile_orig bigfile > ttt.txt");
my $t1 = time;
open FILE, '>>', 'bigfile' or die $!;
print FILE 'newdata';
close FILE;
my $t2 = time;
printf " %.2f\n", (($t2 - $t1)*100000);
}
print "Open Truncate/Write/Close\n";
for (1..5) {
system ("copy bigfile_orig bigfile > ttt.txt");
my $t1 = time;
open FILE, '>', 'bigfile' or die $!;
print FILE 'newdata';
close FILE;
my $t2 = time;
printf " %.2f\n", (($t2 - $t1)*100000);
}

__END__

output:

Open Append/Close
11.59
8.89
12.11
8.99
8.99
Open Truncate/Close
61.61
59.99
48.49
57.98
50.71

Open Append/Write/Close
18.81
17.00
17.60
17.40
19.60
Open Truncate/Write/Close
59.58
66.11
78.11
76.01
67.69
 
P

Peter J. Holzer

I'm interested in memory consumption (CPU), not disk space.

Then why do you measure CPU time and not memory consumption? If you are
interested in RAM usage, measure RAM usage!
Here are
some benchmarks... Roughly compared, (1) would be 10 times faster than
(2), and 100 times faster than (3).

-----------------------------------
(1) Using >
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>>', 'bigfile' or die $!;

Note: Your headline doesn't match the code. I assume that the ">" in
the headline is a typo and that the ">>" in the code is correct.
print FILE 'newdata';
close FILE;
print time."\n"; [...]
Average: (26 + 27 + 26 + 25) / 4 = 26
I am surprised that this operation runs so incredibly fast. My
'bigfile' is 3MB!

3 MB isn't big. And it (almost) doesn't matter how big the file is,
since you are only appending a small, fixed amount of data at the end.

This program probably causes only one write to a data block and one
write to an inode, and these writes are probably also asynchronous.

-----------------------------------
(2) Using >>
-----------------------------------

#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(time);
print time."\n";
open FILE, '>', c or die $!;

Again, your code doesn't match your headline ('>' vs. '>>') and I'm
assuming that your code is correct (although the 'c' is suspicious).
print FILE 'newdata';
close FILE;
print time."\n"; [...]
Average: (201 + 195 + 246 + 231) / 4 = 218,25
Suppose that (1) allocates new memory blocks for 'newdata', and
afterwards joins these blocks to the existing blocks of 'bigfile':
this could maybe explain the extremely low memory consumption.
Suppose that (2) frees the blocks of 'newdata' (truncate action), and
then assigns new blocks for 'newdata': this could maybe explain why it
is 8.4 times slower.

Yes. If your file is 3 MB at the beginning and your file system uses a
block size of 4 kB, then the open causes about 750 Blocks to be freed.
After that a single new block is allocated and written.

Again, all this probably happens asynchronously (the times you reported
are in 1E-5 seconds, so "218,25" is really 2.1825 ms, which isn't enough
to write even a single block to a rotating disk), but the OS has to work
a lot more to trunkate a 3 MB file than to append a few bytes to it.

hp
 
B

Bart Van der Donck

Peter said:
Note: Your headline doesn't match the code. I assume that the ">"
in the headline is a typo and that the ">>" in the code is correct.

Yes, I accidentally swapped them.
      print FILE 'newdata';
      close FILE;
      print time."\n"; [...]
Average: (26 + 27 + 26 + 25) / 4 = 26
I am surprised that this operation runs so incredibly fast. My
'bigfile' is 3MB!

3 MB isn't big. And it (almost) doesn't matter how big the file is,
since you are only appending a small, fixed amount of data at the end.
This program probably causes only one write to a data block and one
write to an inode, and these writes are probably also asynchronous.
      #!/usr/bin/perl
      use strict;
      use warnings;
      use Time::HiRes qw(time);
      print time."\n";
      open FILE, '>', c or die $!;

Again, your code doesn't match your headline ('>' vs. '>>') and I'm
assuming that your code is correct (although the 'c' is suspicious).

Yes, >> should be > and vice versa. 'c' should be 'bigfile'.
      print FILE 'newdata';
      close FILE;
      print time."\n"; [...]
Average: (201 + 195 + 246 + 231) / 4 = 218,25
Suppose that (1) allocates new memory blocks for 'newdata', and
afterwards joins these blocks to the existing blocks of 'bigfile':
this could maybe explain the extremely low memory consumption.
Suppose that (2) frees the blocks of 'newdata' (truncate action), and
then assigns new blocks for 'newdata': this could maybe explain why it
is 8.4 times slower.

Yes. If your file is 3 MB at the beginning and your file system uses a
block size of 4 kB, then the open causes about 750 Blocks to be freed.
After that a single new block is allocated and written.

Again, all this probably happens asynchronously (the times you reported
are in 1E-5 seconds, so "218,25" is really 2.1825 ms, which isn't enough
to write even a single block to a rotating disk), but the OS has to work
a lot more to trunkate a 3 MB file than to append a few bytes to it.

'>>' = take the necessary (few) bytes from the free pool + assign the
content to them ('newdate') + write to an inode so that it's clear
that both ('bigfile' and 'newdata') belong to the same file. Still
pleasantly surprised that the last action doesn't directly deal with
'bigfile'; that must be the reason why eg. Apache's log processing is
so efficient.

'>' = truncate 'bigfile' (return these bytes to the free pool) + take
necessary (few) bytes from the free pool + assign the content
('newdate'). The first action has to deal with a lot of blocks, thus
indeed explaining the longer execution time.

Actually this is what I suspected in my first article (see "using >"
vs. "using >>"); good that it's now cleared out.
 
W

Wolf Behrenhoff

'>>' = take the necessary (few) bytes from the free pool + assign the
content to them ('newdate') + write to an inode so that it's clear
that both ('bigfile' and 'newdata') belong to the same file. Still
pleasantly surprised that the last action doesn't directly deal with
'bigfile'; that must be the reason why eg. Apache's log processing is
so efficient.

Again, this is highly file system dependant!

If blocks have a size of 4k, adding a few bytes to a file with 1 byte
size will write into the very same block. Maybe the storage medium can
only write chunks of some size of data (which might be larger than
"newdata"). In that case, a read needs to be done first. Some other file
systems don't even support appending to a file. And then there is the
cache of the OS which you didn't take into account. You are measuring
the performace of the caching mechanism plus maybe some file system and
disk performance.

I still don't get what this has got to do with Perl and/or memory
consumption.

Cheers, Wolf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top