software design question

C

ccc31807

I have an application (the same one that my last few posts have
concerned) that is growing rather large. It's not OO mostly for the
reason that I've never done OO in Perl and didn't want to learn on
this one.

I have a couple of modules that are responsible for collecting user
data, a couple of modules that are responsible for the control logic,
a couple of modules that are responsible for calculating and preparing
the content, and an SQL module that handles the database stuff. I seem
to have a choice in the way the content is passed as the output as the
program: (1) either return the values to a main function that prints
the output, or (2) printing the output directly in the function that
calculates and prepares the output. IOW, I can do something like the
following:

FIRST OPTION:
my $content = calculate_content( \@vars);
print $content;
....
sub calculate_content
{
my $varref = shift;
my $output;
# do stuff like
$output .= output_from_other_functions();
return $output;
}

SECOND OPTION:
calculate_content( \@vars);
....
sub calculate_content
{
my $varref = shift;
my $output;
# do stuff like
$output .= output_from_other_functions();
print $output;
}

In out-putting the data, I can do it in the main function or do it in
helper functions. Which option is better? Why?

My preference is for (1) because I like to assign my variables in the
main function rather than hide the assignments in subroutines, but it
seems rather verbose and unnecessary here, so I'm tending toward (2),
but I'm not comfortable with it.

Thanks, CC.
 
T

Tad J McClellan

ccc31807 said:
I seem
to have a choice in the way the content is passed as the output as the
program: (1) either return the values to a main function that prints
the output, or (2) printing the output directly in the function that
calculates and prepares the output.

In out-putting the data, I can do it in the main function or do it in
helper functions. Which option is better? Why?


I recommend following Uri's rule:

print rarely, print late

http://groups.google.com/group/comp.lang.perl.misc/msg/10f151df27e050b2
My preference is for (1) because I like to assign my variables in the
main function rather than hide the assignments in subroutines, but it
seems rather verbose and unnecessary here, so I'm tending toward (2),
but I'm not comfortable with it.


Good. You should be uncomfortable with (2). It is less flexible.

What if you later need to print somewhere other than STDOUT?

Easy with (1), less easy with (2).

What if you later need to further process the output? (eg. wrap it in HTML).

Easy with (1), less easy with (2).


Coding for ease of maintenance is a Really Good Idea, therefore
returning strings (rather than printing strings) is also a Good Idea.
 
P

Peter J. Holzer

I recommend following Uri's rule:

print rarely, print late

http://groups.google.com/group/comp.lang.perl.misc/msg/10f151df27e050b2

In general, I agree with that.

Good. You should be uncomfortable with (2). It is less flexible.

It is also harder to test.
What if you later need to print somewhere other than STDOUT?

perldoc -f select

Or you could redirect STDOUT to a string (this is what I usually do if I
need to test a function which prints to STDOUT).

So it is possible, but it may be awkward.
returning strings (rather than printing strings) is also a Good Idea.

Except when your strings are really huge. I have an application where
results in the tens of megabytes are normal and gigabytes possible.

I didn't really expect that (lack of foresight on my part), so I
construct the whole result (an XML file, usually) as a string in memory
before printing it. This needs a lot of memory and it is slower than
necessary (printing only starts when the computation is finished), so
I'd like to change that but the structure of the application doesn't
make that easy.

So while "print rarely, print late" is fine most of the time, you should
still think whether it is appropriate for your specific situation.

hp
 
C

ccc31807

Except when your strings are really huge. I have an application where
results in the tens of megabytes are normal and gigabytes possible.

I don't expect my strings to be really huge, but they possible can be
on the large size. If it comes to it, I'll return a reference to a
scalar to the main function.
I didn't really expect that (lack of foresight on my part), so I
construct the whole result (an XML file, usually) as a string in memory
before printing it. This needs a lot of memory and it is slower than
necessary (printing only starts when the computation is finished), so
I'd like to change that but the structure of the application doesn't
make that easy.

I sometimes have to output a lot of data. I can't think of a single
time I've not written to a text file first, and then sent the file to
a printer, usually non-programmatically but on occasion in the script.
My users expect a file in some particular format, like Word or Excel
or PDF, so it's easier for me to do it this way.
So while "print rarely, print late" is fine most of the time, you should
still think whether it is appropriate for your specific situation.

I wasn't familiar with this rule in this context, but it's a good
codification of a best practice. An exception might be where your app
is computationally expensive and your IO is not, maybe where you are
picking a few values out of a large database. In general, my scripts
don't print until the very last.

CC
 
U

Uri Guttman

PJH> Except when your strings are really huge. I have an application where
PJH> results in the tens of megabytes are normal and gigabytes possible.

so return scalar refs to those strings. that is what i do in many cases
anyway to save on copying.

PJH> I didn't really expect that (lack of foresight on my part), so I
PJH> construct the whole result (an XML file, usually) as a string in
PJH> memory before printing it. This needs a lot of memory and it is
PJH> slower than necessary (printing only starts when the computation
PJH> is finished), so I'd like to change that but the structure of the
PJH> application doesn't make that easy.

and there are modules which ONLY print for you or to a single handle and
cause many problems if you want flexibility. hence my rule. :)

PJH> So while "print rarely, print late" is fine most of the time, you should
PJH> still think whether it is appropriate for your specific situation.

always think about whether something is appropriate! :) but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

uri
 
B

brian d foy

Glenn Jackman said:
Name your subroutines to describe what they're doing:

FIRST OPTION: print generate_content( \@vars );
SECOND: emit_results( \@vars );

Personally, I'd favour the second. It would also lend itself to acting
as a dispatcher should you choose to output results in different
formats (html, csv, etc)

I like to have both of those. One always returns the content as a
string, and one prints to a filehandle:

emit_results( $fh, \@vars ); # $fh is any sort of handle

sub generate_results { ... }

sub emit_results {
my( $fh, $vars ) = @_;

print $fh generate_results( $vars );
}
 
U

Uri Guttman

bdf> I like to have both of those. One always returns the content as a
bdf> string, and one prints to a filehandle:

bdf> emit_results( $fh, \@vars ); # $fh is any sort of handle

bdf> sub generate_results { ... }

bdf> sub emit_results {
bdf> my( $fh, $vars ) = @_;

bdf> print $fh generate_results( $vars );
bdf> }

i don't see the major win having a separate emit routine. it is just the
one line print statement so you have more code for little benefit. if
you needed to print it in many places it could be a minor savings.

uri
 
D

Dr.Ruud

Uri said:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

Or store the strings in an array (and pass around its ref) because

print LIST
 
U

Uri Guttman

R> Uri Guttman said:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

R> Or store the strings in an array (and pass around its ref) because

R> print LIST

arrays take up more storage than a large scalar. and i like to process
whole files with regexes vs looping over an array of lines. it is faster
and usually simpler too. line loops usually require some form of state
(in a paragraph or not, in pod or not, etc.) but you can do the same by
grabbing a whole section in a regex and looping with while. last week i
was working with my intern group and someone wrote a basic pod extractor
with looping over lines. it was longish (20 or so lines) and
stateful. we replaced it with a single regex in a while loop working on
the whole file in a string.

uri
 
U

Uri Guttman

bdf> sub emit_results {
bdf> my( $fh, $vars ) = @_;bdf> print $fh generate_results( $vars );
bdf> }
GJ> I'm operating on the assumption that the generate and/or emit routines
GJ> are large. I prefer my "main" sub to be as concise as possible.

a generate sub can be big or small and that isn't important. but an emit
sub should be short as its guts will be a call to the generate sub. and
it will be slower as it needs a nested sub call and its overhead. just
print directly where you are making the decision is what i do.

uri
 
B

brian d foy

Uri Guttman said:
i don't see the major win having a separate emit routine. it is just the
one line print statement so you have more code for little benefit.

You can override either without disturbing the other.
 
D

David Combs

R> Uri Guttman said:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

R> Or store the strings in an array (and pass around its ref) because

R> print LIST

arrays take up more storage than a large scalar. and i like to process
whole files with regexes vs looping over an array of lines. it is faster
and usually simpler too. line loops usually require some form of state
(in a paragraph or not, in pod or not, etc.) but you can do the same by
grabbing a whole section in a regex and looping with while. last week i
was working with my intern group and someone wrote a basic pod extractor
with looping over lines. it was longish (20 or so lines) and
stateful. we replaced it with a single regex in a while loop working on
the whole file in a string.

uri


Uri, could you show a quick example of each, even using pseudocode --
just enough to make this important principle a bit more concrete?

Thanks!

David
 
S

sln

"R" == Ruud <[email protected]> writes:

R> Uri Guttman said:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

R> Or store the strings in an array (and pass around its ref) because

R> print LIST

arrays take up more storage than a large scalar. and i like to process
whole files with regexes vs looping over an array of lines. it is faster
and usually simpler too. line loops usually require some form of state
(in a paragraph or not, in pod or not, etc.) but you can do the same by
grabbing a whole section in a regex and looping with while. last week i
was working with my intern group and someone wrote a basic pod extractor
with looping over lines. it was longish (20 or so lines) and
stateful. we replaced it with a single regex in a while loop working on
the whole file in a string.

uri


Uri, could you show a quick example of each, even using pseudocode --
just enough to make this important principle a bit more concrete?

Thanks!

David

OMG another important principle to massage ego..
gimme a break man.

-sln
 
D

Dr.Ruud

Uri said:
Ruud:
Uri:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

Or store the strings in an array (and pass around its ref) because

print LIST

arrays take up more storage than a large scalar.

I don't see that as an important factor here.

and i like to process
whole files with regexes vs looping over an array of lines.

I like to build parsers from combining small regexps. Often you need a
state machine (can lead to a while loop) because the file contains data
and meta data. I also like (nested) state machines a lot.
 
P

Peter J. Holzer

Uri said:
Ruud:
Uri:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

Or store the strings in an array (and pass around its ref) because

print LIST

arrays take up more storage than a large scalar.

I don't see that as an important factor here.

Remember that the problem referred to above was memory requirements. So
it may not be an important factor *elsewhere*, but it is *the* important
factor *here*.

I wrote a small test program (see below) and used it to read the
contents of a 6.8 MB e-mail message into memory and measure the VM
usage:


mode=scalarref, filesize=6.7997MB, before = 5.08203MB, after = 11.8828MB, diff= 6.80078MB
mode=scalar, filesize=6.7997MB, before = 5.08203MB, after = 18.6836MB, diff=13.6016MB
mode=scalarcopy, filesize=6.7997MB, before = 5.08203MB, after = 18.6836MB, diff=13.6016MB
mode=arrayloop, filesize=6.7997MB, before = 5.08203MB, after = 15.4023MB, diff=10.3203MB
mode=array, filesize=6.7997MB, before = 5.08203MB, after = 26.1133MB, diff=21.0312MB

so in this example using an array with the common idiom "@lines = <$fh>"
has an overhead of about 200%. No problem if you are dealing with a few
MB, but if your files are a few hundred MB, that may just be make the
difference between feasible and infeasible. If your file has very short
lines, it gets worse. If you split the lines into fields (say, you want
to represent a comma-separated file as an AoA), it gets much worse.


#!/usr/bin/perl
use warnings;
use strict;

use constant M => 1024*1024;

my $size0 = getvmsize();
my $contents = readfile(@ARGV);
my $size1 = getvmsize();

printf "before = %gMB, after = %gMB, diff=%gMB\n",
$size0/M, $size1/M, ($size1-$size0)/M;

exit(0);

sub readfile {
my ($filename, $mode) = @_;
open (my $fh, '<', $filename) or die "cannot open $filename: $!";
if ($mode eq 'scalar') {
local $/;
return <$fh>;
} elsif ($mode eq 'scalarcopy') {
local $/;
my $contents = <$fh>;
return $contents;
} elsif ($mode eq 'scalarref') {
local $/;
my $contents = <$fh>;
return \$contents;
} elsif ($mode eq 'array') {
my @lines = <$fh>;
return \@lines;
} elsif ($mode eq 'arrayloop') {
my @lines;
while (<$fh>) {
push @lines, $_;
}
return \@lines;
}
}


sub getvmsize {
# from linux/Documentation/filesystems/proc.txt
# size total program size··············
# resident size of memory portions·········
# shared number of pages that are shared·
# trs number of pages that are 'code'·
# drs number of pages of data/stack···
# lrs number of pages of library······
# dt number of dirty pages···········
open (my $fh, '<', "/proc/$$/statm");
my $line = <$fh>;
my ($size, $resident, $shared, $trs, $drs, $lrs, $dt) = split(/\s+/, $line);
return $size * 4096; # XXX
}
__END__

hp
 
D

Dr.Ruud

Peter said:
Dr.Ruud:
Uri Guttman:
Ruud:
Uri:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.
Or store the strings in an array (and pass around its ref) because

print LIST
arrays take up more storage than a large scalar.
I don't see that as an important factor here.

Remember that the problem referred to above was memory requirements. So
it may not be an important factor *elsewhere*, but it is *the* important
factor *here*.

That is a different context than I assumed.

I was only talking about storing the strings *to print later* in an
array, in stead of concatenating them.
 
P

Peter J. Holzer

Peter said:
Dr.Ruud:
Uri Guttman:
Ruud:
Uri:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.
Or store the strings in an array (and pass around its ref) because

print LIST
arrays take up more storage than a large scalar.
I don't see that as an important factor here.

Remember that the problem referred to above was memory requirements. So
it may not be an important factor *elsewhere*, but it is *the* important
factor *here*.

That is a different context than I assumed.

I was only talking about storing the strings *to print later* in an
array, in stead of concatenating them.

Same thing. An array has a quite substantial overhead per element. In
the example I just posted it's about 40 bytes (perl 5.10.0 on Linux/i386),
and I think that's typical (about 100 bytes for hash elements, IIRC). So
if you produce output in small chunks and put them into an array you
waste an additional 40 bytes of memory for each chunk compared to
concatening them to a string.

hp
 
D

Dr.Ruud

Peter said:
Ruud:
Peter:
Dr.Ruud:
Uri:
Ruud:
Uri:
[printing late] but scalar refs
solves the passing around big strings. you still need at least one large
buffer for it though.

Or store the strings in an array (and pass around its ref) because
print LIST

arrays take up more storage than a large scalar.

I don't see that as an important factor here.

Remember that the problem referred to above was memory requirements. So
it may not be an important factor *elsewhere*, but it is *the* important
factor *here*.

That is a different context than I assumed.

I was only talking about storing the strings *to print later* in an
array, in stead of concatenating them.

Same thing. An array has a quite substantial overhead per element. In
the example I just posted it's about 40 bytes (perl 5.10.0 on Linux/i386),
and I think that's typical (about 100 bytes for hash elements, IIRC). So
if you produce output in small chunks and put them into an array you
waste an additional 40 bytes of memory for each chunk compared to
concatening them to a string.

Concatenation needs resources too, so as always just use what is
appropriate for your situation. Once your print array, or your print
buffer, is full enough, you can print and reset it. Etc. Etc.
 
I

Ilya Zakharevich

mode=scalarref, filesize=6.7997MB, before = 5.08203MB, after = 11.8828MB, diff= 6.80078MB
mode=array, filesize=6.7997MB, before = 5.08203MB, after = 26.1133MB, diff=21.0312MB

so in this example using an array with the common idiom "@lines = <$fh>"
has an overhead of about 200%.

On my system:

scalar: 8.6M
arrayref: 28.1M 5.8.8
arrayref: 20.8M 5.6.1

So there an additional bug in 5.8.8...

Mail.old>env PERL_DEBUG_MSTATS=1 perl -we "sub f{my @a=<>; print scalar @a; \@a} $a=f" eprints
Name "main::a" used only once: possible typo at -e line 1.
Memory allocation statistics after execution: (buckets 4(4)..1052668(1048576)
9947924 free: 1085 567 8680 20926 55898 15 2 4 0 0 0 0 0 0 0 0 0 0
17903 2631 11329 19275 908
17705532 used: 1210 703 8708 21048 5174 17 6 2 1 1655 3 0 0 0 0 0 0 3
17867 2639 11876 19407 51642
Total sbrk(): 28143616/975:1124. Odd ends: pad+heads+chain+tail: 0+479920+0+10240.
138811

And arrayref is about 9x slower... I suspect that list-<> operator is
not marking its results as TEMP, and/or array-copy operator is not
granting TEMP status of SVs...

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,120
Latest member
ShelaWalli
Top