efficiency question about 'split'

~greg · Jan 24, 2007

I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}

So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents() ...

( because i hope it is)

~Greg.

xhoster · Jan 24, 2007

~greg said:
I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}

So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents()

I'm pretty sure that first it will read the entire contents into memory as
a very big scalar, and then it will split the entire contents, in memory,
into an even bigger array.

You'll probably want to use the fh() method to get a file handle, and then
read from that handle line by line.

Xho

xhoster · Jan 24, 2007

I'm pretty sure that first it will read the entire contents into memory
as a very big scalar, and then it will split the entire contents, in
memory, into an even bigger array.

You'll probably want to use the fh() method to get a file handle, and
then read from that handle line by line.

Actually, that gives you handle onto the compressed data stream, not the
uncompressed one. I'm sure that there is a way to this with the tools
listed under "Low-level member data reading", but at this point I'd just
punt and extract it to a temp file then re-read it from that.

~greg · Jan 24, 2007

Actually, that gives you handle onto the compressed data stream, not the
uncompressed one. I'm sure that there is a way to this with the tools
listed under "Low-level member data reading", but at this point I'd just
punt and extract it to a temp file then re-read it from that.

Thank you.

However a temp file would defeat the purpose.

This is what I'm doing with the $ZippedFile:

use Archive::Zip;
my $Zip = Archive::Zip->new("$Folder/$ZippedFile");
my ($Dictionary) = $Zip->members();

--and the result is exactly the same
as if I'd done this with the $UnzippedFile:

open F, "$Folder/$UnzippedFile";
my $Dictionary = join "\n", (<F>);
close F;

EXCEPT that the loading is about 4 times faster
from the $ZippedFile, --even though it has to be
expanded in-memory, -- than from the unzipped file.

But what I don't know is

1) does every call to $Dictionary->contents()
go through the inflation process again?

And 2) - what is a very general perl question:
Since I will be running the loop many many times,
can I just leave it like this?

foreach my $line (split "\n", $String)

or should I do this?

my @String = split "\n", $String;
foreach my $line (@String)

The reason I'd want to leave it the first way
is that $String is very larger, and @String
would be even larger.

I would undef $String, but still, for a moment,
anyway the 2nd way just might be using twice
as much memory as is really necessary.

But I don't know, because IF the first way
does essentially the same thing that the 2nd
way does, but with a temorary @String,
--and IF it does it EVERY single time it's called,
then that obviously that would be wrong way
to go!

I hope that's clearer

~greg

~greg · Jan 24, 2007

I just tried this:

while($Dictionary->contents() =~ /^(.*)$/gm)
{
my $line = $1;
...
}

I wasn't sure that $Dictionary->contents()
would be a full-fledged string, with the book-keeping
necessary to keep place during the loop,

But it does seem to be. (Or become)
Because it worked.

Except that it's about 100 times slower!

Paul Marquess · Jan 24, 2007

~greg said:
I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}

So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents() ...

Nope, it will uncompress the whole lot in memory, then split that into an
array.

A more efficient way to do what you want is with Archive::Zip::MemberRead,
which is part of Archive::Zip. This is from its synopsis

use Archive::Zip;
use Archive::Zip::MemberRead;
$zip = new Archive::Zip("file.zip");
$fh = new Archive::Zip::MemberRead($zip, "subdir/abc.txt");
while (defined($line = $fh->getline()))
{
print $fh->input_line_number . "#: $line\n";
}

Alternatively you can use IO::Uncompress::Unzip to do the same thing

use IO::Uncompress::Unzip qw

all);

my $file = "$Folder/$ZippedFile";
my $uz = IO::Uncompress::Unzip->new($file)
or die "Cannot open $file: $UnzipError\n";

while (<$uz>)
{
}

Paul

xhoster · Jan 24, 2007

~greg said:
Thank you.

However a temp file would defeat the purpose.
....

But what I don't know is

1) does every call to $Dictionary->contents()
go through the inflation process again?

I suspect so. However, $Dictionary->contents() is only called
once for the entire foreach loop, not once for each iteration
of that loop. So if your foreach loop is nested in another loop,
$Dictionary->contents() is called once per outer loop, not once per
outer loop * inner loop.

And 2) - what is a very general perl question:
Since I will be running the loop many many times,
can I just leave it like this?

foreach my $line (split "\n", $String)

or should I do this?

my @String = split "\n", $String;
foreach my $line (@String)

I think those are effectively identical.

The reason I'd want to leave it the first way
is that $String is very larger, and @String
would be even larger.

I suspect this makes no difference.

I would undef $String, but still, for a moment,
anyway the 2nd way just might be using twice
as much memory as is really necessary.

This is Perl, you are almost certainly using more than twice as much
memory as is really necessary anyway

Xho

xhoster · Jan 24, 2007

Paul Marquess said:
Nope, it will uncompress the whole lot in memory, then split that into an
array.

A more efficient way to do what you want is with
Archive::Zip::MemberRead, which is part of Archive::Zip. This is from its
synopsis

Very nice. I wish Archive::Zip::MemberRead was mentioned somewhere in
the Archive::Zip perldoc.

Xho

~greg · Jan 26, 2007

thank you all!

although i am still not quite sure what the answer is, then again,
I am not quite sure what my question was either.
And I know I made factual mistakes in asking it.

Pretty silly in any case.
I've got a 2.8 ghz processor, 2gig of ram,
and about a terabyte of external storage.

It's just that, -late at night, -when I asked the question
about efficiency, -my heart had somehow gone back
to the commodore-64, -- 0.9875 mhz 6502 chip,
64k ram (expanded to 128)
360k floppies (-but unlimited memory on cassette.)
The only decent way to program back then
was in Forth and assembly.
And efficiency was everything.

so, I am sorry.
i was just having a senior moment.

~greg.

Read efficiency?	6	Feb 21, 2010
Non-uniform split	10	Sep 7, 2006
Question about split	3	Nov 7, 2003
Efficiency in code to cycle log file	0	Jul 28, 2004
Efficiency of std::sort on std::deque, plus an idiom issue	3	Aug 3, 2009
Question about wizard Perl programmers	24	Mar 5, 2007
Move modules to submodules question	1	Jan 11, 2013
Efficiency/runtime trouble - may not be Ruby-specific	7	Jul 25, 2007

efficiency question about 'split'

~greg

xhoster

xhoster

~greg

~greg

Paul Marquess

xhoster

xhoster

~greg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads