efficiency question about 'split'

G

~greg

I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}



So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents() ...

( because i hope it is)

~Greg.
 
X

xhoster

~greg said:
I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}

So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents()

I'm pretty sure that first it will read the entire contents into memory as
a very big scalar, and then it will split the entire contents, in memory,
into an even bigger array.

You'll probably want to use the fh() method to get a file handle, and then
read from that handle line by line.

Xho
 
X

xhoster

I'm pretty sure that first it will read the entire contents into memory
as a very big scalar, and then it will split the entire contents, in
memory, into an even bigger array.

You'll probably want to use the fh() method to get a file handle, and
then read from that handle line by line.

Actually, that gives you handle onto the compressed data stream, not the
uncompressed one. I'm sure that there is a way to this with the tools
listed under "Low-level member data reading", but at this point I'd just
punt and extract it to a temp file then re-read it from that.
 
G

~greg

Actually, that gives you handle onto the compressed data stream, not the
uncompressed one. I'm sure that there is a way to this with the tools
listed under "Low-level member data reading", but at this point I'd just
punt and extract it to a temp file then re-read it from that.


Thank you.

However a temp file would defeat the purpose.

This is what I'm doing with the $ZippedFile:

use Archive::Zip;
my $Zip = Archive::Zip->new("$Folder/$ZippedFile");
my ($Dictionary) = $Zip->members();

--and the result is exactly the same
as if I'd done this with the $UnzippedFile:

open F, "$Folder/$UnzippedFile";
my $Dictionary = join "\n", (<F>);
close F;

EXCEPT that the loading is about 4 times faster
from the $ZippedFile, --even though it has to be
expanded in-memory, -- than from the unzipped file.

But what I don't know is

1) does every call to $Dictionary->contents()
go through the inflation process again?

And 2) - what is a very general perl question:
Since I will be running the loop many many times,
can I just leave it like this?

foreach my $line (split "\n", $String)

or should I do this?

my @String = split "\n", $String;
foreach my $line (@String)


The reason I'd want to leave it the first way
is that $String is very larger, and @String
would be even larger.

I would undef $String, but still, for a moment,
anyway the 2nd way just might be using twice
as much memory as is really necessary.

But I don't know, because IF the first way
does essentially the same thing that the 2nd
way does, but with a temorary @String,
--and IF it does it EVERY single time it's called,
then that obviously that would be wrong way
to go!

I hope that's clearer :)

~greg
 
G

~greg

I just tried this:

while($Dictionary->contents() =~ /^(.*)$/gm)
{
my $line = $1;
...
}


I wasn't sure that $Dictionary->contents()
would be a full-fledged string, with the book-keeping
necessary to keep place during the loop,

But it does seem to be. (Or become)
Because it worked.

Except that it's about 100 times slower!
 
P

Paul Marquess

~greg said:
I have a very large Archive::Zip object, $Dictionary,
and I am doing a loop over its contents, like this:

foreach my $line (split "\n", $Dictionary->contents())
{
....
}



So.
I was just wondering if perl is smart enough
not to literally actually make a temporary big array first
out of the splitting of the big string $Dictionary->contents() ...

Nope, it will uncompress the whole lot in memory, then split that into an
array.

A more efficient way to do what you want is with Archive::Zip::MemberRead,
which is part of Archive::Zip. This is from its synopsis

use Archive::Zip;
use Archive::Zip::MemberRead;
$zip = new Archive::Zip("file.zip");
$fh = new Archive::Zip::MemberRead($zip, "subdir/abc.txt");
while (defined($line = $fh->getline()))
{
print $fh->input_line_number . "#: $line\n";
}


Alternatively you can use IO::Uncompress::Unzip to do the same thing

use IO::Uncompress::Unzip qw:)all);

my $file = "$Folder/$ZippedFile";
my $uz = IO::Uncompress::Unzip->new($file)
or die "Cannot open $file: $UnzipError\n";

while (<$uz>)
{
}


Paul
 
X

xhoster

~greg said:
Thank you.

However a temp file would defeat the purpose.
....

But what I don't know is

1) does every call to $Dictionary->contents()
go through the inflation process again?

I suspect so. However, $Dictionary->contents() is only called
once for the entire foreach loop, not once for each iteration
of that loop. So if your foreach loop is nested in another loop,
$Dictionary->contents() is called once per outer loop, not once per
outer loop * inner loop.
And 2) - what is a very general perl question:
Since I will be running the loop many many times,
can I just leave it like this?

foreach my $line (split "\n", $String)

or should I do this?

my @String = split "\n", $String;
foreach my $line (@String)

I think those are effectively identical.
The reason I'd want to leave it the first way
is that $String is very larger, and @String
would be even larger.

I suspect this makes no difference.
I would undef $String, but still, for a moment,
anyway the 2nd way just might be using twice
as much memory as is really necessary.

This is Perl, you are almost certainly using more than twice as much
memory as is really necessary anyway :)

Xho
 
X

xhoster

Paul Marquess said:
Nope, it will uncompress the whole lot in memory, then split that into an
array.

A more efficient way to do what you want is with
Archive::Zip::MemberRead, which is part of Archive::Zip. This is from its
synopsis

Very nice. I wish Archive::Zip::MemberRead was mentioned somewhere in
the Archive::Zip perldoc.

Xho
 
G

~greg

thank you all!

although i am still not quite sure what the answer is, then again,
I am not quite sure what my question was either.
And I know I made factual mistakes in asking it.

Pretty silly in any case.
I've got a 2.8 ghz processor, 2gig of ram,
and about a terabyte of external storage.

It's just that, -late at night, -when I asked the question
about efficiency, -my heart had somehow gone back
to the commodore-64, -- 0.9875 mhz 6502 chip,
64k ram (expanded to 128)
360k floppies (-but unlimited memory on cassette.)
The only decent way to program back then
was in Forth and assembly.
And efficiency was everything.

so, I am sorry.
i was just having a senior moment.

~greg.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,797
Messages
2,569,647
Members
45,380
Latest member
LatonyaEde

Latest Threads

Top