IPC:Shareable

S

Snorik

Hello everyone,

I am trying to speed up a few perl scripts by forking them.
Unfortunately, I need to pass back to the parent.
I am using named pipes at other places, but this time, I wanted to use
shared memory (this being on *nix).

The first case is basically traversing a HUGE directory tree, looking
for certain files and returning them.

The idea is to fork finds from a specific point of the directory tree,
gather all the files in an array for each child process and then store
that array as reference as value of the hash.


For this, I have done:

sub get_fbas_for_rg
{
my $rg = shift;
my @children;
use IPC::Shareable;
use Data::Dumper;
use constant MYGLUE => 'Test';
my %fba_hash;
my $handle = tie (%fba_hash, IPC::Shareable, MYGLUE, {create =>
1, mode => 0666}) or die "cannot tie to shared memory: $! \n";

my @ggs = qx (ls /default/main/www/$rg | grep -v STAGING | grep -
v WORKAREA | grep -v EDITION);

foreach my $gg (@ggs)
{
chomp $gg;
my $gg_fba = $gg."_FBAs";

my $pid = fork();

if ($pid)
{
push(@children, $pid);
}
elsif ($pid == 0)
{
my @fbas = (qx (/usr/bin/find /default/main/
www/$rg/$gg/WORKAREA/workarea/$gg_fba -type f ));
$handle->shlock();
push (@{$fba_hash{$gg}}, @fbas);
$handle->shunlock();
exit (0);
}
else
{
print STDERR "\nERROR: fork failed: $!\n";
}
}

foreach (@children)
{
waitpid($_, 0);
}
return %fba_hash;
}

Now, If I call this function, it seems to work fine, only the hash
values contain only the scalars of the array, at least that is what
Data Dumper tells me:

$VAR1 = {
'dir1' => 3858,
'dir2' => 2394,
'dir3' => 2075
};


This is what I do in the script:

my %fbas = TestPackage::get_fbas_for_rg("test");

print Dumper \%fbas;

foreach my $gg (keys %fbas)
{
print $gg."\n";
foreach my $fba (sort @{$fbas{$gg}})
{
print $fba."\n";
}
}


The foreach loop does not return anything (understandable since the
hash value only contains the scalar of the array).
Again, my question is: how do I manage to receive the actual array in
the calling script instead of just a hash containing my designated
keys and the sizes of the arrays as values?

I would be very grateful for some help.
 
B

Ben Morrow

Quoth Snorik said:
Hello everyone,

I am trying to speed up a few perl scripts by forking them.
Unfortunately, I need to pass back to the parent.
I am using named pipes at other places, but this time, I wanted to use
shared memory (this being on *nix).

The first case is basically traversing a HUGE directory tree, looking
for certain files and returning them.

The idea is to fork finds from a specific point of the directory tree,
gather all the files in an array for each child process and then store
that array as reference as value of the hash.


For this, I have done:

sub get_fbas_for_rg
{
my $rg = shift;
my @children;
use IPC::Shareable;
use Data::Dumper;

It's best not to 'use' modules inside a sub (except for lowercase
pragmata which have lexical effect). It gives the false impression that
the exporter subs are only available in that sub.
use constant MYGLUE => 'Test';
my %fba_hash;
my $handle = tie (%fba_hash, IPC::Shareable, MYGLUE, {create =>
1, mode => 0666}) or die "cannot tie to shared memory: $! \n";

my @ggs = qx (ls /default/main/www/$rg | grep -v STAGING | grep -
v WORKAREA | grep -v EDITION);

use File::Slurp qw/read_dir/;

my @ggs =
grep !/EDITION/,
grep !/WORKAREA/,
grep !/STAGING/,
read_dir "/default/main/www/$rg";

or

grep !/EDITION|WORKAREA|STAGING/,

of course.
my @fbas = (qx (/usr/bin/find /default/main/
www/$rg/$gg/WORKAREA/workarea/$gg_fba -type f ));

I would use File::Find::Rule for this.

my @fbas = File::Find::Rule->file
->in("/default/main/www/$rg/$gg/WORKAREA/workarea/$gg_fba");
$handle->shlock();
push (@{$fba_hash{$gg}}, @fbas);

You cannot assign a ref to an IPC::Shareable tied hash. The other
process has no way of following that ref: it refers to data structures
that aren't in shared memory. I would suggest using Storable:

use Storable qw/freeze/;

$fba_hash{$gg} = freeze \@fbas;

and then retrieve it with

use Storable qw/thaw/;

my @fbas = @{ thaw $fba_hash{$gg} };

Ben
 
T

Ted Zlatanov

S> use IPC::Shareable;

Try IPC::ShareLite or even Tie::ShareLite (easiest, hash interface).
They work better for me.

S> Now, If I call this function, it seems to work fine, only the hash
S> values contain only the scalars of the array, at least that is what
S> Data Dumper tells me:

S> $VAR1 = {
S> 'dir1' => 3858,
S> 'dir2' => 2394,
S> 'dir3' => 2075
S> };

You're assigning @array to the hash value; the value can only be a
scalar so you get the size of the array instead of its contents.

See the Tie::ShareLite docs, especially section 'REFERENCES,' for a
better solution.

Ted
 
S

Snorik

Quoth Snorik <[email protected]>:









It's best not to 'use' modules inside a sub (except for lowercase
pragmata which have lexical effect). It gives the false impression that
the exporter subs are only available in that sub.

Ok, that is a useful remark, I will keep that in mind.
    use File::Slurp qw/read_dir/;

*snip useage of File::Slurp*

Thanks for pointing that out, that module really helps working a lot!
I never knew that existed.
I would use File::Find::Rule for this.

    my @fbas = File::Find::Rule->file
        ->in("/default/main/www/$rg/$gg/WORKAREA/workarea/$gg_fba");

Again, thanks for pointing that out - This is so much more elegant
than the normal File::Find way.
You cannot assign a ref to an IPC::Shareable tied hash. The other
process has no way of following that ref: it refers to data structures
that aren't in shared memory. I would suggest using Storable:

OK, so if I may rephrase in order to check whether I have actually
understood:
All that can be tied in that hash is the scalar of the array (its
size), I cannot use it to follow a ref to the actual array.
    use Storable qw/freeze/;

    $fba_hash{$gg} = freeze \@fbas;

and then retrieve it with

    use Storable qw/thaw/;

    my @fbas = @{ thaw $fba_hash{$gg} };

I have a question concerning this (I just had a look at the Storable
documentation, but this does not really clear things up):

So Storable persists (and of course serializes) any datastructure;that
means I can store the hash to disk (or memory, hopefully memory?).
How can I retrieve this in the calling script, as this sub is going to
live in a module itself? I must admit, this is my first attempt at IPC
myself.

I would be very grateful for an answer.

Snorik
 
B

Ben Morrow

Quoth Snorik said:
OK, so if I may rephrase in order to check whether I have actually
understood:
All that can be tied in that hash is the scalar of the array (its
size), I cannot use it to follow a ref to the actual array.

Yes. I don't entirely understand why the value stored was
scalar(@array): I would have expected it to be the stringification of
the ref. I guess it's to do with how IPC::Shareable interprets its
arguments.
I have a question concerning this (I just had a look at the Storable
documentation, but this does not really clear things up):

So Storable persists (and of course serializes) any datastructure;that
means I can store the hash to disk (or memory, hopefully memory?).

Yes. You use store/retrieve to save to and load from disk; you use
freeze/thaw to save to and load from memory.
How can I retrieve this in the calling script, as this sub is going to
live in a module itself? I must admit, this is my first attempt at IPC
myself.

If you store it with 'freeze', you get it out again with 'thaw'.

Ben
 
S

Snorik

Yes. You use store/retrieve to save to and load from disk; you use
freeze/thaw to save to and load from memory.

Ok, thanks for that, I will read the documentation and actually try to
understand it.
If you store it with 'freeze', you get it out again with 'thaw'.

Yes, I have understood that, but if I freeze a hash in one script, how
can I thaw it in the other script? I do not have the reference?
I tried to use a tied variable for that, figuring that this should
work this time, but this failed unfortunately.
 
X

xhoster

Snorik said:
Yes, I have understood that, but if I freeze a hash in one script, how
can I thaw it in the other script?

When you freeze, you get a serialized data, which is just a string. You
pass that string to the other script using shared memory (or pipes).
I do not have the reference?

That is what thaw does. It makes a reference again out of the serialized
data. Obviously it isn't the same reference, but deep copy of the
referenced data.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

Snorik

S>           use IPC::Shareable;

Try IPC::ShareLite or even Tie::ShareLite (easiest, hash interface).
They work better for me.

S> Now, If I call this function, it seems to work fine, only the hash
S> values contain only the scalars of the array, at least that is what
S> Data Dumper tells me:

S> $VAR1 = {
S>           'dir1' => 3858,
S>           'dir2' => 2394,
S>           'dir3' => 2075
S>         };

You're assigning @array to the hash value; the value can only be a
scalar so you get the size of the array instead of its contents.

See the Tie::ShareLite docs, especially section 'REFERENCES,' for a
better solution.

Hello,

ok, this works - however, it is very slow if I use a normal hash (the
hash has about 10000 entries).
I have yet to get it to work with hash references.

Thanks for your help!
 
T

Ted Zlatanov

S> ok, this works - however, it is very slow if I use a normal hash (the
S> hash has about 10000 entries).
S> I have yet to get it to work with hash references.

I've only used it with hashes of up to 1000 entries, but I'm surprised
it's very slow. Can you show your code so we can see if the problem is
in the module or in your code?

Ted
 
S

Snorik

S> ok, this works - however, it is very slow if I use a normal hash (the
S> hash has about 10000 entries).
S> I have yet to get it to work with hash references.

I've only used it with hashes of up to 1000 entries, but I'm surprised
it's very slow.  Can you show your code so we can see if the problem is
in the module or in your code?

Hello,

okay, it appears even a Solaris system needs a reboot some time - now
it is pretty fast: 19 seconds for 16k entries (and that includes
Data::Dumper printing out the hash and forking about 30 child
processes).

================================
I do the following now:

if ($pid)
{
push(@children, $pid);
}
elsif ($pid == 0)
{
use File::Find::Rule;
my @fbas = File::Find::Rule->file->in("/default/main/www/$rg/$gg/
WORKAREA/workarea/$gg_fba");
$ipc->lock(LOCK_EX);
$shared{$gg} = \@fbas;
$ipc->unlock();
exit (0);
}
else
{
print STDERR "\nERROR: fork failed: $!\n";
}
}

foreach (@children)
{
waitpid($_, 0);
}
return %shared;

And in the calling script:

my %fba_ref = Package::get_fbas_for_rg("dir1");
print Dumper \%fba_ref;

=================================
I am wondering: Do I even need the locks for the hash reference, does
this lock the entire hash, or solely the key in question? Does this
still go faster?
 
S

Snorik

Ok, when testing with another directory tree, I saw the following:

IPC::Sha(in cleanup) IPC::ShareLite store() error: No space left on
device at /opt/iw-home/iw-perl/site/lib/Tie/ShareLite.pm line 366


Then, on retry, it worked again - therefore my question: Do I have to
clean up anything after I used it?

I tried using "destroy => 'yes'", but that yields "IPC::ShareLite
fetch() error: Invalid argument at /opt/iw-home/iw-perl/site/lib/Tie/
ShareLite.pm line 342", telling me (not sure) that the object is
destroyed too early.

Some explanation what to do for this would be great :)
 
T

Ted Zlatanov

S> Ok, when testing with another directory tree, I saw the following:
S> IPC::Sha(in cleanup) IPC::ShareLite store() error: No space left on
S> device at /opt/iw-home/iw-perl/site/lib/Tie/ShareLite.pm line 366


S> Then, on retry, it worked again - therefore my question: Do I have to
S> clean up anything after I used it?

S> I tried using "destroy => 'yes'", but that yields "IPC::ShareLite
S> fetch() error: Invalid argument at /opt/iw-home/iw-perl/site/lib/Tie/
S> ShareLite.pm line 342", telling me (not sure) that the object is
S> destroyed too early.

S> Some explanation what to do for this would be great :)

You need to increase your shared memory. This is different for every
OS. For Solaris, see
e.g. http://publib.boulder.ibm.com/tividd/td/ITAME/SC32-1351-00/en_US/HTML/am51_perftune34.htm

Give it at least 10 times the default, unless your system is short on memory.

Ted
 
T

Ted Zlatanov

S> if ($pid)
S> {
S> push(@children, $pid);
S> }
S> elsif ($pid == 0)
S> {
S> use File::Find::Rule;
S> my @fbas = File::Find::Rule->file->in("/default/main/www/$rg/$gg/
S> WORKAREA/workarea/$gg_fba");
S> $ipc->lock(LOCK_EX);
S> $shared{$gg} = \@fbas;
S> $ipc->unlock();
S> exit (0);
S> }
S> else
S> {
S> print STDERR "\nERROR: fork failed: $!\n";
S> }
S> }

S> foreach (@children)
S> {
S> waitpid($_, 0);
S> }
S> return %shared;

S> And in the calling script:

S> my %fba_ref = Package::get_fbas_for_rg("dir1");
S> print Dumper \%fba_ref;

S> =================================

S> I am wondering: Do I even need the locks for the hash reference,

Yes.

S> does this lock the entire hash, or solely the key in question?

The whole hash.

S> Does this still go faster?

Than what? I'm not sure what we're measuring.

Your code looks OK. You should probably destroy the hash in the main
process after the children are done. This is not showing all your code;
your initialization should have an ID for the shared memory segment
you're using. If you use the same ID, you will reuse the same space, so
you don't have to destroy it if you'll be reusing it regularly.
Sometimes people use the process ID $$ as the shared memory ID, and in
that case you'll keep allocating more memory every time.

Ted
 
T

Ted Zlatanov

S> I do the following now:

S> if ($pid)
S> {
S> push(@children, $pid);
S> }
S> elsif ($pid == 0)
S> {
S> use File::Find::Rule;
S> my @fbas = File::Find::Rule->file->in("/default/main/www/$rg/$gg/
S> WORKAREA/workarea/$gg_fba");
S> $ipc->lock(LOCK_EX);
S> $shared{$gg} = \@fbas;
S> $ipc->unlock();
S> exit (0);
S> }
S> else
S> {
S> print STDERR "\nERROR: fork failed: $!\n";
S> }
S> }

By the way, I forgot to mention: you can just put the file list in a
file, and have the file name in the shared memory. That way you get
locking (only the process with the shared memory lock can open or write
to files), your shared memory usage is low, and you don't have to worry
about serializing large amounts of data. This is what I've used in the
past for database loaders. It works well.

Ted
 
S

Snorik

S> I do the following now:

S> if ($pid)
S> {
S> push(@children, $pid);
S> }
S> elsif ($pid == 0)
S> {
S> use File::Find::Rule;
S> my @fbas = File::Find::Rule->file->in("/default/main/www/$rg/$gg/
S> WORKAREA/workarea/$gg_fba");
S> $ipc->lock(LOCK_EX);
S> $shared{$gg} = \@fbas;
S> $ipc->unlock();
S> exit (0);
S> }
S> else
S> {
S> print STDERR "\nERROR: fork failed: $!\n";
S> }
S> }

By the way, I forgot to mention: you can just put the file list in a
file, and have the file name in the shared memory.  That way you get
locking (only the process with the shared memory lock can open or write
to files), your shared memory usage is low, and you don't have to worry
about serializing large amounts of data.  This is what I've used in the
past for database loaders.  It works well.

That is a good idea and what I am going to do!
 
S

Snorik

S> if ($pid)
S> {
S> push(@children, $pid);
S> }
S> elsif ($pid == 0)
S> {
S> use File::Find::Rule;
S> my @fbas = File::Find::Rule->file->in("/default/main/www/$rg/$gg/
S> WORKAREA/workarea/$gg_fba");
S> $ipc->lock(LOCK_EX);
S> $shared{$gg} = \@fbas;
S> $ipc->unlock();
S> exit (0);
S> }
S> else
S> {
S> print STDERR "\nERROR: fork failed: $!\n";
S> }
S> }

S> foreach (@children)
S> {
S> waitpid($_, 0);
S> }
S> return %shared;

S> And in the calling script:

S> my %fba_ref = Package::get_fbas_for_rg("dir1");
S> print Dumper \%fba_ref;

S> =================================

S> I am wondering: Do I even need the locks for the hash reference,

Yes.

S> does this lock the entire hash, or solely the key in question?

The whole hash.

S> Does this still go faster?

Than what?  I'm not sure what we're measuring.  

Your code looks OK.  You should probably destroy the hash in the main
process after the children are done.  

Ok, but how do I return it to the calling script otherwise?
The idea of all of this is to encapsulate the action completely and
just return something from the module without anyone having to bother
what happens inside there.
This is not showing all your code;
your initialization should have an ID for the shared memory segment
you're using.  If you use the same ID, you will reuse the same space, so
you don't have to destroy it if you'll be reusing it regularly.

I just noticed that reusing it means that for a second run with other
parameters, I also receive part of the result from the first one. So,
I better not keep it :)

For initialization, I just use the following:

my $ipc = tie %shared, 'Tie::ShareLite', -key=>1971, -mode=>0600, -
create=>'yes', -destroy =>'no' or die("Could not tie to shared memory:
$!")
Sometimes people use the process ID $$ as the shared memory ID, and in
that case you'll keep allocating more memory every time.

I do not want to keep it, how can I clean it up properly (again,
hidden inside the module if possible) and still return a valid
reference from the routine in the module?
 
S

Snorik

Ok, but how do I return it to the calling script otherwise?
The idea of all of this is to encapsulate the action completely and
just return something from the module without anyone having to bother
what happens inside there.


I just noticed that reusing it means that for a second run with other
parameters, I also receive part of the result from the first one. So,
I better not keep it :)

For initialization, I just use the following:

 my $ipc = tie %shared, 'Tie::ShareLite', -key=>1971, -mode=>0600, -
create=>'yes', -destroy =>'no' or die("Could not tie to shared memory:
$!")


I do not want to keep it, how can I clean it up properly (again,
hidden inside the module if possible) and still return a valid
reference from the routine in the module?

To illustrate, I want to do this (this is most of the calling script):

my %fba_ref = Package::get_fbas_for_rg("directory");
print Dumper \%fba_ref;

Trying to clear the memory like this (hardcoded id, I know, but if I
clear it everytime I use it it should not be a problem - right?):

Package::cleanMemory(1971);

returns a "permission denied" error.

cleanMemory just looks like this:

161 {
162 my $handle = shift;
163 use IPC::ShareLite;
164 my $share = IPC::ShareLite->new( -key => $handle, -
create => 'yes', -destroy => 'no') or die "caution: ".$!;
165 $share->destroy();
166 }

I figured, I can just reuse the shared memory segment and just destroy
it - but I seem to be too dense to get that.
 
T

Ted Zlatanov

S> I just noticed that reusing it means that for a second run with other
S> parameters, I also receive part of the result from the first one. So,
S> I better not keep it :)

S> For initialization, I just use the following:

S> my $ipc = tie %shared, 'Tie::ShareLite', -key=>1971, -mode=>0600, -
S> create=>'yes', -destroy =>'no' or die("Could not tie to shared memory:
S> $!")

Do this in the parent (the one which forks all the others) with
-destroy => 'yes'

If you want to keep running children after the parent script is gone,
you'll have to have a cleanup function or script, which is called before
the next run happens. It's exactly like leaving files in /tmp after
you're done processing them.

Ted
 
S

Snorik

S> I just noticed that reusing it means that for a second run with other
S> parameters, I also receive part of the result from the first one. So,
S> I better not keep it :)

S> For initialization, I just use the following:

S>  my $ipc = tie %shared, 'Tie::ShareLite', -key=>1971, -mode=>0600, -
S> create=>'yes', -destroy =>'no' or die("Could not tie to shared memory:
S> $!")

Ted, sorry it took me some time to respond...
Do this in the parent (the one which forks all the others) with
-destroy => 'yes'

I think I am doing this in the parent, that means before before the
fork() occurs (which should be the parent):

my $ipc = tie %shared, 'Tie::ShareLite', -key=>1971, -mode=>0600, -
create=>'yes', -destroy =>'yes' or die("Could not tie to shared
memory: $!");

use File::Slurp qw/read_dir/;
my @ggs = [... removed to shorten - it does return an array]
foreach my $gg (@ggs)
{
chomp $gg;
my $pid = fork();
if ($pid)
{
push(@children, $pid);
}
elsif ($pid == 0)
{

use File::Find::Rule;
[here stuff happens...]


I receive this for each element of the array
IPC::ShareLite fetch() error: Invalid argument at [..]Tie/ShareLite.pm
line 342

To me that says that the object is already destroyed?

*snip useful advice about child processes*

Thank you for that gem of useful information!
 
S

Snorik

Rather than explaining how to fix your example, here's a working program
that shows how to use Tie::ShareLite. Each child will lock, write
'hello' to the key of its PID, then unlock. You should lock and unlock
around every access to the tied hash. The parent waits for the children
to finish and then prints the summary (also locking and unlocking, to
protect from other processes that might be accessing that shared memory).
OK.

The get_ipc() function is just for convenience. Key 1971 is just an
example from the Tie::ShareLite docs, you can use any value.

I know, I know.
Note I clear %shared every time I start.

Ok, I was not aware I could just do that. I thought I needed to
actually remove the contents of the shared memory segment somehow.
It's not destroyed at the
program's end. Setting destroy to yes in the parent doesn't work for
me, and I didn't debug it (no time :)

I have the nagging feeling that this does not work.
But I like your solution to that, it is very simple yet powerful.

Anyhow, stuff works like a charm now, thank you so much for your time,
I owe you a beer :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top