speaking of forking -- parallel database fetches?

D

DJ Stunks

Hey all,

I have a question about performing two long-running selects from a
database in parallel. I've only written a few scripts which do things
in parallel so I'm not an expert by any means.

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete rather than pulling all the rows in the child, and serializing
and passing to the parent.

Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

TIA,
-jp

#!/usr/bin/perl <pseudocode>

use strict;
use warnings;

my @kids = (
{ query => 'select * from big_table' },
{ query => 'select * from another_big_table' },
);

my $pid;
for my $kid (@kids) {
$pid = open( my $fh, '|-');
die "Can't fork: $!\n" if not defined $pid;

@{ $kid }{ 'pid','handle' } = ($pid,$fh);
}

if ( $pid == 0) { # I'm one of the children

# connect to the db
# prepare query
# execute query
# wait for results
# foreach @row = $sth->fetchrow_array
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row
#
# while ( my $line = < $kids[1]{handle} >) {
# etc.
}

__END__
 
X

xhoster

DJ Stunks said:
Hey all,

I have a question about performing two long-running selects from a
database in parallel. I've only written a few scripts which do things
in parallel so I'm not an expert by any means.

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

How many rows are we talking?
I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete

I'm pretty sure that that ain't gonna happen.
rather than pulling all the rows in the child, and serializing
and passing to the parent.

Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

Parallel::Jobs might help, although it isn't very easy to use and wants to
run commands rather than Perl code (so you would have to fire up a new Perl
interpreter from scratch, and it wouldn't inherit variables, etc, from the
existing one). I've posted here a while ago a simple mod of
Parallel::ForkManager to allow back tack from the child to the parent, but
it send the data as a slug of serialized data, not as a stream. Whether
this is deadly or not depends on how many rows.

http://groups.google.com/group/comp.lang.perl.modules/browse_frm/thread/9c8
ef79472740156/

Also, there is Parallel::Simple, which I never used but looks promising.
my $pid;
for my $kid (@kids) {
$pid = open( my $fh, '|-');
die "Can't fork: $!\n" if not defined $pid;

@{ $kid }{ 'pid','handle' } = ($pid,$fh);
}

if ( $pid == 0) { # I'm one of the children

How does the child know which child it is? It would have to loop over
@kids comparing it's pid to the stored pid (which it can't do, because
the stored pid is stored only in the parent, not the child). Better to
move this else block into the inside of the the for my $kid loop, that way
the kid automatically know who it is by looking in $kid.

....
# wait for results
# foreach @row = $sth->fetchrow_array

Technically, you can't waid for results as a separate action from
calling fetchrow_array. fetchrow_array is inherently waiting for results.
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)

exit. You almost certainly don't want to waitpid (for what? The parent?).
At this point, you stuffed everything you have into the buffer up to the
parent. That is all you an do. If the parent doesn't read it all, what
is the child to do about it?
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row

You will want to "close $kids[0]{handle} or [die|warn] $! $?;" to make sure
the kid ended things on a good note. The close implicilty waits for the
the child to exit.

Xho
 
R

Rocco Caputo

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise. :)

http://search.cpan.org/search?query=POE+DBI&mode=dist
 
D

DJ Stunks

Rocco said:
I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise. :)

http://search.cpan.org/search?query=POE+DBI&mode=dist

this looks great, time to start reading :)

you know POE had occurred to me, my search terms must have been bad
choices.

Thanks to both who responded,
-jp
 
P

Peter J. Holzer

I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so?

I was thinking something along these lines (pseudocode below) but I was
hoping there would be some way to give the parent access to the
statement handle itself so it could pull the rows once the queries were
complete

Theoretically, that's possible. In practice, database client libaries
don't support that.

[...]
if ( $pid == 0) { # I'm one of the children

# connect to the db
# prepare query
# execute query
# wait for results
# foreach @row = $sth->fetchrow_array
# print join( $;, @row ), "\n"

# exit; (exit or waitpid? do I know the parent read everything?)
}
else { # I'm the parent

# while ( my $line = < $kids[0]{handle} >) {
# @row = split $;, $line;
# do whatever with @row
#
# while ( my $line = < $kids[1]{handle} >) {
# etc.
}

You are reading all the lines from kid 0 before the first line from kid 1
here. This means that kid 1 will block as soon as it has written enough
records to fill the pipe (typically a few kB). Not a problem if your
queried only return a few rows, but if they are long-running because
they return lots of rows, they won't really run in parallel.

You may want to look at select (or IO::Select), but be warned that select
and <> don't mix well - you need to use sysread instead.

hp
 
D

DJ Stunks

DJ said:
Rocco said:
I have two straightforward SELECT statements, but both take about two
minutes to complete. I'd like to run them in parallel, but I obviously
need access to all the rows - what's the best way to do so? [...]
Any ideas or maybe modules which could be handy? (I looked at both
Acme::Spork and Parallel::ForkManager but neither are appropriate)

There are four ways to do this with POE. Whether they're appropriate
is left as an exercise. :)

http://search.cpan.org/search?query=POE+DBI&mode=dist

this looks great, time to start reading :)

Well, I went with EasyDBI which looks ideal, however, my queries are
both coming back with the error "Died". The queries work through
straight DBI, and both of them take the right amount of time through
EasyDBI, but no rows return, just the error.

I'm not sure if anyone is familiar with this module. It doesn't look
like I can get any more detailed debugging as to what "Died". I'll try
one of the other 4 suggestions tomorrow and see how it goes.

If it matters, the queries use the DBD::ODBC driver and I'm running on
Win32 (ActiveState Perl 5.8.7 [813], DBD::ODBC 1.13, POE 0.9500,
EasyDBI 1.14)

Thanks for the suggestion though, I'll keep at it.
-jp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top