speaking of forking -- parallel database fetches?

Discussion in 'Perl Misc' started by DJ Stunks, Nov 24, 2006.

  1. DJ Stunks

    DJ Stunks Guest

    Hey all,

    I have a question about performing two long-running selects from a
    database in parallel. I've only written a few scripts which do things
    in parallel so I'm not an expert by any means.

    I have two straightforward SELECT statements, but both take about two
    minutes to complete. I'd like to run them in parallel, but I obviously
    need access to all the rows - what's the best way to do so?

    I was thinking something along these lines (pseudocode below) but I was
    hoping there would be some way to give the parent access to the
    statement handle itself so it could pull the rows once the queries were
    complete rather than pulling all the rows in the child, and serializing
    and passing to the parent.

    Any ideas or maybe modules which could be handy? (I looked at both
    Acme::Spork and Parallel::ForkManager but neither are appropriate)

    TIA,
    -jp

    #!/usr/bin/perl <pseudocode>

    use strict;
    use warnings;

    my @kids = (
    { query => 'select * from big_table' },
    { query => 'select * from another_big_table' },
    );

    my $pid;
    for my $kid (@kids) {
    $pid = open( my $fh, '|-');
    die "Can't fork: $!\n" if not defined $pid;

    @{ $kid }{ 'pid','handle' } = ($pid,$fh);
    }

    if ( $pid == 0) { # I'm one of the children

    # connect to the db
    # prepare query
    # execute query
    # wait for results
    # foreach @row = $sth->fetchrow_array
    # print join( $;, @row ), "\n"

    # exit; (exit or waitpid? do I know the parent read everything?)
    }
    else { # I'm the parent

    # while ( my $line = < $kids[0]{handle} >) {
    # @row = split $;, $line;
    # do whatever with @row
    #
    # while ( my $line = < $kids[1]{handle} >) {
    # etc.
    }

    __END__
    DJ Stunks, Nov 24, 2006
    #1
    1. Advertising

  2. DJ Stunks

    Guest

    "DJ Stunks" <> wrote:
    > Hey all,
    >
    > I have a question about performing two long-running selects from a
    > database in parallel. I've only written a few scripts which do things
    > in parallel so I'm not an expert by any means.
    >
    > I have two straightforward SELECT statements, but both take about two
    > minutes to complete. I'd like to run them in parallel, but I obviously
    > need access to all the rows - what's the best way to do so?


    How many rows are we talking?

    > I was thinking something along these lines (pseudocode below) but I was
    > hoping there would be some way to give the parent access to the
    > statement handle itself so it could pull the rows once the queries were
    > complete


    I'm pretty sure that that ain't gonna happen.

    > rather than pulling all the rows in the child, and serializing
    > and passing to the parent.
    >
    > Any ideas or maybe modules which could be handy? (I looked at both
    > Acme::Spork and Parallel::ForkManager but neither are appropriate)


    Parallel::Jobs might help, although it isn't very easy to use and wants to
    run commands rather than Perl code (so you would have to fire up a new Perl
    interpreter from scratch, and it wouldn't inherit variables, etc, from the
    existing one). I've posted here a while ago a simple mod of
    Parallel::ForkManager to allow back tack from the child to the parent, but
    it send the data as a slug of serialized data, not as a stream. Whether
    this is deadly or not depends on how many rows.

    http://groups.google.com/group/comp.lang.perl.modules/browse_frm/thread/9c8
    ef79472740156/

    Also, there is Parallel::Simple, which I never used but looks promising.

    >
    > my $pid;
    > for my $kid (@kids) {
    > $pid = open( my $fh, '|-');
    > die "Can't fork: $!\n" if not defined $pid;
    >
    > @{ $kid }{ 'pid','handle' } = ($pid,$fh);
    > }
    >
    > if ( $pid == 0) { # I'm one of the children


    How does the child know which child it is? It would have to loop over
    @kids comparing it's pid to the stored pid (which it can't do, because
    the stored pid is stored only in the parent, not the child). Better to
    move this else block into the inside of the the for my $kid loop, that way
    the kid automatically know who it is by looking in $kid.

    ....

    > # wait for results
    > # foreach @row = $sth->fetchrow_array


    Technically, you can't waid for results as a separate action from
    calling fetchrow_array. fetchrow_array is inherently waiting for results.

    > # print join( $;, @row ), "\n"
    >
    > # exit; (exit or waitpid? do I know the parent read everything?)


    exit. You almost certainly don't want to waitpid (for what? The parent?).
    At this point, you stuffed everything you have into the buffer up to the
    parent. That is all you an do. If the parent doesn't read it all, what
    is the child to do about it?

    > }
    > else { # I'm the parent
    >
    > # while ( my $line = < $kids[0]{handle} >) {
    > # @row = split $;, $line;
    > # do whatever with @row


    You will want to "close $kids[0]{handle} or [die|warn] $! $?;" to make sure
    the kid ended things on a good note. The close implicilty waits for the
    the child to exit.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Nov 24, 2006
    #2
    1. Advertising

  3. DJ Stunks

    Rocco Caputo Guest

    On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
    >
    > I have two straightforward SELECT statements, but both take about two
    > minutes to complete. I'd like to run them in parallel, but I obviously
    > need access to all the rows - what's the best way to do so?

    [...]
    > Any ideas or maybe modules which could be handy? (I looked at both
    > Acme::Spork and Parallel::ForkManager but neither are appropriate)


    There are four ways to do this with POE. Whether they're appropriate
    is left as an exercise. :)

    http://search.cpan.org/search?query=POE DBI&mode=dist

    --
    Rocco Caputo - http://poe.perl.org/
    Rocco Caputo, Nov 25, 2006
    #3
  4. DJ Stunks

    DJ Stunks Guest

    Rocco Caputo wrote:
    > On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
    > >
    > > I have two straightforward SELECT statements, but both take about two
    > > minutes to complete. I'd like to run them in parallel, but I obviously
    > > need access to all the rows - what's the best way to do so?

    > [...]
    > > Any ideas or maybe modules which could be handy? (I looked at both
    > > Acme::Spork and Parallel::ForkManager but neither are appropriate)

    >
    > There are four ways to do this with POE. Whether they're appropriate
    > is left as an exercise. :)
    >
    > http://search.cpan.org/search?query=POE DBI&mode=dist


    this looks great, time to start reading :)

    you know POE had occurred to me, my search terms must have been bad
    choices.

    Thanks to both who responded,
    -jp
    DJ Stunks, Nov 25, 2006
    #4
  5. On 2006-11-24 19:23, DJ Stunks <> wrote:
    > I have two straightforward SELECT statements, but both take about two
    > minutes to complete. I'd like to run them in parallel, but I obviously
    > need access to all the rows - what's the best way to do so?
    >
    > I was thinking something along these lines (pseudocode below) but I was
    > hoping there would be some way to give the parent access to the
    > statement handle itself so it could pull the rows once the queries were
    > complete


    Theoretically, that's possible. In practice, database client libaries
    don't support that.

    [...]
    > if ( $pid == 0) { # I'm one of the children
    >
    > # connect to the db
    > # prepare query
    > # execute query
    > # wait for results
    > # foreach @row = $sth->fetchrow_array
    > # print join( $;, @row ), "\n"
    >
    > # exit; (exit or waitpid? do I know the parent read everything?)
    > }
    > else { # I'm the parent
    >
    > # while ( my $line = < $kids[0]{handle} >) {
    > # @row = split $;, $line;
    > # do whatever with @row
    > #
    > # while ( my $line = < $kids[1]{handle} >) {
    > # etc.
    > }


    You are reading all the lines from kid 0 before the first line from kid 1
    here. This means that kid 1 will block as soon as it has written enough
    records to fill the pipe (typically a few kB). Not a problem if your
    queried only return a few rows, but if they are long-running because
    they return lots of rows, they won't really run in parallel.

    You may want to look at select (or IO::Select), but be warned that select
    and <> don't mix well - you need to use sysread instead.

    hp

    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
    Peter J. Holzer, Nov 26, 2006
    #5
  6. DJ Stunks

    DJ Stunks Guest

    DJ Stunks wrote:
    > Rocco Caputo wrote:
    > > On 24 Nov 2006 11:23:27 -0800, DJ Stunks wrote:
    > > >
    > > > I have two straightforward SELECT statements, but both take about two
    > > > minutes to complete. I'd like to run them in parallel, but I obviously
    > > > need access to all the rows - what's the best way to do so?

    > > [...]
    > > > Any ideas or maybe modules which could be handy? (I looked at both
    > > > Acme::Spork and Parallel::ForkManager but neither are appropriate)

    > >
    > > There are four ways to do this with POE. Whether they're appropriate
    > > is left as an exercise. :)
    > >
    > > http://search.cpan.org/search?query=POE DBI&mode=dist

    >
    > this looks great, time to start reading :)


    Well, I went with EasyDBI which looks ideal, however, my queries are
    both coming back with the error "Died". The queries work through
    straight DBI, and both of them take the right amount of time through
    EasyDBI, but no rows return, just the error.

    I'm not sure if anyone is familiar with this module. It doesn't look
    like I can get any more detailed debugging as to what "Died". I'll try
    one of the other 4 suggestions tomorrow and see how it goes.

    If it matters, the queries use the DBD::ODBC driver and I'm running on
    Win32 (ActiveState Perl 5.8.7 [813], DBD::ODBC 1.13, POE 0.9500,
    EasyDBI 1.14)

    Thanks for the suggestion though, I'll keep at it.
    -jp
    DJ Stunks, Nov 27, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Soren
    Replies:
    4
    Views:
    1,241
    c d saunter
    Feb 14, 2008
  2. Vivek Menon
    Replies:
    5
    Views:
    3,321
    Paul Uiterlinden
    Jun 8, 2011
  3. Vivek Menon
    Replies:
    0
    Views:
    1,752
    Vivek Menon
    Jun 10, 2011
  4. Daniel Choi
    Replies:
    4
    Views:
    111
    Robert Klemme
    Sep 24, 2008
  5. Prabh

    Parallel/ Multithreading /Forking?

    Prabh, Oct 27, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    129
Loading...

Share This Page