Need suggestions to improve code

Discussion in 'Perl Misc' started by SSS Develop, Aug 24, 2013.

  1. SSS Develop

    SSS Develop Guest

    Hello,

    I am working on some data crunching/migration of data. Data is in postgresql database.

    The rows of data from where I am reading are more than 20 milions.
    While working on this data with normal (linear way) it is taking more than 2 days time to handle this data :(

    I thought of using Parrallel::ForkManager Perl module. Please help me optimize/improve following program.

    -------------
    use strict;
    use warnings
    use DBI;
    use Parallel::ForkManager;


    my $MAX = 100;
    my $pm = Parallel::ForkManager->new($MAX);


    my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

    my $query = $dbh->prepare("select id from secrete_records");

    $query->execute();


    while ( my $record = $query->fetchrow_hashref() ) {

    my $pid = $pm->start and next;

    # creating new Database connection
    my $dbh_internal = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
    my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
    $query_internal->execute($record);

    //
    // do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database


    $dbh_internal->disconnect()
    $pm->finish()


    }

    $dbh->disconnect();

    ------------

    Note:

    I am creating multiple connections with Database ($dbh, $dbh_internal). Not sure if that's required. I tried to use one connection but it trows error. Its related to SSL :( though I am not using database connection in SSL mode.



    Any other ways to handle such large number of data in Perl?


    thank you,

    ---SSS
    SSS Develop, Aug 24, 2013
    #1
    1. Advertising

  2. SSS Develop <> writes:

    [...]

    > The rows of data from where I am reading are more than 20 milions.
    > While working on this data with normal (linear way) it is taking more than 2 days time to handle this data :(
    >
    > I thought of using Parrallel::ForkManager Perl module. Please help
    > me optimize/improve following program.


    The speed improvement you can achieve in this way depends on what the
    computer (or computers) you are using can actually do in
    parallell. For compute-bound parts of the appliaction (according to
    the code below, there aren't any), it can roughly do as many things in
    parallell as there are available CPUs/ cores and for I/O-bound parts,
    as there are available, independent 'I/O processing thingies' ie
    disks or SSDs. Once you get past either number, more threads of
    execution running the code will actually slow things down as they
    block each other while competing for resources and the overhead for
    switching among them increases.

    > -------------
    > use strict;
    > use warnings
    > use DBI;
    > use Parallel::ForkManager;
    >
    >
    > my $MAX = 100;
    > my $pm = Parallel::ForkManager->new($MAX);


    Which means that, except in relatively rare situations, 100 is
    certainly way too much.

    > my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;


    Is the database running on another host? If so, you should run your
    code there. Otherwise, using AF_UNIX sockets to connect to the server
    should be somewhat faster than using a (mock) TCP connection.

    > my $query = $dbh->prepare("select id from secrete_records");
    >
    > $query->execute();
    >
    >
    > while ( my $record = $query->fetchrow_hashref() ) {
    >
    > my $pid = $pm->start and next;
    >
    > # creating new Database connection
    > my $dbh_internal = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
    > my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
    > $query_internal->execute($record);
    >
    > //
    > // do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database


    Another obvious idea would be: Move the processing into the database
    in order avoid copying the data backwards and forwards between your
    program and the database server(s). Postgres can be programmed in a
    variety of languages, Perl being among them.

    You should also consider tuning the database configuration, in
    particular, the 'fsync' related parts and the various buffer settings
    whose defaults were presumably selected for some VAX running 4.3BSD in
    1989 or so (this means they're ridicolously low for any even remotely
    modern computer).
    Rainer Weikusat, Aug 24, 2013
    #2
    1. Advertising

  3. On 08/24/13 01:01, SSS Develop wrote:
    > Hello,
    >
    > I am working on some data crunching/migration of data. Data is in
    > postgresql database.
    >
    > The rows of data from where I am reading are more than 20 milions.
    > While working on this data with normal (linear way) it is taking more
    > than 2 days time to handle this data :(


    That is 11 records per second, which seems to be astonishing slow. Why
    is it so slow? Figure that out. Then either fix it, or use that
    knowledge to do parallelization in an appropriate way.

    >
    > I thought of using Parrallel::ForkManager Perl module. Please help
    > me optimize/improve following program.
    >
    > ------------- use strict; use warnings use DBI; use
    > Parallel::ForkManager;
    >
    >
    > my $MAX = 100;


    $MAX of 100 seems awfully high, unless you have remarkable hardware.
    "The beatings will continue until morale improves."

    > my $pm = Parallel::ForkManager->new($MAX);
    >
    >
    > my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" .
    > '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
    >
    > my $query = $dbh->prepare("select id from secrete_records");
    >
    > $query->execute();
    >
    >
    > while ( my $record = $query->fetchrow_hashref() ) {
    >
    > my $pid = $pm->start and next;
    >
    > # creating new Database connection



    > my $dbh_internal = DBI->connect(
    > "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres",
    > "postgres" ) or die DBI->errstr;


    You are creating a new connection for every record. While that
    shouldn't take nearly 100 ms per connection and so it might not be rate
    limiting in your case, it is still going to be pretty inefficient.
    Alas, Parallel::ForkManager doesn't facilitate anything else, so maybe
    ForkManager isn't the right thing to use.

    > my $query_internal = $dbh_internal("select id, record_one, record_two
    > from secrete_records where id=?")


    Syntax error.

    > $query_internal->execute($record);


    execute does not accept a hash reference.


    >
    > Note:
    >
    > I am creating multiple connections with Database ($dbh,
    > $dbh_internal). Not sure if that's required. I tried to use one
    > connection but it trows error. Its related to SSL :( though I am not
    > using database connection in SSL mode.
    >
    >
    >
    > Any other ways to handle such large number of data in Perl?


    2 million is a fairly small number, not a large number. Figure out
    *why* it is so slow, then speed it up.

    Xho
    Xho Jingleheimerschmidt, Aug 27, 2013
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gbattine
    Replies:
    8
    Views:
    410
    Philipp Leitner
    Jul 16, 2006
  2. Anand

    Need to Improve my skill set in C

    Anand, Jan 11, 2004, in forum: C Programming
    Replies:
    3
    Views:
    548
    Darrell Grainger
    Jan 11, 2004
  3. Matthew Wilson

    Need advice on how to improve this function

    Matthew Wilson, Aug 20, 2006, in forum: Python
    Replies:
    3
    Views:
    277
    Gabriel Genellina
    Aug 22, 2006
  4. DG
    Replies:
    4
    Views:
    259
  5. Replies:
    24
    Views:
    201
Loading...

Share This Page