Need suggestions to improve code

SSS Develop · Aug 24, 2013

Hello,

I am working on some data crunching/migration of data. Data is in postgresql database.

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more than 2 days time to handle this data

I thought of using Parrallel::ForkManager Perl module. Please help me optimize/improve following program.

-------------
use strict;
use warnings
use DBI;
use Parallel::ForkManager;

my $MAX = 100;
my $pm = Parallel::ForkManager->new($MAX);

my $dbh = DBI->connect( "DBI

g:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

my $query = $dbh->prepare("select id from secrete_records");

$query->execute();

while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection
my $dbh_internal = DBI->connect( "DBI

g:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
$query_internal->execute($record);

//
// do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database

$dbh_internal->disconnect()
$pm->finish()

}

$dbh->disconnect();

------------

Note:

I am creating multiple connections with Database ($dbh, $dbh_internal). Not sure if that's required. I tried to use one connection but it trows error. Its related to SSL

though I am not using database connection in SSL mode.

Any other ways to handle such large number of data in Perl?

thank you,

---SSS

Rainer Weikusat · Aug 24, 2013

[...]

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more than 2 days time to handle this data

I thought of using Parrallel::ForkManager Perl module. Please help
me optimize/improve following program.

The speed improvement you can achieve in this way depends on what the
computer (or computers) you are using can actually do in
parallell. For compute-bound parts of the appliaction (according to
the code below, there aren't any), it can roughly do as many things in
parallell as there are available CPUs/ cores and for I/O-bound parts,
as there are available, independent 'I/O processing thingies' ie
disks or SSDs. Once you get past either number, more threads of
execution running the code will actually slow things down as they
block each other while competing for resources and the overhead for
switching among them increases.

-------------
use strict;
use warnings
use DBI;
use Parallel::ForkManager;

my $MAX = 100;
my $pm = Parallel::ForkManager->new($MAX);

Which means that, except in relatively rare situations, 100 is
certainly way too much.

my $dbh = DBI->connect( "DBIg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

Is the database running on another host? If so, you should run your
code there. Otherwise, using AF_UNIX sockets to connect to the server
should be somewhat faster than using a (mock) TCP connection.

my $query = $dbh->prepare("select id from secrete_records");

$query->execute();

while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection
my $dbh_internal = DBI->connect( "DBIg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
$query_internal->execute($record);

//
// do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database

Another obvious idea would be: Move the processing into the database
in order avoid copying the data backwards and forwards between your
program and the database server(s). Postgres can be programmed in a
variety of languages, Perl being among them.

You should also consider tuning the database configuration, in
particular, the 'fsync' related parts and the various buffer settings
whose defaults were presumably selected for some VAX running 4.3BSD in
1989 or so (this means they're ridicolously low for any even remotely
modern computer).

Xho Jingleheimerschmidt · Aug 27, 2013

Hello,

I am working on some data crunching/migration of data. Data is in
postgresql database.

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more
than 2 days time to handle this data

That is 11 records per second, which seems to be astonishing slow. Why
is it so slow? Figure that out. Then either fix it, or use that
knowledge to do parallelization in an appropriate way.

I thought of using Parrallel::ForkManager Perl module. Please help
me optimize/improve following program.

------------- use strict; use warnings use DBI; use
Parallel::ForkManager;

my $MAX = 100;

$MAX of 100 seems awfully high, unless you have remarkable hardware.
"The beatings will continue until morale improves."

my $pm = Parallel::ForkManager->new($MAX);

my $dbh = DBI->connect( "DBIg:dbname=" . 'postgres' . ";host=" .
'192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

my $query = $dbh->prepare("select id from secrete_records");

$query->execute();

while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection

my $dbh_internal = DBI->connect(
"DBIg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres",
"postgres" ) or die DBI->errstr;

You are creating a new connection for every record. While that
shouldn't take nearly 100 ms per connection and so it might not be rate
limiting in your case, it is still going to be pretty inefficient.
Alas, Parallel::ForkManager doesn't facilitate anything else, so maybe
ForkManager isn't the right thing to use.

my $query_internal = $dbh_internal("select id, record_one, record_two
from secrete_records where id=?")

Syntax error.

$query_internal->execute($record);

execute does not accept a hash reference.

Note:

I am creating multiple connections with Database ($dbh,
$dbh_internal). Not sure if that's required. I tried to use one
connection but it trows error. Its related to SSL though I am not
using database connection in SSL mode.

Any other ways to handle such large number of data in Perl?

2 million is a fairly small number, not a large number. Figure out
*why* it is so slow, then speed it up.

Xho

Help me to Improve	11	Oct 7, 2011
Perl to SQLite bridge is not working, database connect fails ....	3	Apr 8, 2014
DBI:Pg script returning 'can't call method' error	3	Aug 2, 2008
Very Sluggish Code	4	Jun 11, 2012
improve the efficiency	7	Oct 6, 2003
define a standard exception handler to handle database query faults	3	Feb 15, 2007
Fail extracting table from .mdb file using DBI module	6	Feb 10, 2008
yet another question about numbers and strings	3	Mar 15, 2012

Need suggestions to improve code

SSS Develop

Rainer Weikusat

Xho Jingleheimerschmidt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads