Need suggestions to improve code


S

SSS Develop

Hello,

I am working on some data crunching/migration of data. Data is in postgresql database.

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more than 2 days time to handle this data :(

I thought of using Parrallel::ForkManager Perl module. Please help me optimize/improve following program.

-------------
use strict;
use warnings
use DBI;
use Parallel::ForkManager;


my $MAX = 100;
my $pm = Parallel::ForkManager->new($MAX);


my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

my $query = $dbh->prepare("select id from secrete_records");

$query->execute();


while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection
my $dbh_internal = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
$query_internal->execute($record);

//
// do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database


$dbh_internal->disconnect()
$pm->finish()


}

$dbh->disconnect();

------------

Note:

I am creating multiple connections with Database ($dbh, $dbh_internal). Not sure if that's required. I tried to use one connection but it trows error. Its related to SSL :( though I am not using database connection in SSL mode.



Any other ways to handle such large number of data in Perl?


thank you,

---SSS
 
Ad

Advertisements

R

Rainer Weikusat

[...]
The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more than 2 days time to handle this data :(

I thought of using Parrallel::ForkManager Perl module. Please help
me optimize/improve following program.

The speed improvement you can achieve in this way depends on what the
computer (or computers) you are using can actually do in
parallell. For compute-bound parts of the appliaction (according to
the code below, there aren't any), it can roughly do as many things in
parallell as there are available CPUs/ cores and for I/O-bound parts,
as there are available, independent 'I/O processing thingies' ie
disks or SSDs. Once you get past either number, more threads of
execution running the code will actually slow things down as they
block each other while competing for resources and the overhead for
switching among them increases.
-------------
use strict;
use warnings
use DBI;
use Parallel::ForkManager;


my $MAX = 100;
my $pm = Parallel::ForkManager->new($MAX);

Which means that, except in relatively rare situations, 100 is
certainly way too much.
my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

Is the database running on another host? If so, you should run your
code there. Otherwise, using AF_UNIX sockets to connect to the server
should be somewhat faster than using a (mock) TCP connection.
my $query = $dbh->prepare("select id from secrete_records");

$query->execute();


while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection
my $dbh_internal = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;
my $query_internal = $dbh_internal("select id, record_one, record_two from secrete_records where id=?")
$query_internal->execute($record);

//
// do some activity (includes insert, update in secrete_records_archives and other relavent tables from same database

Another obvious idea would be: Move the processing into the database
in order avoid copying the data backwards and forwards between your
program and the database server(s). Postgres can be programmed in a
variety of languages, Perl being among them.

You should also consider tuning the database configuration, in
particular, the 'fsync' related parts and the various buffer settings
whose defaults were presumably selected for some VAX running 4.3BSD in
1989 or so (this means they're ridicolously low for any even remotely
modern computer).
 
Ad

Advertisements

X

Xho Jingleheimerschmidt

Hello,

I am working on some data crunching/migration of data. Data is in
postgresql database.

The rows of data from where I am reading are more than 20 milions.
While working on this data with normal (linear way) it is taking more
than 2 days time to handle this data :(

That is 11 records per second, which seems to be astonishing slow. Why
is it so slow? Figure that out. Then either fix it, or use that
knowledge to do parallelization in an appropriate way.
I thought of using Parrallel::ForkManager Perl module. Please help
me optimize/improve following program.

------------- use strict; use warnings use DBI; use
Parallel::ForkManager;


my $MAX = 100;

$MAX of 100 seems awfully high, unless you have remarkable hardware.
"The beatings will continue until morale improves."
my $pm = Parallel::ForkManager->new($MAX);


my $dbh = DBI->connect( "DBI:pg:dbname=" . 'postgres' . ";host=" .
'192.168.1.1', "postgres", "postgres" ) or die DBI->errstr;

my $query = $dbh->prepare("select id from secrete_records");

$query->execute();


while ( my $record = $query->fetchrow_hashref() ) {

my $pid = $pm->start and next;

# creating new Database connection

my $dbh_internal = DBI->connect(
"DBI:pg:dbname=" . 'postgres' . ";host=" . '192.168.1.1', "postgres",
"postgres" ) or die DBI->errstr;

You are creating a new connection for every record. While that
shouldn't take nearly 100 ms per connection and so it might not be rate
limiting in your case, it is still going to be pretty inefficient.
Alas, Parallel::ForkManager doesn't facilitate anything else, so maybe
ForkManager isn't the right thing to use.
my $query_internal = $dbh_internal("select id, record_one, record_two
from secrete_records where id=?")

Syntax error.
$query_internal->execute($record);

execute does not accept a hash reference.

Note:

I am creating multiple connections with Database ($dbh,
$dbh_internal). Not sure if that's required. I tried to use one
connection but it trows error. Its related to SSL :( though I am not
using database connection in SSL mode.



Any other ways to handle such large number of data in Perl?

2 million is a fairly small number, not a large number. Figure out
*why* it is so slow, then speed it up.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top