Read from database, write to another database, simultaneously

S

Sean Davis

I am working on a simple script to read from one database (oracle) and
write to another (postgresql). I retrieve the data from oracle in
chunks and drop the data to postgresql continuously. The author of one
of the python database clients mentioned that using one thread to
retrieve the data from the oracle database and another to insert the
data into postgresql with something like a pipe between the two threads
might make sense, keeping both IO streams busy. Any hints on how to
get started?

Thanks,
Sean
 
B

Bjoern Schliessmann

Sean said:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,


Björn
 
J

johnf

Bjoern said:
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,


Björn
Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

One time transfer live to the time issue.

Johnf
 
D

Dennis Lee Bieber

I am working on a simple script to read from one database (oracle) and
write to another (postgresql). I retrieve the data from oracle in
chunks and drop the data to postgresql continuously. The author of one
of the python database clients mentioned that using one thread to
retrieve the data from the oracle database and another to insert the
data into postgresql with something like a pipe between the two threads
might make sense, keeping both IO streams busy. Any hints on how to
get started?

If this is a one-time process -- ie, you are porting all data from
Oracle to PostgreSQL once, and then dropping use of Oracle -- optimizing
the script is probably not worth the effort. Heck, is a script even
needed -- I don't know about Oracle, but MySQL has a command line tool
to dump databases as text SQL. Some work with find&replace to change
MySQL specific syntax to PostgreSQL syntax, followed by feeding the SQL
file to a PostgreSQL command line tool would be sufficient for a
one-time process.

If you really want to do a threaded job... read the library
reference manual sections for the threading module, and the queue
module.

-=-=-=-=-=- PSEUDOCODE
import threading
import Queue #I seem to recall this module is capitalized

TRANSFER_END = "Some Unique Object"

transferQueue = Queue.Queue() #maybe specify a maximum size

def oracleWork():
#assorted oracle specific code
for rec in oracle_cursor:
transferQueue.put(rec)
transferQueue.put(TRANSFER_END)

def postgresqlWork():
#assorted postgresql specific code
while True:
rec = transferQueue.get()
if rec is TRANSFER_END: break
#insert record
#commit?
#or commit here

ot = threading.Thread(target=oracleWork)
pt = threading.Thread(target=postgresqlWork)

ot.start()
pt.start()

ot.join()
pt.join()
-=-=-=-=-=-
Thanks,
Sean
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
L

Laurent Pointal

Bjoern Schliessmann a écrit :
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.


Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

But, if its only a do-once job, maybe the optimization is net really
necessary.
 
S

Sean Davis

Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

Yes, some detail about the problem is definitely in order!

We have a collaborator that is going to maintain a large genome
database that is a component of a postgresql database that we currently
maintain. There are going to be consumers of the oracle data using
both mysql and postgresql. The oracle database is LARGE with around
100,000,000 rows spread over some 50-70 tables in multiple schemas.
The idea is that as publicly available data (from various datasources
on the web) become available, the oracle team will update the oracle
database, doing all the parsing and necessary data cleanup of the raw
data. We then want to be able to update postgres with these oracle
data. So the process may be done only once per month on some tables,
but as often as once a day on others.

As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data. Note the
disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

I hope that clarifies things.

Thanks,
Sean
 
S

Sean Davis

Bjoern Schliessmann a écrit :
Sean Davis wrote:
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

Some more detail:

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low
and because of the independent machines for the database end, it is
definitely possible to read from one database while writing to the
other. That is the solution that I am looking for, and Dennis's post
seems pretty close to what I need. I will have to use some kind of
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

Thanks,
Sean
 
I

Istvan Albert

Sean said:
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

IMO the problem that you need to solve is not well suited for the
python DBAPI as this API is meant to support programming and
interacting with the database not streaming large quantities from one
database into another.

I agree with another opinion in this thread.

All you need is a simple way to pipe the output from Oracle into
Postgresql. Just run the oracle client and start dumping to the
standard output. Pipe it through sed (or a python program) to reformat
the output for whatever minor fixes (you might not even need this step)
then continue piping it right into psql.

i.
 
P

Paul Boddie

Sean said:
As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data.

And which seems to use the COPY FROM command in PostgreSQL...
Note the disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in?

My experience along with the PostgreSQL documentation tells me that you
shouldn't worry about this problem too much: using COPY FROM is *far*
faster than performing many inserts or updates. The only thing you'd
need to worry about is data being copied into the database that
duplicates existing data, causing constraint violations, but since
you're already using this method I imagine that this is not a likely
problem.

Paul
 
B

Bjoern Schliessmann

Laurent said:
Not so sure, there is low CPU in the Python script,
Yes.

but there may be CPU+disk activity on the database sides [with
cache management and other optimizations on disk access].

That's it. So data queues up on the database side and you won't get
much value from faked concurrency with CPU cycles.
So, with a reader thread and a writer thread, he can have a select
on a database performed in parallel with an insert on the other
database.

Explain. Remember, threads aren't really working concurrently. Even
on a multiprocessor machine you have constraints for IO traffic.
(And the GIL exists too)
But, if its only a do-once job, maybe the optimization is net
really necessary.

I still don't understand how threads would help optimizing a task
that largely depends on IO and will probably be executed on one
CPU.

Regards,


Björn
 
B

Bjoern Schliessmann

Sean said:
I solved this problem by creating a temporary file as an
intermediary, but why wait for Oracle to finish dumping data when
I can potentially be loading into postgres at the same time that
the data is coming in? So, I am actually
looking for a solution to this problem that doesn't require an
intermediate file and allows simultaneous reading and writing,
with the caveat that the data cannot all be read into memory
simultaneously, so will need to be buffered.

The functions you use don't seem very suited for such a
streaminglike task.

Regards,


Björn

--
BOFH excuse #282:

High altitude condensation from U.S.A.F prototype aircraft has
contaminated the primary subnet mask. Turn off your computer for 9
days to avoid damaging it.
 
D

Dennis Lee Bieber

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low

Which somewhat cancels the idea of just using pipes to pass data
directly... Unless the command-line clients can be run "locally" and
connect to the remote servers.
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the

"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

{note: command line tool names are made up for the example}

"copy ... from ..." appears to be designed to work with formatted
text files... Something that might have been produced as a tabular
report by almost any tool. As such, you are basically bypassing most of
the DB-API capability. If Oracle has a "select" variant that can specify
a start-offset and length (or a server-side cursor in the DB-API so the
full data is not transferred in one "chunk"), and the PostgreSQL DP-API
supports an "executemany" operation, the threading pseudocode I showed
could still be used. Instead of queueing single records, one could queue
a multiple record "chunk" (a "fetchmany" specifying 100, 500, or
whatever, records); the data would be a list of tuples, so no concerns
about parsing fields from a text record. Then an "executemany" would
submit the chunk at one time.
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

My view: avoid using any commands that result in formatting the data
as text reports... Use DB-API interfaces to keep the data as fields.

ocrsr.execute("select * from table")
while True:
batch = ocrsr.fetchmany(100)
if not batch: break
pcrsr.executemany("insert into table", batch)

{this is a sequential sample -- but using threads just means the ocrsr
operation are in one thread, batch is the data relayed via a queue, and
pcrsr operations are in the other thread}
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
C

Chris Mellon

Which somewhat cancels the idea of just using pipes to pass data
directly... Unless the command-line clients can be run "locally" and
connect to the remote servers.


"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

{note: command line tool names are made up for the example}

"copy ... from ..." appears to be designed to work with formatted
text files... Something that might have been produced as a tabular
report by almost any tool. As such, you are basically bypassing most of
the DB-API capability. If Oracle has a "select" variant that can specify
a start-offset and length (or a server-side cursor in the DB-API so the
full data is not transferred in one "chunk"), and the PostgreSQL DP-API
supports an "executemany" operation, the threading pseudocode I showed
could still be used. Instead of queueing single records, one could queue
a multiple record "chunk" (a "fetchmany" specifying 100, 500, or
whatever, records); the data would be a list of tuples, so no concerns
about parsing fields from a text record. Then an "executemany" would
submit the chunk at one time.


My view: avoid using any commands that result in formatting the data
as text reports... Use DB-API interfaces to keep the data as fields.

ocrsr.execute("select * from table")
while True:
batch = ocrsr.fetchmany(100)
if not batch: break
pcrsr.executemany("insert into table", batch)

{this is a sequential sample -- but using threads just means the ocrsr
operation are in one thread, batch is the data relayed via a queue, and
pcrsr operations are in the other thread}
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/


Using db-api for this is the wrong approach. COPY FROM, which is the
postgresql equivilent of oracles SQLLoader, is orders of magnitude
faster than sequential inserts, and it's really what you want for bulk
data loads. What you're trying to do is read & write from the file at
the same time, but across a network interface that's just asking for
trouble. In particular, I doubt that postgres expects to have
something writing to the end of the file it is loading and will
probably error out if it hits an unexpected EOF, which is what will
happen if the writer lags behind.

Just dump the files from oracle, copy them, and load them. It'll
probably be faster in the long run when you count all the time you
lose re-starting the processing and troubleshooting your
"optimization".
 
P

Paul Boddie

Dennis said:
"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

In chapter 13 of the PostgreSQL 8.1 documentation ("Performance Tips"),
descending into section 13.4 ("Populating a Database"), whose
introduction may not admittedly coincide precisely with the inquirer's
use case ("first populating a database" doesn't strictly apply, but it
may yet describe a situation that is close enough given the kind of
data involved), and within section 13.4.2 ("Use COPY"), the text reads
as follows:

"Use COPY to load all the rows in one command, instead of using a
series of INSERT commands."

http://www.postgresql.org/docs/8.1/static/populate.html#POPULATE-COPY-FROM
If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

The problem with this approach (as the manual notes) is that COPY is a
lot faster than lots of INSERT statements:

"Note that loading a large number of rows using COPY is almost always
faster than using INSERT, even if PREPARE is used and multiple
insertions are batched into a single transaction."

My experience with inserting large numbers of records into PostgreSQL
suggests that this advice should not be readily ignored.

Paul
 
D

Dennis Lee Bieber

My experience with inserting large numbers of records into PostgreSQL
suggests that this advice should not be readily ignored.
The problem though, is that the original poster claimed the Oracle
data was being emitted in multiple chunks, not as a single output -- and
they want to avoid collecting the data in a temporary file...
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
P

Paul Boddie

Dennis said:
The problem though, is that the original poster claimed the Oracle
data was being emitted in multiple chunks, not as a single output -- and
they want to avoid collecting the data in a temporary file...

I think he wanted to know whether concurrent reads from Oracle and
writes to PostgreSQL would be faster or more efficient. I'd argue that
the temporary file approach is likely to be faster and more efficient
(disk space usage for temporary files disregarded). Sure, if you can
have two concurrent processes with minimal contention, one reading from
Oracle, one writing to PostgreSQL, where the PostgreSQL one never lags
behind and thus completes shortly after the Oracle data has been
completely downloaded, then that might be the ideal solution, but I'm
inclined to believe that unless the stream of data from Oracle were
arriving really slowly, it wouldn't work out that way.

Anyway, that's what my experiences with PostgreSQL suggest so far.
Further commentary on the topic is, as noted, available in the manual.

Paul
 
D

Dennis Lee Bieber

I think he wanted to know whether concurrent reads from Oracle and
writes to PostgreSQL would be faster or more efficient. I'd argue that

Digging up the original post shows:

op> I retrieve the data from oracle in
op> chunks and drop the data to postgresql continuously.

and then goes on to mention that a coworker had suggested using parallel
threads...

I focused on that "oracle in chunks" -- which, with the desire to
minimize disk/memory usage, (to me) precludes using that "copy ... from
...." statement.

Given an ability to have oracle dump the data in a format postgresql
can batch load from would, of course, obviate the need for any Python
script in the middle...
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top