Parallel insert to postgresql with thread

Abandoned · Oct 25, 2007

Hi..
I use the threading module for the fast operation. But i have some
problems..
This is my code sample:
=================
conn =
psycopg2.connect(user='postgres',password='postgres',database='postgres')
cursor = conn.cursor()
class paralel(Thread):
def __init__ (self, veriler, sayii):
Thread.__init__(self)
def run(self):
save(a, b, c)

def save(a,b,c):
cursor.execute("INSERT INTO keywords (keyword) VALUES
('%s')" % a)
conn.commit()
cursor.execute("SELECT
CURRVAL('keywords_keyword_id_seq')")
idd=cursor.fetchall()
return idd[0][0]

def start(hiz):
datas=[........]
for a in datas:
current = paralel(a, sayii)
current.start()
==================
And it gives me different errors to try parallel insert. My querys
work in normal operation but in paralel don't work.
How can i insert data to postgresql the same moment ?
errors:
no results to fetch
cursor already closed

Diez B. Roggisch · Oct 25, 2007

Abandoned said:
Hi..
I use the threading module for the fast operation. But i have some
problems..
This is my code sample:
=================
conn =
psycopg2.connect(user='postgres',password='postgres',database='postgres')
cursor = conn.cursor()
class paralel(Thread):
def __init__ (self, veriler, sayii):
Thread.__init__(self)
def run(self):
save(a, b, c)

def save(a,b,c):
cursor.execute("INSERT INTO keywords (keyword) VALUES
('%s')" % a)
conn.commit()
cursor.execute("SELECT
CURRVAL('keywords_keyword_id_seq')")
idd=cursor.fetchall()
return idd[0][0]

def start(hiz):
datas=[........]
for a in datas:
current = paralel(a, sayii)
current.start()
==================
And it gives me different errors to try parallel insert. My querys
work in normal operation but in paralel don't work.
How can i insert data to postgresql the same moment ?
errors:
no results to fetch
cursor already closed

DB modules aren't necessarily thread-safe. Most of the times, a connection
(and of course their cursor) can't be shared between threads.

So open a connection for each thread.

Diez

Scott David Daniels · Oct 25, 2007

Diez said:
Abandoned said:

Hi..
I use the threading module for the fast operation. But .... [in each thread]
def save(a,b,c):
cursor.execute("INSERT INTO ...
conn.commit()
cursor.execute(...)
How can i insert data to postgresql the same moment ?...

Click to expand...

DB modules aren't necessarily thread-safe. Most of the times, a connection
(and of course their cursor) can't be shared between threads.

So open a connection for each thread.

Note that your DB server will have to "serialize" your inserts, so
unless there is some other reason for the threads, a single thread
through a single connection to the DB is the way to go. Of course
it may be clever enough to behave "as if" they are serialized, but
mostly of your work parallelizing at your end simply creates new
work at the DB server end.

-Scott David Daniels
(e-mail address removed)

Erik Jones · Oct 25, 2007

Diez said:
Diez said:

Abandoned said:

Hi..
I use the threading module for the fast operation. But .... [in each thread]
def save(a,b,c):
cursor.execute("INSERT INTO ...
conn.commit()
cursor.execute(...)
How can i insert data to postgresql the same moment ?...

Click to expand...

DB modules aren't necessarily thread-safe. Most of the times, a
connection
(and of course their cursor) can't be shared between threads.

So open a connection for each thread.

Click to expand...

Note that your DB server will have to "serialize" your inserts, so
unless there is some other reason for the threads, a single thread
through a single connection to the DB is the way to go. Of course
it may be clever enough to behave "as if" they are serialized, but
mostly of your work parallelizing at your end simply creates new
work at the DB server end.

Fortunately, in his case, that's not necessarily true. If they do
all their work with the same connection then, yes, but there are
other problems with that as mention wrt thread safety and psycopg2.
If he goes the recommended route with a separate connection for each
thread, then Postgres will not serialize multiple inserts coming from
separate connections unless there is something like and ALTER TABLE
or REINDEX concurrently happening on the table. The whole serialized
inserts thing is strictly something popularized by MySQL and is by no
means necessary or standard (as with a lot of MySQL).

Erik Jones

Software Developer | Emma®
(e-mail address removed)
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Scott David Daniels · Oct 26, 2007

Erik said:
Diez said:

Abandoned wrote:
Hi..
I use the threading module for the fast operation. But .... [in each thread]
def save(a,b,c):
cursor.execute("INSERT INTO ...
conn.commit()
cursor.execute(...)
How can i insert data to postgresql the same moment ?...

DB modules aren't necessarily thread-safe. Most of the times, a
connection (and ... cursor) can't be shared between threads.
So open a connection for each thread.

Click to expand...

Note that your DB server will have to "serialize" your inserts, so
... a single thread through a single connection to the DB is the way
to go. Of course it (the DB server) may be clever enough to behave
"as if" they are serialized, but most of your work parallelizing at
your end simply creates new work at the DB server end.

Click to expand...

Fortunately, in his case, that's not necessarily true.... If he
goes the recommended route with a separate connection for each thread,
then Postgres will not serialize multiple inserts coming from separate
connections unless there is something like and ALTER TABLE or REINDEX
concurrently happening on the table.
> The whole serialized inserts thing is strictly something popularized
> by MySQL and is by no means necessary or standard (as with a lot of
MySQL).

But he commits after every insert, which _does_ force serialization (if
only to provide safe transaction boundaries). I understand you can get
clever at how to do it, _but_ preserving ACID properties is exactly what
I mean by "serialize," and while I like to bash MySQL as well as the
next person, I most certainly am not under the evil sway of the vile
MySQL cabal.

The server will have to be able to abort each transaction
_independently_ of the others, and so must serialize any index
updates that share a page by, for example, landing in the same node
of a B-Tree.

-Scott David Daniels
(e-mail address removed)

Erik Jones · Oct 26, 2007

If you're not Scott Daniels, beware that this conversation has gone
horribly off topic and, unless you have an interest in PostreSQL, you
may not want to bother reading on...

Erik said:
Erik said:

Diez B. Roggisch wrote:
Abandoned wrote:
Hi..
I use the threading module for the fast operation. But ....
[in each thread]
def save(a,b,c):
cursor.execute("INSERT INTO ...
conn.commit()
cursor.execute(...)
How can i insert data to postgresql the same moment ?...

DB modules aren't necessarily thread-safe. Most of the times, a
connection (and ... cursor) can't be shared between threads.
So open a connection for each thread.

Note that your DB server will have to "serialize" your inserts, so
... a single thread through a single connection to the DB is the way
to go. Of course it (the DB server) may be clever enough to behave
"as if" they are serialized, but most of your work parallelizing at
your end simply creates new work at the DB server end.

Click to expand...

Fortunately, in his case, that's not necessarily true.... If he
goes the recommended route with a separate connection for each
thread,
then Postgres will not serialize multiple inserts coming from
separate
connections unless there is something like and ALTER TABLE or REINDEX
concurrently happening on the table.
The whole serialized inserts thing is strictly something popularized
by MySQL and is by no means necessary or standard (as with a lot of
MySQL).

Click to expand...

But he commits after every insert, which _does_ force serialization
(if
only to provide safe transaction boundaries). I understand you can
get
clever at how to do it, _but_ preserving ACID properties is exactly
what
I mean by "serialize,"

First, bad idea to work with your own definition of a very domain
specific and standardized term. Especially when Postgres's Multi-
Version Concurrency Control mechanisms are designed specifically for
the purpose of preserve ACID compliance without forcing serialized
transactions on the user.

Second, unless he specifically sets his transaction level to
serializable, he will be working in read-committed mode. What this
specifically means is that two (or more) transactions writing to the
same table will not block any of the others. Let's say the user has
two concurrent inserts to run on the same table that, for whatever
reason, take a while to run (for example, they insert the results of
some horribly complex or inefficient select), if either is run in
serializable mode then which ever one starts a fraction of a second
sooner will run until completion before the second is even allowed to
begin. In (the default) read-committed mode they will both begin
executing as soon as they are called and will write their data
regardless of conflicts. At commit time (which may be sometime later
for transactions with multiple statements are used) is when conflicts
are resolved. So, if between the two example transactions there does
turn out to be a conflict betwen their results, whichever commits
second will roll back and, since the data written by the second
transaction will not be marked as committed, it will never be visible
to any other transactions and the space will remain available for
future transactions.

Here's the relevant portion of the Postgres docs on all of this:
http://www.postgresql.org/docs/8.2/interactive/mvcc.html

and while I like to bash MySQL as well as the
next person, I most certainly am not under the evil sway of the vile
MySQL cabal.

Good to hear

The server will have to be able to abort each transaction
_independently_ of the others, and so must serialize any index
updates that share a page by, for example, landing in the same node
of a B-Tree.

There is nothing inherent in B-Trees that prevents identical datum
from being written in them. If there was the only they'd be good for
would be unique indexes. Even if you do use a unique index, as noted
above, constraints and conflicts are only enforced at commit time.

Erik Jones

Software Developer | Emma®
(e-mail address removed)
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Laurent Pointal · Oct 26, 2007

DB modules aren't necessarily thread-safe. Most of the times, a
connection (and of course their cursor) can't be shared between threads.

So open a connection for each thread.

Diez

DB modules following DBAPI2 must define the following attribute:

"""
threadsafety

Integer constant stating the level of thread safety the
interface supports. Possible values are:

0 Threads may not share the module.
1 Threads may share the module, but not connections.
2 Threads may share the module and connections.
3 Threads may share the module, connections and
cursors.
"""

http://www.python.org/dev/peps/pep-0249/

psycopg2 for insertion of binary data to PostgreSQL database	7	Aug 21, 2010
How to return data in specific format from Python Flask API?	0	Aug 10, 2022
postgresql modules and performance	2	Apr 24, 2005
psphere: how to make thread safe	3	Aug 22, 2012
new to python network programming is async_chat.push thread-safe?python3.0	0	Oct 24, 2008
Help on thread pool	3	May 17, 2008
Logging thread with Queue and multiple threads to log messages	3	Nov 9, 2008
Advice Criticism on Python App	4	Mar 24, 2010

Parallel insert to postgresql with thread

Abandoned

Diez B. Roggisch

Scott David Daniels

Erik Jones

Scott David Daniels

Erik Jones

Laurent Pointal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads