Parallel insert to postgresql with thread

Discussion in 'Python' started by Abandoned, Oct 25, 2007.

  1. Abandoned

    Abandoned Guest

    Hi..
    I use the threading module for the fast operation. But i have some
    problems..
    This is my code sample:
    =================
    conn =
    psycopg2.connect(user='postgres',password='postgres',database='postgres')
    cursor = conn.cursor()
    class paralel(Thread):
    def __init__ (self, veriler, sayii):
    Thread.__init__(self)
    def run(self):
    save(a, b, c)

    def save(a,b,c):
    cursor.execute("INSERT INTO keywords (keyword) VALUES
    ('%s')" % a)
    conn.commit()
    cursor.execute("SELECT
    CURRVAL('keywords_keyword_id_seq')")
    idd=cursor.fetchall()
    return idd[0][0]

    def start(hiz):
    datas=[........]
    for a in datas:
    current = paralel(a, sayii)
    current.start()
    ==================
    And it gives me different errors to try parallel insert. My querys
    work in normal operation but in paralel don't work.
    How can i insert data to postgresql the same moment ?
    errors:
    no results to fetch
    cursor already closed
     
    Abandoned, Oct 25, 2007
    #1
    1. Advertising

  2. Abandoned wrote:

    > Hi..
    > I use the threading module for the fast operation. But i have some
    > problems..
    > This is my code sample:
    > =================
    > conn =
    > psycopg2.connect(user='postgres',password='postgres',database='postgres')
    > cursor = conn.cursor()
    > class paralel(Thread):
    > def __init__ (self, veriler, sayii):
    > Thread.__init__(self)
    > def run(self):
    > save(a, b, c)
    >
    > def save(a,b,c):
    > cursor.execute("INSERT INTO keywords (keyword) VALUES
    > ('%s')" % a)
    > conn.commit()
    > cursor.execute("SELECT
    > CURRVAL('keywords_keyword_id_seq')")
    > idd=cursor.fetchall()
    > return idd[0][0]
    >
    > def start(hiz):
    > datas=[........]
    > for a in datas:
    > current = paralel(a, sayii)
    > current.start()
    > ==================
    > And it gives me different errors to try parallel insert. My querys
    > work in normal operation but in paralel don't work.
    > How can i insert data to postgresql the same moment ?
    > errors:
    > no results to fetch
    > cursor already closed


    DB modules aren't necessarily thread-safe. Most of the times, a connection
    (and of course their cursor) can't be shared between threads.

    So open a connection for each thread.

    Diez
     
    Diez B. Roggisch, Oct 25, 2007
    #2
    1. Advertising

  3. Diez B. Roggisch wrote:
    > Abandoned wrote:
    >
    >> Hi..
    >> I use the threading module for the fast operation. But ....

    [in each thread]
    >> def save(a,b,c):
    >> cursor.execute("INSERT INTO ...
    >> conn.commit()
    >> cursor.execute(...)
    >> How can i insert data to postgresql the same moment ?...

    >
    > DB modules aren't necessarily thread-safe. Most of the times, a connection
    > (and of course their cursor) can't be shared between threads.
    >
    > So open a connection for each thread.


    Note that your DB server will have to "serialize" your inserts, so
    unless there is some other reason for the threads, a single thread
    through a single connection to the DB is the way to go. Of course
    it may be clever enough to behave "as if" they are serialized, but
    mostly of your work parallelizing at your end simply creates new
    work at the DB server end.

    -Scott David Daniels
     
    Scott David Daniels, Oct 25, 2007
    #3
  4. Abandoned

    Erik Jones Guest

    On Oct 25, 2007, at 7:28 AM, Scott David Daniels wrote:

    > Diez B. Roggisch wrote:
    >> Abandoned wrote:
    >>
    >>> Hi..
    >>> I use the threading module for the fast operation. But ....

    > [in each thread]
    >>> def save(a,b,c):
    >>> cursor.execute("INSERT INTO ...
    >>> conn.commit()
    >>> cursor.execute(...)
    >>> How can i insert data to postgresql the same moment ?...

    >>
    >> DB modules aren't necessarily thread-safe. Most of the times, a
    >> connection
    >> (and of course their cursor) can't be shared between threads.
    >>
    >> So open a connection for each thread.

    >
    > Note that your DB server will have to "serialize" your inserts, so
    > unless there is some other reason for the threads, a single thread
    > through a single connection to the DB is the way to go. Of course
    > it may be clever enough to behave "as if" they are serialized, but
    > mostly of your work parallelizing at your end simply creates new
    > work at the DB server end.


    Fortunately, in his case, that's not necessarily true. If they do
    all their work with the same connection then, yes, but there are
    other problems with that as mention wrt thread safety and psycopg2.
    If he goes the recommended route with a separate connection for each
    thread, then Postgres will not serialize multiple inserts coming from
    separate connections unless there is something like and ALTER TABLE
    or REINDEX concurrently happening on the table. The whole serialized
    inserts thing is strictly something popularized by MySQL and is by no
    means necessary or standard (as with a lot of MySQL).

    Erik Jones

    Software Developer | Emma®

    800.595.4401 or 615.292.5888
    615.292.0777 (fax)

    Emma helps organizations everywhere communicate & market in style.
    Visit us online at http://www.myemma.com
     
    Erik Jones, Oct 25, 2007
    #4
  5. Erik Jones wrote:
    >
    > On Oct 25, 2007, at 7:28 AM, Scott David Daniels wrote:
    >> Diez B. Roggisch wrote:
    >>> Abandoned wrote:
    >>>> Hi..
    >>>> I use the threading module for the fast operation. But ....

    >> [in each thread]
    >>>> def save(a,b,c):
    >>>> cursor.execute("INSERT INTO ...
    >>>> conn.commit()
    >>>> cursor.execute(...)
    >>>> How can i insert data to postgresql the same moment ?...
    >>>
    >>> DB modules aren't necessarily thread-safe. Most of the times, a
    >>> connection (and ... cursor) can't be shared between threads.
    >>> So open a connection for each thread.

    >>
    >> Note that your DB server will have to "serialize" your inserts, so
    >> ... a single thread through a single connection to the DB is the way
    >> to go. Of course it (the DB server) may be clever enough to behave
    >> "as if" they are serialized, but most of your work parallelizing at
    >> your end simply creates new work at the DB server end.

    >
    > Fortunately, in his case, that's not necessarily true.... If he
    > goes the recommended route with a separate connection for each thread,
    > then Postgres will not serialize multiple inserts coming from separate
    > connections unless there is something like and ALTER TABLE or REINDEX
    > concurrently happening on the table.
    > The whole serialized inserts thing is strictly something popularized
    > by MySQL and is by no means necessary or standard (as with a lot of
    > MySQL).


    But he commits after every insert, which _does_ force serialization (if
    only to provide safe transaction boundaries). I understand you can get
    clever at how to do it, _but_ preserving ACID properties is exactly what
    I mean by "serialize," and while I like to bash MySQL as well as the
    next person, I most certainly am not under the evil sway of the vile
    MySQL cabal.

    The server will have to be able to abort each transaction
    _independently_ of the others, and so must serialize any index
    updates that share a page by, for example, landing in the same node
    of a B-Tree.

    -Scott David Daniels
     
    Scott David Daniels, Oct 26, 2007
    #5
  6. Abandoned

    Erik Jones Guest

    OT Re: Parallel insert to postgresql with thread

    If you're not Scott Daniels, beware that this conversation has gone
    horribly off topic and, unless you have an interest in PostreSQL, you
    may not want to bother reading on...

    On Oct 25, 2007, at 9:46 PM, Scott David Daniels wrote:

    > Erik Jones wrote:
    >>
    >> On Oct 25, 2007, at 7:28 AM, Scott David Daniels wrote:
    >>> Diez B. Roggisch wrote:
    >>>> Abandoned wrote:
    >>>>> Hi..
    >>>>> I use the threading module for the fast operation. But ....
    >>> [in each thread]
    >>>>> def save(a,b,c):
    >>>>> cursor.execute("INSERT INTO ...
    >>>>> conn.commit()
    >>>>> cursor.execute(...)
    >>>>> How can i insert data to postgresql the same moment ?...
    >>>>
    >>>> DB modules aren't necessarily thread-safe. Most of the times, a
    >>>> connection (and ... cursor) can't be shared between threads.
    >>>> So open a connection for each thread.
    >>>
    >>> Note that your DB server will have to "serialize" your inserts, so
    >>> ... a single thread through a single connection to the DB is the way
    >>> to go. Of course it (the DB server) may be clever enough to behave
    >>> "as if" they are serialized, but most of your work parallelizing at
    >>> your end simply creates new work at the DB server end.

    >>
    >> Fortunately, in his case, that's not necessarily true.... If he
    >> goes the recommended route with a separate connection for each
    >> thread,
    >> then Postgres will not serialize multiple inserts coming from
    >> separate
    >> connections unless there is something like and ALTER TABLE or REINDEX
    >> concurrently happening on the table.
    >> The whole serialized inserts thing is strictly something popularized
    >> by MySQL and is by no means necessary or standard (as with a lot of
    >> MySQL).

    >
    > But he commits after every insert, which _does_ force serialization
    > (if
    > only to provide safe transaction boundaries). I understand you can
    > get
    > clever at how to do it, _but_ preserving ACID properties is exactly
    > what
    > I mean by "serialize,"


    First, bad idea to work with your own definition of a very domain
    specific and standardized term. Especially when Postgres's Multi-
    Version Concurrency Control mechanisms are designed specifically for
    the purpose of preserve ACID compliance without forcing serialized
    transactions on the user.

    Second, unless he specifically sets his transaction level to
    serializable, he will be working in read-committed mode. What this
    specifically means is that two (or more) transactions writing to the
    same table will not block any of the others. Let's say the user has
    two concurrent inserts to run on the same table that, for whatever
    reason, take a while to run (for example, they insert the results of
    some horribly complex or inefficient select), if either is run in
    serializable mode then which ever one starts a fraction of a second
    sooner will run until completion before the second is even allowed to
    begin. In (the default) read-committed mode they will both begin
    executing as soon as they are called and will write their data
    regardless of conflicts. At commit time (which may be sometime later
    for transactions with multiple statements are used) is when conflicts
    are resolved. So, if between the two example transactions there does
    turn out to be a conflict betwen their results, whichever commits
    second will roll back and, since the data written by the second
    transaction will not be marked as committed, it will never be visible
    to any other transactions and the space will remain available for
    future transactions.

    Here's the relevant portion of the Postgres docs on all of this:
    http://www.postgresql.org/docs/8.2/interactive/mvcc.html

    > and while I like to bash MySQL as well as the
    > next person, I most certainly am not under the evil sway of the vile
    > MySQL cabal.


    Good to hear ;)

    >
    > The server will have to be able to abort each transaction
    > _independently_ of the others, and so must serialize any index
    > updates that share a page by, for example, landing in the same node
    > of a B-Tree.


    There is nothing inherent in B-Trees that prevents identical datum
    from being written in them. If there was the only they'd be good for
    would be unique indexes. Even if you do use a unique index, as noted
    above, constraints and conflicts are only enforced at commit time.

    Erik Jones

    Software Developer | Emma®

    800.595.4401 or 615.292.5888
    615.292.0777 (fax)

    Emma helps organizations everywhere communicate & market in style.
    Visit us online at http://www.myemma.com
     
    Erik Jones, Oct 26, 2007
    #6
  7. Le Thu, 25 Oct 2007 13:27:40 +0200, Diez B. Roggisch a écrit :

    > DB modules aren't necessarily thread-safe. Most of the times, a
    > connection (and of course their cursor) can't be shared between threads.
    >
    > So open a connection for each thread.
    >
    > Diez


    DB modules following DBAPI2 must define the following attribute:

    """
    threadsafety

    Integer constant stating the level of thread safety the
    interface supports. Possible values are:

    0 Threads may not share the module.
    1 Threads may share the module, but not connections.
    2 Threads may share the module and connections.
    3 Threads may share the module, connections and
    cursors.
    """

    http://www.python.org/dev/peps/pep-0249/



    --
    Laurent POINTAL -
     
    Laurent Pointal, Oct 26, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. cannontrodder
    Replies:
    1
    Views:
    752
    cannontrodder
    Jul 25, 2006
  2. Soren
    Replies:
    4
    Views:
    1,325
    c d saunter
    Feb 14, 2008
  3. Vivek Menon
    Replies:
    5
    Views:
    3,477
    Paul Uiterlinden
    Jun 8, 2011
  4. Vivek Menon
    Replies:
    0
    Views:
    1,795
    Vivek Menon
    Jun 10, 2011
  5. divyajacob
    Replies:
    0
    Views:
    140
    divyajacob
    Jun 30, 2010
Loading...

Share This Page