psycopg2: connect copy_from and copy_to

Discussion in 'Python' started by Thomas Guettler, Feb 19, 2008.

  1. Hi,

    I want to copy data from a production database to
    a testing database. Up to now I used psycopg2 and
    copy_from and copy_to with two cursors and a temporary file.

    This works, but it would be better, if the data
    gets piped from one cursor to the next without a temporary
    file.

    The psycopg2 methods look like this:
    copy_to(fd, table) # read data
    copy_from(fd, table) # write data

    Using select (wait until fd has data) does not work, since
    I pass in the fd. If copy_to would return the file descriptor,
    I could use it to pass the data to copy_from.

    I found no way to connect them. Maybe it could be done
    with threading, but I try to avoid threads.

    Since the script will run only on Unix, I could use pipes. But
    this would fail, too, since copy_to() would block before I can
    call copy_from (since buffer is full and no one is reading the data
    from the pipe).

    Any suggestions?

    Thomas
     
    Thomas Guettler, Feb 19, 2008
    #1
    1. Advertising

  2. Thomas Guettler

    Chris Guest

    On Feb 19, 5:06 pm, Thomas Guettler <> wrote:
    > Hi,
    >
    > I want to copy data from a production database to
    > a testing database. Up to now I used psycopg2 and
    > copy_from and copy_to with two cursors and a temporary file.
    >
    > This works, but it would be better, if the data
    > gets piped from one cursor to the next without a temporary
    > file.
    >
    > The psycopg2 methods look like this:
    > copy_to(fd, table) # read data
    > copy_from(fd, table) # write data
    >
    > Using select (wait until fd has data) does not work, since
    > I pass in the fd. If copy_to would return the file descriptor,
    > I could use it to pass the data to copy_from.
    >
    > I found no way to connect them. Maybe it could be done
    > with threading, but I try to avoid threads.
    >
    > Since the script will run only on Unix, I could use pipes. But
    > this would fail, too, since copy_to() would block before I can
    > call copy_from (since buffer is full and no one is reading the data
    > from the pipe).
    >
    > Any suggestions?
    >
    > Thomas


    Doesn't PostGres come with Export/Import apps ? That would be easiest
    (and faster).

    Else,

    prod_cursor.execute('select data from production')
    for each_record in cursor.fetchall():
    dev_cursor.execute('insert into testing')

    that is one way of doing it
     
    Chris, Feb 19, 2008
    #2
    1. Advertising

  3. Thomas Guettler

    Guest

    > Doesn't PostGres come with Export/Import apps ? That would be easiest
    > (and faster).


    Yes, PostgreSQL core has import/export apps, but they tend to target
    general administration rather than transactional loading/moving of
    data. ie, dump and restore a database or schema. There is a pgfoundry
    project called pgloader that appears to be targeting a more general
    scenario of, well, loading data, but I imagine most people end up
    writing custom ETL for data flow.

    > Else,
    >
    > prod_cursor.execute('select data from production')
    > for each_record in cursor.fetchall():
    >     dev_cursor.execute('insert into testing')
    >
    > that is one way of doing it


    In any high volume cases you don't want to do that. The current best
    practice for loading data into an existing PostgreSQL database is
    create temp, load into temp using COPY, merge from temp into
    destination(the last part is actually the tricky one ;).
     
    , Feb 19, 2008
    #3
  4. > Doesn't PostGres come with Export/Import apps ? That would be easiest
    > (and faster).


    Yes, you can use "pg_dump production ... | psql testdb", but
    this can lead to dead locks, if you call this during
    a python script which is in the middle of a transaction. The python
    script locks a table, so that psql can't write to it.

    I don't think calling pg_dump and psql/pg_restore is faster.

    > prod_cursor.execute('select data from production')
    > for each_record in cursor.fetchall():
    > dev_cursor.execute('insert into testing')


    I know, but COPY is much faster.

    Thomas
     
    Thomas Guettler, Feb 19, 2008
    #4
  5. Thomas Guettler

    Guest

    On Feb 19, 8:06 am, Thomas Guettler <> wrote:
    > Any suggestions?


    If you don't mind trying out some beta quality software, you can try
    my pg_proboscis driver. It has a DBAPI2 interface, but for you to use
    COPY, you'll need to use the GreenTrunk interface:

    import postgresql.interface.proboscis.dbapi2 as db
    # yeah, it doesn't use libpq, so you'll need to "spell" everything
    out. And, no dsn either, just keywords.
    src = db.connect(user = 'pgsql', host = 'localhost', port = 5432,
    database = 'src')
    dst = db.connect(suer = 'pgsql', host = 'localhost', port = 5432,
    database = 'dst')

    fromq = src.greentrunk.Query("COPY tabl TO STDOUT")
    toq = dst.greentrunk.Query("COPY tabl FROM STDIN")

    toq(fromq())


    It's mostly pure-python, so if you don't have any indexes on the
    target table, you'll probably only get about 100,000 - 150,000 records
    per second(of course, it depends on how beefy your CPU is). With
    indexes on a large destination table, I don't imagine the pure Python
    COPY being the bottleneck.

    $ easy_install pg_proboscis

    Notably, it currently(version 0.9) uses the qmark paramstyle, but I
    plan to make 1.0 much more psyco. =)
    [python.projects.postgresql.org, some of the docs are outdated atm due
    to a recent fury of development =]
     
    , Feb 19, 2008
    #5
  6. Thomas Guettler

    Chris Guest

    On Feb 19, 6:23 pm, Thomas Guettler <> wrote:
    > > Doesn't PostGres come with Export/Import apps ? That would be easiest
    > > (and faster).

    >
    > Yes, you can use "pg_dump production ... | psql testdb", but
    > this can lead to dead locks, if you call this during
    > a python script which is in the middle of a transaction. The python
    > script locks a table, so that psql can't write to it.
    >
    > I don't think calling pg_dump and psql/pg_restore is faster.
    >
    > > prod_cursor.execute('select data from production')
    > > for each_record in cursor.fetchall():
    > > dev_cursor.execute('insert into testing')

    >
    > I know, but COPY is much faster.
    >
    > Thomas


    I'm used to Oracle which doesn't exhibit this problem... ;)
     
    Chris, Feb 19, 2008
    #6
  7. Thomas Guettler

    Guest

    On Feb 19, 9:23 am, Thomas Guettler <> wrote:
    > Yes, you can use "pg_dump production ... | psql testdb", but
    > this can lead to dead locks, if you call this during
    > a python script which is in the middle of a transaction. The python
    > script locks a table, so that psql can't write to it.


    Hrm? Dead locks where? Have you considered a cooperative user lock?
    Are just copying data? ie, no DDL or indexes?
    What is the script doing? Updating a table with unique indexes?

    > I don't think calling pg_dump and psql/pg_restore is faster.


    Normally it will be. I've heard people citing cases of COPY at about a
    million records per second into "nicely" configured systems.
    However, if psycopg2's COPY is in C, I'd imagine it could achieve
    similar speeds. psql and psycopg2 both being libpq based are bound to
    have similar capabilities assuming the avoidance of interpreted Python
    code in feeding the data to libpq.

    > I know, but COPY is much faster.


    yessir.
     
    , Feb 19, 2008
    #7
  8. schrieb:
    > On Feb 19, 8:06 am, Thomas Guettler <> wrote:
    >> Any suggestions?

    >
    > If you don't mind trying out some beta quality software, you can try
    > my pg_proboscis driver. It has a DBAPI2 interface, but for you to use
    > COPY, you'll need to use the GreenTrunk interface:
    >


    Up to now I am happy with psycopg2. Why do you develop pg_proboscis?

    Thomas
     
    Thomas Guettler, Feb 20, 2008
    #8
  9. Thomas Guettler

    Guest

    On Feb 20, 9:27 am, Thomas Guettler <> wrote:
    > Up to now I am happy with psycopg2.


    Yeah. psyco is good.

    > Why do you develop pg_proboscis?


    [Good or bad as they may be]

    1. Alternate interface ("greentrunk")
    2. Non-libpq implementation yields better control over the wire that
    allows:
    a. Custom communication channels (not limited to libpq's
    worldview)
    b. Leveraging of protocol features that libpq's API does not fully
    accommodate (think bulk INSERTs using prepared statements with less
    round-trip costs)
    c. Allows custom "sub-protocols"(I use this to implement a remote
    python command/console[pdb'ing stored Python procedures, zero network
    latency
    scripts]).
    3. Makes use of binary types to reduce bandwidth usage. (I started
    developing this before libpq had the ability to describe statements to
    derive statement
    parameter types and cursor column types(?), so using the binary
    format was painful at best)
    4. Has potential for being used in event driven applications without
    threads.
    5. Better control/understanding of running queries allows for
    automatic operation interrupts in exception cases:
    [The last two may be possible using libpq's async interfaces, but
    I'm not entirely sure]
    6. Arguably greater(well, *easier* is likely a better word)
    portability. While I have yet to get it to work with other
    Pythons(pypy, ironpython, jython), the potential to work with these
    alternate implementations is there. The real impediment here is
    missing/inconsistent features in the implementations(setuptools
    support, missing os module in ironpython(i know, i know, and I don't
    care. Last time I checked, it's missing from the default install
    that's broken :p), jython is still at 2.2, iirc)
    7. Bit of a license zealot. psycopg2 is [L?]GPL, pg_proboscis is
    BSD[or MIT if you like] like PostgreSQL. (Yes, Darcy's interface is
    BSD licensed(iirc), but it too is libpq based)

    In sum, it yields greater control over the connection which I believe
    will lead to a more flexible and higher quality interface than a libpq
    solution.
    [The primary pain point I've had is implementing all the
    authentication mechanisms supported by PG]
     
    , Feb 21, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ASh
    Replies:
    10
    Views:
    2,483
    Anton Shishkov
    Mar 31, 2010
  2. mrdrew
    Replies:
    5
    Views:
    2,797
    Dennis Lee Bieber
    Apr 5, 2010
  3. Mohsen Pahlevanzadeh

    Qt connect and first connect or unicode

    Mohsen Pahlevanzadeh, Sep 17, 2013, in forum: Python
    Replies:
    3
    Views:
    162
    Mohsen Pahlevanzadeh
    Sep 18, 2013
  4. Mohsen Pahlevanzadeh

    Re: Qt connect and first connect or unicode

    Mohsen Pahlevanzadeh, Sep 17, 2013, in forum: Python
    Replies:
    0
    Views:
    126
    Mohsen Pahlevanzadeh
    Sep 17, 2013
  5. Oscar Benjamin

    Re: Qt connect and first connect or unicode

    Oscar Benjamin, Sep 17, 2013, in forum: Python
    Replies:
    0
    Views:
    120
    Oscar Benjamin
    Sep 17, 2013
Loading...

Share This Page