Re: How to find bad row with db api executemany()?

Discussion in 'Python' started by Dennis Lee Bieber, Mar 29, 2013.

  1. On Fri, 29 Mar 2013 14:53:30 -0400, Dave Angel <>
    declaimed the following in gmane.comp.python.general:

    > On 03/29/2013 10:48 AM, Roy Smith wrote:
    > > I'm inserting a gazillion rows into a MySQL database using MySQLdb and cursor.executemany() for efficiency. Every once in a while, I get a row which violates some kind of database constraint and raises Error.
    > >
    > > I can catch the exception, but don't see any way to tell which row caused the problem. Is this information obtainable, short of retrying each row one by one?
    > >

    >
    > I don't know the direct answer, or even if there is one (way to get
    > MySQL to tell you which one failed), but ...
    >
    > Assuming that executeMany is much cheaper than a million calls to
    > executeOne (or whatever).


    If using MySQLdb, there isn't all that much difference... MySQLdb is
    still compatible with MySQL v4 (and maybe even v3), and since those
    versions don't have "prepared statements", .executemany() essentially
    turns into something that creates a newline delimited "list" of
    "identical" (but for argument substitution) statements and submits that
    to MySQL.

    Just look at the source code (cursors.py in MySQLdb package).

    Hmmm... That gives a rally point. If it IS submitting one massive
    string containing all the data copies, could the failure be coming from
    submitting something too big for the client/server communication
    channel?
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Mar 29, 2013
    #1
    1. Advertising

  2. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Dennis Lee Bieber <> wrote:

    > If using MySQLdb, there isn't all that much difference... MySQLdb is
    > still compatible with MySQL v4 (and maybe even v3), and since those
    > versions don't have "prepared statements", .executemany() essentially
    > turns into something that creates a newline delimited "list" of
    > "identical" (but for argument substitution) statements and submits that
    > to MySQL.


    Shockingly, that does appear to be the case. I had thought during my
    initial testing that I was seeing far greater throughput, but as I got
    more into the project and started doing some side-by-side comparisons,
    it the differences went away.

    We're sucking in a pretty huge amount of data. The source document is a
    7 GB gzipped XML file. I'm not sure how big it is uncompressed (we use
    gzip.GzipFile to uncompress on the fly) but I'm guessing something like
    a 30x compression ratio so 200 GB? The last time we imported the whole
    set, it ran for 21 days!

    It turns out, the problems we were seeing were all inserts into a new
    table we added. Apparently, the default charset is latin-1 and we
    didn't notice that when we altered the schema! Once I noticed that all
    the other tables were utf-8 and changed this one to be that, the
    problems went away.

    Sadly, I ended up resorting to a truly ugly hack to diagnose the
    problem. I catch the exception and parse the text message. Yuck.

    try:
    self.executemany(self.sql_statement, self.sql_params)
    except MySQLdb.Error as ex:
    code, message = ex.args
    m = re.search(r".* at row (\d+)$", message)
    if m:
    i = int(m.group(1)) - 1 # Python is 0-index, SQL, 1-index


    The other truly horrible part of the project was when I decided it was
    bad for my import script to have too much schema knowledge hard-wired
    in. So, I decided to use SQLAlchemy to introspect the database and
    discover the column names, types, and defaults. It turns out, if an
    integer column has a default (say, 0), the introspected data comes back
    with the default as the string, '0'. WTF???

    Does Postgress's Python adapter handle executemany() in a sane way?
    We're not wedded to MySQL in any way. We use it for exactly this one
    process. We get these XML dumps from a supplier's SQL-Server database.
    We stage the data in MySQL, then export what we need into MongoDB. We
    could easily swap out the MySQL staging for Postgress if that worked
    better.

    Hmmm, we do take advantage of REPLACE INTO, which I think is a
    non-standard MySQL addition. Not sure if Postgress supports that.
     
    Roy Smith, Mar 30, 2013
    #2
    1. Advertising

  3. On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <> wrote:
    > In article <>,
    > Dennis Lee Bieber <> wrote:
    >
    >> If using MySQLdb, there isn't all that much difference... MySQLdb is
    >> still compatible with MySQL v4 (and maybe even v3), and since those
    >> versions don't have "prepared statements", .executemany() essentially
    >> turns into something that creates a newline delimited "list" of
    >> "identical" (but for argument substitution) statements and submits that
    >> to MySQL.

    >
    > Shockingly, that does appear to be the case. I had thought during my
    > initial testing that I was seeing far greater throughput, but as I got
    > more into the project and started doing some side-by-side comparisons,
    > it the differences went away.


    How much are you doing per transaction? The two extremes (everything
    in one transaction, or each line in its own transaction) are probably
    the worst for performance. See what happens if you pepper the code
    with 'begin' and 'commit' statements (maybe every thousand or ten
    thousand rows) to see if performance improves.

    ChrisA
     
    Chris Angelico, Mar 30, 2013
    #3
  4. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <> wrote:
    > > In article <>,
    > > Dennis Lee Bieber <> wrote:
    > >
    > >> If using MySQLdb, there isn't all that much difference... MySQLdb is
    > >> still compatible with MySQL v4 (and maybe even v3), and since those
    > >> versions don't have "prepared statements", .executemany() essentially
    > >> turns into something that creates a newline delimited "list" of
    > >> "identical" (but for argument substitution) statements and submits that
    > >> to MySQL.

    > >
    > > Shockingly, that does appear to be the case. I had thought during my
    > > initial testing that I was seeing far greater throughput, but as I got
    > > more into the project and started doing some side-by-side comparisons,
    > > it the differences went away.

    >
    > How much are you doing per transaction? The two extremes (everything
    > in one transaction, or each line in its own transaction) are probably
    > the worst for performance. See what happens if you pepper the code
    > with 'begin' and 'commit' statements (maybe every thousand or ten
    > thousand rows) to see if performance improves.
    >
    > ChrisA


    We're doing it all in one transaction, on purpose. We start with an
    initial dump, then get updates about once a day. We want to make sure
    that the updates either complete without errors, or back out cleanly.
    If we ever had a partial daily update, the result would be a mess.

    Hmmm, on the other hand, I could probably try doing the initial dump the
    way you describe. If it fails, we can just delete the whole thing and
    start again.
     
    Roy Smith, Mar 30, 2013
    #4
  5. On Sat, Mar 30, 2013 at 12:19 PM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> On Sat, Mar 30, 2013 at 11:41 AM, Roy Smith <> wrote:
    >> > In article <>,
    >> > Dennis Lee Bieber <> wrote:
    >> >
    >> >> If using MySQLdb, there isn't all that much difference... MySQLdb is
    >> >> still compatible with MySQL v4 (and maybe even v3), and since those
    >> >> versions don't have "prepared statements", .executemany() essentially
    >> >> turns into something that creates a newline delimited "list" of
    >> >> "identical" (but for argument substitution) statements and submits that
    >> >> to MySQL.
    >> >
    >> > Shockingly, that does appear to be the case. I had thought during my
    >> > initial testing that I was seeing far greater throughput, but as I got
    >> > more into the project and started doing some side-by-side comparisons,
    >> > it the differences went away.

    >>
    >> How much are you doing per transaction? The two extremes (everything
    >> in one transaction, or each line in its own transaction) are probably
    >> the worst for performance. See what happens if you pepper the code
    >> with 'begin' and 'commit' statements (maybe every thousand or ten
    >> thousand rows) to see if performance improves.
    >>
    >> ChrisA

    >
    > We're doing it all in one transaction, on purpose. We start with an
    > initial dump, then get updates about once a day. We want to make sure
    > that the updates either complete without errors, or back out cleanly.
    > If we ever had a partial daily update, the result would be a mess.
    >
    > Hmmm, on the other hand, I could probably try doing the initial dump the
    > way you describe. If it fails, we can just delete the whole thing and
    > start again.


    One transaction for the lot isn't nearly as bad as one transaction per
    row, but it can consume a lot of memory on the server - or at least,
    that's what I found last time I worked with MySQL. (PostgreSQL works
    completely differently, and I'd strongly recommend doing it all as one
    transaction if you switch.) It's not guaranteed to help, but if it
    won't hurt to try, there's a chance you'll gain some performance.

    ChrisA
     
    Chris Angelico, Mar 30, 2013
    #5
  6. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Roy Smith <> wrote:

    > In article <>,
    > Dennis Lee Bieber <> wrote:
    >
    > > If using MySQLdb, there isn't all that much difference... MySQLdb is
    > > still compatible with MySQL v4 (and maybe even v3), and since those
    > > versions don't have "prepared statements", .executemany() essentially
    > > turns into something that creates a newline delimited "list" of
    > > "identical" (but for argument substitution) statements and submits that
    > > to MySQL.

    >
    > Shockingly, that does appear to be the case. I had thought during my
    > initial testing that I was seeing far greater throughput, but as I got
    > more into the project and started doing some side-by-side comparisons,
    > it the differences went away.


    OMG, this is amazing.

    http://stackoverflow.com/questions/3945642/

    It turns out, the MySQLdb executemany() runs a regex over your SQL and
    picks one of two algorithms depending on whether it matches or not.

    restr = (r"\svalues\s*"
    r"(\(((?<!\\)'[^\)]*?\)[^\)]*(?<!\\)?'"
    r"|[^\(\)]|"
    r"(?:\([^\)]*\))"
    r")+\))")

    Leaving aside the obvious line-noise aspects, the operative problem here
    is that it only looks for "values" (in lower case).

    I've lost my initial test script which convinced me that executemany()
    would be a win; I'm assuming I used lower case for that. Our production
    code uses "VALUES".

    The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
    seconds. When I switch to "values", I'm getting more like 1000 rows in
    100 ms!

    A truly breathtaking bug.
     
    Roy Smith, Mar 30, 2013
    #6
  7. On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <> wrote:
    > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
    > seconds. When I switch to "values", I'm getting more like 1000 rows in
    > 100 ms!
    >
    > A truly breathtaking bug.


    *facepalm*

    Doubly facepalm because a regex could easily have tested for mixed case.

    Especially facepalm because there's some way to do this that's faster
    than straight INSERT statements, and it's not clearly documented as
    "hey, guys, if you want to dump loads of data in, use COPY instead"
    (it might be that, I don't know, but usually COPY isn't directly
    transliterable with INSERT).

    I agree. Breathtaking.

    ChrisA
     
    Chris Angelico, Mar 30, 2013
    #7
  8. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > Especially facepalm because there's some way to do this that's faster
    > than straight INSERT statements, and it's not clearly documented as
    > "hey, guys, if you want to dump loads of data in, use COPY instead"
    > (it might be that, I don't know, but usually COPY isn't directly
    > transliterable with INSERT).


    We're actually using REPLACE INTO. For the initial data load, we could
    just as well do INSERT, but we need the REPLACE functionality as we roll
    in the daily incremental updates.

    This also explains why, even after provisioning our RDS instance for
    2000 IOPS (that's AWS-speak for "we paid extra to get more disk
    bandwidth"), we didn't see any performance improvement!
     
    Roy Smith, Mar 30, 2013
    #8
  9. Dennis Lee Bieber

    rusi Guest

    On Mar 30, 7:49 am, Chris Angelico <> wrote:
    > On Sat, Mar 30, 2013 at 1:44 PM, Roy Smith <> wrote:
    > > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
    > > seconds.  When I switch to "values", I'm getting more like 1000 rows in
    > > 100 ms!

    >
    > > A truly breathtaking bug.

    >
    > *facepalm*
    >
    > Doubly facepalm because a regex could easily have tested for mixed case.
    >
    > Especially facepalm because there's some way to do this that's faster
    > than straight INSERT statements, and it's not clearly documented as
    > "hey, guys, if you want to dump loads of data in, use COPY instead"
    > (it might be that, I don't know, but usually COPY isn't directly
    > transliterable with INSERT).
    >
    > I agree. Breathtaking.
    >
    > ChrisA


    I recently heard this:
    A phone company needed to send out bulk-smses to its customers. It was
    of the order of millions.
    A (noob?) python programmer was assigned the task and used django with
    whatever is the django orm.
    It took of the order of weeks to send out the smses.
    A python expert was called in. He threw out the python and redid it
    in SQL.
    It was done in minutes.
     
    rusi, Mar 30, 2013
    #9
  10. On Sat, Mar 30, 2013 at 2:09 PM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> Especially facepalm because there's some way to do this that's faster
    >> than straight INSERT statements, and it's not clearly documented as
    >> "hey, guys, if you want to dump loads of data in, use COPY instead"
    >> (it might be that, I don't know, but usually COPY isn't directly
    >> transliterable with INSERT).

    >
    > We're actually using REPLACE INTO. For the initial data load, we could
    > just as well do INSERT, but we need the REPLACE functionality as we roll
    > in the daily incremental updates.
    >
    > This also explains why, even after provisioning our RDS instance for
    > 2000 IOPS (that's AWS-speak for "we paid extra to get more disk
    > bandwidth"), we didn't see any performance improvement!


    Hmm. I heard around the forums that Amazon weren't that great at disk
    bandwidth anyway, and that provisioning IO was often a waste of money.
    But we never did all that much much research on Amazon I/O
    performance; shortly after doing some basic benchmarking, we decided
    that the cloud was a poor fit for our system model, and went looking
    at dedicated servers with their own RAID storage right there on the
    bus.

    ChrisA
     
    Chris Angelico, Mar 30, 2013
    #10
  11. Dennis Lee Bieber

    rusi Guest

    On Mar 30, 8:13 am, rusi <> wrote:

    > It took of the order of weeks to send out the smses.


    'Week' I think is more accurate.
     
    rusi, Mar 30, 2013
    #11
  12. Dennis Lee Bieber

    Tim Chase Guest

    On 2013-03-29 21:19, Roy Smith wrote:
    > We're doing it all in one transaction, on purpose. We start with
    > an initial dump, then get updates about once a day. We want to
    > make sure that the updates either complete without errors, or back
    > out cleanly. If we ever had a partial daily update, the result
    > would be a mess.


    Having had to do some similarly-sized bulk data loads (in my case,
    MSSqlServer at $JOB) couple other ideas occur to me:

    1) I believe MySQL has a side-loading function (I'd have to go
    digging for it; a quick google suggests a "LOAD DATA INFILE"
    statement[1]) that allows you to load data from an external file such
    as an XML or CSV file

    2) Load into a temp table in testable batches, then do some sort of
    batch insert into your main table. Again, a quick google suggest the
    "INSERT ... SELECT" syntax[2]

    -tkc

    [1]
    http://dev.mysql.com/doc/refman/5.1/en/load-data.html

    [2]
    http://dev.mysql.com/doc/refman/5.0/en/insert-select.html
     
    Tim Chase, Mar 30, 2013
    #12
  13. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > Hmm. I heard around the forums that Amazon weren't that great at disk
    > bandwidth anyway, and that provisioning IO was often a waste of money.


    Au, contraire. I guess it all depends on what you're doing. If you're
    CPU bound, increasing your I/O bandwidth won't help. But, at least on
    our database (MongoDB) servers, we saw a huge performance boost when we
    started going for provisioned IO.

    > But we never did all that much much research on Amazon I/O
    > performance; shortly after doing some basic benchmarking, we decided
    > that the cloud was a poor fit for our system model, and went looking
    > at dedicated servers with their own RAID storage right there on the
    > bus.


    As far as I can tell, from a raw price/performance basis, they're pretty
    expensive. But, from a convenience standpoint, it's hard to beat.

    Case in point: We've been thinking about SSD as our next performance
    step-up. One day, we just spun up some big honking machine, configured
    it with 2 TB of SSD, and played around for a while. Wicked fast. Then
    we shut it down. That experiment probably cost us $10 or so, and we
    were able to run it on the spur of the moment.

    Another example was last summer when we had a huge traffic spike because
    of a new product release. Caught us by surprise how much new traffic it
    would generate. Our site was in total meltdown. We were able to spin
    up 10 new servers in an afternoon. If we had to go out and buy
    hardware, have it shipped to us, figure out where we had rack space,
    power, network capacity, cooling, etc, we'd have been out of business
    before we got back on the air.

    Yet another example. We just (as in, while I've been typing this) had
    one of our servers go down. Looks like the underlying hardware the VM
    was running on croaked, because when the instance came back up, it had a
    new IP address. The whole event was over in a couple of minutes, with
    only minor disruption to the service. And, presumably, there's some
    piece of hardware somewhere in Virginia that needs repairing, but that's
    not our problem.

    The really big boys (Google, Facebook) run their own data centers. But,
    some surprisingly large operations run out of AWS. Netflix, for
    example. The convenience and flexibility is worth a lot.
     
    Roy Smith, Mar 30, 2013
    #13
  14. Dennis Lee Bieber

    Tim Chase Guest

    Re: How to find bad row with db api executemany()? (PS)

    On 2013-03-29 22:17, Tim Chase wrote:
    > 2) Load into a temp table in testable batches, then do some sort of
    > batch insert into your main table. Again, a quick google suggest
    > the "INSERT ... SELECT" syntax[2]


    It looks like there's a corresponding "REPLACE INTO ... SELECT"
    syntax[1], as you mention doing a REPLACE INTO rather than a straight
    INSERT

    -tkc

    [1]
    http://dev.mysql.com/doc/refman/5.0/en/replace.html
     
    Tim Chase, Mar 30, 2013
    #14
  15. On Fri, 29 Mar 2013 21:19:22 -0400, Roy Smith <> declaimed
    the following in gmane.comp.python.general:

    > In article <>,
    > Chris Angelico <> wrote:
    >
    > >
    > > How much are you doing per transaction? The two extremes (everything
    > > in one transaction, or each line in its own transaction) are probably
    > > the worst for performance. See what happens if you pepper the code
    > > with 'begin' and 'commit' statements (maybe every thousand or ten
    > > thousand rows) to see if performance improves.
    > >
    > > ChrisA

    >
    > We're doing it all in one transaction, on purpose. We start with an
    > initial dump, then get updates about once a day. We want to make sure
    > that the updates either complete without errors, or back out cleanly.
    > If we ever had a partial daily update, the result would be a mess.
    >

    As I recall, DB-API compliance should have started a transaction
    when you submit a statement, so explicit "begin" should not be required.

    My suggestion would be to first try just breaking the .executemany()
    into chunks rather than everything at once. Submit maybe 500-1000
    records at a time via .executemany(), checking for errors -- if a block
    errors you can still rollback() the entire transaction AND you have a
    smaller set of data to examine. Then at the end, if no errors, do the
    commit()
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Mar 30, 2013
    #15
  16. Dennis Lee Bieber

    Roy Smith Guest

    In article
    <>,
    rusi <> wrote:

    > I recently heard this:
    > A phone company needed to send out bulk-smses to its customers. It was
    > of the order of millions.
    > A (noob?) python programmer was assigned the task and used django with
    > whatever is the django orm.
    > It took of the order of weeks to send out the smses.
    > A python expert was called in. He threw out the python and redid it
    > in SQL.
    > It was done in minutes.


    I'm not surprised. It almost certainly wasn't the python that was the
    problem. More than likely, he was doing some horribly inefficient
    database operations.

    Certainly, in our case, performance is all about the database. We
    mostly can't even measure the time we spend running Python code.
     
    Roy Smith, Mar 30, 2013
    #16
  17. On Fri, 29 Mar 2013 22:44:53 -0400, Roy Smith <> declaimed
    the following in gmane.comp.python.general:

    >
    > OMG, this is amazing.
    >
    > http://stackoverflow.com/questions/3945642/
    >
    > It turns out, the MySQLdb executemany() runs a regex over your SQL and
    > picks one of two algorithms depending on whether it matches or not.
    >

    Hmm, I never tracked deeper than the cursors.executemany() block...
    Then again, I don't do regex's (for my uses I've often been able to code
    a simple parser using .split(), "string" in list-or-string, etc. faster
    than reading the help system for regex syntax).

    > restr = (r"\svalues\s*"
    > r"(\(((?<!\\)'[^\)]*?\)[^\)]*(?<!\\)?'"
    > r"|[^\(\)]|"
    > r"(?:\([^\)]*\))"
    > r")+\))")
    >
    > Leaving aside the obvious line-noise aspects, the operative problem here
    > is that it only looks for "values" (in lower case).
    >

    Well, I suppose if one's application documentation is complete
    enough to mention it, one could maybe edit that part of MySQLdb to be
    case insensitive (or do a .lower() on the query string only where this
    check is performed). Documented so the next time one updates the adapter
    one as a reminder to adjust that functionality.

    > I've lost my initial test script which convinced me that executemany()
    > would be a win; I'm assuming I used lower case for that. Our production
    > code uses "VALUES".
    >

    I'd not have been affected -- I don't write SQL with uppercase
    keywords in practice; maybe only when discussing in a forum will I take
    the time to uppercase them...

    > The slow way (i.e. "VALUES"), I'm inserting 1000 rows about every 2.4
    > seconds. When I switch to "values", I'm getting more like 1000 rows in
    > 100 ms!
    >

    The slow algorithm literally is "one .execute() per record" --
    looking at my MySQL references MySQL does support a form of INSERT in
    which it supplies a "list" of parameter values (not as a "prepared
    statement" with placeholders, which is what I suspect PostgreSQL's
    adapters use).

    > A truly breathtaking bug.


    Especially given how long MySQLdb has existed (the multi-record
    INSERT goes back to MySQL 3.something).
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Mar 30, 2013
    #17
  18. On Sat, Mar 30, 2013 at 2:36 PM, Roy Smith <> wrote:
    > In article <>,
    > Chris Angelico <> wrote:
    >
    >> Hmm. I heard around the forums that Amazon weren't that great at disk
    >> bandwidth anyway, and that provisioning IO was often a waste of money.

    >
    > Au, contraire. I guess it all depends on what you're doing. If you're
    > CPU bound, increasing your I/O bandwidth won't help. But, at least on
    > our database (MongoDB) servers, we saw a huge performance boost when we
    > started going for provisioned IO.
    >
    > As far as I can tell, from a raw price/performance basis, they're pretty
    > expensive. But, from a convenience standpoint, it's hard to beat.


    Yeah, I'm not saying you won't see performance go up - just that it's
    going to be expensive compared to what you could do with a dedicated
    server. The flexibility costs.

    > Case in point: We've been thinking about SSD as our next performance
    > step-up. One day, we just spun up some big honking machine, configured
    > it with 2 TB of SSD, and played around for a while. Wicked fast. Then
    > we shut it down. That experiment probably cost us $10 or so, and we
    > were able to run it on the spur of the moment.


    That is one thing the cloud is *awesome* for. "Hmm, I wonder......"
    *half an hour later* "Now I know." *bill comes in* "That was cheap."

    > Another example was last summer when we had a huge traffic spike because
    > of a new product release. Caught us by surprise how much new traffic it
    > would generate. Our site was in total meltdown. We were able to spin
    > up 10 new servers in an afternoon. If we had to go out and buy
    > hardware, have it shipped to us, figure out where we had rack space,
    > power, network capacity, cooling, etc, we'd have been out of business
    > before we got back on the air.


    Yep. We're looking rather at server rental, from big honking data
    centers (by the way, I never managed to figure this out - at what size
    do things begin to honk? Larger than a bread box?), where hopefully
    they would be able to deploy us more servers within a matter of hours.
    Not as good as cloud (where you can have more servers in minutes), but
    you described "in an afternoon" as your success story, so I'm guessing
    most of the delay was an administrative one - the decision to actually
    go ahead and spin those servers up. Unless you're automating the whole
    thing (which is possible with the Amazon cloud, but not every client
    will do it), that's always going to cost you.

    > Yet another example. We just (as in, while I've been typing this) had
    > one of our servers go down. Looks like the underlying hardware the VM
    > was running on croaked, because when the instance came back up, it had a
    > new IP address. The whole event was over in a couple of minutes, with
    > only minor disruption to the service. And, presumably, there's some
    > piece of hardware somewhere in Virginia that needs repairing, but that's
    > not our problem.


    Yeah, that's also a concern. And that works beautifully as long as
    you're okay with that. It'll also happen more often with AWS than with
    dedicated hardware, because there are more components to fail. That's
    part of what wouldn't have fitted our server layout; we have a layer
    on top of all that to manage monitoring (and, incidentally, we set up
    a dozen home-grade laptops as a "server cluster" to test all those
    systems, and knew within minutes of one of the laptops deciding that
    it had crunched its last bits), and we really want stable IP addresses
    for DNS and such. Cloud isn't bad, it was just a bad fit for us.

    Side point: You mentioned SSDs. Are you aware of the fundamental risks
    associated with them? Only a handful of SSD models are actually
    trustworthy for databasing. Sure, they're fast, but can you afford
    data corruption in the event of a power outage? Most SSDs will
    cheerfully lie about fsync() and thus violate transactional integrity
    (ACID compliance etc).

    ChrisA
     
    Chris Angelico, Mar 30, 2013
    #18
  19. Dennis Lee Bieber

    Roy Smith Guest

    In article <>,
    Chris Angelico <> wrote:

    > Side point: You mentioned SSDs. Are you aware of the fundamental risks
    > associated with them? Only a handful of SSD models are actually
    > trustworthy for databasing.


    We haven't decided if we're going that route yet, but if we do, we will
    probably use do RAID SSD for added reliability. We also have all our
    database servers in failover clusters, so we get added reliability that
    way too.

    But, we have some runway left with more conventional technologies, so we
    don't need to decide for a while. Ultimately, however, as reliability
    goes up and cost comes down, it's hard to imagine the SSD isn't going to
    be a big part of our lives at some point.
     
    Roy Smith, Mar 30, 2013
    #19
  20. On Sat, 30 Mar 2013 13:49:56 +1100, Chris Angelico <>
    declaimed the following in gmane.comp.python.general:


    > Especially facepalm because there's some way to do this that's faster
    > than straight INSERT statements, and it's not clearly documented as
    > "hey, guys, if you want to dump loads of data in, use COPY instead"
    > (it might be that, I don't know, but usually COPY isn't directly
    > transliterable with INSERT).
    >


    While COPY is found in my ancient PostgreSQL book, I don't find it
    in MySQL. I suspect the equivalent is LOAD DATA INFILE. Firebird goes
    completely differently: one does a CREATE TABLE exttable EXTERNAL FILE
    "fixed-length.txt" (table definition), then creates a matching internal
    table, and finally performs an INSERT INTO internal SELECT fields FROM
    exttable.

    I think MySQL is the only common DBMS with an extension on INSERT of
    allowing multiple records (I've not checked my Access 2010 docs, and my
    MSDE/SQL-Server books are in storage -- but SQLite3, Firebird, and
    PostgreSQL all seem to be "one INSERT = one record").



    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Mar 30, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rantingrick
    Replies:
    44
    Views:
    1,223
    Peter Pearson
    Jul 13, 2010
  2. Roy Smith
    Replies:
    24
    Views:
    922
    Martin Gregorie
    Oct 17, 2010
  3. Roy Smith
    Replies:
    1
    Views:
    146
    Dennis Lee Bieber
    Mar 1, 2013
  4. Roy Smith
    Replies:
    2
    Views:
    146
    Roy Smith
    Mar 30, 2013
  5. Dave Angel
    Replies:
    0
    Views:
    122
    Dave Angel
    Mar 29, 2013
Loading...

Share This Page