Extracting data from dump file

Discussion in 'Python' started by TYR, Nov 23, 2007.

  1. TYR

    TYR Guest

    I have a large dump file that originated in a MySQL db; I need to get
    it into an SQLite file.

    Various options are suggested around the web; none of them seem to
    work (most failing to import the thing in the first place). So I
    removed the assorted taggery from each end, leaving just a big text
    file taking the following format:

    ('value', 'value', 'value, 'value'),
    ('value','value','value','value')...

    I planned to find some way of splitting the thing at the commas
    outside the bracketed groups, thus giving me a list of tuples; then I
    could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then
    iterate through the list INSERTing INTO.

    Then my problems began; I tried using the python csv method, replacing
    the string ),( with \t and then using \t as the delimiter. First
    problem; there's a size limit coded into the module. No problem, use
    csv.field_size_limit() to alter it. Problem; it doesn't actually parse
    at all, just sends the whole thing as a string and the SQL INSERT
    fails with a "not enough args" error.

    Tried using string.split() and re.split(data, r'\t'); first gave the
    same error, second failed with a "too many named groups" error. Tried
    using ; as a delimiter and going back to csv; this fails to match
    the ; for some reason. Any ideas?
    TYR, Nov 23, 2007
    #1
    1. Advertising

  2. TYR

    John Machin Guest

    On Nov 24, 5:46 am, TYR <> wrote:
    > I have a large dump file that originated in a MySQL db; I need to get
    > it into an SQLite file.
    >
    > Various options are suggested around the web; none of them seem to
    > work (most failing to import the thing in the first place). So I
    > removed the assorted taggery from each end, leaving just a big text
    > file taking the following format:
    >
    > ('value', 'value', 'value, 'value'),
    > ('value','value','value','value')...


    Consider the possibility that individual data rows are separated by "),
    \n(" or maybe "), \n(" or something else, not "),(".

    What does
    print repr(open("my_dump.txt", "rb").read(1000))
    tell you? Are there any "\n" in there at all? BTW, what platform are
    you running on?

    >
    > I planned to find some way of splitting the thing at the commas
    > outside the bracketed groups, thus giving me a list of tuples; then I
    > could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then


    CREATE TABLE foo VALUES '1', .... are you sure?


    > iterate through the list INSERTing INTO.
    >
    > Then my problems began; I tried using the python csv method, replacing
    > the string ),( with \t and then using \t as the delimiter. First
    > problem; there's a size limit coded into the module. No problem, use
    > csv.field_size_limit() to alter it.


    >>> import csv
    >>> csv.field_size_limit()

    131072
    >>>


    And do you really expect that any one of your data rows will take more
    than 128KB in the dump file?? How large did you have to set the limit
    to make it "work"???

    > Problem; it doesn't actually parse
    > at all,


    Sounds like it did parse, with the result being one csv row containing
    one very long field.

    > just sends the whole thing as a string and the SQL INSERT
    > fails with a "not enough args" error.
    >
    > Tried using string.split() and re.split(data, r'\t'); first gave the
    > same error, second failed with a "too many named groups" error.


    All of these are indications that you don't have the correct
    delimiter.

    > Tried
    > using ; as a delimiter and going back to csv; this fails to match
    > the ; for some reason. Any ideas?


    Examine your dump carefully. Are there any newlines other than at the
    logical end of data rows (like e.g. it's folded at column 78)? If so,
    you'll have to replace those by spaces or delete them. Work out
    exactly what is separating data rows. Let's call it "sep". If you can
    use file_contents.replace(sep, "\t"), then you can use
    file_contents.split(sep) directly; there's no need for a two step
    approach. If sep is variable, then use re.split.

    Try this on a small subset of your dump file. Before you go anywhere
    near the database, check the results of your split: did you get the
    expected number of rows [currently you are getting 1 row], does each
    row look OK? Unsubtle hint: the first row may start with "junk(" and
    the last may end with ")junk" if you haven't already allowed for that.

    Then try inserting into the database. Print a row number and row
    contents before each insert (or wrap try/except around each insert) so
    that you know which row any error relates to.

    If you still have a problem, ask again, but do show exactly what your
    dump file looks like and what your code is -- my crystal ball needs a
    long recharge after all of the above!

    HTH,
    John
    John Machin, Nov 23, 2007
    #2
    1. Advertising

  3. On Fri, 23 Nov 2007 10:46:12 -0800 (PST), TYR <>
    declaimed the following in comp.lang.python:

    > I have a large dump file that originated in a MySQL db; I need to get
    > it into an SQLite file.
    >
    > Various options are suggested around the web; none of them seem to
    > work (most failing to import the thing in the first place). So I
    > removed the assorted taggery from each end, leaving just a big text
    > file taking the following format:
    >

    I'd think the simplest would be to tweak the contents of the dump by
    removing the MySQL specific terms, or replacing some with SQLite syntax,
    then batching the whole file into the command line utility.

    I presume your file is similar to:

    -=-=-=-=-=-=-=-
    -- MySQL Administrator dump 1.4
    --
    -- ------------------------------------------------------
    -- Server version 5.0.24a-community-max-nt-log


    /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
    /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
    /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
    /*!40101 SET NAMES utf8 */;

    /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
    /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS,
    FOREIGN_KEY_CHECKS=0 */;
    /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO'
    */;


    --
    -- Create schema bestiaria
    --

    CREATE DATABASE /*!32312 IF NOT EXISTS*/ bestiaria;
    USE bestiaria;

    --
    -- Table structure for table `bestiaria`.`artlinks`
    --

    DROP TABLE IF EXISTS `artlinks`;
    CREATE TABLE `artlinks` (
    `ID` int(11) NOT NULL auto_increment,
    `category` enum('archive','artist','zine','reference','news') NOT NULL
    default 'archive',
    `name` varchar(75) NOT NULL default '',
    `URL` varchar(75) NOT NULL default '',
    `banner` varchar(75) default NULL,
    `width` int(11) default NULL,
    `height` int(11) default NULL,
    `description` varchar(250) default NULL,
    PRIMARY KEY (`ID`)
    ) ENGINE=MyISAM AUTO_INCREMENT=49 DEFAULT CHARSET=latin1;

    --
    -- Dumping data for table `bestiaria`.`artlinks`
    --

    /*!40000 ALTER TABLE `artlinks` DISABLE KEYS */;
    INSERT INTO `artlinks`
    (`ID`,`category`,`name`,`URL`,`banner`,`width`,`height`,`description`)
    VALUES

    (1,'archive','Epilogue','http://www.epilogue.net/',NULL,NULL,NULL,NULL),

    (2,'archive','FurNation','http://www.furnation.com/','images/furnation.gif',404,44,'&quot;The
    Artists\' Image Archive&quot;'),

    (3,'archive','Elfwood','http://www.elfwood.com/',NULL,NULL,NULL,'Reorganized,
    subsuming the former <i>Lothlorien</i> and <i>Zone47</i> subsites'),
    (5,'archive','The Orlando Furry
    Archives','http://fanarchive.net/',NULL,NULL,NULL,NULL),
    (6,'archive','Side 7','http://www.side7.com/',NULL,NULL,NULL,'&quot;the
    Artist\'s Archive&quot;'),

    (7,'archive','VCL','http://us.vclart.net/vcl/Artists/',NULL,NULL,NULL,'Furry
    Image Archives'),
    (37,'archive','Deviant
    Art','http://www.deviantart.com/',NULL,NULL,NULL,'An online artist
    collective and marketplace. Various genres.'),
    (8,'archive','YERF','http://www.yerf.com/',NULL,NULL,NULL,'Formerly
    known as <i>The Squeeky Clean Furry Archives</I>'),
    (9,'zine','Herd
    Mentality','http://www.furry.org.au/aatheus/herdmentality/',NULL,NULL,NULL,'&quot;The
    Centaur Fanzine!&quot; (although the definition of &quot;centaur&quot;
    is rather loose)'),

    (10,'zine','Huzzah','http://www.huzzah.com/',NULL,NULL,NULL,'&quot;Anthropomorphics
    for Artists&quot;');
    INSERT INTO `artlinks`
    (`ID`,`category`,`name`,`URL`,`banner`,`width`,`height`,`description`)
    VALUES
    ....
    -=-=-=-=-=-=-=-

    > ('value', 'value', 'value, 'value'),
    > ('value','value','value','value')...
    >
    > I planned to find some way of splitting the thing at the commas
    > outside the bracketed groups, thus giving me a list of tuples; then I
    > could of course CREATE TABLE foo VALUES '1', '2, '3', '4' and then


    That sure doesn't look like any "create table" syntax I've ever
    seen...

    > iterate through the list INSERTing INTO.
    >
    > Then my problems began; I tried using the python csv method, replacing
    > the string ),( with \t and then using \t as the delimiter. First


    Don't know about your file, but mine does NOT have
    """
    ....),(...
    """
    instead I have
    """
    ....),
    (...
    """

    or in one line, encoded: "...),\n(..."

    I wouldn't expect the CSV module to have much luck with most of the
    dump -- one would have to remove all the comments, CREATE stuff, and the
    INSERT lines, leaving just the "(...)" rows. THEN remove any "(" that
    starts a line, and any ")," that ends a line, while avoiding any
    embedded... Then, maybe, "unescape" any MySQL escapes in text columns
    before using the SQLite parameter substitution to escape them for /it/.

    > problem; there's a size limit coded into the module. No problem, use
    > csv.field_size_limit() to alter it. Problem; it doesn't actually parse
    > at all, just sends the whole thing as a string and the SQL INSERT
    > fails with a "not enough args" error.
    >
    > Tried using string.split() and re.split(data, r'\t'); first gave the
    > same error, second failed with a "too many named groups" error. Tried
    > using ; as a delimiter and going back to csv; this fails to match
    > the ; for some reason. Any ideas?


    Write an SQL parser of sorts (I have no experience with the python
    parsing stuff, so would brute force one...)... Something like:

    -=-=-=-=-=-=-=-
    """
    Brute Force template for MySQL Dump Parser to SQLite
    """

    import sys

    fout = sys.stdout #for info only at this point

    DUMP_FILE = "test.sql"

    def parse_insert(fin, ln):
    insert = ln.replace("`", "'")
    for ln in fin: #another nesting
    ln = ln.strip()
    fout.write("%s\n\t%s;\n\n" % (insert, ln[:-1]))
    if ln.endswith(";"): break


    def parse_table(fin):
    uniq = []
    fields = []
    pkey = ""
    for ln in fin: #yes, nested read functionality
    ln = ln.strip()
    if ln.startswith(")") and ln.endswith(";"):
    break
    if ln.startswith("PRIMARY KEY"):
    pkey = "\tprimary key %s" %
    ln[ln.index("("):ln.index(")")+1].replace("`", "'")
    continue
    if ln.startswith("UNIQUE"):
    uniq.append("\tUNIQUE %s" %
    ln[ln.index("("):ln.index(")")+1].replace("`", "'"))
    continue
    flds = ln.split()
    flds[-1] = flds[-1].rstrip(",")
    field_name = flds[0].replace("`", "'")
    ftype = flds[1].lower()
    if ("int" in ftype
    or "bit" in ftype
    or "bool" in ftype):
    field_type = "integer"
    elif ("dec" in ftype
    or "float" in ftype
    or "double" in ftype
    or "real" in ftype):
    field_type = "real"
    elif ("text" in ftype
    or "char" in ftype
    or "enum" in ftype
    or "set" in ftype):
    field_type = "text"
    elif "blob" in ftype:
    field_type = "blob"
    elif ("date" in ftype
    or "time" in ftype
    or "year" in ftype):
    field_type = "text"
    else:
    fout.write("Ignoring MySQL field type: %s\n" % ftype)
    field_type = " "
    clause = ""
    if "auto_increment" in ln:
    clause += " primary key autoincrement "
    if "UNIQUE" in ln:
    clause += " unique "
    if "NOT NULL" in ln:
    clause += " not null "
    if "default" in ln:
    clause += ln[ln.index("default"):]
    if clause.endswith(","):
    clause = clause[:-1]
    fields.append("\t%s %s %s" % (field_name, field_type, clause))
    if pkey:
    fields.append(pkey)
    if uniq:
    fields.extend(uniq)
    fout.write("%s\n" % ",\n".join(fields))
    fout.write(");\n")

    def parse_create(fin, ln):
    if ln.startswith("CREATE DATABASE "):
    while not ln.endswith(";"):
    ln = fin.readline()
    ln = ln.strip()
    dbname = ln.split()[-1][:-1]
    fout.write("SQLite: con = .connect('%s.db')\n" % dbname)
    elif ln.startswith("CREATE TABLE "):
    table_name = ln.split()[2].replace("`", "'")
    fout.write("SQLite: create table %s\n(\n" % table_name)
    parse_table(fin)
    else:
    fout.write("Ignoring MySQL statement: %s\n" % ln)

    def parse_file(fin):
    for ln in fin:
    ln = ln.strip()
    if (not ln
    or ln.startswith("--")
    or ln.startswith("/*")):
    continue #ignore comments and blank lines
    elif ln.startswith("CREATE "):
    parse_create(fin, ln)
    elif ln.startswith("INSERT "):
    parse_insert(fin, ln)
    else:
    fout.write("Ignoring MySQL statement: %s\n" % ln)

    if __name__ == "__main__":
    fin = open(DUMP_FILE, "r")
    parse_file(fin)
    fin.close()
    # fout.close()
    -=-=-=-=-=-=-=-

    Which produces something like the following when fed the start of my
    dump file... I just created diagnostic output, but the output could
    either be formatted to be an SQLite command line batch file, or
    converted into DB-API compatible calls to SQLite from within the program
    (as implied by that connect line)

    I did not attempt any tests for escaped fields in the inserts... That
    section of code should really be expanded to strip the ( ... ), do a CSV
    logic split, each field then unescaped, reescaped as needed, and
    rejoined...


    -=-=-=-=-=-=-=-
    SQLite: con = .connect('bestiaria.db')
    Ignoring MySQL statement: USE bestiaria;
    Ignoring MySQL statement: DROP TABLE IF EXISTS `artlinks`;
    SQLite: create table 'artlinks'
    (
    'ID' integer primary key autoincrement not null ,
    'category' text not null default 'archive',
    'name' text not null default '',
    'URL' text not null default '',
    'banner' text default NULL,
    'width' integer default NULL,
    'height' integer default NULL,
    'description' text default NULL,
    primary key ('ID')
    );
    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES

    (1,'archive','Epilogue','http://www.epilogue.net/',NULL,NULL,NULL,NULL);

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES

    (2,'archive','FurNation','http://www.furnation.com/','images/furnation.gif',404,44,'&quot;The
    Artists\' Image Archive&quot;');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES

    (3,'archive','Elfwood','http://www.elfwood.com/',NULL,NULL,NULL,'Reorganized,
    subsuming the former <i>Lothlorien</i> and <i>Zone47</i> subsites');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (5,'archive','The Orlando Furry
    Archives','http://fanarchive.net/',NULL,NULL,NULL,NULL);

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (6,'archive','Side
    7','http://www.side7.com/',NULL,NULL,NULL,'&quot;the Artist\'s
    Archive&quot;');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES

    (7,'archive','VCL','http://us.vclart.net/vcl/Artists/',NULL,NULL,NULL,'Furry
    Image Archives');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (37,'archive','Deviant
    Art','http://www.deviantart.com/',NULL,NULL,NULL,'An online artist
    collective and marketplace. Various genres.');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (8,'archive','YERF','http://www.yerf.com/',NULL,NULL,NULL,'Formerly
    known as <i>The Squeeky Clean Furry Archives</I>');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (9,'zine','Herd
    Mentality','http://www.furry.org.au/aatheus/herdmentality/',NULL,NULL,NULL,'&quot;The
    Centaur Fanzine!&quot; (although the definition of &quot;centaur&quot;
    is rather loose)');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES

    (10,'zine','Huzzah','http://www.huzzah.com/',NULL,NULL,NULL,'&quot;Anthropomorphics
    for Artists&quot;');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (11,'zine','Sofawolf
    Press','http://www.sofawolf.com/','images/swplogo.gif',158,62,'Sofawolf
    Press showcases quality furry fiction in book and magazine format.');

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (12,'artist','Theresa Larsson: Aspects of
    Art','http://www.theresalarsson.se/','images/aspects_of_art.jpg',468,60,NULL);

    INSERT INTO 'artlinks'
    ('ID','category','name','URL','banner','width','height','description')
    VALUES
    (13,'artist','Megan
    Giles','http://www.furnation.com/Megan_giles/','images/mgiles.jpg',468,80,NULL);
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
    Dennis Lee Bieber, Nov 23, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. halfdog
    Replies:
    12
    Views:
    12,443
  2. Santiago Romero

    Read binary file and dump data in

    Santiago Romero, Jan 13, 2009, in forum: Python
    Replies:
    3
    Views:
    2,236
    Santiago Romero
    Jan 14, 2009
  3. Replies:
    0
    Views:
    508
  4. Chris Rebert
    Replies:
    0
    Views:
    487
    Chris Rebert
    Feb 2, 2009
  5. Wesley
    Replies:
    1
    Views:
    128
    Wesley
    Apr 15, 2014
Loading...

Share This Page