Transforming ascii file (pseduo database) into proper database

Discussion in 'Python' started by p., Jan 21, 2008.

  1. p.

    p. Guest

    I need to take a series of ascii files and transform the data
    contained therein so that it can be inserted into an existing
    database. The ascii files are just a series of lines, each line
    containing fields separated by '|' character. Relations amongst the
    data in the various files are denoted through an integer identifier, a
    pseudo key if you will. Unfortunately, the relations in the ascii file
    do not match up with those in the database in which i need to insert
    the data, i.e., I need to transform the data from the files before
    inserting into the database. Now, this would all be relatively simple
    if not for the following fact: The ascii files are each around 800MB,
    so pulling everything into memory and matching up the relations before
    inserting the data into the database is impossible.

    My questions are:
    1. Has anyone done anything like this before, and if so, do you have
    any advice?
    2. In the abstract, can anyone think of a way of amassing all the
    related data for a specific identifier from all the individual files
    without pulling all of the files into memory and without having to
    repeatedly open, search, and close the files over and over again?
     
    p., Jan 21, 2008
    #1
    1. Advertising

  2. p.

    Tim Chase Guest

    > I need to take a series of ascii files and transform the data
    > contained therein so that it can be inserted into an existing
    > database.

    [snip]
    > I need to transform the data from the files before inserting
    > into the database. Now, this would all be relatively simple if
    > not for the following fact: The ascii files are each around
    > 800MB,

    [snip]
    > My questions are:
    > 1. Has anyone done anything like this before, and if so, do
    > you have any advice?


    Yes, I regularly do ETL on files from cellular providers to
    transform hundreds of megs worth (some approach a gig) of data
    into our internal system.

    > 2. In the abstract, can anyone think of a way of amassing all
    > the related data for a specific identifier from all the
    > individual files without pulling all of the files into memory
    > and without having to repeatedly open, search, and close the
    > files over and over again?


    if the file is sorted by something you can use, you can iterate
    over it and just deal with one grouping at a time. In my case,
    iterating over gobs of call-detail, the file happens to be sorted
    by the phone-number on the account. So I iterate over the file
    maintaining a list of calls for the given phonenumber, and when
    the phonenumber changes, I deal with the previous cache of data,
    then re-initialize with the new phone's data.

    Other ideas:

    1) create a temp DB (such as sqlite), skim through the file
    inserting all your data into a table in this DB, then use DB
    functionality on it

    2) in a light-weight way, assuming there's lots of data per row,
    and that you have multiple rows associated with a given ID (in my
    case, such as a phonenumber), you can create a dictionary of an
    ID to a list of file-offsets in which that ID is used. You can
    then skim through the file once gathering all the offsets with
    calls to tell() and then when you want to process an item, you
    can seek to that particular offset and read in the line. Not
    greatly efficient, but hackable.

    But mostly, it helps if you have a sorted field that's useful to
    you :)

    -tkc
     
    Tim Chase, Jan 21, 2008
    #2
    1. Advertising

  3. p.

    John Machin Guest

    On Jan 22, 7:51 am, "p." <> wrote:
    > I need to take a series of ascii files and transform the data
    > contained therein so that it can be inserted into an existing
    > database. The ascii files are just a series of lines, each line
    > containing fields separated by '|' character. Relations amongst the
    > data in the various files are denoted through an integer identifier, a
    > pseudo key if you will. Unfortunately, the relations in the ascii file
    > do not match up with those in the database in which i need to insert
    > the data, i.e., I need to transform the data from the files before
    > inserting into the database. Now, this would all be relatively simple


    I'm glad you think so ... we don't have a clear understanding of the
    problem.

    > if not for the following fact: The ascii files are each around 800MB,
    > so pulling everything into memory and matching up the relations before
    > inserting the data into the database is impossible.
    >
    > My questions are:
    > 1. Has anyone done anything like this before, and if so, do you have
    > any advice?
    > 2. In the abstract, can anyone think of a way of amassing all the
    > related data for a specific identifier from all the individual files
    > without pulling all of the files into memory and without having to
    > repeatedly open, search, and close the files over and over again?


    It would help enormously if you clarified whether the "identifier"
    identifies entities or relationships between entities. Is there a
    different series of "identifiers" for each (say) type of relationship?

    1. Sort each of the N ascii files into "identifier" order. Do an N-way
    merge to get all the data for each "identifier" in the same place at
    the same time, rearrange it, and insert it into the final database.

    OR

    2. Load each of the N ascii files into a table in a project-temporary
    database. Create a suitably-structured index on the "identifier"
    column in each table. Do an N-way merge using SQL inner/outer
    joins ...


    How big is N anyway? How many unique "identifiers"?
     
    John Machin, Jan 21, 2008
    #3
  4. p.

    Paul Rubin Guest

    "p." <> writes:
    > 1. Has anyone done anything like this before, and if so, do you have
    > any advice?


    Sort all the files with an external sort utility (e.g. unix sort), so
    that records with the same key are all brought together. Then you can
    process the files sequentially.
     
    Paul Rubin, Jan 21, 2008
    #4
  5. p. a écrit :
    > I need to take a series of ascii files and transform the data
    > contained therein so that it can be inserted into an existing
    > database. The ascii files are just a series of lines, each line
    > containing fields separated by '|' character. Relations amongst the
    > data in the various files are denoted through an integer identifier, a
    > pseudo key if you will. Unfortunately, the relations in the ascii file
    > do not match up with those in the database in which i need to insert
    > the data, i.e., I need to transform the data from the files before
    > inserting into the database. Now, this would all be relatively simple
    > if not for the following fact: The ascii files are each around 800MB,
    > so pulling everything into memory and matching up the relations before
    > inserting the data into the database is impossible.
    >
    > My questions are:
    > 1. Has anyone done anything like this before,


    More than once, yes.

    > and if so, do you have
    > any advice?


    1/ use the csv module to parse your text files

    2/ use a temporary database (which schema will mimic the one in the flat
    files), so you can work with the appropriate tools - ie: the RDBMS will
    take care of disk/memory management, and you'll have a specialized,
    hi-level language (namely, SQL) to reassemble your data the right way.


    > 2. In the abstract, can anyone think of a way of amassing all the
    > related data for a specific identifier from all the individual files
    > without pulling all of the files into memory and without having to
    > repeatedly open, search, and close the files over and over again?


    Answer above.
     
    Bruno Desthuilliers, Jan 21, 2008
    #5
  6. On Jan 21, 4:45 pm, Paul Rubin <http://> wrote:
    > "p." <> writes:
    > > 1. Has anyone done anything like this before, and if so, do you have
    > > any advice?

    >
    > Sort all the files with an external sort utility (e.g. unix sort), so
    > that records with the same key are all brought together. Then you can
    > process the files sequentially.


    Seconded. Unix sort can do external sorting [1] so your program will
    work even if the files don't fit in memory. Once they are sorted,
    itertools (especially groupby) is your friend.

    George


    [1] http://en.wikipedia.org/wiki/External_sort
     
    George Sakkis, Jan 21, 2008
    #6
  7. p.

    p. Guest

    So in answer to some of the questions:
    - There are about 15 files, each roughly representing a table.
    - Within the files, each line represents a record.
    - The formatting for the lines is like so:

    File1:
    somval1|ID|someval2|someval3|etc.

    File2:
    ID|someval1|someval2|somewal3|etc.

    Where ID is the one and only value linking "records" from one file to
    "records" in another file - moreover, as far as I can tell, the
    relationships are all 1:1 (or 1:0) (I don't have the full dataset yet,
    just a sampling, so I'm flying a bit in the dark).
    - I believe that individual "records" within each of the files is
    unique with respect to the identifier (again, not certain because I'm
    only working with sample data).
    - As the example shows, the position of the ID is not the same for all
    files.
    - I don't know how big N is since I only have a sample to work with,
    and probably won't get the full dataset anytime soon. (Lets just take
    it as a given that I won't get that information until AFTER a first
    implementation...politics.)
    - I don't know how many identifiers either, although it has to be at
    least as large as the number of lines in the largest file (again, I
    don't have the actual data yet).

    So as an exercise, lets assume 800MB file, each line of data taking up
    roughly 150B (guesstimate - based on examination of sample data)...so
    roughly 5.3 million unique IDs.

    With that size, I'll have to load them into temp db. I just can't see
    holding that much data in memory...
     
    p., Jan 21, 2008
    #7
  8. p.

    Paul Rubin Guest

    "p." <> writes:
    > So as an exercise, lets assume 800MB file, each line of data taking up
    > roughly 150B (guesstimate - based on examination of sample data)...so
    > roughly 5.3 million unique IDs.


    I still don't understand what the problem is. Are you familiar with
    the concept of external sorting? What OS are you using? If you're
    using a Un*x-like system, the built-in sort command should do what you
    need. "Internal" sorting means reading a file into memory and sorting
    it in memory with something like the .sort() function. External
    sorting is what you do when the file won't fit in memory. Basically
    you read sequential chunks of the file where each chunk fits in
    memory, sort each chunk internally and write it to a temporary disk
    file, then merge all the disk files. You can sort inputs of basically
    unlimited size this way. The unix sort command knows how to do this.

    It's often a good exercise with this type of problem, to ask yourself
    how an old-time mainframe programmer would have done it. A "big"
    computer of the 1960's might have had 128 kbytes of memory and a few
    MB of disk, but a bunch of magtape drives that held a few dozen MB
    each. With computers like that, they managed to process the phone
    bills for millions of people. The methods that they used are still
    relevant with today's much bigger and faster computers.

    If you watch old movies that tried to get a high tech look by showing
    computer machine rooms with pulsating tape drives, external sorting is
    what those computers spent most of their time doing.

    Finally, 800MB isn't all that big a file by today's standards. Memory
    for desktop computers costs around 25 dollars per gigabyte so having
    8GB of ram on your desk to crunch those 800MB files with is not at all
    unreasonable.
     
    Paul Rubin, Jan 21, 2008
    #8
  9. p.

    p. Guest

    Thanks to all for the ideas. I am familiar with external sorting.
    Hadn't considered it though. Will definitely be giving that a go, and
    then merging. Again, thanks all.
     
    p., Jan 22, 2008
    #9
  10. On Mon, 21 Jan 2008 12:51:59 -0800 (PST), "p." <>
    declaimed the following in comp.lang.python:

    > I need to take a series of ascii files and transform the data
    > contained therein so that it can be inserted into an existing
    > database. The ascii files are just a series of lines, each line
    > containing fields separated by '|' character. Relations amongst the
    > data in the various files are denoted through an integer identifier, a
    > pseudo key if you will. Unfortunately, the relations in the ascii file
    > do not match up with those in the database in which i need to insert
    > the data, i.e., I need to transform the data from the files before
    > inserting into the database. Now, this would all be relatively simple
    > if not for the following fact: The ascii files are each around 800MB,
    > so pulling everything into memory and matching up the relations before
    > inserting the data into the database is impossible.
    >

    Unclear: Do you mean you currently have multiple files and need to
    combine records from each to create a single record for insertion into a
    single table? Or do you perhaps mean the ID field can not be used as-is,
    due to conflicts with data already existing in the database.

    For the former, a classical sort-merge algorithm would apply {let me
    guess, in this world of Java and Web applications, colleges don't teach
    how to handle sequential processing of a transaction file against a
    master file}

    For the latter, I'd probably do a prepass building a structure that
    just contained record IDs to be updated during regular processing.
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Jan 22, 2008
    #10
  11. In article <>,
    p. <> wrote:
    >I need to take a series of ascii files and transform the data
    >contained therein so that it can be inserted into an existing
    >database. The ascii files are just a series of lines, each line
    >containing fields separated by '|' character. Relations amongst the
    >data in the various files are denoted through an integer identifier, a
    >pseudo key if you will. Unfortunately, the relations in the ascii file
    >do not match up with those in the database in which i need to insert
    >the data, i.e., I need to transform the data from the files before
    >inserting into the database. Now, this would all be relatively simple
    >if not for the following fact: The ascii files are each around 800MB,
    >so pulling everything into memory and matching up the relations before
    >inserting the data into the database is impossible.


    In this case good old fashioned batch processing (line by line)
    may be appropriate.
    Read up on tools like sort and join.

    These tools are present on all Unix-like systems, and on windows
    in open-source toolkits.

    >
    >My questions are:
    >1. Has anyone done anything like this before, and if so, do you have
    >any advice?


    Puzzling question. Computers weren't invented for GUI's. They were
    invented for precisely this kind of thing. So, yes, it is a sure bet.

    >2. In the abstract, can anyone think of a way of amassing all the
    >related data for a specific identifier from all the individual files
    >without pulling all of the files into memory and without having to
    >repeatedly open, search, and close the files over and over again?


    As long as you don't use Excell, it is not up to it ;-)

    Groetjes Albert

    --
    --
    Albert van der Horst, UTRECHT,THE NETHERLANDS
    Economic growth -- like all pyramid schemes -- ultimately falters.
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
     
    Albert van der Horst, Jan 28, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ken Adams
    Replies:
    1
    Views:
    725
    Arnaud Berger
    Mar 10, 2005
  2. freeposte
    Replies:
    7
    Views:
    4,740
    Roedy Green
    Feb 10, 2006
  3. Yves Forkl
    Replies:
    0
    Views:
    485
    Yves Forkl
    Oct 28, 2003
  4. jblazi
    Replies:
    34
    Views:
    823
    Alex Martelli
    Aug 5, 2004
  5. Replies:
    0
    Views:
    316
Loading...

Share This Page