Fastest way to store ints and floats on disk

Discussion in 'Python' started by Laszlo Nagy, Aug 7, 2008.

  1. Laszlo Nagy

    Laszlo Nagy Guest

    Hi,

    I'm working on a pivot table. I would like to write it in Python. I
    know, I should be doing that in C, but I would like to create a cross
    platform version which can deal with smaller databases (not more than a
    million facts).

    The data is first imported from a csv file: the user selects which
    columns contain dimension and measure data (and which columns to
    ignore). In the next step I would like to build up a database that is
    efficient enough to be used for making pivot tables. Here is my idea for
    the database:

    Original CSV file with column header and values:

    "Color","Year","Make","Price","VMax"
    Yellow,2000,Ferrari,100000,254
    Blue,2003,Volvo,50000,210

    Using the GUI, it is converted to this:

    dimensions = [
    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    'Yellow' ], },
    { 'name':'Year', colindex:1, 'values':[
    1995,1999,2000,2001,2002,2003,2007 ], },
    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    'Lamborgini' ], },
    ]
    measures = [
    { 'name', 'Price', 'colindex':3 },
    { 'name', 'Vmax', 'colindex':4 },
    ]
    facts = [
    ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
    measure_values )
    ( (1,5,1),(50000.0,210.0) ),
    .... # Some million rows or less
    ]


    The core of the idea is that, when using a relatively small number of
    possible values for each dimension, the facts table becomes
    significantly smaller and easier to process. (Processing the facts would
    be: iterate over facts, filter out some of them, create statistical
    values of the measures, grouped by dimensions.)

    The facts table cannot be kept in memory because it is too big. I need
    to store it on disk, be able to read incrementally, and make statistics.
    In most cases, the "statistic" will be simple sum of the measures, and
    counting the number of facts affected. To be effective, reading the
    facts from disk should not involve complex conversions. For this reason,
    storing in CSV or XML or any textual format would be bad. I'm thinking
    about a binary format, but how can I interface that with Python?

    I already looked at:

    - xdrlib, which throws me DeprecationWarning when I store some integers
    - struct which uses format string for each read operation, I'm concerned
    about its speed

    What else can I use?

    Thanks,

    Laszlo
    Laszlo Nagy, Aug 7, 2008
    #1
    1. Advertising

  2. Laszlo Nagy

    castironpi Guest

    On Aug 7, 1:41 pm, Laszlo Nagy <> wrote:
    >   Hi,
    >
    > I'm working on a pivot table. I would like to write it in Python. I
    > know, I should be doing that in C, but I would like to create a cross
    > platform version which can deal with smaller databases (not more than a
    > million facts).
    >
    > The data is first imported from a csv file: the user selects which
    > columns contain dimension and measure data (and which columns to
    > ignore). In the next step I would like to build up a database that is
    > efficient enough to be used for making pivot tables. Here is my idea for
    > the database:
    >
    > Original CSV file with column header and values:
    >
    > "Color","Year","Make","Price","VMax"
    > Yellow,2000,Ferrari,100000,254
    > Blue,2003,Volvo,50000,210
    >
    > Using the GUI, it is converted to this:
    >
    > dimensions = [
    >     { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    > 'Yellow' ], },
    >     { 'name':'Year', colindex:1, 'values':[
    > 1995,1999,2000,2001,2002,2003,2007 ], },
    >     { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    > 'Lamborgini' ], },
    > ]
    > measures = [
    >     { 'name', 'Price', 'colindex':3 },
    >     { 'name', 'Vmax', 'colindex':4 },
    > ]
    > facts = [
    >     ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
    > measure_values )
    >     ( (1,5,1),(50000.0,210.0) ),
    >    .... # Some million rows or less
    > ]
    >
    > The core of the idea is that, when using a relatively small number of
    > possible values for each dimension, the facts table becomes
    > significantly smaller and easier to process. (Processing the facts would
    > be: iterate over facts, filter out some of them, create statistical
    > values of the measures, grouped by dimensions.)
    >
    > The facts table cannot be kept in memory because it is too big. I need
    > to store it on disk, be able to read incrementally, and make statistics.
    > In most cases, the "statistic" will be simple sum of the measures, and
    > counting the number of facts affected. To be effective, reading the
    > facts from disk should not involve complex conversions. For this reason,
    > storing in CSV or XML or any textual format would be bad. I'm thinking
    > about a binary format, but how can I interface that with Python?
    >
    > I already looked at:
    >
    > - xdrlib, which throws me DeprecationWarning when I store some integers
    > - struct which uses format string for each read operation, I'm concerned
    > about its speed
    >
    > What else can I use?
    >
    > Thanks,
    >
    >    Laszlo


    Take a look at the mmap module. You get direct memory access, backed
    by the file system. struct + mmap, if you keep your strings small?
    castironpi, Aug 7, 2008
    #2
    1. Advertising

  3. Laszlo Nagy <> writes:

    > The facts table cannot be kept in memory because it is too big. I need to
    > store it on disk, be able to read incrementally, and make statistics. In most
    > cases, the "statistic" will be simple sum of the measures, and counting the
    > number of facts affected. To be effective, reading the facts from disk should
    > not involve complex conversions. For this reason, storing in CSV or XML or any
    > textual format would be bad. I'm thinking about a binary format, but how can I
    > interface that with Python?
    >
    > I already looked at:
    >
    > - xdrlib, which throws me DeprecationWarning when I store some integers
    > - struct which uses format string for each read operation, I'm concerned about
    > its speed
    >
    > What else can I use?


    pytables (<http://www.pytables.org/>) looks like the right kind of
    thing.

    -M-
    Matthew Woodcraft, Aug 9, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    5
    Views:
    389
    Randolf Richardson
    Jan 28, 2007
  2. Skybuck Flying

    ints ints ints and ints

    Skybuck Flying, Jul 8, 2004, in forum: C Programming
    Replies:
    24
    Views:
    816
    Jack Klein
    Jul 10, 2004
  3. M.-A. Lemburg
    Replies:
    3
    Views:
    269
    castironpi
    Aug 10, 2008
  4. Laszlo Nagy
    Replies:
    0
    Views:
    278
    Laszlo Nagy
    Aug 8, 2008
  5. castironpi
    Replies:
    5
    Views:
    332
    castironpi
    Aug 24, 2008
Loading...

Share This Page