Re: Fastest way to store ints and floats on disk

Discussion in 'Python' started by M.-A. Lemburg, Aug 7, 2008.

  1. On 2008-08-07 20:41, Laszlo Nagy wrote:
    >
    > Hi,
    >
    > I'm working on a pivot table. I would like to write it in Python. I
    > know, I should be doing that in C, but I would like to create a cross
    > platform version which can deal with smaller databases (not more than a
    > million facts).
    >
    > The data is first imported from a csv file: the user selects which
    > columns contain dimension and measure data (and which columns to
    > ignore). In the next step I would like to build up a database that is
    > efficient enough to be used for making pivot tables. Here is my idea for
    > the database:
    >
    > Original CSV file with column header and values:
    >
    > "Color","Year","Make","Price","VMax"
    > Yellow,2000,Ferrari,100000,254
    > Blue,2003,Volvo,50000,210
    >
    > Using the GUI, it is converted to this:
    >
    > dimensions = [
    > { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    > 'Yellow' ], },
    > { 'name':'Year', colindex:1, 'values':[
    > 1995,1999,2000,2001,2002,2003,2007 ], },
    > { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    > 'Lamborgini' ], },
    > ]
    > measures = [
    > { 'name', 'Price', 'colindex':3 },
    > { 'name', 'Vmax', 'colindex':4 },
    > ]
    > facts = [
    > ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
    > measure_values )
    > ( (1,5,1),(50000.0,210.0) ),
    > .... # Some million rows or less
    > ]
    >
    >
    > The core of the idea is that, when using a relatively small number of
    > possible values for each dimension, the facts table becomes
    > significantly smaller and easier to process. (Processing the facts would
    > be: iterate over facts, filter out some of them, create statistical
    > values of the measures, grouped by dimensions.)
    >
    > The facts table cannot be kept in memory because it is too big. I need
    > to store it on disk, be able to read incrementally, and make statistics.
    > In most cases, the "statistic" will be simple sum of the measures, and
    > counting the number of facts affected. To be effective, reading the
    > facts from disk should not involve complex conversions. For this reason,
    > storing in CSV or XML or any textual format would be bad. I'm thinking
    > about a binary format, but how can I interface that with Python?
    >
    > I already looked at:
    >
    > - xdrlib, which throws me DeprecationWarning when I store some integers
    > - struct which uses format string for each read operation, I'm concerned
    > about its speed
    >
    > What else can I use?


    >>> import marshal
    >>> marshal.dump(1, open('test.db', 'wb'))
    >>> marshal.load(open('test.db', 'rb'))

    1

    It also very fast at dumping/loading lists, tuples, dictionaries,
    floats, etc.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Aug 07 2008)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________

    :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    M.-A. Lemburg, Aug 7, 2008
    #1
    1. Advertising

  2. M.-A. Lemburg

    castironpi Guest

    On Aug 7, 2:27 pm, "M.-A. Lemburg" <> wrote:
    > On 2008-08-07 20:41, Laszlo Nagy wrote:
    >
    >
    >
    >
    >
    > >  Hi,

    >
    > > I'm working on a pivot table. I would like to write it in Python. I
    > > know, I should be doing that in C, but I would like to create a cross
    > > platform version which can deal with smaller databases (not more than a
    > > million facts).

    >
    > > The data is first imported from a csv file: the user selects which
    > > columns contain dimension and measure data (and which columns to
    > > ignore). In the next step I would like to build up a database that is
    > > efficient enough to be used for making pivot tables. Here is my idea for
    > > the database:

    >
    > > Original CSV file with column header and values:

    >
    > > "Color","Year","Make","Price","VMax"
    > > Yellow,2000,Ferrari,100000,254
    > > Blue,2003,Volvo,50000,210

    >
    > > Using the GUI, it is converted to this:

    >
    > > dimensions = [
    > >    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    > > 'Yellow' ], },
    > >    { 'name':'Year', colindex:1, 'values':[
    > > 1995,1999,2000,2001,2002,2003,2007 ], },
    > >    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    > > 'Lamborgini' ], },
    > > ]
    > > measures = [
    > >    { 'name', 'Price', 'colindex':3 },
    > >    { 'name', 'Vmax', 'colindex':4 },
    > > ]
    > > facts = [
    > >    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
    > > measure_values )
    > >    ( (1,5,1),(50000.0,210.0) ),
    > >   .... # Some million rows or less
    > > ]

    >
    > > The core of the idea is that, when using a relatively small number of
    > > possible values for each dimension, the facts table becomes
    > > significantly smaller and easier to process. (Processing the facts would
    > > be: iterate over facts, filter out some of them, create statistical
    > > values of the measures, grouped by dimensions.)

    >
    > > The facts table cannot be kept in memory because it is too big. I need
    > > to store it on disk, be able to read incrementally, and make statistics..
    > > In most cases, the "statistic" will be simple sum of the measures, and
    > > counting the number of facts affected. To be effective, reading the
    > > facts from disk should not involve complex conversions. For this reason,
    > > storing in CSV or XML or any textual format would be bad. I'm thinking
    > > about a binary format, but how can I interface that with Python?

    >
    > > I already looked at:

    >
    > > - xdrlib, which throws me DeprecationWarning when I store some integers
    > > - struct which uses format string for each read operation, I'm concerned
    > > about its speed

    >
    > > What else can I use?

    >
    >  >>> import marshal
    >  >>> marshal.dump(1, open('test.db', 'wb'))
    >  >>> marshal.load(open('test.db', 'rb'))
    > 1
    >
    > It also very fast at dumping/loading lists, tuples, dictionaries,
    > floats, etc.


    Depending on how hard-core you want to be, store the int, float,
    string, and long C structures directly to disk, at a given offset.
    Either use fixed-length strings, or implement (or find) a memory
    manager. Anyone have a good alloc-realloc-free library, C or Python?
    castironpi, Aug 9, 2008
    #2
    1. Advertising

  3. M.-A. Lemburg

    John Machin Guest

    On Aug 10, 4:58 am, castironpi <> wrote:
    > On Aug 7, 2:27 pm, "M.-A. Lemburg" <> wrote:
    >
    >
    >
    > > On 2008-08-07 20:41, Laszlo Nagy wrote:

    >
    > > > Hi,

    >
    > > > I'm working on a pivot table. I would like to write it in Python. I
    > > > know, I should be doing that in C, but I would like to create a cross
    > > > platform version which can deal with smaller databases (not more than a
    > > > million facts).

    >
    > > > The data is first imported from a csv file: the user selects which
    > > > columns contain dimension and measure data (and which columns to
    > > > ignore). In the next step I would like to build up a database that is
    > > > efficient enough to be used for making pivot tables. Here is my idea for
    > > > the database:

    >
    > > > Original CSV file with column header and values:

    >
    > > > "Color","Year","Make","Price","VMax"
    > > > Yellow,2000,Ferrari,100000,254
    > > > Blue,2003,Volvo,50000,210

    >
    > > > Using the GUI, it is converted to this:

    >
    > > > dimensions = [
    > > > { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    > > > 'Yellow' ], },
    > > > { 'name':'Year', colindex:1, 'values':[
    > > > 1995,1999,2000,2001,2002,2003,2007 ], },
    > > > { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    > > > 'Lamborgini' ], },
    > > > ]
    > > > measures = [
    > > > { 'name', 'Price', 'colindex':3 },
    > > > { 'name', 'Vmax', 'colindex':4 },
    > > > ]
    > > > facts = [
    > > > ( (3,2,0),(100000.0,254.0) ), # ( dimension_value_indexes,
    > > > measure_values )
    > > > ( (1,5,1),(50000.0,210.0) ),
    > > > .... # Some million rows or less
    > > > ]

    >
    > > > The core of the idea is that, when using a relatively small number of
    > > > possible values for each dimension, the facts table becomes
    > > > significantly smaller and easier to process. (Processing the facts would
    > > > be: iterate over facts, filter out some of them, create statistical
    > > > values of the measures, grouped by dimensions.)

    >
    > > > The facts table cannot be kept in memory because it is too big. I need
    > > > to store it on disk, be able to read incrementally, and make statistics.
    > > > In most cases, the "statistic" will be simple sum of the measures, and
    > > > counting the number of facts affected. To be effective, reading the
    > > > facts from disk should not involve complex conversions. For this reason,
    > > > storing in CSV or XML or any textual format would be bad. I'm thinking
    > > > about a binary format, but how can I interface that with Python?

    >
    > > > I already looked at:

    >
    > > > - xdrlib, which throws me DeprecationWarning when I store some integers
    > > > - struct which uses format string for each read operation, I'm concerned
    > > > about its speed

    >
    > > > What else can I use?

    >
    > > >>> import marshal
    > > >>> marshal.dump(1, open('test.db', 'wb'))
    > > >>> marshal.load(open('test.db', 'rb'))

    > > 1

    >
    > > It also very fast at dumping/loading lists, tuples, dictionaries,
    > > floats, etc.

    >
    > Depending on how hard-core you want to be, store the int, float,
    > string, and long C structures directly to disk, at a given offset.
    > Either use fixed-length strings, or implement (or find) a memory
    > manager. Anyone have a good alloc-realloc-free library, C or Python?


    A long time ago, when I last needed to bother about such things (to
    override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
    the trick.

    A memory allocator written in Python? That's a novel concept.
    John Machin, Aug 9, 2008
    #3
  4. M.-A. Lemburg

    castironpi Guest

    On Aug 9, 4:43 pm, John Machin <> wrote:
    > On Aug 10, 4:58 am, castironpi <> wrote:
    >
    >
    >
    > > On Aug 7, 2:27 pm, "M.-A. Lemburg" <> wrote:

    >
    > > > On 2008-08-07 20:41, Laszlo Nagy wrote:

    >
    > > > >  Hi,

    >
    > > > > I'm working on a pivot table. I would like to write it in Python. I
    > > > > know, I should be doing that in C, but I would like to create a cross
    > > > > platform version which can deal with smaller databases (not more than a
    > > > > million facts).

    >
    > > > > The data is first imported from a csv file: the user selects which
    > > > > columns contain dimension and measure data (and which columns to
    > > > > ignore). In the next step I would like to build up a database that is
    > > > > efficient enough to be used for making pivot tables. Here is my idea for
    > > > > the database:

    >
    > > > > Original CSV file with column header and values:

    >
    > > > > "Color","Year","Make","Price","VMax"
    > > > > Yellow,2000,Ferrari,100000,254
    > > > > Blue,2003,Volvo,50000,210

    >
    > > > > Using the GUI, it is converted to this:

    >
    > > > > dimensions = [
    > > > >    { 'name':'Color', 'colindex:0, 'values':[ 'Red', 'Blue', 'Green',
    > > > > 'Yellow' ], },
    > > > >    { 'name':'Year', colindex:1, 'values':[
    > > > > 1995,1999,2000,2001,2002,2003,2007 ], },
    > > > >    { 'name':'Make', colindex:2, 'value':[ 'Ferrari', 'Volvo', 'Ford',
    > > > > 'Lamborgini' ], },
    > > > > ]
    > > > > measures = [
    > > > >    { 'name', 'Price', 'colindex':3 },
    > > > >    { 'name', 'Vmax', 'colindex':4 },
    > > > > ]
    > > > > facts = [
    > > > >    ( (3,2,0),(100000.0,254.0)  ), # ( dimension_value_indexes,
    > > > > measure_values )
    > > > >    ( (1,5,1),(50000.0,210.0) ),
    > > > >   .... # Some million rows or less
    > > > > ]

    >
    > > > > The core of the idea is that, when using a relatively small number of
    > > > > possible values for each dimension, the facts table becomes
    > > > > significantly smaller and easier to process. (Processing the facts would
    > > > > be: iterate over facts, filter out some of them, create statistical
    > > > > values of the measures, grouped by dimensions.)

    >
    > > > > The facts table cannot be kept in memory because it is too big. I need
    > > > > to store it on disk, be able to read incrementally, and make statistics.
    > > > > In most cases, the "statistic" will be simple sum of the measures, and
    > > > > counting the number of facts affected. To be effective, reading the
    > > > > facts from disk should not involve complex conversions. For this reason,
    > > > > storing in CSV or XML or any textual format would be bad. I'm thinking
    > > > > about a binary format, but how can I interface that with Python?

    >
    > > > > I already looked at:

    >
    > > > > - xdrlib, which throws me DeprecationWarning when I store some integers
    > > > > - struct which uses format string for each read operation, I'm concerned
    > > > > about its speed

    >
    > > > > What else can I use?

    >
    > > >  >>> import marshal
    > > >  >>> marshal.dump(1, open('test.db', 'wb'))
    > > >  >>> marshal.load(open('test.db', 'rb'))
    > > > 1

    >
    > > > It also very fast at dumping/loading lists, tuples, dictionaries,
    > > > floats, etc.

    >
    > > Depending on how hard-core you want to be, store the int, float,
    > > string, and long C structures directly to disk, at a given offset.
    > > Either use fixed-length strings, or implement (or find) a memory
    > > manager.  Anyone have a good alloc-realloc-free library, C or Python?

    >
    > A long time ago, when I last needed to bother about such things (to
    > override the memory allocator in the DJGPP RTL), Doug Lea's malloc did
    > the trick.
    >
    > A memory allocator written in Python? That's a novel concept.


    For strings and longs, you need variable-length records. Wrap Lea's
    malloc in Python calls, and design a Python class where Year, Price,
    and VMax are stored as ints at given offsets from the start of the
    file, and Color and Make are stored as strings at given offsets.
    Don't bother to cache, just seek and read. Part 1 of the file, or
    File 1, looks like:

    Car 1 color_offset year_offset make_offset price_offset vmax_offset
    Car 2 color_offset year_offset make_offset price_offset vmax_offset

    Store them directly as bytes, not string reps. of numbers.

    1024 1050 1054 1084 1088
    1092 1112 1116 1130 1134

    Part 2 looks like

    1024 Yell
    1028 ow
    1050 2000
    1054 Ferr
    1058 ari
    1084 100000
    1088 254
    1092 Blue
    1112 2003
    1116 Volv

    and so on.
    castironpi, Aug 10, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    5
    Views:
    379
    Randolf Richardson
    Jan 28, 2007
  2. Skybuck Flying

    ints ints ints and ints

    Skybuck Flying, Jul 8, 2004, in forum: C Programming
    Replies:
    24
    Views:
    809
    Jack Klein
    Jul 10, 2004
  3. Laszlo Nagy
    Replies:
    2
    Views:
    253
    Matthew Woodcraft
    Aug 9, 2008
  4. Laszlo Nagy
    Replies:
    0
    Views:
    276
    Laszlo Nagy
    Aug 8, 2008
  5. castironpi
    Replies:
    5
    Views:
    329
    castironpi
    Aug 24, 2008
Loading...

Share This Page