Need design advice. What's my best approach for storing this data?

Discussion in 'Python' started by Mudcat, Mar 17, 2006.

  1. Mudcat

    Mudcat Guest

    Hi,

    I am trying to build a tool that analyzes stock data. Therefore I am
    going to download and store quite a vast amount of it. Just for a
    general number - assuming there are about 7000 listed stocks on the two
    major markets plus some extras, 255 tradying days a year for 20 years,
    that is about 36 million entries.

    Obviously a database is a logical choice for that. However I've never
    used one, nor do I know what benefits I would get from using one. I am
    worried about speed, memory usage, and disk space.

    My initial thought was to put the data in large dictionaries and shelve
    them (and possibly zipping them to save storage space until the data is
    needed). However, these are huge files. Based on ones that I have
    already done, I estimated at least 5 gigs for storage this way. My
    structure for this files was a 3 layered dictionary.
    [Market][Stock][Date](Data List). That allows me to easily access any
    data for any date or stock in a particular market. Therefore I wasn't
    really concerned about the organizational aspects of a db since this
    would serve me fine.

    But before I put this all together I wanted to ask around to see if
    this is a good approach. Will it be faster to use a database over a
    structured dictionary? And will I get a lot of overhead if I go with a
    database? I'm hoping people who have dealt with such large data before
    can give me a little advice.

    Thanks ahead of time,
    Marc
    Mudcat, Mar 17, 2006
    #1
    1. Advertising

  2. Mudcat

    J Correia Guest

    "Mudcat" <> wrote in message
    news:...
    > Hi,
    >
    > I am trying to build a tool that analyzes stock data. Therefore I am
    > going to download and store quite a vast amount of it. Just for a
    > general number - assuming there are about 7000 listed stocks on the two
    > major markets plus some extras, 255 tradying days a year for 20 years,
    > that is about 36 million entries.
    >


    On a different tack, to avoid thinking about any db issues, consider
    subscribing
    to TC2000 (tc2000.com)... they already have all that data,
    in a database which takes about 900Mb when fully installed.
    They also have an API which allows you full access to the database
    (including from Python via COM). The API is pretty robust and allows
    you do pre-filtering (e.g. give me last 20 years of all stocks over $50
    with ave daily vol > 100k) at the db level meaning you can focus on using
    Python for analysis. The database is also updated daily.

    If you don't need daily updates, then subscribe (first 30 days free) and
    cancel, and you've got a snapshot db of all the data you need.

    They also used to send out an evaluation CD which had all
    the history data barring the last 3 months or so which is certainly
    good enough for analysis and testing. Not sure if they still do that.

    HTH.
    J Correia, Mar 17, 2006
    #2
    1. Advertising

  3. On 17 Mar 2006 09:08:03 -0800, "Mudcat" <> declaimed
    the following in comp.lang.python:

    > structure for this files was a 3 layered dictionary.
    > [Market][Stock][Date](Data List). That allows me to easily access any
    > data for any date or stock in a particular market. Therefore I wasn't


    Of course, you'll have to know which market the stock is in first,
    or do a test on each. And what happens if a stock changes market (it
    happens)...

    >
    > But before I put this all together I wanted to ask around to see if
    > this is a good approach. Will it be faster to use a database over a
    > structured dictionary? And will I get a lot of overhead if I go with a
    > database? I'm hoping people who have dealt with such large data before
    > can give me a little advice.
    >

    Well, an RDBM won't require you to load the whole dataset just to
    add a record <G>. And you are looking at something that has small
    updates, but possible large retrievals. Do you really want the
    application to load "all" that data just to add one record and then save
    it all back out (what happens if the power fails halfway through the
    save; are you writing to a different file, then delete/rename). Do you
    have enough memory to support the data set as one chunk? (Did you
    mention 5GB?)


    I don't know what your data list is (high, low, close?) but there
    are a number of choices available...

    From a simple flat table with indices on Market, Stock, and Date
    [Market, Stock, Date -> high, low, close]

    (The notation is [unique/composite key -> dependent data]

    Or dual tables:
    [Market, Stock] [Stock, Date -> high, low, close]

    (Avoids duplicating the Market field for every record, though loses any
    history of when a stock changes markets; note the first table doesn't
    have any dependent data)

    Or, if using an RDBM where each table is a separate file [Visual FoxPro
    <yuck>, MySQL MyISAM], you could even do:
    [Market, Stock] [Date -> high, low, close]stock1
    [Date -> high, low, close]stock2
    ...
    [Date -> high, low, close]stock/n/

    where each [Date -> high, low, close] is a separate table named after
    the stock (each such table is identical in layout). More complexity when
    working multiple stocks (the worst would be to do a report of stock
    names and spread sorted by size of spread for a single day -- you'd have
    to create a temporary table of [Stock -> high, low, close] to perform
    the report).
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
    Dennis Lee Bieber, Mar 17, 2006
    #3
  4. Mudcat

    Rene Pijlman Guest

    Mudcat:
    >My initial thought was to put the data in large dictionaries and shelve
    >them (and possibly zipping them to save storage space until the data is
    >needed). However, these are huge files.


    ZODB solves that problem for you.
    http://www.zope.org/Wikis/ZODB/FrontPage

    More in particular "5.3 BTrees Package":
    http://www.zope.org/Wikis/ZODB/guide/node6.html#SECTION000630000000000000000

    But I've only used ZODB for small databases compared to yours. It's
    supposed to scale very well, but I can't speak from experience.

    --
    René Pijlman
    Rene Pijlman, Mar 17, 2006
    #4
  5. Mudcat

    Mudcat Guest

    >On a different tack, to avoid thinking about any db issues, consider
    >subscribing
    >to TC2000 (tc2000.com)... they already have all that data,
    >in a database which takes about 900Mb when fully installed.


    That is an interesting option also. I had actually looked for ready
    made databases and didn't come across this one. Although, I don't
    understand how they can fit all that info into 900Mb.

    I like this option, but I guess if I decide to keep using this database
    then I need to keep up my subcription. The thing I liked about
    downloading everything from Yahoo was that I didn't have to pay anyone
    for the data.

    Does anyone know the best way to compress this data? or do any of these
    databases handle compression automatically? 5gig will be hard for any
    computer to deal with, even in a database.
    Mudcat, Mar 17, 2006
    #5
  6. Mudcat

    Mudcat Guest

    In doing a little research I ran across PyTables, which according to
    the documentation does this: "PyTables is a hierarchical database
    package designed to efficiently manage very large amounts of data." It
    also deals with compression and various other handy things. Zope also
    seems to be designed to handle large amounts of data with compression
    in mind.

    Does any know which of these two apps would better fit my purpose? I
    don't know if either of these has limitations that might not work out
    well for what I'm trying to do. I really need to try and compress the
    data as much as possible without making the access times really slow.

    Thanks
    Mudcat, Mar 19, 2006
    #6
  7. Mudcat

    Robert Kern Guest

    Mudcat wrote:
    > In doing a little research I ran across PyTables, which according to
    > the documentation does this: "PyTables is a hierarchical database
    > package designed to efficiently manage very large amounts of data." It
    > also deals with compression and various other handy things. Zope also
    > seems to be designed to handle large amounts of data with compression
    > in mind.
    >
    > Does any know which of these two apps would better fit my purpose? I
    > don't know if either of these has limitations that might not work out
    > well for what I'm trying to do. I really need to try and compress the
    > data as much as possible without making the access times really slow.


    PyTables is exactly suited to storing large amounts of numerical data aranged in
    tables and arrays. The ZODB is not.

    --
    Robert Kern


    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
    Robert Kern, Mar 19, 2006
    #7
  8. Mudcat

    Magnus Lycka Guest

    Re: Need design advice. What's my best approach for storing thisdata?

    Mudcat wrote:
    > I am trying to build a tool that analyzes stock data. Therefore I am
    > going to download and store quite a vast amount of it. Just for a
    > general number - assuming there are about 7000 listed stocks on the two
    > major markets plus some extras, 255 tradying days a year for 20 years,
    > that is about 36 million entries.
    >
    > Obviously a database is a logical choice for that. However I've never
    > used one, nor do I know what benefits I would get from using one. I am
    > worried about speed, memory usage, and disk space.


    This is a typical use case for relational database systems.
    With something like DB2 or Oracle here, you can take advantage
    of more than 20 years of work by lots of developers trying to
    solve the kind of problems you will run into.

    You haven't really stated all the facts to decide what product
    to choose though. Will this be a multi-user applications?
    Do you forsee a client/server application? What operating
    system(s) do you need to support?

    With relational databases, it's plausible to move some of
    the hard work in the data analysis into the server. Using
    this well means that you need to learn a bit about how
    relational databases work, but I think it's with the trouble.
    It could mean that much less data ever needs to reach your
    Python program for processing, and that will mean a lot for
    your performance. Relational databases are very good at
    searching, sorting and simple aggregations of data. SQL is
    a declarative language, and in principle, your SQL code
    will just declare the correct queries and manipulations that
    you want to achieve, and tuning will be a separate activity,
    which doesn't need to involve program changes. In reality,
    there are certainly cases where changes in SQL code will
    influence performance, but to a very large extent, you can
    achieve good performance through building indices and by
    letting the database gather statistics and analyze the
    queries your programs contain. As a bonus, you also have
    advanced systems for security, transactional safety, on-
    line backup, replication etc.

    You don't get these advantages with any other data storage
    systems.

    I'd get Chris Fehily's "SQL Visual Quickstart Guide", which
    is as good as his Python book. As database, it depends a bit
    on your platform you work with. I'd avoid MySQL. Some friends
    of mine have used it for needs similar to yours, and they are
    now running into its severe shortcomings. (I did warn them.)

    For Windows, I think the single user version of SQL Server
    (MSDE?) is gratis. For both Windows and Linux/Unix, there are
    (I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
    Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
    I think DB2 is somewhere in between. PostgreSQL is also a good
    option.

    Either way, it certainly seems natural to learn relational
    databases and SQL if you want to work with financial software.
    Magnus Lycka, Mar 20, 2006
    #8
  9. On Mon, 20 Mar 2006 11:00:21 +0100, Magnus Lycka <>
    declaimed the following in comp.lang.python:

    > For Windows, I think the single user version of SQL Server
    > (MSDE?) is gratis. For both Windows and Linux/Unix, there are


    MSDE seems to vary availability from month to month <G>

    The version I had (came with VB6 Pro) was restricted to a maximum
    database of 2GB (~same size limit as a JET MDB, but using SQL Server
    core), and had a query choke of something like five simultaneous
    queries.

    > (I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
    > Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
    > I think DB2 is somewhere in between. PostgreSQL is also a good
    > option.
    >

    Depending upon requirements, there is also Firebird (a spawn of
    Interbase), MaxDB (MySQL's release of the SAP DB), and (while some abhor
    it) even MySQL might be worthy... (at least, until Oracle decides MySQL
    can no longer license either the Sleepycat BDB or the Inno Oy InnoDB
    backends).

    Ingres may also be viable.
    http://www.ingres.com/products/Prod_Download_Portal.html
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
    Dennis Lee Bieber, Mar 20, 2006
    #9
  10. On Mon, 20 Mar 2006 17:52:19 GMT, Dennis Lee Bieber
    <> declaimed the following in comp.lang.python:


    > Ingres may also be viable.
    > http://www.ingres.com/products/Prod_Download_Portal.html


    Looking deeper, looks like Linux only...
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
    Dennis Lee Bieber, Mar 21, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr Gordonz

    Advice on design approach and principles

    Mr Gordonz, Aug 4, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    359
    Steve C. Orr, MCSD
    Aug 4, 2003
  2. =?Utf-8?B?TWFubnk=?=

    Advice on best approach

    =?Utf-8?B?TWFubnk=?=, Dec 23, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    339
    =?Utf-8?B?TWFubnk=?=
    Dec 23, 2004
  3. phl
    Replies:
    1
    Views:
    286
    bruce barker
    Jan 15, 2007
  4. Erik J Sawyer

    Advice on design approach (please!)

    Erik J Sawyer, Sep 6, 2003, in forum: ASP .Net Building Controls
    Replies:
    2
    Views:
    116
    Shawn B.
    Sep 7, 2003
  5. Erik J Sawyer

    Advice on design approach (please!)

    Erik J Sawyer, Sep 6, 2003, in forum: ASP .Net Web Controls
    Replies:
    1
    Views:
    119
    Allan Doensen
    Sep 8, 2003
Loading...

Share This Page