Need design advice. What's my best approach for storing this data?

M

Mudcat

Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.

My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files. Based on ones that I have
already done, I estimated at least 5 gigs for storage this way. My
structure for this files was a 3 layered dictionary.
[Market][Stock][Date](Data List). That allows me to easily access any
data for any date or stock in a particular market. Therefore I wasn't
really concerned about the organizational aspects of a db since this
would serve me fine.

But before I put this all together I wanted to ask around to see if
this is a good approach. Will it be faster to use a database over a
structured dictionary? And will I get a lot of overhead if I go with a
database? I'm hoping people who have dealt with such large data before
can give me a little advice.

Thanks ahead of time,
Marc
 
J

J Correia

Mudcat said:
Hi,

I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.
They also have an API which allows you full access to the database
(including from Python via COM). The API is pretty robust and allows
you do pre-filtering (e.g. give me last 20 years of all stocks over $50
with ave daily vol > 100k) at the db level meaning you can focus on using
Python for analysis. The database is also updated daily.

If you don't need daily updates, then subscribe (first 30 days free) and
cancel, and you've got a snapshot db of all the data you need.

They also used to send out an evaluation CD which had all
the history data barring the last 3 months or so which is certainly
good enough for analysis and testing. Not sure if they still do that.

HTH.
 
D

Dennis Lee Bieber

structure for this files was a 3 layered dictionary.
[Market][Stock][Date](Data List). That allows me to easily access any
data for any date or stock in a particular market. Therefore I wasn't

Of course, you'll have to know which market the stock is in first,
or do a test on each. And what happens if a stock changes market (it
happens)...
But before I put this all together I wanted to ask around to see if
this is a good approach. Will it be faster to use a database over a
structured dictionary? And will I get a lot of overhead if I go with a
database? I'm hoping people who have dealt with such large data before
can give me a little advice.
Well, an RDBM won't require you to load the whole dataset just to
add a record <G>. And you are looking at something that has small
updates, but possible large retrievals. Do you really want the
application to load "all" that data just to add one record and then save
it all back out (what happens if the power fails halfway through the
save; are you writing to a different file, then delete/rename). Do you
have enough memory to support the data set as one chunk? (Did you
mention 5GB?)


I don't know what your data list is (high, low, close?) but there
are a number of choices available...

From a simple flat table with indices on Market, Stock, and Date
[Market, Stock, Date -> high, low, close]

(The notation is [unique/composite key -> dependent data]

Or dual tables:
[Market, Stock] [Stock, Date -> high, low, close]

(Avoids duplicating the Market field for every record, though loses any
history of when a stock changes markets; note the first table doesn't
have any dependent data)

Or, if using an RDBM where each table is a separate file [Visual FoxPro
<yuck>, MySQL MyISAM], you could even do:
[Market, Stock] [Date -> high, low, close]stock1
[Date -> high, low, close]stock2
...
[Date -> high, low, close]stock/n/

where each [Date -> high, low, close] is a separate table named after
the stock (each such table is identical in layout). More complexity when
working multiple stocks (the worst would be to do a report of stock
names and spread sorted by size of spread for a single day -- you'd have
to create a temporary table of [Stock -> high, low, close] to perform
the report).
--
 
R

Rene Pijlman

Mudcat:
My initial thought was to put the data in large dictionaries and shelve
them (and possibly zipping them to save storage space until the data is
needed). However, these are huge files.

ZODB solves that problem for you.
http://www.zope.org/Wikis/ZODB/FrontPage

More in particular "5.3 BTrees Package":
http://www.zope.org/Wikis/ZODB/guide/node6.html#SECTION000630000000000000000

But I've only used ZODB for small databases compared to yours. It's
supposed to scale very well, but I can't speak from experience.
 
M

Mudcat

On a different tack, to avoid thinking about any db issues, consider
subscribing
to TC2000 (tc2000.com)... they already have all that data,
in a database which takes about 900Mb when fully installed.

That is an interesting option also. I had actually looked for ready
made databases and didn't come across this one. Although, I don't
understand how they can fit all that info into 900Mb.

I like this option, but I guess if I decide to keep using this database
then I need to keep up my subcription. The thing I liked about
downloading everything from Yahoo was that I didn't have to pay anyone
for the data.

Does anyone know the best way to compress this data? or do any of these
databases handle compression automatically? 5gig will be hard for any
computer to deal with, even in a database.
 
M

Mudcat

In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.

Thanks
 
R

Robert Kern

Mudcat said:
In doing a little research I ran across PyTables, which according to
the documentation does this: "PyTables is a hierarchical database
package designed to efficiently manage very large amounts of data." It
also deals with compression and various other handy things. Zope also
seems to be designed to handle large amounts of data with compression
in mind.

Does any know which of these two apps would better fit my purpose? I
don't know if either of these has limitations that might not work out
well for what I'm trying to do. I really need to try and compress the
data as much as possible without making the access times really slow.

PyTables is exactly suited to storing large amounts of numerical data aranged in
tables and arrays. The ZODB is not.

--
Robert Kern
(e-mail address removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
M

Magnus Lycka

Mudcat said:
I am trying to build a tool that analyzes stock data. Therefore I am
going to download and store quite a vast amount of it. Just for a
general number - assuming there are about 7000 listed stocks on the two
major markets plus some extras, 255 tradying days a year for 20 years,
that is about 36 million entries.

Obviously a database is a logical choice for that. However I've never
used one, nor do I know what benefits I would get from using one. I am
worried about speed, memory usage, and disk space.

This is a typical use case for relational database systems.
With something like DB2 or Oracle here, you can take advantage
of more than 20 years of work by lots of developers trying to
solve the kind of problems you will run into.

You haven't really stated all the facts to decide what product
to choose though. Will this be a multi-user applications?
Do you forsee a client/server application? What operating
system(s) do you need to support?

With relational databases, it's plausible to move some of
the hard work in the data analysis into the server. Using
this well means that you need to learn a bit about how
relational databases work, but I think it's with the trouble.
It could mean that much less data ever needs to reach your
Python program for processing, and that will mean a lot for
your performance. Relational databases are very good at
searching, sorting and simple aggregations of data. SQL is
a declarative language, and in principle, your SQL code
will just declare the correct queries and manipulations that
you want to achieve, and tuning will be a separate activity,
which doesn't need to involve program changes. In reality,
there are certainly cases where changes in SQL code will
influence performance, but to a very large extent, you can
achieve good performance through building indices and by
letting the database gather statistics and analyze the
queries your programs contain. As a bonus, you also have
advanced systems for security, transactional safety, on-
line backup, replication etc.

You don't get these advantages with any other data storage
systems.

I'd get Chris Fehily's "SQL Visual Quickstart Guide", which
is as good as his Python book. As database, it depends a bit
on your platform you work with. I'd avoid MySQL. Some friends
of mine have used it for needs similar to yours, and they are
now running into its severe shortcomings. (I did warn them.)

For Windows, I think the single user version of SQL Server
(MSDE?) is gratis. For both Windows and Linux/Unix, there are
(I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
I think DB2 is somewhere in between. PostgreSQL is also a good
option.

Either way, it certainly seems natural to learn relational
databases and SQL if you want to work with financial software.
 
D

Dennis Lee Bieber

For Windows, I think the single user version of SQL Server
(MSDE?) is gratis. For both Windows and Linux/Unix, there are

MSDE seems to vary availability from month to month <G>

The version I had (came with VB6 Pro) was restricted to a maximum
database of 2GB (~same size limit as a JET MDB, but using SQL Server
core), and had a query choke of something like five simultaneous
queries.
(I think) gratis versions of both Oracle 10g, IBM DB2 UDB and
Mimer SQL. Mimer SQL is easy to install, Oracle is a pain, and
I think DB2 is somewhere in between. PostgreSQL is also a good
option.
Depending upon requirements, there is also Firebird (a spawn of
Interbase), MaxDB (MySQL's release of the SAP DB), and (while some abhor
it) even MySQL might be worthy... (at least, until Oracle decides MySQL
can no longer license either the Sleepycat BDB or the Inno Oy InnoDB
backends).

Ingres may also be viable.
http://www.ingres.com/products/Prod_Download_Portal.html
--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top