custom data warehouse in python vs. out-of-the-box ETL tool

S

snfctech

Does anyone have experience building a data warehouse in python? Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go. e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.
 
S

Sean DiZazzo

Does anyone have experience building a data warehouse in python?  Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go.  e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.

I have done some small/medium sized stuff using SQLAlchemy,
Turbogears, and Flex. I have never used a commercial product, but I
imagine getting it set up to work with your data is the hardest part
of the job anyway, and the solution you end up with will most likely
limit you to applying their api to your data. If you build it
yourself, you have complete control, and know exactly where to go when
you have a problem, or to add a feature.

I'm no expert, but I think I would try to find a way to consolidate
the data into one data source. We handle the giant amount of data we
are collecting by preprocessing it into another DB anyway, so I
imagine you could do both things at the same time.

This could very probably be handled in a different way if you are a
DBA. I'm just a MySQL hack. :)

~Sean
 
S

snfctech

Thanks for your replies, Sean and Martin.

I agree that the ETL tools are complex in themselves, and I may as
well spend that learning curve on a lower-level tool-set that has the
added value of greater flexibility.

Can you suggest a good book or tutorial to help me build a data
warehouse in python? Bill Inmon's "Building the Data Warehouse" is 17
years old, and I've been cautioned against Kimball.

Thanks.
 
N

nn

Does anyone have experience building a data warehouse in python?  Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go.  e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.

I use both Python and a Data-warehouse tool (Datastage) from IBM that
is similar to Informatica. The main difference with Python is
throughput. The tool has good sort and join routines of multithreaded
C code that handles data bigger than what fits in RAM. It also has
good native drivers for the DB2 database. For data conversions and
other transformations every row gets processed on a different CPU. You
can really put a 16 core machine to good use with this thing.

In your case you probably won't have enough data to justify the cost
of buying a tool. They are quite expensive.
 
S

snfctech

@Martin: I originally thought that there was nothing "magical" about
building a data warehouse, but then I did a little research and
received all sorts of feedback about how data warehouse projects have
notorious failure rates, that data warehouse design IS different than
normal RDBMS - and then there's the whole thing about data marts vs.
warehouses, Kimball vs. Inmon, star schemas, EAV tables, and so on.
So I started to think that maybe I needed to get a little better read
on the subject.
 
S

snfctech

@Lemburg: Thanks for the suggestion. I'm sure you make a fine
product, but my development platform is Linux, and I don't want any
additional Windows servers to deal with (than the ones I'm already
stuck with.)
 
T

Tony Schmidt

@Martin: Thanks for your great feedback.

So do you think it would be very beneficial for me to start with an
Inman or Kimball book? Or do you think it would be just leisure
reading and not very practical at best - fill my head with needless
jargon and inflexible dogmas, at worst?

I took a database class in college, understand the basic principals of
normalisation, and have built a few complicated RDBMS schemas from the
ground up.
 
T

Tony Schmidt

Hi, Marc-Andre - well, so far you seem to be the only one suggesting
that cross-database joins is the way to go - everyone else has been
telling me to build a warehouse. I initially was trying to avoid the
warehouse idea to "avoid going through the external temporary
resource", as you say. But then the situation came up where the data
warehouse would give me another benefit by enabling power users with
tools like OOBase to do their own reporting. So I started to re-
consider it, and since then, no one has advised the direct approach
with connections/joins on multiple databases.

Have you had much luck circumventing the need for a data warehouse in
this way?

Thanks.
 
J

Jim

On this post, you are talking about Talend and Informatica like both
are
comparable... indeed both tools are comparable in power and
functionality.
But keep in mind Informatica is a proprietary solution, with no access
to the code. You have a license fee + other costs associated to the
use
of Infa.
Talend is an open source ETL able to perform data migration and
synchronization. It's code is open source. You can download the free
version of Talend, Talend Open Studio on the website.
Being an open source tool, you can adapt the software to your
organization and therefore spend less time and money on the
development
of your companies' own software.
Download it here: http://www.talend.com/download.php

Just my 2 cents about Talend and Informatica...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top