C
ccc31807
My apology to those of you who think this post is OT.
Here is an extended quotation from a Microsoft white paper on data
transformation services, or Integration Services in MS SQL 2008. This
really hit me hard -- I'm a database guy who uses Perl extensively to
do my job. SQL Server is a commercial product, which MS sells to
generate revenue. Ordinarily, people buy stuff that they perceive has
value over and above the purchase price. Microsoft obviously wants to
persuade people that SIS adds value, and they may be right, I'm not
going to prejudge SIS, particularly since I don't know anything about
it. I'm wondering how others may respond, hence this post. I will
follow up this post with my response. (I don't want to do it now lest
I poison the well.)
In case you are wondering, this post is directly related to certain
events that have transpired in my workplace in the past couple of
days. I'll leave it to you to guess what those events are.
Here is the article, and yes, I know it violates the terms of use,
according to which you need written permission from Microsoft
Corporation to even discover that the document exists, much less to
read it!
<quote>
Challenges of Data Integration
At one level, the problem of data integration in our real-world
scenario is extraordinarily simple. Get data from multiple sources,
cleanse and transform the data, and load the data into appropriate
data stores for analysis and reporting. Unfortunately, in a typical
data warehouse or business intelligence project, enterprises spend 60–
80% of the available resources in the data integration stage. Why is
it so difficult?
Technology Challenges
Technology challenges start with source systems. We are moving from
collecting data on transactions (where customers commit to getting,
buying, or otherwise acquiring something) to collecting data on pre-
transactions (where mechanisms such as Web clicks or RFID tags track
customer intentions). Data is now not only acquired via traditional
sources and formats, such as databases and text files, but is
increasingly available in a variety of different formats (ranging from
proprietary files to Microsoft Office documents to XML-based files)
and from Internet-based sources such as Web services and RSS (Really
Simple Syndication) streams.
The most pertinent challenges are:
• Multiple sources with different formats.
• Structured, semi-structured, and unstructured data.
• Data feeds from source systems arriving at different times.
• Huge data volumes.
In an ideal world, even if you somehow manage to get all the data we
need in one place, new challenges start to surface, including:
• Data quality.
• Making sense of different data formats.
• Transforming the data into a format that is meaningful to business
analysts.
Suppose that you can magically get all of the data that you need and
that you can cleanse, transform, and map the data into a useful
format. There is still another shift away from traditional data
movement and integration. That is the shift from fixed long batch-
oriented processes to fluid and shorter on-demand processes. Most
organizations perform batch-oriented processes during “downtimes” when
users do not place heavy demands on the system. This is usually at
night during a predefined batch window of 6-8 hours, when no one is
supposed to be in the office. With the increasing globalization of
businesses of every size and type, this is no longer true. There is
very little (if any) downtime and someone is always in the office
somewhere in the world.
As a result you have:
• Increasing pressure to load the data as quickly as possible.
• The need to load multiple destinations at the same time.
• Diverse destinations.
Not only do you need to achieve all of these results, but also you
need to achieve them as fast as possible. In extreme cases, such as
online businesses, you must integrate data on a continuous basis.
There are no real batch windows and latencies cannot exceed minutes.
In many of these scenarios, the decision-making process is automated
with continuously running software.
Scalability and performance become more and more important as you face
business needs that cannot tolerate any downtime.
Without the right technology, systems require staging at almost every
step of the warehousing and integration process. As different
(especially nonstandard) data sources need to be included in the
Extract, Transform, and Load (ETL) process and as more complex
operations (such as data and text mining) need to be performed on the
data, the need to stage the data increases. As illustrated in Figure
1, with increased staging the time taken to “close the loop,” (i.e.,
to analyze, and take action on new data) increases as well. These
traditional ELT architectures (as opposed to value-added ETL processes
that occur prior to loading) impose severe restrictions on the ability
of systems to respond to emerging business needs.
Finally, the question of how data integration ties into the overall
integration architecture of the organization is becoming more
important when you need both the real-time transactional technology of
application integration and the batch-oriented high-volume world of
data integration technology to solve the business problems of the
enterprise.
</quote>
Here is an extended quotation from a Microsoft white paper on data
transformation services, or Integration Services in MS SQL 2008. This
really hit me hard -- I'm a database guy who uses Perl extensively to
do my job. SQL Server is a commercial product, which MS sells to
generate revenue. Ordinarily, people buy stuff that they perceive has
value over and above the purchase price. Microsoft obviously wants to
persuade people that SIS adds value, and they may be right, I'm not
going to prejudge SIS, particularly since I don't know anything about
it. I'm wondering how others may respond, hence this post. I will
follow up this post with my response. (I don't want to do it now lest
I poison the well.)
In case you are wondering, this post is directly related to certain
events that have transpired in my workplace in the past couple of
days. I'll leave it to you to guess what those events are.
Here is the article, and yes, I know it violates the terms of use,
according to which you need written permission from Microsoft
Corporation to even discover that the document exists, much less to
read it!
<quote>
Challenges of Data Integration
At one level, the problem of data integration in our real-world
scenario is extraordinarily simple. Get data from multiple sources,
cleanse and transform the data, and load the data into appropriate
data stores for analysis and reporting. Unfortunately, in a typical
data warehouse or business intelligence project, enterprises spend 60–
80% of the available resources in the data integration stage. Why is
it so difficult?
Technology Challenges
Technology challenges start with source systems. We are moving from
collecting data on transactions (where customers commit to getting,
buying, or otherwise acquiring something) to collecting data on pre-
transactions (where mechanisms such as Web clicks or RFID tags track
customer intentions). Data is now not only acquired via traditional
sources and formats, such as databases and text files, but is
increasingly available in a variety of different formats (ranging from
proprietary files to Microsoft Office documents to XML-based files)
and from Internet-based sources such as Web services and RSS (Really
Simple Syndication) streams.
The most pertinent challenges are:
• Multiple sources with different formats.
• Structured, semi-structured, and unstructured data.
• Data feeds from source systems arriving at different times.
• Huge data volumes.
In an ideal world, even if you somehow manage to get all the data we
need in one place, new challenges start to surface, including:
• Data quality.
• Making sense of different data formats.
• Transforming the data into a format that is meaningful to business
analysts.
Suppose that you can magically get all of the data that you need and
that you can cleanse, transform, and map the data into a useful
format. There is still another shift away from traditional data
movement and integration. That is the shift from fixed long batch-
oriented processes to fluid and shorter on-demand processes. Most
organizations perform batch-oriented processes during “downtimes” when
users do not place heavy demands on the system. This is usually at
night during a predefined batch window of 6-8 hours, when no one is
supposed to be in the office. With the increasing globalization of
businesses of every size and type, this is no longer true. There is
very little (if any) downtime and someone is always in the office
somewhere in the world.
As a result you have:
• Increasing pressure to load the data as quickly as possible.
• The need to load multiple destinations at the same time.
• Diverse destinations.
Not only do you need to achieve all of these results, but also you
need to achieve them as fast as possible. In extreme cases, such as
online businesses, you must integrate data on a continuous basis.
There are no real batch windows and latencies cannot exceed minutes.
In many of these scenarios, the decision-making process is automated
with continuously running software.
Scalability and performance become more and more important as you face
business needs that cannot tolerate any downtime.
Without the right technology, systems require staging at almost every
step of the warehousing and integration process. As different
(especially nonstandard) data sources need to be included in the
Extract, Transform, and Load (ETL) process and as more complex
operations (such as data and text mining) need to be performed on the
data, the need to stage the data increases. As illustrated in Figure
1, with increased staging the time taken to “close the loop,” (i.e.,
to analyze, and take action on new data) increases as well. These
traditional ELT architectures (as opposed to value-added ETL processes
that occur prior to loading) impose severe restrictions on the ability
of systems to respond to emerging business needs.
Finally, the question of how data integration ties into the overall
integration architecture of the organization is becoming more
important when you need both the real-time transactional technology of
application integration and the batch-oriented high-volume world of
data integration technology to solve the business problems of the
enterprise.
</quote>