XML <=> Text conversion platform requiring high performance

=?iso-8859-1?q?Benjamin_B=E9car?= · Aug 24, 2006

Hello everyone.

I have to find a correct architecture to achieve this XML <=> Text
conversion platform. The platform (based on Win2003Server) will have to
deal with 21 million XML files and 16 million text files a day. The
average file size is 1,1 Kb, but they are received by the platform in
the form of big archives (7000 files per archive, app. 7.7Mb).

After some investigation on the Internet, I have decided (95% sure) to
use SAX as the API to deal with my files. And I will not use XSLT as
the main converter, because it will be too slow.

However, I must say that I am a complete "newbie" concerning XML, and
those decisions have been taken after much reading, and discussions
with "others", supposed to be slightly better than me at XML. Which
means two things, and those are my questions :

* Does this architecture looks good to you ?
Win2003 Server, Websphere 5.1.1, Java and SAX

* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

Thanks for your help.

Juergen Kahrs · Aug 24, 2006

Benjamin said:
* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

This depends on your way of processing. If you choose to
start one new process for the processing of a file, then
you might run into problems with the amount of available
RAM in your host. If 1000 files are processed at each time
instant, and 1000 processes handle them, then the overhead
of each process will also sum up to a 1000-fold.

=?iso-8859-1?q?Benjamin_B=E9car?= · Aug 24, 2006

Thanks for your answer,

I intend to process one file after the other, and not to use threads.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

Joseph Kesselman · Aug 24, 2006

If you're talking about 40GB XML files, I'd strongly suggest considering
a serious XML database (which, these days, includes IBM's DB2). Or
keeping the main database in non-XML form and using XML only for the
extracted subsets that you're going to expose to the outside world.

=?iso-8859-1?q?Benjamin_B=E9car?= · Aug 24, 2006

The problem is that this platform is supposed to only be a gateway for
text files (around 17GB a day) and XML files (around 23 GB a day).
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Aug 24, 2006

Benjamin said:
I intend to process one file after the other, and not to use threads.

This is a conservative design decision.
In production environments it is generally a
good idea to choose proven designs instead of
futuristic ones.

Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Indeed.

Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

The rule-of-thumb is that a good SAX parser can parse 10 MB/s.
40 GB of data should therefore take about 4000 seconds. This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Aug 24, 2006

Benjamin said:
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

Now I understand why you opted for a sequential approach.
For the prototyping phase, the following tool maybe
useful in building a first implementation of the
processing pipeline:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

I know of at least one developer (Andrew Schorr) who
has built a production environment with this tool.
But his constraints differed a bit from yours.
Andrew uses a Solaris server. He has XML files as
large as 1 GB and has to import them into the
PostgreSQL database.

=?iso-8859-1?q?Benjamin_B=E9car?= · Aug 24, 2006

Thank you very much for this information, I'm gonna look into it right
now. And please excuse me for my previous mail that looked a bit
"stubborn", as I did not explained the reasons.

=?iso-8859-1?q?Benjamin_B=E9car?= · Aug 25, 2006

Jürgen Kahrs a écrit :

This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).

It doesn't seem to me much time, so that is a good thing. But could you
tell me what kind of hardware have you taken for this estimation ?

Thanks.

High Performance Xml parser	3	Nov 27, 2006
High-performance Python websites	13	Nov 26, 2009
text to xml conversion	2	Jun 21, 2007
PhD positions University Paris-12: XML high-performance	0	Oct 25, 2007
Galry, a high-performance interactive visualization package in Python	0	Jan 29, 2013
High Performance Xml parser	3	Nov 27, 2006
Some notes on a high-performance Python application.	4	Mar 26, 2008
Liquid Technologies Unvei Liquid XML Studio 2013	0	Mar 20, 2013

XML <=> Text conversion platform requiring high performance

=?iso-8859-1?q?Benjamin_B=E9car?=

Juergen Kahrs

=?iso-8859-1?q?Benjamin_B=E9car?=

Joseph Kesselman

=?iso-8859-1?q?Benjamin_B=E9car?=

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

=?iso-8859-1?q?Benjamin_B=E9car?=

=?iso-8859-1?q?Benjamin_B=E9car?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads