XML <=> Text conversion platform requiring high performance

  • Thread starter =?iso-8859-1?q?Benjamin_B=E9car?=
  • Start date
?

=?iso-8859-1?q?Benjamin_B=E9car?=

Hello everyone.

I have to find a correct architecture to achieve this XML <=> Text
conversion platform. The platform (based on Win2003Server) will have to
deal with 21 million XML files and 16 million text files a day. The
average file size is 1,1 Kb, but they are received by the platform in
the form of big archives (7000 files per archive, app. 7.7Mb).

After some investigation on the Internet, I have decided (95% sure) to
use SAX as the API to deal with my files. And I will not use XSLT as
the main converter, because it will be too slow.

However, I must say that I am a complete "newbie" concerning XML, and
those decisions have been taken after much reading, and discussions
with "others", supposed to be slightly better than me at XML. Which
means two things, and those are my questions :

* Does this architecture looks good to you ?
Win2003 Server, Websphere 5.1.1, Java and SAX

* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

Thanks for your help.
 
J

Juergen Kahrs

Benjamin said:
* Do you have any idea of the performance of this architecture?
Wouldn't it be better to chose Unix or Sun as the processing platform,
especially when we know that , in all, around 40Gb of data will have to
be processed each day ?

This depends on your way of processing. If you choose to
start one new process for the processing of a file, then
you might run into problems with the amount of available
RAM in your host. If 1000 files are processed at each time
instant, and 1000 processes handle them, then the overhead
of each process will also sum up to a 1000-fold.
 
?

=?iso-8859-1?q?Benjamin_B=E9car?=

Thanks for your answer,

I intend to process one file after the other, and not to use threads.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?
 
J

Joseph Kesselman

If you're talking about 40GB XML files, I'd strongly suggest considering
a serious XML database (which, these days, includes IBM's DB2). Or
keeping the main database in non-XML form and using XML only for the
extracted subsets that you're going to expose to the outside world.
 
?

=?iso-8859-1?q?Benjamin_B=E9car?=

The problem is that this platform is supposed to only be a gateway for
text files (around 17GB a day) and XML files (around 23 GB a day).
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Benjamin said:
I intend to process one file after the other, and not to use threads.

This is a conservative design decision.
In production environments it is generally a
good idea to choose proven designs instead of
futuristic ones.
Why that ? Because it will be harder to develop and thus slower, which
I do not want.
Indeed.

Do you think the application will be able to deal with those 40Gb data
if processed sequentialy ?

The rule-of-thumb is that a good SAX parser can parse 10 MB/s.
40 GB of data should therefore take about 4000 seconds. This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Benjamin said:
Moreover, concerning XML files, they have special features that
"prevent" them from an easy storage in a XML database : they are zipped
and signed. Which means that I have no other choice but to deal with
those big files one after the other, when they arrive at the gateway.

Now I understand why you opted for a sequential approach.
For the prototyping phase, the following tool maybe
useful in building a first implementation of the
processing pipeline:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

I know of at least one developer (Andrew Schorr) who
has built a production environment with this tool.
But his constraints differed a bit from yours.
Andrew uses a Solaris server. He has XML files as
large as 1 GB and has to import them into the
PostgreSQL database.
 
?

=?iso-8859-1?q?Benjamin_B=E9car?=

Thank you very much for this information, I'm gonna look into it right
now. And please excuse me for my previous mail that looked a bit
"stubborn", as I did not explained the reasons.
 
?

=?iso-8859-1?q?Benjamin_B=E9car?=

Jürgen Kahrs a écrit :
This
is about 1 hour of CPU time for the traffic of a day. If there
is few other overhead in your software system, then you've got
some headroom left with just one server (with just one CPU).

It doesn't seem to me much time, so that is a good thing. But could you
tell me what kind of hardware have you taken for this estimation ?

Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top