Large XML files

Discussion in 'XML' started by jdev8080, Dec 20, 2005.

  1. jdev8080

    jdev8080 Guest

    We are looking at creating large XML files containing binary data
    (encoded as base64) and passing them to transformers that will parse
    and transform the data into different formats.

    Basically, we have images that have associated metadata and we are
    trying to develop a unified delivery mechanism. Our XML documents may
    be as large as 1GB and contain up to 100,000 images.

    My question is, has anyone done anything like this before?

    What are the performance considerations?

    Do the current parsers support this size of XML file?

    Has anyone used fast infoset for this type of problem?

    Is there a better way to deliver large sets of binary files (i.e. zip
    files or something like that)?

    Any input would be great. If there is a better board to post this,
    please let me know.

    Thx,

    Bret
     
    jdev8080, Dec 20, 2005
    #1
    1. Advertising

  2. jdev8080 wrote:

    > Basically, we have images that have associated metadata and we are
    > trying to develop a unified delivery mechanism. Our XML documents may
    > be as large as 1GB and contain up to 100,000 images.
    >
    > My question is, has anyone done anything like this before?


    Yes, Andrew Schorr told me that he processes files
    of this size. After some experiments with Pyxie, he
    now uses xgawk with the XML extension of GNU Awk.

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/

    > What are the performance considerations?


    Andrew stores each item in a separate XML file and
    the concatenates all the XML files to one large file,
    often large than 1 GB. My own performance measurements
    tell me that a modern PC should parse about 10 MB/s.

    > Do the current parsers support this size of XML file?


    Yes, but probably only SAX-like parsers.
    DOM-like parsers have to store the complete file
    in memory and are therefore limited by the amount
    of memory. In reality, no DOM parsers to date is able
    to read XML files larger than about 500 M. If I am wrong
    about this, I bet that someone will correct me.

    > Is there a better way to deliver large sets of binary files (i.e. zip
    > files or something like that)?


    I store such files in .gz format. When reading them, it
    is a good idea _not_ to unzip them. Use gzip to produce
    a stream of data which will be immediately processed by
    the SAX parser:

    gzip -c large_file.xml | parser ...

    The advantage of this approach is that at each time instant,
    only part of the file will occupy space in memory. This is
    extremely fast and your server can run a hundred of such
    processes on each CPU in parallel.
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Dec 20, 2005
    #2
    1. Advertising

  3. jdev8080

    Jimmy Zhang Guest

    You can also try VTD-XML (http://vtd-xml.sf.net), which uses about 1.3~1.5x
    the
    size of XML file. Currently it only supports files size of 1GB, so if you
    have 2GB of
    physical memory, you can load everything in memory and perform random access
    on
    it like DOM (of course with DOM will get outOfMem exception). Support for
    large
    files are on the way.

    "J├╝rgen Kahrs" <> wrote in message
    news:...
    > jdev8080 wrote:
    >
    >> Basically, we have images that have associated metadata and we are
    >> trying to develop a unified delivery mechanism. Our XML documents may
    >> be as large as 1GB and contain up to 100,000 images.
    >>
    >> My question is, has anyone done anything like this before?

    >
    > Yes, Andrew Schorr told me that he processes files
    > of this size. After some experiments with Pyxie, he
    > now uses xgawk with the XML extension of GNU Awk.
    >
    > http://home.vrweb.de/~juergen.kahrs/gawk/XML/
    >
    >> What are the performance considerations?

    >
    > Andrew stores each item in a separate XML file and
    > the concatenates all the XML files to one large file,
    > often large than 1 GB. My own performance measurements
    > tell me that a modern PC should parse about 10 MB/s.
    >
    >> Do the current parsers support this size of XML file?

    >
    > Yes, but probably only SAX-like parsers.
    > DOM-like parsers have to store the complete file
    > in memory and are therefore limited by the amount
    > of memory. In reality, no DOM parsers to date is able
    > to read XML files larger than about 500 M. If I am wrong
    > about this, I bet that someone will correct me.
    >
    >> Is there a better way to deliver large sets of binary files (i.e. zip
    >> files or something like that)?

    >
    > I store such files in .gz format. When reading them, it
    > is a good idea _not_ to unzip them. Use gzip to produce
    > a stream of data which will be immediately processed by
    > the SAX parser:
    >
    > gzip -c large_file.xml | parser ...
    >
    > The advantage of this approach is that at each time instant,
    > only part of the file will occupy space in memory. This is
    > extremely fast and your server can run a hundred of such
    > processes on each CPU in parallel.
     
    Jimmy Zhang, Jan 8, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. spammy
    Replies:
    3
    Views:
    4,178
    Jason DeFontes
    May 21, 2004
  2. RAJ
    Replies:
    2
    Views:
    447
  3. Ketchup
    Replies:
    1
    Views:
    288
    Jan Tielens
    May 25, 2004
  4. thufir
    Replies:
    3
    Views:
    237
    Thufir
    Apr 12, 2008
  5. Replies:
    5
    Views:
    997
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page