seeking in large xml file using sax 2

  • Thread starter =?ISO-8859-1?Q?Thomas_J=E4ger?=
  • Start date
?

=?ISO-8859-1?Q?Thomas_J=E4ger?=

Hi,

I have a large XML file (~120MB) and do not want to read the whole file at
once. Is it possible to seek in the file and start parsing from that point
using SAX 2?

Thanks
Thomas
 
C

Christophe Vanfleteren

Roedy said:
SEE I told you this would happen! XML needs to be put on a diet and
made faster to parse.

see http://mindprod.com/jgloss/xml.html
from that page:

There is no mechanism to describe the types of the data. To XML, everything
is a string. There is no way to specify a field must be numeric, that in
needs two decimal places, that it must represent a date in some range, that
it must not have accented letters, that it be restricted to certain
punctuation, or be one of a certain set of legal values. There are scores
of tack-ons trying to fix this and other shortcomings turning the simple
XML into a tower of Babel.

-> That's what XML Schema mostly solves. you hava datatypes, constraints,
regexes, ...

XML uses a ugly syntax with gratuitous punctuation. #IMPLIED really means
optional. #PCDATA means string <!ATTLIST means attributes.

-> that is a problem with DTD's, not XML itself: DTD's are ugly, not
powerfull and not written in XML itself, once again XML Schema solves
these problems.

One possible candidate for the XML replacement job is the Java serialised
object format. It can handle just about any data structure imaginable. It
is platform independent. It has a simple DTD -- Java source code for the
corresponding class. Some claim it is Java-only. Not so. It is no more
difficult for C++ to parse than any other similar newly concocted protocol.
It is not tied to any hardware or OS. It is just that Java has a head start
implementing it. Java can implement it with no extra overhead.

-> funny when you consider that, for long term persistence, Sun recommends
that you save your beans using the XMLEncoders. that is the problem with
the Java serialisation: binary compatibility has issues, XML solves those.


You're right on the performance issues ofcourse :)
 
J

Jacob

Roedy said:
SEE I told you this would happen! XML needs to be put on a diet and
made faster to parse.

see http://mindprod.com/jgloss/xml.html

I like your description. This is in line with
my own XML experiences, and I use XML with great
care.

Where I used to pass a pointer to a memory
location (4 bytes) for data transfer I now see
myself wrapping my objects into human-readable
ASCII format, packing dense numeric values into
their decimal string representatives, adding
lots of unneccessary tags, then serialize the
whole lot, pass it on to the network, deserialize,
and re-construct the objects in the other end
(which is often the same machine and process
you started from)...

BTW: Why does it need to be <tag>...</tag>.
Whouldn't <tag>...</> be sufficient? Tags
are nested after all.

And why doesn't XML support binary numeric
data? (or does it?)
 
R

Roedy Green

BTW: Why does it need to be <tag>...</tag>.
Whouldn't <tag>...</> be sufficient? Tags
are nested after all.

If the format were verified, all the end tags could be replaced by a
single end marker. They are there purely for the convenience of
humans.

Humans only look at these things during debugging. Why have such a
fluffy format, that requires so much labour to parse for production?

Why not have a fluffy debugging format that is interconvertible with a
streamlined production format?

It would be more compact to send, and faster to parse. Your library
could produce/eat either. In production the debugging versions would
never even be loaded.

If a production failed, you could still dump the message in fluffy
format.

What sort of production format am I thinking of?
http://mindprod.com/projhtmlcompactor.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top