Java and huge XML file to be parsed

Dimitri Maziuk · Jun 24, 2004

Roedy Green sez:

Think of what fraction of the
planet's XML or HTML documents would pass a complete W3C validation
suite, perhaps under 1%. Using a binary format solves that problem in
one fell swoop with the additional benefits of:

1. more compact, faster download.
2. faster processing.
3. tighter specification.
4. fewer people have to understand it.
5. simpler classes needed to process it, important in handhelds.

That is assuming

0. software that translated source into binary works correctly:
we know it doesn't. And when it doesn't we get to the interesting
part: failure modes. HTML browser can fail to "View source" and
user will still see the content. Binary browser?

1. binary representation is not necessarily more compact. E.g.
using double-byte characters vs. single-byte + charset header.

2. nobody cares about processing speed. The bottleneck is network
I/O, not CPU speed. What we do care about is byte ordering, original
word sizes, and all other fun stuff you need to deal with when
getting raw bytes over the wire.

3. non-issue as there's no reason why text markup format specs
must necessarily be less tight that binary format specs. What
happened in Real Life when Nutscrape, Microshaft, and whathaveyou
add feechoorz to their software and then shove them up HTML
specs would happen with any format, binary, shminary.

4. non-issue. Your own estimate is that 1% of HTML is good, ergo
only 1% of webshite designers and authors of HTML editing software
understand HTML. Ergo, they don't _have_ to understand it already,
obscuring the format further won't change anything.

(Obviously, the assumption that people will make better $foo if
they don't understand $foo is in itself rather amusing. E.g.
people would make better cars if they didn't understand how cars
work.)

5. who said anything about classes? You can process HTML with sed:
s/<.+>//g will give you nice plain text output, and you can add
bells and whistles as appropriate for your hardware.

Furrfu
Dima

Roedy Green · Jun 25, 2004

0. software that translated source into binary works correctly:
we know it doesn't.

The odds of it working are extremely high. A bug will soon be noticed
and fixed because there are so many other programs cross checking it.

The odds of a human doing it manually perfectly are extremely low. It
is the sort of mind-numbing task computers excel at.

Roedy Green · Jun 25, 2004

1. binary representation is not necessarily more compact. E.g.
using double-byte characters vs. single-byte + charset header.

Then use a binary representation with single-byte UTF-8 or other even
more compact encoding such as Huffman.

The point is when you don't worry about making the format convenient
for humans you can make it optimally convenient for computers, i.e.
some optimal combination of:

fast to process,
compact to transport,
processible with small amounts of RAM.

Grant Wagner · Jun 25, 2004

Jezuch said:
U¿ytkownik Roedy Green napisa³:

This one is *the* problem. People are lazy. Imagine what would happen if you
developed something like this and said to them "it's all fine, but you have
to use THIS tool". I presume that noone would bother to get it...

Actually, the problem isn't with laziness, it's with being told what to do, and
how to do it.

If I wrote perfectly acceptable HTML in Notepad, and then was told "that's fine,
but you have to use *this* tool to do it all again because you can't import what
you've done", I'd either tell them to go stuff it, or I'd find a way to upload
my hand-coded Notepad version, even if it meant writing a "compiler" to turn my
perfectly acceptable HTML into whatever tokenized mish-mash-mess was "required".

HTTP is _all_ text/byte-stream, it's what allows me to do:

print "Content-Type: text/plain\n\n";
print "this\n";
print "is\n";
print "a\n";
print "new\n";
print "line";

in Perl and have it come out on the browser correctly.

And thank <insert your choosen deity here> they did it that way is all I have to
say.

Roedy Green · Jun 25, 2004

If I wrote perfectly acceptable HTML in Notepad, and then was told "that's fine,
but you have to use *this* tool to do it all again because you can't import what
you've done",

Nobody is stopping you from using it, there is just one more step. It
is no different than being told you must put your letter in an
envelope before posting it.

Roedy Green · Jun 25, 2004

Nobody is stopping you from using it, there is just one more step. It
is no different than being told you must put your letter in an
envelope before posting it.

For communication there has to be SOME standard. Think of it this way.
You are infringing on MY rights by insisting I use a fluffy, badly
specified error-prone format. You are further deliberately trying to
drive mad by putting malformed HTML on your website that crashes my
browser.

I should, in the American tradition, SUE you for damages, pain and
suffering.

Dale King · Apr 15, 2006

Hello, Roedy Green !

You said:
Why use a ancient tool like that? It is like doing data entry with
NOTEPAD.

And what specifically is wrong with allowing someone to edit it
with the simplest of tools? That isn't even an option with a
binary format.

For heaven sake. Surely we could create editor that
created, edited and searched a compact XML-like representation that
made it IMPOSSIBLE to create syntax errors and almost correct

data.

Sure we could create an editor for each and every format out
there, but that would sure be a lot of work. Each fomat would
have its own editor. And we also end up duplicating that work
when we want to transform from one format to another.

Or we can use XML where the parser or editor only has to be
created once. The parser already exists and there are XML editors
that do just what you describe. And transformation from one
format to another is easy as well.

So if you want to keep reinventing the wheel feel free. The rest
of the world has got better things to do.

It is not as though we failed to notice what a MESS HTML became from
lack of such a representation. The idiots took the worst features of
HTML.

No they didn't. The primary problems with HTML is that it is
about presentation, it is not well-formed, and not validating.
All of which are not true of XML.

It is amazing that such a IDIOTIC format caught on.

Quite understandable why it caught on.

Dale King · Apr 15, 2006

Hello, Roedy Green !

You said:
have to parse.
u
ARRGH. That file is probably 20 times the size if would be if stored
in some sensible format.

There seems to be an underlying false assumption by he OP and
probably by Roedy. The fact that it is 215 MB on disk does not
mean that the in-memory version of it will be anywhere near that
large. When using a DOM style tool you have control over the
objects created.

It will take 100 times a long to parse than
some sensible binary format.

I find that to be a gross exaggeration, but neither of us has
hard data. I would also say that the development time for coding
the parser and editor for a binary format is 100 times that of
using XML. Although that development time can be lessened by
using XML to describe and edit the data then transforming that
into the binary format.

PHOOEY ON XML! I knew this insanity would happen.

What insanity? That it would actually be put to good use despite
your objections?

Trying to parse a HUGE(1gb) xml file	41	Dec 20, 2010
How to remove an empty line which is created when i deleted a element from my xml file?	0	Oct 1, 2016
splitting up huge (1 GB) xml documents	5	Apr 29, 2005
Processing a huge xml file	12	Jul 23, 2007
Seek in huge xml-files	2	Aug 8, 2008
Update huge xml files without loading into RAM	7	Oct 9, 2003
How to test for JDOMExceptions in an XML file?	2	Sep 18, 2009
How to speed up XML reading	11	Sep 11, 2012

Java and huge XML file to be parsed

Dimitri Maziuk

Roedy Green

Roedy Green

Grant Wagner

Roedy Green

Roedy Green

Dale King

Dale King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads