Vanilla XML parser

M

Malcolm McLean

As part of the binary image processing library work I had to load some XML
files. There doesn't seem to be a lightweight XML parser available on the web.
Plenty of bloated ones that require full-fledged installs. But nothing you
can just grab and compile.

So I decided to write a vanilla one myself. It did the job, and loaded my
data files. But it only weighs in as a single average-length source file.
That's partly because it only does ascii, doesn't handle defined entities
or special tags, and so on.

But is there the potential for this to be developed into a lightweight, single
file parser? Ther's also a question for Jacob here. The structure is simply
a tree. How would the container library map on to XML?
 
L

Les Cargill

Malcolm said:
As part of the binary image processing library work I had to load some XML
files. There doesn't seem to be a lightweight XML parser available on the web.
Plenty of bloated ones that require full-fledged installs. But nothing you
can just grab and compile.

If expat doesn't cut it, try ezxml.

http://ezxml.sourceforge.net/
 
B

BGB

As part of the binary image processing library work I had to load some XML
files. There doesn't seem to be a lightweight XML parser available on the web.
Plenty of bloated ones that require full-fledged installs. But nothing you
can just grab and compile.

So I decided to write a vanilla one myself. It did the job, and loaded my
data files. But it only weighs in as a single average-length source file.
That's partly because it only does ascii, doesn't handle defined entities
or special tags, and so on.

But is there the potential for this to be developed into a lightweight, single
file parser? Ther's also a question for Jacob here. The structure is simply
a tree. How would the container library map on to XML?

I did similar as well.

wrote a simple lightweight parser/printer and basic tree-manipulation
code (partly similar to DOM).


IIRC, I initially wrote it to support XML-RPC.
as-such, it uses a similar subset to that used by both XML-RPC and XMPP
(although it does support namespaces).


later it was used as the AST format for my first BGBScript VM
interpreter (later versions used S-Expression ASTs). (actually, the
first interpreter directly walked/interpreted these ASTs, but was soon
changed to "word-code", and later interpreters switched to bytecode with
a variable-length coding for many values, and more recently use threaded
code rather than directly interpreting the bytecode).

it was later utilized as the core of my C compiler project, where
basically XML trees were used as the main AST structure, and the API was
tweaked some to be better suited to compiler-related tasks.
(of course, the C compiler wasn't very good and subsequently "decayed"
mostly into a code-processing and metadata mining tool). sadly, I have
been unable to really justify the effort that would required to "revive"
it as a full C compiler (probably using bytecode which would run in a
VM, and most likely executed as threaded-code).


or such...
 
R

Rui Maciel

Malcolm said:
So I decided to write a vanilla one myself. It did the job, and loaded my
data files. But it only weighs in as a single average-length source file.
That's partly because it only does ascii, doesn't handle defined entities
or special tags, and so on.

If the parser fails to parse valid XML then it isn't exactly a XML parser.
This isn't necessarily good or bad, much less a problem. Nevertheless,
there is a reason why XML parsers tend not to be tiny.

But is there the potential for this to be developed into a lightweight,
single file parser? Ther's also a question for Jacob here.

I suspect that the question you need to answer first is the following: do
you really need XML to begin with? In other words, isn't there any other
data format that fits your needs, is easier to parse and you are able to
adopt? JSON springs to mind, for example.

Following that, do you really need a parser that supports an entire generic
format in its full glory, or do you only need to parse a language which is a
subset of that format? In your post you mentioned that you developed your
parser as part of an image processing library. This leads to suspect that
you might not really need to support every single feature of XML, or any
other generic data format. That being the case then your job is made a bit
simpler: you would only need to specify your data format and write a parser
for it. As a consequence, your parser will be significantly lighter and
more efficient.


Rui Maciel
 
M

Malcolm McLean

בת×ריך ×™×•× ×¨×שון, 26 ב×וגוסט 2012 10:55:10 UTC+1, מ×ת Rui Maciel:
Malcolm McLean wrote:



If the parser fails to parse valid XML then it isn't exactly a XML parser..
This isn't necessarily good or bad, much less a problem. Nevertheless,
there is a reason why XML parsers tend not to be tiny.
The data has to be in XML format, to interchange with other programs.
But it's very simple - a few optional text fields, a few compulsory text
fields, width and height and an M x N variable list of cells. Then you
can have a list of any number of images in the file.
But it seemed a generic parser was the way to go, not to hardcode the fields
in the low level code. But I didn't want to throw a 5 MB executable at it.
But it seems to me that the majority of XML files are like this - you've
got tags, attributes, and text in your leaf tags. Recursively defined
"entities" and CDATA elements and all the other niggles are rare.
 
B

BGB

בת×ריך ×™×•× ×¨×שון, 26 ב×וגוסט 2012 10:55:10 UTC+1, מ×ת Rui Maciel:
The data has to be in XML format, to interchange with other programs.
But it's very simple - a few optional text fields, a few compulsory text
fields, width and height and an M x N variable list of cells. Then you
can have a list of any number of images in the file.
But it seemed a generic parser was the way to go, not to hardcode the fields
in the low level code. But I didn't want to throw a 5 MB executable at it.
But it seems to me that the majority of XML files are like this - you've
got tags, attributes, and text in your leaf tags. Recursively defined
"entities" and CDATA elements and all the other niggles are rare.

yeah.

if the parser can parse the basic tag syntax (and, maybe, namespace
syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
is pretty much the entirety of XML that most programs need to support
for most documents.

in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
anyways, many documents omit them (either not identifying the document
type at all, or identifying it via a namespace).


so, a lot depends...
 
M

Malcolm McLean

בת×ריך ×™×•× ×©×œ×™×©×™, 28 ב×וגוסט 2012 04:55:13 UTC+1, מ×ת BGB:
On 8/26/2012 12:49 PM, Malcolm McLean wrote:

if the parser can parse the basic tag syntax (and, maybe, namespace
syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
is pretty much the entirety of XML that most programs need to support
for most documents.
That was my thinking. Allowing recursive defintion of "entities" complicates
things considerably. Maybe it should have a patch to support CDATA.
in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
anyways, many documents omit them (either not identifying the document
type at all, or identifying it via a namespace).
It's always an issues, what to do with badly formatted input. The idea behind
the XML spec is that you can open the file in binary, then work out whetherit
is ascii, big-endian unicode or little-endian unicode, by examining the first
few bytes. But I'm not currently supporting unicode, and the second file I
had to parse didn't have the ?xml tag.
 
B

BGB

בת×ריך ×™×•× ×©×œ×™×©×™, 28 ב×וגוסט 2012 04:55:13 UTC+1, מ×ת BGB:
That was my thinking. Allowing recursive defintion of "entities" complicates
things considerably. Maybe it should have a patch to support CDATA.

my parser ignores user-defined entities (all others are hard-coded), and
basically hard-codes CDATA.

It's always an issues, what to do with badly formatted input. The idea behind
the XML spec is that you can open the file in binary, then work out whether it
is ascii, big-endian unicode or little-endian unicode, by examining the first
few bytes. But I'm not currently supporting unicode, and the second file I
had to parse didn't have the ?xml tag.

well, as noted: many files omit them.


my code generally assumes UTF-8 unless stated otherwise.

it is possible to detect the BOM in the case of Unicode, and this much
may be required for UTF-16 files.

so, text loading could look like:
BOM detected? read as UTF-16 or UTF-32 (maybe just repack as UTF-8);
looks like valid UTF-8? parse as UTF-8;
otherwise? guess (probably ASCII + codepages).

my code largely ignores the existence of codepages, and even if I did
use them it is not clear I would go much beyond "Extended ASCII" / CP437
and/or CP1252 anyways (I was once tempted by CP437 for sake of
more-readily-addressable box-drawing characters, but ended up opting
with plain ASCII characters instead). these would just follow the CP ->
UTF-8 route anyways.

although the BOM is not strictly required for UTF-16 or 32, it is
usually present (text editors tend to emit it and often depend on its
presence).


in the situations I use my stuff for, it would be fairly unlikely to
encounter anything outside of ASCII range, and even then, something not
UTF-8 encoded.

the text editors I have also only really give a few options for saving:
ASCII, UTF-8, and UTF-16 (LE or BE).

another supports saving using codepages, but not readily (it involves a
sub-menu and going through a dialog box to enable these options for
"Save As"), with ASCII, UTF-8, and UTF-16 as the only "readily
available" options.

yeah, I think there is a pattern here...
 
J

jennywilkinson96

As part of the binary image processing library work I had to load some XML

files. There doesn't seem to be a lightweight XML parser available on the web.

Plenty of bloated ones that require full-fledged installs. But nothing you

can just grab and compile.



So I decided to write a vanilla one myself. It did the job, and loaded my

data files. But it only weighs in as a single average-length source file.

That's partly because it only does ascii, doesn't handle defined entities

or special tags, and so on.



But is there the potential for this to be developed into a lightweight, single

file parser? Ther's also a question for Jacob here. The structure is simply

a tree. How would the container library map on to XML?



--

Vanilla XML Parser

http://www.malcolmmclean.site11.com/www

I thought notepad++ was pretty bland and basic, have used Liquid Studio in comparison and that deliberately is not vanilla, http://www.liquid-technologies.com/xml-editor.aspx
 
J

John Bode

As part of the binary image processing library work I had to load some XML
files. There doesn't seem to be a lightweight XML parser available on the web.
Plenty of bloated ones that require full-fledged installs. But nothing you
can just grab and compile.

So I decided to write a vanilla one myself. It did the job, and loaded my
data files. But it only weighs in as a single average-length source file.
That's partly because it only does ascii, doesn't handle defined entities
or special tags, and so on.

But is there the potential for this to be developed into a lightweight, single
file parser? Ther's also a question for Jacob here. The structure is simply
a tree. How would the container library map on to XML?

I've wrote my own XML parser for a project some years ago. It even
worked...mostly...after a couple of iterations.

If I had it to do over again I'd just go with expat and be done with it.
I'll take a little code bloat if it saves me some headaches in the end.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top