Vanilla XML parser

Discussion in 'C Programming' started by Malcolm McLean, Aug 23, 2012.

  1. As part of the binary image processing library work I had to load some XML
    files. There doesn't seem to be a lightweight XML parser available on the web.
    Plenty of bloated ones that require full-fledged installs. But nothing you
    can just grab and compile.

    So I decided to write a vanilla one myself. It did the job, and loaded my
    data files. But it only weighs in as a single average-length source file.
    That's partly because it only does ascii, doesn't handle defined entities
    or special tags, and so on.

    But is there the potential for this to be developed into a lightweight, single
    file parser? Ther's also a question for Jacob here. The structure is simply
    a tree. How would the container library map on to XML?

    --
    Vanilla XML Parser
    http://www.malcolmmclean.site11.com/www
     
    Malcolm McLean, Aug 23, 2012
    #1
    1. Advertising

  2. Malcolm McLean

    Les Cargill Guest

    Malcolm McLean wrote:
    > As part of the binary image processing library work I had to load some XML
    > files. There doesn't seem to be a lightweight XML parser available on the web.
    > Plenty of bloated ones that require full-fledged installs. But nothing you
    > can just grab and compile.
    >


    If expat doesn't cut it, try ezxml.

    http://ezxml.sourceforge.net/

    > So I decided to write a vanilla one myself. It did the job, and loaded my
    > data files. But it only weighs in as a single average-length source file.
    > That's partly because it only does ascii, doesn't handle defined entities
    > or special tags, and so on.
    >
    > But is there the potential for this to be developed into a lightweight, single
    > file parser? Ther's also a question for Jacob here. The structure is simply
    > a tree. How would the container library map on to XML?
    >



    --
    Les Cargill
     
    Les Cargill, Aug 24, 2012
    #2
    1. Advertising

  3. Malcolm McLean

    BGB Guest

    On 8/23/2012 3:45 PM, Malcolm McLean wrote:
    > As part of the binary image processing library work I had to load some XML
    > files. There doesn't seem to be a lightweight XML parser available on the web.
    > Plenty of bloated ones that require full-fledged installs. But nothing you
    > can just grab and compile.
    >
    > So I decided to write a vanilla one myself. It did the job, and loaded my
    > data files. But it only weighs in as a single average-length source file.
    > That's partly because it only does ascii, doesn't handle defined entities
    > or special tags, and so on.
    >
    > But is there the potential for this to be developed into a lightweight, single
    > file parser? Ther's also a question for Jacob here. The structure is simply
    > a tree. How would the container library map on to XML?
    >


    I did similar as well.

    wrote a simple lightweight parser/printer and basic tree-manipulation
    code (partly similar to DOM).


    IIRC, I initially wrote it to support XML-RPC.
    as-such, it uses a similar subset to that used by both XML-RPC and XMPP
    (although it does support namespaces).


    later it was used as the AST format for my first BGBScript VM
    interpreter (later versions used S-Expression ASTs). (actually, the
    first interpreter directly walked/interpreted these ASTs, but was soon
    changed to "word-code", and later interpreters switched to bytecode with
    a variable-length coding for many values, and more recently use threaded
    code rather than directly interpreting the bytecode).

    it was later utilized as the core of my C compiler project, where
    basically XML trees were used as the main AST structure, and the API was
    tweaked some to be better suited to compiler-related tasks.
    (of course, the C compiler wasn't very good and subsequently "decayed"
    mostly into a code-processing and metadata mining tool). sadly, I have
    been unable to really justify the effort that would required to "revive"
    it as a full C compiler (probably using bytecode which would run in a
    VM, and most likely executed as threaded-code).


    or such...
     
    BGB, Aug 24, 2012
    #3
  4. Malcolm McLean

    Rui Maciel Guest

    Malcolm McLean wrote:

    > So I decided to write a vanilla one myself. It did the job, and loaded my
    > data files. But it only weighs in as a single average-length source file.
    > That's partly because it only does ascii, doesn't handle defined entities
    > or special tags, and so on.


    If the parser fails to parse valid XML then it isn't exactly a XML parser.
    This isn't necessarily good or bad, much less a problem. Nevertheless,
    there is a reason why XML parsers tend not to be tiny.


    > But is there the potential for this to be developed into a lightweight,
    > single file parser? Ther's also a question for Jacob here.


    I suspect that the question you need to answer first is the following: do
    you really need XML to begin with? In other words, isn't there any other
    data format that fits your needs, is easier to parse and you are able to
    adopt? JSON springs to mind, for example.

    Following that, do you really need a parser that supports an entire generic
    format in its full glory, or do you only need to parse a language which is a
    subset of that format? In your post you mentioned that you developed your
    parser as part of an image processing library. This leads to suspect that
    you might not really need to support every single feature of XML, or any
    other generic data format. That being the case then your job is made a bit
    simpler: you would only need to specify your data format and write a parser
    for it. As a consequence, your parser will be significantly lighter and
    more efficient.


    Rui Maciel
     
    Rui Maciel, Aug 26, 2012
    #4
  5. בת×ריך ×™×•× ×¨×שון, 26 ב×וגוסט 2012 10:55:10 UTC+1, מ×ת Rui Maciel:
    > Malcolm McLean wrote:
    >
    >
    > > So I decided to write a vanilla one myself. It did the job, and loaded m
    > > data files. But it only weighs in as a single average-length source file.
    > > That's partly because it only does ascii, doesn't handle defined entities
    > > or special tags, and so on.

    >
    > If the parser fails to parse valid XML then it isn't exactly a XML parser..
    > This isn't necessarily good or bad, much less a problem. Nevertheless,
    > there is a reason why XML parsers tend not to be tiny.
    >

    The data has to be in XML format, to interchange with other programs.
    But it's very simple - a few optional text fields, a few compulsory text
    fields, width and height and an M x N variable list of cells. Then you
    can have a list of any number of images in the file.
    But it seemed a generic parser was the way to go, not to hardcode the fields
    in the low level code. But I didn't want to throw a 5 MB executable at it.
    But it seems to me that the majority of XML files are like this - you've
    got tags, attributes, and text in your leaf tags. Recursively defined
    "entities" and CDATA elements and all the other niggles are rare.
    >
     
    Malcolm McLean, Aug 26, 2012
    #5
  6. Malcolm McLean

    BGB Guest

    On 8/26/2012 12:49 PM, Malcolm McLean wrote:
    > בת×ריך ×™×•× ×¨×שון, 26 ב×וגוסט 2012 10:55:10 UTC+1, מ×ת Rui Maciel:
    >> Malcolm McLean wrote:
    >>
    >>
    >>> So I decided to write a vanilla one myself. It did the job, and loaded m
    >>> data files. But it only weighs in as a single average-length source file.
    >>> That's partly because it only does ascii, doesn't handle defined entities
    >>> or special tags, and so on.

    >>
    >> If the parser fails to parse valid XML then it isn't exactly a XML parser.
    >> This isn't necessarily good or bad, much less a problem. Nevertheless,
    >> there is a reason why XML parsers tend not to be tiny.
    >>

    > The data has to be in XML format, to interchange with other programs.
    > But it's very simple - a few optional text fields, a few compulsory text
    > fields, width and height and an M x N variable list of cells. Then you
    > can have a list of any number of images in the file.
    > But it seemed a generic parser was the way to go, not to hardcode the fields
    > in the low level code. But I didn't want to throw a 5 MB executable at it.
    > But it seems to me that the majority of XML files are like this - you've
    > got tags, attributes, and text in your leaf tags. Recursively defined
    > "entities" and CDATA elements and all the other niggles are rare.


    yeah.

    if the parser can parse the basic tag syntax (and, maybe, namespace
    syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
    is pretty much the entirety of XML that most programs need to support
    for most documents.

    in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
    anyways, many documents omit them (either not identifying the document
    type at all, or identifying it via a namespace).


    so, a lot depends...
     
    BGB, Aug 28, 2012
    #6
  7. בת×ריך ×™×•× ×©×œ×™×©×™, 28 ב×וגוסט 2012 04:55:13 UTC+1, מ×ת BGB:
    > On 8/26/2012 12:49 PM, Malcolm McLean wrote:
    >
    > if the parser can parse the basic tag syntax (and, maybe, namespace
    > syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
    > is pretty much the entirety of XML that most programs need to support
    > for most documents.
    >

    That was my thinking. Allowing recursive defintion of "entities" complicates
    things considerably. Maybe it should have a patch to support CDATA.
    >
    > in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
    > anyways, many documents omit them (either not identifying the document
    > type at all, or identifying it via a namespace).
    >

    It's always an issues, what to do with badly formatted input. The idea behind
    the XML spec is that you can open the file in binary, then work out whetherit
    is ascii, big-endian unicode or little-endian unicode, by examining the first
    few bytes. But I'm not currently supporting unicode, and the second file I
    had to parse didn't have the ?xml tag.

    --
    Check out the vanilla XML parser
    http://www.malcolmmclean.site11.com/www
     
    Malcolm McLean, Aug 28, 2012
    #7
  8. Malcolm McLean

    BGB Guest

    On 8/28/2012 6:13 AM, Malcolm McLean wrote:
    > בת×ריך ×™×•× ×©×œ×™×©×™, 28 ב×וגוסט 2012 04:55:13 UTC+1, מ×ת BGB:
    >> On 8/26/2012 12:49 PM, Malcolm McLean wrote:
    >>
    >> if the parser can parse the basic tag syntax (and, maybe, namespace
    >> syntax, and maybe CDATA), and the "?xml" and "!DOCTYPE" tags, then this
    >> is pretty much the entirety of XML that most programs need to support
    >> for most documents.
    >>

    > That was my thinking. Allowing recursive defintion of "entities" complicates
    > things considerably. Maybe it should have a patch to support CDATA.


    my parser ignores user-defined entities (all others are hard-coded), and
    basically hard-codes CDATA.


    >>
    >> in many cases, given "?xml" and "!DOCTYPE" are mostly just formalities
    >> anyways, many documents omit them (either not identifying the document
    >> type at all, or identifying it via a namespace).
    >>

    > It's always an issues, what to do with badly formatted input. The idea behind
    > the XML spec is that you can open the file in binary, then work out whether it
    > is ascii, big-endian unicode or little-endian unicode, by examining the first
    > few bytes. But I'm not currently supporting unicode, and the second file I
    > had to parse didn't have the ?xml tag.
    >


    well, as noted: many files omit them.


    my code generally assumes UTF-8 unless stated otherwise.

    it is possible to detect the BOM in the case of Unicode, and this much
    may be required for UTF-16 files.

    so, text loading could look like:
    BOM detected? read as UTF-16 or UTF-32 (maybe just repack as UTF-8);
    looks like valid UTF-8? parse as UTF-8;
    otherwise? guess (probably ASCII + codepages).

    my code largely ignores the existence of codepages, and even if I did
    use them it is not clear I would go much beyond "Extended ASCII" / CP437
    and/or CP1252 anyways (I was once tempted by CP437 for sake of
    more-readily-addressable box-drawing characters, but ended up opting
    with plain ASCII characters instead). these would just follow the CP ->
    UTF-8 route anyways.

    although the BOM is not strictly required for UTF-16 or 32, it is
    usually present (text editors tend to emit it and often depend on its
    presence).


    in the situations I use my stuff for, it would be fairly unlikely to
    encounter anything outside of ASCII range, and even then, something not
    UTF-8 encoded.

    the text editors I have also only really give a few options for saving:
    ASCII, UTF-8, and UTF-16 (LE or BE).

    another supports saving using codepages, but not readily (it involves a
    sub-menu and going through a dialog box to enable these options for
    "Save As"), with ASCII, UTF-8, and UTF-16 as the only "readily
    available" options.

    yeah, I think there is a pattern here...
     
    BGB, Aug 28, 2012
    #8
  9. Malcolm McLean

    Guest

    On Thursday, August 23, 2012 9:45:16 PM UTC+1, Malcolm McLean wrote:
    > As part of the binary image processing library work I had to load some XML
    >
    > files. There doesn't seem to be a lightweight XML parser available on the web.
    >
    > Plenty of bloated ones that require full-fledged installs. But nothing you
    >
    > can just grab and compile.
    >
    >
    >
    > So I decided to write a vanilla one myself. It did the job, and loaded my
    >
    > data files. But it only weighs in as a single average-length source file.
    >
    > That's partly because it only does ascii, doesn't handle defined entities
    >
    > or special tags, and so on.
    >
    >
    >
    > But is there the potential for this to be developed into a lightweight, single
    >
    > file parser? Ther's also a question for Jacob here. The structure is simply
    >
    > a tree. How would the container library map on to XML?
    >
    >
    >
    > --
    >
    > Vanilla XML Parser
    >
    > http://www.malcolmmclean.site11.com/www


    I thought notepad++ was pretty bland and basic, have used Liquid Studio in comparison and that deliberately is not vanilla, http://www.liquid-technologies.com/xml-editor.aspx
     
    , Sep 17, 2012
    #9
  10. Malcolm McLean

    John Bode Guest

    On Thursday, August 23, 2012 3:45:16 PM UTC-5, Malcolm McLean wrote:
    > As part of the binary image processing library work I had to load some XML
    > files. There doesn't seem to be a lightweight XML parser available on the web.
    > Plenty of bloated ones that require full-fledged installs. But nothing you
    > can just grab and compile.
    >
    > So I decided to write a vanilla one myself. It did the job, and loaded my
    > data files. But it only weighs in as a single average-length source file.
    > That's partly because it only does ascii, doesn't handle defined entities
    > or special tags, and so on.
    >
    > But is there the potential for this to be developed into a lightweight, single
    > file parser? Ther's also a question for Jacob here. The structure is simply
    > a tree. How would the container library map on to XML?
    >


    I've wrote my own XML parser for a project some years ago. It even
    worked...mostly...after a couple of iterations.

    If I had it to do over again I'd just go with expat and be done with it.
    I'll take a little code bloat if it saves me some headaches in the end.
     
    John Bode, Sep 19, 2012
    #10
  11. Malcolm McLean

    Bill Davy Guest

    "John Bode" <> wrote in message
    news:...
    > On Thursday, August 23, 2012 3:45:16 PM UTC-5, Malcolm McLean wrote:
    >> As part of the binary image processing library work I had to load some
    >> XML
    >> files. There doesn't seem to be a lightweight XML parser available on the
    >> web.
    >> Plenty of bloated ones that require full-fledged installs. But nothing
    >> you
    >> can just grab and compile.
    >>
    >> So I decided to write a vanilla one myself. It did the job, and loaded my
    >> data files. But it only weighs in as a single average-length source file.
    >> That's partly because it only does ascii, doesn't handle defined entities
    >> or special tags, and so on.
    >>
    >> But is there the potential for this to be developed into a lightweight,
    >> single
    >> file parser? Ther's also a question for Jacob here. The structure is
    >> simply
    >> a tree. How would the container library map on to XML?
    >>

    >
    > I've wrote my own XML parser for a project some years ago. It even
    > worked...mostly...after a couple of iterations.
    >
    > If I had it to do over again I'd just go with expat and be done with it.
    > I'll take a little code bloat if it saves me some headaches in the end.



    I found TinyXML (http://sourceforge.net/projects/tinyxml/) worked for me.
     
    Bill Davy, Sep 20, 2012
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VisionSet

    Custom or vanilla Collection/Map?

    VisionSet, Sep 15, 2003, in forum: Java
    Replies:
    3
    Views:
    391
    Harald Hein
    Sep 15, 2003
  2. VisionSet
    Replies:
    1
    Views:
    334
    Roedy Green
    Jul 3, 2004
  3. Art
    Replies:
    3
    Views:
    469
  4. Neil Benn

    Vanilla python path

    Neil Benn, Jul 25, 2005, in forum: Python
    Replies:
    0
    Views:
    358
    Neil Benn
    Jul 25, 2005
  5. markritter150

    Vanilla Ice is back!

    markritter150, Jun 3, 2008, in forum: C Programming
    Replies:
    6
    Views:
    321
Loading...

Share This Page