Discussion in 'C Programming' started by Lynn McGuire, Sep 27, 2013.

  1. Lynn McGuire

    BartC Guest

    Yes, but massive great libraries which half the time are dwarfing the tasks
    they called upon to deal with.

    In Malcolm's case, he is parsing XML which he himself has generated. So the
    program can be minimal (but retaining the advantage of being in a standard

    I recently was thinking about using XML (to store a program structure), but
    have had unhappy dealings with XML before. I instead creating my own format
    (the file doesn't need to go anywhere) to do the same thing. The parser is
    about 150 lines of code.
    BartC, Sep 30, 2013
    1. Advertisements

  2. Lynn McGuire

    Les Cargill Guest

    I was directed to ezXML at one point, and it's a nice, simple XML
    parser. It only gets you to "leaf nodes" and then you have to interpret
    it into a coherent whole by traversing child/sibling pointer

    Just in case you are interested in/tired of maintaining your own.

    Les Cargill, Sep 30, 2013
    1. Advertisements

  3. I've just looked at the source.

    #include <unistd.h>
    #include <sys/types.h>

    So unfortunately it won't compile on every C compiler.
    It's probably perfectly good if you can guarantee a Unix-like system,#but it can't replace the vanilla xml parser.
    Malcolm McLean, Sep 30, 2013
  4. That depends on what you mean by "massive"; there shouldn't be any
    problem loading typical files of reasonable size on 32-bit systems or
    even ridiculously large files on 64-bit systems. Let the OS's virtual
    memory system deal with mapping data in and out as needed.
    So you load it, extract the data you need, and then unload it.
    A dependency on a common XML library shouldn't be a big deal; most
    programs I deal with depend on a dozen or more libraries, of which I
    usually have most or all of them install already anyway. If you're
    really worried about it, just include the library's code within your
    distribution (either source or binary). That's still better than having
    to write and maintain your own XML parser.

    Stephen Sprunk, Sep 30, 2013
  5. Once you depend on half a dozen libraries, adding an extra dependency isn't
    a big change. But going from idempotency to having an external dependency
    is a big step in the wrong direction. It means that your code is unlikely
    to remain useful for very long, because sooner or later one of the externals
    is going to break, become unavailable, require a proprietary compiler, or
    otherwise cause the program to fail. An ffmpeg build broke on Microsoft, for
    example. I suspect it was done deliberately.

    Have you actually read the vanilla xml parser? It took about a day to write.
    In my view, that's time well invested. It won't handle arbitrarily complex
    xml that depends on all the difficult areas of the standard. But any time I
    need a config file or a small database, I can simply include this one file.
    If something goes wrong, the source is simple enough for any competent C
    programmer to understand it in an hour so so.

    Once code passes a certain level of complexity, you need to maintain it.
    It uses various constructs or depends on externals which break. But simple,
    standard C functions, generally you don't.
    Malcolm McLean, Sep 30, 2013
  6. As I said, if you're worried about that, just include the library's code
    within your distribution so there's no external build-time or run-time
    dependency. Many projects do that, especially on Windows.
    I don't doubt that; I'm just not a fan of reinventing the wheel.

    Stephen Sprunk, Oct 1, 2013
  7. Lynn McGuire

    Seebs Guest

    How long is "very long"? I've had code that depends on curses run pretty much
    consistently from the late 80s through today.

    Seebs, Oct 1, 2013
  8. In my case, we had a machine, a single 386 that served 20 of us, that ran
    curses. But I bought a DOS machine for myself, and I had a whole 386
    processor just for me. Fortunately the instructor had told us to write
    a little abstraction layer over curses, so I rewrote it for DOS, and I could
    shuttle code between my home machine and the class.
    Then I got a Unix machine for my first job, so curses would have been
    useful again. Except that it also ran X. Most of the user were artists,
    they didn't like curses-type interfaces. Next job was Windows and games
    console based.
    Then I used a Linux machine for my PhD. So I could have dusted off my
    old curses code. But it was long forgotten by then. However I've got a file to
    load in a bitmap I wrote when I had the first 386. If you look at the Baby X
    resource compiler (http://www.maclolmmclean/site11.com/www/BabyX/BabyX.html )
    you can see it. It's still going strong.
    Malcolm McLean, Oct 1, 2013
  9. Hello group! :)

    This URL is definitely not transporting me to BabyX. I tried
    `http://www.malcolmmclean.site11.com/www' that you mentioned in one of
    the previous posts, but that was also a dud. It was only when I went to
    `http://www.malcolmmclean.site11.com/' and clicked on the 'www/' that I
    was able to get somewhere.

    I'm using Midori.
    Aleksandar Kuktin, Oct 1, 2013
  10. Lynn McGuire

    Nobody Guest

    And most of them would be half-baked parsers for whichever unspecified
    subset of XML the program uses for its output (i.e. not actually XML at
    all, just something similar enough to confuse people).

    If you're going to support XML, you need to accept any file which matches
    the schema, not just those which the application generated itself.
    Otherwise, you can't edit the files with standard tools, can't create
    files with standard libraries, etc. IOW, you may as well just fwrite() the
    in-memory representation to disc.
    Nobody, Oct 2, 2013
  11. Lynn McGuire

    Nobody Guest

    The output might be in a standard format (or it might not, if it's being
    coded to match an implementation rather than a specification). But does
    that really help if you can't do anything with the file beside load it
    straight back in? If the parser won't read the result of editing the file
    with e.g. xsltproc, there isn't a great deal of point in in having it in
    XML in the first place.
    Nobody, Oct 2, 2013
  12. You edit the file with a text editor. Or the program write it to disk in
    xml format, and you can open and examine it with a text editor.
    For the vast majority of programs, it's possible to produce a file which
    is valid xml but which defeats the loader in some way. If you declare a
    clue as 2 across when 2 is at the head of a down-only word, for example,
    a crossword program is going to have to reject your file. So if it also
    rejects it if you set up some complex scheme involving namespaces and
    entities, which it can't understand because it doesn't support those features,
    it's not a qualitative change.

    The alternative to using xml is to declare a specific syntax, as used by
    the Microsoft resource compiler, which pre-dates xml. So with the MS resource
    compiler you declare a bitmap

    disk1 BITMAP "disk.bmp"

    there's nothing too bad about this. But the user's going to be saying "does
    the id need to be in quotes, or just the path? How do I comment out a line?
    Are lines terminated by semi-colons? How do I continue a line if I've got a
    long path? If you use xml, it's easier, because people know the conventions.

    It is a potential problem that some third party may automatically generate
    a Baby X resource compiler script in valid xml which looks to the reader
    like a well-formed script file, but which in fact the Baby X compiler can't
    parse. But it's unlikely to affect many people, and it's unlikely to be
    hard to overcome. Ultimately if there's a demand that it accepts fully-featured
    xml, then of course I'd consider replacing the vanilla xml parser with a
    bigger module.
    Malcolm McLean, Oct 2, 2013
  13. Lynn McGuire

    Ian Collins Guest

    Or use a simple, well known and supported format such as JSON. My
    "full" JSON parser is about 200 lines of code.
    Ian Collins, Oct 2, 2013
  14. From xmlsoft.org:

    The latest versions of libxslt can be found on the xmlsoft.org server. (NOTE that you need the libxml2, libxml2-devel, libxslt and libxslt-devel packages installed to compile applications using libxslt.) Igor Zlatkovic is nowthe maintainer of the Windows port, he provides binaries. CSW provides Solaris binaries, and Steve Ball provides Mac Os X binaries.

    It's not that I don't appreciate what these people are doing. But this is
    totally inappropriate for reading in a 1K or so list of maybe 20 images
    and fonts. You only use that library if you have a need for heavy-duty
    processing, when I'm sure it's good and often a sensible option.
    Malcolm McLean, Oct 3, 2013
  15. Lynn McGuire

    Seebs Guest


    I have seen some spectacular examples of that genre.

    Seebs, Oct 3, 2013
  16. Lynn McGuire

    Ian Collins Guest

    Robert Wessel wrote:

    People sometimes forget the "X" in XML stands for "eXtensible", I don't
    think there's an equivalent for "subsetable" :)
    Ian Collins, Oct 3, 2013
  17. Lynn McGuire

    BartC Guest

    XML looks deceptively simple at first sight. Maybe it actually is simple, as
    far as syntax goes. So why the need for all these complicated libraries? And
    what could xsltproc do to an XML file that would render it unreadable to a
    simple parser?

    You've got start-tags, end-tags, and attributes; what else is there?
    Unsupported character escapes or minor things like that? It might be easier
    to add support for that than struggle with someone else's over-the-top

    And what would you want to do to the file anyway? The data will obviously
    only make sense to this specific application; if there's a problem with the
    content, that is going to be a problem whatever library is used to read it.
    BartC, Oct 3, 2013
  18. You've got the problems of character encodings, which is inherent in a world
    that's moving from ASCII to unicode as a default for information exchange.
    The vanilla xml parser supports only 8-bit chars, though if I extend it
    unicode support is the first on the list. However if you use unicode,
    you can't then embed string literals in calls to the parser, the wide
    character libraries may be unavailable, you can't read the output (I've no
    way of knowing whether a string of characters that is pure gibberish to me
    has been processed correctly or not, often even if it's displayed in the
    right font). It's a source of endless problems.

    Then they complicated the system with things like namespaces and entities.
    There's a famous exploit called "barrel of laughs" which defines the entity
    LOL! then defines another entity as two, another as two of those (so four),
    and so on, until you break an average parser with a file containing a couple
    of hundred characters. The format would have been better, in my view, if
    those things have never been added to it, and in fact they're not needed
    for most applications. But a fully-featured parser has t support it.

    Then if the file is too big to fit in memory, or so big that the processor
    takes non-trivial time to run through it, you need a complex system for
    parsing it. There's a whole terminology for families of parsers that allow
    different types of access. Some legitimate uses of xml can be quite big.
    But of course, if you're just getting together a list of files and strings,
    as with the Baby X resource compiler, then it's most unlikely that someone's
    going to want to create a script with millions of items. So it's acceptable
    just to load the whole thing into memory at once then find elements by
    O(N) access functions.
    Malcolm McLean, Oct 3, 2013
  19. Lynn McGuire

    Nobody Guest

    To an actual XML parser: nothing.

    The problem comes when people try to parse XML using a bunch of regexps
    which were obtained through trial and error (i.e. testing them on some
    sample XML files and tweaking them until they work on those test cases).

    That approach often leads to something which e.g. can't even handle
    whitespace in any context where it didn't occur in the sample files.
    Just getting those right is apparently too hard for some people. E.g.
    attributes could be in any order (many XML parsers store attributes in an
    associative array, so order is unlikely to be preserved), if whitespace is
    allowed it can be any combination of whitespace characters, etc.
    A good example is performing "bulk" processing, e.g. a simple search and
    replace in many files (where the original application requires a dozen
    mouse clicks to load and save each file plus another half a dozen for each
    individual change).

    If the data is in XML, you just need to cook up an XSL transformation
    (or similar) then you can process all of the files with one command. Well,
    unless the application's "XML" parser can't actually read anything other
    than its own output, as you probably aren't going to find off-the-shelf
    XML tools which offer the option of restricting to their output to that
    which can be read by John Doe's pseudo-XML subset.

    On the plus side, most of the real XML parsers were written by people who
    still have the scars from trying to deal with what either Netscape or
    Microsoft thought "HTML" meant. Consequently, they don't attempt to be
    fault-tolerant (this may seem like a good idea in theory, but in practice
    it means that every bug in a popular implementation ends up redefining the
    de-facto standard until it's so complex that writing a parser which can
    handle more than 50% of "HTML as deployed" is more work than the Apollo

    So at least we don't normally have to worry about the output being a
    superset of the standard (if it doesn't conform, hardly anything will
    parse it). We just have to worry about the hordes of strcmp-and-regexp
    parsers turning the de-facto standard into an ever-shrinking subset of the
    real thing.
    Nobody, Oct 3, 2013
  20. Lynn McGuire

    Rui Maciel Guest

    This isn't exacly true. When a generic parser is adopted, the need to write a custom parser
    doesn't go away. Instead, the only thing that is accomplished is that the job of writing a
    single parser is replaced with two jobs: implement and maintain a third-party component, and
    write a custom parser for the output of that generic parser. Whether it's through a schema
    definition and/or through a set of routines, you're going to write that second parser.

    This can only be true if you assume you don't have to parse the output of the generic parser,
    and even then it's still highly debatable. For example, a custom recursive descent parser for
    an INI-type document format can be written in less than 500 LoC, including a couple hundred LoCs
    for hand-written state tables. This is your complete parser, which performs all data
    validations you might wish for and handles any error with the document structure which you can
    come up with. If you use a parser generator then you'll be able to implement that parser with a
    fraction of those LoC.

    Rui Maciel
    Rui Maciel, Oct 4, 2013
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.