RFC: thoughts for a "streamlined" XML syntax variant...

Discussion in 'XML' started by BGB, May 11, 2012.

  1. BGB

    BGB Guest

    one issue partly in the case of XML for its use in structured data is
    its relative verbosity, especially in cases where it is entered by hand
    or being read by a human (say, for debugging reasons, ...).

    so, the thought here would be to allow a "modest" syntax extension
    (probably would be limited to particular implementations which support
    the extension).


    more specifically, I was considering it as a possible extension feature
    to my own implementation, but have some doubts given that, yes, this
    would be non-standard extension. note that there probably would be a
    feature to manually "enable" it, such as to avoid necessarily breaking
    compatibility. in my case, the current primary use is for things like
    compiler ASTs, where it competes some with the use of S-Expressions for
    ASTs (Lisp style, not the "Rivest" variant / name-hijack). note that
    these ASTs normally never leave the application which created them, so
    the impact of using a non-standard syntax when serializing them is
    likely fairly small.


    example, say that a person has an expression like:
    <if>
    <cond>
    <binary op="&lt;">
    <ref name="x"/>
    <number value="3"/>
    </binary>
    </cond>
    <then>
    <funcall name="foo">
    <args/>
    </funcall>
    </then>
    </if>

    representing, say, the AST of the statement "if(x>3)foo();".

    the parser and printer could use a more compact encoding, say:
    <if
    <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    <then <funcall name="foo" <args/>>>
    >


    which would be regarded as functionally-equivalent to the prior
    expression (and would generate equivalent DOM trees when read back in).


    with the following rules:
    <tag>...</tag> and <tag/> are the same as before.

    while:
    <tag <...> ...>
    would use an alternate parsing strategy, where ">" is significant (since
    the prior tag didn't actually end), and indicates the end of the
    expression (the magic here would be seeing another "<" within a tag).

    similarly, maybe "<[[" could also be parsed as a shorthand for
    "<![CDATA[" as well (and would also match nicer with the closing bracket
    "]]>").


    note that it would be possible to mix them, as in:
    <foo> <bar <baz/>> </foo>
    and:
    <foo <bar> <baz/> </bar>>

    maybe also a different "name" would be a good idea, like "XEML" or
    similar would make sense, such as to reduce possible confusion.


    any thoughts or relevant information to look at?...
    BGB, May 11, 2012
    #1
    1. Advertising

  2. BGB

    Peter Flynn Guest

    On 11/05/12 18:40, BGB wrote:
    > one issue partly in the case of XML for its use in structured data
    > is its relative verbosity, especially in cases where it is entered by
    > hand or being read by a human (say, for debugging reasons, ...).


    I think this was expected to be a very rare case, which is why the spec
    says that terseness in XML markup is of minimal importance.

    > so, the thought here would be to allow a "modest" syntax extension
    > (probably would be limited to particular implementations which
    > support the extension).
    >
    > more specifically, I was considering it as a possible extension
    > feature to my own implementation, but have some doubts given that,
    > yes, this would be non-standard extension. note that there probably
    > would be a feature to manually "enable" it, such as to avoid
    > necessarily breaking compatibility.


    Switchable is good.

    > in my case, the current primary use is for things like compiler ASTs,
    > where it competes some with the use of S-Expressions for ASTs (Lisp
    > style, not the "Rivest" variant / name-hijack). note that these ASTs
    > normally never leave the application which created them, so the
    > impact of using a non-standard syntax when serializing them is likely
    > fairly small.
    >
    > example, say that a person has an expression like:
    > <if>
    > <cond>
    > <binary op="&lt;">
    > <ref name="x"/>
    > <number value="3"/>
    > </binary>
    > </cond>
    > <then>
    > <funcall name="foo">
    > <args/>
    > </funcall>
    > </then>
    > </if>
    >
    > representing, say, the AST of the statement "if(x>3)foo();".
    >
    > the parser and printer could use a more compact encoding, say:
    > <if
    > <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    > <then <funcall name="foo" <args/>>>


    This syntax (or very nearly) is already available in SGML:

    <!doctype if [
    <!element if - - (cond,then)>
    <!element cond - - (binary)>
    <!element binary - - (ref,number)>
    <!element number - - empty>
    <!element then - - (funcall)>
    <!element funcall - - (args)>
    <!element (args,ref) - - empty>
    <!attlist binary op cdata #required>
    <!attlist (ref,funcall) name cdata #required>
    <!attlist number value cdata #required>
    <!entity lt sdata "<">
    ]>
    <if<cond<binary op="&lt;"<ref name=x<number value="3"></></>
    <then<funcall name=foo<args></></></>

    > which would be regarded as functionally-equivalent to the prior
    > expression (and would generate equivalent DOM trees when read back in).
    >
    > with the following rules:
    > <tag>...</tag> and <tag/> are the same as before.
    >
    > while:
    > <tag <...> ...>
    > would use an alternate parsing strategy, where ">" is significant (since
    > the prior tag didn't actually end), and indicates the end of the
    > expression (the magic here would be seeing another "<" within a tag).
    >
    > similarly, maybe "<[[" could also be parsed as a shorthand for
    > "<![CDATA[" as well (and would also match nicer with the closing bracket
    > "]]>").
    >
    > note that it would be possible to mix them, as in:
    > <foo> <bar <baz/>> </foo>
    > and:
    > <foo <bar> <baz/> </bar>>
    >
    > maybe also a different "name" would be a good idea, like "XEML" or
    > similar would make sense, such as to reduce possible confusion.
    >
    > any thoughts or relevant information to look at?...


    I think you'd need a special editor: if the objective is to abbreviate
    the syntax, there is a delicate breakpoint between the denseness of the
    reduced syntax and the ability of the creator/user to understand it.

    What about writing up the method as a paper for the Balisage (markup)
    conference? That's really the place to discuss new syntaxes.

    ///Peter
    Peter Flynn, May 11, 2012
    #2
    1. Advertising

  3. There have been multiple suggestions for terser syntax, over the years
    since XML was released to the world. In general, they have failed
    contact with the real world -- they're harder to work with, and/or they
    aren't actually significantly more compact (especially when you remember
    that XML compresses wonderfully), and/or what you'd really want is a
    custom representation within your application (perhaps straight data
    structures) and XML only as the export/import interface to the rest of
    the world.

    The W3C has looked at a number of "binary XML" representations. I
    believe there was a working group that was investigating trying to come
    up with something official. I'm not sure what its status is now; the
    idea still strikes me as an awkward compromise that is going to face too
    many conflicting goals.

    Finally: XML's greatest value is that there are lots of tools already in
    place that support it. This won't be true of any new syntax.

    Sorry, but I think there really isn't sufficient value here to make the
    idea worth pursuing.
    Joe Kesselman, May 11, 2012
    #3
  4. BGB

    BGB Guest

    On 5/11/2012 1:44 PM, Peter Flynn wrote:
    > On 11/05/12 18:40, BGB wrote:
    >> one issue partly in the case of XML for its use in structured data
    >> is its relative verbosity, especially in cases where it is entered by
    >> hand or being read by a human (say, for debugging reasons, ...).

    >
    > I think this was expected to be a very rare case, which is why the spec
    > says that terseness in XML markup is of minimal importance.
    >


    fair enough.

    I mostly use it for things like compiler ASTs, network protocols, and
    file-formats (generally structured-data).


    currently used forms of XML are:
    raw/plaintext XML;
    as deflated plaintext XML;
    as an in-use binary format (similar to an "improved" version of WBXML
    with a few more features and density-improvements, with both being
    byte-based).

    I have another format I could use, but going into it likely pushes
    topicality (it is a Huffman-compressed binary serialization format,
    currently used for sending messages over a TCP socket in a 3D game
    engine, but this doesn't have much in particular to do with XML, as the
    message format it is currently used with is S-Expression based, rather
    than XML based).

    but, yeah, I guess originally XML was intended for markup of mostly
    textual documents (like in HTML or similar), rather than for
    representing structured data (or being used for humans viewing said
    structured data as debugging output).

    I wonder if anyone ever really considered "scene-graph delta-update
    messages in a 3D FPS game" as a possible use-case for XML either?
    somehow I doubt it (I had intended to do this originally, despite
    eventually opting for a different representation for said deltas).

    even as such, I did end up aggressively compressing them (via a
    specialized encoding scheme), as otherwise the bandwidth usage would
    have been a bit steep for a typical end-user internet connection.


    >> so, the thought here would be to allow a "modest" syntax extension
    >> (probably would be limited to particular implementations which
    >> support the extension).
    >>
    >> more specifically, I was considering it as a possible extension
    >> feature to my own implementation, but have some doubts given that,
    >> yes, this would be non-standard extension. note that there probably
    >> would be a feature to manually "enable" it, such as to avoid
    >> necessarily breaking compatibility.

    >
    > Switchable is good.
    >


    yeah.


    >> in my case, the current primary use is for things like compiler ASTs,
    >> where it competes some with the use of S-Expressions for ASTs (Lisp
    >> style, not the "Rivest" variant / name-hijack). note that these ASTs
    >> normally never leave the application which created them, so the
    >> impact of using a non-standard syntax when serializing them is likely
    >> fairly small.
    >>
    >> example, say that a person has an expression like:
    >> <if>
    >> <cond>
    >> <binary op="&lt;">
    >> <ref name="x"/>
    >> <number value="3"/>
    >> </binary>
    >> </cond>
    >> <then>
    >> <funcall name="foo">
    >> <args/>
    >> </funcall>
    >> </then>
    >> </if>
    >>
    >> representing, say, the AST of the statement "if(x>3)foo();".
    >>
    >> the parser and printer could use a more compact encoding, say:
    >> <if
    >> <cond<binary op="&lt;"<ref name="x"/> <number value="3"/>>>>
    >> <then<funcall name="foo"<args/>>>

    >
    > This syntax (or very nearly) is already available in SGML:
    >
    > <!doctype if [
    > <!element if - - (cond,then)>
    > <!element cond - - (binary)>
    > <!element binary - - (ref,number)>
    > <!element number - - empty>
    > <!element then - - (funcall)>
    > <!element funcall - - (args)>
    > <!element (args,ref) - - empty>
    > <!attlist binary op cdata #required>
    > <!attlist (ref,funcall) name cdata #required>
    > <!attlist number value cdata #required>
    > <!entity lt sdata "<">
    > ]>
    > <if<cond<binary op="&lt;"<ref name=x<number value="3"></></>
    > <then<funcall name=foo<args></></></>
    >


    fair enough.


    >> which would be regarded as functionally-equivalent to the prior
    >> expression (and would generate equivalent DOM trees when read back in).
    >>
    >> with the following rules:
    >> <tag>...</tag> and<tag/> are the same as before.
    >>
    >> while:
    >> <tag<...> ...>
    >> would use an alternate parsing strategy, where ">" is significant (since
    >> the prior tag didn't actually end), and indicates the end of the
    >> expression (the magic here would be seeing another "<" within a tag).
    >>
    >> similarly, maybe "<[[" could also be parsed as a shorthand for
    >> "<![CDATA[" as well (and would also match nicer with the closing bracket
    >> "]]>").
    >>
    >> note that it would be possible to mix them, as in:
    >> <foo> <bar<baz/>> </foo>
    >> and:
    >> <foo<bar> <baz/> </bar>>
    >>
    >> maybe also a different "name" would be a good idea, like "XEML" or
    >> similar would make sense, such as to reduce possible confusion.
    >>
    >> any thoughts or relevant information to look at?...

    >
    > I think you'd need a special editor: if the objective is to abbreviate
    > the syntax, there is a delicate breakpoint between the denseness of the
    > reduced syntax and the ability of the creator/user to understand it.
    >


    I hadn't considered this case.
    if the code is being viewed/edited in a generic text editor (such as
    Notepad), it shouldn't make too much of a difference, but granted a
    specialized XML editor could very well get confused.

    but, in this case, I doubt that such a change would render the syntax
    unreadable (to humans), but it could reduce verbosity and sprawl
    somewhat (in intermediate data files spit out by the application), which
    is currently the main problem area (finding things in multi-MB files is
    hard enough as-is, much less when the AST for a single function in a
    C-like syntax can span over a fairly large number of pages).

    but, I don't think it would be too much of a different issue from that
    of a person trying to read S-Expressions, if using a more compact format.

    this is partly because a C-style (programming language) syntax is fairly
    information-dense, but when parsed into ASTs and then dumped as XML,
    there is a significant amount of expansion.


    > What about writing up the method as a paper for the Balisage (markup)
    > conference? That's really the place to discuss new syntaxes.
    >


    I don't know much about them, I hadn't heard of this before.


    > ///Peter
    >
    BGB, May 12, 2012
    #4
  5. BGB

    BGB Guest

    On 5/11/2012 3:08 PM, Joe Kesselman wrote:
    > There have been multiple suggestions for terser syntax, over the years
    > since XML was released to the world. In general, they have failed
    > contact with the real world -- they're harder to work with, and/or they
    > aren't actually significantly more compact (especially when you remember
    > that XML compresses wonderfully), and/or what you'd really want is a
    > custom representation within your application (perhaps straight data
    > structures) and XML only as the export/import interface to the rest of
    > the world.
    >


    in the case of the compiler ASTs, a DOM-like system was used internally,
    rather than raw structures.

    in this case, I have basically been doing it this way (at least in one
    branch of my stuff), since about 2004 (originally, the system was much
    closer to DOM, but has diverged somewhat over the years, mostly to
    improve usability and performance for these use cases).


    thus far, the external syntax (generally for debugging dumps) has been
    in traditional XML syntax.


    > The W3C has looked at a number of "binary XML" representations. I
    > believe there was a working group that was investigating trying to come
    > up with something official. I'm not sure what its status is now; the
    > idea still strikes me as an awkward compromise that is going to face too
    > many conflicting goals.
    >


    yes, but note the original stated purpose:
    mostly for humans looking over debugging dumps.


    a binary-XML format was not the goal in this case, since a human can't
    read binary XML. rather, it is to "optimize" how much page-up/page-down
    action is needed in Notepad and similar...

    trying to find and look at stuff in giant sprawling text files is kind
    of a pain.


    FWIW, I also use binary XML formats, but I consider this to be a
    different use-case.

    granted, in such a use-case, I guess it wouldn't actually do much harm
    if it were output-only, serving solely as a debug-dump format, rather
    than something which can be parsed back in.


    > Finally: XML's greatest value is that there are lots of tools already in
    > place that support it. This won't be true of any new syntax.
    >


    doesn't particularly matter in this case:
    I control nearly all of the code which would likely be used for dealing
    with it directly.

    so, the syntax would not likely be used for interchange between
    applications, and thus whether or not anyone else supports it is of much
    less importance.


    > Sorry, but I think there really isn't sufficient value here to make the
    > idea worth pursuing.



    as a standardized feature, maybe...

    I didn't mean making it be standard (or even necessarily that the W3C
    would notice, or care).

    I figured I would state it here to see what anyone thought, but don't
    actually expect any sort of widespread adoption.

    IOW: this was not intended as a "feature request"...
    BGB, May 12, 2012
    #5
  6. On 5/11/2012 8:29 PM, BGB wrote:
    > in the case of the compiler ASTs, a DOM-like system was used internally,
    > rather than raw structures.


    Personally I would do a custom datastructure and give it an XML
    serializer, or some other adapter layer that lets you view it in terms
    of an XML infoset -- because trying to shove things into DOM form is
    going to be much less memory-efficient and slower to access than a more
    dedicated representation would be.

    > yes, but note the original stated purpose:
    > mostly for humans looking over debugging dumps.


    If it's for the humans, they will want to be able to use their preferred
    existing XML tools to process those dumps -- otherwise there's no
    advantage to using XML at all, and you might as well use whatever
    nonportable custom representation you prefer... which will probably be
    more readable that raw XML syntax since you can tune it for the needs of
    that specific task.

    Or, as a compromise, output XML and then provide a tool which translates
    it into your compact human-readable representation. Then folks who want
    to use text editor to view your version can use that tool, while others
    who prefer an editor which manipulates the XML tree -- or who want to
    use a stylesheet to render the data into another representation entirely
    -- will have that option.

    >> Finally: XML's greatest value is that there are lots of tools already in
    >> place that support it. This won't be true of any new syntax.
    >>

    >
    > doesn't particularly matter in this case:


    XML is just another tool, and no tool is right for all purposes.
    Screwdrivers make poor hammers. Hammers make worse screwdrivers. If
    interoperability and toolability isn't your goal, XML may not be
    relevant for you; do what makes sense for your task.

    I have no opinion on the suggested syntax as a representation for
    non-XML trees; I tend to either use raw data or indentation and/or
    delimiters (Lisp/Scheme parens, Algol-family braces, whatever). How well
    your proposal works is going to depend heavily on what kinds of data
    you're presenting and what people are trying to extract from it.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, May 12, 2012
    #6
  7. On 5/11/2012 9:27 PM, Joe Kesselman wrote:
    > you're presenting and what people are trying to extract from it.


    .... and, of course, on what tools you assume they'll want to use to do so.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, May 12, 2012
    #7
  8. BGB

    BGB Guest

    On 5/11/2012 6:27 PM, Joe Kesselman wrote:
    > On 5/11/2012 8:29 PM, BGB wrote:
    >> in the case of the compiler ASTs, a DOM-like system was used internally,
    >> rather than raw structures.

    >
    > Personally I would do a custom datastructure and give it an XML
    > serializer, or some other adapter layer that lets you view it in terms
    > of an XML infoset -- because trying to shove things into DOM form is
    > going to be much less memory-efficient and slower to access than a more
    > dedicated representation would be.
    >


    actually, at one point there was an interpreter of mine itself based on
    directly interpreting said ASTs in DOM form, and yes, it was slow...

    I don't actually know just how slow it was, but I realize now that an
    earlier Scheme interpreter of mine which was running "fast" in
    comparison (of the naive "directly execute source expressions" variety),
    was in-fact running 10,000x slower than native, I suspect this thing was
    very possibly around 100k or 1M times slower than native... (then again,
    at the time, it also was using a memory manager where every type-check
    also involved a linear search over the entire heap, ...).

    the thing was basically a hack where I had wrote a parser which parsed a
    JavaScript like syntax into DOM nodes, and fed it into a hacked-up
    XML-RPC implementation.

    this incredible slowness led me to later switch over to "wordcode" (like
    bytecode except an array of 16-bit shorts), and later over to bytecode.

    (later on I also switched from using bytecode to internally using
    threaded-code, but bytecode remains as the "canonical" representation).


    both then and now, a fair amount of type-checking is done using strings
    and "strcmp()", as most types are identified by name (this strategy won
    out due to being most convenient, and not actually all that expensive), ...

    now the interpreter is much faster, so performance is no longer a major
    issue.



    as-is (in the present), yes, those ASTs can chew through memory
    (especially for the C compiler front-end), but the present
    implementation has a fair amount of optimizing, and so performance
    doesn't actually seem to be all that bad in this case (the XML-related
    code is not a significant time-waster in the profiler, including for my
    C-compiler frontend, which is the main place the XML-based ASTs are
    still used).

    granted, yes, there is some internal trickery, like the attributes can
    encode numbers directly (as doubles), rather than representing them as
    strings, ...


    luckily, RAM use isn't really a huge issue on modern systems.


    I also don't really feel that raw structs would offer all that much
    advantage in this case, since although it is a little easier to access
    fields, the drawback is that different nodes would likely need different
    structs, and would create additional issues related to serialization.

    in terms of tradeoffs, there is not that much huge advantage
    usability-wise of 'node->value' over 'dyxGeti(node, "value")', so it may
    well be a reasonable tradeoff...


    >> yes, but note the original stated purpose:
    >> mostly for humans looking over debugging dumps.

    >
    > If it's for the humans, they will want to be able to use their preferred
    > existing XML tools to process those dumps -- otherwise there's no
    > advantage to using XML at all, and you might as well use whatever
    > nonportable custom representation you prefer... which will probably be
    > more readable that raw XML syntax since you can tune it for the needs of
    > that specific task.
    >
    > Or, as a compromise, output XML and then provide a tool which translates
    > it into your compact human-readable representation. Then folks who want
    > to use text editor to view your version can use that tool, while others
    > who prefer an editor which manipulates the XML tree -- or who want to
    > use a stylesheet to render the data into another representation entirely
    > -- will have that option.
    >


    it is possible, but as noted it probably would have been option-enabled
    anyways, meaning that even if supported, probably some action would be
    used to enable it (and it could also be turned back off again, probably
    by an option which could be put into a config file or similar).


    >>> Finally: XML's greatest value is that there are lots of tools already in
    >>> place that support it. This won't be true of any new syntax.
    >>>

    >>
    >> doesn't particularly matter in this case:

    >
    > XML is just another tool, and no tool is right for all purposes.
    > Screwdrivers make poor hammers. Hammers make worse screwdrivers. If
    > interoperability and toolability isn't your goal, XML may not be
    > relevant for you; do what makes sense for your task.
    >


    fair enough, it is just used for this part of the system.


    > I have no opinion on the suggested syntax as a representation for
    > non-XML trees; I tend to either use raw data or indentation and/or
    > delimiters (Lisp/Scheme parens, Algol-family braces, whatever). How well
    > your proposal works is going to depend heavily on what kinds of data
    > you're presenting and what people are trying to extract from it.
    >


    as noted before, it would be used for printing the internal DOM-like nodes.

    given I am already using a system which is internally XML-based,
    sticking with an XML-like syntax would make sense (or, at least,
    something composed of tags and attributes). switching out to something
    radically different would be a fairly major alteration.


    many other parts of the system use a Lisp-like form, but they also use a
    different representation internally as well (lists composed of cons-cells).

    sadly, at present, parts of my VM which use S-Expressions for ASTs and
    parts which use XML based ASTs are largely incompatible.

    it would be nice sometimes if it were one or the other, but neither is
    "clearly better" (S-Expressions are faster, but not very flexible, and
    XML is more flexible, but also a little slower and more awkward to work
    with). similarly, there is no known good way to merge them without
    creating a horrible mess.

    ironically, when S-Expressions are organized into a tagged structure
    (similar to XML), they actually seem to use more memory than the
    equivalent in XML/DOM-style nodes...


    so, no ideal solutions here...
    BGB, May 12, 2012
    #8
  9. BGB

    Peter Flynn Guest

    On 12/05/12 01:01, BGB wrote:
    [...]
    > but, yeah, I guess originally XML was intended for markup of mostly
    > textual documents (like in HTML or similar), rather than for
    > representing structured data (or being used for humans viewing said
    > structured data as debugging output).


    Yes. The use of XML-Data was first proposed by Microsoft, I seem to
    remember, about half-way through the development phase of XML.

    > I wonder if anyone ever really considered "scene-graph delta-update
    > messages in a 3D FPS game" as a possible use-case for XML either?
    > somehow I doubt it


    XML has been used for many applications far beyond what we expected.

    [...]
    > I hadn't considered this case.
    > if the code is being viewed/edited in a generic text editor (such as
    > Notepad), it shouldn't make too much of a difference, but granted a
    > specialized XML editor could very well get confused.


    I can't imagine anyone actually wanting to code *any* structured syntax
    in something like Notepad. But all you would need to do is modify one of
    the FLOSS XML editors (Emacs would be the obvious start-point) to use
    your syntax.

    >> What about writing up the method as a paper for the Balisage (markup)
    >> conference? That's really the place to discuss new syntaxes.

    >
    > I don't know much about them, I hadn't heard of this before.


    http://www.balisage.net
    Every August in Montreal. This is the hard-core conference for markup.

    ///Peter
    Peter Flynn, May 12, 2012
    #9
  10. BGB

    BGB Guest

    On 5/12/2012 9:36 AM, Peter Flynn wrote:
    > On 12/05/12 01:01, BGB wrote:
    > [...]
    >> but, yeah, I guess originally XML was intended for markup of mostly
    >> textual documents (like in HTML or similar), rather than for
    >> representing structured data (or being used for humans viewing said
    >> structured data as debugging output).

    >
    > Yes. The use of XML-Data was first proposed by Microsoft, I seem to
    > remember, about half-way through the development phase of XML.
    >


    fair enough.


    >> I wonder if anyone ever really considered "scene-graph delta-update
    >> messages in a 3D FPS game" as a possible use-case for XML either?
    >> somehow I doubt it

    >
    > XML has been used for many applications far beyond what we expected.
    >


    yep.

    probably because it makes a fairly versatile format for tree-structured
    data.


    strangely enough, I don't currently use it for data-binding, which I
    guess is what many people use it for, rather most use has been in terms
    of using the trees directly (with no intermediate structures or objects).


    > [...]
    >> I hadn't considered this case.
    >> if the code is being viewed/edited in a generic text editor (such as
    >> Notepad), it shouldn't make too much of a difference, but granted a
    >> specialized XML editor could very well get confused.

    >
    > I can't imagine anyone actually wanting to code *any* structured syntax
    > in something like Notepad. But all you would need to do is modify one of
    > the FLOSS XML editors (Emacs would be the obvious start-point) to use
    > your syntax.
    >


    Emacs, blarg...

    I guess it could be a mystery how much effort it would be to add support
    to something like Notepad++ or SciTe or similar (or how an unmodified
    Notepad++ would respond to such a syntax). I think most likely it would
    confuse the syntax highlighting and text-folding features.

    but, I hadn't considered it much, since I had just assumed using Notepad
    or Notepad2 or similar, since these editors are fairly simple and seem
    to hold up fairly well with large text files (logs, ...).


    >>> What about writing up the method as a paper for the Balisage (markup)
    >>> conference? That's really the place to discuss new syntaxes.

    >>
    >> I don't know much about them, I hadn't heard of this before.

    >
    > http://www.balisage.net
    > Every August in Montreal. This is the hard-core conference for markup.
    >


    fair enough.


    > ///Peter
    BGB, May 12, 2012
    #10
  11. El 11/05/2012 19:40, BGB escribió:
    >...
    > example, say that a person has an expression like:
    > <if>
    > <cond>
    > <binary op="&lt;">
    > <ref name="x"/>
    > <number value="3"/>
    > </binary>
    > </cond>
    > <then>
    > <funcall name="foo">
    > <args/>
    > </funcall>
    > </then>
    > </if>
    >
    > representing, say, the AST of the statement "if(x>3)foo();".
    >
    > the parser and printer could use a more compact encoding, say:
    > <if
    > <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    > <then <funcall name="foo" <args/>>>
    > >


    In that case the slashes in the "/>" endmarks are probably superfluous.

    --
    Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
    Manuel Collado, May 14, 2012
    #11
  12. On 5/14/2012 7:49 AM, Manuel Collado wrote:
    > In that case the slashes in the "/>" endmarks are probably superfluous.


    And you might want to use (), {} or [] instead of <>, to emphasize that
    this is *not* XML.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, May 15, 2012
    #12
  13. BGB

    BGB Guest

    On 5/14/2012 4:49 AM, Manuel Collado wrote:
    > El 11/05/2012 19:40, BGB escribió:
    >> ...
    >> example, say that a person has an expression like:
    >> <if>
    >> <cond>
    >> <binary op="&lt;">
    >> <ref name="x"/>
    >> <number value="3"/>
    >> </binary>
    >> </cond>
    >> <then>
    >> <funcall name="foo">
    >> <args/>
    >> </funcall>
    >> </then>
    >> </if>
    >>
    >> representing, say, the AST of the statement "if(x>3)foo();".
    >>
    >> the parser and printer could use a more compact encoding, say:
    >> <if
    >> <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    >> <then <funcall name="foo" <args/>>>
    >> >

    >
    > In that case the slashes in the "/>" endmarks are probably superfluous.
    >


    in depends on how the parser works.

    given the intention that the new syntax be a direct extension of the
    existing syntax, rather than entirely replacing it, the '/' is still
    needed in order to avoid the syntax becoming ambiguous (how do you
    otherwise distinguish between an empty tag and the start of a list which
    is terminated by a closing tag?...).

    there are potentially more complex ways to deal with it, such as
    determining the type of the next matching closing tag, but this is
    problematic and potentially costly.

    example:
    <a><b><c>...</c></a>
    a begins, scans forwards, sees matched closing a.
    b begins, scans forwards, sees that next closing tag is a, concludes it
    does not contain c, ...

    the problem here is that in the naive case, this could cause the parser
    to require around O(n^2) time, rather than O(n) time, so it is better to
    avoid ambiguity.


    or such...
    BGB, May 15, 2012
    #13
  14. BGB

    BGB Guest

    On 5/14/2012 9:12 PM, Joe Kesselman wrote:
    > On 5/14/2012 7:49 AM, Manuel Collado wrote:
    >> In that case the slashes in the "/>" endmarks are probably superfluous.

    >
    > And you might want to use (), {} or [] instead of <>, to emphasize that
    > this is *not* XML.
    >


    I disagree here on both counts, in that I don't believe the '/' would be
    superfluous (as I see it, such a change would cause syntactic ambiguity
    unless the old syntax were removed entirely, which I doubt would be
    beneficial), nor that the use of a different characters would be
    particularly beneficial (more likely, people would see the different
    tagging structure and conclude that it is something different).

    I don't think it would actually make much difference, because by the
    time they are confusing syntax based on the brace characters used, they
    would also be confusing it with other syntax based on the characters used.

    using "()" would by similar reasoning make it too likely to be confused
    with S-Expressions...

    or, by similar logic, using "{}" people might confuse it for JSON.


    I suspect that the differences are sufficient to where they will be
    clearly different and unlikely to be confused regardless of the
    characters used.

    if people see something like:
    <if<...>>
    it will probably be fairly obvious that this is not XML.


    I was probably considering giving it a different name, but not yet
    decided on anything (nor done much with the idea yet as I have been more
    busy with other stuff).

    one possible option is "ZEML" (say, "Z-Expression Markup Language").
    partly because Z is like a flipped S, and is almost as hard-core as X.

    so, most likely choices are either sticking with <...> or using [...].

    may decide on other possible alterations.
    BGB, May 15, 2012
    #14
  15. El 15/05/2012 6:24, BGB escribió:
    > On 5/14/2012 4:49 AM, Manuel Collado wrote:
    >> El 11/05/2012 19:40, BGB escribió:
    >>> ...
    >>> example, say that a person has an expression like:
    >>> <if>
    >>> <cond>
    >>> <binary op="&lt;">
    >>> <ref name="x"/>
    >>> <number value="3"/>
    >>> </binary>
    >>> </cond>
    >>> <then>
    >>> <funcall name="foo">
    >>> <args/>
    >>> </funcall>
    >>> </then>
    >>> </if>
    >>>
    >>> representing, say, the AST of the statement "if(x>3)foo();".
    >>>
    >>> the parser and printer could use a more compact encoding, say:
    >>> <if
    >>> <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    >>> <then <funcall name="foo" <args/>>>
    >>> >

    >>
    >> In that case the slashes in the "/>" endmarks are probably superfluous.
    >>

    >
    > in depends on how the parser works.
    >
    > given the intention that the new syntax be a direct extension of the
    > existing syntax, rather than entirely replacing it, the '/' is still
    > needed in order to avoid the syntax becoming ambiguous (how do you
    > otherwise distinguish between an empty tag and the start of a list which
    > is terminated by a closing tag?...).


    Sorry, don't understand your explanation. Example:

    <list>
    <item/>
    <item/>
    <item/>
    </list>

    just becomes

    <list<item><item><item>>

    What's the problem?

    >
    > there are potentially more complex ways to deal with it, such as
    > determining the type of the next matching closing tag, but this is
    > problematic and potentially costly.


    All the closing tags are just ">". Each one matches exactly the last
    previous unclosed tag.

    >
    > example:
    > <a><b><c>...</c></a>
    > a begins, scans forwards, sees matched closing a.
    > b begins, scans forwards, sees that next closing tag is a, concludes it
    > does not contain c, ...


    Well, this is the standard notation, not the new streamlined one.

    >
    > the problem here is that in the naive case, this could cause the parser
    > to require around O(n^2) time, rather than O(n) time, so it is better to
    > avoid ambiguity.


    Matching opening and ending tags (or parentheses, or brackets, ...) can
    be done in O(n) on a single pass with the help of a stack. Or,
    equivalently, with just a recursive parser.

    >
    > or such...


    --
    Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
    Manuel Collado, May 15, 2012
    #15
  16. BGB

    BGB Guest

    On 5/15/2012 4:11 PM, Manuel Collado wrote:
    > El 15/05/2012 6:24, BGB escribió:
    >> On 5/14/2012 4:49 AM, Manuel Collado wrote:
    >>> El 11/05/2012 19:40, BGB escribió:
    >>>> ...
    >>>> example, say that a person has an expression like:
    >>>> <if>
    >>>> <cond>
    >>>> <binary op="&lt;">
    >>>> <ref name="x"/>
    >>>> <number value="3"/>
    >>>> </binary>
    >>>> </cond>
    >>>> <then>
    >>>> <funcall name="foo">
    >>>> <args/>
    >>>> </funcall>
    >>>> </then>
    >>>> </if>
    >>>>
    >>>> representing, say, the AST of the statement "if(x>3)foo();".
    >>>>
    >>>> the parser and printer could use a more compact encoding, say:
    >>>> <if
    >>>> <cond <binary op="&lt;" <ref name="x"/> <number value="3"/>>>>
    >>>> <then <funcall name="foo" <args/>>>
    >>>> >
    >>>
    >>> In that case the slashes in the "/>" endmarks are probably superfluous.
    >>>

    >>
    >> in depends on how the parser works.
    >>
    >> given the intention that the new syntax be a direct extension of the
    >> existing syntax, rather than entirely replacing it, the '/' is still
    >> needed in order to avoid the syntax becoming ambiguous (how do you
    >> otherwise distinguish between an empty tag and the start of a list which
    >> is terminated by a closing tag?...).

    >
    > Sorry, don't understand your explanation. Example:
    >
    > <list>
    > <item/>
    > <item/>
    > <item/>
    > </list>
    >
    > just becomes
    >
    > <list<item><item><item>>
    >
    > What's the problem?
    >


    it was because the parser would accept both forms of syntax at the same
    time (the original plan was for it to be a backwards-compatible
    extension syntax, but I have since changed my mind on this point).


    dropping the '/' means only one form of the syntax is supported.


    >>
    >> there are potentially more complex ways to deal with it, such as
    >> determining the type of the next matching closing tag, but this is
    >> problematic and potentially costly.

    >
    > All the closing tags are just ">". Each one matches exactly the last
    > previous unclosed tag.
    >


    this is not exactly how I imagined it working though.
    what this would do is essentially create something more akin to an
    S-Expression parser, which wasn't the original intention.


    >>
    >> example:
    >> <a><b><c>...</c></a>
    >> a begins, scans forwards, sees matched closing a.
    >> b begins, scans forwards, sees that next closing tag is a, concludes it
    >> does not contain c, ...

    >
    > Well, this is the standard notation, not the new streamlined one.
    >


    yes, but again, if both coexist at the same time in the same parser, ...
    then one either needs '/' or faces an ambiguity.


    >>
    >> the problem here is that in the naive case, this could cause the parser
    >> to require around O(n^2) time, rather than O(n) time, so it is better to
    >> avoid ambiguity.

    >
    > Matching opening and ending tags (or parentheses, or brackets, ...) can
    > be done in O(n) on a single pass with the help of a stack. Or,
    > equivalently, with just a recursive parser.
    >


    see above.


    but, yeah, I have thought more about it, and I may just branch off the
    syntax entirely and drop backwards compatibility.

    the reason was that a lot of these "extensions" I was considering would
    be significant enough to make it not worthwhile to keep it as a
    backwards-compatible extended syntax.


    so, likely newer "new" design:
    tag = '<' name [key '=' value]+ node* '>'
    text = quoted_string | block_string
    block_string = '<[[' ... ']]>'
    value = name | number | quoted_string

    names and numbers would basically use C-like rules:
    nameinitchar = a-z | A-Z | _ | various_unicode_ranges
    namechar = nameinitchar | 0-9
    name = nameinitchar namechar*
    digits = (0-9)+
    number = digits [ '.' digits] [e|E ['+'|'-'] digits]

    ....


    so, for example:
    <foo a=9 <bar <[[here is some text]]>>>

    will probably also consider switching to a C-like character escape notation.
    <text "this string\ncontains\nnewlines!">
    though likely including as extensions the ability to directly embed
    newlines and do \ line-continuations.

    <text "this string
    contains
    newlines!">

    <text "this string \
    contains \
    spaces!">

    plain quoted-strings would basically map to text nodes, and
    block-strings would map to CDATA.


    note that '/' would no longer be used, and traditional tag syntax would
    no longer work. it would probably also drop DTDs and similar as well.


    yes, this is no longer XML, but it could be usable...
    BGB, May 17, 2012
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alx
    Replies:
    0
    Views:
    433
  2. cr88192
    Replies:
    3
    Views:
    570
    cr88192
    Sep 7, 2005
  3. Ivan Shmakov
    Replies:
    3
    Views:
    1,140
    Kari Hurtta
    Feb 13, 2012
  4. Daniel

    Streamlined Framework

    Daniel, Aug 17, 2006, in forum: Ruby
    Replies:
    3
    Views:
    144
    Dark Ambient
    Aug 17, 2006
  5. Michael Kaelbling

    RFC Future Ruby hash literal syntax

    Michael Kaelbling, Nov 29, 2010, in forum: Ruby
    Replies:
    16
    Views:
    255
    Ryan Davis
    Dec 1, 2010
Loading...

Share This Page