RFC: thoughts for a "streamlined" XML syntax variant...

BGB · May 11, 2012

one issue partly in the case of XML for its use in structured data is
its relative verbosity, especially in cases where it is entered by hand
or being read by a human (say, for debugging reasons, ...).

so, the thought here would be to allow a "modest" syntax extension
(probably would be limited to particular implementations which support
the extension).

more specifically, I was considering it as a possible extension feature
to my own implementation, but have some doubts given that, yes, this
would be non-standard extension. note that there probably would be a
feature to manually "enable" it, such as to avoid necessarily breaking
compatibility. in my case, the current primary use is for things like
compiler ASTs, where it competes some with the use of S-Expressions for
ASTs (Lisp style, not the "Rivest" variant / name-hijack). note that
these ASTs normally never leave the application which created them, so
the impact of using a non-standard syntax when serializing them is
likely fairly small.

example, say that a person has an expression like:
<if>
<cond>
<binary op="<">
<ref name="x"/>
<number value="3"/>
</binary>
</cond>
<then>
<funcall name="foo">
<args/>
</funcall>
</then>
</if>

representing, say, the AST of the statement "if(x>3)foo();".

the parser and printer could use a more compact encoding, say:
<if

which would be regarded as functionally-equivalent to the prior
expression (and would generate equivalent DOM trees when read back in).

with the following rules:
<tag>...</tag> and <tag/> are the same as before.

while:
<tag <...> ...>
would use an alternate parsing strategy, where ">" is significant (since
the prior tag didn't actually end), and indicates the end of the
expression (the magic here would be seeing another "<" within a tag).

similarly, maybe "<[[" could also be parsed as a shorthand for
"<![CDATA[" as well (and would also match nicer with the closing bracket
"]]>").

note that it would be possible to mix them, as in:
<foo> <bar <baz/>> </foo>
and:
<foo <bar> <baz/> </bar>>

maybe also a different "name" would be a good idea, like "XEML" or
similar would make sense, such as to reduce possible confusion.

any thoughts or relevant information to look at?...

Peter Flynn · May 11, 2012

one issue partly in the case of XML for its use in structured data
is its relative verbosity, especially in cases where it is entered by
hand or being read by a human (say, for debugging reasons, ...).

I think this was expected to be a very rare case, which is why the spec
says that terseness in XML markup is of minimal importance.

so, the thought here would be to allow a "modest" syntax extension
(probably would be limited to particular implementations which
support the extension).

more specifically, I was considering it as a possible extension
feature to my own implementation, but have some doubts given that,
yes, this would be non-standard extension. note that there probably
would be a feature to manually "enable" it, such as to avoid
necessarily breaking compatibility.

Switchable is good.

in my case, the current primary use is for things like compiler ASTs,
where it competes some with the use of S-Expressions for ASTs (Lisp
style, not the "Rivest" variant / name-hijack). note that these ASTs
normally never leave the application which created them, so the
impact of using a non-standard syntax when serializing them is likely
fairly small.

example, say that a person has an expression like:
<if>
<cond>
<binary op="<">
<ref name="x"/>
<number value="3"/>
</binary>
</cond>
<then>
<funcall name="foo">
<args/>
</funcall>
</then>
</if>

representing, say, the AST of the statement "if(x>3)foo();".

the parser and printer could use a more compact encoding, say:
<if
<cond <binary op="<" <ref name="x"/> <number value="3"/>>>>
<then <funcall name="foo" <args/>>>

This syntax (or very nearly) is already available in SGML:

<!doctype if [
<!element if - - (cond,then)>
<!element cond - - (binary)>
<!element binary - - (ref,number)>
<!element number - - empty>
<!element then - - (funcall)>
<!element funcall - - (args)>
<!element (args,ref) - - empty>
<!attlist binary op cdata #required>
<!attlist (ref,funcall) name cdata #required>
<!attlist number value cdata #required>
<!entity lt sdata "<">
]>
<if<cond<binary op="<"<ref name=x<number value="3"></></>

which would be regarded as functionally-equivalent to the prior
expression (and would generate equivalent DOM trees when read back in).

with the following rules:
<tag>...</tag> and <tag/> are the same as before.

while:
<tag <...> ...>
would use an alternate parsing strategy, where ">" is significant (since
the prior tag didn't actually end), and indicates the end of the
expression (the magic here would be seeing another "<" within a tag).

similarly, maybe "<[[" could also be parsed as a shorthand for
"<![CDATA[" as well (and would also match nicer with the closing bracket
"]]>").

note that it would be possible to mix them, as in:
<foo> <bar <baz/>> </foo>
and:
<foo <bar> <baz/> </bar>>

maybe also a different "name" would be a good idea, like "XEML" or
similar would make sense, such as to reduce possible confusion.

any thoughts or relevant information to look at?...

I think you'd need a special editor: if the objective is to abbreviate
the syntax, there is a delicate breakpoint between the denseness of the
reduced syntax and the ability of the creator/user to understand it.

What about writing up the method as a paper for the Balisage (markup)
conference? That's really the place to discuss new syntaxes.

///Peter

Joe Kesselman · May 11, 2012

There have been multiple suggestions for terser syntax, over the years
since XML was released to the world. In general, they have failed
contact with the real world -- they're harder to work with, and/or they
aren't actually significantly more compact (especially when you remember
that XML compresses wonderfully), and/or what you'd really want is a
custom representation within your application (perhaps straight data
structures) and XML only as the export/import interface to the rest of
the world.

The W3C has looked at a number of "binary XML" representations. I
believe there was a working group that was investigating trying to come
up with something official. I'm not sure what its status is now; the
idea still strikes me as an awkward compromise that is going to face too
many conflicting goals.

Finally: XML's greatest value is that there are lots of tools already in
place that support it. This won't be true of any new syntax.

Sorry, but I think there really isn't sufficient value here to make the
idea worth pursuing.

BGB · May 11, 2012

I think this was expected to be a very rare case, which is why the spec
says that terseness in XML markup is of minimal importance.

fair enough.

I mostly use it for things like compiler ASTs, network protocols, and
file-formats (generally structured-data).

currently used forms of XML are:
raw/plaintext XML;
as deflated plaintext XML;
as an in-use binary format (similar to an "improved" version of WBXML
with a few more features and density-improvements, with both being
byte-based).

I have another format I could use, but going into it likely pushes
topicality (it is a Huffman-compressed binary serialization format,
currently used for sending messages over a TCP socket in a 3D game
engine, but this doesn't have much in particular to do with XML, as the
message format it is currently used with is S-Expression based, rather
than XML based).

but, yeah, I guess originally XML was intended for markup of mostly
textual documents (like in HTML or similar), rather than for
representing structured data (or being used for humans viewing said
structured data as debugging output).

I wonder if anyone ever really considered "scene-graph delta-update
messages in a 3D FPS game" as a possible use-case for XML either?
somehow I doubt it (I had intended to do this originally, despite
eventually opting for a different representation for said deltas).

even as such, I did end up aggressively compressing them (via a
specialized encoding scheme), as otherwise the bandwidth usage would
have been a bit steep for a typical end-user internet connection.

so, the thought here would be to allow a "modest" syntax extension
(probably would be limited to particular implementations which
support the extension).

more specifically, I was considering it as a possible extension
feature to my own implementation, but have some doubts given that,
yes, this would be non-standard extension. note that there probably
would be a feature to manually "enable" it, such as to avoid
necessarily breaking compatibility.

Click to expand...

Switchable is good.

yeah.

in my case, the current primary use is for things like compiler ASTs,
where it competes some with the use of S-Expressions for ASTs (Lisp
style, not the "Rivest" variant / name-hijack). note that these ASTs
normally never leave the application which created them, so the
impact of using a non-standard syntax when serializing them is likely
fairly small.

example, say that a person has an expression like:
<if>
<cond>
<binary op="<">
<ref name="x"/>
<number value="3"/>
</binary>
</cond>
<then>
<funcall name="foo">
<args/>
</funcall>
</then>
</if>

representing, say, the AST of the statement "if(x>3)foo();".

the parser and printer could use a more compact encoding, say:
<if
<cond<binary op="<"<ref name="x"/> <number value="3"/>>>>
<then<funcall name="foo"<args/>>>

Click to expand...

This syntax (or very nearly) is already available in SGML:

<!doctype if [
<!element if - - (cond,then)>
<!element cond - - (binary)>
<!element binary - - (ref,number)>
<!element number - - empty>
<!element then - - (funcall)>
<!element funcall - - (args)>
<!element (args,ref) - - empty>
<!attlist binary op cdata #required>
<!attlist (ref,funcall) name cdata #required>
<!attlist number value cdata #required>
<!entity lt sdata "<">
]>
<if<cond<binary op="<"<ref name=x<number value="3"></></>
<then<funcall name=foo<args></></></>

fair enough.

which would be regarded as functionally-equivalent to the prior
expression (and would generate equivalent DOM trees when read back in).

with the following rules:
<tag>...</tag> and<tag/> are the same as before.

while:
<tag<...> ...>
would use an alternate parsing strategy, where ">" is significant (since
the prior tag didn't actually end), and indicates the end of the
expression (the magic here would be seeing another "<" within a tag).

similarly, maybe "<[[" could also be parsed as a shorthand for
"<![CDATA[" as well (and would also match nicer with the closing bracket
"]]>").

note that it would be possible to mix them, as in:
<foo> <bar<baz/>> </foo>
and:
<foo<bar> <baz/> </bar>>

maybe also a different "name" would be a good idea, like "XEML" or
similar would make sense, such as to reduce possible confusion.

any thoughts or relevant information to look at?...

Click to expand...

I think you'd need a special editor: if the objective is to abbreviate
the syntax, there is a delicate breakpoint between the denseness of the
reduced syntax and the ability of the creator/user to understand it.

I hadn't considered this case.
if the code is being viewed/edited in a generic text editor (such as
Notepad), it shouldn't make too much of a difference, but granted a
specialized XML editor could very well get confused.

but, in this case, I doubt that such a change would render the syntax
unreadable (to humans), but it could reduce verbosity and sprawl
somewhat (in intermediate data files spit out by the application), which
is currently the main problem area (finding things in multi-MB files is
hard enough as-is, much less when the AST for a single function in a
C-like syntax can span over a fairly large number of pages).

but, I don't think it would be too much of a different issue from that
of a person trying to read S-Expressions, if using a more compact format.

this is partly because a C-style (programming language) syntax is fairly
information-dense, but when parsed into ASTs and then dumped as XML,
there is a significant amount of expansion.

What about writing up the method as a paper for the Balisage (markup)
conference? That's really the place to discuss new syntaxes.

I don't know much about them, I hadn't heard of this before.

BGB · May 11, 2012

There have been multiple suggestions for terser syntax, over the years
since XML was released to the world. In general, they have failed
contact with the real world -- they're harder to work with, and/or they
aren't actually significantly more compact (especially when you remember
that XML compresses wonderfully), and/or what you'd really want is a
custom representation within your application (perhaps straight data
structures) and XML only as the export/import interface to the rest of
the world.

in the case of the compiler ASTs, a DOM-like system was used internally,
rather than raw structures.

in this case, I have basically been doing it this way (at least in one
branch of my stuff), since about 2004 (originally, the system was much
closer to DOM, but has diverged somewhat over the years, mostly to
improve usability and performance for these use cases).

thus far, the external syntax (generally for debugging dumps) has been
in traditional XML syntax.

The W3C has looked at a number of "binary XML" representations. I
believe there was a working group that was investigating trying to come
up with something official. I'm not sure what its status is now; the
idea still strikes me as an awkward compromise that is going to face too
many conflicting goals.

yes, but note the original stated purpose:
mostly for humans looking over debugging dumps.

a binary-XML format was not the goal in this case, since a human can't
read binary XML. rather, it is to "optimize" how much page-up/page-down
action is needed in Notepad and similar...

trying to find and look at stuff in giant sprawling text files is kind
of a pain.

FWIW, I also use binary XML formats, but I consider this to be a
different use-case.

granted, in such a use-case, I guess it wouldn't actually do much harm
if it were output-only, serving solely as a debug-dump format, rather
than something which can be parsed back in.

Finally: XML's greatest value is that there are lots of tools already in
place that support it. This won't be true of any new syntax.

doesn't particularly matter in this case:
I control nearly all of the code which would likely be used for dealing
with it directly.

so, the syntax would not likely be used for interchange between
applications, and thus whether or not anyone else supports it is of much
less importance.

Sorry, but I think there really isn't sufficient value here to make the
idea worth pursuing.

as a standardized feature, maybe...

I didn't mean making it be standard (or even necessarily that the W3C
would notice, or care).

I figured I would state it here to see what anyone thought, but don't
actually expect any sort of widespread adoption.

IOW: this was not intended as a "feature request"...

Joe Kesselman · May 11, 2012

in the case of the compiler ASTs, a DOM-like system was used internally,
rather than raw structures.

Personally I would do a custom datastructure and give it an XML
serializer, or some other adapter layer that lets you view it in terms
of an XML infoset -- because trying to shove things into DOM form is
going to be much less memory-efficient and slower to access than a more
dedicated representation would be.

yes, but note the original stated purpose:
mostly for humans looking over debugging dumps.

If it's for the humans, they will want to be able to use their preferred
existing XML tools to process those dumps -- otherwise there's no
advantage to using XML at all, and you might as well use whatever
nonportable custom representation you prefer... which will probably be
more readable that raw XML syntax since you can tune it for the needs of
that specific task.

Or, as a compromise, output XML and then provide a tool which translates
it into your compact human-readable representation. Then folks who want
to use text editor to view your version can use that tool, while others
who prefer an editor which manipulates the XML tree -- or who want to
use a stylesheet to render the data into another representation entirely
-- will have that option.

doesn't particularly matter in this case:

XML is just another tool, and no tool is right for all purposes.
Screwdrivers make poor hammers. Hammers make worse screwdrivers. If
interoperability and toolability isn't your goal, XML may not be
relevant for you; do what makes sense for your task.

I have no opinion on the suggested syntax as a representation for
non-XML trees; I tend to either use raw data or indentation and/or
delimiters (Lisp/Scheme parens, Algol-family braces, whatever). How well
your proposal works is going to depend heavily on what kinds of data
you're presenting and what people are trying to extract from it.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Joe Kesselman · May 11, 2012

you're presenting and what people are trying to extract from it.

.... and, of course, on what tools you assume they'll want to use to do so.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

BGB · May 11, 2012

Personally I would do a custom datastructure and give it an XML
serializer, or some other adapter layer that lets you view it in terms
of an XML infoset -- because trying to shove things into DOM form is
going to be much less memory-efficient and slower to access than a more
dedicated representation would be.

actually, at one point there was an interpreter of mine itself based on
directly interpreting said ASTs in DOM form, and yes, it was slow...

I don't actually know just how slow it was, but I realize now that an
earlier Scheme interpreter of mine which was running "fast" in
comparison (of the naive "directly execute source expressions" variety),
was in-fact running 10,000x slower than native, I suspect this thing was
very possibly around 100k or 1M times slower than native... (then again,
at the time, it also was using a memory manager where every type-check
also involved a linear search over the entire heap, ...).

the thing was basically a hack where I had wrote a parser which parsed a
JavaScript like syntax into DOM nodes, and fed it into a hacked-up
XML-RPC implementation.

this incredible slowness led me to later switch over to "wordcode" (like
bytecode except an array of 16-bit shorts), and later over to bytecode.

(later on I also switched from using bytecode to internally using
threaded-code, but bytecode remains as the "canonical" representation).

both then and now, a fair amount of type-checking is done using strings
and "strcmp()", as most types are identified by name (this strategy won
out due to being most convenient, and not actually all that expensive), ...

now the interpreter is much faster, so performance is no longer a major
issue.

as-is (in the present), yes, those ASTs can chew through memory
(especially for the C compiler front-end), but the present
implementation has a fair amount of optimizing, and so performance
doesn't actually seem to be all that bad in this case (the XML-related
code is not a significant time-waster in the profiler, including for my
C-compiler frontend, which is the main place the XML-based ASTs are
still used).

granted, yes, there is some internal trickery, like the attributes can
encode numbers directly (as doubles), rather than representing them as
strings, ...

luckily, RAM use isn't really a huge issue on modern systems.

I also don't really feel that raw structs would offer all that much
advantage in this case, since although it is a little easier to access
fields, the drawback is that different nodes would likely need different
structs, and would create additional issues related to serialization.

in terms of tradeoffs, there is not that much huge advantage
usability-wise of 'node->value' over 'dyxGeti(node, "value")', so it may
well be a reasonable tradeoff...

If it's for the humans, they will want to be able to use their preferred
existing XML tools to process those dumps -- otherwise there's no
advantage to using XML at all, and you might as well use whatever
nonportable custom representation you prefer... which will probably be
more readable that raw XML syntax since you can tune it for the needs of
that specific task.

Or, as a compromise, output XML and then provide a tool which translates
it into your compact human-readable representation. Then folks who want
to use text editor to view your version can use that tool, while others
who prefer an editor which manipulates the XML tree -- or who want to
use a stylesheet to render the data into another representation entirely
-- will have that option.

it is possible, but as noted it probably would have been option-enabled
anyways, meaning that even if supported, probably some action would be
used to enable it (and it could also be turned back off again, probably
by an option which could be put into a config file or similar).

XML is just another tool, and no tool is right for all purposes.
Screwdrivers make poor hammers. Hammers make worse screwdrivers. If
interoperability and toolability isn't your goal, XML may not be
relevant for you; do what makes sense for your task.

fair enough, it is just used for this part of the system.

I have no opinion on the suggested syntax as a representation for
non-XML trees; I tend to either use raw data or indentation and/or
delimiters (Lisp/Scheme parens, Algol-family braces, whatever). How well
your proposal works is going to depend heavily on what kinds of data
you're presenting and what people are trying to extract from it.

as noted before, it would be used for printing the internal DOM-like nodes.

given I am already using a system which is internally XML-based,
sticking with an XML-like syntax would make sense (or, at least,
something composed of tags and attributes). switching out to something
radically different would be a fairly major alteration.

many other parts of the system use a Lisp-like form, but they also use a
different representation internally as well (lists composed of cons-cells).

sadly, at present, parts of my VM which use S-Expressions for ASTs and
parts which use XML based ASTs are largely incompatible.

it would be nice sometimes if it were one or the other, but neither is
"clearly better" (S-Expressions are faster, but not very flexible, and
XML is more flexible, but also a little slower and more awkward to work
with). similarly, there is no known good way to merge them without
creating a horrible mess.

ironically, when S-Expressions are organized into a tagged structure
(similar to XML), they actually seem to use more memory than the
equivalent in XML/DOM-style nodes...

so, no ideal solutions here...

Peter Flynn · May 12, 2012

On 12/05/12 01:01, BGB wrote:
[...]

but, yeah, I guess originally XML was intended for markup of mostly
textual documents (like in HTML or similar), rather than for
representing structured data (or being used for humans viewing said
structured data as debugging output).

Yes. The use of XML-Data was first proposed by Microsoft, I seem to
remember, about half-way through the development phase of XML.

I wonder if anyone ever really considered "scene-graph delta-update
messages in a 3D FPS game" as a possible use-case for XML either?
somehow I doubt it

XML has been used for many applications far beyond what we expected.

[...]

I hadn't considered this case.
if the code is being viewed/edited in a generic text editor (such as
Notepad), it shouldn't make too much of a difference, but granted a
specialized XML editor could very well get confused.

I can't imagine anyone actually wanting to code *any* structured syntax
in something like Notepad. But all you would need to do is modify one of
the FLOSS XML editors (Emacs would be the obvious start-point) to use
your syntax.

I don't know much about them, I hadn't heard of this before.

http://www.balisage.net
Every August in Montreal. This is the hard-core conference for markup.

///Peter

BGB · May 12, 2012

On 12/05/12 01:01, BGB wrote:
[...]

but, yeah, I guess originally XML was intended for markup of mostly
textual documents (like in HTML or similar), rather than for
representing structured data (or being used for humans viewing said
structured data as debugging output).

Click to expand...

Yes. The use of XML-Data was first proposed by Microsoft, I seem to
remember, about half-way through the development phase of XML.

fair enough.

XML has been used for many applications far beyond what we expected.

yep.

probably because it makes a fairly versatile format for tree-structured
data.

strangely enough, I don't currently use it for data-binding, which I
guess is what many people use it for, rather most use has been in terms
of using the trees directly (with no intermediate structures or objects).

[...]

I hadn't considered this case.
if the code is being viewed/edited in a generic text editor (such as
Notepad), it shouldn't make too much of a difference, but granted a
specialized XML editor could very well get confused.

Click to expand...

I can't imagine anyone actually wanting to code *any* structured syntax
in something like Notepad. But all you would need to do is modify one of
the FLOSS XML editors (Emacs would be the obvious start-point) to use
your syntax.

Emacs, blarg...

I guess it could be a mystery how much effort it would be to add support
to something like Notepad++ or SciTe or similar (or how an unmodified
Notepad++ would respond to such a syntax). I think most likely it would
confuse the syntax highlighting and text-folding features.

but, I hadn't considered it much, since I had just assumed using Notepad
or Notepad2 or similar, since these editors are fairly simple and seem
to hold up fairly well with large text files (logs, ...).

http://www.balisage.net
Every August in Montreal. This is the hard-core conference for markup.

fair enough.

Manuel Collado · May 14, 2012

El 11/05/2012 19:40, BGB escribió:

...
example, say that a person has an expression like:
<if>
<cond>
<binary op="<">
<ref name="x"/>
<number value="3"/>
</binary>
</cond>
<then>
<funcall name="foo">
<args/>
</funcall>
</then>
</if>

representing, say, the AST of the statement "if(x>3)foo();".

the parser and printer could use a more compact encoding, say:
<if

In that case the slashes in the "/>" endmarks are probably superfluous.

Joe Kesselman · May 15, 2012

In that case the slashes in the "/>" endmarks are probably superfluous.

And you might want to use (), {} or [] instead of <>, to emphasize that
this is *not* XML.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

BGB · May 15, 2012

El 11/05/2012 19:40, BGB escribió:

In that case the slashes in the "/>" endmarks are probably superfluous.

in depends on how the parser works.

given the intention that the new syntax be a direct extension of the
existing syntax, rather than entirely replacing it, the '/' is still
needed in order to avoid the syntax becoming ambiguous (how do you
otherwise distinguish between an empty tag and the start of a list which
is terminated by a closing tag?...).

there are potentially more complex ways to deal with it, such as
determining the type of the next matching closing tag, but this is
problematic and potentially costly.

example:
<a><b><c>...</c></a>
a begins, scans forwards, sees matched closing a.
b begins, scans forwards, sees that next closing tag is a, concludes it
does not contain c, ...

the problem here is that in the naive case, this could cause the parser
to require around O(n^2) time, rather than O(n) time, so it is better to
avoid ambiguity.

or such...

BGB · May 15, 2012

In that case the slashes in the "/>" endmarks are probably superfluous.

Click to expand...

And you might want to use (), {} or [] instead of <>, to emphasize that
this is *not* XML.

I disagree here on both counts, in that I don't believe the '/' would be
superfluous (as I see it, such a change would cause syntactic ambiguity
unless the old syntax were removed entirely, which I doubt would be
beneficial), nor that the use of a different characters would be
particularly beneficial (more likely, people would see the different
tagging structure and conclude that it is something different).

I don't think it would actually make much difference, because by the
time they are confusing syntax based on the brace characters used, they
would also be confusing it with other syntax based on the characters used.

using "()" would by similar reasoning make it too likely to be confused
with S-Expressions...

or, by similar logic, using "{}" people might confuse it for JSON.

I suspect that the differences are sufficient to where they will be
clearly different and unlikely to be confused regardless of the
characters used.

if people see something like:
<if<...>>
it will probably be fairly obvious that this is not XML.

I was probably considering giving it a different name, but not yet
decided on anything (nor done much with the idea yet as I have been more
busy with other stuff).

one possible option is "ZEML" (say, "Z-Expression Markup Language").
partly because Z is like a flipped S, and is almost as hard-core as X.

so, most likely choices are either sticking with <...> or using [...].

may decide on other possible alterations.

Manuel Collado · May 15, 2012

El 15/05/2012 6:24, BGB escribió:

in depends on how the parser works.

given the intention that the new syntax be a direct extension of the
existing syntax, rather than entirely replacing it, the '/' is still
needed in order to avoid the syntax becoming ambiguous (how do you
otherwise distinguish between an empty tag and the start of a list which
is terminated by a closing tag?...).

Sorry, don't understand your explanation. Example:

<list>
<item/>
<item/>
<item/>
</list>

just becomes

<list<item><item><item>>

What's the problem?

there are potentially more complex ways to deal with it, such as
determining the type of the next matching closing tag, but this is
problematic and potentially costly.

All the closing tags are just ">". Each one matches exactly the last
previous unclosed tag.

example:
<a><b><c>...</c></a>
a begins, scans forwards, sees matched closing a.
b begins, scans forwards, sees that next closing tag is a, concludes it
does not contain c, ...

Well, this is the standard notation, not the new streamlined one.

the problem here is that in the naive case, this could cause the parser
to require around O(n^2) time, rather than O(n) time, so it is better to
avoid ambiguity.

Matching opening and ending tags (or parentheses, or brackets, ...) can
be done in O(n) on a single pass with the help of a stack. Or,
equivalently, with just a recursive parser.

BGB · May 16, 2012

El 15/05/2012 6:24, BGB escribió:

Sorry, don't understand your explanation. Example:

<list>
<item/>
<item/>
<item/>
</list>

just becomes

<list<item><item><item>>

What's the problem?

it was because the parser would accept both forms of syntax at the same
time (the original plan was for it to be a backwards-compatible
extension syntax, but I have since changed my mind on this point).

dropping the '/' means only one form of the syntax is supported.

All the closing tags are just ">". Each one matches exactly the last
previous unclosed tag.

this is not exactly how I imagined it working though.
what this would do is essentially create something more akin to an
S-Expression parser, which wasn't the original intention.

Well, this is the standard notation, not the new streamlined one.

yes, but again, if both coexist at the same time in the same parser, ...
then one either needs '/' or faces an ambiguity.

Matching opening and ending tags (or parentheses, or brackets, ...) can
be done in O(n) on a single pass with the help of a stack. Or,
equivalently, with just a recursive parser.

see above.

but, yeah, I have thought more about it, and I may just branch off the
syntax entirely and drop backwards compatibility.

the reason was that a lot of these "extensions" I was considering would
be significant enough to make it not worthwhile to keep it as a
backwards-compatible extended syntax.

so, likely newer "new" design:
tag = '<' name [key '=' value]+ node* '>'
text = quoted_string | block_string
block_string = '<[[' ... ']]>'
value = name | number | quoted_string

names and numbers would basically use C-like rules:
nameinitchar = a-z | A-Z | _ | various_unicode_ranges
namechar = nameinitchar | 0-9
name = nameinitchar namechar*
digits = (0-9)+
number = digits [ '.' digits] [e|E ['+'|'-'] digits]

....

so, for example:
<foo a=9 <bar <[[here is some text]]>>>

will probably also consider switching to a C-like character escape notation.
<text "this string\ncontains\nnewlines!">
though likely including as extensions the ability to directly embed
newlines and do \ line-continuations.

<text "this string
contains
newlines!">

<text "this string \
contains \
spaces!">

plain quoted-strings would basically map to text nodes, and
block-strings would map to CDATA.

note that '/' would no longer be used, and traditional tag syntax would
no longer work. it would probably also drop DTDs and similar as well.

yes, this is no longer XML, but it could be usable...

RFC, an ugly parser hack (and a bin-xml variant)	3	Sep 4, 2005
Two expression variant of ()	13	Mar 30, 2012
[RFC] Mysql::DBLink	3	Dec 8, 2013
Help for a newbie	13	Feb 13, 2023
Reverse search for a website	2	Apr 24, 2024
RFC: tkSimpleDialog IMPROVED AGAIN!.	1	Jul 3, 2011
Seeking co-founders for my company.	3	Sep 8, 2024
Looking for a Tool to Export MSG Files as VCF for Apple Contacts?	0	Jul 24, 2025

RFC: thoughts for a "streamlined" XML syntax variant...

BGB

Peter Flynn

Joe Kesselman

BGB

BGB

Joe Kesselman

Joe Kesselman

BGB

Peter Flynn

BGB

Manuel Collado

Joe Kesselman

BGB

BGB

Manuel Collado

BGB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads