Interplatform (interprocess, interlanguage) communication

A

Arne Vajhøj

Let's say you used a simple RandomAccessFile. How could you implement
a busy lock field in the file to indicate the file was busy being
updated? or busy being read? In RAM you have test and set locks to
check a value and set the value in one atomic operation. How could
you simulate that without test and set hardware on the SSD?

java.nio.channels.FileLock with the caveats about what the OS
supports.
You can't
very well share a RAM lock between separate jobs.

You can in most OS. It is just not well supported in Java.

Arne
 
B

BGB

»X« below is another language than Java, for example,
VBA, C#, or C.

I am mostly a C developer, so I am writing more from my perspective...

When an X process and a Java process have to exchange
information on the same computer, what possibilites are
there? The Java process should act as a client, sending
commands to the X process and also wants to read answers
from the X process. So, the X process is a kind of server.

My criteria are: reliability and it should not be extremely
slow (say exchanging a string should not take more than
about 10 ms). The main criterion is reliability.

»Reliability« means little risk of creating problems, little
risk of failure at run-time. (It might help when the client
[=Java process] can reset the communication to a known and
sane start state in case of problems detected at run-time.)

The host OS is Windows, but a portable solution won't hurt.

A list of possibilities I am aware of now:

Pipes

I have no experience with this. I heard one can establish
a new process »proc« with »exec« and then use

BufferedWriter out = new BufferedWriter(new
OutputStreamWriter(proc.getOutputStream()));
BufferedReader in = new BufferedReader(new
InputStreamReader(proc.getInputStream()));

no real comment, as I don't have much experience using pipes on Windows.

Files

One process writes to the end of a file, the other reads
from the end of the file? - I never tried this, don't know
if it is guaranteed to work that one process can detect and
read, whether the other has just appended something to a file.

What if the processes run very long and the files get too
large? But OTOH this is very transparent, which makes it easy
to debug, since one can open the files and directly inspect
them, or even append commands manually with »copy con file«.

IME, I have often seen synchronization issues in these cases. sometimes
the OS will refuse to let multiple programs access the same file at the
same time, but sometimes it does work (I think depending on how the file
is opened and which flags are given and similar).

if just naively using "fopen()" or similar (in C), IME/IIRC, the OS will
typically only allow a single version of the file to be open at once
(not necessarily as limiting as it may seem).


in scenarios where it has worked (multiple versions can be opened), it
often seems like the OS is "lazy": one process will see an out-of-date
version of the file data (the data will often be out-of-date until the
writer closes the file or similar).

I never really felt all that inclined to look into the how/why/when
aspects of all this.

a partial exception is when using shared-memory, which tends to stay
up-to-date.


these issues don't seem to really pop up so much if one passes data in
an "open file, write, close file" or "open file, read, close file"
strategy (then the file is always seen up-to-date, and typically the
chance of clash remains fairly small).

this strategy is arguably not very efficient, but it is fairly simple
and tends to work "well enough" for many use cases (particularly passing
"globs of data once in a great while", or when operating at
"user-interaction" time-frames, such as the file is reloaded, say,
because the user just saved to it).

if done well, this can be used to implement things like a "magic
notepad", whereby data edited/saved in Notepad is automatically
reflected in the running app (say, by polling+"stat()", then processing
the file if it has changed).

conceptually, the latency should only really be limited by polling rate
(although granted polling isn't free, and a process bogging down the
system by polling a file in a tight loop isn't necessarily desirable
either).


another advantage of files is that they are more amendable to "makeshift
options" than some of the other strategies (one doesn't really need to
care what apps are thrown in the mix, so long as they can read/write the
files in question).

Sockets

This is slightly less transparent than files, but has the
advantage that it becomes very easy to have the two
processes running on different computers later, if this
should ever be required. Debugging should be possible
by a man-in-the-middle proxy that prints all information
it sees or by connecting to the server with a terminal.

I have used sockets for IPC before fairly well.

a minor issue with TCP for IPC though is that sometimes the buffering
does something very annoying:
no matter how long one waits, TCP will not send the data until a certain
amount has been written to the socket (IIRC, one can disable buffering
or similar to prevent this, but unbuffered sockets can be evil on a
network if used naively, such as writing an individual byte or datum at
a time, rather than sending the entire message in a single write, since
an unbuffered socket may attempt to send a datagram for *every* write to
the socket).

TCP works fairly well for transmitting lots of small messages (and apart
from the potential buffering issue has very little latency).

UDP also has some merit, but the big annoying hassle of having to pack
ones' messages into UDP datagrams (however, UDP is much more resistant
against stalls, which can easily become an issue for TCP sockets if
going over the wider internet, however UDP is unreliable and unordered
which also needs to be taken into account).

JNI

JNI might be used to access code written in C or
ABI-compatible languages. This should be fast, but I heard
that it is error prone to write JNI code and needs some
learning (code less maintainable)?

JNI can work, but is also annoying in some ways.

if one simply wants to call functions or pass data or messages to/from C
code, it works fairly well. JNI is, however, not readily capable of IPC
AFAIK. it also may result in some level of "physical coupling" between
code in the languages in question (may or may not be desirable, probably
depends on the task, often it is preferable IME to avoid coupling where
possible, even often within code within the same language).

it is also not necessarily all that much more convenient than options
such as sockets (likely depends a lot on the task though, for many, it
may just be easier to write a message parser/dispatcher for whatever
comes over the socket).
 
B

BGB

I recommend using sockets.

in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.


another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

note: this does not mean SOAP or CORBA or some other "standardized"
messaging system, rather just that one initially builds and processes
the messages in some form that is more high-level than spitting out
bytes, and processing everything via a loop and a big "switch()" or
similar (although this can be an initially fairly simple option, so has
some merit due to ease of implementation).


the main reason for picking a binary message-serialization format (for
something like S-Expressions or XML nodes), would be mostly if there is
a chance that the serialized data will go over the internet, and a
textual format can be a bit bulkier (and thus slower to transmit over a
slower connection), as well as typically being slower to decode (a
sanely designed message format can be much more quickly unpacked than a
textual format can be parsed).

sending text over sockets may have merits as well, and is generally
preferable for "open" protocols.


or such...
 
A

Arved Sandstrom

On 2/7/2012 11:11 AM, jebblue wrote: [ SNIP ]
I recommend using sockets.

in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.


another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

note: this does not mean SOAP or CORBA or some other "standardized"
messaging system, rather just that one initially builds and processes
the messages in some form that is more high-level than spitting out
bytes, and processing everything via a loop and a big "switch()" or
similar (although this can be an initially fairly simple option, so has
some merit due to ease of implementation).


the main reason for picking a binary message-serialization format (for
something like S-Expressions or XML nodes), would be mostly if there is
a chance that the serialized data will go over the internet, and a
textual format can be a bit bulkier (and thus slower to transmit over a
slower connection), as well as typically being slower to decode (a
sanely designed message format can be much more quickly unpacked than a
textual format can be parsed).

sending text over sockets may have merits as well, and is generally
preferable for "open" protocols.

or such...

I've done a fair bit with sockets myself, including recently, in fact
including on a current gig. Some of the message formats have been
designed by others, some by me. A few of them are specialized industry
standards, some are very custom and bespoke.

A few of the formats have been binary: fixed-length blocks of data with
fields at various offsets. Works well enough if it suits the data.

A bunch of others have been text and line-oriented: a fixed number of
lines of data in known order, so that line 10 is always the data for a
particular field.

Other things to consider: JAXB, JSON etc. Minimum coding fuss at the
endpoints if that's what's appropriate for constructing message payloads.

I like text-based protocols, for some simple situations, that behave
like SMTP or POP. But it obviously depends on what you expect your
client and server to do, it's just another approach to be aware of.

One of the big things in designing one's own messaging is error
handling. People generally do just fine with the happy path, but ignore
comprehensive error handling, or get wrapped around the axle trying to
do it.

A lot of situations admit of more than one approach.

AHS
 
A

Arne Vajhøj

in general, I agree (sockets generally make the most sense),
another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

If you want compact and text go for JSON.

Arne
 
M

Martin Gregorie

in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.
Yes, for small amounts of data or message passing between processes I
tend to like sockets - as others have said, the fact that they are
agnostic about the location of the communicating processes is often very
useful.
usually, for passing messages over sockets, I have used "compact"
specialized binary formats,
Yep. ASN.1 has to be about the most compact way of encoding structured,
multi-field messages with XML occupying the other end of the scale.

That said, for short, list of fields messages I often use a CSV string
preceded by an unsigned binary byte value containing the string length:
this type of message is both easy to transfer, even if the connection
wants to fragment it during transmission, and by having a printable text
payload, its also convenient for trouble shooting.
 
B

BGB

Yes, for small amounts of data or message passing between processes I
tend to like sockets - as others have said, the fact that they are
agnostic about the location of the communicating processes is often very
useful.

yep.


Yep. ASN.1 has to be about the most compact way of encoding structured,
multi-field messages with XML occupying the other end of the scale.

I disagree partly WRT ASN.1:
a disadvantage of ASN.1 is that a lot of times it tends to use
fixed-width integer encodings (and often sends structures in a
"reasonably raw" form), whereas one can shave more bytes using a
variable-length-integer scheme (why encode an integer in 4 bytes if you
only need 1 byte in a given case?). it is also possible to shave more
bytes if one makes the format use an adaptive/context-sensitive encoding
scheme and maybe a variant of Huffman coding or similar (and possibly
encode integer values using a similar scheme to that used in Deflate).
it is in-fact not particularly difficult to outperform ASN.1 in these
regards.


granted, yes, custom Huffman-based data encodings are probably not "the
norm" for network protocols (though some programs, such as the Quake 3
engine, have used Huffman-compressed network protocols).

there is also "arithmetic coding" and "range coding", but with these it
is a lot harder to make the codec be acceptably fast (whereas there are
some tricks to allow optimizing Huffman codecs).


in cases where I have used XML, I have typically used a custom binary
XML variant, which can greatly reduce the overhead vs textual XML. in
terms of saving bytes, my encoding can be more compact than WBXML or
XML+Deflate, but is arguably more "esoteric", and as-is doesn't make use
of schemas (it is instead a basic adaptive coding, and is vaguely
similar to an LZ-Markov coding, attempting to exploit repeating patterns
in tag-structure and similar via prediction, but like most adaptive
codings initially transmits the data in a less dense form as it needs to
build up a new context for each message). the coding in question doesn't
use Huffman coding (for sake of simplicity, and because I don't always
particularly need "maximum compactness"), but a Huffman-based variant
could be created if needed.

there is also EXI, but I don't know how my encoding compares (EXI
probably does better though, given that IIRC it uses binary universal
codes and schemas).


for something else of mine I am using S-Expression based messages
(currently between components within the same process), and had
considered using a vaguely similar binary coding if/when I get around to it.

That said, for short, list of fields messages I often use a CSV string
preceded by an unsigned binary byte value containing the string length:
this type of message is both easy to transfer, even if the connection
wants to fragment it during transmission, and by having a printable text
payload, its also convenient for trouble shooting.

yes, this is possible.

also possibly would be a TLV encoding (say, possibly doing something
similar to the Matroska MKV file-format).


say, the integer values are encoded something like (range, encoding):
0-127 0xxxxxxx
128-16383 10xxxxxx xxxxxxxx
16384-2097151 110xxxxx xxxxxxxx xxxxxxxx
2097152-... ...

likewise, one can get a signed variant by folding the sign into the LSB,
forming a pattern like: 0, -1, 1, -2, 2, ...

then, one defines tags as:
{
VLI tag;
VLI length;
byte data[length];
}

where tags can hold either data or messages (and, the smallest tag size
needs 2 bytes, or 3 bytes if one has 1 byte of payload for the tag).


if the length is optional (presence depends on tag), one can reduce the
typical tag size to 1 byte. likewise, tags can be combined with an
MTF/MRU scheme such that any recently used tags have a small value (and
can thus be encoded in a single byte). (many of my formats define tags
inline, rather than relying on some large hard-coded tag-list).

more bytes can be saved if more of the message structure is known, say
that not only does the tag encode a particular tag-type, but also may
carry information about what follows after it (various combinations of
attributes, and if it contains sub-tags and what they might be, ...).

if a new tag is defined, it is added to the MRU, but if not used
frequently may move "backwards" (towards higher index numbers) or
eventually be forgotten (falls off the end of the list).

note that some hard-coded tag-numbers will be needed for basic control
purposes (encoding new/unfamiliar tags, ...).


a Huffman-based variant could be similar, just one may encode integers
differently. an example scheme is to use a prefix value (Huffman coded)
and a suffix bit pattern (similar to Deflate). a simpler (but less
compact) scheme was used in JPEG, and IIRC I had before "compromised"
between them by having the Huffman table be stored using Rice codes.


example (prefix range, value range, suffix bits):
0-15 0-15 0
16-23 16-31 1
24-31 32-63 2
32-39 64-127 3
40-47 128-255 4
48-55 512-1024 5
56-63 1024-2047 6
64-71 2048-4095 7
72-79 4096-8191 8
80-87 8192-16383 9
....

also note that a nifty thing (also used in Deflate) is to compress the
Huffman table itself using Huffman coding.


likewise, one can save a few bytes if the encoder is smart enough to
recognize when tags encode numeric data (mostly specific to XML, with
S-Expressions or similar one knows when they are dealing with numeric data).

likewise, one can encode floats as a pair of integer values (although
floats present a few of their own complexities). one can also devise
special encodings for things like numeric vectors, quaternions, ... if
needed as well.


likewise, either an LZ77 or LZ-Markov scheme can be used for encoding
strings (an example would be to used a fixed-size rotating window like
in Deflate, and essentially using the same basic encoding for strings,
albeit likely with the use of an "End-Of-String" marker).

say (range, meaning):
0-255: literal byte values
258: End Of String
259-321: LZ77 Run (encodes length, followed by window offset).

String encoding would be used, say, for encoding both literal text, and
also for escaping things like tag and attribute names.

....


the main variability is mostly in terms of the type of payload being
transmitted:
be it XML-based, S-Expression based, or potentially object-based
(similar to either JSON, or a sort of "heap pickling" style system).


for most structured data, it shouldn't be needed to change the
"fundamentals" too much. the main difference is between tree-structured
and heap-like / graph-structured data, as graph-structured data is often
better sent as a flat list of objects with a certain entry being a "root
node" than as a tree (this can be accomplished either by building a
list, or using an algorithm to detect and break-up cycles when needed).


granted, for most use-cases something like this is likely to be overkill.


or such...
 
B

BGB

On 2/7/2012 11:11 AM, jebblue wrote: [ SNIP ]
I recommend using sockets.

in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.


another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

note: this does not mean SOAP or CORBA or some other "standardized"
messaging system, rather just that one initially builds and processes
the messages in some form that is more high-level than spitting out
bytes, and processing everything via a loop and a big "switch()" or
similar (although this can be an initially fairly simple option, so has
some merit due to ease of implementation).


the main reason for picking a binary message-serialization format (for
something like S-Expressions or XML nodes), would be mostly if there is
a chance that the serialized data will go over the internet, and a
textual format can be a bit bulkier (and thus slower to transmit over a
slower connection), as well as typically being slower to decode (a
sanely designed message format can be much more quickly unpacked than a
textual format can be parsed).

sending text over sockets may have merits as well, and is generally
preferable for "open" protocols.

or such...

I've done a fair bit with sockets myself, including recently, in fact
including on a current gig. Some of the message formats have been
designed by others, some by me. A few of them are specialized industry
standards, some are very custom and bespoke.

A few of the formats have been binary: fixed-length blocks of data with
fields at various offsets. Works well enough if it suits the data.

A bunch of others have been text and line-oriented: a fixed number of
lines of data in known order, so that line 10 is always the data for a
particular field.

Other things to consider: JAXB, JSON etc. Minimum coding fuss at the
endpoints if that's what's appropriate for constructing message payloads.

I like text-based protocols, for some simple situations, that behave
like SMTP or POP. But it obviously depends on what you expect your
client and server to do, it's just another approach to be aware of.

well, text need not be all that limiting.
if one has XML or free-form S-Expressions (in their true sense, like in
Lisp or Scheme, not the mutilated/watered-down Rivest ones), then one
can do a fair amount with text.

IME, there are many tradeoffs (regarding ease of use, ...) between XML
and S-Exps, and neither seems "clearly better" (as far as
representations go, I find S-Exps easier to work with, but namespaces
and attributes in XML can make it more flexible, as one can more easily
throw new tags or attributes at the problem with less chance of breaking
existing code).

an example is this:
<foo> <bar value="3"/> </foo>
and:
(foo (bar 3))

now, consider one wants to add a new field to 'foo' (say 'ln').
<foo ln="15"> <bar value="3"/> </foo>
and:
(foo 15 (bar 3))

a difference here is that existing code will probably not even notice
the new XML attribute, whereas the positional nature of most
S-Expressions makes the latter far more likely to break something (and
there is no good way to "annotate" an S-Exp, whereas with XML it is
fairly solidly defined that one can simply add new attributes).


note: my main way of working with XML is typically via DOM-style
interfaces (if I am using it, it is typically because I am directly
working with the data structure, and not as the result of some dumb-ass
"data binding" crud...).


typically, the "internal representation" and "concrete serialization"
are different:
I may use a textual XML serialization, or just as easily, I could use a
binary format;
likewise for S-Exps (actually, I probably far more often represent
S-Exps as a binary format of one form or another than I use them in a
form externally serialized as text).

all hail the mighty DOM-node or CONS-cell...

One of the big things in designing one's own messaging is error
handling. People generally do just fine with the happy path, but ignore
comprehensive error handling, or get wrapped around the axle trying to
do it.

yeah, but this applies to programming in general, so message-passing is
likely nothing special here. one issue maybe special to sockets though
is the matter of whether or not the whole message has been received,
often resulting in some annoying code to basically read messages from
the socket and not decode them until the entire message has been received.

A lot of situations admit of more than one approach.

agreed.

it is like me and file-formats.
often I just use ASCII text (simple, easy, editable in Notepad or
similar, ...).

I make plenty of use of simple line-oriented text formats as well.

other times, I might use more advanced binary formats, or maybe even
employ the use of "data compression" techniques (such as Huffman
coding), so a lot depends.
 
A

Arved Sandstrom

On 2/7/2012 11:11 AM, jebblue wrote: [ SNIP ]

I recommend using sockets.


in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.


another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

note: this does not mean SOAP or CORBA or some other "standardized"
messaging system, rather just that one initially builds and processes
the messages in some form that is more high-level than spitting out
bytes, and processing everything via a loop and a big "switch()" or
similar (although this can be an initially fairly simple option, so has
some merit due to ease of implementation).


the main reason for picking a binary message-serialization format (for
something like S-Expressions or XML nodes), would be mostly if there is
a chance that the serialized data will go over the internet, and a
textual format can be a bit bulkier (and thus slower to transmit over a
slower connection), as well as typically being slower to decode (a
sanely designed message format can be much more quickly unpacked than a
textual format can be parsed).

sending text over sockets may have merits as well, and is generally
preferable for "open" protocols.

or such...

I've done a fair bit with sockets myself, including recently, in fact
including on a current gig. Some of the message formats have been
designed by others, some by me. A few of them are specialized industry
standards, some are very custom and bespoke.

A few of the formats have been binary: fixed-length blocks of data with
fields at various offsets. Works well enough if it suits the data.

A bunch of others have been text and line-oriented: a fixed number of
lines of data in known order, so that line 10 is always the data for a
particular field.

Other things to consider: JAXB, JSON etc. Minimum coding fuss at the
endpoints if that's what's appropriate for constructing message payloads.

I like text-based protocols, for some simple situations, that behave
like SMTP or POP. But it obviously depends on what you expect your
client and server to do, it's just another approach to be aware of.

well, text need not be all that limiting.

You may have misunderstood something I said if you got that impression
from me, that text is all that limiting. :)

[ SNIP ]
note: my main way of working with XML is typically via DOM-style
interfaces (if I am using it, it is typically because I am directly
working with the data structure, and not as the result of some dumb-ass
"data binding" crud...).

I haven't been able to completely avoid using the DOM, but I loathe the
API. If I'm using XML at all, and JAXB suits, I'll use JAXB. More
generally I'll use SAX or StAX.

I almost never encounter a situation where DOM is called for, simply
because no random access to the document is called for. When I send XML
back and forth as a payload, the entire thing is meant to be used, and
it makes sense to do the immediate and complete conversion into real
information rather than storing it into an opaque and kludgy DOM
representation.

For a lot of situations, not just message passing between endpoints, I
have backed away from XML anyway. For configuration files I have gotten
newly enthused by .properties files, because so often they fit the bill
much better than XML configuration files. And I mentioned JSON
previously, I prefer that to XML in many situations now.

[ SNIP ]
yeah, but this applies to programming in general, so message-passing is
likely nothing special here.

That's true, but it's maybe a bit more of an art form with messages.
Your message producer may be Java and produce beautiful exceptions in
your carefully designed exception hierarchy, but your clients may very
well not be Java at all, in which case you may end up with an error
message sub-protocol that borrows ideas from from HTTP status codes.

A lot of Java programmers these days maybe have never really dealt with
return codes, because we sort of tell them not to use them in Java, but
in the case of implementation-neutral status codes (including ones for
errors) that's really the design mindset that you need to be in: status
codes.

one issue maybe special to sockets though
is the matter of whether or not the whole message has been received,
often resulting in some annoying code to basically read messages from
the socket and not decode them until the entire message has been received.

There is that. Although I find that once you've worked through one or
two socket implementations that you tend to devise some pretty re-usable
code for handling the incomplete message situations.
[ SNIP ]

AHS
 
B

BGB

On 12-02-07 07:38 PM, BGB wrote:
On 2/7/2012 11:11 AM, jebblue wrote:
[ SNIP ]

I recommend using sockets.


in general, I agree (sockets generally make the most sense), although
there are cases where file-based communications can make sense, although
probably not in the form as described in the OP.


another issue (besides how to pass messages), is what sort of form to
pass messages in.

usually, in my case, if storing data in files, I tend to prefer
ASCII-based formats.

usually, for passing messages over sockets, I have used "compact"
specialized binary formats, typically serialized data from some other
form (such as XML nodes or S-Expressions). although "magic byte value"
based message formats are initially simpler, they tend to be harder to
expand later (whereas encoding/decoding some more generic form, though
initially more effort, can turn out to be easier to maintain and extend
later).

note: this does not mean SOAP or CORBA or some other "standardized"
messaging system, rather just that one initially builds and processes
the messages in some form that is more high-level than spitting out
bytes, and processing everything via a loop and a big "switch()" or
similar (although this can be an initially fairly simple option, so has
some merit due to ease of implementation).


the main reason for picking a binary message-serialization format (for
something like S-Expressions or XML nodes), would be mostly if there is
a chance that the serialized data will go over the internet, and a
textual format can be a bit bulkier (and thus slower to transmit over a
slower connection), as well as typically being slower to decode (a
sanely designed message format can be much more quickly unpacked than a
textual format can be parsed).

sending text over sockets may have merits as well, and is generally
preferable for "open" protocols.

or such...

I've done a fair bit with sockets myself, including recently, in fact
including on a current gig. Some of the message formats have been
designed by others, some by me. A few of them are specialized industry
standards, some are very custom and bespoke.

A few of the formats have been binary: fixed-length blocks of data with
fields at various offsets. Works well enough if it suits the data.

A bunch of others have been text and line-oriented: a fixed number of
lines of data in known order, so that line 10 is always the data for a
particular field.

Other things to consider: JAXB, JSON etc. Minimum coding fuss at the
endpoints if that's what's appropriate for constructing message payloads.

I like text-based protocols, for some simple situations, that behave
like SMTP or POP. But it obviously depends on what you expect your
client and server to do, it's just another approach to be aware of.

well, text need not be all that limiting.

You may have misunderstood something I said if you got that impression
from me, that text is all that limiting. :)

[ SNIP ]

ok.

it came off that you were implying that text only really worked well for
simple protocols, like SMTP, POP, HTTP, ...

I haven't been able to completely avoid using the DOM, but I loathe the
API. If I'm using XML at all, and JAXB suits, I'll use JAXB. More
generally I'll use SAX or StAX.

I have rarely done things for which SAX has made sense...
usually in cases where SAX would make sense, I end up using
line-oriented text formats instead (because there is often little
obvious reason for why XML syntax would make much sense).

I almost never encounter a situation where DOM is called for, simply
because no random access to the document is called for. When I send XML
back and forth as a payload, the entire thing is meant to be used, and
it makes sense to do the immediate and complete conversion into real
information rather than storing it into an opaque and kludgy DOM
representation.

often, I use it for things like compiler ASTs, where it competes some
against S-Expressions (they are produced by the main parser, worked on,
and then later converted into bytecode or similar).

typically, one works by walking the tree, and potentially
rebuilding/rewriting a new tree in the process, or maybe adding
annotations to the existing tree.


a recent case where I did consider using XML as a message-passing
protocol, I ended up opting for S-Expressions (or, more properly,
Lisp-style lists) instead, mostly because they are a lot easier to build
and process, and much less painful than working with a DOM-style API
(and also because S-Expressions tend to perform better and use less
memory in my case as well...).

typically, the messages are tree-structured data of some sort (in the
recent example, it was being used for scene-graph delta messages, which
basically update the status of various objects in the scene, as well as
passing other events for things "going on", like sound-effects being
heard, updates to camera location and status, ...).


it is also desirable to keep the serialized representation small, since
a lot may be going on (in real time), and it would be annoying (say, to
players) if the connection got needlessly bogged down sending lots of
overly verbose update messages (more so if one has stuff like
network-synchronized rag-dolls or similar, where a ragdoll may send
position updates for nearly every bone for every frame).

say:
(bonedelta 499 (bone 0 (org ...) (rot ...)) (bone 1 (org ...) (rot ...))
....)
(bonedelta 515 ...)
....


hence, it may make a little sense to employ a compressed binary format.
I also personally dislike schemas or similar concepts, as they tend to
make things brittle (both the transmitter and receiver need a correct
and up-to-date schema, creating a higher risk of version issues), and
typically don't really compress all that much better (and are
potentially worse) than what a decent adaptive coding can do.

("on the wire", S-Exps and XML are not all that drastically different,
the main practical differences are more in terms of how one may work
with them in-program).


granted, yes, text+deflate also works OK if one is feeling lazy (since
IME Deflate will typically reduce textual XML or S-Exps to around
10%-25% their original size, vs say a 5%-10% one might get with a
specialized binary format).

there is also the tradeoff of designing a binary format to be standalone
(say, including its own Huffman compressor), or to be used in
combination with deflate (at which point one tries to design the format
to instead produce data which deflate can utilize efficiently).

in the latter option, there is the secondary concern of external deflate
(assuming that the data will probably be sent in a compressed channel or
stored in a ZIP file or similar), or using deflate internally (like in
PNG or similar).

there are many tradeoffs...

For a lot of situations, not just message passing between endpoints, I
have backed away from XML anyway. For configuration files I have gotten
newly enthused by .properties files, because so often they fit the bill
much better than XML configuration files. And I mentioned JSON
previously, I prefer that to XML in many situations now.

I typically use line-oriented text formats for most of these purposes...

never really did understand why someone would use XML for things like
configuration files (it neither makes them easier to process, nor does
it help anything with users trying to edit them).


as-is, my configuration format consists of "console commands", which may
in turn set "cvars" or issue key-binding commands, ...

for another (more serious) system, I am using a format which is
partially a hybrid of INI and REG files (it is for a registry-like
hierarchical database). I have on/off considered switching to a binary
database format, but never got around to it.

some amount of other data is stored in formats similar to the Quake map
format, or other special-purpose text formats.

[ SNIP ]
yeah, but this applies to programming in general, so message-passing is
likely nothing special here.

That's true, but it's maybe a bit more of an art form with messages.
Your message producer may be Java and produce beautiful exceptions in
your carefully designed exception hierarchy, but your clients may very
well not be Java at all, in which case you may end up with an error
message sub-protocol that borrows ideas from from HTTP status codes.

A lot of Java programmers these days maybe have never really dealt with
return codes, because we sort of tell them not to use them in Java, but
in the case of implementation-neutral status codes (including ones for
errors) that's really the design mindset that you need to be in: status
codes.

granted, I am actually primarily a C and C++ programmer, but
message-passing isn't particularly language-specific. granted, yes, the
lack of "standard" exceptions is an annoyance in C, where typically one
either needs to not use exceptions, or end up using non-portable
exception mechanisms, and there is no particularly good way to "build
ones' own", although some people have before done some fairly "creative"
things with macros...

one issue maybe special to sockets though
is the matter of whether or not the whole message has been received,
often resulting in some annoying code to basically read messages from
the socket and not decode them until the entire message has been received.

There is that. Although I find that once you've worked through one or
two socket implementations that you tend to devise some pretty re-usable
code for handling the incomplete message situations.
[ SNIP ]

yep.

one can always tag messages and then give them with a length.

{ tag, length, data[length] }
message is then not processed until entire data region is received.
typically, this is plenty sufficient.


likewise, a PPP/HDLC style system (message start/end codes) could also
be used.


depending on other factors, one can also do things like in JPEG or MPEG,
and use a special escape-code for messages and control-codes.

this can allow a top-level message format like:
{ escape-code, tag [ length, data[length] ... ] }

typically, in such cases (I have seen) there have been ways to escape
the escape-code, usually for cases where the escape code appeared
by-chance in the data. this in-turn adds the annoyance of typically
having to escape any escape-codes in the payload data.

some others have partly worked around the above by making the escape
code fairly long (32 or 48 bits or more) and very unlikely to appear by
chance, and likely involving "sanity checks" to try to rule out false
positives.

say: { escape-magic, tag, length, data[length], checksum }
with the assumption that chance is very unlikely to lead to all of:
an escape magic, a valid tag value, a sane length, and a valid checksum.

depending, the escape-magic and tag can be the same value.

for example:
the byte 0x7E is magic;
7E,00 escapes 7E (or maybe 7E,7E)
7E,01 Start Of Message (followed by message data)
7E,02 End Of Message (maybe, followed by checksum)
others: reserved for link-control messages.


then one can pass encoded messages over the link.

typically, I have not tried parsing incomplete messages, as trying to
make a message decoder deal gracefully with truncated data is a bit more
of a hassle.

depending on other factors (say, if one is using Huffman), then one can
also use special markers to transmit the Huffman tables and other things.

say:
7E,03: Stream Reset (possibly followed by a stream/protocol ID magic)
7E,04-07: Huffman Tables 0-3
7E,08: End Of Huffman Table
....
 
L

Lew

BGB said:
...
an example is this:
<foo> <bar value="3"/> </foo>
and:
(foo (bar 3))

now, consider one wants to add a new field to 'foo' (say 'ln').
<foo ln="15"> <bar value="3"/> </foo>
and:
(foo 15 (bar 3))

a difference here is that existing code will probably not even notice
the new XML attribute, whereas the positional nature of most

Ahem. You mean other than failing schema validation?
S-Expressions makes the latter far more likely to break something (and

More likely than failing schema validation was for that well-designed XML-based
application?
there is no good way to "annotate" an S-Exp, whereas with XML it is
fairly solidly defined that one can simply add new attributes).

Attributes in XML are not annotation (with or without quotes). That role is filled by the actual 'annotation' element
http://www.w3schools.com/schema/el_annotation.asp
note: my main way of working with XML is typically via DOM-style
interfaces (if I am using it, it is typically because I am directly
working with the data structure, and not as the result of some dumb-ass
"data binding" crud...).

Sorry, "dumb-ass 'data-binding' crud"?

Why the extreme pejoratives? I would not say that there's anything wrong with
XML data-binding /per se/, although as with documented-oriented approaches it
can be done very badly.
typically, the "internal representation" and "concrete serialization"
are different:

I don't understand what you mean here. You cite these terms in quotes as though
they are a standard terminology for some specific things, but use them in their
ordinary meaning. The internal representation of what? The serialization
("concrete" or otherwise) of what? I don't mean to be obtuse here, but I am not
grokking the referents.
I may use a textual XML serialization, or just as easily, I could use a
binary format;
likewise for S-Exps (actually, I probably far more often represent
S-Exps as a binary format of one form or another than I use them in a
form externally serialized as text).

all hail the mighty DOM-node or CONS-cell...

WTF?
 
B

BGB

Ahem. You mean other than failing schema validation?

many of us don't use schemas with our XML.

I think the issue is that one particular technology, XML, is used in
significantly different ways by different people and for different reasons.

many people use XML for data-binding, and many other people who use it
could care less about data-binding.


some people may use XML for similar purposes to how people using Lisp
would use lists (never-mind if this is kind of awkward, it does work).

like, doing Lisp type stuff in Java using DOM-nodes in place of
cons-based lists... +1 now that Java also (sort of) has closures.

More likely than failing schema validation was for that well-designed XML-based
application?

as noted, many people neither use schemas nor any sort of schema
validation. in many use-cases, schemas are overly constraining to the
ability of using XML to represent free-form data, or using them
otherwise would offer little particular advantage.

say, if one is using XML for compiler ASTs or similar (say, the XML is
used to represent a just-parsed glob of source-code), do they really
need any sort of schema?

http://en.wikipedia.org/wiki/Abstract_syntax_tree

Attributes in XML are not annotation (with or without quotes). That role is filled by the actual 'annotation' element
http://www.w3schools.com/schema/el_annotation.asp

they can be used for annotating the nodes in many sane use cases...

a lot depends on how one is using the XML in a given context.

Sorry, "dumb-ass 'data-binding' crud"?

Why the extreme pejoratives? I would not say that there's anything wrong with
XML data-binding /per se/, although as with documented-oriented approaches it
can be done very badly.

yeah, this may have been stated overly strongly.

personally, IMO, data-binding is probably one of the worse and
technically more pointless ways of using XML (as, IMO, it leads to such
similarly ill-designed technologies as SOAP and similar...).

not that data-binding is itself necessarily itself pointless, but doing
it via overly verbose namespace-ridden XML is probably one of the worse
ways of doing it (vs either specialized file-formats, or the use of
binary data-binding formats, which IMO should also not be used for data
interchange).


admittedly, I also partly dislike traditional ways of using data-binding
as it often exposes things which are theoretically internal to the app,
namely structural data representation (via classes/...), with things
which should theoretically be isolated from the internal data
representation: file formats.

or, IOW: a file-format (or protocol/...) should express the data in
itself, and not express how it is physically represented within the
application.

likewise, data going into or coming out of a piece of code should be
ideally documented and defined in a form separate from the component in
question.

otherwise, data-binding is not that much different than a more modern
variant of writing raw structures and arrays to files.

I don't understand what you mean here. You cite these terms in quotes as though
they are a standard terminology for some specific things, but use them in their
ordinary meaning. The internal representation of what? The serialization
("concrete" or otherwise) of what? I don't mean to be obtuse here, but I am not
grokking the referents.

the internal representation of the data within the application code.

if one knows which objects or classes exist, what sorts of members they
contain, ... then one is essentially exposing data which should not be
visible, or for that matter relied upon for data interchange (or, for
that matter, relevant).

ideally, any data represented externally should be defined in terms of
its semantics: something will be present if it is relevant to the
meaning of the data. the serialization will then be defined in terms of
expressing the structure and semantics of the data, which may bear very
little resemblance to how the data is represented in the actual
classes/arrays/whatever which make up how the data is represented
internally to the application.

similarly, file formats should be as much abstracted from the
application code as is reasonably possible, with a "concrete"
specification for the file-format or data-representation being written
instead.


both XML and S-Expressions can be used as structured ways of
representing semantics, rather than as ways of representing the contents
of given a data-object.


DOM nodes can be very powerful (and are probably a much better way of
using XML than using it as some sort of data-binding thing).


cons-cells are pairs of dynamically-typed values, typically called "car"
and "cdr" and used to implement lists and similar (and are the main
building block of "everything" in languages like Lisp and Scheme, well,
along with "symbols" and "fixnums" and similar).

http://en.wikipedia.org/wiki/Cons_cell

they can also be implemented in C, C++, and Java without too much
trouble, and can be a fairly useful way of building various sorts of
data structures (although, sadly, they aren't nearly as efficient in
Java as they could be, but OTOH it is also sort of a pain to build a
dynamic type-system in C, so it probably evens out...).

then one can proceed to build logic based mostly on building and
processing lists.

or, conceptually, they can be regarded as a type of linked-list based
containers, however the ways they are traditionally used are
significantly different from traditional ways of using containers (they
are typically used as ways of building tree-structures, rather than
usually as ways of storing a collection of items).


it may be worthwhile to look-up information regarding Lisp and Scheme
and similar, not that there is necessarily much reason to actually use
the languages, but there are some ideas and ways of doing things which
can be mapped fairly nicely onto other, more common, languages.
 
A

Arne Vajhøj

as noted, many people neither use schemas nor any sort of schema
validation. in many use-cases, schemas are overly constraining to the
ability of using XML to represent free-form data, or using them
otherwise would offer little particular advantage.

xsd:any do provide some flexibility in schemas.
say, if one is using XML for compiler ASTs or similar (say, the XML is
used to represent a just-parsed glob of source-code), do they really
need any sort of schema?

I would expect syntax trees to follow certain rules and not be free
form.

Arne
 
A

Arne Vajhøj

I have rarely done things for which SAX has made sense...
usually in cases where SAX would make sense, I end up using
line-oriented text formats instead (because there is often little
obvious reason for why XML syntax would make much sense).

Non flat structure and validation comes to mind.

Arne
 
B

BGB

Non flat structure and validation comes to mind.

fair enough.

often, one can implement non-flat structures with line-oriented formats,
for example:
....
groupDef {
....
groupDef {
itemDef {
....
}
....
}
....
}

a lot of time this may be combined with cosmetic indentation, but this
does not change if it is a line-oriented format, for example, writing:
groupDef
{
....
}

could very-well break the parser.


typically, I have not used validation:
if there is anything to validate, typically this logic will be placed in
the logic to parse the text.

a lot of times, code operates under the assumption that nearly anything
which can be reasonably done is valid de-facto (the code is written,
however, to ideally not do anything compromising).


granted, typically I don't deal a whole lot with anything "security
critical" or where there is much need to worry about "trust" or
"authorization" or similar (or if privacy or money or similar was
involved...). maybe if security were more of a concern, then added
layers of pedantics and validation would make a lot more sense.

in my typical use-cases, the theoretical worst case would probably be if
a 3rd party could somehow break the app and get control of the users' OS
or similar and cause damage, but again, modern Windows is itself partly
designed to try to defend against this (running applications by default
with constrained privileges, ...).
 
L

Lew

xsd:any do provide some flexibility in schemas.


I would expect syntax trees to follow certain rules and not be free
form.

In one breath we're singing the praises of binary formats, in the next we
complain that XML isn't sufficiently flexible.

"Do they really need any sort of schema?" with XML is usually a "yes".

But only if you're interested in clear, unambiguous, readily-parsable and
maintainable XML document formats.

People often excoriate the supposed verbosity of XML as though it were the only
criterion to measure utility.

There is no inherent advantage of a LISP/list-like format over any other, nor vice versa; it's all accordin'. If the convention is agreeable to all parties,
it will work. If all projects were one-off and isolated from the larger world,
we'd never need to adhere to a standard. If we don't mind inventing our own
tools for anything, we'd never have to adopt a standard with extensive tools
support.

Where are the *real* costs of a software system?
 
B

BGB

xsd:any do provide some flexibility in schemas.

yep, but one can wonder what is the gain of using a schema if one is
just going to use "xsd:any"?...

it is also a mystery how well EXI behaves in this case (admittedly, I
have not personally looked into EXI in-depth, as I only briefly skimmed
over the spec a long time ago).

I would expect syntax trees to follow certain rules and not be free
form.

well, there are some rules, but the question is more if a schema or the
use of validation would offer much advantage to make using it worth the
bother?...

the other possibility would be to make the next compiler stage, upon
seeing invalid data, give an error message essentially like "what the
hell is this?..." and halt compilation (typically this is what happens
if the compiler logic encounters a node type it doesn't know how to do
anything with in a situation where a known node-type is expected, or if
some required node is absent or similar).


so, one can have a schema to validate, say, that ones' "if" node looks like:
<if>
<cond> expr </cond>
<then> statement </then>
<else> statement </else>
</if>

but, OTOH, if upon getting back a null node when looking for "cond" or
"then", it causes an internal-error message to get displayed, it is the
same effect. even if it just ungracefully tries to use the null and
causes the program to crash, it is probably still not a huge loss (apart
from the annoyance that is a crash-prone compiler...).



I think the original point though was more about XML vs S-Expressions in
this case though:
XML allows more easily just stuffing-in new tags or contents for
existing tag-types, if this makes sense (it doesn't necessarily break
existing code or structures, and actually, protocols like XMPP make use
of this property fairly directly). for S-Exps, which are often
essentially, this is much less nice, and will often include needing more
node-types to deal with the presence or absence of certain features
(whereas with XML one can use different logic based on whether or not
certain attributes or tags are present or absent).

granted, it does still leave the possibility that one could structure
things more loosely (with S-Exps), say, rather than:
( if /cond/ /then/ /else/ )
one has:
( if (cond /cond/ ) (then /then/ ) (else /else/ ) )

so, gaining a little more flexibility at the cost of a little more
verbosity, which is possibly a reasonable point one could argue (my
client/server frame-delta protocol works more like this, typically using
marker tags before everything in place of lots of fixed argument lists,
although fixed-lists are used in many places as well).


trivia: the frame-delta protocol was originally intended to be
XML-based, but I switched out to S-Expressions at the last minute (just
prior to actually implementing it) mostly on the ground that S-Exps
would have been less effort (and I didn't feel like jerking off with the
added awkwardness using XML would bring at the moment).

a funny irony would be if someone were to devise some sort of schema
system and use it to try to validate their S-Expressions.

it is still an open question as to which is ultimately "better", as each
has strengths, and mostly seems to boil down to a tradeoff between
flexibility and ease-of-use.
 
B

BGB

In one breath we're singing the praises of binary formats, in the next we
complain that XML isn't sufficiently flexible.

it is not like one can't have both:
have a format which is at the same time is a compressed binary format,
and can also retain the full flexibility of representing free-form XML
semantics, ideally without a major drop in compactness (this happens
with WBXML, and IIRC should also happen with EXI about as soon as one
starts encoding nodes which lie outside the schema).

this is partly why I was advocating a sort of pattern-building adaptive
format: it can build the functional analogue of a schema as it encodes
the data, and likewise does not depend on a schema to properly decode
the document. it is mostly a matter of having the format predict when it
doesn't need to specify tag and attribute names (it is otherwise similar
to a traditional data-compressor).

this is functionally similar to the sliding-window as used in deflate
and LZMA (7zip) and similar (in contrast to codebook-based data
compressors). functionally, it would have a little more in common with
LZW+MTF than with LZ77 though.

granted, potentially a binary format could incorporate both support for
schemas and the use of adaptive compression.


is XML really the text, or is it actually the structure?
I had operated under the premise that it was the data-structure (tags,
attributes, namespaces, ...), which allows for pretty much anything
which can faithfully encode the structure (without imposing too many
arbitrary restrictions).

"Do they really need any sort of schema?" with XML is usually a "yes".

But only if you're interested in clear, unambiguous, readily-parsable and
maintainable XML document formats.

fair enough, I have mostly been using it "internally", and as noted, for
some of my file-formats, I had used a custom binary coded variant
(roughly similar to WBXML, but generally more compact and supporting
more features, such as namespaces and similar, which I had called SBXE).
it didn't make use of schemas, and worked by simply encoding the tag
structure into the file, and using basic contextual modeling strategies.

it also compared favorably with XML+GZ in my tests (which IIRC was also
generally smaller than WBXML). remotely possible would also be XML+BZip2
or XML+LZMA.


I had considered the possibility of a more "advanced" format (with more
advanced predictive modeling), but didn't bother (couldn't see much
point at the time of trying to shave off more bytes at the time, as it
was already working fairly well).

People often excoriate the supposed verbosity of XML as though it were the only
criterion to measure utility.

well, a lot depends...

for disk files, really, who cares?...
for a link where a several kB message might only take maybe 250-500ms
and is at typical "user-interaction" speeds (say, part of a generic "web
app"), likewise, who cares?...


it may matter a little more in a 3D interactive world where everything
going on in the visible scene has to get through at a 10Hz or 24Hz
clock-tick, and if the connection bogs down the user will be rather
annoyed (as their game world has essentially stalled).

one may have to make due with about 16-24kB/s (or maybe less) to better
ensure a good user experience (little is to say that the user has a
perfect internet connection either).

so, some sort of compression may be needed in this case.
(yes, XML+GZ would probably be sufficient).

if it were dial-up, probably no one would even consider using XML for
the network protocol in a 3D game.

There is no inherent advantage of a LISP/list-like format over any other, nor vice versa; it's all accordin'. If the convention is agreeable to all parties,
it will work. If all projects were one-off and isolated from the larger world,
we'd never need to adhere to a standard. If we don't mind inventing our own
tools for anything, we'd never have to adopt a standard with extensive tools
support.

it is possible, it all depends.

a swaying factor in my last choice was the effort tradeoff of writing
the code (because working with DOM is kind of a pain...). IIRC, I may
have also been worrying about performance (mostly passing around lots of
numeric data as ASCII strings, ...).

but, I may eventually need to throw together a basic encoding scheme for
this case (a binary encoder for list-based data), that or just reuse an
existing data serializer of mine (mostly intended for generic data
serialization, which supports lists). it lacks any sort of prediction or
context modeling though, and is used in my stuff mostly as a container
format for bytecode for my VM and similar.

Where are the *real* costs of a software system?

who knows?...

probably delivering the best reasonable user experience?...

for a game:
reasonably good graphics;
reasonably good performance (ideally, consistently over 30fps);
hopefully good gameplay, plot, story, ...

well, that and "getting everything done" (this is the hard one).
 
A

Arved Sandstrom

fair enough.

often, one can implement non-flat structures with line-oriented formats,
for example:
...
groupDef {
...
groupDef {
itemDef {
...
}
...
}
...
}
[ SNIP ]

No need for the braces, if you're going to use those all you gain over
the XML is terseness.

Consider line-oriented files/messages like .properties files: these can
describe hierarchical structures perfectly well if you've got an
understood key=value syntax, specifically with a hierarchy-supporting
syntax for the keys. Easy to read and edit, easy to parse.

As an example take a look at log4j .properties and XML configuration
files. All you gain with the XML is the ability to validate against a
log4j DTD.
a lot of times, code operates under the assumption that nearly anything
which can be reasonably done is valid de-facto (the code is written,
however, to ideally not do anything compromising).

granted, typically I don't deal a whole lot with anything "security
critical" or where there is much need to worry about "trust" or
"authorization" or similar (or if privacy or money or similar was
involved...). maybe if security were more of a concern, then added
layers of pedantics and validation would make a lot more sense.

in my typical use-cases, the theoretical worst case would probably be if
a 3rd party could somehow break the app and get control of the users' OS
or similar and cause damage, but again, modern Windows is itself partly
designed to try to defend against this (running applications by default
with constrained privileges, ...).
This is a narrow view of application security. Unless you're writing toy
apps, one would expect that your apps are doing *something*, and that
something includes access to databases or files or other resources.
Furthermore, if your app is used by anyone other than yourself, another
asset is in play, and that's your personal, team's or business's
reputation.

Privacy-sensitive data, or financial data, doesn't have to be involved,
and you don't need the actions of a malicious third party, in order to
have an application security problem. If your code is such that it
corrupts any persistent data, say, or is seriously under-performant
under load, or intermittently breaks and the app has to be re-started,
you've managed to trample all over the Integrity [1] and Availability
security attributes of CAI (Confidentiality, Availability,
Integrity)...all without the help of any malicious external threats.

Do you think your users care who or what mangled part of the
organizational data, or who or what is responsible for 20 percent
downtime? Some of your stakeholders will, sure, when culprits are being
sought, but most of your users will just care about proper function.

All application security starts with good coding. That's why so much of
standards like the Java Secure Coding Guidelines, or OWASP
Development/Code Review/Testing guides, have to do with good coding. And
I don't believe you can really relax your standards with some apps and
have high standards in another.

AHS

1. Strictly speaking not an integrity violation if you can detect the
unintended data corruption, ideally know what caused it, and even better
repair it, but in practice once the damage is done you often
*effectively* can't easily recover; the effort of detecting and fixing
is itself punitive.
 
B

BGB

fair enough.

often, one can implement non-flat structures with line-oriented formats,
for example:
...
groupDef {
...
groupDef {
itemDef {
...
}
...
}
...
}
[ SNIP ]

No need for the braces, if you're going to use those all you gain over
the XML is terseness.

well, if the format is still line-oriented, one can still parse the
files using a loop, getting and splitting strings, and checking the
first token of each line.

parsing XML is a little more invovlved, since:
items may be split across lines, or multiple items may exist on the same
line;
one can no longer use whitespace or commas as the primary deliminator;
....

granted, yes, one can use SAX or similar, but alas...

one can wonder though, what really would be the gain of using XML syntax
in many such cases, vs the typical "relative niceness" of a line
oriented format.

like, say I have a format which looks like:
{
"classname" "func_door"
"angle" "-1"
....
{
[ 1 0 0 16 ] brick/mybrick [ 0 1 0 0 ] [ 0 0 1 0 ]
[ -1 0 0 16 ] brick/mybrick [ 0 1 0 0 ] [ 0 0 1 0 ]
[ 0 1 0 16 ] brick/mybrick [ 1 0 0 0 ] [ 0 0 1 0 ]
[ 0 -1 0 16 ] brick/mybrick [ 1 0 0 0 ] [ 0 0 1 0 ]
[ 0 0 1 16 ] brick/mybrick [ 1 0 0 0 ] [ 0 1 0 0 ]
[ 0 0 -1 16 ] brick/mybrick [ 1 0 0 0 ] [ 0 1 0 0 ]
}
}

would it really look much better as:
<entity>
<field var="classname" value="func_door"/>
<field var="angle" value="-1"/>
....
<brush>
<face plane="1 0 0 16" texture="brick/mybrick" sdir="0 1 0 0" tdir="0 0
1 0"/>
....
</brush>
</entity>

even despite the parser being more generic, and it being better labeled
what everything is, is it really an improvement WRT, say, readability?...

Consider line-oriented files/messages like .properties files: these can
describe hierarchical structures perfectly well if you've got an
understood key=value syntax, specifically with a hierarchy-supporting
syntax for the keys. Easy to read and edit, easy to parse.

yes, but this defeats your own prior point, namely indirectly asserting
that line-oriented == flat-structure.

point is, one can have hierarchical line-oriented files.

As an example take a look at log4j .properties and XML configuration
files. All you gain with the XML is the ability to validate against a
log4j DTD.

This is a narrow view of application security. Unless you're writing toy
apps, one would expect that your apps are doing *something*, and that
something includes access to databases or files or other resources.
Furthermore, if your app is used by anyone other than yourself, another
asset is in play, and that's your personal, team's or business's
reputation.

"someone steals' the user's save-games!", that would be scary, or not
really...

most of the files in a game are generic resource data, but stealing them
is of little concern, and damaging them is more likely to be an
annoyance than an actual threat "oh crap, I might have to reinstall...".

Privacy-sensitive data, or financial data, doesn't have to be involved,
and you don't need the actions of a malicious third party, in order to
have an application security problem. If your code is such that it
corrupts any persistent data, say, or is seriously under-performant
under load, or intermittently breaks and the app has to be re-started,
you've managed to trample all over the Integrity [1] and Availability
security attributes of CAI (Confidentiality, Availability,
Integrity)...all without the help of any malicious external threats.

typically, crashes are more an annoyance than a major threat.

consider Skyrim: the damn thing can't usually keep going for more than 1
or 2 hours before crashing-to-desktop or similar.

of course, not everyone aspires towards Bethesda levels of stability.

Do you think your users care who or what mangled part of the
organizational data, or who or what is responsible for 20 percent
downtime? Some of your stakeholders will, sure, when culprits are being
sought, but most of your users will just care about proper function.

only likely matters if it is some sort of server-based or business type app.

ok, a game-server crashing could be a bit annoying if one were making
something like an MMORPG or something (like WoW...).


in my case, I am not:
the online play would likely be more for things like user-run deathmatch
servers and similar.

All application security starts with good coding. That's why so much of
standards like the Java Secure Coding Guidelines, or OWASP
Development/Code Review/Testing guides, have to do with good coding. And
I don't believe you can really relax your standards with some apps and
have high standards in another.

it is more a matter of productivity:
focus on security, code-quality, ... in places where it is important;
otherwise, whatever one can mash together which basically works is
arguably good enough.

granted, it is not like there aren't some things I care about, like I
prefer clean and nice code over a tangled mess, but ultimately this may
be secondary to the greater concern, "get it done" (as, what good is
good code if the product can never get out the door and on the market?).


it is like with art:
some people can be perfectionist, and worry about tiny details which
hardly anyone would ever notice;
other people can try to make something "good enough" and hope users
don't notice or care about any little graphical imperfections.

AHS

1. Strictly speaking not an integrity violation if you can detect the
unintended data corruption, ideally know what caused it, and even better
repair it, but in practice once the damage is done you often
*effectively* can't easily recover; the effort of detecting and fixing
is itself punitive.

potentially, but it depends on the relative costs.

if the worst case is forcing a reinstall, this is much less of an issue
than, say, if it breaks their savegames, which is much less of an issue
than if any "actually important" data is involved (compromises users'
privacy or security, causes damage to their computer, ...).

say, one doesn't want to have their app be a vector for virus delivery,
as this can give a bad reputation.


but, alas...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top