[Q] Text vs Binary Files

Discussion in 'XML' started by Eric, May 27, 2004.

  1. Eric

    Eric Guest

    Assume that disk space is not an issue
    (the files will be small < 5k in general for the purpose of storing
    preferences)

    Assume that transportation to another OS may never occur.


    Are there any solid reasons to prefer text files over binary files
    files?

    Some of the reasons I can think of are:

    -- should transportation to another OS become useful or needed,
    the text files would be far easier to work with

    -- tolerant of basic data type size changes (enumerated types have been
    known to change in size from one version of a compiler to the next)

    -- if a file becomes corrupted, it would be easier to find and repair
    the problem potentially avoiding the annoying case of just
    throwing it out

    I would like to begin using XML for the storage of application
    preferences, but I need to convince others who are convinced that binary
    files are the superior method that text files really are the way to go.

    Thoughts? Comments?
    Eric, May 27, 2004
    #1
    1. Advertising

  2. On Thu, 27 May 2004, Eric wrote:
    >
    > Assume that disk space is not an issue [...]
    > Assume that transportation to another OS may never occur.
    > Are there any solid reasons to prefer text files over binary files?
    >
    > Some of the reasons I can think of are:
    >
    > -- should transportation to another OS become useful or needed,
    > the text files would be far easier to work with


    I would guess this is wrong, in general. Think of the difference
    between a DOS/Win32 text file, a MacOS text file, and a *nix text
    file (hint: linefeeds and carriage returns). Now think of the
    difference between the same systems' binary files (hint: nothing).
    There do exist many free tools to deal with line-ending troubles,
    though, so this isn't really a disadvantage; just a counter to your
    claim.

    > -- tolerant of basic data type size changes (enumerated types have been
    > known to change in size from one version of a compiler to the next)


    It's about five minutes' work to write portable binary I/O functions
    in most languages, if you're worried about the size of 'int' on your
    next computer or something. Check out any file-format standard for
    ideas, and Google "network byte order." If you're coming from a C
    background, then you'll understand when I tell you that 'fwrite' should
    never, ever be applied to anything but buffers of 'unsigned char'! :)

    > -- if a file becomes corrupted, it would be easier to find and repair
    > the problem potentially avoiding the annoying case of just
    > throwing it out


    Yes, definitely. Also, it's much easier to tell if text has been
    corrupted in transmission --- it won't look like text anymore!
    Binary always looks like binary; you need explicit checksums and
    guards against corruption there. (Again, see file-format standards,
    especially my favorite, the PNG image standard.)

    > I would like to begin using XML for the storage of application
    > preferences, but I need to convince others who are convinced that binary
    > files are the superior method that text files really are the way to go.


    One major advantage of plain text is that it can be sent over HTTP
    and other Web protocols without "armoring." You can put plain text
    in the body of a POST request, for example, where I doubt arbitrary
    bytes would be accepted. (I dunno, though.)
    Along the same lines, you can email your data files back and forth
    in the body of an email message, rather than mucking about with
    attachments.

    The disadvantage is size; but you don't seem worried about that.
    Another possible disadvantage would be that text is easily read and
    reverse-engineered, if you're worried about that (e.g., proprietary
    config files or savefiles for a game) --- but then you can always
    encrypt whatever you don't want read immediately. [Whatever you
    don't want read *ever*, you simply don't give to your users, because
    they'll crack anything given enough time.]

    HTH,
    -Arthur
    Arthur J. O'Dwyer, May 27, 2004
    #2
    1. Advertising

  3. Eric

    Eric Guest

    Arthur J. O'Dwyer <> wrote:

    > > -- should transportation to another OS become useful or needed,
    > > the text files would be far easier to work with

    >
    > I would guess this is wrong, in general. Think of the difference
    > between a DOS/Win32 text file, a MacOS text file, and a *nix text
    > file (hint: linefeeds and carriage returns).


    Which is why I mentioned at the end using a solid XML parser to deal
    with such issues transparently. I likely wouldn't consider using a text
    file if something like XML and solid parsers weren't available and free.

    > Now think of the
    > difference between the same systems' binary files (hint: nothing).


    Well, you say 'same systems'...so, yes, in general, reading & writing a
    binary file that will never be moved to another OS shouldn't present any
    serious issues. (or am I wrong here?)

    However, the point was that it could be moved, in which case dealing
    with big/little endian issues would become important.

    > > -- tolerant of basic data type size changes (enumerated types have
    > > been known to change in size from one version of a compiler to
    > > the next)

    >
    > It's about five minutes' work to write portable binary I/O functions
    > in most languages


    Ah, but it's five minutes I don't want to spend, especially since the
    time would need to be spent every time something changed. I believe in
    fixing a problem once.

    Plus, the potental for spending time attempting to figure out why the
    @#$%@$ isn't being read properly isn't accounted for here.

    > Another possible disadvantage would be that text is easily read and
    > reverse-engineered


    In my case, this is a benefit.

    --
    == Eric Gorr ========= http://www.ericgorr.net ========= ICQ:9293199 ===
    "Therefore the considerations of the intelligent always include both
    benefit and harm." - Sun Tzu
    == Insults, like violence, are the last refuge of the incompetent... ===
    Eric, May 27, 2004
    #3
  4. Eric

    gswork Guest

    (Eric) wrote in message news:<1geew2n.10d70ck1mpdeeN%>...
    > Assume that disk space is not an issue
    > (the files will be small < 5k in general for the purpose of storing
    > preferences)
    >
    > Assume that transportation to another OS may never occur.
    >
    >
    > Are there any solid reasons to prefer text files over binary files
    > files?
    >
    > Some of the reasons I can think of are:
    >
    > -- should transportation to another OS become useful or needed,
    > the text files would be far easier to work with
    >
    > -- tolerant of basic data type size changes (enumerated types have been
    > known to change in size from one version of a compiler to the next)
    >
    > -- if a file becomes corrupted, it would be easier to find and repair
    > the problem potentially avoiding the annoying case of just
    > throwing it out


    All good reasons...

    > I would like to begin using XML for the storage of application
    > preferences, but I need to convince others who are convinced that binary
    > files are the superior method that text files really are the way to go.
    >
    > Thoughts? Comments?


    For your application i think you have it right. Preferences in an XML
    text file are more flexible for the user/admin (can be edited by hand
    as last resort) and also for you as developer, a text file can have
    entries listed 'out of order' and with the right tags and parsing it
    won't really matter. For the same reasons they can also be easier to
    change and add to over time.

    The main reasons for using binary files to store preferences are:

    -security (but they're crackable, and text files can be encrypted
    anyway)
    -programming ease, it can be easier to just have a preference
    structure than to attempt a robust parsing of a given set of text
    items, the text could be messed with after all
    -size, relevant if they need to be shuttled around a network a lot or
    will take up lots disk space

    It sounds like they don't apply in your case.
    gswork, May 27, 2004
    #4
  5. On Thu, 27 May 2004, Eric wrote:
    >
    > Arthur J. O'Dwyer <> wrote:
    > [Eric wrote]
    > > > -- should transportation to another OS become useful or needed,
    > > > the text files would be far easier to work with

    > >
    > > I would guess this is wrong, in general. Think of the difference
    > > between a DOS/Win32 text file, a MacOS text file, and a *nix text
    > > file (hint: linefeeds and carriage returns).

    >
    > Which is why I mentioned at the end using a solid XML parser to deal
    > with such issues transparently. I likely wouldn't consider using a text
    > file if something like XML and solid parsers weren't available and free.


    Ah, but what do you do when the XML standard changes? :) Seriously,
    this is something you really need to consider IMHO. (Of course, this
    is cross-posted to an XML group, and I don't know much about XML, so
    don't take my word about anything...) There are XML Version Foo parsers
    available now, but when XML Version Bar comes out, there'll be lag time.
    Think of the messes with HTML 4.0 [about which I know little] and C'99
    [about which I know much].
    Free parsers *are* nice, though, no dispute there. :)

    > > Now think of the
    > > difference between the same systems' binary files (hint: nothing).

    >
    > Well, you say 'same systems'...so, yes, in general, reading & writing a
    > binary file that will never be moved to another OS shouldn't present any
    > serious issues. (or am I wrong here?)


    Misunderstood. By "the same systems," I meant the systems I just
    mentioned: DOS/Win32, Unix, and MacOS. Their binary data formats are
    identical.

    > > > -- tolerant of basic data type size changes (enumerated types have
    > > > been known to change in size from one version of a compiler to
    > > > the next)

    > >
    > > It's about five minutes' work to write portable binary I/O functions
    > > in most languages

    >
    > Ah, but it's five minutes I don't want to spend,


    Versus five minutes trying to make your free XML parser compile?
    I'd take five minutes with binary files any day. ;-)

    > especially since the
    > time would need to be spent every time something changed. I believe in
    > fixing a problem once.


    So do I. That's why you spend the five minutes writing your portable
    binary I/O functions. Then you never need to write them again. For
    a not-so-hot-but-portable-across-aforementioned-systems example, see
    http://www.contrib.andrew.cmu.edu/~ajo/free-software/ImageFmtc.c,
    functions 'fread_endian' and 'bwrite_endian'. Write once, use many
    times.
    The number of bits in a 32-bit integer is *never* going to change.
    The number of bits in a machine word is *definitely* going to change.
    This is why all existing file-format standards explicitly state that
    they are dealing with 32-bit integers, not machine words: so the
    file-format code never has to change, no matter where it runs.

    > Plus, the potental for spending time attempting to figure out why the
    > @#$%@$ isn't being read properly isn't accounted for here.


    Of course not. I/O is trivial. It's your *algorithms* that are
    going to be broken; and they'd be broken no matter what output format
    you used.

    > > Another possible disadvantage would be that text is easily read and
    > > reverse-engineered

    >
    > In my case, this is a benefit.


    Good. :)

    -Arthur
    Arthur J. O'Dwyer, May 27, 2004
    #5
  6. Eric

    Eric Guest

    Arthur J. O'Dwyer <> wrote:

    > On Thu, 27 May 2004, Eric wrote:
    > >
    > > Arthur J. O'Dwyer <> wrote:
    > > [Eric wrote]
    > > > > -- should transportation to another OS become useful or needed,
    > > > > the text files would be far easier to work with
    > > >
    > > > I would guess this is wrong, in general. Think of the difference
    > > > between a DOS/Win32 text file, a MacOS text file, and a *nix text
    > > > file (hint: linefeeds and carriage returns).

    > >
    > > Which is why I mentioned at the end using a solid XML parser to deal
    > > with such issues transparently. I likely wouldn't consider using a text
    > > file if something like XML and solid parsers weren't available and free.

    >
    > Ah, but what do you do when the XML standard changes? :)


    Please correct me if I am wrong, but the design of XML already takes
    this into account. In otherwords, the idea that it can and will change
    is a part of the design - this is one reason why XML is such a nifty
    technology.

    > Misunderstood. By "the same systems," I meant the systems I just
    > mentioned: DOS/Win32, Unix, and MacOS. Their binary data formats are
    > identical.


    What do you mean by 'their binary data formats are identical'?...this
    would seem to imply that big/little endian issues are a thing of the
    past...?

    > > > > -- tolerant of basic data type size changes (enumerated types have
    > > > > been known to change in size from one version of a compiler to
    > > > > the next)
    > > >
    > > > It's about five minutes' work to write portable binary I/O functions
    > > > in most languages

    > >
    > > Ah, but it's five minutes I don't want to spend,

    >
    > Versus five minutes trying to make your free XML parser compile?


    Binaries of the better parsers are available, so this is a non-issue.
    :)

    > > Plus, the potental for spending time attempting to figure out why the
    > > @#$%@$ isn't being read properly isn't accounted for here.

    >
    > Of course not. I/O is trivial.


    Once you track down the problem...however, it would not be uncommon to
    think the problem lies elsewhere first and spend hours before finding
    the trivial fix.

    > It's your *algorithms* that are
    > going to be broken; and they'd be broken no matter what output format
    > you used.


    With XML, the risk of this is far less, as long as you're not changing
    the tag names or what they mean, if it really exists at all.
    Eric, May 27, 2004
    #6
  7. On Thu, 27 May 2004, Eric wrote:
    >
    > Arthur J. O'Dwyer <> wrote:
    > >
    > > Ah, but what do you do when the XML standard changes? :)

    >
    > Please correct me if I am wrong, but the design of XML already takes
    > this into account. In otherwords, the idea that it can and will change
    > is a part of the design - this is one reason why XML is such a nifty
    > technology.


    Probably true. I don't know much about XML's namespacing rules
    (by which I mean the rules that say that <foo> is an okay tag for
    a user to create, but <bar> could be given special meaning by
    future standards). [If anyone wants to give me a lecture, that's
    fine; otherwise, I'll just look it up when I need to know. ;) ]

    > > Misunderstood. By "the same systems," I meant the systems I just
    > > mentioned: DOS/Win32, Unix, and MacOS. Their binary data formats are
    > > identical.

    >
    > What do you mean by 'their binary data formats are identical'?...this
    > would seem to imply that big/little endian issues are a thing of the
    > past...?


    Yup. The vast majority of computers these days use eight-bit
    byte-oriented transmission and storage protocols. Whatever bit-ordering
    problems there are have moved "downstream" to those people involved in
    the construction of hardware that has to choose whether to transmit
    bit 0 or bit 7 first (and I'm sure they have their own relevant
    standards in those fields, too).
    Again, I refer you to standards like RFCs 1950, 1951, and 1952
    (Google "RFC 1950"). Note the utter lack of concern with the vagaries
    of the machine. We have indeed moved past big/little-endian wars;
    now, whoever's[1] writing the relevant standard simply says, "All eggs
    distributed according to the Fred protocol must be broken at the
    big end," and that's the end of *that!*


    > > > Plus, the potental for spending time attempting to figure out why the
    > > > @#$%@$ isn't being read properly isn't accounted for here.

    > >
    > > Of course not. I/O is trivial.

    >
    > Once you track down the problem...however, it would not be uncommon to
    > think the problem lies elsewhere first and spend hours before finding
    > the trivial fix.


    You misunderstand me. I/O is trivial; thus, after the first five
    minutes spent making sure the trivial code is correct (which is trivial
    to prove), you never need to touch it or look at it again. If you
    never touch it, you can't possibly introduce bugs into it. And if it
    starts out bugfree (trivially proven), and never has any bugs introduced
    into it (because it's never modified), then it will remain bugfree
    forever. (And thus you never need to fix it, trivially or not.)

    I'm completely serious and not using hyperbole at all when I say
    I/O is trivial. It really is.

    -Arthur

    [1] - In speech I'd say "who'sever writing...," but that looks
    awful no matter how I spell it. Whosever? Whos'ever? Who's-ever?
    Yuck. :(
    Arthur J. O'Dwyer, May 27, 2004
    #7
  8. On Thu, 27 May 2004, Eric wrote:

    > Assume that disk space is not an issue
    > (the files will be small < 5k in general for the purpose of storing
    > preferences)
    >
    > Assume that transportation to another OS may never occur.
    >
    >
    > Are there any solid reasons to prefer text files over binary files
    > files?
    >
    > Some of the reasons I can think of are:
    >
    > -- should transportation to another OS become useful or needed,
    > the text files would be far easier to work with
    >
    > -- tolerant of basic data type size changes (enumerated types have been
    > known to change in size from one version of a compiler to the next)
    >
    > -- if a file becomes corrupted, it would be easier to find and repair
    > the problem potentially avoiding the annoying case of just
    > throwing it out
    >
    > I would like to begin using XML for the storage of application
    > preferences, but I need to convince others who are convinced that binary
    > files are the superior method that text files really are the way to go.
    >
    > Thoughts? Comments?


    In favour of binary, if a customer has access to it, they will be more
    likely to muck with a text file then a binary file.

    In favour of text, will you ever need to diff the files (old version
    against new version)? Will you need to source control and/or merge the
    files? Easier to do as text.

    --
    Send e-mail to: darrell at cs dot toronto dot edu
    Don't send e-mail to
    Darrell Grainger, May 27, 2004
    #8
  9. Eric

    Ben Measures Guest

    Arthur J. O'Dwyer wrote:
    > On Thu, 27 May 2004, Eric wrote:
    >>Which is why I mentioned at the end using a solid XML parser to deal
    >>with such issues transparently. I likely wouldn't consider using a text
    >>file if something like XML and solid parsers weren't available and free.

    >
    > Ah, but what do you do when the XML standard changes? :) Seriously,
    > this is something you really need to consider IMHO. (Of course, this
    > is cross-posted to an XML group, and I don't know much about XML, so
    > don't take my word about anything...) There are XML Version Foo parsers
    > available now, but when XML Version Bar comes out, there'll be lag time.
    > Think of the messes with HTML 4.0 [about which I know little] and C'99
    > [about which I know much].
    > Free parsers *are* nice, though, no dispute there. :)


    XML was created to solve the problem of the HTML version mess. The
    specification itself is very flexible (yet precise) with the result that
    the language can be extended without needing a change to the
    specification (or parsers based on the specification).

    It's so good it's almost magical.

    > The number of bits in a 32-bit integer is *never* going to change.
    > The number of bits in a machine word is *definitely* going to change.
    > This is why all existing file-format standards explicitly state that
    > they are dealing with 32-bit integers, not machine words: so the
    > file-format code never has to change, no matter where it runs.


    IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit
    integer" - the int type can be more than 32-bits.

    >>Plus, the potental for spending time attempting to figure out why the
    >>@#$%@$ isn't being read properly isn't accounted for here.

    >
    > Of course not. I/O is trivial. It's your *algorithms* that are
    > going to be broken; and they'd be broken no matter what output format
    > you used.


    Unless you're using somebody else's parser, which may not be broken.
    Such as libxml2 which is *very* unlikely to be broken.

    --
    Ben M.
    Ben Measures, May 27, 2004
    #9
  10. On Thu, 27 May 2004, Ben Measures wrote:
    >
    > XML was created to solve the problem of the HTML version mess. The
    > specification itself is very flexible (yet precise) with the result that
    > the language can be extended without needing a change to the
    > specification (or parsers based on the specification).
    >
    > It's so good it's almost magical.


    Okay, I'm convinced, then. :)


    > > The number of bits in a 32-bit integer is *never* going to change.
    > > The number of bits in a machine word is *definitely* going to change.
    > > This is why all existing file-format standards explicitly state that
    > > they are dealing with 32-bit integers, not machine words: so the
    > > file-format code never has to change, no matter where it runs.

    >
    > IIRC in C++ (and I'm sure C) there is no such guarantee of a "32-bit
    > integer" - the int type can be more than 32-bits.


    More is better. A 33-bit integer can hold all the values that a
    32-bit integer can, and then some. If the particular algorithms in
    question are defined not to use the "and then some" part of the integer,
    that's fine. (The at-least-32-bit type in C and C++ is 'long int'.
    When I use the word 'integer', I'm using it in the same sense as the
    C standard: to mean "any integral type," not to mean "'int' type."
    Just in case that was confusing you.)

    *Again* I urge the consultation of the RFCs defining any standard
    binary file format, and the notice of the complete lack of regard
    for big-endian/little-endian/19-bit-int/37-bit-int issues. At the
    byte level, these things simply never come up.


    > >>Plus, the potental for spending time attempting to figure out why the
    > >>@#$%@$ isn't being read properly isn't accounted for here.

    > >
    > > Of course not. I/O is trivial. It's your *algorithms* that are
    > > going to be broken; and they'd be broken no matter what output format
    > > you used.

    >
    > Unless you're using somebody else's parser, which may not be broken.
    > Such as libxml2 which is *very* unlikely to be broken.


    I don't see the connection between my statement and your reply.
    What is the antecedent of your "Unless"? (Literally, you're saying
    that if you use libxml2 for I/O, then your non-I/O-related algorithms
    will have no bugs. This is what used to be called "spooky action at a
    distance," and I don't think it applies to code. :)

    -Arthur
    Arthur J. O'Dwyer, May 28, 2004
    #10
  11. Eric writes:
    Arthur J. O'Dwyer writes:

    E> ...the files will be [...] for the purpose of storing preferences)
    E>
    E> Assume that transportation to another OS may never occur.
    E> [...]
    E> -- should transportation to another OS become useful or needed,
    E> the text files would be far easier to work with

    A> I would guess this is wrong, in general. Think of the difference
    A> between a DOS/Win32 text file, a MacOS text file, and a *nix text
    A> file (hint: linefeeds and carriage returns). Now think of the
    A> difference between the same systems' binary files (hint: nothing)

    Sizes are different. Endian-ness is different. Formats may be
    different (think: floating point and other more exotic formats).

    Consider finding the file in five years and not having any of the
    previous tools that used it. Which is likely to be easier to get
    the data out of: text or binary?

    How often have we had people come here to ask help in decyphering
    a binary file?

    A> The vast majority of computers these days use eight-bit
    A> byte-oriented transmission and storage protocols. Whatever
    A> bit-ordering problems there are have moved "downstream" to
    A> those people involved in the construction of hardware that
    A> has to choose whether to transmit bit 0 or bit 7 first...

    So what happens when I transmit a binary floating point number to
    a machine with a different format?

    I agree these issues are quite solveable, but I think they are
    more *Easily* solveable with text as an intermediate format.

    A> It's about five minutes' work to write portable binary I/O
    A> functions in most languages, if you're worried about the
    A> size of 'int' on your next computer or something.

    Might be a little more than five minutes, but I agree it's not hard.

    But what IS five minutes work is a CR/CRLF/LF converter! (-:

    I know this 'cause I've done it several times over the years.



    FOOD FOR THOUGHT:
    =================
    Consider: The Rosetta Stone.

    Now consider the bestest, most *useful* binary format you can name.
    Think it stands any chance AT ALL of surviving that long?

    If you want the broadest, most robust, most portable format
    possible, there is only one answer: TEXT!

    Accept no substitutes! (-:

    --
    |_ CJSonnack <> _____________| How's my programming? |
    |_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
    |_____________________________________________|_______________________|
    Programmer Dude, May 28, 2004
    #11
  12. On Fri, 28 May 2004, Programmer Dude wrote:
    >
    > Arthur J. O'Dwyer writes:
    > > I would guess this is wrong, in general. Think of the difference
    > > between a DOS/Win32 text file, a MacOS text file, and a *nix text
    > > file (hint: linefeeds and carriage returns). Now think of the
    > > difference between the same systems' binary files (hint: nothing)

    >
    > Sizes are different. Endian-ness is different. Formats may be
    > different (think: floating point and other more exotic formats).


    [For --hopefully-- the last time: I wasn't talking about sizes,
    or endianness, or floating-point formats. I was talking about the
    format in which a binary file is stored. Binary means bytes. On
    the vast majority of modern computers, that's eight bits per byte.
    I refer you to the file format standard for ANYTHING EVER, but
    especially PNG, because it's very cool and quite possibly *more*
    modular than XML. :) ]

    > Consider finding the file in five years and not having any of the
    > previous tools that used it. Which is likely to be easier to get
    > the data out of: text or binary?


    Without any of the computers that used it? Pretty close to zero,
    even with the help of an electron microscope. Assuming you have
    no hex editor, but you do have a computer and a text editor, then
    obviously text will be easier to display. Contrariwise, if you
    have no text editor but do have a hex editor, binary will be easier
    to display. Neither will necessarily be easier to interpret unless
    you have a copy of the relevant file format standard, and then the
    point is pretty much moot anyway.

    > How often have we had people come here to ask help in decyphering
    > a binary file?


    How often have people come here to ask help in writing "Hello
    world!" programs? How often have people come to sci.crypt to
    ask help in "deciphering" cryptograms? If you're saying that a
    lot of people are stupid, I'm inclined to agree with you.

    > A> The vast majority of computers these days use eight-bit
    > A> byte-oriented transmission and storage protocols. Whatever
    > A> bit-ordering problems there are have moved "downstream" to
    > A> those people involved in the construction of hardware that
    > A> has to choose whether to transmit bit 0 or bit 7 first...
    >
    > So what happens when I transmit a binary floating point number to
    > a machine with a different format?


    Ick, floating point! ;) Seriously, I don't have much experience
    with floating point, but I would expect you'd either use a fixed-point
    representation (common in the domains in which I work), or you'd
    convert to some IEEE format (about which I know little, and your
    point about relevant standards' becoming extinct may well apply).

    > I agree these issues are quite solveable, but I think they are
    > more *Easily* solveable with text as an intermediate format.


    How do you save a floating-point number to a text file?
    Losslessly? How many lines of <your PLOC here> code is that? :)
    Once I've seen a compelling answer to that, I may start thinking
    in earnest about how to save floating-point numbers losslessly in
    binary. And we'll see who comes out on top. ;)


    > FOOD FOR THOUGHT:
    > =================
    > Consider: The Rosetta Stone.
    >
    > Now consider the bestest, most *useful* binary format you can name.
    > Think it stands any chance AT ALL of surviving that long?
    >
    > If you want the broadest, most robust, most portable format
    > possible, there is only one answer: TEXT!


    Written on STONE TABLETS! And then BURIED IN THE DESERT!

    > Accept no substitutes! (-:


    Absolutely 100% agreed! (-:

    -Arthur
    Arthur J. O'Dwyer, May 28, 2004
    #12
  13. Eric

    Piet Blok Guest

    Without taking a stand pro or con binary or text in this discussion, I like
    to point out that XML files ARE stored in binary format, conformant to the
    encoding attribute in the XML declaration. Now, not all encodings are ASCII
    like, think of the various EBCDIC character sets. If you must view an
    EBCDIC encoded XML file on your PC at home you need code conversion
    (implemented in XML parsers). A simple text editor like NotePad will not be
    very helpfull.

    When XML data is transmitted over networks it should be done binary, not in
    text mode, because in text mode, the data may be translated to some other
    encoding scheme. But the encoding attribute, being part of the data, will
    not be adjusted. The result is no XML anymore.

    Piet
    Piet Blok, May 29, 2004
    #13
  14. Eric

    Ben Measures Guest

    Piet Blok wrote:
    > Without taking a stand pro or con binary or text in this discussion, I like
    > to point out that XML files ARE stored in binary format, conformant to the
    > encoding attribute in the XML declaration. Now, not all encodings are ASCII
    > like, think of the various EBCDIC character sets. If you must view an
    > EBCDIC encoded XML file on your PC at home you need code conversion
    > (implemented in XML parsers). A simple text editor like NotePad will not be
    > very helpfull.
    >
    > When XML data is transmitted over networks it should be done binary, not in
    > text mode, because in text mode, the data may be translated to some other
    > encoding scheme. But the encoding attribute, being part of the data, will
    > not be adjusted. The result is no XML anymore.
    >
    > Piet


    Good point not yet considered.

    --
    Ben M.
    Ben Measures, May 30, 2004
    #14
  15. Eric

    Sammy Tough Guest

    hi,

    I agree, you can make the same errors in coding information using either
    plain text or regulated plain text like xml. But you have more tools in your
    hand if you dont invent your own format. There are more abstraction layers
    used in xml. If you think over it you often have to invent such abstraction
    layers in your proprietary format too. With the difference it has to be
    invented from every new programmer every time he has to code a new type of
    information. If you write down a rule how such coding should work (e.g.
    every new tupel is finished by a <cr>-sign) you are following a similar way,
    the xml-developers did.

    greetings

    Sammy
    Sammy Tough, Jun 1, 2004
    #15
  16. Arthur J. O'Dwyer writes:

    >> Sizes are different. Endian-ness is different. Formats may be
    >> different (think: floating point and other more exotic formats).

    >
    > [For --hopefully-- the last time: I wasn't talking about sizes,
    > or endianness, or floating-point formats. I was talking about the
    > format in which a binary file is stored. Binary means bytes.


    Usually. Only usually. (-:

    > On the vast majority of modern computers, that's eight bits per byte.


    Usually. Only usually. (-: (-:

    > I refer you to the file format standard for ANYTHING EVER,..


    And folks who write code that deals with these formats need to be
    fully up to speed on the format, don't they. And in the case of
    evolving formats, need to consider upgrading so they can continue
    to read newer formats. This thread has touched on many of the
    *tools* (e.g. network transport layers) available to deal with these
    binary formats, AND THAT'S THE POINT: you need all this *stuff* and
    knowledge.

    Text is simple. You stop even *thinking* about a lot of stuff.
    And it has the advantage of easy human readability, a "nice to have"
    for debugging and maintanence purposes.

    Binary, in comparison, is a headache. (-:

    >> Consider finding the file in five years and not having any of the
    >> previous tools that used it. Which is likely to be easier to get
    >> the data out of: text or binary?

    >
    > Without any of the computers that used it? Pretty close to zero,
    > even with the help of an electron microscope.


    No, it would be silly of me to mean that.

    > Assuming you have no hex editor,...


    Hey, I'll even grant you the hex editor!

    > ...but you do have a computer and a text editor, then
    > obviously text will be easier to display.


    Even if you can examine the hex, do you see the hassle required to
    analyse what all those bits *mean*? Compare that to a text file
    that very likely *tags* (labels) the data! I mean, come on, how
    can you beat named, trivial-to-view data?

    And ya see that? Even given equal ability to examine the raw file
    (that is, sans intelligent interpreter), text is a monster winner.

    > Contrariwise, if you have no text editor but do have a hex editor,
    > binary will be easier to display.


    Ummmm, you're winging it here. (-: First, really, a hex viewer, but
    no text viewer? I think that'd be a first in computing history, but
    stranger things have happened. :)

    Second, doncha think viewing the text in the hex viewer would still
    be a lot more obvious (given those labels) than the raw bin bits?

    Even when you tilt the playing field insanely, text still wins! (-:

    > Neither will necessarily be easier to interpret unless you have a
    > copy of the relevant file format standard, and then the point is
    > pretty much moot anyway.


    Well, right, we're assuming the fileformat is lost or unavailable.
    And even if we somehow lost the "format" to text/plain, the pattern
    of text lines with repeating delimiters is a red flag. Consider too
    that at this extreme--where we've forgotten ASCII--how much harder
    would it be to figure out binary storage formats (remember there's
    likely no clue where object boundaries are)?

    >> How often have we had people come here to ask help in decyphering
    >> a binary file?

    >
    > How often have people come here to ask help in writing "Hello
    > world!" programs? How often have people come to sci.crypt to
    > ask help in "deciphering" cryptograms? If you're saying that a
    > lot of people are stupid, I'm inclined to agree with you.


    No (well, actually, yes that's true, but not my point right now :).

    I'm pointing out--comparing like with like--no one stumbling on a
    text file containing important data comes begging interpretation.
    Cryptograms are play, and I doubt the urgent, often work-related
    situation happens in s.crypt.

    >> So what happens when I transmit a binary floating point number to
    >> a machine with a different format?

    >
    > Ick, floating point! ;)


    [bwg] Exactly my point! Which would you rather deal with:

    "99.1206" 0x42c63dbf


    > Seriously, I don't have much experience with floating point, but I
    > would expect you'd either use a fixed-point representation (common
    > in the domains in which I work),...


    Let me guess. CAD/CAM or NC or something involving physical coords?
    Fixed point isn't uncommon in environments when you know the range
    of values expected. When you don't and need the largest range possible.
    (Or when you DO and need a huge range.) You need floating point.

    >> I agree these issues are quite solveable, but I think they are
    >> more *Easily* solveable with text as an intermediate format.

    >
    > How do you save a floating-point number to a text file?


    As you'd expect. printf("%d") ... strtod()

    > Losslessly?


    Within certain parameters, close enough. Once you're dealing with FP,
    you sorta have to give up the concept of lossless. Experts in FP know
    how to deal with it to make the pain as low as possible, but FP is all
    about approximation.

    If you need absolute precision, you could always save the bytes as a
    hex string. Fast and easy in and out.

    > How many lines of <your PLOC here> code is that? :)


    Only a few surrounding strtod() if you don't mind a little edge loss.
    (IIRC, within precision limits, text<=>FP *is* fully deterministic?)


    --
    |_ CJSonnack <> _____________| How's my programming? |
    |_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
    |_____________________________________________|_______________________|
    Programmer Dude, Jun 8, 2004
    #16
  17. Eric

    Rolf Magnus Guest

    Arthur J. O'Dwyer wrote:

    >
    > On Thu, 27 May 2004, Eric wrote:
    >>
    >> Assume that disk space is not an issue [...]
    >> Assume that transportation to another OS may never occur.
    >> Are there any solid reasons to prefer text files over binary files?
    >>
    >> Some of the reasons I can think of are:
    >>
    >> -- should transportation to another OS become useful or needed,
    >> the text files would be far easier to work with

    >
    > I would guess this is wrong, in general. Think of the difference
    > between a DOS/Win32 text file, a MacOS text file, and a *nix text
    > file (hint: linefeeds and carriage returns).


    Linefeeds and carriage returns don't matter in XML. The other
    differences are ruled out by specifying the encoding. Any XML parser
    should understand utf-8.

    > Now think of the difference between the same systems' binary files
    > (hint: nothing).


    That's wrong. Under most (but not all) DOS compilers, int is 16bit,
    under Windows, it's 32bit. Under Linux on x86, long double is 80bit,
    und Windows, it's 64bit. And the OS is not the only thing that matters.
    On the Motorola CPUs, data is stored in big endian, on x86 in little
    endian. A 64bit CPU might use a 64bit type for long (or it might not),
    while on most 32bit CPUs, long is 32bit. Some systems have special
    alginment reqirements, others don't. And there are a lot of other
    potential problems with binary data. Those problems can all be worked
    around, but it's a lot easier with text, especially xml.
    Rolf Magnus, Jun 9, 2004
    #17
  18. Eric

    Jeff Brooks Guest

    Rolf Magnus wrote:
    > Arthur J. O'Dwyer wrote:
    >
    >>On Thu, 27 May 2004, Eric wrote:
    >>
    >>>Assume that disk space is not an issue [...]
    >>>Assume that transportation to another OS may never occur.
    >>>Are there any solid reasons to prefer text files over binary files?
    >>>
    >>>Some of the reasons I can think of are:
    >>>
    >>>-- should transportation to another OS become useful or needed,
    >>> the text files would be far easier to work with

    >>
    >> I would guess this is wrong, in general. Think of the difference
    >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
    >>file (hint: linefeeds and carriage returns).

    >
    > Linefeeds and carriage returns don't matter in XML. The other
    > differences are ruled out by specifying the encoding. Any XML parser
    > should understand utf-8.


    Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
    has byte ordering issues. Writing an UTF-16 file on different cpus can
    result in text files that are different. This can be resolved because of
    the encoding the the UTF standards use but it means that any true XML
    parser must deal with high-endian, low-endian issues.

    Most people consider having to write code in a way that translates the
    format to your specific cpu as the measure for data not being portable.
    XML does have this issue so if thats your definition of portable then
    XML isn't portable.

    "All XML processors MUST accept the UTF-8 and UTF-16 encodings of
    Unicode 3.1"
    - http://www.w3.org/TR/REC-xml/#charsets

    "The primary feature of Unicode 3.1 is the addition of 44,946 new
    encoded characters. These characters cover several historic scripts,
    several sets of symbols, and a very large collection of additional CJK
    ideographs.

    For the first time, characters are encoded beyond the original 16-bit
    codespace or Basic Multilingual Plane (BMP or Plane 0). These new
    characters, encoded at code positions of U+10000 or higher, are
    synchronized with the forthcoming standard ISO/IEC 10646-2."
    - http://www.unicode.org/reports/tr27/

    The majority of XML parsers only use 16-bit characters. This means that
    the majority of XML parsers can't actually read XML.

    Jeff Brooks
    Jeff Brooks, Jun 9, 2004
    #18
  19. Jeff Brooks wrote:

    > Rolf Magnus wrote:

    <snip>
    >>
    >> Linefeeds and carriage returns don't matter in XML. The other
    >> differences are ruled out by specifying the encoding. Any XML parser
    >> should understand utf-8.

    >
    > Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
    > has byte ordering issues. Writing an UTF-16 file on different cpus can
    > result in text files that are different. This can be resolved because of
    > the encoding the the UTF standards use but it means that any true XML
    > parser must deal with high-endian, low-endian issues.


    Don't want to be seen to be supporting XML here, but doesn't the UTF-16
    standard define byte ordering? I was under the impression (without
    having done any work with it) that a UTF-16 multi-byte sequence could be
    parsed as a byte stream.

    --
    Corey Murtagh
    The Electric Monk
    "Quidquid latine dictum sit, altum viditur!"
    Corey Murtagh, Jun 9, 2004
    #19
  20. Jeff Brooks () wrote:
    : Rolf Magnus wrote:
    : > Arthur J. O'Dwyer wrote:
    : >
    : >>On Thu, 27 May 2004, Eric wrote:
    : >>
    : >>>Assume that disk space is not an issue [...]
    : >>>Assume that transportation to another OS may never occur.
    : >>>Are there any solid reasons to prefer text files over binary files?
    : >>>
    : >>>Some of the reasons I can think of are:
    : >>>
    : >>>-- should transportation to another OS become useful or needed,
    : >>> the text files would be far easier to work with
    : >>
    : >> I would guess this is wrong, in general. Think of the difference
    : >>between a DOS/Win32 text file, a MacOS text file, and a *nix text
    : >>file (hint: linefeeds and carriage returns).
    : >
    : > Linefeeds and carriage returns don't matter in XML. The other
    : > differences are ruled out by specifying the encoding. Any XML parser
    : > should understand utf-8.

    : Actually, to be an XML parser it must support UTF-8, and UTF-16. UTF-16
    : has byte ordering issues.

    You can only have byte order issues when you store the UTF-16 as 8 bit
    bytes. But a stream of 8 bit bytes is _not_ UTF-16, which by definition
    is a stream of 16 bit entities, so it is not the UTF-16 that has byte
    order issues.

    However, even the storage issue should have been trivial to solve - and
    would simply have consisted of requiring 8 bit streams encoding 16 bit
    unicode values to use network byte order, as is required in similar
    situation within internet protocols (which are used with no
    interoperability issues between all sorts of endians). The lack of
    specifying and requiring this, and instead using zero width non-breaking
    spaces to help the reader "quess" that byte ordering was used in the
    translation from 16 bit information units into 8 bit storage units, is by
    far one of the biggest kludges ever.
    Malcolm Dew-Jones, Jun 9, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page