Binary or text file

Discussion in 'C++' started by list@ubootsuxx.de, May 10, 2007.

  1. Guest

    Hi folks,

    I am new to Googlegroups. I asked my questions at other forums, since
    now.

    I have an important question: I have to check files if they are
    binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
    check a file an find out if the file is binary or text?

    Thanks for your help.
     
    , May 10, 2007
    #1
    1. Advertising

  2. osmium Guest

    <> wrote:

    > I am new to Googlegroups. I asked my questions at other forums, since
    > now.
    >
    > I have an important question: I have to check files if they are
    > binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
    > check a file an find out if the file is binary or text?


    You can't. You can only determine with high probability what the file is.
    Assuming ASCII code only a very few of the control characters ever appear in
    a text file. It's like finding a white crow, if you had looked at just one
    more crow, it might have been a white one. But if you have the file
    extender for the file (as above), you can look at Wotsit and get an answer.

    http://www.wotsit.org/
     
    osmium, May 10, 2007
    #2
    1. Advertising

  3. > You can't. You can only determine with high probability what the file is.
    > Assuming ASCII code only a very few of the control characters ever appear in
    > a text file.


    Thats pretty much the way to do it. If you take the unix command
    `file' it does it pretty much like this. It'll generally take the
    first 512 bytes of the file and from that it can determine the type of
    file. Binary files tend to have a lot of padding with bytes zeroed
    out, while ascii files will have every byte having a value > 30.
     
    Keith Halligan, May 10, 2007
    #3
  4. On May 10, 3:16 pm, Keith Halligan <> wrote:
    > > You can't. You can only determine with high probability what the file is.
    > > Assuming ASCII code only a very few of the control characters ever appear in
    > > a text file.

    >
    > Thats pretty much the way to do it. If you take the unix command
    > `file' it does it pretty much like this. It'll generally take the
    > first 512 bytes of the file and from that it can determine the type of
    > file. Binary files tend to have a lot of padding with bytes zeroed
    > out, while ascii files will have every byte having a value > 30.


    there is a 'file' command utility in unix that does the job
    borrow source code from it :)
     
    Diego Martins, May 11, 2007
    #4
  5. James Kanze Guest

    On May 10, 8:16 pm, Keith Halligan <> wrote:
    > > You can't. You can only determine with high probability what the file is.
    > > Assuming ASCII code only a very few of the control characters ever appear in
    > > a text file.


    > Thats pretty much the way to do it. If you take the unix command
    > `file' it does it pretty much like this. It'll generally take the
    > first 512 bytes of the file and from that it can determine the type of
    > file. Binary files tend to have a lot of padding with bytes zeroed
    > out, while ascii files will have every byte having a value > 30.


    Note, however, that the file utility has a very high error rate.
    And it knows a fair amount about the formats of different types
    of binary files, and can recognize those because of various
    embedded magic numbers---if the file matches a known format,
    then it isn't plain text.

    In practice, today, ASCII is pretty much inexistant; most text
    is in some other encoding. A file in UTF-32LE, for example,
    with English text, will have close to 3/4 of the bytes 0. You
    can still try some heuristics: if you have a file with 1 byte
    non-0, then three 0's, and that pattern repeats, with few
    exceptions, there's a very good chance that it is UTF-32LE. But
    it's more complicated (and globally, less reliable) that back in
    the days when everything was ASCII.

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 11, 2007
    #5
  6. On May 12, 6:14 am, James Kanze <> wrote:
    ....
    >
    > In practice, today, ASCII is pretty much inexistant; most text
    > is in some other encoding.


    Really ? Most text files I see don't have any characters beyond the
    ASCII set which would make them ASCII.

    > .... A file in UTF-32LE, for example,
    > with English text, will have close to 3/4 of the bytes 0. You
    > can still try some heuristics: if you have a file with 1 byte
    > non-0, then three 0's, and that pattern repeats, with few
    > exceptions, there's a very good chance that it is UTF-32LE. But
    > it's more complicated (and globally, less reliable) that back in
    > the days when everything was ASCII.


    I have yet to see a UTF-32LE file in the wild. Even the UTF-16 files
    I've seen are far and few between. I'd like to believe that utf-8
    will become the default text format and there are a few tests to
    determine the likliness of a file being utf-8 (and no, it's probably
    not a BOM at the beginning of the file).
     
    Gianni Mariani, May 12, 2007
    #6
  7. James Kanze Guest

    On May 12, 1:32 am, Gianni Mariani <> wrote:
    > On May 12, 6:14 am, James Kanze <> wrote:
    > ...


    > > In practice, today, ASCII is pretty much inexistant; most text
    > > is in some other encoding.


    > Really ? Most text files I see don't have any characters beyond the
    > ASCII set which would make them ASCII.


    Really. You must live a very parochial life. I find accented
    characters pretty regularly in my files (including in C++ source
    files). And ASCII doesn't have any accented characters.

    You're reading this thread; there are non-ASCII characters in
    the messages in it. (Check out my signature, for example.)
    Practically, if you're connected to the network, you can forget
    about ASCII; you have to be able to handle a large number of
    different character encodings.

    > > .... A file in UTF-32LE, for example,
    > > with English text, will have close to 3/4 of the bytes 0. You
    > > can still try some heuristics: if you have a file with 1 byte
    > > non-0, then three 0's, and that pattern repeats, with few
    > > exceptions, there's a very good chance that it is UTF-32LE. But
    > > it's more complicated (and globally, less reliable) that back in
    > > the days when everything was ASCII.


    > I have yet to see a UTF-32LE file in the wild.


    I haven't either, but I know that they exist. I've also created
    a few for test purposes.

    > Even the UTF-16 files I've seen are far and few between.


    Curious. From what I understand, UTF-16 is the standard
    encoding under Windows. And machines running Windows aren't
    exactly "few and far between".

    > I'd like to believe that utf-8
    > will become the default text format


    I would too, but given the passive that has to be taken into
    account, I don't realistically expect it to happen any time
    soon.

    > and there are a few tests to
    > determine the likliness of a file being utf-8 (and no, it's probably
    > not a BOM at the beginning of the file).


    Actually, UTF-8 isn't that difficult. If the first 500 some
    bytes don't contain an illegal UTF-8 sequence, there's only a
    very small probability that the file isn't UTF-8.

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 12, 2007
    #7
  8. On May 12, 7:18 pm, James Kanze <> wrote:
    > On May 12, 1:32 am, Gianni Mariani <> wrote:
    >
    > > On May 12, 6:14 am, James Kanze <> wrote:
    > > ...
    > > > In practice, today, ASCII is pretty much inexistant; most text
    > > > is in some other encoding.

    > > Really ? Most text files I see don't have any characters beyond the
    > > ASCII set which would make them ASCII.

    >
    > Really. You must live a very parochial life.


    What is with you French ? Nuking the pacific is not enough ?

    > ... I find accented
    > characters pretty regularly in my files (including in C++ source
    > files). And ASCII doesn't have any accented characters.


    I think my claim is valid, most, i.e. 50% or more of text files I use
    are ASCII. If it wasn't for your .sig having a few 8859-1 characters
    in it, your posts would be ASCII as well.

    ....
    > > Even the UTF-16 files I've seen are far and few between.

    >
    > Curious. From what I understand, UTF-16 is the standard
    > encoding under Windows. And machines running Windows aren't
    > exactly "few and far between".


    Still, even on Windows, most text files are created as 8 bit. The
    only tool I use regularly that produces utf-16 files in regedit
    although it will read utf-8 files correctly.

    I suspect very few applications will read utf-16 in a conforming way.
    I don't if ISO-10646 has been updated, but a while back, utf-16 was a
    stateful encoding (it still is for all intents and purposes). Any
    time you read a reversed BOM you need to swap endianness. I have met
    very few programmers that know what a surrogate pair is.

    >
    > > I'd like to believe that utf-8
    > > will become the default text format

    >
    > I would too, but given the passive that has to be taken into
    > account, I don't realistically expect it to happen any time
    > soon.


    Well. there are alot of websites that claim to push utf-8 and most
    browsers support utf-8 well - even bidi selection works like it should
    which is quite cool

    >
    > > and there are a few tests to
    > > determine the likliness of a file being utf-8 (and no, it's probably
    > > not a BOM at the beginning of the file).

    >
    > Actually, UTF-8 isn't that difficult. If the first 500 some
    > bytes don't contain an illegal UTF-8 sequence, there's only a
    > very small probability that the file isn't UTF-8.


    Yes. That's right. You need to have a lib that is robust enough to
    tell you.
     
    Gianni Mariani, May 12, 2007
    #8
  9. ajk Guest

    On 10 May 2007 09:58:41 -0700, wrote:

    >Hi folks,
    >
    >I am new to Googlegroups. I asked my questions at other forums, since
    >now.
    >
    >I have an important question: I have to check files if they are
    >binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
    >check a file an find out if the file is binary or text?
    >
    >Thanks for your help.


    Depends a bit what you mean with "binary"

    If you are under Windows you can determine if a file is an .exe-file
    by reading the first few bytes in the file. Strictly speaking all
    files are stored in binary format and it is a matter of interpreting
    the contents.
    /ajk
     
    ajk, May 12, 2007
    #9
  10. osmium Guest

    "ajk" writes:

    >>I am new to Googlegroups. I asked my questions at other forums, since
    >>now.
    >>
    >>I have an important question: I have to check files if they are
    >>binary(.bmp, .avi, .jpg) or text(.txt, .cpp, .h, .php, .html). How to
    >>check a file an find out if the file is binary or text?
    >>
    >>Thanks for your help.

    >
    > Depends a bit what you mean with "binary"
    >
    > If you are under Windows you can determine if a file is an .exe-file
    > by reading the first few bytes in the file. Strictly speaking all
    > files are stored in binary format and it is a matter of interpreting
    > the contents.


    Since he posted the question to a.l.c++ we assume he wants an answer that is
    appropriate within the context of that
    language. I think you should think more deeply about the difference between
    "highly likely" and *is*.
     
    osmium, May 12, 2007
    #10
  11. James Kanze Guest

    On May 12, 2:51 pm, Gianni Mariani <> wrote:
    > On May 12, 7:18 pm, James Kanze <> wrote:


    > > On May 12, 1:32 am, Gianni Mariani <> wrote:


    > > > On May 12, 6:14 am, James Kanze <> wrote:
    > > > ...
    > > > > In practice, today, ASCII is pretty much inexistant; most text
    > > > > is in some other encoding.
    > > > Really ? Most text files I see don't have any characters beyond the
    > > > ASCII set which would make them ASCII.


    > > Really. You must live a very parochial life.


    > What is with you French ? Nuking the pacific is not enough ?


    Racist, on top of it. I've worked in both France and Germany,
    and it is a fact of life that both languages have characters
    which aren't present in ASCII, but which are more or less
    necessary if the text is to be understood, or at least appear
    normal. From what I've seen of other languages, this seems to
    be the usual case. Long before Unicode, different regions
    developed different encodings to handle non-US ASCII characters,
    because a definite need for it was felt.

    > > ... I find accented
    > > characters pretty regularly in my files (including in C++ source
    > > files). And ASCII doesn't have any accented characters.


    > I think my claim is valid, most, i.e. 50% or more of text files I use
    > are ASCII. If it wasn't for your .sig having a few 8859-1 characters
    > in it, your posts would be ASCII as well.


    Not all my posts. I frequently post to fr.comp.lang.c++ and
    de.comp.lang.iso-c++ as well, and my posts there contain
    characters which are not ASCII.

    Formally, of course, the issue is far from simple. If you're
    dealing with text data over the network, you have to be ready to
    handle different code sets. In practice, most protocols will
    insist on either one of the Unicode encodings or an encoding
    which shares the first 129 characters with ASCII for the start
    of the headers, until you've transmitted the information as to
    which encoding you are actually using. And if you know that it
    is text, and that it starts with a header, picking between
    UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE and a byte encoding is
    trivial, and that allows you to get through until you've read
    the real encoding.

    And of course, most of the newer protocols just say: it has to
    be UTF-8.

    > ...


    > > > Even the UTF-16 files I've seen are far and few between.


    > > Curious. From what I understand, UTF-16 is the standard
    > > encoding under Windows. And machines running Windows aren't
    > > exactly "few and far between".


    > Still, even on Windows, most text files are created as 8 bit. The
    > only tool I use regularly that produces utf-16 files in regedit
    > although it will read utf-8 files correctly.


    > I suspect very few applications will read utf-16 in a conforming way.
    > I don't if ISO-10646 has been updated, but a while back, utf-16 was a
    > stateful encoding (it still is for all intents and purposes). Any
    > time you read a reversed BOM you need to swap endianness. I have met
    > very few programmers that know what a surrogate pair is.


    I have met very few programmers who even know that there exist
    character sets which aren't encoded using single, 8 bit
    characters. I'm not saying that ignorance isn't wide spread,
    but I will try to fight it, whenever I can.

    > > > I'd like to believe that utf-8
    > > > will become the default text format


    > > I would too, but given the passive that has to be taken into
    > > account, I don't realistically expect it to happen any time
    > > soon.


    > Well. there are alot of websites that claim to push utf-8 and most
    > browsers support utf-8 well - even bidi selection works like it should
    > which is quite cool


    It's making headway. But a lot of code and text is old code and
    text. And it's not going to go away anytime soon.

    > > > and there are a few tests to
    > > > determine the likliness of a file being utf-8 (and no, it's probably
    > > > not a BOM at the beginning of the file).


    > > Actually, UTF-8 isn't that difficult. If the first 500 some
    > > bytes don't contain an illegal UTF-8 sequence, there's only a
    > > very small probability that the file isn't UTF-8.


    > Yes. That's right. You need to have a lib that is robust enough to
    > tell you.


    Or write one yourself:).

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 12, 2007
    #11
  12. On May 13, 7:39 am, James Kanze <> wrote:
    > On May 12, 2:51 pm, Gianni Mariani <> wrote:
    >
    > > On May 12, 7:18 pm, James Kanze <> wrote:
    > > > On May 12, 1:32 am, Gianni Mariani <> wrote:
    > > > > On May 12, 6:14 am, James Kanze <> wrote:
    > > > > ...
    > > > > > In practice, today, ASCII is pretty much inexistant; most text
    > > > > > is in some other encoding.
    > > > > Really ? Most text files I see don't have any characters beyond the
    > > > > ASCII set which would make them ASCII.
    > > > Really. You must live a very parochial life.

    > > What is with you French ? Nuking the pacific is not enough ?

    >
    > Racist, on top of it. I've worked in both France and Germany,
    > and it is a fact of life that both languages have characters
    > which aren't present in ASCII, but which are more or less
    > necessary if the text is to be understood, or at least appear
    > normal. ...


    OK, the French didn't nuke the Pacific now ... and by claiming they
    did one is now racist ?

    Because someone does not use accented characters one is now
    "parochial".

    And because someone does not agree with you one is "inexperienced".

    Yup. Sounds French to me. If you can't use facts, use personal
    attacks.

    > ... From what I've seen of other languages, this seems to
    > be the usual case. Long before Unicode, different regions
    > developed different encodings to handle non-US ASCII characters,
    > because a definite need for it was felt.


    ISO-8859-1, -2 ... -15, JIS, ShiftJIS, EUC-*, ISO-2022, Big5, KOI-8
    are ones I have personally worked with. It was a mess. That's why I
    pushed for Unicode (utf-8) adoption as much as I could. Many file
    formats became utf-8 because I suggested and explained to developers
    what they needed to do otherwise and believe me, it was not easy to
    convince people to use utf-8.

    One of the nicest but underused features of uicode text is language
    tagging. A unicode text string is able to tell you what language it
    is (meaning that all unicode text is stateful) but very few people
    implement it.

    >
    > > > ... I find accented
    > > > characters pretty regularly in my files (including in C++ source
    > > > files). And ASCII doesn't have any accented characters.

    > > I think my claim is valid, most, i.e. 50% or more of text files I use
    > > are ASCII. If it wasn't for your .sig having a few 8859-1 characters
    > > in it, your posts would be ASCII as well.

    >
    > Not all my posts. I frequently post to fr.comp.lang.c++ and
    > de.comp.lang.iso-c++ as well, and my posts there contain
    > characters which are not ASCII.


    That's nice. You have such a colorful world with that accented É and
    that eszet character 'ß' it the pivot of the spice of life.
    ....
    >
    > And of course, most of the newer protocols just say: it has to
    > be UTF-8.


    That's the conclusion I came to very early. I remember when I posted
    that suggestion and I was told I was being bigoted.

    >
    > > ...
    > > > > Even the UTF-16 files I've seen are far and few between.
    > > > Curious. From what I understand, UTF-16 is the standard
    > > > encoding under Windows. And machines running Windows aren't
    > > > exactly "few and far between".

    >> ... I have met
    > > very few programmers that know what a surrogate pair is.

    >
    > I have met very few programmers who even know that there exist
    > character sets which aren't encoded using single, 8 bit
    > characters. I'm not saying that ignorance isn't wide spread,
    > but I will try to fight it, whenever I can.


    Life with Unicode is much easier. Surprising little code really needs
    to care that it is parsing utf-8. Some code will break because it
    splits characters or it compares un-normalized strings, but these
    problems are far easier to deal with than the mish-mash of encodings
    in the past.

    >
    > > > > I'd like to believe that utf-8
    > > > > will become the default text format
    > > > I would too, but given the passive that has to be taken into
    > > > account, I don't realistically expect it to happen any time
    > > > soon.

    > > Well. there are alot of websites that claim to push utf-8 and most
    > > browsers support utf-8 well - even bidi selection works like it should
    > > which is quite cool

    >
    > It's making headway. But a lot of code and text is old code and
    > text. And it's not going to go away anytime soon.


    Do you normalize your unicode strings ? Do you apply state from
    unicode language tags across all strings you extract from a stream of
    unicode characters ?

    >
    > > > > and there are a few tests to
    > > > > determine the likliness of a file being utf-8 (and no, it's probably
    > > > > not a BOM at the beginning of the file).
    > > > Actually, UTF-8 isn't that difficult. If the first 500 some
    > > > bytes don't contain an illegal UTF-8 sequence, there's only a
    > > > very small probability that the file isn't UTF-8.

    > > Yes. That's right. You need to have a lib that is robust enough to
    > > tell you.

    >
    > Or write one yourself:).


    You have probably used one I wrote. Do you know where the "-l" in
    iconv came from ?
     
    Gianni Mariani, May 12, 2007
    #12
  13. Ian Collins Guest

    Gianni Mariani wrote:
    > On May 13, 7:39 am, James Kanze <> wrote:
    >> On May 12, 2:51 pm, Gianni Mariani <> wrote:
    >>
    >>> On May 12, 7:18 pm, James Kanze <> wrote:
    >>>
    >>>> Really. You must live a very parochial life.
    >>> What is with you French ? Nuking the pacific is not enough ?

    >> Racist, on top of it. I've worked in both France and Germany,
    >> and it is a fact of life that both languages have characters
    >> which aren't present in ASCII, but which are more or less
    >> necessary if the text is to be understood, or at least appear
    >> normal. ...

    >
    > OK, the French didn't nuke the Pacific now ... and by claiming they
    > did one is now racist ?
    >
    > Because someone does not use accented characters one is now
    > "parochial".
    >

    Why all the crap? Just because you and I don't see many text files with
    extended character sets, doesn't mean that aren't in widespread use.

    If you want to pick a fight, find a rough bar.

    --
    Ian Collins.
     
    Ian Collins, May 12, 2007
    #13
  14. On May 13, 8:56 am, Ian Collins <> wrote:
    > Gianni Mariani wrote:
    > > On May 13, 7:39 am, James Kanze <> wrote:
    > >> On May 12, 2:51 pm, Gianni Mariani <> wrote:

    >
    > >>> On May 12, 7:18 pm, James Kanze <> wrote:

    >
    > >>>> Really. You must live a very parochial life.
    > >>> What is with you French ? Nuking the pacific is not enough ?
    > >> Racist, on top of it. I've worked in both France and Germany,
    > >> and it is a fact of life that both languages have characters
    > >> which aren't present in ASCII, but which are more or less
    > >> necessary if the text is to be understood, or at least appear
    > >> normal. ...

    >
    > > OK, the French didn't nuke the Pacific now ... and by claiming they
    > > did one is now racist ?

    >
    > > Because someone does not use accented characters one is now
    > > "parochial".

    >
    > Why all the crap?


    Is that a technical term ?

    > ... Just because you and I don't see many text files with
    > extended character sets, doesn't mean that aren't in widespread use.


    The claim by James was that today "ASCII is pretty much
    inexistant(sic)". Which is blatantly wrong. Having pointed that out
    to him, he shoots back using "parochial" or "inexperienced" to justify
    himself.

    James, being of German and French background, I could hope for a more
    Swiss-neutral attitude but it appears that we have a classic Parisian
    arrogance with a German bureaucratic mind-set. I haven't met too many
    of these guys around.

    >
    > If you want to pick a fight, find a rough bar.


    You're right, I should have known better.

    So, we should all proclaim that all ASCII files are now officially
    utf-8 and all other text formats are deprecated and should be deleted.
     
    Gianni Mariani, May 13, 2007
    #14
  15. On May 12, 11:42 pm, ajk <> wrote:
    > On 10 May 2007 09:58:41 -0700, wrote:


    > ...Strictly speaking all
    > files are stored in binary format and it is a matter of interpreting
    > the contents.


    Strictly speaking, that is not true depending on who you're talking
    about doing the interpretation. Some systems (VMS) didn't allow you
    to read the binary stream of all files and would have a "record
    management services" (RMS) get in the way. Those days are more or less
    gone (thank Unix).
     
    Gianni Mariani, May 13, 2007
    #15
  16. Gianni Mariani, May 13, 2007
    #16
  17. On Sat, 12 May 2007 15:46:12 -0700, Gianni Mariani wrote:
    > On May 13, 7:39 am, James Kanze <> wrote:
    >> On May 12, 2:51 pm, Gianni Mariani <> wrote:
    >>
    >> > On May 12, 7:18 pm, James Kanze <> wrote:
    >> > > On May 12, 1:32 am, Gianni Mariani <> wrote:
    >> > > > On May 12, 6:14 am, James Kanze <> wrote:
    >> > > > ...
    >> > > > > In practice, today, ASCII is pretty much inexistant; most text
    >> > > > > is in some other encoding.
    >> > > > Really ? Most text files I see don't have any characters beyond
    >> > > > the ASCII set which would make them ASCII.
    >> > > Really. You must live a very parochial life.
    >> > What is with you French ? Nuking the pacific is not enough ?

    >>
    >> Racist, on top of it. I've worked in both France and Germany, and it
    >> is a fact of life that both languages have characters which aren't
    >> present in ASCII, but which are more or less necessary if the text is
    >> to be understood, or at least appear normal. ...

    >
    > OK, the French didn't nuke the Pacific now ... and by claiming they did
    > one is now racist ?
    >
    > Because someone does not use accented characters one is now "parochial".
    >
    > And because someone does not agree with you one is "inexperienced".
    >
    > Yup. Sounds French to me. If you can't use facts, use personal
    > attacks.


    Funny how nationalism rears its ugly head in the most unlikely places.

    Welcome to my kill file.

    --
    Markus Schoder
     
    Markus Schoder, May 13, 2007
    #17
  18. On May 13, 1:45 pm, Markus Schoder <> wrote:
    ....
    > Funny how nationalism rears its ugly head in the most unlikely places.
    >
    > Welcome to my kill file.


    That's usually written *PLONK*.

    I welcome our new kill file overloads.
     
    Gianni Mariani, May 13, 2007
    #18
  19. James Kanze Guest

    On May 13, 12:46 am, Gianni Mariani <> wrote:
    > On May 13, 7:39 am, James Kanze <> wrote:
    > > On May 12, 2:51 pm, Gianni Mariani <> wrote:


    > > > On May 12, 7:18 pm, James Kanze <> wrote:
    > > > > On May 12, 1:32 am, Gianni Mariani <> wrote:
    > > > > > On May 12, 6:14 am, James Kanze <> wrote:
    > > > > > ...
    > > > > > > In practice, today, ASCII is pretty much inexistant; most text
    > > > > > > is in some other encoding.
    > > > > > Really ? Most text files I see don't have any characters beyond the
    > > > > > ASCII set which would make them ASCII.
    > > > > Really. You must live a very parochial life.
    > > > What is with you French ? Nuking the pacific is not enough ?


    > > Racist, on top of it. I've worked in both France and Germany,
    > > and it is a fact of life that both languages have characters
    > > which aren't present in ASCII, but which are more or less
    > > necessary if the text is to be understood, or at least appear
    > > normal. ...


    > OK, the French didn't nuke the Pacific now ... and by claiming they
    > did one is now racist ?


    What does nuking the Pacific have to do with anything. It's
    racist to condemn all French because some idiotic government
    officials do something stupid. If you're going to judge
    everyone by their government, what would one say about the
    Americans today?

    > Because someone does not use accented characters one is now
    > "parochial".


    Because one doesn't take into account that they exist, one is
    very parochial.

    [...]
    > > And of course, most of the newer protocols just say: it has to
    > > be UTF-8.


    > That's the conclusion I came to very early. I remember when I posted
    > that suggestion and I was told I was being bigoted.


    By who? I think that there is a consensus that UTF-8 is the way
    to go. The problem is that reality isn't following that
    consensus very quickly, and that as soon as a computer is
    connected to the network, it has to deal with all sorts of wierd
    encodings. It's a lot of extra work, for everyone involved, but
    that's life.

    [...]
    > Life with Unicode is much easier. Surprising little code really needs
    > to care that it is parsing utf-8.


    Are you kidding? What about code which uses e.g. "isalpha()".

    > Some code will break because it splits characters or it
    > compares un-normalized strings, but these problems are far
    > easier to deal with than the mish-mash of encodings in the
    > past.


    Easier, yes, but not all of the tools are necessarily in place.
    Things like "isalpha()" are an obvious problem.

    > > > > Actually, UTF-8 isn't that difficult. If the first 500 some
    > > > > bytes don't contain an illegal UTF-8 sequence, there's only a
    > > > > very small probability that the file isn't UTF-8.
    > > > Yes. That's right. You need to have a lib that is robust enough to
    > > > tell you.


    > > Or write one yourself:).


    > You have probably used one I wrote. Do you know where the "-l" in
    > iconv came from ?


    What's iconv?

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 13, 2007
    #19
  20. James Kanze Guest

    On May 13, 1:51 am, Gianni Mariani <> wrote:
    > On May 13, 8:56 am, Ian Collins <> wrote:
    > > Gianni Mariani wrote:


    > > ... Just because you and I don't see many text files with
    > > extended character sets, doesn't mean that aren't in widespread use.


    More significantly, the software which generated what you are
    processing as "pure ASCII" probably was actually using some
    exended code set. There is no support for "pure ASCII" under
    Linux, as far as I can see, for example. The reality is that if
    your software doesn't correctly handle characters with a bit 7
    set, it is broken, because even in America, most of the tools
    can easily generate such files.

    I know that I have a couple of files which contain a 'ÿ' (y with
    a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
    many programs treat it as an end of file. Would you (or Gianni,
    for that matter) consider this "correct", even if the program
    didn't have to deal with accented characters per se? Would you
    (or Gianni) consider it OK to not test this (limit) case,
    knowing that it is a frequent error?

    > The claim by James was that today "ASCII is pretty much
    > inexistant(sic)". Which is blatantly wrong.


    Statistics? ASCII isn't used by Windows. It's not available in
    the standard Linux distributions I use. All of the Internet
    protocols I know *now* require more. (The now is important.
    When I first implemented code around SMTP and NNTP, ASCII was
    the standard encoding, and in fact, the only one supported.)

    > Having pointed that out to him, he shoots back using
    > "parochial" or "inexperienced" to justify himself.


    > James, being of German and French background,


    James, being born and raised in the United States, and still
    holding an American passport...

    > I could hope for a more
    > Swiss-neutral attitude but it appears that we have a classic Parisian
    > arrogance with a German bureaucratic mind-set.


    More racism. I've not encountered any arrogance in Paris, and
    I've not found Germany to be any more bureaucratic that anywhere
    else.

    People with that sort of attitude are parochial. They've not
    gone out and actually considered other people for what they are.

    [...]
    > So, we should all proclaim that all ASCII files are now officially
    > utf-8 and all other text formats are deprecated and should be deleted.


    Of course, if you'd have actually read what you're responding
    to, I said that we have to deal with a lot of different code
    sets. And that that is a real problem.

    --
    James Kanze (Gabi Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 13, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fangs
    Replies:
    3
    Views:
    9,947
    darshana
    Oct 26, 2008
  2. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    969
    James Kanze
    Apr 28, 2008
  3. scad
    Replies:
    4
    Views:
    980
    James Kanze
    May 28, 2009
  4. Jim
    Replies:
    6
    Views:
    768
  5. zvika
    Replies:
    2
    Views:
    144
    Jürgen Exner
    Dec 12, 2004
Loading...

Share This Page