Binary or text file

G

Gianni Mariani

What does nuking the Pacific have to do with anything.

It's arrogant, just like you are being.
... It's
racist to condemn all French because some idiotic government
officials do something stupid. If you're going to judge
everyone by their government, what would one say about the
Americans today?

The majority of US citizens voted in the scariest U.S. government in
my living history.
- the "Patriot Act" - we started seeing the worst of McArthyism coming
back.
- The VP getting the biggest defence contracts to the company he used
to run
- Letting Microsoft off the hook for criminal activity

I can go on and on. The USA is run by the corps.
Because one doesn't take into account that they exist, one is
very parochial.

Re-read what you wrote and tell me you honestly believe that ASCII
files are non-existant. I have applications that still generate
them. I still generate them. I see them every day. Accusing me of
being parochial is very arrogant and disingenuous.
[...]
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.

By who?

The discussion I refer to is archived by google.
... I think that there is a consensus that UTF-8 is the way
to go.

It was not a consensus in 1996.

....
Are you kidding? What about code which uses e.g. "isalpha()".

Ok, you need to think a little harder at what you're trying to do.
Easier, yes, but not all of the tools are necessarily in place.
Things like "isalpha()" are an obvious problem.

There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.
What's iconv?

man iconv
 
J

James Kanze

Perhaps bigoted is the wrong word....

At least I can say I called this prophetically.

So what are we arguing about? You obviously know that other
code sets exist, and that we have to deal with them. And we
seem to agree with regards to the ideal solution.

The problem is that most vendors don't see any value in
supporting *both* ASCII and a superset of ASCII (e.g. ISO
8859-1, UTF-8, etc.), and because the superset is in practice
necessary, only provide it. You can pretend that you only have
to deal with ASCII, but in practice, you can't prevent
characters in the superset from appearing in your text files,
and a correct program has to deal with them correctly. The
default code set for Windows 8 bits is, I believe, ISO 8859-1.
(I'm not sure that this is by design. It may just be a case of
using USC-2, and stripping off the top byte.) Which means that
you may have characters in the text which the user thought were
legitimate characters, because they displayed as such on his
screen. And that, even though they are native English speakers:
the name on the binding of the encyclopedia I used when at
school (in northern, rural Illinois---you can't get much more
parocial) started with an Æ, for example, and at least one of my
English teachers spelled 'naïve' with the diaeresis. If the
characters are available (and they are), they will be used. And
a correct program will handle them correctly. In practice, text
encoded in "ASCII" simply doesn't exist today, and programs
which assume that their text input is pure ASCII are simply
broken.

And from the link above, it's obvious that you know this as well
as I do, if not better. So what's the argument about?
 
G

Gianni Mariani

More significantly, the software which generated what you are
processing as "pure ASCII" probably was actually using some
exended code set. There is no support for "pure ASCII" under
Linux, as far as I can see, for example. The reality is that if
your software doesn't correctly handle characters with a bit 7
set, it is broken, because even in America, most of the tools
can easily generate such files.

I know that I have a couple of files which contain a 'ÿ' (y with
a diaerisis) in ISO 8859-1, for test purposes. It's amazing how
many programs treat it as an end of file. Would you (or Gianni,
for that matter) consider this "correct", even if the program
didn't have to deal with accented characters per se? Would you
(or Gianni) consider it OK to not test this (limit) case,
knowing that it is a frequent error?

Chill for a sec. You said ASCII files are "inexistant(sic)" which I
suspect means nonexistent. I call "bollocks" on that one and the
result is I'm being accused of being "parochial".
Statistics? ASCII isn't used by Windows. It's not available in
the standard Linux distributions I use. All of the Internet
protocols I know *now* require more. (The now is important.
When I first implemented code around SMTP and NNTP, ASCII was
the standard encoding, and in fact, the only one supported.)

And this has a bearing on (non)existence of ASCII files how ?
James, being born and raised in the United States, and still
holding an American passport...

Ah, even better. A bumbling American with and arrogant Parisian
attitude driven by a German bent for precision. Life sucks sometimes.
More racism. I've not encountered any arrogance in Paris, and

Really ? Where have you been in Paris ? Even the French I know
consider Parisians to be generally more arrogant than the rest of
France. Ah, there you go. You look at yourself in the mirror while
in Paris.
I've not found Germany to be any more bureaucratic that anywhere
else.

That's not what Germans say about themselves.
People with that sort of attitude are parochial. They've not
gone out and actually considered other people for what they are.

OK. Again, a very arrogant thing to say.
[...]
So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.

Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.

You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.
 
G

Gianni Mariani

So what are we arguing about?

The (non)existence of ASCII files. You say they don't exist and I say
they do. We're talking about files, code that consumes files.

Pretty stupid thing to be arguing about.

But I suppose you're too busy accusing me of being parochial and I'd
too busy trying to explain why.
 
J

James Kanze

Chill for a sec. You said ASCII files are "inexistant(sic)" which I
suspect means nonexistent. I call "bollocks" on that one and the
result is I'm being accused of being "parochial".

If you refuse to recognize that most of the files you actually
have to deal with were not written in ASCII, but in some
superset of ASCII, you're living in some isolated, very
backwards community. Neither Linux nor Windows even support
"ASCII" now adays.

If you consider that ASCII is all we'll ever need, you're being
very parochial, not looking beyond a very, very limited
community of users.

Those are the facts. Whether you like them or not.
And this has a bearing on (non)existence of ASCII files how ?

Well, if ASCII isn't supported by Windows, and it isn't
supported by Linux, it obviously cannot be the general case for
most files.
Ah, even better. A bumbling American with and arrogant
Parisian attitude driven by a German bent for precision. Life
sucks sometimes.

Ah, even more blatant racism.

[Lot's more racism cut...]
OK. Again, a very arrogant thing to say.

What's arrogant about calling a spade a spade?
[...]
So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.
Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.
You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.

You may have a choice, but I live in the real world. The files
are there, and I have to deal with them. Whether I like it or
not.
 
J

James Kanze


[...]
The majority of US citizens voted in the scariest U.S. government in
my living history.
- the "Patriot Act" - we started seeing the worst of McArthyism coming
back.
- The VP getting the biggest defence contracts to the company he used
to run
- Letting Microsoft off the hook for criminal activity
I can go on and on. The USA is run by the corps.

OK, you aren't racist. You just hate everbody:).

Seriously, do you really believe that you can judge people by
their government? Even in so-called democracies, like France
and the USA. I've lived in three different countries, and I've
very close contacts with a fourth (my wife is Italian). I've
found people to be pretty much the same everywhere, and in the
vast majority, I've found them to be pretty decent.
Re-read what you wrote and tell me you honestly believe that ASCII
files are non-existant.

I haven't seen one for ages. Neither Linux nor Windows even
supports them.
I have applications that still generate them. I still
generate them. I see them every day. Accusing me of being
parochial is very arrogant and disingenuous.

So what machine are you using? Posix requires 8 bit characters,
and it doesn't have a function "isascii" anymore---it requires
full support for an eight bit character set. And of course,
correct code will not fail because some file happens to contain
an accented character. You can pretend that your files are
ASCII, but that's just pretending.
[...]
And of course, most of the newer protocols just say: it has to
be UTF-8.
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.
By who?
The discussion I refer to is archived by google.
It was not a consensus in 1996.

That's a long time ago. (In our profession, at least.)
Ok, you need to think a little harder at what you're trying to do.

In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.
There is a need to standardize on something that handles all these
things - ICU is the only thing I have seen that gets close.

They seem to have done the most work in this direction to date.
On the other hand, they use UTF-16, which doesn't seem a
judicious choice today: UTF-32 or UTF-8 would seem preferable,
depending on what the program is doing.
 
G

Gianni Mariani

....
A tad off topic. I suppose we digressed off topic a few posts ago.
OK, you aren't racist. You just hate everbody:).

I have an opinion. The older I get, the less critical I am of the
opinions I hold but the stronger the opinions are.

There is another theory I have about governments. You get the
government you deserve.
Seriously, do you really believe that you can judge people by
their government? Even in so-called democracies, like France
and the USA. I've lived in three different countries, and I've
very close contacts with a fourth (my wife is Italian). I've
found people to be pretty much the same everywhere, and in the
vast majority, I've found them to be pretty decent.

Same here. Almost exactly but they are all European.Roman in
heritage. Try spending some serious time in India, Thailand, PRC or
Hong Kong though, and then some more time in the Saudi, or UAE or
Nigeria even. The cultural skew takes some time to come to grips
with.

Just because a Parisian is arrogant to me doesn't mean I think the
worst of them. I talk about one particular incident in a Paris in a
restaurant all the time. It's very funny and if the listeners take
the time to think about it, it shows a very positive pride the French
have about themselves. I sometimes wish I had that a little more as
well. Nonetheless, it's arrogant and the arrogance comes out in other
ways. I have a similar incident in Switzerland (Neuchatel) and I can
tell you that I was annoyed, mostly with myself because I could not
fault the Swiss shop assistant for bending over backwards to help.

If I tried to pull off what the Parisians (allegedly) do where I live
now, I would probably come off as a total dick.
I haven't seen one for ages. Neither Linux nor Windows even
supports them.

Never mind. It matters not. The point I make is not so deep and
meaningful.
So what machine are you using? Posix requires 8 bit characters,
and it doesn't have a function "isascii" anymore---it requires
full support for an eight bit character set. And of course,
correct code will not fail because some file happens to contain
an accented character. You can pretend that your files are
ASCII, but that's just pretending.

You talk about processing technology, I talk about actual files. See,
not so deep and meaningful. It all works back to the context of the
original statement. Way too much energy spent here.
[...]
And of course, most of the newer protocols just say: it has to
be UTF-8.
That's the conclusion I came to very early. I remember when I posted
that suggestion and I was told I was being bigoted.
By who?
The discussion I refer to is archived by google.
... I think that there is a consensus that UTF-8 is the way
to go.
It was not a consensus in 1996.

That's a long time ago. (In our profession, at least.)
...
Ok, you need to think a little harder at what you're trying to do.

In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.

At one point I was asked to give a recommendation on
internationalizing and application. It was a web browser. My default
answer was "wide chars", etc etc I examined the code and realized I'd
given the project a death sentence because there was no way the
project would recover so I went back to the team and said - JUST
KIDDING. What you need is utf8 with one of these special string
classes that converts a string transparently between utf-8 and utf-16
whenever it needs to and slowly move more of the application over to
wide char code. The code was migrated when it needed to and much of
the application didn't need touching.

The main point of this was that the codebase never broke uncontainably
and it's i18n support improved incrementally until it was adequate
without needing to interrupt development of other parts of the
product.
They seem to have done the most work in this direction to date.
On the other hand, they use UTF-16, which doesn't seem a
judicious choice today: UTF-32 or UTF-8 would seem preferable,
depending on what the program is doing.

Yeah, I recall having the same thought now. You should find this one
amusing:

http://mail-archives.apache.org/mod_mbox/xerces-c-dev/200007.mbox/<[email protected]>

Time for a new ICU.
 
G

Gianni Mariani

Those are the facts. Whether you like them or not.

aha. Fact is, - most of *MY* text files are ASCII.
Well, if ASCII isn't supported by Windows, and it isn't
supported by Linux, it obviously cannot be the general case for
most files.

aha. So. Fact is, - most of *MY* text files are ASCII.

....
Ah, even more blatant racism.

aha. I'm not sure if that's the PC American or the arrogant Parisian
talking.
[Lot's more racism cut...]
OK. Again, a very arrogant thing to say.

What's arrogant about calling a spade a spade?

.... "Have you stopped beating your wife ?" ....

That's an example of what you do by what you say is "calling a spade a
spade".

I may, or may not be racist, but I am convinced that accusing someone
of racism without knowing (or wanting to know) who they are or what
they do is a display of arrogance.
[...]
So, we should all proclaim that all ASCII files are now officially
utf-8 and all other text formats are deprecated and should be deleted.
Of course, if you'd have actually read what you're responding
to, I said that we have to deal with a lot of different code
sets. And that that is a real problem.
You have a choice to refuse to deal with anything but utf8. Until you
do, you will whine.

You may have a choice, but I live in the real world. The files
are there, and I have to deal with them. Whether I like it or
not.

It seems you have made a choice.

I don't know what product you work on or even have decision making
power on, but, there are many ways to slice the problem.

You know, we really have to stop meeting this way, people might get
the wrong idea. I need to move on. Feel free to make whatever damage
you like, at this point I'm going to cut-n-run.

It's been fun.
 
J

James Kanze

On May 14, 5:24 pm, James Kanze <[email protected]>
wrote:> On May 13, 1:40 pm, Gianni Mariani
<[email protected]> wrote:

[...]
Same here. Almost exactly but they are all European.Roman in
heritage. Try spending some serious time in India, Thailand, PRC or
Hong Kong though, and then some more time in the Saudi, or UAE or
Nigeria even. The cultural skew takes some time to come to grips
with.

Yes, but it's really still very superficial. Human nature is
human nature. It does make it more difficult to recognize the
similarities, however.

[...]
You talk about processing technology, I talk about actual files. See,
not so deep and meaningful. It all works back to the context of the
original statement. Way too much energy spent here.

They're related, but my real point was different. Perhaps if I
stated it something along the lines "a correct program cannot
assume that any file it reads contains only characters in the
ASCII character set."

It's a conceptual point of view. When I first started working
on Unix, we pretty much considered that all text files were
ASCII. In some ways, it was false even then; the OS never made
the slightest guarantee, and characters with the 8th bit set did
creep into text files from time to time. But we had a function,
isascii(), which we used to test for such characters, and if
they were present, we rejected the file as being corrupt.

Today, of course, we no longer have that function, and every
editor, on every system, is capable of generating accented
characters. So the files aren't really ASCII, but whatever
encoding the editor was generating (ISO 8859-1 seems very
common). And of course, a correct program will handle them
correctly.

Now, you may say that all, or almost all of the files you have
to deal with actually only contain characters in the subset
common to ASCII, the ISO 8859 encodings and UTF-8. That may be
(although it's not the case where I work, and hasn't been for
well over 10 years). But I insist that that is not an
appropriate way of thinking about it. Those files were created
by an editor, or some other program, which is perfectly capable
of creating characters which are not in ASCII. And considering
them "pure" ASCII will lead to carelessness in programming, and
an increased risk of errors.

In that sense, ASCII files simply do not exist. There is no way
you can open a file, and say, this file is pure ASCII, and
cannot possibly contain anything else. I also suspect that it
is exceedingly rare that you can open a text file saying: this
file should be pure ASCII, and anything else means it is
corrupt. There are doubtlessly exceptions to this, particularly
with regards to machine generated data. But most of the
exceptions I know go even further: if the file contains, say, a
list of floating point values, then it is corrupt if it contains
any alpha character, not just if it contains an accented
character.
[...]
...
Life with Unicode is much easier. Surprising little code really needs
to care that it is parsing utf-8.
Are you kidding? What about code which uses e.g. "isalpha()".
Ok, you need to think a little harder at what you're trying to do.
In general. Once you can no longer count on just ASCII, you do
have problems. Regardless of the encoding. On the whole, I
think UTF-8 is the only viable solution for communications, and
it is also the prefered solution for internal coding for a lot
of applications. Other applications will prefer UTF-32. And a
number of applications will still make do with some pure 8 bit
encoding, ISO 8859-1, or such.
At one point I was asked to give a recommendation on
internationalizing and application. It was a web browser. My default
answer was "wide chars", etc etc I examined the code and realized I'd
given the project a death sentence because there was no way the
project would recover so I went back to the team and said - JUST
KIDDING. What you need is utf8 with one of these special string
classes that converts a string transparently between utf-8 and utf-16
whenever it needs to and slowly move more of the application over to
wide char code. The code was migrated when it needed to and much of
the application didn't need touching.
The main point of this was that the codebase never broke uncontainably
and it's i18n support improved incrementally until it was adequate
without needing to interrupt development of other parts of the
product.

I presume you're talking about internal representation here. A
Web browser certainly has to deal with a large number of
different external encodings. If I control the entire chain,
there's no doubt that everything would be Unicode, UTF-8
externally, and either UTF-8 or UTF-32 internally, depending on
what I was doing. But I never do control the entire chain: here
at work, the powers that be haven't installed any Unicode fonts
on the machines, so I'm stuck with ISO 8859-1:-(.
Yeah, I recall having the same thought now. You should find this one
amusing:

Time for a new ICU.

:). To be fair to them: when they defined their spec, Unicode
was only 16 bits. Also, any program really treating text
seriously will have to deal with various composite characters
anyway, and handling the surrogates isn't that much more work.

On the other hand, the more I work with such characters, the
more I realize that you can do directly in UTF-8. Multibyte
characters have a reputation for causing all sorts of problems,
but UTF-8 has addressed some of the issues (and of course, a lot
of the problems are just because the code isn't prepared for
multibyte characters). Once you're handling surrogates and
composite characters, is UTF-8 really any more difficult than
UTF-32?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top