Is this String class properly implemented?

Alf P. Steinbach · May 9, 2009

* Jerry Coffin:

Have you looked at both specifactions to find out? Have you even looked
at one of them?

It would be nice to know exactly what convinces you that it's
disinformation, and particularly whether you have any authoritative
source for the claim. Wikipedia certainly doesn't qualify, and as much
respect as I have to James, I don't think he does either. It would
appear to me that the only authoritative sources on the subject are the
standards themselves -- and your statement leads me to doubt that you've
consulted them in this case.

You're reversing the burden of evidence.

You made an outrageous claim, which if it were true would make ISO 8859-1 a very
impractical standard; now please submit your evidence that you think is in favor
of that claim.

Cheers & hth.,

- Alf

Jerry Coffin · May 9, 2009

[email protected] says... said:
You're reversing the burden of evidence.

You made an outrageous claim, which if it were true would make ISO 8859-1 a very
impractical standard; now please submit your evidence that you think is in favor
of that claim.

I thought I'd made it clear, but the evidence is the standards
themselves. If, by "submit" you mean posting them here, I obviously
can't do that -- they're all copyrighted, as I'm sure you're already
well aware.

As for rendering anything impractical, I don't think it does anything of
the sort. Quite the contrary, there's not likely to be any practical
effect at all -- what you get is pretty much the same regardless of what
name the standard chooses to give it.

Ultimately, this isn't particularly different from the '.' character --
we use it both as a period (full stop/end of sentence marker) and a
decimal point. Whether some particular document calls it a "decimal
point" or "period" or "full stop" makes little real difference to how
people actually put it to use. A standard that chose one name over the
ohter might reflect the cultural background of its designers, but
wouldn't be particularly likely to render that standard any more or less
practical.

Alf P. Steinbach · May 9, 2009

* Jerry Coffin:

I thought I'd made it clear, but the evidence is the standards
themselves. If, by "submit" you mean posting them here, I obviously
can't do that -- they're all copyrighted, as I'm sure you're already
well aware.

I'm sorry on your behalf, but quoting a limited part of a standard is fair use,
so there's nothing stopping you from that.

In passing, note that the error in your reasoning started with the "obviously";
that little code-word often signals an error of reasoning.

And in case you doubt that quoting is fair use, note that in this group we often
quote from the C++ standard -- perhaps you have done so yourself, earlier?

As for rendering anything impractical, I don't think it does anything of
the sort. Quite the contrary, there's not likely to be any practical
effect at all -- what you get is pretty much the same regardless of what
name the standard chooses to give it.

Ultimately, this isn't particularly different from the '.' character --
we use it both as a period (full stop/end of sentence marker) and a
decimal point. Whether some particular document calls it a "decimal
point" or "period" or "full stop" makes little real difference to how
people actually put it to use. A standard that chose one name over the
ohter might reflect the cultural background of its designers, but
wouldn't be particularly likely to render that standard any more or less
practical.

Assuming for the sake of argument that the two standards use different terms to
describe character 96, since it seems you're reluctant to offer any evidence,
almost as if the implication that you have these standards wasn't true.

Is your point that the two standards use different terms for the same thing?

In that case either your argument earlier in the thread was misleading, or your
current argument is misleading.

Or is your point that the two standards use different terms with the intention
to denote two different things?

In that case you have misunderstood the standards.

Cheers & hth.,

- Alf

Jerry Coffin · May 9, 2009

[email protected] says... said:
I'm sorry on your behalf, but quoting a limited part of a standard is fair use,
so there's nothing stopping you from that.

I've already quoted the relevant parts. Each has a table of numbers and
the character associated with each number. In the ASCII table, it's
listed as a backward quote. In the ISO 8859 table, it's listed as a
grave accent.

In passing, note that the error in your reasoning started with the "obviously";
that little code-word often signals an error of reasoning.

There was no error in reasoning.

And in case you doubt that quoting is fair use, note that in this group we often
quote from the C++ standard -- perhaps you have done so yourself, earlier?

I have no problem with fair use, or quoting relevant portions. In this
case, there's no other explanatory text, so I've already quoted
everything I can find that's relevant.

[ ... ]

Assuming for the sake of argument that the two standards use different terms to
describe character 96, since it seems you're reluctant to offer any evidence,
almost as if the implication that you have these standards wasn't true.

I'm not sure what further evidence would be relevant -- I've already
quoted what each says on the subject. Neither appears to have anything
beyond the single-word description of that particular character.

Is your point that the two standards use different terms for the same thing?

In that case either your argument earlier in the thread was misleading, or your
current argument is misleading.

Or is your point that the two standards use different terms with the intention
to denote two different things?

In that case you have misunderstood the standards.

"I'm sorry, but as far as I know, that's BS."

It seems quite incredible for you to claim certainty about the intent of
the standard, especially one that you've apparently never even seen.

I don't claim clarivoyance, so I can only go by what's in the standards
themselves. The text is different, and not in a way I can reasonably
attribute to a typo or anything like that. This seems to support the
belief that there was a real intent to change the meaning to at least
some degree.

If you're honestly interested in the question of what constitutes a
difference between characters at the level of abstraction used in an
encoding standard, I'd advise googling for Han Unification. Early on,
Unicode used Han Unification to reduce the number of code points
necessary for the Chinese, Japanese and Korean alphabets. Considerable
controversy resulted, all based around the question of where to draw the
line between characters that were the same or different.

Alf P. Steinbach · May 9, 2009

* Jerry Coffin:

I've already quoted the relevant parts.

You have as yet not quoted anything, at least not to me.

I don't care what you quoted some years ago to someone else in a another venue.

The original final ASCII standard from 1967 is no longer available so I'm
surprised you claim to have it.

As an alternative you might take a look at

Each has a table of numbers and
the character associated with each number. In the ASCII table, it's
listed as a backward quote. In the ISO 8859 table, it's listed as a
grave accent.

According to the source referenced above, in original ASCII it's an "apostrophe,
or close quotation" when used as punctuation, and an "acute accent" when used as
a diacritical mark. Original ASCII represented diacritical marks by backspacing,
i.e. the visual effect of char + BS + mark on a printer. This convention did not
survive, however, and in later usage as well as in Latin-1 it's punctuation.

You might also wish to consult the Unicode standard's reference document on the
ASCII subset of Unicode.

And in that connection note that Unicode contains Latin-1 (ISO 8859-1) as a
subset, overlapping with ASCII, that is, the same code points...

There was no error in reasoning.

There certainly was, and still is.

I have no problem with fair use, or quoting relevant portions. In this
case, there's no other explanatory text, so I've already quoted
everything I can find that's relevant.

The ASCII standard has/had explanatory text.

It sounds to me like you're referring to just some code chart that someone
labeled "ASCII".

[ ... ]

Assuming for the sake of argument that the two standards use different terms to
describe character 96, since it seems you're reluctant to offer any evidence,
almost as if the implication that you have these standards wasn't true.

Click to expand...

I'm not sure what further evidence would be relevant -- I've already
quoted what each says on the subject. Neither appears to have anything
beyond the single-word description of that particular character.

It seems you don't have the standards.

They have much more than code charts.

"I'm sorry, but as far as I know, that's BS."

It seems quite incredible for you to claim certainty about the intent of
the standard, especially one that you've apparently never even seen.

Don't flout inane accusations on top of obstinate wrong-headedness and a
ridiculous claim.

You have a misunderstood whatever material you have, and you haven't understood
that Latin-1 is a direct extension of ASCII (sans control characters), and that
Unicode is a direct extension of Latin-1 -- which is what you need to grasp.

Cheers & hth.,

- Alf

Tony · May 10, 2009

Richard said:
Tony said:

Jerry Coffin wrote:
[...]

It's a bit hard to say much about ASCII per se -- the standard has
been obsolete for a long time. Even the organization that formed it
doesn't exist any more.

Click to expand...

Oh? Is that why such care was taken with the Unicode spec to make
sure that it mapped nicely onto ASCII?

Click to expand...

Or ISO-8859?

[...]

Fine, upper and lower case then. But no umlauts or accent marks!

Click to expand...

How naïve. My _English_ dictionary includes déjà vu, gâteau and many
other words with diacritics.

And how many variable names do you create with those foreign glyphs? Hmm?

Then one still needs some diacritics. The ISO-8859 family has them;
ASCII doesn't.

The issue here is not Webster's Dictionary.

Tony · May 10, 2009

James said:
There is a huge volume of programs that can and do use no text.
However, I don't know of any program today that uses text in
ASCII;

You must be thinking of shrink-wrap-type user-interactive programs rather
than in-house development tools, for example.

text is used to communicate with human beings, and ASCII
isn't sufficient for that.

Millions of posts on USENET seem to contradict that statement.

Except that the examples are false. C/C++/Java and Ada require
Unicode.

To be general they do. One could easily eliminate that requirement and still
get much work done. I'm "arguing" not against Unicode, but that the ASCII
subset, in and of itself, is useful.

Practically everything on the network is UTF-8.

Basically, except for some historical tools, ASCII is dead.

Nah, it's alive and well, even if you choose to call it a subset of
something else. Parse all of the non-binary group posts and see how many
non-ASCII characters come up (besides your tagline!).

As long as you're the only person using your code, you can do
what you want.

person, or company, or group, or alliance all work. Standards were meant to
be... ignored (there's always a better way)!

I understand the rationale.

First, there is no such thing as an ASCII text file.

Then what is a file that contains only ASCII printable characters (throw in
LF and HT for good measure)?

For that
matter, under Unix, there is no such thing as a text file. A
file is a sequence of bytes.

And if the file is opened in text mode?

How those bytes are interpreted
depends on the application.

So the distinction between text and binary mode is .... ?

Internally, the program is still working with ASCII strings,
assuming English is the language (PURE English that recognizes
only 26 letters, that is).

Click to expand...

Pure English has [...]

_I_ was giving the definition of "Pure English" in the context (like a
glossary). How many letters are there in the English alphabet? How many?
Surely I wasn't taught umlauts in gradeschool. You are arguing semantics and
I'm arguing practicality: if I can make a simplifying assumption, I'm gonna
do it (and eval that assumption given the task at hand)!

accented characters in some words (at least
according to Merriam Webster, for American English). Pure
English distiguishes between open and closing quotes, both
single and double. Real English distinguishes between a hyphen,
an en dash and an em dash.

But that's all irrelevant, because in the end, you're writing
bytes, and you have to establish some sort of agreement between
what you mean by them, and what the programs reading the data
mean. (*If* we could get by with only the characters in
traditional ASCII, it would be nice, because for historical
reasons, most of the other encodings encountered encode those
characters identically. Realistically, however, any program
dealing with text has to support more, or nobody will use it.)

Where did you get that bullshit?

This week's trade rags (it's still around here, so if you want the exact
reference, just ask me). It makes sense too: Apple moved off of PowerPC also
probably to avoid doom. I'm a Wintel developer exclusively right now also,
so it makes double sense to me.

Sun does sell x86 processors
(using the AMD chip). And IBM and HP are quite successful with
there lines of non x86 processors. (IMHO, where Sun went wrong
was in abandoning its traditional hardware market, and moving
into software adventures like Java.)

Topic for another thread for sure (those kinds of threads are fun, but don't
result in anything useful). What you said parenthetically above, I kinda
agree with: Open Solaris looked like a winner to me until they made it
subserviant to Java (a platform to push Java). Dumb Sun move #2. (But I only
track these things lightly on the surface).

I'm not referencing any application domain in particular.

Apparently you referenced OSes a few times.

Practically all of the Unix applications I know take the
encoding from the environment; those that don't use UTF-8 (the
more recent ones, anyway). All of the Windows applications I
know use UTF-16LE.

Do you think anyone would use MS Office or Open Office if they
only supported ASCII?

I was talking about simpler class of programs and libraries even: say, a
program's options file and the ini-file parser (designated subset of 7-bit
ASCII).

Apparently there is a semantic gap in our "debate". I'm not sure where it
is, but I think it may be in that you are talking about what goes on behind
the scenes in an OS, for example, and I'm just using the simple ini-file
parser using some concoction called ASCIIString as the workhorse.

Yes. That's where I live and work. In the real world. I
produce programs that other people use. (In practice, my
programs don't usually deal with text, except maybe to pass it
through, so I'm not confronted with the problem that often. But
often enough to be aware of it.)

You opportunistically took that out of context. I was alluding toward the
difference between the problem domain (the real world) and the solution
domain (technology).

Not really.

Well you snipped off the context so I don't know how I meant that.

Programs assign semantics to those ones and zeros.
Even at the hardware level---a float and an int may contain the
same number of bits, but the code uses different instructions
with them. Programs interpret the data.

Which brings us back to my point above---you don't generally
control how other programs are going to interpret the data you
write.

If you say so. But if I specify that ini-files are for my program may
contain only the designated subset of 7-bit ASCII, and someone puts an
invalid character in there, expect a nasty error box popping up.

Sorry, I don't know what you're talking about.

Nevermind. It just seemed like you were arguing both sides of the point in
the two threads combined.

I've already had to deal with C with the symbols in Kanji.

So use it once and then jettison all simpler things? The C/C++ APIs are
overly-general (IMO) that's why I don't use them unless the situation
warrants it. Generality makes complexity. Every developer should know how to
implement a linked list, for example. Every developer should have a number
of linked lists he uses, as having only one design paradigm ensures every
program/project is a compromise. IMO. YMMV.

That
would have been toward the end of the 1980s. And I haven't seen
a program in the last ten years which didn't use symbols and
have comments in either French or German.

But you're in/from France right? Us pesky "americans" huh.

Fine. If you write a compiler, and you're the only person to
use it, you can do whatever you want. But there's no sense in
talking about it here, since it has no relevance in the real
world.

You're posting in extremism to promote generalism? Good engineering includes
exploiting simplifying assumptions (and avoiding the hype, on the flip
side). (You'd really put non-ASCII characters in source code comments?
Bizarre.)

Most programs don't need to be international. Data and development tools are
not the same.

No it's not.

Well it would be for me! So yes it is!

(Actually, the most difficult language to program
in is English,

Not for me! Context matters! (I was the context, along with many other
developers here).

It's one of my primarly languages as well. Not the only one,
obviously, but one of them.

"primarly" (hehe

). "A set of primary languages?". One primary or none
probably. (None is as good as one, I'm not dissing... I only know two
languages and a third ever so lightly for "I took it in HS").

That has nothing to do with the operating system. Read the
language standards.

Ah ha! The golden calf. I had a feeling there was a god amongst us. :/

I'm not "big" on "standards". (Separate thread!).

No. Do you know any of the languages in question? All of them
clearly require support for at least the first BMP of Unicode in
the compiler. You may not use that possibility---a lot of
people don't---but it's a fundamental part of the language.

THAT _IS_ the point (!): if a program (or other) doesn't require it, then it
is just CHAFF. This ever-expoused over-generality and
general-is-good-and-always-better gets very annoying in these NGs. Save the
committe stuff for c.l.c++.moderated or the std group. The chaff is probably
holding back practicality for those who can't distinquish politics.

Jerry Coffin · May 10, 2009

[email protected] says... said:
You have as yet not quoted anything, at least not to me.

Yes, I did. When I said the ISO standard describes the character as a
grave accent, that was a direct quote from the standard -- it's also
_all_ the standard says about that character.

[ ... ]

The original final ASCII standard from 1967 is no longer available so I'm
surprised you claim to have it.

As it happens, we needed a copy at work a few years ago, so we had a
couple of people working for a week or so to find it. As I recall, the
copy we found was at a university in Australia, from which we got a
Xeroxed copy.

BTW, you seem to have rather a problem with the date there as well --
the _original_ final ASCII standard was in 1963. The 1967 version was a
revision. There was also a 1968 revision, and as I understand it, a 1986
version as well (though I've never seen a copy of the latter). The
changes from the 1967 to '68 standards were quite minimal though.

The ASCII standard only gave extremely minimal descriptions of the
control characters as well. The ISO did publish a separate document
(roughly what would now be called a TR) giving (somewhat) more detailed
description of the control characters -- but not of the printable
characters.

As an alternative you might take a look at

According to the source referenced above, in original ASCII it's an "apostrophe,
or close quotation" when used as punctuation, and an "acute accent" when used as
a diacritical mark.

You're not even looking at the right character. The character under
discussion is a couple of lines down, the 6/0 rather than 2/7. In any
case, this seems to be from somebody else's understanding of ASCII, not
from the standard itself.

[ ... ]

There certainly was, and still is.

You haven't shown any mistake in reasoning yet. In fact, you haven't
even figured out which character is being discussed yet, and you've
shown nothing to indicate that you've looked at the original source
either.

[ ... ]

The ASCII standard has/had explanatory text.

Yes, some -- but not for the character in question.

OTOH (working to get back to something topical), it does contain
explanatory text showing that the use of "new line" in C and C++ really
does come directly from ASCII:

In the definition of LF (page 8):
Where appropriate, this character may have the meaning
"New Line" (NL), a format effector which controls the
movement of the printing point to the first printing
position on the next printing line. Use of this
convention requires agreement between the sender and
recipient of data.

and in Appendix A, section A7.6:

The function "New Line" (NL) was associated with the LF
(rather than with CR or with a separate character) to
provide the most useful combination of functions through
the use of only two character positions, and to allow the
use of a common end-of-line format for both printers
having separate CR-LF functions and those having a
combined (i.e., NL) function. This sequence would be
CR-LF, producing the same result on both classes, and
would be useful during conversion of a system from one
method of operation to the other.

I believe this interpretation of LF was new in the 1968 version of the
standard.

If you're interested in the history of ASCII and its standardization,
you might want to look at Bob Bemer's web site, at:

http://www.trailing-edge.com/~bobbemer/index.htm

Jerry Coffin · May 10, 2009

Doing a bit of looking, I found a web site that has a bit of interesting
history about some of the characters in ASCII, ISO 8859, Unicode, ISO
10646, and so on.

http://www.cs.tut.fi/~jkorpela/latin1/ascii-hist.html

I certainly can't vouch for everything he says being absolutely
accurate, but everything I've seen in it looks pretty reasonable and he
gives references for nearly everything. Unless essentially _everything_
he says is wrong, he demonstrates the point I was originally trying to
make quite well -- while the newer standards largely _attempt_ to act as
proper supersets of ASCII, there are enough variations between early
character sets (e.g. between US-ASCII and ISO 646) that this isn't
always entirely possible.

Interestingly, he has a rather lengthy piece about the character I
mentioned (opening single quote mark / grave accent). This seems to show
exactly HOW things got the way they are. It was originally proposed in
the ISO committee as a grave accent. The US committee then overloaded it
to be an opening single quote. My guess is that ISO 8859 was written
primarily (if not exclusively) as a superset of ISO 646, so they simply
ignored the American aberration of calling it an opening single quote.

A number of other characters (including the right single
quote/apostrophe to which Alf referred) have slightly differing
definitions between different standards as well. The first (1963) ASCII
standard referred to it purely as an apostrophe. Later versions added
the notations of Closing Single Quotation mark and Acute Accent. The
current versions of ISO 646, 8859 and 10646 have gone back to the
original and refer to it only as an apostrophe.

The available evidence suggests that Alf's accusations were and are
unfounded -- the definitions associated with a number of code points
have varied between standards, but this has neither led to any
significant incompatibility nor rendered any of the standards
particularly impractical. At the same time, even in the simplest of
plain text, using only 7-bit characters, there are variations in the
interpretations of a few code points.

Alf P. Steinbach · May 10, 2009

* Jerry Coffin:

Yes, I did. When I said the ISO standard describes the character as a
grave accent, that was a direct quote from the standard

A quote is indicated by quoting.

Descriptions about something are not quotes.

You did not quote and you said you quoted.

-- it's also
_all_ the standard says about that character.

I doubt it.

Anyways, you're wrong and really don't know what you're talking about.

ASCII (sans control chars) is a proper subset of Latin-1, with the same code
points. There's no difference. You snipped my suggestion that you look up the
Unicode standard's separate document on its ASCII subset, but what the heck, I
just suggest it again.

[ ... ]

The original final ASCII standard from 1967 is no longer available so I'm
surprised you claim to have it.

Click to expand...

As it happens, we needed a copy at work a few years ago, so we had a
couple of people working for a week or so to find it. As I recall, the
copy we found was at a university in Australia, from which we got a
Xeroxed copy.

Who do you think you're kidding?

BTW, you seem to have rather a problem with the date there as well --
the _original_ final ASCII standard was in 1963. The 1967 version was a
revision. There was also a 1968 revision, and as I understand it, a 1986
version as well (though I've never seen a copy of the latter). The
changes from the 1967 to '68 standards were quite minimal though.

If you're referring to the 1963 standard, it /lacked lowercase letters/.

Are you *really* suggesting that was the final standard?

ROTFL.

Bye, for this topic at least,

- ALf

Jerry Coffin · May 10, 2009

[email protected] says... said:
Bye, for this topic at least,

IOW, you've seen the next follow-up I posted that showed in detail that
all the claims you've made in this thread were complete nonsense from
beginning to end!

James Kanze · May 10, 2009

* Jerry Coffin:

[ ... ]

I'm sorry but as far as I know that's BS.

Click to expand...

Have you looked at both specifactions to find out? Have you
even looked at one of them?

Would be nice to know where you picked up that piece of
disinformation, though.

Click to expand...

It would be nice to know exactly what convinces you that
it's disinformation, and particularly whether you have any
authoritative source for the claim. Wikipedia certainly
doesn't qualify, and as much respect as I have to James, I
don't think he does either. It would appear to me that the
only authoritative sources on the subject are the standards
themselves -- and your statement leads me to doubt that
you've consulted them in this case.

Click to expand...

Click to expand...

Obviously. I'm not the author of the standard, and I don't
actually have access to the text of any of them except Unicode.

From experience, I can say that all of the implementations of
the Unix shells I know (from Bourne on through ksh and bash)
treat the character encoded 96 in the same way, regardless of
the encoding used. (The original Bourne shell used ASCII---and
added internal information on the eighth bit. The others use
the encoding specified by the LC_CTYPE environment variable,
which may be any of the ISO 8859 encodings, or UTF-8.)

Perhaps a more accurate specification of my claim is that source
files written in ASCII could still be read by programs using one
of the ISO 8859 encodings or UTF-8. At least under Unix. As
for the "goals" of the various standards, I do think that they
were more along the lines of interoperatability, rather than
exact identity.

You're reversing the burden of evidence.

Click to expand...

You made an outrageous claim, which if it were true would make
ISO 8859-1 a very impractical standard; now please submit your
evidence that you think is in favor of that claim.

Click to expand...

I think that his evidence is clear: the official standards of
each encoding. (At least, that seems to me to be what he is
implying.) A look at the one line version of ISO 8859-1
confirms what he has said about that. I can't find the ASCII
standard on line, but I've spoken with Jerry personally, and
from what I know of his work, it seems reasonable to assume that
he actually does have access to the standard (which I don't), so
I'll take him on his word for it (unless someone else can post
an actual quote from the standard, contradicting what he's
said).

Click to expand...

James Kanze · May 10, 2009

You must be thinking of shrink-wrap-type user-interactive
programs rather than in-house development tools, for example.

No. None of the in house programs I've seen use ASCII, either.

Millions of posts on USENET seem to contradict that statement.

In what way. The USENET doesn't require, or even encourage
ASCII. My postings are in either ISO 8859-1 or UTF-8, depending
on the machine I'm posting from. I couldn't post them in ASCII,
because they always contain accented characters.

To be general they do. One could easily eliminate that
requirement and still get much work done. I'm "arguing" not
against Unicode, but that the ASCII subset, in and of itself,
is useful.

It's certainly useful, in certain limited contexts. Until
you've seen a BOM or an encoding specification, for example, in
XML. (Although technically, it's not ASCII, but the common
subset of UTF-8 and the ISO 8859 encodings.)

Nah, it's alive and well, even if you choose to call it a
subset of something else. Parse all of the non-binary group
posts and see how many non-ASCII characters come up (besides
your tagline!).

Just about every posting, in some groups I participate in.

Then what is a file that contains only ASCII printable
characters (throw in LF and HT for good measure)?

A file that doesn't exist on any of the machines I have access
to.

At the lowest level, a file is just a sequence of bytes (under
Unix or Windows, at least). At that level, text files don't
exist. It's up to the programs reading or writing the file to
interpret those bytes. And none of the programs I use interpret
them as ASCII.

And if the file is opened in text mode?

It depends on the imbued locale. (Text mode or not.)

So the distinction between text and binary mode is .... ?

Arbitrary. It depends on the system. Under Unix, there isn't
any. Under Windows, it's just the representation of '\n' in the
file. Under other OS's, it's usually a different file type in
the OS (and a file written in text mode can't be opened in
binary, and vice versa).

Internally, the program is still working with ASCII strings,
assuming English is the language (PURE English that recognizes
only 26 letters, that is).

Click to expand...

Pure English has [...]

Click to expand...

_I_ was giving the definition of "Pure English" in the context
(like a glossary). How many letters are there in the English
alphabet? How many?

The same as in French, German or Italian: 26. However, in all
four of these languages, you have cases where you need accents,
which are made by adding something to the representation of the
letter (and require a distinct encoding for the computer)---in
German, there is even a special case of ß, which can't be made
by just adding an accent (but which still isn't a letter).

Surely I wasn't taught umlauts in gradeschool.

I was taught to spell naïve correctly (although I don't know if
it was grade school or high school).

You are arguing semantics and I'm arguing practicality: if I
can make a simplifying assumption, I'm gonna do it (and eval
that assumption given the task at hand)!

[...]

This week's trade rags (it's still around here, so if you want
the exact reference, just ask me). It makes sense too: Apple
moved off of PowerPC also probably to avoid doom. I'm a Wintel
developer exclusively right now also, so it makes double sense
to me.

Whatever? The fact remains that 1) Sun does produce processors
with Intel architecture---the choice is up to the customer, and
2) Sun and Apple address entirely different markets, so a
comparison isn't relevant. (The ability to run MS Office on a
desktop machine can be a killer criterion. The ability to run
it on a server is totally irrelevant.)

[...]

I was talking about simpler class of programs and libraries
even: say, a program's options file and the ini-file parser
(designated subset of 7-bit ASCII).

Apparently there is a semantic gap in our "debate". I'm not
sure where it is, but I think it may be in that you are
talking about what goes on behind the scenes in an OS, for
example, and I'm just using the simple ini-file parser using
some concoction called ASCIIString as the workhorse.

All of the ini-files I've see do allow accented characters.

If you say so. But if I specify that ini-files are for my
program may contain only the designated subset of 7-bit ASCII,
and someone puts an invalid character in there, expect a nasty
error box popping up.

As long as you're the only user of your programs, that's fine.
Once you have other users, you have to take their desires into
consideration.

But you're in/from France right? Us pesky "americans" huh.

Sort of

. My mother was American, and I was born and raised in
the United States. My father was German, my wife's Italian, and
I currently live in France (but I've also lived a lot in
Germany). And yes, I do use four languages on an almost daily
basis, so I'm somewhat sensitivized to the issue. But I find
that even when working in an English language context, I need
more than just ASCII. And I find that regardless of what I
need, the machines I use don't even offer ASCII as a choice.

You're posting in extremism to promote generalism? Good
engineering includes exploiting simplifying assumptions (and
avoiding the hype, on the flip side). (You'd really put
non-ASCII characters in source code comments? Bizarre.)

I have to, because my comments where I work now have to be in
French, and French without accents is incomprehensible. The
need is less frequent in English, but it does occur.

Richard Herring · May 11, 2009

Tony said:
Richard said:

Tony said:

Jerry Coffin wrote:
[...]

It's a bit hard to say much about ASCII per se -- the standard has
been obsolete for a long time. Even the organization that formed it
doesn't exist any more.

Oh? Is that why such care was taken with the Unicode spec to make
sure that it mapped nicely onto ASCII?

Click to expand...

Or ISO-8859?

[...]

The English alphabet has 26 characters. No more, no less.

Unfortunately statements like this weaken your point. By any
reasonable measure, the English alphabet contains at least 26
characters (upper and lower case).

Fine, upper and lower case then. But no umlauts or accent marks!

Click to expand...

How naïve. My _English_ dictionary includes déjà vu, gâteau and many
other words with diacritics.

Click to expand...

And how many variable names do you create with those foreign glyphs? Hmm?

Who cares? I'm merely providing a counterexample to your sweeping claim
that the English alphabet has exactly 26 characters. Or even 52.

Tony · May 13, 2009

James said:
Whatever? The fact remains that 1) Sun does produce processors
with Intel architecture---the choice is up to the customer,

Not enough emphasis on and too late (2003) to the x86 party, so it is being
said. What Sun was doing at the time of the Oracle buyout is irrelevant.
What is relevant is the history of the company and the strategic decisions
that were (or weren't!) made, for they are what led to the company's
instability.

and
2) Sun and Apple address entirely different markets, so a
comparison isn't relevant.

No one was comparing Sun and Apple: I was "hinting" at the fact that x86 is
still growing in it's ubiquitousness. It is suggested by analysts, as I
originally noted, that Sun's decision making regarding x86 vs. it's own
Sparc was a major strategic mistake.

I'm just regurgitating what the industry analysts are saying; I find it
interesting to read/study product and company lifecycles and strategies.

Tony · May 15, 2009

Richard said:
Tony said:

Richard said:

In message <[email protected]>, Tony
Jerry Coffin wrote:

[...]

It's a bit hard to say much about ASCII per se -- the standard has
been obsolete for a long time. Even the organization that formed
it doesn't exist any more.

Oh? Is that why such care was taken with the Unicode spec to make
sure that it mapped nicely onto ASCII?

Or ISO-8859?

[...]

The English alphabet has 26 characters. No more, no less.

Unfortunately statements like this weaken your point. By any
reasonable measure, the English alphabet contains at least 26
characters (upper and lower case).

Fine, upper and lower case then. But no umlauts or accent marks!

How naïve. My _English_ dictionary includes déjà vu, gâteau and many
other words with diacritics.

Click to expand...

And how many variable names do you create with those foreign glyphs?
Hmm?

Click to expand...

Who cares? I'm merely providing a counterexample to your sweeping
claim that the English alphabet has exactly 26 characters. Or even 52.

I meant letters, not characters. It should be obvious from the CONTEXT ("eye
on the ball" people!) that was what I meant. Perhaps you are trying
opportunistically to imply something different.

Tony · May 15, 2009

Alf said:
ASCII (sans control chars) is a proper subset of Latin-1,

Since ASCII preceded "Latin-1" (ISO 8859-1), it would be more correct to say
that "Latin-1" is a superset of ASCII. ASCII is the basis of modern
character encodings.

Tony · May 15, 2009

Jerry said:
BTW, you seem to have rather a problem with the date there as well --
the _original_ final ASCII standard was in 1963. The 1967 version was
a revision. There was also a 1968 revision, and as I understand it, a
1986 version as well (though I've never seen a copy of the latter).
The changes from the 1967 to '68 standards were quite minimal though.

1986? Really? What happened in 1986? I thought the ASCII timeline stopped at
1983.

Tony · May 15, 2009

James said:
No. None of the in house programs I've seen use ASCII, either.

In what way. The USENET doesn't require, or even encourage
ASCII.

But the underlying protocol is NNTP, and while I don't know for sure, I have
an incling that it is still a 7-bit protocol (?). But that wasn't my point.
I was suggesting that most USENET posts in threaded discussion groups are
ASCII (by nature of the characters in use by the posts).

My postings are in either ISO 8859-1 or UTF-8, depending
on the machine I'm posting from.

You can call it what you want, but if it contains only ASCII characters,
then I consider it an ASCII post.

I couldn't post them in ASCII,
because they always contain accented characters.

And that's your perogative. It's not English though and it introduces
complexity where it is not necessary. Claiming that unnaturalized words are
rationale for "Unicode everywhere" is ludicrous (for lack of a better word
that escapes my mind right now).

It's certainly useful, in certain limited contexts.

"limited" is contextual. If a product has "only" 1% market share but has
billions of dollars in sales, is it irrelevant?

Until
you've seen a BOM or an encoding specification, for example, in
XML. (Although technically, it's not ASCII, but the common
subset of UTF-8 and the ISO 8859 encodings.)

Use the appropriate tool for the job. No more, no less. (That concept seems
to escape language library comittees).

Just about every posting, in some groups I participate in.

You mean the header encoding or transformation encoding field? Parse just
the message, not the header designations. One could understand "some groups"
in your context: you work for a "foreign" (English is the second language)
company or something right? Well duh, then.

A file that doesn't exist on any of the machines I have access
to.

Bah. Enough of your banter/babbling on this. It's a waste of my time.

At the lowest level, a file is just a sequence of bytes (under
Unix or Windows, at least).
So?

At that level, text files don't
exist.
So?

It's up to the programs reading or writing the file to
interpret those bytes.

Yes (So?).

And none of the programs I use interpret
them as ASCII.

So?

(Is there a point you have in all that?? Oh, that though the files may
contain only 7-bit ASCII characters, there is some relevance in the
supersetting UTF-16/UTF-8 being used by the OS? That's NO point! I can use a
Caterpillar belt-driven tractor with a 3406 diesel in it to work my 10 acre
farm (or buy one to do so), but surely I'd be labeled "eccentric" or worse).

It depends on the imbued locale. (Text mode or not.)

My point was made just above. No need to drag locales into the discussion.
(My "locale" speaks English as the only language (which has only 26 letters,
BTW)).

Arbitrary. It depends on the system. Under Unix, there isn't
any. Under Windows, it's just the representation of '\n' in the
file. Under other OS's, it's usually a different file type in
the OS (and a file written in text mode can't be opened in
binary, and vice versa).

I doesn't matter. "text file" is a valid concept.

Internally, the program is still working with ASCII strings,
assuming English is the language (PURE English that recognizes
only 26 letters, that is).
Pure English has [...]

Click to expand...

Click to expand...

_I_ was giving the definition of "Pure English" in the context
(like a glossary). How many letters are there in the English
alphabet? How many?

Click to expand...

The same as in French, German or Italian: 26.
TY.

However, in all
four of these languages, you have cases where you need accents,

Accented words are either still being evaluated for inclusion into English
or are there for disambiguity. I used "Pure English" as that by which is
made up of only the 26 letters of the English alphabet.

I was taught to spell naïve correctly (although I don't know if
it was grade school or high school).

'naive' has been naturalized into the English language and does not
have/does not require (unless one feels romantic?) an accent. You were
taught French, not English.

All of the ini-files I've see do allow accented characters.

Again, so? You are suggesting that because you are bilingual or something
that all quest for simple elegance be thrown out the window? What is your
point?! (Certainly it is not engineering practicality).

As long as you're the only user of your programs, that's fine.
Once you have other users, you have to take their desires into
consideration.

Don't get into politics, cuz you suck at it. Life is too short to get bogged
down in Unicode just because a trivial few feel that English should be
bastardized with unnaturalized ideas like 'naive' with a diacritic. That's
just naive (actually, just crappy engineering, IMO, but I couldn't resist
the "punch line").

Sort of.

Don't even go there: I'm NON-political and here for engineering pursuit (for
the most part).

My mother was American, and I was born and raised in
the United States. My father was German, my wife's Italian, and
I currently live in France (but I've also lived a lot in
Germany).

And this is relevant why???

And yes, I do use four languages on an almost daily
basis, so I'm somewhat sensitivized to the issue.

There is no issue: I am not developing international programs (or at least
not targeting any user other than those who can use English). Most programs
do not need internationalization. Overkill is overkill. "Cry me a f'n
river".

But I find
that even when working in an English language context, I need
more than just ASCII.

Sometimes. Program option "inifiles" though? Apparently I've just suggested
to you a simplifying assumption that may indeed simplify your projects and
help you escape the narrowness of technology to some degree.

And I find that regardless of what I
need, the machines I use don't even offer ASCII as a choice.

I don't know what you mean. I think they all do.

I have to, because my comments where I work now have to be in
French, and French without accents is incomprehensible. The
need is less frequent in English, but it does occur.

Simplify your life: use English (for SW dev at least)!

James Kanze · May 15, 2009

But the underlying protocol is NNTP, and while I don't know
for sure, I have an incling that it is still a 7-bit protocol
(?). But that wasn't my point. I was suggesting that most
USENET posts in threaded discussion groups are ASCII (by
nature of the characters in use by the posts).

And I'm simply pointing out that that is false. Even in this
group, I sometimes have problems with postings, because the
installed fonts on my machines at work only support ISO 8859-1.
(At home, I use UTF-8, and everything works.) Which doesn't
have things like opening and closing quotes.

You can call it what you want, but if it contains only ASCII
characters, then I consider it an ASCII post.

But that's never the case for mine. And I see quite a few
others as well where it's not the case. Even in English
language groups like this one.

And that's your perogative. It's not English though and it
introduces complexity where it is not necessary.

I'm not sure what you mean by "it's not English". "Naïve" is a
perfectly good English word. And English uses quotes and dashes
(which aren't available even in ISO 8859-1) and other various
symbols like § not available in ASCII in its punctuation. Not
to mention that a lot of groups handle mathematical topics, and
mathematics uses a lot of special symbols.

And of course, not all groups use (only) English.

Claiming that unnaturalized words are rationale for "Unicode
everywhere" is ludicrous (for lack of a better word that
escapes my mind right now).

It has nothing to do with unnaturalized words (and I don't see
where "naïve" is unnaturalized). It has to do with recognizing
reality.

My point was made just above. No need to drag locales into the
discussion. (My "locale" speaks English as the only language
(which has only 26 letters, BTW)).

And what does the number of letters have to do with it? French
also has only 26 letters. You still put accents on some of
them, and you still use punctuation.

[...]

'naive' has been naturalized into the English language and
does not have/does not require (unless one feels romantic?) an
accent. You were taught French, not English.

Merriam-Webster disagrees with you.

Again, so? You are suggesting that because you are bilingual
or something that all quest for simple elegance be thrown out
the window? What is your point?! (Certainly it is not
engineering practicality).

My point is that software should be usable. And adapt to the
people using it, not vice versa. And that even in English, you
need more than simple ASCII. (At least, if you want to use
English correctly.)

[---]

Don't get into politics, cuz you suck at it. Life is too short
to get bogged down in Unicode just because a trivial few feel
that English should be bastardized with unnaturalized ideas
like 'naive' with a diacritic.

Or quotes. Or dashes. Or any number of other things. And that
"trivial few" includes the authors of all of the major
dictionaries I have access to.

If you don't know English well, that's your problem.

[...]

Simplify your life: use English (for SW dev at least)!

If you've ever tried to understand English written by a
non-native speaker, you'll realize that it's much simpler to let
them use French (or German, when I worked there). Communication
is an important part of software engineering, and communication
is vastly improved if people can use their native language.

The following code is not working properly need assistance	2	Nov 16, 2022
STRING - Remove small letters from string	1	Jan 20, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Temporary Class through Reference	1	Feb 5, 2013
Lexical Analysis on C++	1	Oct 31, 2023
If(strcmp(str, "") == 0) - What does this line of code mean?	0	Aug 8, 2022
Need help with this script	4	Mar 12, 2023
How to put a null check on this code	0	Jan 4, 2022

Is this String class properly implemented?

Alf P. Steinbach

Jerry Coffin

Alf P. Steinbach

Jerry Coffin

Alf P. Steinbach

Tony

Tony

Jerry Coffin

Jerry Coffin

Alf P. Steinbach

Jerry Coffin

James Kanze

James Kanze

Richard Herring

Tony

Tony

Tony

Tony

Tony

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads