POSIX enhancements to printf

B

BartC

Keith Thompson said:
BartC said:
(Why would one particular machine have so locales in it? Does the OS
switch between 132 different languages too? It seems remarkably
wasteful if so.)
[...]

Wasteful of what? I wouldn't expect each locale to take up much space.

It's not so much the space. It's the approach, taking all this data
describing how different parts of the world do things, and encoding all that
into every single machine on the planet. (And I understand that each locale
would need a different OS setup anyway.)

I mean, you wouldn't install a printer driver for every conceivable printer
there has ever been or are likely to come across.

And in the context of C, it seems odd it would go out of it's way to get the
decimal point right in two hundred different configurations, but only likes
to acknowledge one kind of newline character per implementation!

(And I still don't get why a low-level language like C needs to get involved
in such local considerations as the kind of thousands separator anyway.
Doing this stuff properly is difficult, and switching on this feature in C
will interfere with the processing of generic text files: you don't want to
start worrying about the kind of decimal point used!)
 
J

James Kuyper

Keith Thompson said:
[...] ....
(Why would one particular machine have so locales in it? Does the OS
switch between 132 different languages too? It seems remarkably
wasteful if so.)
[...]

Wasteful of what? I wouldn't expect each locale to take up much space.

It's not so much the space. It's the approach, taking all this data
describing how different parts of the world do things, and encoding all that
into every single machine on the planet. ...

If the space required to do that doesn't matter, why do you care? As
we've seen, different systems have different numbers of locales
installed, they aren't "... all ... installed on every machine on the
planet."

On my desktop system, locale -a gives a list of 735 available locales.
/usr/share/locale/ has 590 subdirectories - I'm not sure of the
significance of the difference between those two numbers. Those
directories take up a total of 443MB, 428 of which are for the message
files. I've no idea why the SAs decided to install that many locales -
it's a single user system, and there's nothing work-related that I'm
ever likely to do that requires anything other than the "C" locale.
... (And I understand that each locale
would need a different OS setup anyway.)

What do you mean by "different OS setup"?
 
S

Stephen Sprunk

It's not so much the space. It's the approach, taking all this data
describing how different parts of the world do things, and encoding
all that into every single machine on the planet. (And I understand
that each locale would need a different OS setup anyway.)

That's because those machines could be used by users located (or merely
from) anywhere on the planet, and the machine is expected to adapt to
the user. Plus, it's a lot simpler to just include everything than to
make users/admins select which locale(s) to install; it's not like the
tables take up much space.
I mean, you wouldn't install a printer driver for every conceivable
printer there has ever been or are likely to come across.

Doing that would make things a lot simpler for users/admins.
And in the context of C, it seems odd it would go out of it's way to
get the decimal point right in two hundred different configurations,
but only likes to acknowledge one kind of newline character per
implementation!

That's because localization is about users, whereas line termination is
about the implementation--and it's not always an embedded character!
(And I still don't get why a low-level language like C needs to get
involved in such local considerations as the kind of thousands
separator anyway.

It's a common thing for programs to need to do, so it makes sense to put
it in the Standard Library.
Doing this stuff properly is difficult,

Not really.
and switching on this feature in C will interfere with the processing
of generic text files: you don't want to start worrying about the kind
of decimal point used!)

Your misunderstanding is due to thinking there is such a thing as a
"generic text file". Wrong. Text files are encoded in a variety of
ways, and _your_ way is no more "generic" than anyone else's--and
programs need to be able to adapt to all of them.

S
 
S

Stephen Sprunk

hi_IN_INSCII-DEV uses two rather than three digits in a group,
sometimes.

The data for 'grouping' is a bit messy but 127; alone means no
grouping. 3;3; indicates thousands grouping (why two threes?)
...
hi_IN.ISCII-DEV: decimal_point '.' thousands_sep ',' grouping 2;3;

This case is probably why there are two numbers; "2;3;" results in
groupings like "12,34,56,789", which apparently reflects how numbers are
spoken in Hindi.

S
 
B

BartC

James Kuyper said:
On 02/28/2014 12:34 PM, BartC wrote:

What do you mean by "different OS setup"?

Everything about the OS and environment that needs to be different for a
different locale.

(I remember when MSDN used to send me, quarterly iirc, dozens of different
Windows 95 installation CDs, for different regions. I'm not saying that's
the way to go, but it suggests the differences were significant.)
 
J

jacob navia

Le 28/02/2014 19:45, Stephen Sprunk a écrit :
Your misunderstanding is due to thinking there is such a thing as a
"generic text file". Wrong. Text files are encoded in a variety of
ways, and _your_ way is no more "generic" than anyone else's--and
programs need to be able to adapt to all of them.

In the macintosh it is impossible to code plain text files. Even "vi" is
Unicode aware and will accept any unicode chracter!

This means that if a SINGLE unicode character slips in, all the file
will be saved in UTF8
 
J

James Kuyper

Everything about the OS and environment that needs to be different for a
different locale.

I believe that's all in the locale files; there's nothing additional
that's needed (well, I just did a quick search and found that some of
the relevant files are stored in /usr/share/i18n/locales, but that was
only an additional 6MB). A key part of what you're talking about is
stored int the LC_MESSAGES subdirectories of each locale directory,
which is why they average nearly a megabyte apiece.

I could be mistaken, since the only locales I've ever had to worry about
were the en_US and zh_TW locales.
(I remember when MSDN used to send me, quarterly iirc, dozens of different
Windows 95 installation CDs, for different regions. I'm not saying that's
the way to go, but it suggests the differences were significant.)

I'm mainly familiar with POSIX systems; it wouldn't surprise me in the
least if Windows took a more complicated approach to the same issue.
 
G

glen herrmannsfeldt

Kenny McCormack said:
BTW, how does this weird way of doing things affect function calling?
(Note: This was alluded to earlier by another poster).
I.e., if I live in one of those countries, am I not entitled to write:
(snip)

and expect foo to receive the value of twelve and 34 one hundreths?

I have various times wondered why all programming languages use
english keywords. (Does anyone know any that use other languages,
not counting APL?)

Seems to me that one could write a compiler for another languages,
specific to that country, and that would accept compile-time
constants in 12,34 form. That isn't C.

-- glen
 
G

glen herrmannsfeldt

(snip)
If we allowed full Unicode character sets for source code (identifiers,
comments, string constants), the various ways of denoting numeric constants,
and several styles of punctuation, so that strings can be quoted as:

then reading source code is going to get very interesting (and confusing,
what with the glyph for 'A' for example being represented by a dozen code
points, all distinct to a compiler.)

You mean like Java?

Java allows all unicode letters to start variable names, and unicode
letters and digits to continue them. And yes, it allows you to name
a variable capital Alpha, for example.

As well as I remember, only \u0022 for quoting strings.

I did once find a unicode editor, write a program with lower case
pi as a variable, and give it the appropriate value. The editor would
write them out with \u escapes, which Java processes very early.
(Early enough that you can terminate lines with \u000a.)

-- glen
 
K

Kaz Kylheku

Le 28/02/2014 19:45, Stephen Sprunk a écrit :


In the macintosh it is impossible to code plain text files. Even "vi" is
Unicode aware and will accept any unicode chracter!

This means that if a SINGLE unicode character slips in, all the file
will be saved in UTF8

Or more like, the whole file still looks like ASCII, except for that one
character is a multi-byte encoding.

UTF-8 is backward-compatible with USASCII!
 
B

BartC

Stephen Sprunk said:
On 28-Feb-14 11:34, BartC wrote:


Your misunderstanding is due to thinking there is such a thing as a
"generic text file". Wrong. Text files are encoded in a variety of
ways, and _your_ way is no more "generic" than anyone else's--and
programs need to be able to adapt to all of them.

The hundreds of text files I've downloaded over the years have all been very
generically similar - except some have CRLF line endings, some LF, and a few
CR. I've never comes across commas for decimal points though. It would be
chaotic if that become common.
 
J

James Kuyper

On 02/28/2014 03:55 PM, BartC wrote:
....
CR. I've never comes across commas for decimal points though. It would be
chaotic if that become common.

It is already common in some parts of the world, and has been for a long
time. If that's sufficient to create chaos, then chaos has in fact
already been created - which isn't exactly news.
 
K

Keith Thompson

Melzzzzz said:
Just tried, same on Linux (Ubuntu 13.10)

Presumably "unicode character" means "non-ASCII character", since all
representable characters (within reason) are Unicode.

A text file containing nothing but ASCII characters (in the range 0 to
127) is already UTF-8 encoded. Adding a non-ASCII character doesn't do
anything special; the system just happens to use UTF-8 as its default
encoding.

These days, I'd almost argue that UTF-8 *is* "plain text" (though
Windows still tends to use UTF-16, or UCS-2 pretending to be UTF-16).
 
J

James Kuyper

Actually not. locale.h was an invention of the C89 committee.

In my attempt to put a birth date on locales, I searched for comp.lang.c
articles containing the word "locale". The oldest I could find was dated
1986-11-09, during the transition from net.lang.c to comp.lang.c (there
were no such messages posted to net.lang.c). It only slightly predates
C89, and appears to be part of the discussions leading up to the
adoption of C89, which would be consistent with what you've said.
 
J

James Kuyper

On 02/28/2014 04:31 PM, Keith Thompson wrote:
....
These days, I'd almost argue that UTF-8 *is* "plain text" (though
Windows still tends to use UTF-16, or UCS-2 pretending to be UTF-16).

How does it manage to differ from UCS-2 without actually qualifying as
UTF-16? There's unfortunately nothing implausible about a claim that
Microsoft would fail to correctly implement either encoding - but I'm
curious about the details.
 
K

Keith Thompson

James Kuyper said:
On 02/28/2014 04:31 PM, Keith Thompson wrote:
...

How does it manage to differ from UCS-2 without actually qualifying as
UTF-16? There's unfortunately nothing implausible about a claim that
Microsoft would fail to correctly implement either encoding - but I'm
curious about the details.

[I replied to your e-mail before realizing you had also posted a
followup.]

Since any text file that includes only characters in the 16-bit BMP
has the same representation in UCS-2 and in UTF-16, it's easy for
software to get away with ignoring the difference by assuming that
text consists entirely of 16-bit characters.

In the early days, Unicode didn't extend beyond the BMP; UTF-16 wasn't
necessary until characters past 0xffff were added.

I suspect there's still a fair amount of old Windows software that
still doesn't handle characters outside the BMP. I don't know how
many programs have this problem.
 
G

glen herrmannsfeldt

(snip, someone wrote)
Or more like, the whole file still looks like ASCII, except for
that one character is a multi-byte encoding.
UTF-8 is backward-compatible with USASCII!

At least some MS software puts special flag bytes at the beginning
of a UTF-8 or UTF-16 file. As I understand it, to indicate the
endianness and also that it is UTF-8 or UTF-16.

(And to confuse programs not expecting them.)

-- glen
 
G

glen herrmannsfeldt

(snip, I wrote)
There have been more than a few, although none have achieved major
prominence (at least outside their home areas). A partial list:

And the idea goes back a long time. There were non-English versions
of Algol, and I've heard of a Russian Cobol. More recently there was
an Arabic-based programming language in the news, and there are
Chinese versions of C++ and Basic. MS's VBA is (partially) localized,
so your Excel macro may not run on a machine set up for a different
language.

When I wrote that, I was thinking about languages that aren't just
a replacement of English keywords with those of another language,
but designed, as close as is reasonable, from the beginning.

Most new languages borrow at least a little from previous ones, but
I meant more than one could do with the C preprocessor and
some #define directives.

-- glen
 
G

Geoff

The hundreds of text files I've downloaded over the years have all been very
generically similar - except some have CRLF line endings, some LF, and a few
CR. I've never comes across commas for decimal points though. It would be
chaotic if that become common.

Then you've never had to read documents translated from German to
English. Try going to Siemens' web site and reading their technical
documentation about motor drives or motor starters. They don't usually
manage to translate every ., to ,. when it comes to ratings and
dimensions.
 
J

James Kuyper

James Kuyper said:
On 02/28/2014 04:31 PM, Keith Thompson wrote:
...

How does it manage to differ from UCS-2 without actually qualifying as
UTF-16? There's unfortunately nothing implausible about a claim that
Microsoft would fail to correctly implement either encoding - but I'm
curious about the details.

[I replied to your e-mail before realizing you had also posted a
followup.]

Sorry - I hit the wrong button - again.
Since any text file that includes only characters in the 16-bit BMP
has the same representation in UCS-2 and in UTF-16, it's easy for
software to get away with ignoring the difference by assuming that
text consists entirely of 16-bit characters.

In the early days, Unicode didn't extend beyond the BMP; UTF-16 wasn't
necessary until characters past 0xffff were added.

I suspect there's still a fair amount of old Windows software that
still doesn't handle characters outside the BMP. I don't know how
many programs have this problem.

I'd describe that as incomplete conversion from UCS-2 to UTF-16, rather
than as "UCS-2 pretending to be UTF-16".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top