POSIX enhancements to printf

K

Keith Thompson

glen herrmannsfeldt said:
At least some MS software puts special flag bytes at the beginning
of a UTF-8 or UTF-16 file. As I understand it, to indicate the
endianness and also that it is UTF-8 or UTF-16.

(And to confuse programs not expecting them.)

The special flag is a "byte order mark", represented as the 16-bit
character '\uFEFF' (ZERO WIDTH NO-BREAK SPACE). In UTF-16, you can tell
whether a file is big-endian or little-endian by checking whether the
first two bytes are (FE FF) or (FF FE). In UTF-8, it's represented by
the 3-byte sequence (EF BB BF). Without such a marker, it can be
difficult (and in principle sometimes impossible) to distinguish between
valid big-endian and little-ending UTF-16 files.

Outside the Windows world, UTF-8 is by far the most common encoding for
Unicode text, and byte order marks are rare. There are *some*
UTF-8-with-BOM files floating around (likely converted from Windows
UTF-16-with-BOM).
 
B

BartC

Geoff said:
Then you've never had to read documents translated from German to
English. Try going to Siemens' web site and reading their technical
documentation about motor drives or motor starters. They don't usually
manage to translate every ., to ,. when it comes to ratings and
dimensions.

I was thinking more of technical documents that would need interpretation of
numeric values by program, than natural language which just needs character
processing (or which are not 'read' by a program at all, only humans).

So, if a C programmer in Germany is all geared up to read national TXT files
containing commas for decimal points, how does he manage to read, say,
standard DXF files? Or comma-delimited ones?! /He/ might be ultra-aware of
these issues whenever such a file has to be read (or written), but it could
be the end-user who faces the problem. An end-user or recipient who could be
in any region.
 
G

Geoff

I was thinking more of technical documents that would need interpretation of
numeric values by program, than natural language which just needs character
processing (or which are not 'read' by a program at all, only humans).

So, if a C programmer in Germany is all geared up to read national TXT files
containing commas for decimal points, how does he manage to read, say,
standard DXF files?

I don't know. If they're standard DXF files, what does that standard
have to say about localization of numerical data? Are DXF
"standardized" to always use commas per US usage? Are the tools that
deal with DXF files using an internal representation and only
presenting localizations on output?

Or comma-delimited ones?!

A comma-delimited file is a comma-delimited file is it not?
 
J

Jorgen Grahn

Then you've never had to read documents translated from German to
English. Try going to Siemens' web site and reading their technical
documentation about motor drives or motor starters. They don't usually
manage to translate every ., to ,. when it comes to ratings and
dimensions.

Things vary depending on country, business and the purpose of the
data. If it's supposed to be machine-readable /and/ human-readable
like the output from a Unix command, IME it's almost universally plain
%f.

Here in .se I rarely see the native 1 234,5678 form. Especially not
in engineering. I associate it with money and invoices ...

/Jorgen
 
B

BartC

Geoff said:
I don't know. If they're standard DXF files, what does that standard
have to say about localization of numerical data? Are DXF
"standardized" to always use commas per US usage? Are the tools that
deal with DXF files using an internal representation and only
presenting localizations on output?

This is exactly the kind of thing that can get 'chaotic'. (And I've
exchanged plenty of DXF files with countries that use "," for decimal
points, but I've never seen such a format used.)
A comma-delimited file is a comma-delimited file is it not?

Yes, but if you're reading a line such as:

123,456.789

which normally represents the two numbers (123, 456.789), and read it German
style, how is it going to interpret that first comma, as a separator or
decimal point? And if the former, how do you represent floating point
numbers in such a file, without using an 'English' decimal point?
 
G

Geoff

This is exactly the kind of thing that can get 'chaotic'. (And I've
exchanged plenty of DXF files with countries that use "," for decimal
points, but I've never seen such a format used.)

Which would seem to imply that the standard for DXF file data format
does not allow for that style. So what is your point?
Yes, but if you're reading a line such as:

Yes. No but.

A comma-delimited file is a comma-delimited file and it must be read
as such. There is no provision for locale-specific data format.
123,456.789

which normally represents the two numbers (123, 456.789), and read it German
style, how is it going to interpret that first comma, as a separator or
decimal point? And if the former, how do you represent floating point
numbers in such a file, without using an 'English' decimal point?

Have you checked?

Could this be the reason the limited work that was done on the
implementation of the POSIX locale generally doesn't delimit numbers
that way in those locales?

Have you looked at the function strfmon?

Apart from the usage in printf and strfmon, what other functions take
the locale environment information into account?

How would a program, writing data out in one locale, guarantee
portability of that data to another locale?

What do the POSIX and C standards have to say about it?

Wouldn't that responsibility belong to the programmer designing the
output?

How does anyone digest a comma-delimited file now?

Just as the name implies, comma-delimited is comma-delimited. The
"standard" for comma-delimited style data files would seem to exclude
the use of any locale that reverses the roll of comma and decimal -
unless one provides for the use of the same locale functionality in
the functions used to read that data as were used to write it.
 
S

Stefan Ram

Geoff said:
A comma-delimited file is a comma-delimited file and it must be read
as such. There is no provision for locale-specific data format.

Yes. And I would add that - of course - just »comma-delimited«
is too vague. To write a parser, I need a grammar that gives
every details of the syntax.
How does anyone digest a comma-delimited file now?

I cannot parse a file unless I know its syntax. So I would
kindly ask the customer for a syntax specification.

If he cannot supply one, then I would need to work out one
on my own first, which would cost an additional charge. Then
the customer would have to sign it, and then I'd start to
write the parser.
 
B

BartC

Geoff said:
Which would seem to imply that the standard for DXF file data format
does not allow for that style. So what is your point?

I suppose my point is that a decimal point rather than comma is probably the
de facto standard for machine-readable fields in text, even with countries
that use the comma, unless the text is purely for internal use within the
region.

The business with the thousands separator is less important - that is only
for display, and not an essential part of the data like the decimal point.
It's unusual anyway to have to deal with thousands separators in input (I
only do that for source code, and then the separator is not a comma).

And really, when I output numbers with such separators, it is /only to make
long numbers more readable/; I care very little about local customs or
whether it is exactly right for the region! Anyone in their own region can
do the same. The unfortunate few who have to create software that does the
'right thing' in every single region, then they're welcome to all those
headaches. But I suspect there is a lot more to it than just setlocale() and
"%f".
 
S

Stephen Sprunk

Everything about the OS and environment that needs to be different
for a different locale.

(I remember when MSDN used to send me, quarterly iirc, dozens of
different Windows 95 installation CDs, for different regions. I'm not
saying that's the way to go, but it suggests the differences were
significant.)

That's because Win95 was only nominally Unicode-aware; each localized
version internally used a different "code page", and if a program called
a "Wide" API function, the arguments were translated to that "code page"
and passed to the "ANSI" API. That "code page" was hard-coded into the
system as part of the build/packaging process.

In contrast, in WinNT arguments to the "ANSI" APIs were translated to
UCS-2 and passed to the "Wide" API; this is the architecture that formed
the basis for Win2k, WinXP, etc., and now the "localization" merely sets
the default "code page" for said translation; all of the various locales
(except certain parts of Asia, which are a free download) are included
with every installation, switchable per user from the Control Panel.

Sadly, Windows _still_ doesn't allow selecting UTF-8 by default; if it
did, the entire code page mess (and the need for a "Wide" API) would
disappear within a few years. But that's not too surprising since
Microsoft seems to think that "Unicode" always means UCS-2 (or UTF-16,
for the few people there that know about surrogates).

S
 
K

Keith Thompson

Geoff said:
Which would seem to imply that the standard for DXF file data format
does not allow for that style. So what is your point?

Comma-delimited files are quite capable of representing fields that
include comma characters. Typically such a field is enclosed in
quotation marks. I understand there's no universally convention, which
is certainly a problem, but storing data with commas in a CSV file is
straightforward.
 
G

glen herrmannsfeldt

(snip)
Comma-delimited files are quite capable of representing fields that
include comma characters. Typically such a field is enclosed in
quotation marks. I understand there's no universally convention, which
is certainly a problem, but storing data with commas in a CSV file is
straightforward.

Yes, but then you also need a convention for quoting quotes, and
the result is much harder to parse. More specifically, there is
no regular expression for the delimiter that you can use, for example,
as the FS for awk.

(Not that it is that hard to parse, but much harder than just
searching for a delimiter character.)

-- glen
 
J

James Kuyper

On 03/02/2014 12:50 PM, Geoff wrote:
....
Just as the name implies, comma-delimited is comma-delimited. The
"standard" for comma-delimited style data files would seem to exclude
the use of any locale that reverses the roll of comma and decimal -

Unless I'm missing something, it would seem to preclude any locale that
USES the comma, regardless of role.
unless one provides for the use of the same locale functionality in
the functions used to read that data as were used to write it.

How would that help? How would such a function decide whether to parse
123,456.55 as 123456.55, or as 123 followed by 456.55?
 
J

James Kuyper

On 03/02/2014 04:48 PM, Keith Thompson wrote:
....
Comma-delimited files are quite capable of representing fields that
include comma characters. Typically such a field is enclosed in
quotation marks.

My understanding is that, when such a convention applies, the
corresponding field is usually interpreted as a character string, not as
numeric data.
 
I

Ian Collins

Robert said:
The US is one of the few markets where a significant number of
developers have the option of ignoring I18N.

As is most of the English speaking world!
 
G

Geoff

On 03/02/2014 12:50 PM, Geoff wrote:
...

Unless I'm missing something, it would seem to preclude any locale that
USES the comma, regardless of role.

You are absolutely right. I had my blinders on.
How would that help? How would such a function decide whether to parse
123,456.55 as 123456.55, or as 123 followed by 456.55?

It wouldn't.
 
I

Ian Collins

Keith said:
Comma-delimited files are quite capable of representing fields that
include comma characters. Typically such a field is enclosed in
quotation marks. I understand there's no universally convention, which
is certainly a problem, but storing data with commas in a CSV file is
straightforward.

http://tools.ietf.org/html/rfc4180
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top