Rough draft: Proposed format specifier for a thousands separator

W

Wolfgang Rohdewald

[Paul Rubin]
What if you want to change the separator? Europeans usually
use periods instead of commas: one thousand = 1.000.

That is supported also.

do you support just a fixed set of separators or anything?

how about this: (Switzerland)

12'000.99

or spacing:

12 000.99
 
P

pruebauno

As far as I can see you're proposing an amendment to *encourage*
writing code that is not locale aware, with the amendment itself being
locale specific, which surely has to be a regressive move in the 21st
century. Frankly, I'd sooner see it made /harder/ to write code that
is not locale aware (warnings, like FxCop gives on .net code?) tnan
/easier/. Perhaps that's because I'm British, not American and I'm
sick of having date fields get the date wrong because the programmer
thinks the USA is the world. It makes me sympathetic to the problems
caused to others by programmers who think the English-speaking world
is the world.

By the way, to others who think that 123,456.7 and 123.456,7 are the
only conventions in common use in the West, no they're not. 123 456.7
is in common use in engineering, at least in Europe, precisely to
reduce (though not eliminate) problems caused by dot and comma
confusion..

I lived in three different countries and in school used blank for
thousand separator to avoid confusion with the multiply operator. I
think this proposal is more for debugging big numbers and meant mostly
for programmers' eyes. We are already using the dot instead of comma
decimal separator in our programming languages that one more
Americanism won't kill us.

I am leaning towards proposal 1 now just to avoid the thousand
variations that will be requested because of this, making the
implementation unnecessarily complex. I can always use the 3
replacement hack (conveniently documented in the pep).

+1 for Nick's proposal
 
L

Lie Ryan

Raymond said:
[andrew cooke]
would it break anything to also allow

Yes, that's allowed too! The separators can be any one of COMMA,
SPACE, DOT, UNDERSCORE, or NON-BREAKING-SPACE.

What if I want other separators?

How about this idea: make the format has "long" format, which is a bit
more verbose, flexible, and unambiguous, and the current proposal a
"short" format, which is more concise.

The "long" format would be like this (this is much, much more featureful
than the current proposition, I think I might have crossed far beyond
the Mayan line):

[n|sign <signnegative>[[, <signzero>], <signpositive>] | ]
[w|min <minwidth>[, <align>[, <alignfill>]]]
[x|max <maxwidth>[, <overflowsign[, overflowalign]>]]
[s|sep [[...]<sep><sepwidth>]<sep><sepwidth> | ]
[dp|decpoint <decpoint> | ]
[ds|decsep <width><sep>[, <width><sep>[...]] | ]
[b|base <base-n>[, <charset>]]
[p|prec <prec> | ]
t|type <type>

The feel of "long" format
fmt_string: 'type f'

number: 876543213456.98765445
result: 876543213456.98765445

fmt_string: 'decpoint ^ | type f'

number: 876543213456.98765445
result: 876543213456^98765445

fmt_string: 'sep 2>1:3.4 | decpoint , | prec 3 | type f'

number: 876543213456.98765445
result: 87>65>4:321.3456,988

fmt_string: 'sep 2>1:3.4 | decpoint , | prec 3 | type f'

number: 876543213456.98765445
result: 87>65>4:321.3456,988

fmt_string: 'sep 2>1:3.4 | decpoint , | prec 3 | type f'

number: 876543213456.98765445
result: 87>65>4:321.3456,988

General Rules:
- every field, except type is optional
- fields are separated by | (this may change), escape literal | with ||
- every fields starts with an identifier then a mandatory whitespace
- subfields are separated by commas. Each identifier has long and short
identifier.
- Processing precedent is: type, base, prec, sep/decsep, decpoint, sign,
min, max

Specific rules:
- min and max determines width, min determine the rule when the
resulting string is shorter than minwidth, max determine rule when the
resulting string is longer than maxwidth (basically trimming). alignfill
is character/sequence of character to be used to make the resulting
string as long as minwidth, overflowsign is character added when
maxwidth is exceeded and trimming occurs
- sep is basically a separator delimited for each width. The regular
latin number system would be represented as sep 3.3 the leftmost number
and separator would be repeated.
- decsep works similarly to sep
- base is the number base, charset is mapping of digits used to
represent output number in the certain base.

PS: It is not designed for hand written, but is meant to be fairly readable
PPS: It is fairly modular too
 
R

Raymond Hettinger

 The separators can be any one of COMMA,
What if I want other separators?

format(n, ',d').replace(",", yoursep)

How about this idea: make the format has "long" format, which is a bit
more verbose, flexible, and unambiguous, and the current proposal a
"short" format, which is more concise.

The "long" format would be like this (this is much, much more featureful
than the current proposition, I think I might have crossed far beyond
the Mayan line):

I concur ;-)


Raymond
 
J

John Nagle

Yes. In COBOL, one writes

PICTURE $999,999,999.99

which is is way ahead of most of the later approaches.

John Nagle
 
R

Raymond Hettinger

Todays updates to: http://www.python.org/dev/peps/pep-0378/

* Detail issues with the locale module.
* Summarize commentary to date.
-- Opposition to formatting strings in general
(preferring a convenience function or PICTURE clause)
-- Opposition to any non-locale aware approach
* Add APOSTROPHE and non-breaking SPACE to the list of separators.
* Add more links to external references
(Babel, Excel, ADA, CommonLisp, COBOL, C-Sharp).
* Clarify how proposal II is parsed.


Raymond
 
M

MRAB

Raymond said:
Todays updates to: http://www.python.org/dev/peps/pep-0378/

* Detail issues with the locale module.
* Summarize commentary to date.
-- Opposition to formatting strings in general
(preferring a convenience function or PICTURE clause)
-- Opposition to any non-locale aware approach
* Add APOSTROPHE and non-breaking SPACE to the list of separators.
* Add more links to external references
(Babel, Excel, ADA, CommonLisp, COBOL, C-Sharp).
* Clarify how proposal II is parsed.
I'd just like to make the point that the string methods, eg
unicode.upper, aren't locale-sensitive, so 'format' shouldn't be either.

The string methods could perhaps retain their current behaviour as the
default and accept a parameter to make them locale-sensitive. The same
could be the case for 'format' so the format string has "." to represent
the decimal point and "," to represent the digit separator, and those
would be the default, but it could accept a flag ("L"?) to make it
locale-sensitive.
 
P

Paul Rubin

Tim Rowe said:
And if it's mostly for programmers' eyes, why does the motivation
state that "Adding thousands separators is one of the simplest ways to
improve the professional appearance and readability of output exposed
to end users"?

It occurs to me, at least for quantities of data, one of the most
useful aids to readability is scaling down the quantity and suffixing
it with K (kilo), M (mega), G (giga), etc. This is sometimes done
with K=1000 and sometimes with K=1024 (fancy pronunciation "kibi"
rather than kilo, officially abbreviated Ki). Possible formatting:

'%.3K' % 1234567 = 1.235K # K = 1000
'%.:3Ki' % 1234567 = 1.206K # K = 1024

The colon (two dots) signifies base two. The "i" is not part of the
format spec, it's just a literal character, to make the standard
abbreviation for kibi.
 
L

Lie Ryan

Raymond said:
Motivation:

Provide a simple, non-locale aware way to format a number
with a thousands separator.

Adding thousands separators is one of the simplest ways to
improve the professional appearance and readability of
output exposed to end users.

In the finance world, output with commas is the norm. Finance
users
and non-professional programmers find the locale approach to be
frustrating, arcane and non-obvious.

It is not the goal to replace locale or to accommodate every
possible convention. The goal is to make a common task easier
for many users.


Raymod, I think there are several problems with the Motivations:
> The goal is to make a common task easier
> for many users.

Common task, for most people, means formatting numbers to the locale. We
should make converting numbers to locale easier to use, as easy as
calling a magic function that can convert the current object to the
locale representation or as simple as defining locale ID in the mini
language. This proposal, I believe, is for the _less_ common task of
formatting a number to a custom format not generally used anywhere else
in the world (like formatting a number to form an ipv6 address or
formatting a number to html/TeX code[1]).

[1] I know one mathematic textbook that uses superscript negative for
negative number to disambiguate it with minus sign.
> In the finance world, output with commas is the norm.

I can't cite any source, but I am skeptical with that. And how about
non-finance world? Scientific world? Pure math world?
> Provide a simple, non-locale aware way to format a number
> with a thousands separator.

Many have pointed out, locale is hard to use, this is easier approach
but pity it is not locale aware. If we want to provide a non-locale
aware formatting, we must make it flexible enough to make it the
"Ultimate Formatter". Otherwise it will just be redundant to locale.
> Adding thousands separators is one of the simplest ways to
> improve the professional appearance and readability of
> output exposed to end users.

There are infinitely many approach to numbers. One Singaporean text book
uses half-width space as thousand separator. One Autralian text book
uses superscript minus for negative numbers (which I believe would
require more than Unicode to represent, TeX or PDF perhaps). The
accounting world sometimes uses colors and parentheses to denote
negative numbers (this requires emmiting codes for the layout program:
HTML, TeX, PDF)

Anything less powerful than my proposed "Crossing Mayan line" is just a
harder alternative for locale module.
 
R

Raymond Hettinger

[Lie Ryan]
 >     In the finance world, output with commas is the norm.

I can't cite any source, but I am skeptical with that.

No doubt that you're skeptical of anything you didn't
already know ;-) I'm a CPA, was a 15 year division controller
for a Fortune 500 company, and an auditor for an international
accounting firm. Believe me when I say it is the norm in finance.
Besides, it seems like you're arguing that thousands separators
aren't needed anywhere at all and have doubts about their
utility. Pick-up your pocket calculator and take a look.
Look at your paycheck or your bank statement. Check-out a
publishing style guide. They are somewhat basic. There's
a reason the MS Excel and Lotus offered them from day one.

Python's format() style was taken directly from C-Sharp.
which offers both an "n" format that is locale sensitive
and a non-locale-sensitive variant that specifies a comma.
I'm suggesting that we also do both.

Random, make-up statistic: 99% of Python scripts are
not internationalized, have no need to be internationalized,
and have output intended to be used in the script writer's
immediate environment.

Another issue I have with locale is that you have to find
one that matches every specific need. Quick, which one gives
you non-breaking spaces for a thousands separator? If you
do find such a locale and it happens to be spelled the same
way on every platform, is it self-evident in your program
that it will in fact print with spaces or has that become
an implicit, behind the scenes operation. If later you need
to print another number with a different separator, do you
have a way make that happen without breaking the first piece
of code you wrote?

The locale module has plenty of issues for a programmer to
think about:
http://docs.python.org/library/locale.html#background-details-hints-tips-and-caveats

Besides, lots of people use Python who are not professional
programmers. We should not require them enter the complicated
world of locale just to do a basic formatting task. When
I teach Python to pre-college students, there is no way I'm
adding locale to the list of things they need to learn to
become functional with the language.

Sorry for the long post, but I feel like you keep inventing
heavy solutions that don't fit well with what we already have.
This should be a simple problem -- when writing a number
format, how I specify that I want character X as a thousands
separator. The answer to that question should be nothing
harder than, "add character X to the format string."

You're a very creative person, but I don't see Guido accepting
any idea that rejects what he has already chosen as the way
to format strings. He is no fan of the locale module's API,
but it is tightly bound to existing programs and POSIX
standards. That greatly limits the options for changing it.

I'm sure you can come-up with 500 ways of meeting this need
(almost none of which meld with Guido's choice to accept
PEP3101 for both 2.6 and 3.0). I'm offering a simple
extension to the existing framework that makes the above
tasks easy. C-sharp make essentially the same choice in its
design. There's no reason for you to have to use it
if you hate it.

Cheers,


Raymond
 
L

Lie Ryan

Raymond said:
If anyone here is interested, here is a proposal I posted on the
python-ideas list.

The idea is to make numbering formatting a little easier with the new
format() builtin
in Py2.6 and Py3.0: http://docs.python.org/library/string.html#formatspec


-------------------------------------------------------------


Motivation:

Provide a simple, non-locale aware way to format
> a number with a thousands separator.

Adding thousands separators is one of the simplest ways to improve the professional appearance and readability of output.

In the finance world, output with commas is the norm. Finance users and non-professional programmers find the locale approach to be frustrating, arcane and non-obvious.

The locale module presents two other challenges. First, it is a global setting and not suitable for multi-threaded apps that need to serve-up requests in multiple locales. Second, the name of a relevant locale (such as "de_DE") can vary from platform to platform or may not be defined at all. The docs for the locale module describe these and many other challenges [1] in detail.

It is not the goal to replace the locale module or to accommodate every possible convention. Such tasks are better suited to robust tools like Babel [2] . Instead, our goal is to make a common, everyday task easier for many users.





Comments and suggestions are welcome but I draw the line at supporting
Mayan numbering conventions ;-)


Raymond
 
L

Lie Ryan

Raymond said:
[Lie Ryan]
I can't cite any source, but I am skeptical with that.

No doubt that you're skeptical of anything you didn't
already know ;-) I'm a CPA, was a 15 year division controller
for a Fortune 500 company, and an auditor for an international
accounting firm. Believe me when I say it is the norm in finance.
Besides, it seems like you're arguing that thousands separators
aren't needed anywhere at all and have doubts about their
utility. Pick-up your pocket calculator and take a look.
Look at your paycheck or your bank statement. Check-out a
publishing style guide. They are somewhat basic. There's
a reason the MS Excel and Lotus offered them from day one.

I have no reason to doubt that output with separators is nice, but I am
skeptical that all financial institution in the world (not just US) uses
commas for their separators.
Python's format() style was taken directly from C-Sharp.
which offers both an "n" format that is locale sensitive
and a non-locale-sensitive variant that specifies a comma.
I'm suggesting that we also do both.

I'm fine with that. But no commas, instead user-defineable separators.
Random, make-up statistic: 99% of Python scripts are
not internationalized, have no need to be internationalized,
and have output intended to be used in the script writer's
immediate environment.

Random, make up statistic: 95% of which is scripts written for
personal/internal use.

> If you
> do find such a locale and it happens to be spelled the same
> way on every platform, is it self-evident in your program
> that it will in fact print with spaces or has that become
> an implicit, behind the scenes operation. If later you need
> to print another number with a different separator, do you
> have a way make that happen without breaking the first piece
> of code you wrote?

Yeah, every data in transmission should be in locale independent format,
it should only be turned to locale aware format just before viewing to
the user. That way nothing will break.

Since you're an accountant, I am sure you know about Quicken Files,
which stores data in locale format, which IMHO is a very BAD design.



Another issue I have with locale is that you have to find
one that matches every specific need. Quick, which one gives
you non-breaking spaces for a thousands separator?

That wasn't the issue. Most programs would either "use the environment's
locale and give user configuration to override the locale" or "I don't
care, the output is for personal/internal consumption" or "The data only
makes sense with certain formatting". I don't see a use case where the
programmer would really want to hardcode a locale AND want the output to
be exactly like what he sees in the user machine.

The first case ("use the environment's locale and give user
configuration to override the locale") is for internationalized
applications, and is served by locale. The locale module is currently
difficult to work with, so I believe we should provide a more accessible
way.

The second case ("I don't care, the output is for personal/internal
consumption"), is well served by python's default view.

The third case ("The data only makes sense with certain formatting") is
the one that will benefit the most from non-locale aware formatting. But
they would require a very powerful formatter. Such use case is
formatting IP address, telephone number, ID card number, etc.

My proposition is: make the format specifier a simpler API to locale
aware or make it capable to serve the third case. I would rather
prioritize on the former case.
 
R

Raymond Hettinger

[Lie Ryan]
My proposition is: make the format specifier a simpler API to locale
aware

You do know that we already have one, right?
That's what the existing "n" specifier does.


Raymond
 
H

Hendrik van Rooyen

"Tim Rowe" <dig....il.com> wrote:

8< -----------------------------------------------------------------
......... If "Finance users and non-professional
programmers find the locale approach to be frustrating, arcane and
non-obvious" then by all means propose a way of making it simpler and
clearer, but not a bodge that will increase the amount of bad software
in the world.

I do not follow the reasoning behind this.

It seems to be based on an assumption that the locale approach
is some sort of holy grail that solves these problems, and that
anybody who does not like or use it is automatically guilty of
writing crap code.

No account seems to be taken of the fact that the locale approach
is a global one that forces uniformity on everything done on a PC
or by a user.

So when you want to make a report in a format that would suit
what your foreign visitors are used to, do you have to change
your server's locale, and change it back again afterwards, or what ?

The locale approach has all the disadvantages of global variables.

To make software usable by, or expandable to, different languages
and cultures is a tricky design problem - you have to, at the
minimum, do things like storing all your text, both for prompts and
errors, in some kind of database and refer to it by its key, everywhere.
You cannot simply assume, that because a number represents
a monetary value, that it is Yen, or Australian Dollar, or whatever -
you may have to convert it first, from its currency, to the currency
that you want to display it as, and only then can you worry about
the format that you want to display it in.

In all of this, as I see it, the locale approach addresses only a small
part, and solves very little.

Why is it still being defended and touted as if it were 42? *

- Hendrik

* the answer to life, the universe, and everything.
( - Douglas Adams' Hitchhiker books)
 
H

Hendrik van Rooyen

John Nagle said:
Yes. In COBOL, one writes

PICTURE $999,999,999.99

which is is way ahead of most of the later approaches.

That was fixed width. For zero suppression:

PIC $$$$,$$$,$99.99

This will format 1000 as $1,000.00

For fixed width zero suppression:

PIC $ZZZ,ZZZ,Z99.99

gives a fixed width field - $ 1,000.00
with a fixed width font, this will line the column up,
so that the decimals are under each other.

- Hendrik
 
J

JanC

Raymond said:
No doubt that you're skeptical of anything you didn't
already know ;-) I'm a CPA, was a 15 year division controller
for a Fortune 500 company, and an auditor for an international
accounting firm. Believe me when I say it is the norm in finance.
Besides, it seems like you're arguing that thousands separators
aren't needed anywhere at all and have doubts about their
utility. Pick-up your pocket calculator and take a look.
Look at your paycheck or your bank statement.

My current bank and my previous bank use 2 ways to write numbers:

1. a decimal comma, and a space (or half-space or any other appropriate
small whitespace) as a thousands separator
2. written full out in words (including the currency names)

Invoices (not from these banks) often use a point as the thousands
separator (although that's "wrong" according to some national
standards, it's probably okay according to accounting standards...).

The second formatting (full words) is a legal requirement on certain
financial & legal documents here (and I can imagine in other countries
too?). Anybody working on a PEP about implementing a 'w' (for "wordy"?)
formatting type? ;-)
 
T

Tim Rowe

2009/3/14 Hendrik van Rooyen said:
No account seems to be taken of the fact that the locale approach
is a global one that forces uniformity on everything done on a PC
or by a user.

Not so. Under .NET, for instance, the global settings will give you a
default CultureInfo class, but you can create your own CultureInfo
classes for other cultures in your program and use them in place of
the default.
So when you want to make a report in a format that would suit
what your foreign visitors are used to, do you have to change
your server's locale, and change it back again afterwards, or what ?

No, you create a local locale and use that.

There are essentially three possible levels I can see for this:

- programs that will only ever be used in one locale, known in
advance. They can have the locale hard-wired into the program. No
special support is needed for this. It's pretty easy to write a
function to format a number to a hard-wired locale. I've done it in
Pascal and FORTH and it was easy-peasy, so I can't imagine it's going
to be a big deal in Python. If it's such a big deal for accountants to
write this code, if they ask in this forum how to do it somebody will
almost certainly supply a function that takes a float and returns a
formatted string within a few minutes. It might even be you or me.

- Programs that may be used in any unchanging locale. The existing
locale support is built for this case.

- Programs that nead to operate across locales. This can either be
managed by switching global locales (which you rightly deprecate) or
by managing alternate locales within the program.
The locale approach has all the disadvantages of global variables.

No, it has all the advantages of global constants used as overridable
defaults for local variables.
To make software usable by, or expandable to, different languages
and cultures is a tricky design problem - you have to, at the
minimum, do things like storing all your text, both for prompts and
errors, in some kind of database and refer to it by its key, everywhere.
You cannot simply assume, that because a number represents
a monetary value, that it is Yen, or Australian Dollar, or whatever -
you may have to convert it first, from its currency, to the currency
that you want to display it as, and only then can you worry about
the format that you want to display it in.

Nothing in the proposal being considered addresses any of that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top