Significant digits in a float?

R

Roy Smith

I'm using Python 2.7

I have a bunch of floating point values. For example, here's a few (printed as reprs):

38.0
41.2586
40.75280000000001
49.25
33.795199999999994
36.837199999999996
34.1489
45.5

Fundamentally, these numbers have between 0 and 4 decimal digits of precision, and I want to be able to intuit how many each has, ignoring the obvious floating point roundoff problems. Thus, I want to map:

38.0 ==> 0
41.2586 ==> 4
40.75280000000001 ==> 4
49.25 ==> 2
33.795199999999994 ==> 4
36.837199999999996 ==> 4
34.1489 ==> 4
45.5 ==> 1

Is there any clean way to do that? The best I've come up with so far is to str() them and parse the remaining string to see how many digits it put after the decimal point.

The numbers are given to me as Python floats; I have no control over that. I'm willing to accept that fact that I won't be able to differentiate between float("38.0") and float("38.0000"). Both of those map to 1, which is OK for my purposes.
 
S

Steven D'Aprano

On Mon, 28 Apr 2014 12:00:23 -0400, Roy Smith wrote:

[...]
Fundamentally, these numbers have between 0 and 4 decimal digits of
precision,

I'm surprised that you have a source of data with variable precision,
especially one that varies by a factor of TEN THOUSAND. The difference
between 0 and 4 decimal digits is equivalent to measuring some lengths to
the nearest metre, some to the nearest centimetre, and some to the
nearest 0.1 of a millimetre. That's very unusual and I don't know what
justification you have for combining such a mix of data sources.

One possible interpretation of your post is that you have a source of
floats, where all the numbers are actually measured to the same
precision, and you've simply misinterpreted the fact that some of them
look like they have less precision. Since you indicate that 4 decimal
digits is the maximum, I'm going with 4 decimal digits. So if your data
includes the float 23.5, that's 23.5 measured to a precision of four
decimal places (that is, it's 23.5000, not 23.5001 or 23.4999).

On the other hand, if you're getting your values as *strings*, that's
another story. If you can trust the strings, they'll tell you how many
decimal places: "23.5" is only one decimal place, "23.5000" is four.

But then what to make of your later example?
40.75280000000001 ==> 4

Python floats (C doubles) are quite capable of distinguishing between
40.7528 and 40.75280000000001. They are distinct numbers:

py> 40.75280000000001 - 40.7528
7.105427357601002e-15

so if a number is recorded as 40.75280000000001 presumably it is because
it was measured as 40.75280000000001. (How that precision can be
justified, I don't know! Does it come from the Large Hadron Collider?) If
it were intended to be 40.7528, I expect it would have be recorded as
40.7528. What reason do you have to think that something recorded to 14
decimal places was only intended to have been recorded to 4?

Without knowing more about how your data is generated, I can't advise you
much, but the whole scenario as you have described it makes me think that
*somebody* is doing something wrong. Perhaps you need to explain why
you're doing this, as it seems numerically broken.

Is there any clean way to do that? The best I've come up with so far is
to str() them and parse the remaining string to see how many digits it
put after the decimal point.

I really think you need to go back to the source. Trying to infer the
precision of the measurements from the accident of the string formatting
seems pretty dubious to me.

But I suppose if you wanted to infer the number of digits after the
decimal place, excluding trailing zeroes (why, I do not understand), up
to a maximum of four digits, then you could do:

s = "%.4f" % number # rounds to four decimal places
s = s.rstrip("0") # ignore trailing zeroes, whether significant or not
count = len(s.split(".")[1])


Assuming all the numbers fit in the range where they are shown in non-
exponential format. If you have to handle numbers like 1.23e19 as well,
you'll have to parse the string more carefully. (Keep in mind that most
floats above a certain size are all integer-valued.)

The numbers are given to me as Python floats; I have no control over
that.

If that's the case, what makes you think that two floats from the same
data set were measured to different precision? Given that you don't see
strings, only floats, I would say that your problem is unsolvable.
Whether I measure something to one decimal place and get 23.5, or four
decimal places and get 23.5000, the float you see will be the same.

Perhaps you ought to be using Decimal rather than float. Floats have a
fixed precision, while Decimals can be configured. Then the right way to
answer your question is to inspect the number:

py> from decimal import Decimal as D
py> x = D("23.5000")
py> x.as_tuple()
DecimalTuple(sign=0, digits=(2, 3, 5, 0, 0, 0), exponent=-4)

The number of decimal digits precision is -exponent.

I'm willing to accept that fact that I won't be able to differentiate
between float("38.0") and float("38.0000"). Both of those map to 1,
which is OK for my purposes.

That seems... well, "bizarre and wrong" are the only words that come to
mind. If I were recording data as "38.0000" and you told me I had
measured it to only one decimal place accuracy, I wouldn't be too
pleased. Maybe if I understood the context better?

How about 38.12 and 38.1200?

By the way, you contradict yourself here. Earlier, you described 38.0 as
having zero decimal places (which is wrong). Here you describe it as
having one, which is correct, and then in a later post you describe it as
having zero decimal places again.
 
S

Steven D'Aprano

I get the impression that this is at the core of the misunderstanding.
Having a number's representation ending in “….0†does not mean zero
decimal places; it has exactly one. The value's representation contains
the digit “0†after the decimal point, but that digit is significant to
the precision of the representation.

If the problem could be stated such that “38.0†and “38†and “38.000â€
are consistently described with the correct number of decimal digits of
precision (in those examples: one, zero, and three), maybe the
discussion would make more sense.


It's actually trickier than that. Digits of precision can refer to
measurement error, or to the underlying storage type. Python floats are C
doubles, so they have 64 bits of precision (approximately 17 decimal
digits, if I remember correctly) regardless of the precision of the
measurement. The OP (Roy) is, I think, trying to guess the measurement
precision after the fact, given a float. If the measurement error really
does differ from value to value, I don't think he'll have much luck:
given a float like 23.0, all we can say is that it has *at least* zero
significant decimal places. 23.1 has at least one, 23.1111 has at least
four.

If you can put an upper bound on the precision, as Roy indicates he can,
then perhaps a reasonable approach is to convert to a string rounded to
four decimal places, then strip trailing zeroes:

py> x = 1234.1 # actual internal is closer to 1234.099999999999909
py> ("%.4f" % x).rstrip('0')
'1234.1'

then count the number of digits after the dot. (This assumes that the
string formatting routines are correctly rounded, which they should be on
*most* platforms.) But again, this only gives a lower bound to the number
of significant digits -- it's at least one, but might be more.
 
R

Roy Smith

Steven D'Aprano said:
On Mon, 28 Apr 2014 12:00:23 -0400, Roy Smith wrote:

[...]
Fundamentally, these numbers have between 0 and 4 decimal digits of
precision,

I'm surprised that you have a source of data with variable precision,
especially one that varies by a factor of TEN THOUSAND.

OK, you're surprised.
I don't know what justification you have for combining such a
mix of data sources.

Because that's the data that was given to me. Real life data is messy.
One possible interpretation of your post is that you have a source of
floats, where all the numbers are actually measured to the same
precision, and you've simply misinterpreted the fact that some of them
look like they have less precision.

Another possibility is that they're latitude/longitude coordinates, some
of which are given to the whole degree, some of which are given to
greater precision, all the way down to the ten-thousandth of a degree.
What reason do you have to think that something recorded to 14
decimal places was only intended to have been recorded to 4?

Because I understand the physical measurement these numbers represent.
Sometimes, Steve, you have to assume that when somebody asks a question,
they actually have asked the question then intended to ask.
Perhaps you need to explain why you're doing this, as it seems
numerically broken.

These are latitude and longitude coordinates of locations. Some
locations are known to a specific street address. Some are known to a
city. Some are only known to the country. So, for example, the 38.0
value represents the latitude, to the nearest whole degree, of the
geographic center of the contiguous United States.
I really think you need to go back to the source. Trying to infer the
precision of the measurements from the accident of the string formatting
seems pretty dubious to me.

Sure it is. But, like I said, real-life data is messy. You can wring
your hands and say, "this data sucks, I can't use it", or you can figure
out some way to deal with it. Which is the whole point of my post. The
best I've come up with is inferring something from the string formatting
and I'm hoping there might be something better I might do.
But I suppose if you wanted to infer the number of digits after the
decimal place, excluding trailing zeroes (why, I do not understand), up
to a maximum of four digits, then you could do:

s = "%.4f" % number # rounds to four decimal places
s = s.rstrip("0") # ignore trailing zeroes, whether significant or not
count = len(s.split(".")[1])

This at least seems a little more robust than just calling str(). Thank
you :)
Assuming all the numbers fit in the range where they are shown in non-
exponential format.

They're latitude/longitude, so they all fall into [-180, 180].
Perhaps you ought to be using Decimal rather than float.

Like I said, "The numbers are given to me as Python floats; I have no
control over that".
That seems... well, "bizarre and wrong" are the only words that come to
mind.

I'm trying to intuit, from the values I've been given, which coordinates
are likely to be accurate to within a few miles. I'm willing to accept
a few false negatives. If the number is float("38"), I'm willing to
accept that it might actually be float("38.0000"), and I might be
throwing out a good data point that I don't need to.

For the purpose I'm using the data for, excluding the occasional good
data point won't hurt me. Including the occasional bad one, will.
By the way, you contradict yourself here. Earlier, you described 38.0 as
having zero decimal places (which is wrong). Here you describe it as
having one, which is correct, and then in a later post you describe it as
having zero decimal places again.

I was sloppy there. I was copy-pasting data from my program output.
Observe:
38.0

In standard engineering parlance, the string "38" represents a number
with a precision of +/- 1 unit. Unfortunately, Python's default str()
representation turns this into "38.0", which implies +/- 0.1 unit.

Floats represented as strings (at least in some disciplines, such as
engineering) include more information than just the value. By the
number of trailing zeros, they also include information about the
precision of the measurement. That information is lost when the string
is converted to a IEEE float. I'm trying to intuit that information
back, and as I mentioned earlier, am willing to accept that the
intuiting process will be imperfect. There is real-life value in
imperfect processes.
 
C

Chris Angelico

I'm trying to intuit, from the values I've been given, which coordinates
are likely to be accurate to within a few miles. I'm willing to accept
a few false negatives. If the number is float("38"), I'm willing to
accept that it might actually be float("38.0000"), and I might be
throwing out a good data point that I don't need to.

You have one chance in ten, repeatably, of losing a digit. That is,
roughly 10% of your four-decimal figures will appear to be
three-decimal, and 1% of them will appear to be two-decimal, and so
on. Is that "a few" false negatives? It feels like a lot IMO. But
then, there's no alternative - the information's already gone.

ChrisA
 
N

Ned Batchelder

You have one chance in ten, repeatably, of losing a digit. That is,
roughly 10% of your four-decimal figures will appear to be
three-decimal, and 1% of them will appear to be two-decimal, and so
on. Is that "a few" false negatives? It feels like a lot IMO. But
then, there's no alternative - the information's already gone.

Reminds me of the story that the first survey of Mt. Everest resulted in
a height of exactly 29,000 feet, but to avoid the appearance of an
estimate, they reported it as 29,002: http://www.jstor.org/stable/2684102
 
A

Adam Funk

Another possibility is that they're latitude/longitude coordinates, some
of which are given to the whole degree, some of which are given to
greater precision, all the way down to the ten-thousandth of a degree.

That makes sense. 1° of longitude is about 111 km at the equator,
78 km at 45°N or S, & 0 km at the poles.


"A man pitches his tent, walks 1 km south, walks 1 km east, kills a
bear, & walks 1 km north, where he's back at his tent. What color is
the bear?" ;-)
 
M

Mark H Harris

"A man pitches his tent, walks 1 km south, walks 1 km east, kills a
bear, & walks 1 km north, where he's back at his tent. What color is
the bear?" ;-)

Who manufactured the tent?


marcus
 
R

Ryan Hiebert

"A man pitches his tent, walks 1 km south, walks 1 km east, kills a
bear, & walks 1 km north, where he's back at his tent. What color is
the bear?" ;-)


Skin or Fur?
 
E

emile

"A man pitches his tent, walks 1 km south, walks 1 km east, kills a
bear, & walks 1 km north, where he's back at his tent. What color is
the bear?" ;-)

From how many locations on Earth can someone walk one mile south, one
mile east, and one mile north and end up at their starting point?

Emile
 
M

Mark Lawrence

From how many locations on Earth can someone walk one mile south, one
mile east, and one mile north and end up at their starting point?

Emile

Haven't you heard of The Triangular Earth Society?
 
R

Roy Smith

Chris Angelico said:
You have one chance in ten, repeatably, of losing a digit. That is,
roughly 10% of your four-decimal figures will appear to be
three-decimal, and 1% of them will appear to be two-decimal, and so
on. Is that "a few" false negatives?

You're looking at it the wrong way. It's not that the glass is 10%
empty, it's that it's 90% full, and 90% is a lot of good data :)
 
C

Chris Angelico

You're looking at it the wrong way. It's not that the glass is 10%
empty, it's that it's 90% full, and 90% is a lot of good data :)

Hah! That's one way of looking at it.

At least you don't have to worry about junk digits getting in. The
greatest precision you're working with is three digits before the
decimal and four after, and a Python float can handle that easily.
(Which is what I was concerned about when I first queried your
terminology - four digits to the right of the decimal and, say, 10-12
to the left, and you're starting to see problems.)

ChrisA
 
C

Chris Angelico

The problem is you won't know *which* 90% is accurate, and which 10% is
inaccurate. This is very different from the glass, where it's evident
which part is good.

So, I can't see that you have any choice but to say that *any* of the
precision predictions should expect, on average, to be (10 + 1 + …)
percent inaccurate. And you can't know which ones. Is that an acceptable
error rate?

But they're all going to be *at least* as accurate as the algorithm
says. A figure of 31.4 will be treated as 1 decimal, even though it
might really have been accurate to 4; but a figure of 27.1828 won't be
incorrectly reported as having only 2 decimals.

ChrisA
 
D

Dennis Lee Bieber

does differ from value to value, I don't think he'll have much luck:
given a float like 23.0, all we can say is that it has *at least* zero
significant decimal places. 23.1 has at least one, 23.1111 has at least
four.
I wouldn't even give it that... Since internally they (ignore binary
conversion) translate into

2.30E1, 2.31E1, and 2.31111E1

I'd claim 3-significant digits, 3-significant digits, and 6-significant
digits. (Heck, as I recall classical FORTRAN, they would be 0.230E2...)
If you can put an upper bound on the precision, as Roy indicates he can,
then perhaps a reasonable approach is to convert to a string rounded to
four decimal places, then strip trailing zeroes:
That I'd agree with... once the data has been converted to binary
float, all knowledge of the source significant digits has been lost.

Then confuse matters with the facet that in a math class

1.1 * 2.2 => 2.42

but in a physics or chemistry class the recommended result is

1.1 * 2.2 => 2.4

(one reason slide-rules were acceptable for so long -- and even my high
school trig course only required slide-rule significance even though half
the class had scientific calculators [costing >$100, when a Sterling
slide-rule could still be had for <$10]) <G>
 
D

Dennis Lee Bieber

Any point where the mile east takes you an exact number of times
around the globe. So, anywhere exactly one mile north of that, which
is a number of circles not far from the south pole.
Yeah, but he'd have had to bring his own bear...

Bears and Penguins don't mix. Seals, OTOH, are food to the bears, and
eat the penquins.
 
R

Roy Smith

Dennis Lee Bieber said:
in a physics or chemistry class the recommended result is

1.1 * 2.2 => 2.4

More than recommended. In my physics class, if you put down more
significant digits than the input data justified, you got the problem
marked wrong.
(one reason slide-rules were acceptable for so long -- and even my high
school trig course only required slide-rule significance even though half
the class had scientific calculators [costing >$100, when a Sterling
slide-rule could still be had for <$10]) <G>

Sterling? Snort. K&E was the way to go.
 
R

Roy Smith

Adam Funk said:
That makes sense. 1° of longitude is about 111 km at the equator,
78 km at 45°N or S, & 0 km at the poles.


"A man pitches his tent, walks 1 km south, walks 1 km east, kills a
bear, & walks 1 km north, where he's back at his tent. What color is
the bear?" ;-)

Assuming he shot the bear, red.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top