Significant digits in a float?

Discussion in 'Python' started by Roy Smith, Apr 28, 2014.

  1. Roy Smith

    Roy Smith Guest

    I'm using Python 2.7

    I have a bunch of floating point values. For example, here's a few (printed as reprs):


    Fundamentally, these numbers have between 0 and 4 decimal digits of precision, and I want to be able to intuit how many each has, ignoring the obvious floating point roundoff problems. Thus, I want to map:

    38.0 ==> 0
    41.2586 ==> 4
    40.75280000000001 ==> 4
    49.25 ==> 2
    33.795199999999994 ==> 4
    36.837199999999996 ==> 4
    34.1489 ==> 4
    45.5 ==> 1

    Is there any clean way to do that? The best I've come up with so far is to str() them and parse the remaining string to see how many digits it put after the decimal point.

    The numbers are given to me as Python floats; I have no control over that. I'm willing to accept that fact that I won't be able to differentiate between float("38.0") and float("38.0000"). Both of those map to 1, which is OK for my purposes.
    Roy Smith, Apr 28, 2014
    1. Advertisements

  2. On Mon, 28 Apr 2014 12:00:23 -0400, Roy Smith wrote:

    I'm surprised that you have a source of data with variable precision,
    especially one that varies by a factor of TEN THOUSAND. The difference
    between 0 and 4 decimal digits is equivalent to measuring some lengths to
    the nearest metre, some to the nearest centimetre, and some to the
    nearest 0.1 of a millimetre. That's very unusual and I don't know what
    justification you have for combining such a mix of data sources.

    One possible interpretation of your post is that you have a source of
    floats, where all the numbers are actually measured to the same
    precision, and you've simply misinterpreted the fact that some of them
    look like they have less precision. Since you indicate that 4 decimal
    digits is the maximum, I'm going with 4 decimal digits. So if your data
    includes the float 23.5, that's 23.5 measured to a precision of four
    decimal places (that is, it's 23.5000, not 23.5001 or 23.4999).

    On the other hand, if you're getting your values as *strings*, that's
    another story. If you can trust the strings, they'll tell you how many
    decimal places: "23.5" is only one decimal place, "23.5000" is four.

    But then what to make of your later example?
    Python floats (C doubles) are quite capable of distinguishing between
    40.7528 and 40.75280000000001. They are distinct numbers:

    py> 40.75280000000001 - 40.7528

    so if a number is recorded as 40.75280000000001 presumably it is because
    it was measured as 40.75280000000001. (How that precision can be
    justified, I don't know! Does it come from the Large Hadron Collider?) If
    it were intended to be 40.7528, I expect it would have be recorded as
    40.7528. What reason do you have to think that something recorded to 14
    decimal places was only intended to have been recorded to 4?

    Without knowing more about how your data is generated, I can't advise you
    much, but the whole scenario as you have described it makes me think that
    *somebody* is doing something wrong. Perhaps you need to explain why
    you're doing this, as it seems numerically broken.

    I really think you need to go back to the source. Trying to infer the
    precision of the measurements from the accident of the string formatting
    seems pretty dubious to me.

    But I suppose if you wanted to infer the number of digits after the
    decimal place, excluding trailing zeroes (why, I do not understand), up
    to a maximum of four digits, then you could do:

    s = "%.4f" % number # rounds to four decimal places
    s = s.rstrip("0") # ignore trailing zeroes, whether significant or not
    count = len(s.split(".")[1])

    Assuming all the numbers fit in the range where they are shown in non-
    exponential format. If you have to handle numbers like 1.23e19 as well,
    you'll have to parse the string more carefully. (Keep in mind that most
    floats above a certain size are all integer-valued.)

    If that's the case, what makes you think that two floats from the same
    data set were measured to different precision? Given that you don't see
    strings, only floats, I would say that your problem is unsolvable.
    Whether I measure something to one decimal place and get 23.5, or four
    decimal places and get 23.5000, the float you see will be the same.

    Perhaps you ought to be using Decimal rather than float. Floats have a
    fixed precision, while Decimals can be configured. Then the right way to
    answer your question is to inspect the number:

    py> from decimal import Decimal as D
    py> x = D("23.5000")
    py> x.as_tuple()
    DecimalTuple(sign=0, digits=(2, 3, 5, 0, 0, 0), exponent=-4)

    The number of decimal digits precision is -exponent.

    That seems... well, "bizarre and wrong" are the only words that come to
    mind. If I were recording data as "38.0000" and you told me I had
    measured it to only one decimal place accuracy, I wouldn't be too
    pleased. Maybe if I understood the context better?

    How about 38.12 and 38.1200?

    By the way, you contradict yourself here. Earlier, you described 38.0 as
    having zero decimal places (which is wrong). Here you describe it as
    having one, which is correct, and then in a later post you describe it as
    having zero decimal places again.
    Steven D'Aprano, Apr 29, 2014
    1. Advertisements

  3. It's actually trickier than that. Digits of precision can refer to
    measurement error, or to the underlying storage type. Python floats are C
    doubles, so they have 64 bits of precision (approximately 17 decimal
    digits, if I remember correctly) regardless of the precision of the
    measurement. The OP (Roy) is, I think, trying to guess the measurement
    precision after the fact, given a float. If the measurement error really
    does differ from value to value, I don't think he'll have much luck:
    given a float like 23.0, all we can say is that it has *at least* zero
    significant decimal places. 23.1 has at least one, 23.1111 has at least

    If you can put an upper bound on the precision, as Roy indicates he can,
    then perhaps a reasonable approach is to convert to a string rounded to
    four decimal places, then strip trailing zeroes:

    py> x = 1234.1 # actual internal is closer to 1234.099999999999909
    py> ("%.4f" % x).rstrip('0')

    then count the number of digits after the dot. (This assumes that the
    string formatting routines are correctly rounded, which they should be on
    *most* platforms.) But again, this only gives a lower bound to the number
    of significant digits -- it's at least one, but might be more.
    Steven D'Aprano, Apr 29, 2014
  4. Roy Smith

    Roy Smith Guest

    OK, you're surprised.
    Because that's the data that was given to me. Real life data is messy.
    Another possibility is that they're latitude/longitude coordinates, some
    of which are given to the whole degree, some of which are given to
    greater precision, all the way down to the ten-thousandth of a degree.
    Because I understand the physical measurement these numbers represent.
    Sometimes, Steve, you have to assume that when somebody asks a question,
    they actually have asked the question then intended to ask.
    These are latitude and longitude coordinates of locations. Some
    locations are known to a specific street address. Some are known to a
    city. Some are only known to the country. So, for example, the 38.0
    value represents the latitude, to the nearest whole degree, of the
    geographic center of the contiguous United States.
    Sure it is. But, like I said, real-life data is messy. You can wring
    your hands and say, "this data sucks, I can't use it", or you can figure
    out some way to deal with it. Which is the whole point of my post. The
    best I've come up with is inferring something from the string formatting
    and I'm hoping there might be something better I might do.
    This at least seems a little more robust than just calling str(). Thank
    you :)
    They're latitude/longitude, so they all fall into [-180, 180].
    Like I said, "The numbers are given to me as Python floats; I have no
    control over that".
    I'm trying to intuit, from the values I've been given, which coordinates
    are likely to be accurate to within a few miles. I'm willing to accept
    a few false negatives. If the number is float("38"), I'm willing to
    accept that it might actually be float("38.0000"), and I might be
    throwing out a good data point that I don't need to.

    For the purpose I'm using the data for, excluding the occasional good
    data point won't hurt me. Including the occasional bad one, will.
    I was sloppy there. I was copy-pasting data from my program output.

    In standard engineering parlance, the string "38" represents a number
    with a precision of +/- 1 unit. Unfortunately, Python's default str()
    representation turns this into "38.0", which implies +/- 0.1 unit.

    Floats represented as strings (at least in some disciplines, such as
    engineering) include more information than just the value. By the
    number of trailing zeros, they also include information about the
    precision of the measurement. That information is lost when the string
    is converted to a IEEE float. I'm trying to intuit that information
    back, and as I mentioned earlier, am willing to accept that the
    intuiting process will be imperfect. There is real-life value in
    imperfect processes.
    Roy Smith, Apr 29, 2014
  5. You have one chance in ten, repeatably, of losing a digit. That is,
    roughly 10% of your four-decimal figures will appear to be
    three-decimal, and 1% of them will appear to be two-decimal, and so
    on. Is that "a few" false negatives? It feels like a lot IMO. But
    then, there's no alternative - the information's already gone.

    Chris Angelico, Apr 29, 2014
  6. Reminds me of the story that the first survey of Mt. Everest resulted in
    a height of exactly 29,000 feet, but to avoid the appearance of an
    estimate, they reported it as 29,002:
    Ned Batchelder, Apr 29, 2014
  7. Roy Smith

    Adam Funk Guest

    That makes sense. 1° of longitude is about 111 km at the equator,
    78 km at 45°N or S, & 0 km at the poles.

    "A man pitches his tent, walks 1 km south, walks 1 km east, kills a
    bear, & walks 1 km north, where he's back at his tent. What color is
    the bear?" ;-)
    Adam Funk, Apr 29, 2014
  8. Who manufactured the tent?

    Mark H Harris, Apr 29, 2014
  9. Roy Smith

    Ryan Hiebert Guest

    Skin or Fur?
    Ryan Hiebert, Apr 29, 2014
  10. A man pitches his tent 1 km south and kills a bear with it. Clearly
    that wasn't a tent, it was a cricket ball.

    Chris Angelico, Apr 29, 2014
  11. They could have said it was 29.000 kilofeet.
    Gregory Ewing, Apr 29, 2014
  12. Roy Smith

    emile Guest

    From how many locations on Earth can someone walk one mile south, one
    mile east, and one mile north and end up at their starting point?

    emile, Apr 29, 2014
  13. Haven't you heard of The Triangular Earth Society?
    Mark Lawrence, Apr 30, 2014
  14. Roy Smith

    Roy Smith Guest

    You're looking at it the wrong way. It's not that the glass is 10%
    empty, it's that it's 90% full, and 90% is a lot of good data :)
    Roy Smith, Apr 30, 2014
  15. Hah! That's one way of looking at it.

    At least you don't have to worry about junk digits getting in. The
    greatest precision you're working with is three digits before the
    decimal and four after, and a Python float can handle that easily.
    (Which is what I was concerned about when I first queried your
    terminology - four digits to the right of the decimal and, say, 10-12
    to the left, and you're starting to see problems.)

    Chris Angelico, Apr 30, 2014
  16. But they're all going to be *at least* as accurate as the algorithm
    says. A figure of 31.4 will be treated as 1 decimal, even though it
    might really have been accurate to 4; but a figure of 27.1828 won't be
    incorrectly reported as having only 2 decimals.

    Chris Angelico, Apr 30, 2014
  17. I wouldn't even give it that... Since internally they (ignore binary
    conversion) translate into

    2.30E1, 2.31E1, and 2.31111E1

    I'd claim 3-significant digits, 3-significant digits, and 6-significant
    digits. (Heck, as I recall classical FORTRAN, they would be 0.230E2...)
    That I'd agree with... once the data has been converted to binary
    float, all knowledge of the source significant digits has been lost.

    Then confuse matters with the facet that in a math class

    1.1 * 2.2 => 2.42

    but in a physics or chemistry class the recommended result is

    1.1 * 2.2 => 2.4

    (one reason slide-rules were acceptable for so long -- and even my high
    school trig course only required slide-rule significance even though half
    the class had scientific calculators [costing >$100, when a Sterling
    slide-rule could still be had for <$10]) <G>
    Dennis Lee Bieber, Apr 30, 2014
  18. Yeah, but he'd have had to bring his own bear...

    Bears and Penguins don't mix. Seals, OTOH, are food to the bears, and
    eat the penquins.
    Dennis Lee Bieber, Apr 30, 2014
  19. Roy Smith

    Roy Smith Guest

    More than recommended. In my physics class, if you put down more
    significant digits than the input data justified, you got the problem
    marked wrong.
    Sterling? Snort. K&E was the way to go.
    Roy Smith, Apr 30, 2014
  20. Roy Smith

    Roy Smith Guest

    Assuming he shot the bear, red.
    Roy Smith, Apr 30, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.