Lists: Converting Double to Single

rshepard · Feb 26, 2007

I start with a list of tuples retrieved from a database table. These
tuples are extracted and put into individual lists. So I have lists that
look like this: [1, 2, 3, 4, 5]. When I concatenate lists, I end up with a
list of lists that looks like this: [[1, 2, 3. 4, 5]. [6, 7. 8, 9. 10]].
Then, I average the column values so I end up with a single list, but with
two brackets on each end, for example, [[3.5, 4.5, 5.5, 6.5, 7.5]].

Unfortunately, when I try to use that last list in a NumPy function, I'm
told that it cannot be broadcast to the correct shape. So, what I want to do
is strip the extra brackes from each end to leave just [3.5, 4.5, 5.5, 6.5,
7.5].

How do I do this, please?

Rich

Leif K-Brooks · Feb 26, 2007

So I have lists that look like this: [1, 2, 3, 4, 5]. When I
concatenate lists, I end up with a list of lists that looks like
this: [[1, 2, 3. 4, 5]. [6, 7. 8, 9. 10]].
Really?

>>> [1, 2, 3, 4, 5] + [6, 7, 8, 9, 10]

Click to expand...

Click to expand...

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Then, I average the column values so I end up with a single list, but
with two brackets on each end, for example, [[3.5, 4.5, 5.5, 6.5,
7.5]].

Unfortunately, when I try to use that last list in a NumPy function,
I'm told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].

l = l[0]

Bjoern Schliessmann · Feb 26, 2007

I end up with a single list, but with two brackets on each end,
for example, [[3.5, 4.5, 5.5, 6.5, 7.5]].

Unfortunately, when I try to use that last list in a NumPy
function, I'm
told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].

How do I do this, please?

Easy

I'll show you by example:

ham = [[1,2,3],[4,5,6]]
ham [[1, 2, 3], [4, 5, 6]]
ham[0] [1, 2, 3]
ham.remove([4,5,6])
ham[0] [1, 2, 3]
ham [[1, 2, 3]]

Click to expand...

Click to expand...

HTH&Regards,

Björn

rshepard · Feb 26, 2007

l = l[0]

Of course! If I had let it work in my mind overnight I would almost
certainly have seen this.

Thank you both for your patient responses,

Rich

Scott David Daniels · Feb 27, 2007

Leif said:
...
Unfortunately, when I try to use that last list in a NumPy function,
I'm told that it cannot be broadcast to the correct shape. So, what I
want to do is strip the extra brackes from each end to leave just
[3.5, 4.5, 5.5, 6.5, 7.5].

Click to expand...

l = l[0]

Or, if you want to simultaneously assert there is only one list,
use unpacking like:
[l] = l

Although I'd prefer to use a better name than l; something like:
lst = [[3.5, 4.5, 5.5, 6.5, 7.5]]
[inner] = lst
print inner, lst

Dennis Lee Bieber · Feb 27, 2007

I start with a list of tuples retrieved from a database table. These
tuples are extracted and put into individual lists. So I have lists that
look like this: [1, 2, 3, 4, 5]. When I concatenate lists, I end up with a
list of lists that looks like this: [[1, 2, 3. 4, 5]. [6, 7. 8, 9. 10]].
Then, I average the column values so I end up with a single list, but with
two brackets on each end, for example, [[3.5, 4.5, 5.5, 6.5, 7.5]].

Why create a list of lists in the first place? Initialize a
summation list (of the length matching the data records retrieved), then
just add each tuple positionwise to the summation (keep count of tuples
for the division).

Something like (pseudo-code):

cnt = 0
for rw in cursor():
if cnt:
for i,v in enumerate(rw):
sum += v #accumulate next row
else:
sum = rw #initialize to first row
cnt += 1
avg = [ float(v) / cnt for v in sum]

Don't know if this is faster than numpy operations, but it
definitely reduces the amount of memory consumed by that list of
lists...
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Duncan Booth · Feb 27, 2007

Dennis Lee Bieber said:
Something like (pseudo-code):

cnt = 0
for rw in cursor():
if cnt:
for i,v in enumerate(rw):
sum += v #accumulate next row
else:
sum = rw #initialize to first row
cnt += 1
avg = [ float(v) / cnt for v in sum]

Don't know if this is faster than numpy operations, but it
definitely reduces the amount of memory consumed by that list of
lists...

A lot of course depends on the input data, and the example given wasn't
even syntactically valid so I can't tell if he started with integers or
floats, but if the lists are long and the OP is at all worried about
accuracy, this code may be a bad idea. Adding up a long list of values
and then dividing by the number of values is the classic computer
science example of how to get an inaccurate answer from a floating point
calculation.

I wouldn't have mentioned it, but since the OP said numpy it might just
be that he cares about the result. What I really don't understand though
is why, if he wants the final result in NumPy, he doesn't just use it
for the calculation?

from numpy import array
lst = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
array(lst).mean(axis=0)

Click to expand...

Click to expand...

Click to expand...

array([ 3.5, 4.5, 5.5, 6.5, 7.5])

What I haven't tried to figure out is whether, having said all that,
NumPy's mean is any more accurate than the naive sum and divide
calculation? So far as I can tell, it isn't any more accurate although
it does give different results.

Steven D'Aprano · Feb 27, 2007

Adding up a long list of values
and then dividing by the number of values is the classic computer
science example of how to get an inaccurate answer from a floating point
calculation.

I'm not entirely ignorant when it comes to computational mathematics, but
I must admit this is a new one to me.

If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

The only alternative I thought of was to divide each value by the count
before summing, but that would be horribly inaccurate.

Or would it?
.... return sum(args)/len(args)
........ n = len(args)
.... return sum([x/n for x in args])
....

L = range(25, 597) # 572 values
L = [x/3.3 for x in L]

mean1(*L) 94.090909090909108
mean2(*L)

Click to expand...

Click to expand...

94.090909090909108

The first calculation has 571 additions and one division; the second
calculation has 571 additions and 572 divisions, but they both give the
same result to 15 decimal places.

Neil Cerutti · Feb 27, 2007

Adding up a long list of values and then dividing by the
number of values is the classic computer science example of
how to get an inaccurate answer from a floating point
calculation.

Click to expand...

I'm not entirely ignorant when it comes to computational
mathematics, but I must admit this is a new one to me.

If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

The only alternative I thought of was to divide each value by the count
before summing, but that would be horribly inaccurate.

Or would it?
... return sum(args)/len(args)
...... n = len(args)
... return sum([x/n for x in args])
...

L = range(25, 597) # 572 values
L = [x/3.3 for x in L]

mean1(*L) 94.090909090909108
mean2(*L)

Click to expand...

Click to expand...

94.090909090909108

The first calculation has 571 additions and one division; the
second calculation has 571 additions and 572 divisions, but
they both give the same result to 15 decimal places.

How about this?

L = range(5, 10)
L = [pow(2, -x) for x in L]
"%40.40f" % mean1(*L) '0.0121093750000000000000000000000000000000'
"%40.40f" % mean2(*L)

Click to expand...

Click to expand...

'0.0121093750000000020000000000000000000000'

Offhand, I think the first is "righter". Weird!

Dennis Lee Bieber · Feb 27, 2007

I'm not entirely ignorant when it comes to computational mathematics, but
I must admit this is a new one to me.

If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

I focused primarily upon the "database" result set as the source of
the data. To me, that implied there could be a LARGE set of data whose
length is unknown at the start and, rather than have to allocate Python
lists for all that data before computing an average, the incremental
summation would minimize storage...

Of course, if I'd been fully aware at the time, I'd have just
suggested changing the database query to use aggregate functions, and
let the DBMS compute the averages <G>.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Duncan Booth · Feb 27, 2007

Steven D'Aprano said:
If the values vary greatly in magnitude, you probably want to add them
from smallest to biggest; other than that, how else can you calculate the
mean?

It doesn't have to be smallest to largest, but the important thing is not
to be adding the millionth element to the sum of the preceding 999999
values. So if I remember correctly one option is to add pairs of numbers,
then sum the pairs and so on.

I think my point really is that nobody is going to bother doing this when
they just want to sum a few numbers, and I'm not really surprised that the
builtin sum doesn't do it either, but I was slightly surprised that numpy
doesn't try to minimise loss of precision.

Here's a quick attempt at implementing it which demonstrates how the
builtin sum loses precision somewhat faster than a pairwise sum (and is
rather dependant on the input order). sumpair is of course a lot slower
than the builtin sum, but it might be interesting to see how a C coded
version compared. I can't see that it would necessarily be much slower: it
just adds some counters, bit twiddling and a short temporary array to the
calculation.

C:\temp>sumpairs.py
builtin sum 500000.00000083662
pairwise sum 500000.00000499998
sorted
builtin sum 500000.00000499998
pairwise sum 500000.00000499998
reverse sorted
builtin sum 500000.0
pairwise sum 500000.00000500004
all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

------------ sumpairs.py ------------------------------------------
# Sum implemented so that intermediate results sum a similar number
# of the input values.
def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n > 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

tmp.reverse()
return sum(tmp)

v = [1000000, 0.00001] * 1000000
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "sorted"
v.sort()
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))

print "all the same"
v = [1000000.00001] * 1000000
print "builtin sum ", repr(sum(v)/len(v))
print "pairwise sum", repr(sumpairs(v)/len(v))-----------------------------
----------------------------------

In case the code is a bit hard to follow this version makes it clearer what
is happening:

------------- sumpair2.py ---------------
def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
print "add ", tmp
i = i + 1
n = i & -i
while n > 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
print "reduce", tmp
else:
tmp.append(v)
print "append", tmp

tmp.reverse()
return sum(tmp)

print sumpairs([1]*40)
-----------------------------------------
which gives the output:

append [1]
add [2]
append [2, 1]
add [2, 2]
reduce [4]
append [4, 1]
add [4, 2]
append [4, 2, 1]
add [4, 2, 2]
reduce [4, 4]
reduce [8]
append [8, 1]
add [8, 2]
append [8, 2, 1]
add [8, 2, 2]
reduce [8, 4]
append [8, 4, 1]
add [8, 4, 2]
append [8, 4, 2, 1]
add [8, 4, 2, 2]
reduce [8, 4, 4]
reduce [8, 8]
reduce [16]
append [16, 1]
add [16, 2]
append [16, 2, 1]
add [16, 2, 2]
reduce [16, 4]
append [16, 4, 1]
add [16, 4, 2]
append [16, 4, 2, 1]
add [16, 4, 2, 2]
reduce [16, 4, 4]
reduce [16, 8]
append [16, 8, 1]
add [16, 8, 2]
append [16, 8, 2, 1]
add [16, 8, 2, 2]
reduce [16, 8, 4]
append [16, 8, 4, 1]
add [16, 8, 4, 2]
append [16, 8, 4, 2, 1]
add [16, 8, 4, 2, 2]
reduce [16, 8, 4, 4]
reduce [16, 8, 8]
reduce [16, 16]
reduce [32]
append [32, 1]
add [32, 2]
append [32, 2, 1]
add [32, 2, 2]
reduce [32, 4]
append [32, 4, 1]
add [32, 4, 2]
append [32, 4, 2, 1]
add [32, 4, 2, 2]
reduce [32, 4, 4]
reduce [32, 8]
40

Dennis Lee Bieber · Feb 28, 2007

It doesn't have to be smallest to largest, but the important thing is not
to be adding the millionth element to the sum of the preceding 999999
values. So if I remember correctly one option is to add pairs of numbers,
then sum the pairs and so on.

For floating point, smallest magnitude to largest IS the most
precise.

Pretend you only have 2 significant digits of precision...

10 + 0.4 + 0.4 + 0.4 => 10

0.4 + 0.4 + 0.4 + 10 => 11
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Duncan Booth · Feb 28, 2007

Dennis Lee Bieber said:
For floating point, smallest magnitude to largest IS the most
precise.

Pretend you only have 2 significant digits of precision...

10 + 0.4 + 0.4 + 0.4 => 10

0.4 + 0.4 + 0.4 + 10 => 11

and if you try the way I suggested then initial input order doesn't
matter:

(10 + 0.4) = 10, (0.4 + 0.4) = 0.8, (10 + 0.8) = 11
(0.4 + 0.4) = 0.8, (10 + 0.4) = 10, (0.8 + 10) = 11

Pretend you ran the example code I posted. Specifically the bit where
the output is:

all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

It isn't so much the order of the initial values that matters
(especially if the values are all similar). What *really* matters is
keeping the magnitude of the intermediate results as small as possible.

Summing in pairs ensures that you keep the precision as long as
possible. The order of the initial values is less important than this
(although it does still have a minor effect). For a solution which works
with a sequence of unknown length and doesn't require lookahead I don't
think you can do much better than that.

BTW, in case someone thinks my example numbers are a bit too weird you
can show the same problem averaging a list of 10 digit floating point
numbers with exact representations:

v = [9999999999.] * 10000000
print "builtin sum ", (sum(v)/len(v))
print "pairwise sum", (sumpairs(v)/len(v))

gives:
builtin sum 9999999999.91
pairwise sum 9999999999.0

In both cases the total is large enough to go beyond the range of
integers that can be expressed exactly in floating point and as soon as
that happens the builtin sum loses precision on every subsequent
addition. pairwise summation only loses precision on a very few of the
additions.

P.S. Apologies for the sloppy coding in the sumpairs function. It should
of course do the final addition manually otherwise it is going to give
odd results summing lists or strings or anything else where the order
matters. Revised version:

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n > 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) > 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]

Jussi Salmela · Mar 1, 2007

Duncan Booth kirjoitti:

Dennis Lee Bieber said:
Dennis Lee Bieber said:

For floating point, smallest magnitude to largest IS the most
precise.

Pretend you only have 2 significant digits of precision...

10 + 0.4 + 0.4 + 0.4 => 10

0.4 + 0.4 + 0.4 + 10 => 11

Click to expand...

and if you try the way I suggested then initial input order doesn't
matter:

(10 + 0.4) = 10, (0.4 + 0.4) = 0.8, (10 + 0.8) = 11
(0.4 + 0.4) = 0.8, (10 + 0.4) = 10, (0.8 + 10) = 11

Pretend you ran the example code I posted. Specifically the bit where
the output is:

all the same
builtin sum 1000000.0000016732
pairwise sum 1000000.00001

It isn't so much the order of the initial values that matters
(especially if the values are all similar). What *really* matters is
keeping the magnitude of the intermediate results as small as possible.

Summing in pairs ensures that you keep the precision as long as
possible. The order of the initial values is less important than this
(although it does still have a minor effect). For a solution which works
with a sequence of unknown length and doesn't require lookahead I don't
think you can do much better than that.

BTW, in case someone thinks my example numbers are a bit too weird you
can show the same problem averaging a list of 10 digit floating point
numbers with exact representations:

v = [9999999999.] * 10000000
print "builtin sum ", (sum(v)/len(v))
print "pairwise sum", (sumpairs(v)/len(v))

gives:
builtin sum 9999999999.91
pairwise sum 9999999999.0

In both cases the total is large enough to go beyond the range of
integers that can be expressed exactly in floating point and as soon as
that happens the builtin sum loses precision on every subsequent
addition. pairwise summation only loses precision on a very few of the
additions.

P.S. Apologies for the sloppy coding in the sumpairs function. It should
of course do the final addition manually otherwise it is going to give
odd results summing lists or strings or anything else where the order
matters. Revised version:

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n > 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) > 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]

Warning: I'm no floating point guru! Awfully long post following!

I've run a couple of tests and it seems to me that Dennis Lee Bieber is
on the trail of the truth when he claims that smallest magnitude to
the largest is the way to do the summation. Actually it isn't THE way
although it diminishes the error. I was sort of astonished because I
also had somewhere along the years formed the same notion as Dennis.

"Your" pairwise method beats the former method by a large margin
although I don't understand why. To tell you the truth: I started out to
show you were wrong because intuitively (for me at least) the former
method should work better than yours.

I also found a method called "Kahan summation algorithm" which probably
is the best way to do this.

What have I done then? I used your examples and code and added a
straight summation and the Kahan method using the pseudocode presented
in the Wikipedia article. I also made an experiment using randomized
floats merged wildly into a crazy list.

The results (as I interpret them):
1. The straight summation is the same method that the builtin sum uses
because their results are equal(ly poor).
2. Sorting to increasing order helps these two methods to have better
accuracy.
3. The pairwise method is almost as good as the Kahan method.

Here's the code and the results:

=========================================================================
CODE:
=========================================================================
import random

def sumpairs(seq):
tmp = []
for i,v in enumerate(seq):
if i&1:
tmp[-1] += v
i = i + 1
n = i & -i
while n > 2:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
n >>= 1
else:
tmp.append(v)

while len(tmp) > 1:
t = tmp.pop(-1)
tmp[-1] = tmp[-1] + t
return tmp[0]

def straightSum(seq):
result = 0.0
for elem in seq: result += elem
return result

def kahanSum(input):
sum = input[0]
n = len(input)
c = 0.0 # A running compensation for lost low-order
bits.
for i in range(1, n):
y = input - c # So far, so good: c is zero.
t = sum + y # Alas, sum is big, y small,
# so low-order digits of y are lost.
c = (t - sum) - y # (t - sum) recovers the high-order part of y;
# subtracting y recovers -(low part of y)
sum = t # Algebraically, c should always be zero.
# Beware eagerly optimising compilers!
continue # Next time around, the lost low part will be
# added to y in a fresh attempt.
return sum

print '\n======================> v = [1000000, 0.00001] * 1000000'
v = [1000000, 0.00001] * 1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "sorted"
v.sort()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================> v = [1000000]*1000000 + [0.00001]*1000000'
v = [1000000]*1000000 + [0.00001]*1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================> v = [0.00001]*1000000 + [1000000]*1000000'
v = [0.00001]*1000000 + [1000000]*1000000
print "CORRECT ", '500000.000005 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================> v = [1000000.00001] * 1000000'
v = [1000000.00001] * 1000000
print "CORRECT ", '1000000.00001 <========'
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print '\n======================> Randomized lists'
print '========> lst1 = [random.random() for i in range(1000000)]'
print '========> lst2 = [1000000*random.random() for i in range(1000000)]'
N = 100000
lst1 = [random.random() for i in range(N)]
lst2 = [1000000.0 + random.random() for i in range(N)]
v = []
for i in range(N):
v.append(lst1)
v.append(lst2)
v.append(lst2)
v.append(lst2)
v.append(lst1)
v.append(lst2)
v.append(lst2)
v.append(lst2)
v.append(lst2)
print '========> v = "Crazy merge of lst1 and lst2'
print 'min, max'
print 'lst1:', min(lst1), max(lst1)
print 'lst2:', min(lst2), max(lst2)
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "sorted"
v.sort()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))

print "reverse sorted"
v.reverse()
print "builtin ", repr(sum(v)/len(v))
print "straight", repr(straightSum(v)/len(v))
print "pairwise", repr(sumpairs(v)/len(v))
print "Kahan ", repr(kahanSum(v)/len(v))
=========================================================================
RESULTS:
=========================================================================

======================> v = [1000000, 0.00001] * 1000000
CORRECT 500000.000005 <========
builtin 500000.00000083662
straight 500000.00000083662
pairwise 500000.00000499998
Kahan 500000.00000499998
sorted
builtin 500000.00000499998
straight 500000.00000499998
pairwise 500000.00000499998
Kahan 500000.00000499998
reverse sorted
builtin 500000.0
straight 500000.0
pairwise 500000.00000500004
Kahan 500000.00000499998

======================> v = [1000000]*1000000 + [0.00001]*1000000
CORRECT 500000.000005 <========
builtin 500000.0
straight 500000.0
pairwise 500000.00000500004
Kahan 500000.00000499998

======================> v = [0.00001]*1000000 + [1000000]*1000000
CORRECT 500000.000005 <========
builtin 500000.00000499998
straight 500000.00000499998
pairwise 500000.00000499998
Kahan 500000.00000499998

======================> v = [1000000.00001] * 1000000
CORRECT 1000000.00001 <========
builtin 1000000.0000016732
straight 1000000.0000016732
pairwise 1000000.00001
Kahan 1000000.00001

======================> Randomized lists
========> lst1 = [random.random() for i in range(1000000)]
========> lst2 = [1000000*random.random() for i in range(1000000)]
========> v = "Crazy merge of lst1 and lst2
min, max
lst1: 1.59494420963e-005 0.999990993581
lst2: 1000000.00001 1000001.0
builtin 777778.27774196747
straight 777778.27774196747
pairwise 777778.27774199261
Kahan 777778.27774199261
sorted
builtin 777778.2777420698
straight 777778.2777420698
pairwise 777778.27774199261
Kahan 777778.27774199261
reverse sorted
builtin 777778.27774202416
straight 777778.27774202416
pairwise 777778.27774199261
Kahan 777778.27774199261
=========================================================================

Cheers,
Jussi

Dennis Lee Bieber · Mar 2, 2007

I've run a couple of tests and it seems to me that Dennis Lee Bieber is
on the trail of the truth when he claims that smallest magnitude to
the largest is the way to do the summation. Actually it isn't THE way
although it diminishes the error. I was sort of astonished because I
also had somewhere along the years formed the same notion as Dennis.

Unfortunately, I've misplaced my copy of:

http://www.amazon.com/Real-Computin...6086442?ie=UTF8&s=books&qid=1172813997&sr=1-4

So I can't give any citations... ("Deadly Doses" and "Poisons" I
have no problem finding... But an analytical look at how floating point
operations differ from real world... ?)

Chapter two is titled "Preserving Significant Digits"
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Duncan Booth · Mar 2, 2007

Jussi Salmela said:
I've run a couple of tests and it seems to me that Dennis Lee Bieber is
on the trail of the truth when he claims that smallest magnitude to
the largest is the way to do the summation. Actually it isn't THE way
although it diminishes the error. I was sort of astonished because I
also had somewhere along the years formed the same notion as Dennis.

"Your" pairwise method beats the former method by a large margin
although I don't understand why. To tell you the truth: I started out to
show you were wrong because intuitively (for me at least) the former
method should work better than yours.

I think Dennis is spot on when he says that smallest to largest is the
best way. What he has missed though is that if you have a list of
similar sized values, then after you have added the first two together
the next addition still needs to follow the smallest to largest rule so
it shouldn't include the result of the first addition. One option, if
you have access to the full list when you start, would be to reinsert
each partial sum back into the sorted list as you calculate it: a heap
queue might be an appropriate implementation for that.

The pairwise sum gives you most, although not quite all, of the benefit
of sorting the data: obviously if the input values are identical it is
almost exactly equivalent to adding the smallest values and then
inserting the partial result back into the list. Even in the 1000000,
0.00001 case it avoids most of the precision-losing additions such as
500000000005.0 + 0.00001 which the simple sum hits.

Also it has the further advantages of still working with other summable
objects (e.g. lists) so long as they are associative and not requiring
you to know all the values or even the length of the sequence in
advance.

For floating point you can of course use methods to preserve the lost
bits such as the kahan sum, or even just do all imtermediate
calculations to a higher internal precision.

By the way, I found another bug in my sumpairs: += is a very bad idea
here if the input can contain mutables, so the relevant line should be:

tmp[-1] = tmp[-1] + v

Once I fixed that one, so it no longer attempts to modify lists in-
place, it appears that the more balanced intermediate results give a
benefit when summing lists as well. sumpairs on a list of lists seems to
beat the builtin sum as soon as the length of the list exceeds about 210
items. At 1000 items it is about 3 times faster and 30 times faster at
10,000 items. If I get the time I think I really should code up a C
version of sumpairs and see how that compares against Pythons builtin
sum for numerical sums.

I guess either everyone else missed that computer science lecture I
thought I remembered, or maybe I was the one who was asleep and I
dreamed it all.

Converting a list of lists to a single list	8	Jul 23, 2013
fetchall to python lists	2	Sep 30, 2013
Converting folders of jpegs to single pdf per folder	0	Jan 16, 2014
Struggling for inspiration with lists	1	Dec 17, 2013
How to return list of lists	9	Apr 25, 2013
tuple of ids of integers or lists	3	Mar 17, 2013
Count each unique element in list of lists	1	Nov 8, 2013
Boolean function on variable-length lists	9	Sep 12, 2012

Lists: Converting Double to Single

rshepard

Leif K-Brooks

Bjoern Schliessmann

rshepard

Scott David Daniels

Dennis Lee Bieber

Duncan Booth

Steven D'Aprano

Neil Cerutti

Dennis Lee Bieber

Duncan Booth

Dennis Lee Bieber

Duncan Booth

Jussi Salmela

Dennis Lee Bieber

Duncan Booth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads