sum and strings

G

Georg Brandl

Paul said:
Just because it's common doesn't mean it's obvious. In my opinion
it's as ugly as sin, and the fact that it's an idiom shows a
shortcoming in Python. The obvious reason for summing strings is that
it's a long-established feature of Python that a+b concatenates two
strings, so summing a,b,c,d,e should result in a+b+c+d+e.

Which is exactly how I would concatenate five strings.

For concatenating longer sequences of strings, however, if it needs to be
done fast, performance-wise, this approach is not sensible.

Georg
 
B

Bill Pursell

Georg said:
Why would you try to sum up strings? Besides, the ''.join idiom is quite
common in Python.

One could extend this argument to dissallow the following:
 
S

Sybren Stuvel

Paul Rubin enlightened us with:
Huh? Just because some obscure weirdness like this is in the
manual, doesn't make it obvious or natural.

I think it's quite natural. The "fact" that it's some obscure
weirdness is all in your head.

Sybren
 
R

Rhamphoryncus

Bill said:
One could extend this argument to dissallow the following:

It's worthwhile to note that the use of + as the concatenation operator
is arbitrary. It could just have well been | or &, and has no
relationship with mathematically addition. Were history different
perhaps Guido would have gone with | or & instead, and we wouldn't be
having this conversation.

It's also worth stressing (not in response to your post, but others)
that sum([[1],[2],[3]], []) is just as bad as attempting to sum
strings, both conceptually (it's not mathematical addition), and
performance-wise. Don't do it. :)

I believe the prefered method to flatten a list of lists is this:

shallow = []
for i in deep:
shallow.extend(i)

Yes, it's three lines. It's also very easy to read. reduce() and
sum() are not.
 
C

Carl Banks

Rhamphoryncus said:
It's also worth stressing (not in response to your post, but others)
that sum([[1],[2],[3]], []) is just as bad as attempting to sum
strings, both conceptually (it's not mathematical addition), and
performance-wise. Don't do it. :)

Interesting. I would have guessed that, internally, sum was
implemented with in-place operations (except for the first operation,
which would have to be non-in-place so that it wouldn't modify a
mutable initial value). But looking at the source, I see that I would
have guessed wrongly: it does not use in-place operations. So,
although lists are optimized for growing, sum would end up creating a
new list each iteration anyways. Thus it does have the same penalty
that strings would have. Good to know.

Obviously, sum can't check for every possible type that would be
inefficient, but at least it could check for common builtins (str,
list, tuple). That it only checks for strings suggests to me that this
really is just a case of protecting the unwary. Concatenating strings
is common enough, and the drawbacks of using sum bad enough, that a
special case was considered justified. Lists and tuples, though
theoretically they have the same issues as strings, probably don't
justify a special case because they're not as common.

I believe the prefered method to flatten a list of lists is this:

shallow = []
for i in deep:
shallow.extend(i)

Yes, it's three lines. It's also very easy to read. reduce() and
sum() are not.

sum() is really for numerical uses; people ought to just stick to using
it that way.


Carl Banks
 
P

Paddy

Rhamphoryncus said:
It's worthwhile to note that the use of + as the concatenation operator
is arbitrary. It could just have well been | or &, and has no
relationship with mathematically addition.

The effect of the string concatenation operator is only secondary.
Secondary to the use of the word sum; and what could be 'reasonably'
concieved as sum's effect on non-numeric types.
It's also worth stressing (not in response to your post, but others)
that sum([[1],[2],[3]], []) is just as bad as attempting to sum
strings, both conceptually (it's not mathematical addition)

Unfortunately, the words sum and summation are linked to other words
that are commonly used to describe what is happening to the numders,
the lists, and the strings.
Words such as accumulate, concatenate, aggregate, collect, assemble, as
well as add.
and performance-wise. Don't do it. :)

Amen to that.
I believe the prefered method to flatten a list of lists is this:

shallow = []
for i in deep:
shallow.extend(i)

Yes, it's three lines. It's also very easy to read. reduce() and
sum() are not.

I'd like to squeeze in the listcomp version, not because it is one line
shorter, but because I, and maybe others prefer short listcomps such as
the folowing:

shallow = []
[shallow.extend(i) for i in deep]

-Pad.
 
R

Rhamphoryncus

Paddy said:
Rhamphoryncus said:
It's worthwhile to note that the use of + as the concatenation operator
is arbitrary. It could just have well been | or &, and has no
relationship with mathematically addition.

The effect of the string concatenation operator is only secondary.
Secondary to the use of the word sum; and what could be 'reasonably'
concieved as sum's effect on non-numeric types.
It's also worth stressing (not in response to your post, but others)
that sum([[1],[2],[3]], []) is just as bad as attempting to sum
strings, both conceptually (it's not mathematical addition)

Unfortunately, the words sum and summation are linked to other words
that are commonly used to describe what is happening to the numders,
the lists, and the strings.
Words such as accumulate, concatenate, aggregate, collect, assemble, as
well as add.

String concatenation and numeric addition only group together under the
most vague of english terms, "putting things together". String
concatenation violates many mathematical properties of
addition/summation, and simply isn't related.

It is unfortunate that many languages (including python) promote the
confusion by using + for a "put things together" operator, ie both
mathematical addition and string concatenation.

I believe the prefered method to flatten a list of lists is this:

shallow = []
for i in deep:
shallow.extend(i)

Yes, it's three lines. It's also very easy to read. reduce() and
sum() are not.

I'd like to squeeze in the listcomp version, not because it is one line
shorter, but because I, and maybe others prefer short listcomps such as
the folowing:

shallow = []
[shallow.extend(i) for i in deep]

I'm sure this has been mentioned before, but listcomps are for when you
want to store the list and use it for further things, not for when you
want a side effect. TOOWTDI.

And of course, if saving a line was the reason:

shallow = []
for i in deep: shallow.extend(i)
 
G

Gerhard Fiedler

shallow = []
[shallow.extend(i) for i in deep]

I'm sure this has been mentioned before, but listcomps are for when you
want to store the list and use it for further things, not for when you
want a side effect. TOOWTDI.

Can you please explain what you mean with this, and maybe why?

Thanks,
Gerhard
 
M

Marc 'BlackJack' Rintsch

Gerhard Fiedler said:
shallow = []
[shallow.extend(i) for i in deep]

I'm sure this has been mentioned before, but listcomps are for when you
want to store the list and use it for further things, not for when you
want a side effect. TOOWTDI.

Can you please explain what you mean with this, and maybe why?

You should not abuse list comps just to have a one liner. Only use them
if you really want to build a list and not just for side effects. The
above one-liner builds a list of `None`\s of length ``len(deep)`` for no
reason just to throw them away.

Ciao,
Marc 'BlackJack' Rintsch
 
A

Alex Martelli

Rhamphoryncus said:
I believe the prefered method to flatten a list of lists is this:

shallow = []
for i in deep:
shallow.extend(i)

Yes, it's three lines. It's also very easy to read. reduce() and
sum() are not.

I'd like to squeeze in the listcomp version, not because it is one line
shorter, but because I, and maybe others prefer short listcomps such as
the folowing:

shallow = []
[shallow.extend(i) for i in deep]

I'm sure this has been mentioned before, but listcomps are for when you
want to store the list and use it for further things, not for when you
want a side effect. TOOWTDI.

And of course, if saving a line was the reason:

shallow = []
for i in deep: shallow.extend(i)

Another alternative equivalent to

shallow = sum(deep, [])

but based on the "proper" usage of list comprehensions is

shallow = [ item for sublist in deep for item in sublist ]


In terms of performance, however, the simple loop that you (rhamph)
posted is generally best -- e.g., with Python 2.5c1 on a MacbookPro:

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9'
's=sum(deep,[])'
100000 loops, best of 3: 11.2 usec per loop

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9' 's=[]
for sublist in deep: s.extend(sublist)'
100000 loops, best of 3: 6.92 usec per loop

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9' 's=[]
[s.extend(sublist) for sublist in deep]'
100000 loops, best of 3: 8.48 usec per loop

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9' 's=[item
for sublist in deep for item in sublist]'
100000 loops, best of 3: 17.1 usec per loop


Alex
 
S

Steven D'Aprano

If the args are strings, the above is a quadratic time algorithm and
it's better to throw an error than create such a trap for an unwary user.

That's a nonsense argument. There is no shortage of slow algorithms, and
some of them are built into Python. When did O(n**2) become an error
condition? And overhead matters: if I'm only doing a few concatenations,
it is significantly faster to add the strings using + than to use join()
(at least in Python 2.3 -- YMMV):
s = timeit.Timer("''.join(L)", "L=['a'*50, 'b'*50, 'c'*50]")
s.timeit() 1.3470098972320557
t = timeit.Timer("a+b+c", "a,b,c = 'a'*50, 'b'*50, 'c'*50")
t.timeit()
1.0698421001434326

There's a word for optimizations that actually result in code running 25%
slower.

I applaud that Python's language developers care about efficiency. I
applaud that the documentation warns people about traps and potential
areas of inefficiency. But I think it is a shame that sum() does a
special type-check to avoid something which is only sometimes slow. It
doesn't protect against O(n**2) performance; it merely protects against
just one of an infinite number of possible "traps for the unwary".

I would have thought it would be better for sum() to raise a warning, not
an exception. If we take seriously the argument that sum implies
addition, and that string concatenation isn't really addition, then sum()
should also refuse to operate on lists and tuples and any other
non-numeric class. Either would be better than sum() "protecting" the user
from himself, except when it doesn't.
 
G

Georg Brandl

Steven said:
If the args are strings, the above is a quadratic time algorithm and
it's better to throw an error than create such a trap for an unwary user.

That's a nonsense argument. There is no shortage of slow algorithms, and
some of them are built into Python. When did O(n**2) become an error
condition? And overhead matters: if I'm only doing a few concatenations,
it is significantly faster to add the strings using + than to use join()
(at least in Python 2.3 -- YMMV):
s = timeit.Timer("''.join(L)", "L=['a'*50, 'b'*50, 'c'*50]")
s.timeit() 1.3470098972320557
t = timeit.Timer("a+b+c", "a,b,c = 'a'*50, 'b'*50, 'c'*50")
t.timeit()
1.0698421001434326

There's a word for optimizations that actually result in code running 25%
slower.

I applaud that Python's language developers care about efficiency. I
applaud that the documentation warns people about traps and potential
areas of inefficiency. But I think it is a shame that sum() does a
special type-check to avoid something which is only sometimes slow. It
doesn't protect against O(n**2) performance; it merely protects against
just one of an infinite number of possible "traps for the unwary".

I would have thought it would be better for sum() to raise a warning, not
an exception. If we take seriously the argument that sum implies
addition, and that string concatenation isn't really addition, then sum()
should also refuse to operate on lists and tuples and any other
non-numeric class. Either would be better than sum() "protecting" the user
from himself, except when it doesn't.

Well, present that on python-dev.

Georg
 
F

Fredrik Lundh

Alex said:
In terms of performance, however, the simple loop that you (rhamph)
posted is generally best -- e.g., with Python 2.5c1 on a MacbookPro:

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9'
's=sum(deep,[])'
100000 loops, best of 3: 11.2 usec per loop

brain:~/downloads alex$ python -mtimeit -s'deep=[range(9)]*9' 's=[]
for sublist in deep: s.extend(sublist)'
100000 loops, best of 3: 6.92 usec per loop

at least on this machine, map(s.extend) is slightly faster than the loop:

timeit -s"deep=[range(9)]*9" "s=[]" "for sublist in deep:
s.extend(sublist)"
100000 loops, best of 3: 5.59 usec per loop

timeit -s"deep=[range(9)]*9" "s=[]" "map(s.extend, deep)"
100000 loops, best of 3: 5.26 usec per loop

</F>
 
F

Fredrik Lundh

Steven said:
That's a nonsense argument. There is no shortage of slow algorithms, and
some of them are built into Python. When did O(n**2) become an error
condition? And overhead matters: if I'm only doing a few concatenations,
it is significantly faster to add the strings using + than to use join()
(at least in Python 2.3 -- YMMV):
s = timeit.Timer("''.join(L)", "L=['a'*50, 'b'*50, 'c'*50]")
s.timeit() 1.3470098972320557
t = timeit.Timer("a+b+c", "a,b,c = 'a'*50, 'b'*50, 'c'*50")
t.timeit()
1.0698421001434326

and what exactly does the fact that Python can do operator-based
dispatch much faster than it can do method-based dispatch have to
do with sum's inability to add strings ?

did you have some kind of "zero overhead for some function calls"
optimization in mind ?

</F>
 
C

Carl Banks

Steven said:
But I think it is a shame that sum() does a
special type-check to avoid something which is only sometimes slow. It
doesn't protect against O(n**2) performance; it merely protects against
just one of an infinite number of possible "traps for the unwary".

I'm not so sure special casing was a good idea myself, but...

If you allow for a small number of special cases in the language to
protect against the most harmful and prone traps, this case would be
one of the top ones, IMHO.


Carl Banks


and, you know, if you really want to concatenate strings with sum, you
can

class noopadd(object):
def __add__(self,other):
return other

sum(["abc","def","ghi"],noopadd())
 
P

Paddy

Carl said:
and, you know, if you really want to concatenate strings with sum, you
can

class noopadd(object):
def __add__(self,other):
return other

sum(["abc","def","ghi"],noopadd())

;-)

- Pad.
 
S

Steven D'Aprano

Steven said:
That's a nonsense argument. There is no shortage of slow algorithms, and
some of them are built into Python. When did O(n**2) become an error
condition? And overhead matters: if I'm only doing a few concatenations,
it is significantly faster to add the strings using + than to use join()
(at least in Python 2.3 -- YMMV):
s = timeit.Timer("''.join(L)", "L=['a'*50, 'b'*50, 'c'*50]")
s.timeit() 1.3470098972320557
t = timeit.Timer("a+b+c", "a,b,c = 'a'*50, 'b'*50, 'c'*50")
t.timeit()
1.0698421001434326

and what exactly does the fact that Python can do operator-based
dispatch much faster than it can do method-based dispatch have to
do with sum's inability to add strings ?

Sorry, I don't get you here. Are you saying that string addition doesn't
call the operands' __add__ methods?

Obviously I would have preferred to have done a direct test of sum()
versus join(), but I can't since sum() goes out of its way to make that
difficult.

Or maybe I can...
.... def __add__(self, other): return other
....
.... L = ['a', 'b']
.... M = Magic()
.... """12.319401025772095

That's bizarrely slow for concatenating two characters. It's an order of
magnitude slower than doing a direct addition of two characters, and about
one third the speed of a pure Python implementation:
.... t = ""
.... for s in strlist: t = t+s
.... return t
....
.... a, b, c, d, = 'a', 'b', 'c', 'd'
.... """
t6 = timeit.Timer("mysum([a,b,c,d])", setup1)
t6.timeit()
4.3943448066711426

What's going on here? What have I missed? I know adding strings is
O(n**2), but these results don't make any sense to me.
 
F

Fredrik Lundh

Steven said:
Sorry, I don't get you here. Are you saying that string addition doesn't
call the operands' __add__ methods?

the "+" instruction (BINARY_ADD) special-cases integers and 8-bit
strings, and uses a single C call via a type slot for other types.

in contrast, "join" and "sum" needs to do a full name lookup and a
full Python-level function call. compared to the effort needed to
copy 100 bytes, that's a rather expensive set of operations.

</F>
 
N

neophyte

Fredrik said:
do you often write programs that "sums" various kinds of data types, and
for which performance issues are irrelevant?

or are you just stuck in a "I have this idea" loop?

Maybe he's just insisting on the principle of least surprise?
Personally, I'd rather expect sum() to work for strings and Python to
issue a warning instead of raising a type error. That warning might
also be appropriate when using sum() on other builtin sequence types.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,144
Latest member
KetoBaseReviews
Top