RE: Why doesn't join() call str() on its arguments?

Discussion in 'Python' started by Delaney, Timothy C (Timothy), Feb 16, 2005.

  1. John Roth wrote:

    > result = "".join([str(x) for x in list])


    As of 2.4, you should use a generator expression here instead (unless
    you require backwards-compatibility with 2.3).

    result = ''.join(str(x) for x in iterable)

    Easier to read, more memory-efficient, potentially faster (depending on
    performance characteristics of building large lists).

    Tim Delaney
    Delaney, Timothy C (Timothy), Feb 16, 2005
    #1
    1. Advertising

  2. "Delaney, Timothy C (Timothy)" <> writes:

    > John Roth wrote:
    >
    > > result = "".join([str(x) for x in list])

    >
    > As of 2.4, you should use a generator expression here instead (unless
    > you require backwards-compatibility with 2.3).
    >
    > result = ''.join(str(x) for x in iterable)
    >
    > Easier to read, more memory-efficient, potentially faster (depending on
    > performance characteristics of building large lists).


    Stop me if I sound too whiney, but in my original post that
    started this thread just a couple of hours ago, I did in fact get
    this right, so I'm not entirely sure who the two of you are
    actually talking to...

    --
    Leo Breebaart <>
    Leo Breebaart, Feb 16, 2005
    #2
    1. Advertising

  3. Delaney, Timothy C (Timothy) wrote:

    > John Roth wrote:
    >
    >> result = "".join([str(x) for x in list])

    >
    > As of 2.4, you should use a generator expression here instead (unless
    > you require backwards-compatibility with 2.3).
    >
    > result = ''.join(str(x) for x in iterable)
    >
    > Easier to read, more memory-efficient, potentially faster (depending on
    > performance characteristics of building large lists).
    >
    > Tim Delaney


    Correct me if I'm wrong, but with a real generator expression, join can't
    know in advance the final size of the resulting string. So, for big
    strings, the fact that you know in advance the size of the list and of the
    resulting string could lead to much better performance, even if the memory
    usage is twice the one of the generator expression.
    Christophe Cavalaria, Feb 17, 2005
    #3
  4. Delaney, Timothy C (Timothy)

    Andy Dustman Guest

    I did some timings of ''.join( <list comprehension> ) vs. ''.join(
    <generator expression> ) and found that generator expressions were
    slightly slower, so I looked at the source code to find out why. It
    turns out that the very first thing string_join(self, orig) does is:

    seq = PySequence_Fast(orig, "");

    thus iterating over your generator expression and creating a list,
    making it less efficient than passing a list in the first place via a
    list comprehension.

    The reason it does this is exactly why you said: It iterates over the
    sequence and gets the sum of the lengths, adds the length of n-1
    separators, and then allocates a string this size. Then it iterates
    over the list again to build up the string.

    For generators, you'd have to make a trial allocation and start
    appending stuff as you go, periodically resizing. This *might* end up
    being more efficient in the case of generators, but the only way to
    know for sure is to write the code and benchmark it.

    I will be at PyCon 2005 during the sprint days, so maybe I'll write it
    then if someone doesn't beat me to it. I don't think it'll be all that
    hard. It might be best done as an iterjoin() method, analogous to
    iteritems(), or maybe xjoin() (like xrange(), xreadlines()).

    Incidentally, I was inspired to do the testing in the first place from
    this:

    http://www.skymind.com/~ocrow/python_string/

    Those tests were done with Python-2.3. With 2.4, naive appending (i.e.
    doing s1 += s2 in a loop) is about 13-15% slower than a list
    comprehension, but uses much less memory (for large loops); and a
    generator expression is about 7% slower and uses slightly *more* memory.
    Andy Dustman, Feb 18, 2005
    #4
  5. Delaney, Timothy C (Timothy)

    Dima Dorfman Guest

    On 2005-02-18, Andy Dustman <> wrote:
    > The reason it does this is exactly why you said: It iterates over the
    > sequence and gets the sum of the lengths, adds the length of n-1
    > separators, and then allocates a string this size. Then it iterates
    > over the list again to build up the string.


    The other (and, I suspect, the real) reason for materializing the
    argument is to be able to call unicode.join if it finds Unicode
    elements in the sequence. If it finds such an element, unicode.join
    has to be called on the entire sequence; the part already accumulated
    can't be used because unicode.join wants to call PyUnicode_FromObject
    on all the elements. Since it can't know whether the original argument
    is reiterable, it has to keep around the materialized sequence.

    > For generators, you'd have to make a trial allocation and start
    > appending stuff as you go, periodically resizing. This *might* end up
    > being more efficient in the case of generators, but the only way to
    > know for sure is to write the code and benchmark it.


    Even if it's not faster, it should use about half as much memory for
    non-sequence arguments. That can be a big win if elements are being
    generated on the fly (e.g., it's a generator that does something other
    than just iterate over an existing sequence).
    Dima Dorfman, Feb 18, 2005
    #5
  6. Delaney, Timothy C (Timothy)

    Andy Dustman Guest

    Looking at the code, it seems that if it finds a unicode object on the
    first pass (the sizing pass), it punts and returns PyUnicode_Join(self,
    seq), which is the sequence from above and not necessarily the original
    object (orig), and starts over. In the worst-case scenario, you have a
    long sequence of strings with one unicode string at the end...

    Actually, I guess I'm a little surprised that str.join(arg) doesn't
    require arg to be an iterator that returns str instances.
    unicode.join(arg) can afford to be a little more flexible. I wonder how
    common it is to pass a mixture of str and unicode to str.join().
    Andy Dustman, Feb 18, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leo Breebaart
    Replies:
    46
    Views:
    936
    Rocco Moretti
    Feb 20, 2005
  2. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,739
    Smokey Grindel
    Dec 2, 2006
  3. Replies:
    3
    Views:
    800
  4. Mark Janssen
    Replies:
    0
    Views:
    138
    Mark Janssen
    Apr 12, 2013
  5. Mark Lawrence
    Replies:
    0
    Views:
    127
    Mark Lawrence
    Apr 12, 2013
Loading...

Share This Page