StringIO proposal: add __iadd__

P

Paul Rubin

I've always found the string-building idiom

temp_list = []
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_list.append(v)
final_string = ''.join(temp_list)

completely repulsive. As an alternative I suggest

temp_buf = StringIO()
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_buf += v
final_string = temp_buf.getvalue()

here, "temp_buf += v" is supposed to be the same as "temp_buf.write(v)".
So the suggestion is to add a __iadd__ method to StringIO and cStringIO.

Any thoughts?

Also, I wonder if it's now ok to eliminate the existing StringIO
module (make it an alias for cStringIO) now that new-style classes
permit extending cStringIO.StringIO.
 
A

Alex Martelli

Paul Rubin said:
temp_buf = StringIO()
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_buf += v
final_string = temp_buf.getvalue()

here, "temp_buf += v" is supposed to be the same as "temp_buf.write(v)".
So the suggestion is to add a __iadd__ method to StringIO and cStringIO.

What's the added value of spelling x.write(v) as x += v? Is it worth
the utter strangeness of having a class which allows += and not + (the
only one in the std library, I think it would be)...?
Any thoughts?

I think that the piece of code you like and I just quoted is just fine,
simply by changing the += to a write.
Also, I wonder if it's now ok to eliminate the existing StringIO
module (make it an alias for cStringIO) now that new-style classes
permit extending cStringIO.StringIO.

I love having a pure-Python version of any C-coded standard library
module (indeed, I wish I had more!-) for all sorts of reasons, including
easing the burden of porting Python to weird platforms. In StringIO's
case, it's nice to be able to use the above idiom to concatenate Unicode
strings just as easily as plain ones, for example -- cStringIO (like
file objects) wants plain bytestrings.

It would be nice (in Py3k, when backwards compatibility can be broken)
to make the plain-named, "default" modules those coded in C, since
they're used more often, and find another convention to indicate pure
Python equivalents -- e.g., pickle/pypickle and StringIO/pyStringIO
rather than the current cPickle/pickle and cStringIO/StringIO. But I
hope the pure-python "reference" modules stay around (and, indeed, I'd
love for them to _proliferate_, maybe by adopting some of the work of
the pypy guys at some point;).


Alex
 
P

Paul Rubin

What's the added value of spelling x.write(v) as x += v? Is it worth
the utter strangeness of having a class which allows += and not + (the
only one in the std library, I think it would be)...?

Sure, + can also be supported. Adding two StringIO's, or a StringIO to a
string, results in a StringIO with the obvious contents.
In StringIO's case, it's nice to be able to use the above idiom to
concatenate Unicode strings just as easily as plain ones, for
example -- cStringIO (like file objects) wants plain bytestrings.

I wasn't aware of that limitation--maybe cStringIO could be extended
to take Unicode. You'd use an encode or decode method to get a
bytestring out. Or there could be a mutable-string class separate
from cStringIO, to be used for this purpose (of getting rid of the
list.append kludge).
But I hope the pure-python "reference" modules stay around (and,
indeed, I'd love for them to _proliferate_, maybe by adopting some
of the work of the pypy guys at some point;).

Maybe the standard versions of some of these things can be written in
RPython under PyPy, so they'll compile to fast machine code, and then
the C versions won't be needed. But with CPython I think we need the
C versions.
 
A

Alex Martelli

Paul Rubin said:
I wasn't aware of that limitation--maybe cStringIO could be extended
to take Unicode. You'd use an encode or decode method to get a
bytestring out.

But why can't I have perfectly polymorphic "append a bunch of strings
together", just like I can now (with ''.join of a list of strings, or
StringIO), without caring whether the strings are Unicode or
bytestrings?
Or there could be a mutable-string class separate
from cStringIO, to be used for this purpose (of getting rid of the
list.append kludge).

StringIO works just fine. Developing (and having to document, learn,
teach, ...) a separate interface just in order to remove StringIO does
not seem worth it. As for extending cStringIO.write I guess that's
possible, but not without breaking compatibility (code that now uses
that write with unicode strings assuming that they'll get encoded into
bytestrings by the default encoding, and similarly assumes that getvalue
always returns a bytestring, when called on a cStringIO instance); you'd
need instead to add another couple of methods, or wait for Py3k.

Maybe the standard versions of some of these things can be written in
RPython under PyPy, so they'll compile to fast machine code, and then
the C versions won't be needed. But with CPython I think we need the
C versions.

By all means, the C versions are welcome, I just don't want to lose the
Python versions either (and making them less readable by recoding them
in RPython would interfere with didactical use).


Alex
 
P

Paul Rubin

But why can't I have perfectly polymorphic "append a bunch of strings
together", just like I can now (with ''.join of a list of strings, or
StringIO), without caring whether the strings are Unicode or
bytestrings?

I see that 'a' + u'b' = u'ab', which makes sense. I don't use Unicode
much so haven't paid much attention to such things. Is there some
sound reason cStringIO acts differently from StringIO? I'd expect
them to both do the same thing.
As for extending cStringIO.write I guess that's
possible, but not without breaking compatibility ... you'd
need instead to add another couple of methods, or wait for Py3k.

We're already discussing adding another method, namely __iadd__.
Maybe that's the place to put it.
 
E

Erik Max Francis

Paul said:
I've always found the string-building idiom

temp_list = []
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_list.append(v)
final_string = ''.join(temp_list)

completely repulsive. As an alternative I suggest

temp_buf = StringIO()
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_buf += v
final_string = temp_buf.getvalue()

here, "temp_buf += v" is supposed to be the same as "temp_buf.write(v)".
So the suggestion is to add a __iadd__ method to StringIO and cStringIO.

Any thoughts?

Why? StringIO/cStringIO have file-like interfaces, not sequences.
 
S

Scott David Daniels

Alex said:
It would be nice (in Py3k, when backwards compatibility can be broken)
to make the plain-named, "default" modules those coded in C, since
they're used more often, and find another convention to indicate pure
Python equivalents -- e.g., pickle/pypickle and StringIO/pyStringIO
How about something like a package py for all such python-coded modules
so you use py.StringIO (which I hope gets renamed to stringio in the
Py3K shift).
 
A

Alex Martelli

Paul Rubin said:
I see that 'a' + u'b' = u'ab', which makes sense. I don't use Unicode
much so haven't paid much attention to such things. Is there some
sound reason cStringIO acts differently from StringIO? I'd expect
them to both do the same thing.

I believe that cStringIO tries to optimize, while StringIO doesn't and
is thereby more general.

We're already discussing adding another method, namely __iadd__.
Maybe that's the place to put it.

Still need another method to 'getvalue' which can return a Unicode
string (currently, cStringIO.getvalue returns plain strings only, and it
might break something if that guarantee was removed).

That being said, if the only way to use a StringIO was to call += or
__iadd__ on it, I would switch my recommendation away from it and
towards "just join the sequence of strings". Taking your example:

temp_buf = StringIO()
for x in various_pieces_of_output():
v = go_figure_out_some_string()
temp_buf += v
final_string = temp_buf.getvalue()

it's just more readable to me to express it

final_string = ''.join(go_figure_out_some_string()
for x in various_pieces_of_output())

Being able to use temp_buf.write(v) [like today, but with StringIO, not
cStringIO] would still have me recommending it to newbies, but having to
explain that extra += just tips the didactical balance. It's already
hard enough to jump ahead to a standard library module in the middle of
an explanation of strings, just to explain how to concatenate a bunch...

Yes, I do understand your performance issues:

Nimue:~/pynut alex$ python2.4 -mtimeit -s'from StringIO import StringIO'
's=StringIO(); s.writelines(str(i) for i in range(33)); x=s.getvalue()'
1000 loops, best of 3: 337 usec per loop

Nimue:~/pynut alex$ python2.4 -mtimeit -s'from cStringIO import
StringIO' 's=StringIO(); s.writelines(str(i) for i in range(33));
x=s.getvalue()'
10000 loops, best of 3: 98.1 usec per loop

Nimue:~/pynut alex$ python2.4 -mtimeit 's=list(); s.extend(str(i) for i
in range(33)); x="".join(s)'
10000 loops, best of 3: 99 usec per loop

but using += instead of writelines [[actually, how WOULD you express the
writelines equivalent???]] or abrogating plain-Python StringIO would not
speed up the cStringIO use (which is already just as fast as the ''.join
use).


Alex
 
A

Alex Martelli

Scott David Daniels said:
How about something like a package py for all such python-coded modules
so you use py.StringIO (which I hope gets renamed to stringio in the
Py3K shift).

Sounds good to me, indeed better than 'name mangling'!-)


Alex
 
P

Paul Rubin

I believe that cStringIO tries to optimize, while StringIO doesn't and
is thereby more general.

I'm not sure what optimizations make sense. I'd thought the most
important difference was the ability to subclass StringIO, before
new-style classes arrived. It's really ugly that .getvalue does
different things for StringIO and cStringIO, something that I didn't
realize and which amazes me. I'd go as far as to say maybe .getvalue
should be deprecated in both modules, and replaced by .getstring
(returns regular or unicode string depending on contents) and
..getbytes (always returns a byte string).
Still need another method to 'getvalue' which can return a Unicode
string (currently, cStringIO.getvalue returns plain strings only,
and it might break something if that guarantee was removed).

Yeah, replacing getvalue with explicit methods is preferable. "Explicit
is better than implicit."
That being said, if the only way to use a StringIO was to call += or
__iadd__ on it, I would switch my recommendation away from it and
towards "just join the sequence of strings".

Fixing getvalue takes care of it. The ''join idiom is IMO a total
monstrosity and should die, die, die, die, die.
it's just more readable to me to express it
final_string = ''.join(go_figure_out_some_string()
for x in various_pieces_of_output())

OK for that example, maybe not for a more complex one. Anyway I like
sum(...) even better (where sum promises to be O(n) in the number of
bytes), but clpy had THAT discussion a few days ago.
Being able to use temp_buf.write(v) [like today, but with StringIO, not
cStringIO] would still have me recommending it to newbies, but having to
explain that extra += just tips the didactical balance.

I just can't for the life of me see += as harder to explain than the
''.join horror. But yeah, the real problem is the incompatible
definitions of .getvalue between the two classes, so that should be
fixed, and .write would do the right thing.
but using += instead of writelines [[actually, how WOULD you express the
writelines equivalent???]] or abrogating plain-Python StringIO would not
speed up the cStringIO use (which is already just as fast as the ''.join
use).

''.join with a list (rather than a generator) arg may be plain worse
than python StringIO. Imagine building up a megabyte string one
character at a time, which means making a million-element list and a
million temporary one-character strings before joining them.
 
A

Alex Martelli

Paul Rubin said:
''.join with a list (rather than a generator) arg may be plain worse
than python StringIO. Imagine building up a megabyte string one
character at a time, which means making a million-element list and a
million temporary one-character strings before joining them.

Absolutely wrong: ''.join takes less for a million items than StringIO
takes for 100,000. It's _so_ easy to measure...!

Nimue:~/pynut alex$ python2.4 -mtimeit 's=["x" for i in xrange(999999)];
x="".join(s)'
10 loops, best of 3: 422 msec per loop

Nimue:~/pynut alex$ python2.4 -mtimeit -s'from StringIO import StringIO'
's=StringIO()' 'for i in xrange(99999): s.write("x")' 'x=s.getvalue()'
10 loops, best of 3: 688 msec per loop


After all, how do you think StringIO is implemented internally? A list
of strings and a ''.join at the end are the best way that comes to mind,
and of course there's going to be overhead (although I'm surprised to
see that the overhead is quite as bad as this). BTW, cStringIO isn't
very good here either:

Nimue:~/pynut alex$ python2.4 -mtimeit -s'from cStringIO import
StringIO' 's=StringIO()' 'for i in xrange(999999): s.write("x")'
'x=s.getvalue()'
10 loops, best of 3: 1.28 sec per loop

three times as slow as the ''.join you hate so much -- if it's to take
its place, it clearly needs a lot of work.

As for sum, you'll recall I was its original proponent, and my first
implementation did specialcase strings (delegating right to ''.join).
But that left O(N**2) behavior in many other cases (lists, tuples) and
eventually was whittled down to "summing *numbers*", at least as far as
the intention goes. Perhaps there's space for a "sumsequences" that's
something like itertools.chain but specialcases crucial cases such as
strings (plain and Unicode) and lists? Good luck getting it approved on
python-dev -- I'll gladly implement it, if you can get it past that
hurdle (chatting about it here is entertaining, but unless you can get
BDFL blessing it's in the end futile, and that requires python-dev...).


Alex
 
P

Paul Rubin

Absolutely wrong: ''.join takes less for a million items than StringIO
takes for 100,000.

That depends on how much ram you have. You could try a billion items.
It's _so_ easy to measure...!

Yes but the result depends on your specific hardware and may be
different for someone else.
After all, how do you think StringIO is implemented internally? A list
of strings and a ''.join at the end are the best way that comes to mind,

I'd have used the array module.
As for sum, you'll recall I was its original proponent, and my first
implementation did specialcase strings (delegating right to ''.join).

You could imagine a realy dumb implementation of ''.join that used
a quadratic algorithm, and in fact

http://docs.python.org/lib/string-methods.html

doesn't guarantee that join is linear. Therefore, the whole ''.join
idiom revolves around the progrmamer knowing some undocumented
behavior of the implementation (i.e. that ''.join is optimized). This
reliance on undocumented behavior seems totally bogus to me, but if
it's ok to optimize join, I'd think it's ok to also optimize sum, and
document both.
But that left O(N**2) behavior in many other cases (lists, tuples) and
eventually was whittled down to "summing *numbers*", at least as far as
the intention goes. Perhaps there's space for a "sumsequences" that's
something like itertools.chain but specialcases crucial cases such as
strings (plain and Unicode) and lists?

How making [].join(bunch_of_lists) analogous to ''.join, with a
documented guarantee that both are linear?
 
A

Alex Martelli

Paul Rubin said:
That depends on how much ram you have. You could try a billion items.

Let's see you try it -- I have better things to do than to trash around
checking assertions which I believe are false and that you're too lazy
to check yourself.
I'd have used the array module.

....and would that support plain byte strings and Unicode smoothly and
polymorphically? You may recall a few posts ago expressing wonder at
what optimizations cStringIO might have that stop it from doing just
this...
You could imagine a realy dumb implementation of ''.join that used
a quadratic algorithm, and in fact

http://docs.python.org/lib/string-methods.html

doesn't guarantee that join is linear. Therefore, the whole ''.join
idiom revolves around the progrmamer knowing some undocumented
behavior of the implementation (i.e. that ''.join is optimized). This

No more than StringIO.write "revolves around" the programmer knowing
exactly the same thing about the optimizations in StringIO: semantics
are guaranteed, performance characteristics are not.
reliance on undocumented behavior seems totally bogus to me, but if

So I assume you won't be using StringIO.write any more, nor ANY other
way to join sequences of strings? Because the performance of ALL of
them depend on such "undocumented behavior".

Personally, I don't consider depending on "undocumented behavior" *for
speed* to be bogus at all, particularly when there are no approaches
whose performance characteristics ARE documented and guaranteed.
Besides C++'s standard library, very few languages like to pin
themselves down by ensuring any performance guarantee;-).

How making [].join(bunch_of_lists) analogous to ''.join, with a
documented guarantee that both are linear?

I personally have no objection to adding a join method to lists or other
sequences, but of course the semantics should be similar to:

def join(self, *others):
result = list()
for other in others[:-1]:
result.extend(other)
result.extend(self)
result.extend(others[-1])
return self.__class__(result)

As for performance guarantees, I don't think we have them now even for
list.append, list.extend, dict.__getitem__, and other similarly
fundamental methods. I assume any such guarantees would have to
weaselword regarding costs of memory allocation (including, possibly,
garbage collection), since such allocation may of course be needed and
its performance can easily be out of Python's control; and similarly,
costs of iterating on the items of 'others', cost of indexing it, and so
on (e.g.: for list.sort, cost of comparisons; for dict.__getitem__, cost
of hash on the key; and so on, and so forth).

I don't think it's worth my time doing weaselwording for this purpose,
but if any sealawyers otherwise idle want to volunteer (starting with
the existing methods of existing built-in types, I assume, rather than
by adding others), the offer might be welcome on python-dev (I assume
that large effort will have to be devoted to examining the actual
performance characteristics of at least the reference implementation, in
order to prove that the purported guarantees are indeed met).


Alex
 
P

Paul Rubin

Paul Rubin said:
I'd have used the array module.

I just checked the implementation and it uses ''.join combined with
some bogo-optimizations to cache the result of the join when you do a
seek or write. That is, .seek can take linear time instead of
constant time, a pretty bogus situation if you ask me, though maybe
the amortized time isn't so bad over multiple calls. I didn't check
how cStringIO does it.
 
P

Paul Rubin

Let's see you try it

If you want me to try timing it with a billion items on your computer,
you'd have to set up a suitable account and open a network connection,
etc., probably not worth the trouble. Based on examining StringIO.py,
on my current computer (512MB ram), with 100 million items, it looks
like using a bunch of writes interspersed with seeks will be much
faster than just using writes. I wouldn't have guessed THAT. With a
billion items it will thrash no matter what.
...and would that support plain byte strings and Unicode smoothly and

Actually, I see that getvalue is supposed to raise an error if you
mix unicode with 8-bit ascii:

The StringIO object can accept either Unicode or 8-bit strings,
but mixing the two may take some care. If both are used, 8-bit
strings that cannot be interpreted as 7-bit ASCII (that use the
8th bit) will cause a UnicodeError to be raised when getvalue()
is called.

This is another surprise, I'd have thought it could just convert to
unicode as soon as it saw a unicode string. I think I understand the
idea. The result is that with StringIO (Python 2.4.1),

s = StringIO() # ok
s.write('\xc3') # ok
s.write(u'a') # ok
s.seek(0,2) # raises UnicodeDecodeError

Raising the error at the second s.write doesn't seem like a big
problem. The StringIO doc already doesn't mention that seek can raise
a Unicode exception, so it needs to be fixed either way.
No more than StringIO.write "revolves around" the programmer knowing
exactly the same thing about the optimizations in StringIO: semantics
are guaranteed, performance characteristics are not.

I think having either one use quadratic time is bogus (something like
n log n might be ok).
So I assume you won't be using StringIO.write any more, nor ANY other
way to join sequences of strings? Because the performance of ALL of
them depend on such "undocumented behavior".

I'll keep using them but it means that the program's complexity
(i.e. that it's O(n) and not O(n**2)) depends on the interpreter
implementation, which is bogus. Do you really want a language
designed by computer science geniuses to be so underspecified that
there's no way to tell whether a straightforward program's running
time is linear or quadratic?
Besides C++'s standard library, very few languages like to pin
themselves down by ensuring any performance guarantee;-).

I seem to remember Scheme guarantees tail recursion optimization.
This is along the same lines. It's one thing for the docs to not want
to promise that .sort() uses at most 3.827*n*(lg(n)+14.7) comparisons
or something like that. That's what I'd consider to be pinning down a
performance guarantee. Promising that .sort() is O(n log n) just says
that the implementation is reasonable. The C library's "qsort" doc
even specifies the precise algorithm, or used to. Python's heapq
module doc similarly specifies heapq's algorithm.

Even that gets far afield. The real objection here (about ''.join) is
that every Python user is expected to learn a weird, pervasive idiom,
but the reason for the idiom cannot be deduced from the language
reference. That is just bizarre.
As for performance guarantees, I don't think we have them now even for
list.append, list.extend, dict.__getitem__, and other similarly
fundamental methods. I assume any such guarantees would have to
weaselword regarding costs of memory allocation...

I think it's enough to state the amortized complexity of these
operations to within a factor of O(log(N)). That should be easy
enough to do with the standard implementations (dicts using hashing,
etc) while still leaving the implementation pretty flexible. That
should allow determining the running speed of a user's program to
within a factor of O(log(N)), a huge improvement over not being able
to prove anything about it. Even without such guarantees it's enough
to say that these operations work in the obvious ways ("dicts use
hashing..."), and even without saying that, relying on the behavior
isn't so terrible, because the code that you write is the obvious code
for using operations that work in the obvious ways. That's not the
case for ''.join, the use of which is not obvious at all.

I do see docs for the built-in hash function and __hash__ method

http://docs.python.org/lib/built-in-funcs.html#l2h-34
http://docs.python.org/ref/customization.html#l2h-195

indicating that dictionary lookup uses hashing.
 
P

Paul Rubin

Paul Rubin said:
etc., probably not worth the trouble. Based on examining StringIO.py,
on my current computer (512MB ram), with 100 million items, it looks

Better make that 200 million.
 
R

Raymond Hettinger

[Paul Rubin]
here, "temp_buf += v" is supposed to be the same as "temp_buf.write(v)".
So the suggestion is to add a __iadd__ method to StringIO and cStringIO.

Any thoughts?

The StringIO API needs to closely mirror the file object API.
Do you want to change everything that is filelike to have +=
as a synonym for write()?

In for a penny; in for a pound.


Raymond
 
P

Paul Rubin

Raymond Hettinger said:
The StringIO API needs to closely mirror the file object API.
Do you want to change everything that is filelike to have +=
as a synonym for write()?

Why would they need that? StringIO objects have getvalue() but other
file-like objects don't. What's wrong with __iadd__ being another
StringIO-specific operation?

And is making += a synonym for write() on other file objects really
that bad an idea? It would be like C++'s use of << for file objects
and could make some code nicer if you like that kind of thing.

What I was really aiming for was something like
java.lang.StringBuffer, if that wasn't obvious, but using an
already-existing class (StringIO). java.lang.StringBuffer supports a
bunch of other operations too, so maybe there's something to be said
for adding something like it to Python and using that instead of
StringIO for this purpose.

I also now notice that the StringBuffer doc describes how the Java
compiler is supposed to handle adding multiple String objects by using
a temporary StringBuffer:

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/StringBuffer.html

That could be seen as a performance specification.
 
A

Alex Martelli

Paul Rubin said:
And is making += a synonym for write() on other file objects really
that bad an idea? It would be like C++'s use of << for file objects
and could make some code nicer if you like that kind of thing.

Not really: <<'s point is to allow chaining, f<<a<<b<<c. += would have
no such "advantage" (or disadvantage, as the case may be).


Alex
 
P

Paul Rubin

Not really: <<'s point is to allow chaining, f<<a<<b<<c. += would have
no such "advantage" (or disadvantage, as the case may be).

Hmm, ok. I've always found << repulsive in that context though, so
won't suggest it for Python.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top