"".join(string_generator()) fails to be magic

M

Matt Mackal

I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subject> would be the obvious way to do this, but it of
course converts the generator output to a list first.
 
M

Marc 'BlackJack' Rintsch

I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subject> would be the obvious way to do this, but it of
course converts the generator output to a list first.

Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch
 
T

thebjorn

Even if `str.join()` would not convert the generator into a list first,
you would have overallocation. You don't know the final string size
beforehand so intermediate strings must get moved around in memory while
concatenating. Worst case: all but the last string are already
concatenated and the last one does not fit into the allocated memory
anymore, so there is new memory allocates that can hold both strings ->
double amount of memory needed.

Ciao,
Marc 'BlackJack' Rintsch

Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?

-- bjorn
 
D

Diez B. Roggisch

Matt said:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result. You can get most of the way there with
things like cStringIO or mmap objects, but when you want to actually
get the result as a Python string, you run into the copy again.

Thus, it would be nice if there was a way to join the output of a
string generator so that I didn't need to keep the partial strings in
memory. <subject> would be the obvious way to do this, but it of
course converts the generator output to a list first.

You can't built a contiguous string of bytes without copying them.

The question is: what do you need the resulting strings for? Depending
on the use-case, it might be that you could spare yourself the actual
concatenation, but instead use a generator like this:


def charit(strings):
for s in strings:
for c in s:
yield c

Diez
 
M

Marc 'BlackJack' Rintsch

Perhaps realloc() could be used to avoid this? I'm guessing that's
what cStringIO does, although I'm too lazy to check (I don't have
source on this box). Perhaps a cStringIO.getvalue() implementation
that doesn't copy memory would solve the problem?

How could `realloc()` solve that problem? Doesn't `realloc()` copy the
memory too if the current memory block can't hold the new size!?

And `StringIO` has the very same problem, if the `getvalue()`
method doesn't copy you have to make copies while writing to the `StringIO`
object and the buffer is not large enough.

Ciao,
Marc 'BlackJack' Rintsch
 
S

Sion Arrowsmith

Matt Mackal said:
I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.

Do you mean physical RAM, or addressable memory? If the former,
there's an obvious solution....
Occassionally, I need to join some large strings to build some even
larger strings.

Unfortunately, there's no good way of doing this without using 2x the
amount of memory as the result.

I think you can get better than 2x if you've got a reasonable number
of (ideally similarly sized) large strings with something along the
lines of:

for i in range(0, len(list_of_strings), 3): #tune step
result_string += (list_of_strings +
list_of_strings[i+1] +
list_of_strings[i+2])
list_of_strings = ""
list_of_strings[i+1] = ""
list_of_strings[i+2] = ""

remembering the recent string concatenation optimisations. Beyond
that, your most reliable solution may be the (c)StringIO approach
but with a real file (see the tempfile module, if you didn't know
about it).
 
C

Carl Banks

I have an application that occassionally is called upon to process
strings that are a substantial portion of the size of memory. For
various reasons, the resultant strings must fit completely in RAM.
Occassionally, I need to join some large strings to build some even
larger strings.

Do you really need a Python string? Some functions work just fine on
mmap or array objects, for example regular expressions:
array('c', 'llo')

I would look to see if there's a way to use an array or mmap instead.
If you have an upper bound for the total size, then you can reserve
the needed number of bytes.

If you really need a Python string, you might have to resort to a C
solution.


Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top