chunking a long string?

Roy Smith · Nov 8, 2013

I have a long string (several Mbytes). I want to iterate over it in manageable chunks (say, 1 kbyte each). For (a small) example, if I started with "this is a very long string", and I wanted 10 character chunks, I should get:

"this is a "
"very long "
"string"

This seems like something itertools would do, but I don't see anything. Is there something, or do I just need to loop and slice (and worry about getting all the edge conditions right) myself?

wxjmfauth · Nov 8, 2013

"(say, 1 kbyte each)": one "kilo" of characters or bytes?

Glad to read some users are still living in an ascii world,
at the "Unicode time" where an encoded code point size may vary
between 1-4 bytes.

Oops, sorry, I'm wrong, it can be much more.

jmf

Chris Angelico · Nov 8, 2013

Oops, sorry, I'm wrong, it can be much more.

I know, overhead sucks doesn't it. Python is really abysmal at that;
look how big a single bit is:
14

On the flip side, Python gets really awesome at some other things.
Your operating system probably takes an entire CD to distribute, maybe
even a DVD, so that's either 700MB or 4.7GB, give or take. Look how
efficiently Python can represent it:
36

Wow!

ChrisA

Chris Angelico · Nov 8, 2013

Someone has been hanging out too much over on that thread about
compressing random data ;-)

Hey, that's a bit unfair! Operating systems aren't full of random data!

At least... well, I can't speak for Windows here...

*dives for cover*

ChrisA

Tim Chase · Nov 8, 2013

On the flip side, Python gets really awesome at some other things.
Your operating system probably takes an entire CD to distribute,
maybe even a DVD, so that's either 700MB or 4.7GB, give or take.
Look how efficiently Python can represent it:

36

Someone has been hanging out too much over on that thread about
compressing random data ;-)

-tkc

Chris Angelico · Nov 8, 2013

Those figures look really good but I actually want figures that do things my
way, even if the figures aren't as good or even suck completely. Can you
help me with this even if I've already asked 42 times before but have always
been given the same figures in response?

Yep! I can even offer you an hourglass figure. Here, watch this figure
of an hourglass while you wait for a different answer...

ChrisA

Steven D'Aprano · Nov 9, 2013

"(say, 1 kbyte each)": one "kilo" of characters or bytes?

Glad to read some users are still living in an ascii world, at the
"Unicode time" where an encoded code point size may vary between 1-4
bytes.

Oops, sorry, I'm wrong,

That part is true.

it can be much more.

That part is false. You're measuring the overhead of the object
structure, not the per-character storage. This has been the case going
back since at least Python 2.2: strings are objects, and have overhead.

27

27 bytes for two characters! Except it isn't, it's actually 25 bytes for
the object header and two bytes for the two characters.

48

And here you have four bytes each for the two characters and a 40 byte
header. Observe:

py> c = '\U0001d11e'
py> len(c)
1
py> sys.getsizeof(2*c) - sys.getsizeof(c)
4
py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)
4

How big is the object overhead on a (say) thousand character string? Just
one percent:

py> (sys.getsizeof(1000*c) - 4000)/4000
0.01

Steven D'Aprano · Nov 9, 2013

I have a long string (several Mbytes). I want to iterate over it in
manageable chunks (say, 1 kbyte each). For (a small) example, if I
started with "this is a very long string", and I wanted 10 character
chunks, I should get:

"this is a "
"very long "
"string"

This seems like something itertools would do, but I don't see anything.
Is there something, or do I just need to loop and slice (and worry about
getting all the edge conditions right) myself?

What edge conditions? Should be trivially easy to loop and slice:

def grouper(string, size):
i = 0
while i <= len(string):
yield string[i:i+size]
i += size

But if you prefer, there is a recipe in the itertools documentation to
solve this problem for you:

http://docs.python.org/2/library/itertools.html#recipes

It's short enough to reproduce here.

from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

grouper(your_string, 10, '')

ought to give you the results you want.

I expect (but haven't tested) that for strings, the slice version will be
faster.

wxjmfauth · Nov 9, 2013

Le samedi 9 novembre 2013 01:46:32 UTC+1, Steven D'Aprano a écrit :

That part is true.

That part is false. You're measuring the overhead of the object

structure, not the per-character storage. This has been the case going

back since at least Python 2.2: strings are objects, and have overhead.

27 bytes for two characters! Except it isn't, it's actually 25 bytes for

the object header and two bytes for the two characters.

And here you have four bytes each for the two characters and a 40 byte

header. Observe:

py> c = '\U0001d11e'

py> len(c)

1

py> sys.getsizeof(2*c) - sys.getsizeof(c)

4

py> sys.getsizeof(1000*c) - sys.getsizeof(999*c)

4

How big is the object overhead on a (say) thousand character string? Just

one percent:

py> (sys.getsizeof(1000*c) - 4000)/4000

0.01

--------

Sure, the new phone "xyz" does not cost 600$, it only cost
only 100$ more than the "abc" 500$ phone model.

If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.

After all, it is much more funny to waste its time in coding
and in attempting to handle unicode properly and to observe
this poor Python wasting its time in conversions.
25

Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.

jmf

Chris Angelico · Nov 9, 2013

If you wish to count the the frequency of chars in a text
and store the results in a dict, {char: number_of_that_char, ...},
do not forget to save the key in utf-XXX, it saves memory.

Oh, if you're that concerned about memory usage of individual
characters, try storing them as integers:
14

I really don't see that UTF-32 is much advantage here. UTF-8 happens
to be, because I used an ASCII character, but the integer beats them
all, even for larger numbers:16

And there's even less difference on my Linux box, but of course, you
never compare against Linux because Python 3.2 wide builds don't suit
your numbers.

For longer strings, there's an even more efficient way to store them.
Just store the memory address - that's going to be 4 bytes or 8,
depending on whether it's a 32-bit or 64-bit build of Python. There's
a name for this method of comparing strings: interning. Some languages
do it automatically for all strings, others (like Python) only when
you ask for it. Suddenly it doesn't matter at all what the storage
format is - if the two strings are the same, their addresses are the
same, and conversely. That's how to make it cheap.

Hint: If you attempt to do the same exercise with
words in a "latin" text, never forget the length average
of a word is approximatively 1000 chars.

I think you're confusing length of word with value of picture.

ChrisA

Mark Lawrence · Nov 9, 2013

On 09/11/2013 08:14, (e-mail address removed) wrote:

I'll ask again, please don't send us double spaced google crap.

Roy Smith · Nov 9, 2013

Chris Angelico said:
Some languages [intern] automatically for all strings, others
(like Python) only when you ask for it.

What does "only when you ask for it" mean?

Chris Angelico · Nov 9, 2013

Chris Angelico said:
Chris Angelico said:

Some languages [intern] automatically for all strings, others
(like Python) only when you ask for it.

Click to expand...

What does "only when you ask for it" mean?

You can explicitly intern a Python string with the sys.intern()
function, which returns either the string itself or an
indistinguishable "interned" string. Two equal strings, when interned,
will return the same object:
False

Note that the Python interpreter is free to answer True there, but
there's no mandate for it.
True

Now it's mandated. The two strings must be the same object. Interning
in this way makes string equality come down to an 'is' check, which is
potentially a lot faster than actual string equality.

Some languages (eg Pike) do this automatically with all strings - the
construction of any string includes checking to see if it's a
duplicate of any other string. This adds cost to string manipulation
and speeds up string comparisons; since the engine knows that all
strings are interned, it can do the equivalent of an 'is' check for
any string equality.

So what I meant, in terms of storage/representation efficiency, is
that you can store duplicate strings very efficiently if you simply
increment the reference counts of the same few objects. Python won't
necessarily do that for you; check memory usage of something like
this:

strings = [open("some_big_file").read() for _ in range(10000)]

And compare against this:

strings = [sys.intern(open("some_big_file").read()) for _ in range(10000)]

In a language that guarantees string interning, the syntax of the
former would have the memory consumption of the latter. Whether that
memory saving and improved equality comparison is worth the effort of
dictionarification is one of those eternally-debatable points.

ChrisA

Roy Smith · Nov 9, 2013

Chris Angelico said:
Chris Angelico said:

Some languages [intern] automatically for all strings, others
(like Python) only when you ask for it.

Click to expand...

What does "only when you ask for it" mean?

Click to expand...

You can explicitly intern a Python string with the sys.intern()
function
[long, and good, explanation of interning]

But, you missed the point of my question. You said that Python does
this "only when you ask for it". That implies it never interns strings
if you don't ask for it, which is clearly not true:

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[...]True

I think what you're trying to say is that there are several possible
interning policies:

1) Strings are never interned

2) Strings are always interned

3) Strings are optionally interned, at the discretion of the
implementation

4) The user may force a specific string to be interned by explicitly
requesting it.

and that Pike implements #1, while Python implements #3 and #4.

Chris Angelico · Nov 9, 2013

But, you missed the point of my question. You said that Python does
this "only when you ask for it". That implies it never interns strings
if you don't ask for it, which is clearly not true:

$ python
Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[...]True

Ah! Yes, that's true; literals are interned - I forgot that. But
anything from an external source won't be, hence my example with
reading in the contents of a file.

I think what you're trying to say is that there are several possible
interning policies:

1) Strings are never interned

2) Strings are always interned

3) Strings are optionally interned, at the discretion of the
implementation

4) The user may force a specific string to be interned by explicitly
requesting it.

and that Pike implements #1, while Python implements #3 and #4.

Pike implements #2, I presume that was a typo. And yes, the interning
of literals falls under #3, while sys.intern() gives #4. Use of #1
would be restricted to languages with mutable strings, I would expect,
for the same reason that Python tuples might be shared but lists won't
be.

ChrisA

Roy Smith · Nov 9, 2013

Chris Angelico said:
Pike implements #2, I presume that was a typo.

Duh. Yes.

Steven D'Aprano · Nov 9, 2013

Chris Angelico said:
Chris Angelico said:

Some languages [intern] automatically for all strings, others (like
Python) only when you ask for it.

Click to expand...

What does "only when you ask for it" mean?

In Python 2:

help(intern)

In Python 3:

import sys
help(sys.intern)

for more info. I think that Chris is wrong about Python "only" interning
strings if you explicitly ask for it. I recall that Python will (may?)
automatically intern strings which look like identifiers (e.g. "spam" but
not "Hello World" or "123abc"). Let's see now:

# using Python 3.1 on Linux

py> s = "spam"
py> t = "spam"
py> s is t
True

but:

py> z = ''.join(["sp", "am"])
py> z is s
False

However:

py> u = "123abc"
py> v = "123abc"
py> u is v
True

Hmmm, obviously the rules are a tad more complicated than I thought... in
any case, you shouldn't rely on automatic interning since it is an
implementation dependent optimization and will probably change without
notice.

Chris Angelico · Nov 9, 2013

I think that Chris is wrong about Python "only" interning
strings if you explicitly ask for it. I recall that Python will (may?)
automatically intern strings which look like identifiers (e.g. "spam" but
not "Hello World" or "123abc").

I'm pretty sure it's simply that literals are interned, or at least
shared across a module (and the interactive interpreter "counts" as a
module). And it might still only be ones which look like identifiers,
because:
False

My "only" was false because of the sharing/interning of (some)
literals, which I'd forgotten about; however, there's still the
distinction that I was trying to draw, that in Python _some strings_
are interned (a feature you can explicitly request), rather than _all
strings_ being interned. And as is typical of python-list, it's this
extremely minor point that became the new course of the thread - my
main point was not whether all, some, or no strings get interned, but
that string interning makes the storage space of duplicate strings
immaterial

ChrisA

Steven D'Aprano · Nov 10, 2013

And
as is typical of python-list, it's this extremely minor point that
became the new course of the thread -

You say that as if it were a bad thing

my main point was not whether all,
some, or no strings get interned, but that string interning makes the
storage space of duplicate strings immaterial

True. It's not just a memory saver[1], but a time saver too. Using Python
3.3:

py> from timeit import Timer
py> t1 = Timer('s == t', setup='s = "a b"*10000; t = "a b"*10000')
py> t2 = Timer('s == t',
.... setup='from sys import intern; s = intern("a b"*10000); '
.... 't = intern("a b"*10000)')
py> min(t1.repeat(number=100000))
7.651959054172039
py> min(t2.repeat(number=100000))
0.00881262868642807

String equality does a short-cut of checking for identity; if the strings
are interned, they will be identical.

[1] Assuming that you actually do have duplicate strings. If every string
is unique, interning them potentially wastes memory.

Chris Angelico · Nov 10, 2013

You say that as if it were a bad thing

More a curiosity than a bad thing.

ChrisA

Measuring a string of text	1	Sep 15, 2022
Converting an Array to a String in JavaScript	7	Sep 22, 2023
Trouble accessing a value within a JSON string.	1	Jun 16, 2023
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
String and list error while running a Markov Chain	1	Aug 26, 2020
[LONG] java.net.URI encoding weirdness	18	May 5, 2014
convert a long string in binary	3	Aug 20, 2006
Performance of int/long in Python 3	187	Mar 25, 2013

chunking a long string?

Roy Smith

wxjmfauth

Chris Angelico

Chris Angelico

Tim Chase

Chris Angelico

Steven D'Aprano

Steven D'Aprano

wxjmfauth

Chris Angelico

Mark Lawrence

Roy Smith

Chris Angelico

Roy Smith

Chris Angelico

Roy Smith

Steven D'Aprano

Chris Angelico

Steven D'Aprano

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads