String concatenation benchmarking weirdness

Rotwang · Jan 11, 2013

Hi all,

the other day I 2to3'ed some code and found it ran much slower in 3.3.0
than 2.7.2. I fixed the problem but in the process of trying to diagnose
it I've stumbled upon something weird that I hope someone here can
explain to me. In what follows I'm using Python 2.7.2 on 64-bit Windows
7. Suppose I do this:

from timeit import timeit

# find out how the time taken to append a character to the end of a byte
# string depends on the size of the string

results = []
for size in range(0, 10000001, 100000):
results.append(timeit("y = x + 'a'",
setup = "x = 'a' * %i" % size, number = 1))

If I plot results against size, what I see is that the time taken
increases approximately linearly with the size of the string, with the
string of length 10000000 taking about 4 milliseconds. On the other
hand, if I replace the statement to be timed with "x = x + 'a'" instead
of "y = x + 'a'", the time taken seems to be pretty much independent of
size, apart from a few spikes; the string of length 10000000 takes about
4 microseconds.

I get similar results with strings (but not bytes) in 3.3.0. My guess is
that this is some kind of optimisation that treats strings as mutable
when carrying out operations that result in the original string being
discarded. If so it's jolly clever, since it knows when there are other
references to the same string:

timeit("x = x + 'a'", setup = "x = y = 'a' * %i" % size, number = 1)
# grows linearly with size

timeit("x = x + 'a'", setup = "x, y = 'a' * %i", 'a' * %i"
% (size, size), number = 1)
# stays approximately constant

It also can see through some attempts to fool it:

timeit("x = ('' + x) + 'a'", setup = "x = 'a' * %i" % size, number = 1)
# stays approximately constant

timeit("x = x*1 + 'a'", setup = "x = 'a' * %i" % size, number = 1)
# stays approximately constant

Is my guess correct? If not, what is going on? If so, is it possible to
explain to a programming noob how the interpreter does this? And is
there a reason why it doesn't work with bytes in 3.3?

Rotwang · Jan 11, 2013

Hi all,

the other day I 2to3'ed some code and found it ran much slower in 3.3.0 than
2.7.2. I fixed the problem but in the process of trying to diagnose it I've
stumbled upon something weird that I hope someone here can explain to me.

[stuff about timings]

Is my guess correct? If not, what is going on? If so, is it possible to
explain to a programming noob how the interpreter does this?

Click to expand...

Basically, yes. You can find the discussion behind that optimization at:

http://bugs.python.org/issue980695

It knows when there are other references to the string because all
objects in CPython are reference-counted. It also works despite your
attempts to "fool" it because after evaluating the first operation
(which is easily optimized to return the string itself in both cases),
the remaining part of the expression is essentially "x = TOS + 'a'",
where x and the top of the stack are the same string object, which is
the same state the original code reaches after evaluating just the x.

Nice, thanks.

The stated use case for this optimization is to make repeated
concatenation more efficient, but note that it is still generally
preferable to use the ''.join() construct, because the optimization is
specific to CPython and may not exist for other Python
implementations.

The slowdown in my code was caused by a method that built up a string of
bytes by repeatedly using +=, before writing the result to a WAV file.
My fix was to replaced the bytes string with a bytearray, which seems
about as fast as the rewrite I just tried with b''.join. Do you know
whether the bytearray method will still be fast on other implementations?

wxjmfauth · Jan 12, 2013

from timeit import timeit, repeat

size = 1000

r = repeat("y = x + 'a'", setup = "x = 'a' * %i" % size)
print('1:', r)
r = repeat("y = x + 'é'", setup = "x = 'a' * %i" % size)
print('2:', r)
r = repeat("y = x + 'œ'", setup = "x = 'a' * %i" % size)
print('3:', r)
r = repeat("y = x + '€'", setup = "x = 'a' * %i" % size)
print('4:', r)
r = repeat("y = x + '€'", setup = "x = '€' * %i" % size)
print('5:', r)
r = repeat("y = x + 'œ'", setup = "x = 'œ' * %i" % size)
print('6:', r)
r = repeat("y = é + 'œ'", setup = "é = 'œ' * %i" % size)
print('7:', r)
r = repeat("y = é + 'œ'", setup = "é = '€' * %i" % size)
print('8:', r)

c:\python32\pythonw -u "vitesse3.py"

1: [0.3603178435286996, 0.42901157137281515, 0.35459694357592086]
2: [0.3576409223543202, 0.4272010951864649, 0.3590055732104662]
3: [0.3552022735516487, 0.4256544908828328, 0.35824546465278573]
4: [0.35488168890607774, 0.4271707696118834, 0.36109528098614074]
5: [0.3560675370237849, 0.4261538782668417, 0.36138160167082134]
6: [0.3570182634788317, 0.4270155971913008, 0.35770629956705324]
7: [0.3556977225493485, 0.4264969117143753, 0.3645634239700426]
8: [0.35511247834379844, 0.4259628665308437, 0.3580737510097034]

Exit code: 0
c:\Python33\pythonw -u "vitesse3.py"

1: [0.3053600256152646, 0.3306491917840535, 0.3044963374976518]
2: [0.36252767208680514, 0.36937298133086727, 0.3685573415262271]
3: [0.7666293438924097, 0.7653473991487574, 0.7630926729867262]
4: [0.7636680712265038, 0.7647586103955284, 0.7631395397838059]
5: [0.44721085450773934, 0.3863234021671369, 0.45664368355696094]
6: [0.44699700013114807, 0.3873974001136613, 0.45167383387335036]
7: [0.4465200615491014, 0.387050034441188, 0.45459690419205856]
8: [0.44760587465455437, 0.3875261853459726, 0.45421212384964704]

Exit code: 0

The difference between a correct (coherent) unicode handling and ...

jmf

Terry Reedy · Jan 12, 2013

from timeit import timeit, repeat

size = 1000

r = repeat("y = x + 'a'", setup = "x = 'a' * %i" % size)
print('1:', r)
r = repeat("y = x + 'Ã©'", setup = "x = 'a' * %i" % size)
print('2:', r)
r = repeat("y = x + 'Å“'", setup = "x = 'a' * %i" % size)
print('3:', r)
r = repeat("y = x + 'â‚¬'", setup = "x = 'a' * %i" % size)
print('4:', r)
r = repeat("y = x + 'â‚¬'", setup = "x = 'â‚¬' * %i" % size)
print('5:', r)
r = repeat("y = x + 'Å“'", setup = "x = 'Å“' * %i" % size)
print('6:', r)
r = repeat("y = Ã© + 'Å“'", setup = "Ã© = 'Å“' * %i" % size)
print('7:', r)
r = repeat("y = Ã© + 'Å“'", setup = "Ã© = 'â‚¬' * %i" % size)
print('8:', r)

c:\python32\pythonw -u "vitesse3.py"

Click to expand...

1: [0.3603178435286996, 0.42901157137281515, 0.35459694357592086]
2: [0.3576409223543202, 0.4272010951864649, 0.3590055732104662]
3: [0.3552022735516487, 0.4256544908828328, 0.35824546465278573]
4: [0.35488168890607774, 0.4271707696118834, 0.36109528098614074]
5: [0.3560675370237849, 0.4261538782668417, 0.36138160167082134]
6: [0.3570182634788317, 0.4270155971913008, 0.35770629956705324]
7: [0.3556977225493485, 0.4264969117143753, 0.3645634239700426]
8: [0.35511247834379844, 0.4259628665308437, 0.3580737510097034]

Exit code: 0
c:\Python33\pythonw -u "vitesse3.py"

Click to expand...

1: [0.3053600256152646, 0.3306491917840535, 0.3044963374976518]
2: [0.36252767208680514, 0.36937298133086727, 0.3685573415262271]
3: [0.7666293438924097, 0.7653473991487574, 0.7630926729867262]
4: [0.7636680712265038, 0.7647586103955284, 0.7631395397838059]
5: [0.44721085450773934, 0.3863234021671369, 0.45664368355696094]
6: [0.44699700013114807, 0.3873974001136613, 0.45167383387335036]
7: [0.4465200615491014, 0.387050034441188, 0.45459690419205856]
8: [0.44760587465455437, 0.3875261853459726, 0.45421212384964704]

Exit code: 0

Click to expand...

The difference between a correct (coherent) unicode handling and ...

By 'correct' Jim means 'speedy', for a subset of string operations*.
rather than 'accurate'. In 3.2 and before, CPython does not handle
extended plane characters correctly on Windows and other narrow builds.
This is, by the way, true of many other languages. For instance, Tcl 8.5
and before (not sure about the new 8.6) does not handle them at all. The
same is true of Microsoft command windows.

* lets try another comparison:

from timeit import timeit
print(timeit("a.encode()", "a = 'a'*10000"))

3.2: 12.1 seconds
3.3 .7 seconds

3.3 is 15 times faster!!! (The factor increases with the length of a.)

A fairer comparison is the approximately 120 micro benchmarks in
Tools/stringbench.py. Here they are, uncensored, for 3.3.0 and 3.2.3. It
is in the Tools directory of some distributions but not all (including
not Windows). It can be downloaded from
http://hg.python.org/cpython/file/6fe28afa6611/Tools/stringbench

In FireFox, Right-click on the stringbench.py link and 'Save link as...'
to somewhere you can run it from.
stringbench v2.0
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit
(AMD64)]
2013-01-12 06:17:51.685781
bytes unicode
(in ms) (in ms) % comment
========== case conversion -- dense
0.41 0.43 95.2 ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower()
(*1000)
0.42 0.43 95.8 ("where in the world is carmen san deigo?"*10).upper()
(*1000)
========== case conversion -- rare
0.41 0.43 95.8 ("Where in the world is Carmen San Deigo?"*10).lower()
(*1000)
0.42 0.43 96.3 ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper()
(*1000)
========== concat 20 strings of words length 4 to 15
1.83 1.95 94.1 s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.10 0.10 98.7 "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.46 2.44 100.9 dna.count("AACT") (*10)
========== count newlines
0.77 0.75 103.6 ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.30 0.27 110.5 ("A"*1000).find("A") (*1000)
0.45 0.06 750.5 "A" in "A"*1000 (*1000)
0.30 0.27 110.4 ("A"*1000).index("A") (*1000)
0.24 0.22 107.2 ("A"*1000).partition("A") (*1000)
0.33 0.29 116.6 ("A"*1000).rfind("A") (*1000)
0.32 0.29 107.9 ("A"*1000).rindex("A") (*1000)
0.20 0.21 94.1 ("A"*1000).rpartition("A") (*1000)
0.42 0.45 93.4 ("A"*1000).rsplit("A", 1) (*1000)
0.39 0.41 95.9 ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.32 0.27 121.1 ("AB"*1000).find("AB") (*1000)
0.45 0.06 729.5 "AB" in "AB"*1000 (*1000)
0.30 0.27 111.2 ("AB"*1000).index("AB") (*1000)
0.23 0.28 85.0 ("AB"*1000).partition("AB") (*1000)
0.33 0.30 110.6 ("AB"*1000).rfind("AB") (*1000)
0.33 0.30 110.5 ("AB"*1000).rindex("AB") (*1000)
0.22 0.27 83.1 ("AB"*1000).rpartition("AB") (*1000)
0.46 0.47 96.7 ("AB"*1000).rsplit("AB", 1) (*1000)
0.44 0.48 90.9 ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.24 0.29 84.0 "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.26 0.28 92.9 "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.25 0.28 90.0 "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A 0.67 0.0 "The %(k1)s is %(k2)s the
%(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A 0.06 0.0 "A".join("") (*100)
========== join empty string, with 5 character sep
N/A 0.06 0.0 "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
0.87 1.27 68.8 "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.14 1.54 74.0 "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.27 0.37 72.0 "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.32 0.43 75.7 "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A 1.30 0.0 "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A 1.37 0.0 "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
3.25 3.23 100.5 s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
2.79 2.78 100.4 s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
1.98 1.94 102.3 s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
3.24 3.23 100.3 s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
4.26 3.62 117.7 s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
3.23 3.23 100.1 s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.32 2.32 100.1 s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
3.23 3.21 100.8 s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
3.58 3.57 100.4 s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
3.60 3.60 100.0 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
3.60 3.56 101.2 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
0.62 0.58 106.3 ("AB"*300+"C").find("BC") (*1000)
0.92 0.82 111.8 ("AB"*300+"CA").find("CA") (*1000)
0.73 0.33 218.8 "BC" in ("AB"*300+"C") (*1000)
0.61 0.60 101.0 ("AB"*300+"C").index("BC") (*1000)
0.54 0.82 66.4 ("AB"*300+"C").partition("BC") (*1000)
0.66 0.63 104.6 ("C"+"AB"*300).rfind("CA") (*1000)
0.91 0.88 102.3 ("BC"+"AB"*300).rfind("BC") (*1000)
0.65 0.62 105.1 ("C"+"AB"*300).rindex("CA") (*1000)
0.53 0.56 94.5 ("C"+"AB"*300).rpartition("CA") (*1000)
0.75 0.77 96.6 ("C"+"AB"*300).rsplit("CA", 1) (*1000)
0.65 0.67 97.0 ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.89 0.87 102.3 ("A"*1000).find("B") (*1000)
1.03 0.64 159.1 "B" in "A"*1000 (*1000)
0.67 0.68 98.7 ("A"*1000).partition("B") (*1000)
0.87 0.85 102.8 ("A"*1000).rfind("B") (*1000)
0.67 0.68 98.5 ("A"*1000).rpartition("B") (*1000)
0.87 0.87 99.2 ("A"*1000).rsplit("B", 1) (*1000)
0.86 0.85 101.5 ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
1.22 1.16 104.9 ("AB"*1000).find("BC") (*1000)
1.93 2.02 95.2 ("AB"*1000).find("CA") (*1000)
1.37 0.94 145.3 "BC" in "AB"*1000 (*1000)
1.39 2.14 65.1 ("AB"*1000).partition("BC") (*1000)
2.32 2.31 100.7 ("AB"*1000).rfind("BC") (*1000)
1.47 1.44 102.1 ("AB"*1000).rfind("CA") (*1000)
2.26 2.27 99.7 ("AB"*1000).rpartition("BC") (*1000)
2.46 2.45 100.2 ("AB"*1000).rsplit("BC", 1) (*1000)
1.15 1.16 99.1 ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.13 0.12 105.0 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.12 0.12 105.2 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.08 0.10 80.6 "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16 0.18 93.1 "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11 0.13 84.4 "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.39 0.41 94.8 "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
2.02 2.36 85.6 "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
3.12 3.23 96.6 dna.replace("ATC", "ATT") (*10)
========== replace single character
0.33 0.40 82.4 "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.75 0.86 87.4 "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.41 0.48 86.1 "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.14 0.18 79.3 ("Here are some words. "*2).partition(" ") (*1000)
0.11 0.14 75.1 ("Here are some words. "*2).rpartition(" ") (*1000)
0.35 0.39 90.3 ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.32 0.38 83.9 ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.74 2.02 86.3 "...text...".rsplit("\n") (*10)
1.69 1.97 85.5 "...text...".split("\n") (*10)
1.89 2.55 74.0 "...text...".splitlines() (*10)
========== split newlines
0.35 0.39 88.9 "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.34 0.40 86.4 "this\nis\na\ntest\n".split("\n") (*1000)
0.32 0.40 80.7 "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.28 2.30 99.1 dna.rsplit("ACTAT") (*10)
2.63 2.66 98.9 dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.55 0.69 79.0
"this--is--a--test--of--the--emergency--broadcast--system".rsplit("--")
(*1000)
0.58 0.70 82.9
"this--is--a--test--of--the--emergency--broadcast--system".split("--")
(*1000)
========== split whitespace (huge)
1.51 2.12 71.4 human_text.rsplit() (*10)
1.51 2.05 73.6 human_text.split() (*10)
========== split whitespace (small)
0.48 0.68 70.1 ("Here are some words. "*2).rsplit() (*1000)
0.48 0.64 74.9 ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.24 0.25 95.9 "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.24 0.25 95.7 "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.23 0.25 95.4 "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.09 0.21 44.1 s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.09 0.12 74.0 "\nHello!".rstrip() (*1000)
0.09 0.12 74.0 "Hello!\n".rstrip() (*1000)
0.09 0.12 71.6 "\nHello!\n".strip() (*1000)
0.09 0.12 73.2 "\nHello!".strip() (*1000)
0.09 0.12 72.9 "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.09 0.13 69.6 "\t \tHello".rstrip() (*1000)
0.09 0.13 72.3 "Hello\t \t".rstrip() (*1000)
0.07 0.08 86.8 "Hello\t \t".strip() (*1000)
========== tab split
0.59 0.65 90.9 GFF3_example.rsplit("\t", 8) (*1000)
0.55 0.59 94.2 GFF3_example.rsplit("\t") (*1000)
0.52 0.57 90.7 GFF3_example.split("\t", 8) (*1000)
0.52 0.57 90.1 GFF3_example.split("\t") (*1000)
108.87 116.31 93.6 TOTALstringbench v2.0
3.2.3 (default, Apr 11 2012, 07:12:16) [MSC v.1500 64 bit (AMD64)]
2013-01-12 06:23:05.994000
bytes unicode
(in ms) (in ms) % comment
========== case conversion -- dense
0.63 3.01 21.0 ("WHERE IN THE WORLD IS CARMEN SAN DEIGO?"*10).lower()
(*1000)
0.63 2.90 21.5 ("where in the world is carmen san deigo?"*10).upper()
(*1000)
========== case conversion -- rare
0.84 2.83 29.8 ("Where in the world is Carmen San Deigo?"*10).lower()
(*1000)
0.50 3.47 14.3 ("wHERE IN THE WORLD IS cARMEN sAN dEIGO?"*10).upper()
(*1000)
========== concat 20 strings of words length 4 to 15
1.82 1.75 103.9 s1+s2+s3+s4+...+s20 (*1000)
========== concat two strings
0.09 0.08 115.5 "Andrew"+"Dalke" (*1000)
========== count AACT substrings in DNA example
2.40 2.64 91.1 dna.count("AACT") (*10)
========== count newlines
0.77 0.75 101.6 ...text.with.2000.newlines.count("\n") (*10)
========== early match, single character
0.19 0.18 101.9 ("A"*1000).find("A") (*1000)
0.39 0.05 824.7 "A" in "A"*1000 (*1000)
0.19 0.19 96.3 ("A"*1000).index("A") (*1000)
0.20 0.22 87.5 ("A"*1000).partition("A") (*1000)
0.20 0.20 101.8 ("A"*1000).rfind("A") (*1000)
0.20 0.20 101.2 ("A"*1000).rindex("A") (*1000)
0.18 0.22 82.5 ("A"*1000).rpartition("A") (*1000)
0.41 0.45 91.7 ("A"*1000).rsplit("A", 1) (*1000)
0.42 0.43 99.0 ("A"*1000).split("A", 1) (*1000)
========== early match, two characters
0.19 0.19 102.3 ("AB"*1000).find("AB") (*1000)
0.39 0.05 781.6 "AB" in "AB"*1000 (*1000)
0.19 0.20 97.9 ("AB"*1000).index("AB") (*1000)
0.23 0.33 71.1 ("AB"*1000).partition("AB") (*1000)
0.20 0.20 101.6 ("AB"*1000).rfind("AB") (*1000)
0.20 0.20 100.1 ("AB"*1000).rindex("AB") (*1000)
0.22 0.31 70.4 ("AB"*1000).rpartition("AB") (*1000)
0.47 0.53 90.0 ("AB"*1000).rsplit("AB", 1) (*1000)
0.45 0.52 85.0 ("AB"*1000).split("AB", 1) (*1000)
========== endswith multiple characters
0.18 0.18 97.6 "Andrew".endswith("Andrew") (*1000)
========== endswith multiple characters - not!
0.18 0.18 100.4 "Andrew".endswith("Anders") (*1000)
========== endswith single character
0.18 0.18 97.1 "Andrew".endswith("w") (*1000)
========== formatting a string type with a dict
N/A 0.53 0.0 "The %(k1)s is %(k2)s the
%(k3)s."%{"k1":"x","k2":"y","k3":"z",} (*1000)
========== join empty string, with 1 character sep
N/A 0.05 0.0 "A".join("") (*100)
========== join empty string, with 5 character sep
N/A 0.05 0.0 "ABCDE".join("") (*100)
========== join list of 100 words, with 1 character sep
1.02 1.02 99.6 "A".join(["Bob"]*100)) (*1000)
========== join list of 100 words, with 5 character sep
1.25 1.48 84.4 "ABCDE".join(["Bob"]*100)) (*1000)
========== join list of 26 characters, with 1 character sep
0.31 0.25 122.9 "A".join(list("ABC..Z")) (*1000)
========== join list of 26 characters, with 5 character sep
0.36 0.41 88.4 "ABCDE".join(list("ABC..Z")) (*1000)
========== join string with 26 characters, with 1 character sep
N/A 1.06 0.0 "A".join("ABC..Z") (*1000)
========== join string with 26 characters, with 5 character sep
N/A 1.22 0.0 "ABCDE".join("ABC..Z") (*1000)
========== late match, 100 characters
2.52 2.68 94.0 s="ABC"*33; ((s+"D")*500+s+"E").find(s+"E") (*100)
2.35 3.06 76.9 s="ABC"*33; ((s+"D")*500+"E"+s).find("E"+s) (*100)
1.55 1.61 96.2 s="ABC"*33; (s+"E") in ((s+"D")*300+s+"E") (*100)
2.51 2.68 94.0 s="ABC"*33; ((s+"D")*500+s+"E").index(s+"E") (*100)
3.57 4.66 76.7 s="ABC"*33; ((s+"D")*500+s+"E").partition(s+"E") (*100)
3.23 3.24 99.8 s="ABC"*33; ("E"+s+("D"+s)*500).rfind("E"+s) (*100)
2.35 2.56 91.7 s="ABC"*33; (s+"E"+("D"+s)*500).rfind(s+"E") (*100)
3.23 3.24 99.8 s="ABC"*33; ("E"+s+("D"+s)*500).rindex("E"+s) (*100)
3.58 3.92 91.4 s="ABC"*33; ("E"+s+("D"+s)*500).rpartition("E"+s) (*100)
3.62 3.96 91.4 s="ABC"*33; ("E"+s+("D"+s)*500).rsplit("E"+s, 1) (*100)
2.89 3.38 85.4 s="ABC"*33; ((s+"D")*500+s+"E").split(s+"E", 1) (*100)
========== late match, two characters
0.52 0.52 99.5 ("AB"*300+"C").find("BC") (*1000)
0.69 0.90 76.5 ("AB"*300+"CA").find("CA") (*1000)
0.67 0.37 179.2 "BC" in ("AB"*300+"C") (*1000)
0.51 0.53 96.8 ("AB"*300+"C").index("BC") (*1000)
0.48 0.81 59.3 ("AB"*300+"C").partition("BC") (*1000)
0.55 0.55 101.5 ("C"+"AB"*300).rfind("CA") (*1000)
0.85 0.85 100.0 ("BC"+"AB"*300).rfind("BC") (*1000)
0.55 0.55 100.3 ("C"+"AB"*300).rindex("CA") (*1000)
0.52 0.60 87.1 ("C"+"AB"*300).rpartition("CA") (*1000)
0.78 0.82 95.4 ("C"+"AB"*300).rsplit("CA", 1) (*1000)
0.65 0.72 91.2 ("AB"*300+"C").split("BC", 1) (*1000)
========== no match, single character
0.77 0.77 100.6 ("A"*1000).find("B") (*1000)
0.98 0.63 155.1 "B" in "A"*1000 (*1000)
0.66 0.66 99.7 ("A"*1000).partition("B") (*1000)
0.77 0.77 100.4 ("A"*1000).rfind("B") (*1000)
0.66 0.66 99.7 ("A"*1000).rpartition("B") (*1000)
0.88 0.88 100.4 ("A"*1000).rsplit("B", 1) (*1000)
0.88 0.87 101.2 ("A"*1000).split("B", 1) (*1000)
========== no match, two characters
1.19 1.21 98.1 ("AB"*1000).find("BC") (*1000)
1.79 2.51 71.2 ("AB"*1000).find("CA") (*1000)
1.28 1.08 119.1 "BC" in "AB"*1000 (*1000)
1.10 2.11 52.1 ("AB"*1000).partition("BC") (*1000)
2.37 2.37 100.0 ("AB"*1000).rfind("BC") (*1000)
1.36 1.36 100.5 ("AB"*1000).rfind("CA") (*1000)
2.25 2.26 99.9 ("AB"*1000).rpartition("BC") (*1000)
2.38 2.62 90.7 ("AB"*1000).rsplit("BC", 1) (*1000)
1.18 1.30 90.1 ("AB"*1000).split("BC", 1) (*1000)
========== quick replace multiple character match
0.12 0.32 37.1 ("A" + ("Z"*128*1024)).replace("AZZ", "BBZZ", 1) (*10)
========== quick replace single character match
0.12 0.30 37.9 ("A" + ("Z"*128*1024)).replace("A", "BB", 1) (*10)
========== repeat 1 character 10 times
0.08 0.09 90.3 "A"*10 (*1000)
========== repeat 1 character 1000 times
0.16 0.19 82.2 "A"*1000 (*1000)
========== repeat 5 characters 10 times
0.11 0.12 98.3 "ABCDE"*10 (*1000)
========== repeat 5 characters 1000 times
0.40 0.58 67.9 "ABCDE"*1000 (*1000)
========== replace and expand multiple characters, big string
1.95 2.13 91.7 "...text.with.2000.newlines...replace("\n", "\r\n") (*10)
========== replace multiple characters, dna
2.93 3.25 90.3 dna.replace("ATC", "ATT") (*10)
========== replace single character
0.25 0.26 96.6 "This is a test".replace(" ", "\t") (*1000)
========== replace single character, big string
0.73 1.01 72.0 "...text.with.2000.lines...replace("\n", " ") (*10)
========== replace/remove multiple characters
0.30 0.34 89.0 "When shall we three meet again?".replace("ee", "") (*1000)
========== split 1 whitespace
0.12 0.13 93.3 ("Here are some words. "*2).partition(" ") (*1000)
0.11 0.11 98.8 ("Here are some words. "*2).rpartition(" ") (*1000)
0.32 0.37 86.5 ("Here are some words. "*2).rsplit(None, 1) (*1000)
0.32 0.33 96.9 ("Here are some words. "*2).split(None, 1) (*1000)
========== split 2000 newlines
1.76 2.19 80.5 "...text...".rsplit("\n") (*10)
1.72 2.10 81.9 "...text...".split("\n") (*10)
1.87 2.58 72.4 "...text...".splitlines() (*10)
========== split newlines
0.36 0.34 103.9 "this\nis\na\ntest\n".rsplit("\n") (*1000)
0.35 0.33 105.9 "this\nis\na\ntest\n".split("\n") (*1000)
0.31 0.34 89.7 "this\nis\na\ntest\n".splitlines() (*1000)
========== split on multicharacter separator (dna)
2.18 2.34 93.4 dna.rsplit("ACTAT") (*10)
2.50 2.64 94.5 dna.split("ACTAT") (*10)
========== split on multicharacter separator (small)
0.59 0.62 95.3
"this--is--a--test--of--the--emergency--broadcast--system".rsplit("--")
(*1000)
0.55 0.59 93.1
"this--is--a--test--of--the--emergency--broadcast--system".split("--")
(*1000)
========== split whitespace (huge)
1.54 2.34 65.5 human_text.rsplit() (*10)
1.51 2.22 68.3 human_text.split() (*10)
========== split whitespace (small)
0.46 0.60 76.5 ("Here are some words. "*2).rsplit() (*1000)
0.45 0.51 87.6 ("Here are some words. "*2).split() (*1000)
========== startswith multiple characters
0.18 0.18 97.3 "Andrew".startswith("Andrew") (*1000)
========== startswith multiple characters - not!
0.18 0.18 100.1 "Andrew".startswith("Anders") (*1000)
========== startswith single character
0.17 0.18 96.8 "Andrew".startswith("A") (*1000)
========== strip terminal newline
0.11 0.21 52.0 s="Hello!\n"; s[:-1] if s[-1]=="\n" else s (*1000)
0.06 0.07 92.1 "\nHello!".rstrip() (*1000)
0.06 0.07 92.2 "Hello!\n".rstrip() (*1000)
0.06 0.07 91.2 "\nHello!\n".strip() (*1000)
0.06 0.07 91.1 "\nHello!".strip() (*1000)
0.06 0.07 91.1 "Hello!\n".strip() (*1000)
========== strip terminal spaces and tabs
0.07 0.07 89.4 "\t \tHello".rstrip() (*1000)
0.07 0.07 91.4 "Hello\t \t".rstrip() (*1000)
0.04 0.05 88.7 "Hello\t \t".strip() (*1000)
========== tab split
0.57 0.56 100.8 GFF3_example.rsplit("\t", 8) (*1000)
0.53 0.53 100.7 GFF3_example.rsplit("\t") (*1000)
0.49 0.49 101.2 GFF3_example.split("\t", 8) (*1000)
0.51 0.49 103.5 GFF3_example.split("\t") (*1000)
102.13 125.57 81.3 TOTAL

Ian Kelly · Jan 12, 2013

The difference between a correct (coherent) unicode handling and ...

This thread was about byte string concatenation, not unicode, so your
rant is not even on-topic here.

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Code or Concatenation	0	Oct 28, 2016
Do...While...Not working	1	Feb 15, 2023
String concatenation vs. string formatting	13	Jul 8, 2011
Deepcopying a byte string is quicker than copying it - problem?	1	Feb 27, 2014
[C language] Issue in the Lotka-Volterra model.	0	Jun 28, 2023
[LONG] java.net.URI encoding weirdness	18	May 5, 2014
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012

String concatenation benchmarking weirdness

Rotwang

Rotwang

wxjmfauth

Terry Reedy

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads