Interned Strings

D

Dave

Hello All,

I'm trying to clarify how Python avoids byte by byte
string comparisons most of the time. As I understand,
dictionaries keep strings, their keys (hash values),
and caches of their keys. Caching keys helps to avoid
recalculation of a string's hash value. So, when two
strings need to be compared, only their cached keys
are compared, which improves performance as there is
no need for byte by byte comparison.

Also, there is a global interning dictionary that
keeps interned strings. What I don't understand is why
strings are interned. How does it help with string
comparisons?

Thank you.

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Dave said:
I'm trying to clarify how Python avoids byte by byte
string comparisons most of the time. As I understand,
dictionaries keep strings, their keys (hash values),
and caches of their keys.

If you are talking about dictionaries with string keys,
you also have the dictionary values, of course.
Caching keys helps to avoid
recalculation of a string's hash value.

Correct (s/keys/string hash values/).
So, when two
strings need to be compared, only their cached keys
are compared, which improves performance as there is
no need for byte by byte comparison.

No. When comparing to strings s1 and s2, the following
operations occur:
1. is s1 and s2 the very same string? If yes, they
are equal.
2. else, do they have the same size, the same first
byte (which might be a null byte), and do they
compare equal, byte-by-byte?
If yes, they are equal, if not, they are not equal.
3. Is it perhaps some other compare operation (<,>

<=, >, >=) that we want to perform? Do the
slow algorithm.

As you can see, the string hash is never consulted when
comparing string objects. It is only consulted to
find the potential dictionary slot in the first place.
Also, there is a global interning dictionary that
keeps interned strings. What I don't understand is why
strings are interned. How does it help with string
comparisons?

Why you look up a dictionary entry, this happens:
1. compute the key hash.
2. find the corresponding dictionary slot
If the slot is empty, KeyError.
3. compare the slot key with the search key.
If they are equal: value found.
If they are different: collision, go to the
next key.

Interned strings speed up step 1 and step 3. If
you only have interned strings throughout, you always
also have the hash value. Of course, you had to
compute the hash value when adding the string
to the interning dictionary.

The real speedup is in step 3, and based on
the assumption that collisions won't happen:
if you lookup the key (e.g. to find the value
of a global variable), you find the slot using
the computed hash. Then:
1. if the slot is empty, it's a KeyError.
2. if the slot is not empty, you first compare
for pointer equality. As collisions are
supposedly unlikely, this will be an equal
string most of the time. Then, if you have
interning, it even will be the *same* string,
so you only need to compare pointers to find
out they are the same string.

So assuming all strings are interned, this is how
you do the dictionary lookup.
1. fetch the hash value from the key (no need to
compute it - it's already cached)
2. go to the slot in the dict.
3. if the slot is empty (==NULL): KeyError
4. otherwise: success.
As you can see, in that case, there is no need to
compare the string values. If there are collisions,
and if not all strings are interned, the algorithm
gets more complicated, but four items above are
assumed to be the common case.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top