How can I create customized classes that have similar properties as'str'?

H

Hrvoje Niksic

Steven D'Aprano said:
Not really. intern() works very differently, because it can tie itself to
the Python internals.

The exact implementation mechanism is subtly different, but
functionally intern is equivalent to the "atom" function.
In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary
problem. According to the docs, intern()'s purpose is to speed up
dictionary lookups, not to save memory. I suspect that if it does
save memory, it will be by accident.

It's not by accident, it follows from what interning does. Interning
speeds up comparisons by returning the same string object for the same
string contents. If the strings you're working with tend to repeat,
interning will save some memory simply by preventing storage of
multiple copies of the same string. Whether the savings would make
any difference for the OP is another question.
From the docs:
http://docs.python.org/lib/non-essential-built-in-funcs.html

intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. [...]

Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

That's a frequently misunderstood sentence. It doesn't mean that
intern will make copies; it simply means that the string you get back
from intern can be either the string you passed it or another
(previously interned) string object that is guaranteed to have the
same contents as your string (which makes it technically a "copy" of
the string you passed to intern).
 
G

George Sakkis

On Nov 24, 7:42 pm, Steven D'Aprano
This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.

If you bothered to click on that link you would learn that memoization
can be used to save space too and matches OP's case exactly; even the
identity tests work. Self-importance is bad enough by itself, even
without the ignorance, but you seem to do great in both.

George
 
P

Peter Otten

Steven said:
store = {}
def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

Oh lordy, that's really made my day! That's the funniest piece of code
I've seen for a long time! Worthy of being submitted to the DailyWTF.

Here's a script to show atom()'s effect on memory footprint:

$ cat atom.py
import sys
data = [1]*1000
items = []
cache = {}
if "-a" in sys.argv:
def atom(obj):
try:
return cache[obj]
except KeyError:
cache[obj] = obj
return obj
else:
def atom(obj):
return obj
try:
while 1:
items.append(atom(tuple(data)))
except MemoryError:
print len(items)
$ ulimit -v 5000
$ python atom.py
226
$ python atom.py -a
185742

So if you are going to submit Sam's function make sure to bundle it with
this little demo...

Peter
 
L

Licheng Fang

Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????

Tens of megabytes is not a lot of data.

If the average word size is ten characters, then one million words takes
ten million bytes, or a little shy of ten megabytes. Even if you are
using four-byte characters, you've got 40 MB, still a moderate amount of
data on a modern system.

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed
they quickly ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot
of memory without taking extra care. I just thought if I define my
pattern class with this 'atom' property, coding efforts could be
easier later.
In my natural language processing tasks, I almost always
need to define patterns, identify their occurrences in a huge data, and
count them. Say, I have a big text file, consisting of millions of
words, and I want to count the frequency of trigrams:
trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem is,
D1 and D2 have almost the same set of keys (trigrams of the text), yet
the keys in D2 are new instances, even though these keys probably have
already been inserted into D1. So I end up with unnecessary duplicates
of keys. And this can be a great waste of memory with huge input data.

All these keys will almost certainly add up to only a few hundred
megabytes, which is a reasonable size of data but not excessive. This
really sounds to me like a case of premature optimization. I think you
are wasting your time solving a non-problem.

[snip]
Wow, I didn't know this. But exactly how Python manage these strings? My
interpretator gave me such results:
False

It's an implementation detail. You shouldn't use identity testing unless
you actually care that two names refer to the same object, not because
you want to save a few bytes. That's poor design: it's fragile,
complicated, and defeats the purpose of using a high-level language like
Python.
 
S

Steven D'Aprano

So if you are going to submit Sam's function make sure to bundle it with
this little demo...

Well Peter, I was going to reply with a comment about not changing the
problem domain (tuples of ints to trigrams from a text file for natural
language processing, that is, three character alphanumeric strings), and
that if you re-did your test with strings (as I did) you would see
absolutely no difference. What I was going to say was "Tuples aren't
interned. Short strings that look like identifiers are. Jumping through
hoops to cache things which are already cached is not productive
programming."

But then I dug a little deeper, and disassembled the code I was running,
and discovered that I was being fooled by the Python compiler's constant-
folding, and if I took steps to defeat the optimizer, the effect I was
seeing disappeared, and I got the same results as you.

Well. So I've learned something new: Python doesn't intern strings in the
way I thought it did. I don't quite know *how* it decides which strings
to intern and which ones not to, but at least I've learnt that what I
thought was true is not true.

So I offer my apology to Samwyse, the caching code isn't as redundant and
silly as it appears, and humbly tuck into this nice steaming plate of
crow.

Somebody pass the salt please.
 
M

MonkeeSage

This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.

Not to beat a dead horse, but memoization can significantly minimize
memory usage, given a large data set with redundant elements (as the
OP seems to suggest [e.g., calculating the deltas of trigrams in a
natural language]). For example, if the data set has 1/3 redundant
elements, then the un-memoized version requires 1/3 more memory,
because it needs to store 1/3 more unique copies of the element,
whereas the memoized version has only to store unique elements once.
Every instance of an element which is already in the cache requires
only the cache lookup (speed), rather than the creation of a new
object (memory). So the trade-off is actually speed for memory rather
than the other way around. Of course, it all depends on the nature of
the data; but for large, redundant data sets, memoization is
definitely a win when it comes to memory optimization.

Regards,
Jordan
 
C

Colin J. Williams

Steven said:
Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.


In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.

http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.
 
C

Colin J. Williams

Steven said:
Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.


In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.

http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.
 
S

Steven D'Aprano

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed they
quickly ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot of
memory without taking extra care. I just thought if I define my pattern
class with this 'atom' property, coding efforts could be easier later.

I'm just not getting the same results as you when I try this. I'm finding
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of
how you generate the patterns. What's important is the number of patterns
you produce, and "millions" isn't that huge a number, even without
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's
paging, yes, and my PC runs a little slow when I try to switch from one
running application to another, but nothing unusable. Opening a dozen
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely
"millions" of strings, even if you are keeping two, three, ten redundant
copies. Assuming an average length of twenty bytes per pattern (you said
trigrams, but I'd rather over-estimate than under), and even assuming
that only half the 8GB are available to Python, you should be able to
store something of the order of one hundred million patterns easily:

4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional
overhead.)

If you're running into problems with merely millions of patterns, then
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have
a dict with millions of keys, it doesn't matter how many times each
pattern exists in the corpus because the key only exists *once* in the
dict. Duplicate the dict, or generate it from scratch even, and at worse
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1


Notice that the real killer in the above algorithm is that you need
enough storage, not just for the unique patterns, but for EVERY separate
occurrence of each pattern. Nothing to do with how dicts operate, and
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory.
With 200MB of text, you have 209715198 trigrams in your list. The
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You
almost certainly should change to a lazy approach, and use a generator to
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
"""Yield n-grams."""
if len(s) <= n:
yield s
else:
for i in range(len(s)-n+1):
yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
for word in line.split():
for pattern in make_patterns(word.strip()):
d[pattern] = d.get(pattern, 0) + 1



Obviously I haven't seen your code and don't actually know what you are
doing. If my 1GB machine can deal with a dict of 17 million unique keys
from a corpus of 240MB with no special memory management, your 8GB
machine should too.
 
L

Licheng Fang

I'm just not getting the same results as you when I try this. I'm finding
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of
how you generate the patterns. What's important is the number of patterns
you produce, and "millions" isn't that huge a number, even without
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's
paging, yes, and my PC runs a little slow when I try to switch from one
running application to another, but nothing unusable. Opening a dozen
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely
"millions" of strings, even if you are keeping two, three, ten redundant
copies. Assuming an average length of twenty bytes per pattern (you said
trigrams, but I'd rather over-estimate than under), and even assuming
that only half the 8GB are available to Python, you should be able to
store something of the order of one hundred million patterns easily:

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.
4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional
overhead.)



If you're running into problems with merely millions of patterns, then
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have
a dict with millions of keys, it doesn't matter how many times each
pattern exists in the corpus because the key only exists *once* in the
dict. Duplicate the dict, or generate it from scratch even, and at worse
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?
Notice that the real killer in the above algorithm is that you need
enough storage, not just for the unique patterns, but for EVERY separate
occurrence of each pattern. Nothing to do with how dicts operate, and
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory.
With 200MB of text, you have 209715198 trigrams in your list. The
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You
almost certainly should change to a lazy approach, and use a generator to
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
"""Yield n-grams."""
if len(s) <= n:
yield s
else:
for i in range(len(s)-n+1):
yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
for word in line.split():
for pattern in make_patterns(word.strip()):
d[pattern] = d.get(pattern, 0) + 1

Obviously I haven't seen your code and don't actually know what you are
doing. If my 1GB machine can deal with a dict of 17 million unique keys
from a corpus of 240MB with no special memory management, your 8GB
machine should too.
 
C

Chris Mellon

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.

Implement __hash__ and __eq__ on your pattern class. If the same
pattern compares equal and hashes the same then it will be a "matching
key" as far as the dict is concerned and will only be stored once.
This is probably cheaper than explicit interning anyway (you don't
need to search an intern table).

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

Hashing isn't really an expensive operation. On strings it's even
cached on the object. If you implement your own __hash__ method you
can do the same, but I wouldn't bother unless you benchmark it and
discover that hashing is a bottleneck.
 
Z

zooko

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory. Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled. You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try "tokenizing" your trigrams by defining a dict from three unigrams
to a sequentially allocated integer "trigram id" (also called a
"trigram token"), and a reverse dict which goes from a trigram id to
the three unigrams. Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects. If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure. In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer... Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees. (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn
 
S

samwyse

The "atom" function you describe already exists under the name
"intern".

D'oh! That's what I get for not memorizing "Non-essential Built-in
Functions".

In my defense, however, my function will work with anything that can
be used as a dictionary key (strings, tuples, frozen sets, etc), not
just character strings; thus we return to the original:
Traceback (most recent call last):
File "<pyshell#33>", line 1, in ?
intern(a) is intern(b)
TypeError: intern() argument 1 must be string, not tuple
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top