How can I create customized classes that have similar properties as'str'?

L

Licheng Fang

I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.
 
L

Licheng Fang

I find myself frequently in need of classes like this for two reasons.
First, it's efficient in memory. Second, when two instances are
compared for equality only their pointers are compared. (I think
that's how Python compares 'str's.
 
B

Bjoern Schliessmann

Licheng said:
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

If you only want that if "a == b" is True also "a is b" is True,
overload the is_ attribute of your class. Personally, I don't see
any advantage in this.

Regards,


Björn
 
B

Bjoern Schliessmann

Licheng said:
I find myself frequently in need of classes like this for two
reasons. First, it's efficient in memory.

Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.

BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.
Second, when two instances are compared for equality only their
pointers are compared.

I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.
(I think that's how Python compares 'str's.

Generally not. In CPython, just very short strings are created only
once.
False

Regards,


Björn
 
L

Licheng Fang

Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem
is, D1 and D2 have almost the same set of keys (trigrams of the text),
yet the keys in D2 are new instances, even though these keys probably
have already been inserted into D1. So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.
BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.

I probably need this class to be immutable.
I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.

Yeah, counting is all about equal or not.
Generally not. In CPython, just very short strings are created only
once.

Wow, I didn't know this. But exactly how Python manage these strings?
My interpretator gave me such results:
 
B

Bjoern Schliessmann

Licheng said:
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-

I probably need this class to be immutable.

IMHO you don't need a change of Python, but simply a special
implementation (probably using metaclasses/singletons).
Wow, I didn't know this. But exactly how Python manage these
strings?

I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

Regards,


Björn
 
S

samwyse

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. [...] So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.

create a hash that maps your keys to themselves, then use the values
of that hash as your keys.
global store
if str not in store:
store[str] = str
return store[str]
True
 
M

Marc 'BlackJack' Rintsch

I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch
 
L

Licheng Fang

AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch

Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?
 
B

Bruno Desthuilliers

Licheng Fang a écrit :
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Here's a Q&D attempt - without any garantee, and to be taylored to your
needs.

_values = {} #id(instance) => value mapping
_instances = {} #hash(value) => instance mapping

class Value(object):
def __new__(cls, value):
try:
return _instances[hash(value)]
except KeyError:
instance = object.__new__(cls)
_values[id(instance)] = value
_instances[hash(value)] = instance
return instance

@apply
def value():
def fget(self):
return _values[id(self)]
def fset(self, ignore):
raise AttributeError("%s.value is read only" % type(self))
def fdel(self):
raise AttributeError("%s.value is read only" % type(self))
return property(**locals())


HTH
 
S

samwyse

Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?

The word generally used is 'atom' when referring to strings that are
set up such that 'a == b' implies 'a is b'. This is usually an
expensive process, since you don't want to do it to strings that are,
e.g., read from a file. Yes, it could be done only for string
constants, and some languages (starting with LISP) do this, but that
isn't what you (or most people) want. Whether you realize it or not,
you want control over the process; in your example, you don't want to
do it for the lines read from your file, just the trigrams.

The example that I gave does exactly that. It adds a fixed amount of
storage for each string that you 'intern' (the usual name given to the
process of generating such a string. Let's look at my code again:
global store
if str not in store:
store[str] = str
return store[str]

Each string passed to 'atom' already exists. We look to see if copy
already exists; if so we can discard the latest instance and use that
copy henceforth. If a copy does not exist, we save the string inside
'store'. Since the string already exists, we're just increasing its
reference count by two (so it won't be reference counted) and
increasing the size of 'store' by (an amortized) pair of pointers to
that same string.
 
S

samwyse

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

BTW, if the components of your trigrams are never larger than a byte,
then encode the tuples as integers and don't worry about pointer
comparisons.
return (ord(s[0])*256+ord(s[1]))*256+ord(s[2])
return [ encode(s[i:i+3]) for i in range(0, len(s)-2)]
[6382179, 6447972, 6513765]
 
S

Steven D'Aprano

Yes, millions.


Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????

Tens of megabytes is not a lot of data.

If the average word size is ten characters, then one million words takes
ten million bytes, or a little shy of ten megabytes. Even if you are
using four-byte characters, you've got 40 MB, still a moderate amount of
data on a modern system.

In my natural language processing tasks, I almost always
need to define patterns, identify their occurrences in a huge data, and
count them. Say, I have a big text file, consisting of millions of
words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem is,
D1 and D2 have almost the same set of keys (trigrams of the text), yet
the keys in D2 are new instances, even though these keys probably have
already been inserted into D1. So I end up with unnecessary duplicates
of keys. And this can be a great waste of memory with huge input data.

All these keys will almost certainly add up to only a few hundred
megabytes, which is a reasonable size of data but not excessive. This
really sounds to me like a case of premature optimization. I think you
are wasting your time solving a non-problem.



[snip]
Wow, I didn't know this. But exactly how Python manage these strings? My
interpretator gave me such results:

False


It's an implementation detail. You shouldn't use identity testing unless
you actually care that two names refer to the same object, not because
you want to save a few bytes. That's poor design: it's fragile,
complicated, and defeats the purpose of using a high-level language like
Python.
 
S

Steven D'Aprano

create a hash that maps your keys to themselves, then use the values of
that hash as your keys.
global store
if str not in store:
store[str] = str
return store[str]

Oh lordy, that's really made my day! That's the funniest piece of code
I've seen for a long time! Worthy of being submitted to the DailyWTF.

Samwyse, while I applaud your willingness to help, I think you actually
need to get some programming skills before doing so. Here's a hint to get
you started: can you think of a way to optimize that function so it does
less work?
 
S

Steven D'Aprano

If you only want that if "a == b" is True also "a is b" is True,
overload the is_ attribute of your class. Personally, I don't see any
advantage in this.

No advantage? That's for sure. There is no is_ attribute of generic
classes, and even if there was, it would have no special meaning.

Identity testing can't be overloaded. If it could, it would no longer be
identity testing.
 
G

George Sakkis

Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????

Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and
need to be snarky in most of your posts.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is "memoize". You can find around a dozen of recipes in
the Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

HTH,
George
 
B

Bjoern Schliessmann

Steven said:
No advantage? That's for sure. There is no is_ attribute of
generic classes, and even if there was, it would have no special
meaning.

Argl, I confused the operator module's attributes with objects ;)

Regards,


Björn
 
H

Hrvoje Niksic

samwyse said:
create a hash that maps your keys to themselves, then use the values
of that hash as your keys.

The "atom" function you describe already exists under the name
"intern".
 
S

Steven D'Aprano

Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and need
to be snarky in most of your posts.

I cheerfully admit to never needing to deal with "massive data".

However, I have often needed to deal with tens and hundreds of megabytes
of data, which IS NOT MASSIVE amounts of data to deal with on modern
systems. Which was my point.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is "memoize". You can find around a dozen of recipes in the
Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.
 
S

Steven D'Aprano

The "atom" function you describe already exists under the name "intern".

Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.


In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.

From the docs:
http://docs.python.org/lib/non-essential-built-in-funcs.html

intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. Interning strings is useful to
gain a little performance on dictionary lookup - if the keys in a
dictionary are interned, and the lookup key is interned, the key
comparisons (after hashing) can be done by a pointer compare instead of a
string compare. Normally, the names used in Python programs are
automatically interned, and the dictionaries used to hold module, class
or instance attributes have interned keys. Changed in version 2.3:
Interned strings are not immortal (like they used to be in Python 2.2 and
before); you must keep a reference to the return value of intern() around
to benefit from it.


Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see
whether he actually has a memory problem or not, right?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top