Large Data Sets: Use base variables or classes? And some bindingquestions

P

Patrick Sullivan

Hello.

I will be using some large data sets ("points" from 2 to 12 variables)
and would like to use one class for each point rather than a list or
dictionary. I imagine this is terribly inefficient, but how much?

What is the cost of creating a new class?

What is the cost of referencing a class variable?

What is the cost of calling a class method to just return a variable?

Key point: The point objects, once created, and essentially non-
mutable. Static. Is there a way to "bind" a variable to a object
method in a way that is more efficient than the function calling
self.variable_name ?

I'll run some profile tests later today but if anyone has any cost/
efficiency of object creation in python, or any other idioms related
to variable creation, I'd greatly appreciate some links.

Thanks!

Patrick
 
M

malkarouri

Hello.

I will be using some large data sets ("points" from 2 to 12 variables)
and would like to use one class for each point rather than a list or
dictionary. I imagine this is terribly inefficient, but how much?

I can't really get into details here, but I would suggest that you go
ahead and try first. As you know, premature optimization is the root
of all evil.

General points I would suggest:

- Use Numpy/Scipy (http://www.scipy.org). You will have more
effeciency easier than if you try to use simply Python lists. And it
is much easier to later optimize that.
- Your questions of referencing classes and variables tell me that
perhaps you are starting from a C background, or Java maybe? Anyway,
as far as I know, it is not standard practice to write a class method
(you meant a normal bound method, right?) just to access a variable.
Use a normal Python variable and if you need to make it a method later
turn it into a property.
- Is the efficiency you are looking for is in terms of time or memory?
That difference leads to different optimization tricks sometimes.
- By using Numpy there is probably another advantage to you: some
efficiency in the data representation, as the NumPy array stores data,
say integers, without memory overhead per member (point). Just an
array of integers. Of course there is additional constant memory per
array which is independent of the number of elements (points) you are
storing.
- Generally try to think in terms of arrays of data rather than single
points. If it helps, think in terms of matrices. That is more or less
the design of Matlab, and Numpy is more or less similar.


Now if you specify your problem further I am sure that you will get
better advice from the community here. Don't focus on the details,
probably the bigger picture will help. Working in graphics? Image
processing? Machine Learning/Statistics/Data Mining/ etc..?
 
T

Terry Reedy

Patrick said:
Hello.

I will be using some large data sets ("points" from 2 to 12 variables)
and would like to use one class for each point rather than a list or
dictionary. I imagine this is terribly inefficient, but how much?

I strongly suspect that you should use one class and a class instance
for each 'point'. You can make instances 'fixed' after initialization
by customizing appropriate methods, but I would not bother for private code.
 
C

Carl Banks


Hi, I have a couple suggestions.

I will be using some large data sets ("points" from 2 to 12 variables)
and would like to use one class for each point rather than a list or
dictionary.

Ok, point of terminology. It's not really a nit-pick, either, since
it affects some of your questions below. When you say you want to use
one class for each point, you apparently mean you would like to use
one class instance, or one object, for each point.

One class for each point would be terribly inefficient; one instance,
perhaps not.

I imagine this is terribly inefficient, but how much?

You say large data sets, which suggests that __slots__ mechanism could
be useful to you.

class A(object):
__slots__ = ['var1','var2','var3']

Normally, each class instance has an associated dict which stores the
attributes, but if you define __slots__ then the variables will be
stored in fixed memory locations and no dict will be created.

However, it seems from the rest of your comments that speed is your
main concern. Last time someone reported __slots__ didn't make a big
difference in access time, but it probably would speed up creating
objects a bit. Of course, you should profile it to make sure.

What is the cost of creating a new class?

I'm assuming you want to know the cost of creating a class instance.
Generally speaking, the main cost of this is that you'd be executing
Python code (whereas list and dict are written in C).

What is the cost of referencing a class variable?

I assume you mean an instance variable.

What is the cost of calling a class method to just return a variable?

Significant penalty.

This is because even if the method call is faster (and I doubt very
highly that it is), the method still has to access the variable, which
is going to take the same amount of time as accessing the variable
directly. I.e., you're getting the overhead of a method call to do
the same thing you could have done directly.

I highly recommend against doing this, not only because it's less
efficient, but also because it's considered bad style in Python.

Key point: The point objects, once created, and essentially non-
mutable. Static. Is there a way to "bind" a variable to a object
method in a way that is more efficient than the function calling
self.variable_name ?

Python 2.6 has a new object type called namedtuple in the collections
module. (Actually it's a type factory that creates a subclass of
tuple with attribute names mapped to the indices.) This might be a
perfect fit for your needs. You have to upgrade to 2.6, though, which
won't be released for a few days.


Carl Banks
 
S

Steven D'Aprano

However, it seems from the rest of your comments that speed is your main
concern. Last time someone reported __slots__ didn't make a big
difference in access time, but it probably would speed up creating
objects a bit.

Carl probably knows this already, but for the benefit of the Original
Poster:

__slots__ is intended as a memory optimization, not speed optimization.
If it speeds up creation, that's a serendipitous side-effect of using
less memory.

Of course, you should profile it to make sure.

Absolutely.

Can I ask the OP how large is "large" in the Large Data Sets? What seems
large to people is often not large at all a modern computer.
 
C

Carl Banks

Carl probably knows this already, but for the benefit of the Original
Poster:

__slots__ is intended as a memory optimization, not speed optimization.
If it speeds up creation, that's a serendipitous side-effect of using
less memory.

No, it'd be a serendipitous side-effect of not having to take the time
to create a dict object, which is quite a bit more of a direct cause.

It might still end up being slower (creating slot descriptors might
take more time for all I know) but it's more than just an effect of
less memory.

Carl Banks
 
C

Carl Banks

It might still end up being slower (creating slot descriptors might
take more time for all I know) but it's more than just an effect of
less memory.

Actually scratch that. Descriptors are only created when the type
object is created. I can't think of anything that would need to be
done in an instance only if no dict is present, so using slots
probably almost certianly makes object creation faster. Still, the
last word is the profiler.


Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top