in place list modification necessary? What's a better idiom?

MooMaster · Apr 7, 2009

A similar discussion has already occurred, over 4 years ago:
http://groups.google.com/group/comp...9928?lnk=gst&q=list+in+place#5dff55826a199928

Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.

The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of records from the famous iris
dataset:

5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor

Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

</code>

It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

but when I print out the points afterward, it doesn't keep the
changes.

What's a more efficient way of doing this?

Carl Banks · Apr 7, 2009

MooMaster said:
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

Big problem with this. I'm guessing you never ran it on a really big
file yet. Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns. Not good.) This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once. Do that in a
separete function, have it return the list, and then pass that list to
the quantize function. So replace the midportion of your
createInitialCluster function with something like this:

....
for i in xrange(len(points[0])): # number of columns
column_indices.append(quantize_column(points,i))
fixedPoints = []
for point in points:
dimensions = [quantize(s, column_indices, point.split
(",").index(i))
for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
print dimensions
fixedPoints.append(Point(dimensions))
....

And the two functions would be something like this:

def quantize_columns(point_list,column_index):
# this assumes if the first column is numeric the whole column
would be
if(isNumeric(point_list[0][column_index])):
return None # don't quantize this column
#first we need to discover all the possible values of this column
domain = []
for point in point_list:
domain.append(point.split(",")[column_index])
#make it a set to remove duplicates
return list(set(domain))

def quantize(i,domain,s):
if domain is None:
return float(s)
return float(domain.index(s))

This (once debugged will run much, much faster on a large dataset.

Now back to your question.

It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

Click to expand...

but when I print out the points afterward, it doesn't keep the
changes.

Click to expand...

It's because the order of items in a set is undefined. The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])). You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order. In
fact, you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task. It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task. We'll pass it a
function that generates a different index each time it's called.
(This is probably too advanced for you, but oh well, it's such a cool
trick.)

import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)

A couple general pointers:

* Don't ever use i to represent a string. Programmers expect i to be
an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y,
and z should be floats. You should stick to this convention whenever
possible, but definitely never use i for anything but an integer.

* set is built into Python now; unless you're using an older version
(2.3 I think) you should use set instead of Set.

* The Python style guide (q.g.) recommends that variables use names
such as column_index rather than columnIndex. The world won't end if
you don't follow it but if you want to be Pythonic that's how.

Carl Banks

Peter Otten · Apr 7, 2009

MooMaster said:
Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Peter

Peter Otten · Apr 7, 2009

Peter said:
I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

oops, this must of course be (x != y).

Carl Banks · Apr 7, 2009

I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

However, the OP better be sure to set the scales for the quantized
dimensions high enough so that no clusters form containing points with
different discrete values.

That, in turn, suggests he might as well not even bother sending the
discrete values to the clustering algorithm, but instead to call it
for each unique set of discretes. (However, I could imagine the
marginal cost of more dimensions is less than that of multiple runs;
I've been dealing with such a case at work.)

I'll leave it to the OP to decide.

Carl Banks

Peter Otten · Apr 7, 2009

Carl said:
Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

I've read about K-Means in Segaran's "Collective Intelligence" which
describes it:

"K-Means clustering begins with k randomly placed centroids (points in space
that represent the center of the cluster), and assigns every item to the
nearest one. After the assignment, the centroids are moved to the average
location of all the nodes assigned to them, and the assignments are redone.
This process repeats until the assignments stop changing."

The book doesn't go into the theory, and "any distance would do" was my
assumption which may be wrong.

Peter

MRAB · Apr 7, 2009

Carl said:
MooMaster said:

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:
[snip]
#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

Click to expand...

[snip]

import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)

I'd also remove the isNumeric() function. You're just converting to a
float if you can successfully convert to a float!

try:
dimensions.append(float(s))
except ValueError:
dimensions.append(float(quantization))

R. David Murray · Apr 7, 2009

andrew cooke said:
Carl said:

import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)

Click to expand...

nice reply (i didn't know defaultdict worked like that - very neat).

two small things i noticed:

1 - do you need a separate quantization for each column? the code above
might give, for example, non-contiguous ranges of integers for a
particular column if a string occurs ("by accident" perhaps) in more than
one.

2 - don't bother with isNumeric. just return the cast value or catch the
exception:

[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))

No, no, no; never use a bare except!

Do it MRAB's way and catch ValueError.

MRAB · Apr 7, 2009

andrew said:
R. David Murray said:

[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))

Click to expand...

~~No, no, no; never use a bare except!~~

Click to expand...

can you explain why? i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.

A bare exception will catch _any_ exception, even those you didn't
expect (technical term: "bug" ); for example, if you had written:

try:
dimension.append(float(s))
except:
dimensions.append(float(quantization))

it would silently catch "NameError: name 'dimension' is not defined" and
perform the fallback action.

R. David Murray · Apr 7, 2009

R. David Murray said:
R. David Murray said:

[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))

Click to expand...

~~No, no, no; never use a bare except!~~

Click to expand...

can you explain why? i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.

What if the user pressed ctl-c right when the float was being converted
and appended?

Never use a bare except unless you have a specific reason to do so,
and there are very few of those. (Yes, I should have said it that way
to start with, my apologies for going hyperbolic.) Using because it
doesn't _look_ like it will cause issues is just asking for hard to
track down bugs .

MooMaster · Apr 9, 2009

MooMaster said:
MooMaster said:

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

Click to expand...

<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

Click to expand...

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

Click to expand...

#harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

Click to expand...

Big problem with this. I'm guessing you never ran it on a really big
file yet. Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns. Not good.) This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once. Do that in a
separete function, have it return the list, and then pass that list to
the quantize function. So replace the midportion of your
createInitialCluster function with something like this:

....
for i in xrange(len(points[0])): # number of columns
column_indices.append(quantize_column(points,i))
fixedPoints = []
for point in points:
dimensions = [quantize(s, column_indices, point.split
(",").index(i))
for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
print dimensions
fixedPoints.append(Point(dimensions))
....

And the two functions would be something like this:

def quantize_columns(point_list,column_index):
# this assumes if the first column is numeric the whole column
would be
if(isNumeric(point_list[0][column_index])):
return None # don't quantize this column
#first we need to discover all the possible values of this column
domain = []
for point in point_list:
domain.append(point.split(",")[column_index])
#make it a set to remove duplicates
return list(set(domain))

def quantize(i,domain,s):
if domain is None:
return float(s)
return float(domain.index(s))

This (once debugged will run much, much faster on a large dataset.

Now back to your question.

It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

Click to expand...

for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])
but when I print out the points afterward, it doesn't keep the
changes.

Click to expand...

It's because the order of items in a set is undefined. The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])). You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order. In
fact, you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task. It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task. We'll pass it a
function that generates a different index each time it's called.
(This is probably too advanced for you, but oh well, it's such a cool
trick.)

import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)

A couple general pointers:

* Don't ever use i to represent a string. Programmers expect i to be
an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y,
and z should be floats. You should stick to this convention whenever
possible, but definitely never use i for anything but an integer.

* set is built into Python now; unless you're using an older version
(2.3 I think) you should use set instead of Set.

* The Python style guide (q.g.) recommends that variables use names
such as column_index rather than columnIndex. The world won't end if
you don't follow it but if you want to be Pythonic that's how.

Carl Banks

Nice explanation! Very detailed and thorough, thanks! I haven't used
defaultdict before, so that was a very useful and efficient trick!

MooMaster · Apr 9, 2009

oops, this must of course be (x != y).

The randomness doesn't matter too much, all K-means cares about is a
distance between two points in a coordinate space and as long as that
space is invariant it doesn't matter too much (i.e. we don't want
(1,1) becoming (3,1) on the next iteration, or the value for a
quantized column changing). With that in mind, I was hoping to be lazy
and just go with an enumeration approach...

Nevertheless, it does introduce a subtle ordering for nominal data, as
if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on
that scale Iris-Versicolor is intuitively "closer" to virginica than
setosa is, when in fact such distances don't mean anything on a
nominal scale. I hadn't thought about a function like that, but it
makes a lot of sense. Thanks!

Range / empty list issues??	1	Dec 11, 2023
is list comprehension necessary?	15	Oct 26, 2010
CSV like file format featured recently in Daily Python URL?	4	Oct 6, 2005
A simpler Pimpl Idiom ?	6	Oct 21, 2013
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
Add a list of videos each one in a different button in a web page	1	Dec 10, 2022
Change one list item in place	9	Dec 1, 2010
Trying to modify a list in Python, receiving "variable not defined" error	3	Feb 6, 2023

in place list modification necessary? What's a better idiom?

MooMaster

Carl Banks

Peter Otten

Peter Otten

Carl Banks

Peter Otten

MRAB

R. David Murray

MRAB

R. David Murray

MooMaster

MooMaster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads