in place list modification necessary? What's a better idiom?

M

MooMaster

A similar discussion has already occurred, over 4 years ago:
http://groups.google.com/group/comp...9928?lnk=gst&q=list+in+place#5dff55826a199928

Nevertheless, I have a use-case where such a discussion comes up. For
my data mining class I'm writing an implementation of the bisecting
KMeans clustering algorithm (if you're not familiar with clustering
and are interested, this gives a decent example based overview:
http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt). Given a
CSV dataset of n records, we are to cluster them accordingly.

The dataset is generalizable enough to have any kind of data-type
(strings, floats, booleans, etc) for each of the record's columnar
values, for example here's a couple of records from the famous iris
dataset:

5.1,3.5,1.4,0.2,Iris-setosa
6.4,3.2,4.5,1.5,Iris-versicolor

Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

</code>

It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])

but when I print out the points afterward, it doesn't keep the
changes.

What's a more efficient way of doing this?
 
C

Carl Banks

MooMaster said:
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:

<code>
def createInitialCluster(fileName):
#get the data from the file
points = []
with open(fileName, 'r') as f:
for line in f:
points.append(line.rstrip('\n'))
#clean up the data
fixedPoints = []
for point in points:
dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
print dimensions
fixedPoints.append(Point(dimensions))
#return an initial cluster of all the points
return Cluster(fixedPoints)

def quantize(stringToQuantize, pointList, columnIndex):
#if it's numeric, no need to quantize
if(isNumeric(stringToQuantize)):
return float(stringToQuantize)
#first we need to discover all the possible values of this column
domain = []
for point in pointList:
domain.append(point.split(",")[columnIndex])
#make it a set to remove duplicates
domain = list(Set(domain))
#use the index into the domain as the number representing this
value
return float(domain.index(stringToQuantize))

#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True

Big problem with this. I'm guessing you never ran it on a really big
file yet. Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact. (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns. Not good.) This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once. Do that in a
separete function, have it return the list, and then pass that list to
the quantize function. So replace the midportion of your
createInitialCluster function with something like this:

....
for i in xrange(len(points[0])): # number of columns
column_indices.append(quantize_column(points,i))
fixedPoints = []
for point in points:
dimensions = [quantize(s, column_indices, point.split
(",").index(i))
for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
print dimensions
fixedPoints.append(Point(dimensions))
....

And the two functions would be something like this:

def quantize_columns(point_list,column_index):
# this assumes if the first column is numeric the whole column
would be
if(isNumeric(point_list[0][column_index])):
return None # don't quantize this column
#first we need to discover all the possible values of this column
domain = []
for point in point_list:
domain.append(point.split(",")[column_index])
#make it a set to remove duplicates
return list(set(domain))

def quantize(i,domain,s):
if domain is None:
return float(s)
return float(domain.index(s))

This (once debugged :) will run much, much faster on a large dataset.


Now back to your question.
It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:

for point in points:
point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])
but when I print out the points afterward, it doesn't keep the
changes.

It's because the order of items in a set is undefined. The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])). You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order. In
fact, you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task. It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task. We'll pass it a
function that generates a different index each time it's called.
(This is probably too advanced for you, but oh well, it's such a cool
trick.)


import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)


A couple general pointers:

* Don't ever use i to represent a string. Programmers expect i to be
an integer. i,j,k,l,m, and n should be integers, in fact. u,v,w,x,y,
and z should be floats. You should stick to this convention whenever
possible, but definitely never use i for anything but an integer.

* set is built into Python now; unless you're using an older version
(2.3 I think) you should use set instead of Set.

* The Python style guide (q.g.) recommends that variables use names
such as column_index rather than columnIndex. The world won't end if
you don't follow it but if you want to be Pythonic that's how.



Carl Banks
 
P

Peter Otten

MooMaster said:
Now we can't calculate a meaningful Euclidean distance for something
like "Iris-setosa" and "Iris-versicolor" unless we use string-edit
distance or something overly complicated, so instead we'll use a
simple quantization scheme of enumerating the set of values within the
column domain and replacing the strings with numbers (i.e. Iris-setosa
= 1, iris-versicolor=2).

I'd calculate the distance as

def string_dist(x, y, weight=1):
return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Peter
 
C

Carl Banks

I'd calculate the distance as

def string_dist(x, y, weight=1):
    return weight * (x == y)

You don't get a high resolution in that dimension, but you don't introduce
an element of randomness, either.

Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

However, the OP better be sure to set the scales for the quantized
dimensions high enough so that no clusters form containing points with
different discrete values.

That, in turn, suggests he might as well not even bother sending the
discrete values to the clustering algorithm, but instead to call it
for each unique set of discretes. (However, I could imagine the
marginal cost of more dimensions is less than that of multiple runs;
I've been dealing with such a case at work.)

I'll leave it to the OP to decide.


Carl Banks
 
P

Peter Otten

Carl said:
Does the algorithm require well-ordered data along the dimensions?
Though I've never heard of it, the fact that it's called "bisecting
Kmeans" suggests to me that it does, which means this wouldn't work.

I've read about K-Means in Segaran's "Collective Intelligence" which
describes it:

"K-Means clustering begins with k randomly placed centroids (points in space
that represent the center of the cluster), and assigns every item to the
nearest one. After the assignment, the centroids are moved to the average
location of all the nodes assigned to them, and the assignments are redone.
This process repeats until the assignments stop changing."

The book doesn't go into the theory, and "any distance would do" was my
assumption which may be wrong.

Peter
 
M

MRAB

Carl said:
MooMaster said:
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:
[snip]
#harvested from http://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
try:
i = float(string)
except ValueError:
return False
return True
[snip]

import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)

I'd also remove the isNumeric() function. You're just converting to a
float if you can successfully convert to a float!

try:
dimensions.append(float(s))
except ValueError:
dimensions.append(float(quantization))
 
R

R. David Murray

andrew cooke said:
Carl said:
import collections
import itertools

def createInitialCluster(fileName):
fixedPoints = []
# quantization is a dict that assigns sequentially-increasing
numbers
# to values when reading keys that don't yet exit
quantization = defaultdict.collections(itertools.count().next)
with open(fileName, 'r') as f:
for line in f:
dimensions = []
for s in line.rstrip('\n').split(","):
if isNumeric(s):
dimensions.append(float(s))
else:
dimensions.append(float(quantization))
fixedPoints.append(Point(dimensions))
return Cluster(fixedPoints)


nice reply (i didn't know defaultdict worked like that - very neat).

two small things i noticed:

1 - do you need a separate quantization for each column? the code above
might give, for example, non-contiguous ranges of integers for a
particular column if a string occurs ("by accident" perhaps) in more than
one.

2 - don't bother with isNumeric. just return the cast value or catch the
exception:

[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))


No, no, no; never use a bare except! :)

Do it MRAB's way and catch ValueError.
 
M

MRAB

andrew said:
R. David Murray said:
[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))

No, no, no; never use a bare except! :)


can you explain why? i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.

A bare exception will catch _any_ exception, even those you didn't
expect (technical term: "bug" :)); for example, if you had written:

try:
dimension.append(float(s))
except:
dimensions.append(float(quantization))

it would silently catch "NameError: name 'dimension' is not defined" and
perform the fallback action.
 
R

R. David Murray

R. David Murray said:
[...]
try:
dimensions.append(float(s))
except:
dimensions.append(float(quantization))


No, no, no; never use a bare except! :)


can you explain why? i can't think of any reason why the code would be
better catching a specific exception.

as a general rule, maybe, but in this particular case i can't see a reason
- so i'm not sure if you're just pedantically following rules or if i've
missed something i should know.


What if the user pressed ctl-c right when the float was being converted
and appended?

Never use a bare except unless you have a specific reason to do so,
and there are very few of those. (Yes, I should have said it that way
to start with, my apologies for going hyperbolic.) Using because it
doesn't _look_ like it will cause issues is just asking for hard to
track down bugs :).
 
M

MooMaster

MooMaster said:
So I'm reading in values from a file, and for each column I need to
dynamically discover the range of possible values it can take and
quantize if necessary. This is the solution I've come up with:
<code>
def createInitialCluster(fileName):
    #get the data from the file
    points = []
    with open(fileName, 'r') as f:
        for line in f:
            points.append(line.rstrip('\n'))
    #clean up the data
    fixedPoints = []
    for point in points:
        dimensions = [quantize(i, points, point.split(",").index(i))
for i in point.split(",")]
        print dimensions
        fixedPoints.append(Point(dimensions))
    #return an initial cluster of all the points
    return Cluster(fixedPoints)
def quantize(stringToQuantize, pointList, columnIndex):
    #if it's numeric, no need to quantize
    if(isNumeric(stringToQuantize)):
        return float(stringToQuantize)
    #first we need to discover all the possible values of this column
    domain = []
    for point in pointList:
        domain.append(point.split(",")[columnIndex])
    #make it a set to remove duplicates
    domain = list(Set(domain))
    #use the index into the domain as the number representing this
value
    return float(domain.index(stringToQuantize))
#harvested fromhttp://www.rosettacode.org/wiki/IsNumeric#Python
def isNumeric(string):
    try:
        i = float(string)
    except ValueError:
        return False
    return True

Big problem with this.  I'm guessing you never ran it on a really big
file yet.  Your quantize function does a lot of unnecessary work: it
rebuilds the list of indices for every single line, every single non-
numeric entry, in fact.  (Tech-speak: that is N^2 behavior in both the
number of lines and number of non-numeric columns.  Not good.)  This
will work ok for a small file, but will take forever on a large file
(perhaps literally).

So, forgetting about column indexing for a moment, we can improve this
vastly simply by generating the list of indices once.  Do that in a
separete function, have it return the list, and then pass that list to
the quantize function.  So replace the midportion of your
createInitialCluster function with something like this:

    ....
    for i in xrange(len(points[0])): # number of columns
        column_indices.append(quantize_column(points,i))
    fixedPoints = []
    for point in points:
        dimensions = [quantize(s, column_indices, point.split
(",").index(i))
                for (i,s) in enumerate(point.split(","))] # need index
as well as entry here
        print dimensions
        fixedPoints.append(Point(dimensions))
    ....

And the two functions would be something like this:

def quantize_columns(point_list,column_index):
    # this assumes if the first column is numeric the whole column
would be
    if(isNumeric(point_list[0][column_index])):
        return None # don't quantize this column
    #first we need to discover all the possible values of this column
    domain = []
    for point in point_list:
        domain.append(point.split(",")[column_index])
    #make it a set to remove duplicates
    return list(set(domain))

def quantize(i,domain,s):
    if domain is None:
        return float(s)
    return float(domain.index(s))

This (once debugged :) will run much, much faster on a large dataset.

Now back to your question.
It works, but it feels a little ugly, and not exactly Pythonic. Using
two lists I need the original point list to read in the data, then the
dimensions one to hold the processed point, and a fixedPoint list to
make objects out of the processed data. If my dataset is in the order
of millions, this'll nuke the memory. I tried something like:
for point in points:
   point = Point([quantize(i, points, point.split(",").index(i)) for i
in point.split(",")])
but when I print out the points afterward, it doesn't keep the
changes.

It's because the order of items in a set is undefined.  The order of
the items in list(set(["a","b","c","d"])) might be very different from
the order in list(set(["a","b","c","d","e"])).  You were passing
quantize incomplete an incomplete list of points, so as the points
grew, the items in the set changed, and it messed up the order.  In
fact,  you should never rely on the order being the same, even if you
created the set with the very same arguments.

What you are trying to do should be done with dictionaries: create a
dict that maps a value to a number.

Now, based on your quantize function, it would seem that the number
associated with the value is arbitrary (doesn't matter what it is, as
long as it's distinct), so it isn't necessary to read the whole csv
file in before assigning numbers; just build the dict as you go.

I suggest collections.defaultdict for this task.  It's like a regular
dict, but it creates a new value any time you access a key that
doesn't exist, a perfect solution to your task.  We'll pass it a
function that generates a different index each time it's called.
(This is probably too advanced for you, but oh well, it's such a cool
trick.)

import collections
import itertools

def createInitialCluster(fileName):
    fixedPoints = []
    # quantization is a dict that assigns sequentially-increasing
numbers
    # to values when reading keys that don't yet exit
    quantization = defaultdict.collections(itertools.count().next)
    with open(fileName, 'r') as f:
        for line in f:
            dimensions = []
            for s in line.rstrip('\n').split(","):
                if isNumeric(s):
                    dimensions.append(float(s))
                else:
                    dimensions.append(float(quantization))
            fixedPoints.append(Point(dimensions))
    return Cluster(fixedPoints)

A couple general pointers:

* Don't ever use i to represent a string.  Programmers expect i to be
an integer.  i,j,k,l,m, and n should be integers, in fact.  u,v,w,x,y,
and z should be floats.  You should stick to this convention whenever
possible, but definitely never use i for anything but an integer.

* set is built into Python now; unless you're using an older version
(2.3 I think) you should use set instead of Set.

* The Python style guide (q.g.) recommends that variables use names
such as column_index rather than columnIndex.  The world won't end if
you don't follow it but if you want to be Pythonic that's how.

Carl Banks


Nice explanation! Very detailed and thorough, thanks! I haven't used
defaultdict before, so that was a very useful and efficient trick!
 
M

MooMaster

oops, this must of course be (x != y).

The randomness doesn't matter too much, all K-means cares about is a
distance between two points in a coordinate space and as long as that
space is invariant it doesn't matter too much (i.e. we don't want
(1,1) becoming (3,1) on the next iteration, or the value for a
quantized column changing). With that in mind, I was hoping to be lazy
and just go with an enumeration approach...

Nevertheless, it does introduce a subtle ordering for nominal data, as
if Iris-Setosa =1, Iris-Versicolor=2, and Iris-Virginica=3 then on
that scale Iris-Versicolor is intuitively "closer" to virginica than
setosa is, when in fact such distances don't mean anything on a
nominal scale. I hadn't thought about a function like that, but it
makes a lot of sense. Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top