Mar/11

13

# K-Means calculations in Python

So, having used K-Means in PHP in the past, I expected that it would be similarly straightforward in Python. Simply install numpy, scipy, Pycluster, and *yikes* that didn’t work quite like I hoped. Quite a bit more complicated than what I needed. So I ported the trusty PHP implementation to Python. A great big thank you goes to Jose Fonseca for providing the original implementation.

Both kmeans.py and it’s dependency, ordereddict.py are available.

```from math import ceil

from ordereddict import OrderedDict

"""
This code was originally created in PHP by Jose Fonseca
(josefonseca@blip.pt), and ported to Python by Jay Taylor

Please feel free to use it in either commercial or non-comercial applications.
"""

def kmeans(data, k):
"""
This def takes a array of integers and the number of clusters to create.:
It returns a multidimensional array containing the original data organized
in clusters.

@param array data
@param int k

@return array
"""
cPositions = assign_initial_positions(data, k)
clusters = OrderedDict()
while True:
changes = kmeans_clustering(data, cPositions, clusters)
if not changes:
return kmeans_get_cluster_values(data, clusters)
cPositions = kmeans_recalculate_cpositions(data, cPositions, clusters)

def kmeans_clustering(data, cPositions, clusters):
"""
"""
nChanges = 0
for dataKey, value in enumerate(data):#.items():
minDistance = None
cluster = None
for k, position in cPositions.items():
dist = distance(value, position)
if None is minDistance or minDistance > dist:
minDistance = dist
cluster = k
if not clusters.has_key(dataKey) or clusters[dataKey] != cluster:
nChanges += 1
clusters[dataKey] = cluster
return nChanges

def kmeans_recalculate_cpositions(data, cPositions, clusters):
kValues = kmeans_get_cluster_values(data, clusters)
for k, position in cPositions.items():
if not kValues.has_key(k):
cPositions[k] = 0
else:
cPositions[k] = kmeans_avg(kValues[k])
#cPositions[k] = empty(kValues[k]) ? 0 : kmeans_avg(kValues[k])
return cPositions

def kmeans_get_cluster_values(data, clusters):
values = OrderedDict()
for dataKey, cluster in clusters.items():
if not values.has_key(cluster):
values[cluster] = []
values[cluster].append(data[dataKey])
return values

def kmeans_avg(values):
n = len(values)
total = sum(values)
if n == 0:
return 0
else:

def distance(v1, v2):
"""
Calculates the distance (or similarity) between two values. The closer
the return value is to ZERO, the more similar the two values are.

@param int v1
@param int v2

@return int
"""
return abs(v1-v2)

def assign_initial_positions(data, k):
"""
Creates the initial positions for the given
number of clusters and data.
@param array data
@param int k

@return array
"""
small = min(data)
big = max(data)
num = ceil((abs(big - small) * 1.0) / k)
cPositions = OrderedDict()
while k > 0:
k -= 1
cPositions[k] = small + num * k
return cPositions

if __name__ == '__main__':
print kmeans([1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 17], 3)
```
```>python kmeans.py
OrderedDict({0: [1, 3, 2, 5, 6, 2, 3, 1, 3], 2: [30, 36, 45], 1: [15, 17]})
```

A simple port, after fixing a few spacing typos it worked right out of the gate. If you see any problems pretty please let me know!

Now if someone would just make kmeans++ into a Python module..that would be cool! Hmm..

No tags Hide

To top