Jay Taylor | programmer notes

Archive for March 2011



K-Means calculations in Python

So, having used K-Means in PHP in the past, I expected that it would be similarly straightforward in Python. Simply install numpy, scipy, Pycluster, and *yikes* that didn’t work quite like I hoped. Quite a bit more complicated than what I needed. So I ported the trusty PHP implementation to Python. A great big thank you goes to Jose Fonseca for providing the original implementation.

Both kmeans.py and it’s dependency, ordereddict.py are available.

from math import ceil

from ordereddict import OrderedDict

This code was originally created in PHP by Jose Fonseca
(josefonseca@blip.pt), and ported to Python by Jay Taylor
(jaytaylor.com/@jtaylor on Twitter).

Please feel free to use it in either commercial or non-comercial applications.

def kmeans(data, k):
    This def takes a array of integers and the number of clusters to create.:
    It returns a multidimensional array containing the original data organized
    in clusters.

    @param array data
    @param int k

    @return array
    cPositions = assign_initial_positions(data, k)
    clusters = OrderedDict()
    while True:
        changes = kmeans_clustering(data, cPositions, clusters)
        if not changes:
            return kmeans_get_cluster_values(data, clusters)
        cPositions = kmeans_recalculate_cpositions(data, cPositions, clusters)

def kmeans_clustering(data, cPositions, clusters):
    nChanges = 0
    for dataKey, value in enumerate(data):#.items():
        minDistance = None
        cluster = None
        for k, position in cPositions.items():
            dist = distance(value, position)
            if None is minDistance or minDistance > dist:
                minDistance = dist
                cluster = k
        if not clusters.has_key(dataKey) or clusters[dataKey] != cluster:
            nChanges += 1
        clusters[dataKey] = cluster
    return nChanges

def kmeans_recalculate_cpositions(data, cPositions, clusters):
    kValues = kmeans_get_cluster_values(data, clusters)
    for k, position in cPositions.items():
        if not kValues.has_key(k):
            cPositions[k] = 0
            cPositions[k] = kmeans_avg(kValues[k])
        #cPositions[k] = empty(kValues[k]) ? 0 : kmeans_avg(kValues[k])
    return cPositions

def kmeans_get_cluster_values(data, clusters):
    values = OrderedDict()
    for dataKey, cluster in clusters.items():
        if not values.has_key(cluster):
            values[cluster] = []
    return values

def kmeans_avg(values):
    n = len(values)
    total = sum(values)
    if n == 0:
        return 0
        return total / (n * 1.0)

def distance(v1, v2):
    Calculates the distance (or similarity) between two values. The closer
    the return value is to ZERO, the more similar the two values are.

    @param int v1
    @param int v2

    @return int
    return abs(v1-v2)

def assign_initial_positions(data, k):
    Creates the initial positions for the given
    number of clusters and data.
    @param array data
    @param int k

    @return array
    small = min(data)
    big = max(data)
    num = ceil((abs(big - small) * 1.0) / k)
    cPositions = OrderedDict()
    while k > 0:
        k -= 1
        cPositions[k] = small + num * k
    return cPositions

if __name__ == '__main__':
    print kmeans([1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 17], 3)
>python kmeans.py
OrderedDict({0: [1, 3, 2, 5, 6, 2, 3, 1, 3], 2: [30, 36, 45], 1: [15, 17]})

A simple port, after fixing a few spacing typos it worked right out of the gate. If you see any problems pretty please let me know!

Now if someone would just make kmeans++ into a Python module..that would be cool! Hmm..

No tags Hide

Find it!