Top
Sp.4ML > Data Science  > Toolbox: K-means algorithm ## Toolbox: K-means algorithm

K-means is the basic unsupervised learning technique. This algorithm is an excellent choice for spatial data clustering or the initial analysis and the categorization of the customers base. K-means is an unsupervised algorithm, and we cannot easily say how many groups there should be. So the experiment is usually performed over multiple groups, and based on the model inertia and the silhouette score, we choose the final number of groups. Class implemented in this article does all of those tasks. It:

• performs multiple experiments over a different number of groups,
• stores inertia and silhouette scores,
• shows inertia and silhouette score per number of clusters.

Here is the implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

class ClusteredData:

def __init__(self, dataset: pd.DataFrame):
self.ds = dataset
self.no_of_ranges = None
self.models = []
self.predicted_labels = {}
self.s_scores = []
self.inertia_scores = []

def build_models(self, no_of_clusters_range: list, update_input_labels=True):
self.no_of_ranges = no_of_clusters_range
for n_clust in no_of_clusters_range:
kmeans = KMeans(n_clusters=n_clust)
y_pred = kmeans.fit_predict(self.ds)

# Append model
self.models.append(kmeans)

# Calculate metrics
self.s_scores.append(self._calc_s_score(y_pred))
self.inertia_scores.append(kmeans.inertia_)

# Append output (classified)
if update_input_labels:
self.predicted_labels[n_clust] = y_pred

def _calc_s_score(self, labels):
s_score = silhouette_score(self.ds, labels, sample_size=1000)
return s_score

def show_inertia(self):
plt.figure(figsize = (10,10))
plt.title('Inertia of the models')
plt.plot(self.no_of_ranges, self.inertia_scores)
plt.show()

def show_silhouette_scores(self):
plt.figure(figsize = (10,10))
plt.title('Silhouette scores')
plt.plot(self.no_of_ranges, self.s_scores)
plt.show()

And invoking:

models = ClusteredData(scaled_data)  # Remember to scale and normalize data
ranges = np.arange(4, 21)
models.build_models(ranges)
models.show_inertia()
models.show_silhouette_scores()

Note: we may wonder what those graphs tell us about the clusters? Looking at the inertia is hard to say at which point the curve breaks. Thus, we will check the silhouette score. It is a normalized index between +1 and -1. A value close to +1 indices that our clusters are not overlapping and data is easily clustered. A value close to -1 signifies chaos, groups are overlapping, and labels are chosen randomly. Our task is to select several clusters with reasonably high silhouette scores. In this context, it is 8-10 clusters. We are ready for further analysis! ##### Szymon
Subscribe
Notify of 