{"id":683,"date":"2022-03-03T14:14:08","date_gmt":"2022-03-03T14:14:08","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=683"},"modified":"2022-03-03T14:14:09","modified_gmt":"2022-03-03T14:14:09","slug":"toolbox-k-means-algorithm","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2022\/03\/03\/toolbox-k-means-algorithm\/","title":{"rendered":"Toolbox: K-means algorithm"},"content":{"rendered":"\n<p>K-means is the basic unsupervised learning technique. This algorithm is an excellent choice for <strong><a href=\"https:\/\/ml-gis-service.com\/index.php\/2020\/10\/14\/data-science-unsupervised-classification-of-satellite-images-with-k-means-algorithm\/\" data-type=\"post\" data-id=\"28\">spatial data<\/a> clustering<\/strong> or <strong>the initial analysis and the categorization of the customers base<\/strong>. K-means is an unsupervised algorithm, and we cannot easily say how many groups there should be. So the experiment is usually performed over multiple groups, and based on the model inertia and the silhouette score, we choose the final number of groups. Class implemented in this article does all of those tasks. It:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>performs multiple experiments over a different number of groups,<\/li><li>stores inertia and silhouette scores,<\/li><li>shows inertia and silhouette score per number of clusters.<\/li><\/ul>\n\n\n\n<p>Here is the implementation:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n\nclass ClusteredData:\n    \n    def __init__(self, dataset: pd.DataFrame):\n        self.ds = dataset\n        self.no_of_ranges = None\n        self.models = []\n        self.predicted_labels = {}\n        self.s_scores = []\n        self.inertia_scores = []\n        \n    def build_models(self, no_of_clusters_range: list, update_input_labels=True):\n        self.no_of_ranges = no_of_clusters_range\n        for n_clust in no_of_clusters_range:\n            kmeans = KMeans(n_clusters=n_clust)\n            y_pred = kmeans.fit_predict(self.ds)\n            \n            # Append model\n            self.models.append(kmeans)\n            \n            # Calculate metrics\n            self.s_scores.append(self._calc_s_score(y_pred))\n            self.inertia_scores.append(kmeans.inertia_)\n            \n            # Append output (classified)\n            if update_input_labels:\n              self.predicted_labels[n_clust] = y_pred\n        \n    def _calc_s_score(self, labels):\n        s_score = silhouette_score(self.ds, labels, sample_size=1000)\n        return s_score\n            \n    def show_inertia(self):\n        plt.figure(figsize = (10,10))\n        plt.title('Inertia of the models')\n        plt.plot(self.no_of_ranges, self.inertia_scores)\n        plt.show()\n        \n    def show_silhouette_scores(self):\n        plt.figure(figsize = (10,10))\n        plt.title('Silhouette scores')\n        plt.plot(self.no_of_ranges, self.s_scores)\n        plt.show()<\/pre>\n\n\n\n<p>And invoking:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">models = ClusteredData(scaled_data)  # Remember to scale and normalize data\nranges = np.arange(4, 21)\nmodels.build_models(ranges)\nmodels.show_inertia()\nmodels.show_silhouette_scores()<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"617\" height=\"590\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown.png\" alt=\"\" class=\"wp-image-685\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown.png 617w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown-300x287.png 300w\" sizes=\"auto, (max-width: 617px) 100vw, 617px\" \/><figcaption>Models inertia<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"601\" height=\"590\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown-2.png\" alt=\"\" class=\"wp-image-686\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown-2.png 601w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/03\/Unknown-2-300x295.png 300w\" sizes=\"auto, (max-width: 601px) 100vw, 601px\" \/><figcaption>Models silhouette scores<\/figcaption><\/figure><\/div>\n\n\n\n<p>Note: we may wonder what those graphs tell us about the clusters? Looking at the inertia is hard to say at which point the curve breaks. Thus, we will check the silhouette score. It is a normalized index between +1 and -1. A value close to +1 indices that our clusters are not overlapping and data is easily clustered. A value close to -1 signifies chaos, groups are overlapping, and labels are chosen randomly. Our task is to select several clusters with reasonably high silhouette scores. In this context, it is 8-10 clusters. We are ready for further analysis!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>K-means clustering class for local experiments<\/p>\n","protected":false},"author":1,"featured_media":688,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,19,3,17],"tags":[159,158,161,157,154,21,155,7,160,28,156,62],"class_list":["post-683","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-machine-learning","category-python","category-scripts","tag-clustering","tag-clusters","tag-e-commerce","tag-inertia","tag-k-means","tag-k-means-clustering","tag-kmeans","tag-python","tag-scikit-learn-2","tag-scikit-learn","tag-silhouette-score","tag-spatial"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/683","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=683"}],"version-history":[{"count":2,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/683\/revisions"}],"predecessor-version":[{"id":689,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/683\/revisions\/689"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/688"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=683"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=683"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=683"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}