{"id":28,"date":"2020-10-14T07:10:12","date_gmt":"2020-10-14T07:10:12","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=28"},"modified":"2021-11-30T22:17:08","modified_gmt":"2021-11-30T22:17:08","slug":"data-science-unsupervised-classification-of-satellite-images-with-k-means-algorithm","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2020\/10\/14\/data-science-unsupervised-classification-of-satellite-images-with-k-means-algorithm\/","title":{"rendered":"Data Science: Unsupervised Classification of satellite images with K-means algorithm"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Updates<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>2021-11-30: Added colorbar, correced typos and new function to get a list of spectral indices related to the specific labels from the classification output.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Article Prerequisites:<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>K-means clustering<\/strong> technique (<a href=\"https:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">https:\/\/en.wikipedia.org\/wiki\/K-means_clustering<\/a>).<\/li><li><strong>Python 3.8<\/strong>, <strong>numpy<\/strong> (<a href=\"https:\/\/numpy.org\/\">https:\/\/numpy.org<\/a>) and <strong>rasterio<\/strong> (<a href=\"https:\/\/rasterio.readthedocs.io\/en\/stable\/\">https:\/\/rasterio.readthedocs.io\/en\/stable\/<\/a>), <strong>scikit-learn<\/strong>, <strong>matplotlib<\/strong> and optionally <strong>scikit-image<\/strong>.<\/li><li><strong>Jupyter Notebook<\/strong> (https:\/\/jupyter.org) and conda package and environment management system (https:\/\/docs.conda.io\/en\/latest\/)<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Article Outcomes:<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>You will prepare Landsat 8 data as the input for k-means clustering algorithm.<\/li><li>You perform clustering tasks for different number of clusters and select optimal number of clusters for your analysis.<\/li><li>You use k-means model inertia and silhouette metrics.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Why do we perform unsupervised classification of the satellite images?<\/h2>\n\n\n\n<p>Unsupervised categorization of remotely sensed data seems to be impractical. In truth, supervised classification gives you better results and\u00a0<strong>you know what you\u2019re classifying<\/strong>. Unsupervised algorithms don\u2019t give you that opportunity. Still, there are areas, where unsupervised classification rocks:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Data compression<\/strong>&nbsp;(instead of thousands of different pixel values you get dozens of them, file size may be much smaller).<\/li><li><strong>Cloud detection<\/strong>. Scenario is simple, you have scene without clouds, train on it unsupervised classification algorithm and get multiple labels, then you do the same for future scenes and if labels\u2019 borders are not aligned then probably you get a clouds.<\/li><li><strong>Anomaly detection<\/strong>. Imagine scene without many details (like sea surface). Clusters will be quite large but there are smaller dots \u2013 ships \u2013 which can be clipped from the image based on the simple statistics.<\/li><li><strong>Semi-supervised learning<\/strong>. You may have data labeled only partially, then use of k-means on the unlabeled data is the fastest method to label unknown values.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Environment Setup<\/h2>\n\n\n\n<p>We use the conda package manager to create a working environment. You should install those packages (from the\u00a0<em>conda-forge<\/em>\u00a0channel):<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>numpy<\/li><li>rasterio<\/li><li>scikit-learn<\/li><li>matplotlib<\/li><li>(optional) scikit-image<\/li><\/ul>\n\n\n\n<p>To perform this task you need a band set of atmospherically corrected Landsat 8 images. I\u2019ve used scene\u00a0LC08_L1TP_018030_20190630_20190706_01_T1\u00a0which covers the Great Lakes region (Lake Erie, Lake Ontario, Lake Huron) from which I clipped Toronto (CA) area. You may use any other scene, especially images from your area of living!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data Inspection<\/h2>\n\n\n\n<p>It is a good idea to look into a scene before the analysis, and that\u2019s why instead of jumping in the deep end we start with data visualization. We need for it only band 2, band 3 and band 4 of Landsat 8 scene. We will pass more data into the k-means algorithm and visualize scenes in false color so the input dataset consists of bands from 1 to 7. All of them have 30 meters per pixel resolution so we do not need to perform upsampling or downsampling to perform analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 1: Import Python libraries<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import os\n\nimport numpy as np\nimport rasterio as rio\n\nfrom sklearn.cluster import KMeans\nfrom sklearn.metrics import silhouette_score\nfrom skimage.exposure import equalize_adapthist\n\nimport matplotlib.pyplot as plt<\/pre>\n\n\n\n<p>We import\u00a0<code>numpy<\/code>\u00a0for arrays handling,\u00a0<code>rasterio<\/code>\u00a0for i\/o operations related to the remotely sensed data, scikit-learn\u2019s\u00a0KMeans\u00a0package for unsupervised classification of pixels, scikit-learn\u2019s\u00a0silhouette_score\u00a0function for models evaluation, matplotlib\u2019s\u00a0pyplot\u00a0for data visualization. Python\u2019s\u00a0os\u00a0library is used for the preparation of the paths to Landsat 8 bands. Scikit-image\u2019s\u00a0equalize_adapthist\u00a0function is optional, it is used for visualization of the RGB image. It equalizes the image histogram. I always use it during my analysis because Landsat 8 scenes are dark and it is hard to distinguish details in the RGB image. On the other hand, if you want to start analysis, skip this function. Equalized histograms are not needed by the K-means clustering algorithm.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 2: Load data<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">path = 'data\/'\ncomplete_dataset = os.listdir(path)\ncomplete_dataset = [path + x for x in complete_dataset]\nprint(complete_dataset)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Output of Listing 2:<\/h4>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">['data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B1.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B2.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B3.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B4.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B5.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B6.tif',\n 'data\/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B7.tif']<\/pre>\n\n\n\n<p>In my case bands are stored in the\u00a0<code>'data\/'<\/code>\u00a0folder inside the script directory, so you should change the path name if your scene is elsewhere. Pay attention to the fact that the path is relative and if your data is not in your script folder then you should provide the absolute path. With simple list comprehension, we create full paths to each band stored for processing.<\/p>\n\n\n\n<p>You could visualize data directly in your notebook\u2019s cell but I prefer to write functions and classes even for those tedious tasks. You\u2019ll see, that the function used for visualization of the natural color images may be used for visualization of false-color images too.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 3: Function for visualization of image composites.<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def show_rgb(bands_list, red=4, green=3, blue=2):\n    stack = []\n    \n    colors = [red, green, blue]\n    colors = ['B' + str(x) for x in colors]\n    \n    for color in colors:\n        for band in bands_list:\n            if color in band:\n                with rio.open(band) as src:\n                    array = src.read(1)\n                    stack.append(array)\n                break\n                \n    stack = np.dstack(stack)\n    for i in range(0, 3):\n        stack[:, :, i] = equalize_adapthist(stack[:, :, i], clip_limit=0.025)\n        \n    plt.figure(figsize=(15,15))\n    plt.axis('off')\n    plt.imshow(stack)\n    plt.show()\n\nshow_rgb(complete_dataset)<\/pre>\n\n\n\n<p>Function\u00a0<code>show_rgb<\/code>\u00a0takes a list of all bands and numbers of bands that are related to the red, green and blue layers in our image. For Landsat 8 those are band 4 = red, band 3 = green and band 2 = blue.<\/p>\n\n\n\n<p>Line:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">colors = ['B' + str(x) for x in colors]<\/pre>\n\n\n\n<p>makes this function useful\u00a0<strong>only for Landsat 8 images<\/strong>\u00a0because they have substring\u00a0<code>BXX<\/code>\u00a0in the filename, where\u00a0<code>XX<\/code>\u00a0is the number from 1 do 11, for example\u00a0<code>LC08_L1TP_018030_20190630_20190706_01_T1_<strong>B1<\/strong>.tif<\/code>. If you use other data sources then this function should be rewritten.<\/p>\n\n\n\n<p>We search for each band in a for loop and create a stack from them with numpy\u2019s\u00a0<code>dstack<\/code> method. Then each layer of the stack is equalized with\u00a0<code>equalize_adapthist<\/code> function. The image may be slightly overexposed, we can control it with the\u00a0clip_limit parameter. A smaller value of this parameter makes the image darker and some differences between objects will be less visible.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"369\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/rgb-toronto-500.jpg\" alt=\"\" class=\"wp-image-29\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/rgb-toronto-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/rgb-toronto-500-300x221.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 1: RGB image of Toronto area. Bands 4-3-2 of Landsat 8 satellite.<\/figcaption><\/figure>\n\n\n\n<p>Toronto area has multiple buildings and man-made places and it is hard to distinguish vegetation or other fine details in this brownish image. We can use our function to visualize data with other band combination: 7-6-4. It is designed for man-made objects and enhances them a lot. We pass new\u00a0<code>red<\/code>,\u00a0<code>green<\/code>\u00a0and\u00a0<code>blue<\/code>\u00a0argument values to our\u00a0<code>show_rgb<\/code>\u00a0method.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">show_rgb(complete_dataset, red=7, green=6, blue=4)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Output image:<\/h4>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"369\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/false_color_toronto-500.jpg\" alt=\"\" class=\"wp-image-30\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/false_color_toronto-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/false_color_toronto-500-300x221.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 2: Toronto area in false colors (bands 7-6-4).<\/figcaption><\/figure>\n\n\n\n<p>You may notice that the main features in this scene are buildings, green areas, water, clouds and cloud shadows (approx. 5 classes of objects). How well\u00a0<em>K-means<\/em>\u00a0algorithm distinguish between different terrain types? We will build a special class to answer that question.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Models Development<\/h2>\n\n\n\n<p>To perform\u00a0<em>K-means<\/em>\u00a0classification we need to provide all layers for analysis and the number of clusters that should be detected. There are two lines of code to get classified output.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">kmeans = KMeans(n_clusters=number of clusters)\ny_pred = kmeans.fit_predict(data stack)<\/pre>\n\n\n\n<p>But we cannot use it just like that. The problem is hidden in the number of clusters. It is an unsupervised algorithm and (probably)\u00a0we don\u2019t know how many classes of terrain type are in our scene. To find out we create a class that will test a range of clusters, store classified images, and training models and their evaluation metrics.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 4: ClusteredBands class &#8211; initial structure<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class ClusteredBands:\n    \n    def __init__(self, rasters_list):\n        self.rasters = rasters_list\n        \n    def set_raster_stack(self):\n        pass\n        \n    def build_models(self, no_of_clusters_range):\n        pass\n        \n    def show_clustered(self):\n        pass\n            \n    def show_inertia(self):\n        pass\n        \n    def show_silhouette_scores(self):\n        pass<\/pre>\n\n\n\n<p>Class\u00a0<code>ClusteredBands<\/code>\u00a0has one parameter with which it is initialized. We pass to it paths to the Landsat 8 tiff files. It could be one band or multiple bands \u2013 it is up to you. I choose to pass all 30 meters bands but if you pass fewer layers (as for example bands 4, 3, 2) then your analysis will be faster.<\/p>\n\n\n\n<p>Each empty class method has a purpose but the most important are\u00a0<code>set_raster_stack<\/code>and\u00a0<code>build_models<\/code>.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>set_raster_stack<\/code>&nbsp;: for development of the raster stack from all bands passed during class initialization,<\/li><li><code>build_models<\/code>&nbsp;: for building multiple models, one for each range passed in the form of a list. This method will update model, prediction output in the form of classified raster image, inertia value and silhouette score for each range,<\/li><li><code>show_clustered<\/code>&nbsp;: function which create figure for each output and show it in our notebook,<\/li><li><code>show_inertia<\/code>&nbsp;: plot of inertia change vs number of ranges,<\/li><li><code>show_silhouette_scores<\/code>&nbsp;: plot of silhouette score vs number of ranges.<\/li><\/ul>\n\n\n\n<p>Our class gives us the opportunity to perform full analysis within one object so data is prepared only once with\u00a0<code>set_raster_stack<\/code>\u00a0method and we can evaluate multiple ranges in the\u00a0build_models\u00a0method. The class can read GeoTIFF images, then create a layered stack from them. Next, it builds and evaluates models for a given range of clusters. Data processing and model development steps end with output visualization methods.<\/p>\n\n\n\n<p>Data processing is the first step in our analysis, we need to write a method for reading\u00a0numpy arrays from the bands&#8217; list passed as the initial argument.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 5: ClusteredBands class &#8211; set bands stack method<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class ClusteredBands:\n    \n    def __init__(self, rasters_list):\n        self.rasters = rasters_list\n        self.model_input = None\n        self.width = 0\n        self.height = 0\n        self.depth = 0\n        \n    def set_raster_stack(self):\n        band_list = []\n        for image in self.rasters:\n            with rio.open(image, 'r') as src:\n                band = src.read(1)\n                band = np.nan_to_num(band)\n                band_list.append(band)\n        bands_stack = np.dstack(band_list)\n        \n        # Prepare model input from bands stack\n        self.width, self.height, self.depth = bands_stack.shape\n        self.model_input = bands_stack.reshape(self.width * self.height, self.depth)\n<\/pre>\n\n\n\n<p>Method\u00a0<code>set_raster_stack<\/code>\u00a0opens each image and gets numpy array of band values. We must be sure that\u00a0<code>nan<\/code>\u00a0values are converted into numbers, for example 0s, so we use\u00a0<code>nan_to_num<\/code>\u00a0method. Layers for analysis are prepared. We store width, height and depth of band stack, those parameters will be useful later, when we create classified output images.\u00a0<em>K-means<\/em>\u00a0algorithm takes a stack of arrays \u2013 we create them with\u00a0<code>reshape<\/code> method. Then we store the input dataset in\u00a0<code>self.model_input<\/code>\u00a0parameter.<\/p>\n\n\n\n<p>With prepared data, we can move to the machine learning task! Our method performs\u00a0K-means\u00a0clustering for a different number of ranges, then store trained models, their evaluation metrics and classification outputs as model parameters. It will allow us to compare results for a different number of clusters.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 6: ClusteredBands class &#8211; models training<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class ClusteredBands:\n    \n    def __init__(self, rasters_list):\n        self.rasters = rasters_list\n        self.model_input = None\n        self.width = 0\n        self.height = 0\n        self.depth = 0\n        self.no_of_ranges = None\n        self.models = None\n        self.predicted_rasters = None\n        self.s_scores = []\n        self.inertia_scores = []\n        \n    def build_models(self, no_of_clusters_range):\n        self.no_of_ranges = no_of_clusters_range\n        models = []\n        predicted = []\n        inertia_vals = []\n        s_scores = []\n        for n_clust in no_of_clusters_range:\n            kmeans = KMeans(n_clusters=n_clust)\n            y_pred = kmeans.fit_predict(self.model_input)\n            \n            # Append model\n            models.append(kmeans)\n            \n            # Calculate metrics\n            s_scores.append(self._calc_s_score(y_pred))\n            inertia_vals.append(kmeans.inertia_)\n            \n            # Append output image (classified)\n            quantized_raster = np.reshape(y_pred, (self.width, self.height))\n            predicted.append(quantized_raster)\n            \n        # Update class parameters\n        self.models = models\n        self.predicted_rasters = predicted\n        self.s_scores = s_scores\n        self.inertia_scores = inertia_vals\n\ndef _calc_s_score(self, labels):\n        s_score = silhouette_score(self.model_input, labels, sample_size=1000)\n        return s_score<\/pre>\n\n\n\n<p>Argument\u00a0<code>number_of_clastures_range<\/code>\u00a0passed into\u00a0<code>build_models<\/code>\u00a0method must be a list of natural numbers. Our function performs calculations for each range (number of clusters). We automatically get a list of trained models which may be important if you\u2019d like to inspect model hyperparameters. I think that more important are classified rasters because we can inspect what algorithm has produced. Manual inspection is prone to error and we enhance it with\u00a0<strong>model inertia<\/strong>\u00a0and\u00a0<strong>silhouette score<\/strong>\u00a0calculated for each range. We write data visualization methods to show output images, plot the inertia and plot the silhouette score. The full code for the\u00a0<code>ClusteredBands<\/code>\u00a0class is relatively long, but half of it is occupied by data visualization methods, so don\u2019t panic!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 7: ClusteredBands class<\/h3>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">class ClusteredBands:\n    \n    def __init__(self, rasters_list):\n        self.rasters = rasters_list\n        self.model_input = None\n        self.width = 0\n        self.height = 0\n        self.depth = 0\n        self.no_of_ranges = None\n        self.models = None\n        self.predicted_rasters = None\n        self.s_scores = []\n        self.inertia_scores = []\n        \n    def set_raster_stack(self):\n        band_list = []\n        for image in self.rasters:\n            with rio.open(image, 'r') as src:\n                band = src.read(1)\n                band = np.nan_to_num(band)\n                band_list.append(band)\n        bands_stack = np.dstack(band_list)\n        \n        # Prepare model input from bands stack\n        self.width, self.height, self.depth = bands_stack.shape\n        self.model_input = bands_stack.reshape(self.width * self.height, self.depth)\n        \n    def build_models(self, no_of_clusters_range):\n        self.no_of_ranges = no_of_clusters_range\n        models = []\n        predicted = []\n        inertia_vals = []\n        s_scores = []\n        for n_clust in no_of_clusters_range:\n            kmeans = KMeans(n_clusters=n_clust)\n            y_pred = kmeans.fit_predict(self.model_input)\n            \n            # Append model\n            models.append(kmeans)\n            \n            # Calculate metrics\n            s_scores.append(self._calc_s_score(y_pred))\n            inertia_vals.append(kmeans.inertia_)\n            \n            # Append output image (classified)\n            quantized_raster = np.reshape(y_pred, (self.width, self.height))\n            predicted.append(quantized_raster)\n            \n        # Update class parameters\n        self.models = models\n        self.predicted_rasters = predicted\n        self.s_scores = s_scores\n        self.inertia_scores = inertia_vals\n        \n    def _calc_s_score(self, labels):\n        s_score = silhouette_score(self.model_input, labels, sample_size=1000)\n        return s_score\n        \n    def show_clustered(self):\n        for idx, no_of_clust in enumerate(self.no_of_ranges):\n            title = 'Number of clusters: ' + str(no_of_clust)\n            image = self.predicted_rasters[idx]\n            plt.figure(figsize = (15,15))\n            plt.axis('off')\n            plt.title(title)\n            plt.imshow(image, cmap='Accent')\n            plt.colorbar()\n            plt.show()\n            \n    def show_inertia(self):\n        plt.figure(figsize = (10,10))\n        plt.title('Inertia of the models')\n        plt.plot(self.no_of_ranges, self.inertia_scores)\n        plt.show()\n        \n    def show_silhouette_scores(self):\n        plt.figure(figsize = (10,10))\n        plt.title('Silhouette scores')\n        plt.plot(self.no_of_ranges, self.s_scores)\n        plt.show()<\/pre>\n\n\n\n<p>We test it and discuss results. First, initialize object with list of bands and then create band stack.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">clustered_models = ClusteredBands(complete_dataset)\nclustered_models.set_raster_stack()<\/pre>\n\n\n\n<p>Set number of ranges and create models. This may take a while.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">ranges = np.arange(3, 11, 1)\nclustered_models.build_models(ranges)\nclustered_models.show_clustered()\nclustered_models.show_inertia()\nclustered_models.show_silhouette_scores()<\/pre>\n\n\n\n<p>And we are ready to go! Look into outputs for 3, 5, 6 and 10 categories. Which one would you like to use as the foundation of more sophisticated analysis? Which number of ranges is optimal in your opinion?<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"378\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_3-500.jpg\" alt=\"\" class=\"wp-image-31\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_3-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_3-500-300x227.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 3: Unsupervised classification of Toronto area with 3 categories.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"378\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_5-500.jpg\" alt=\"\" class=\"wp-image-32\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_5-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_5-500-300x227.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 4: Unsupervised classification of Toronto area with 5 categories.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"378\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_6-500.jpg\" alt=\"\" class=\"wp-image-33\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_6-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_6-500-300x227.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 5: Unsupervised classification of Toronto area with 6 categories.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"378\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_10-500.jpg\" alt=\"\" class=\"wp-image-34\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_10-500.jpg 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/clustered_toronto_10-500-300x227.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 6: Unsupervised classification of Toronto area with 10 categories.<\/figcaption><\/figure>\n\n\n\n<p>I think that optimal results are for five (5) or six (6) categories. When you compare those images to the RGB scene then it is visible that the main terrain types are presented in the figures. Three (3) categories are not enough to catch spatial variability, a large area of 0\u2019s (NaN\u2019s) is not making it better, it \u201csteals\u201d one label from useful classes. Ten (10) categories create some kind of noise and it is much harder to find out what they are representing when we compare the output image to the RGB scene. So six or five classes give me what I was looking for. It is interesting that for those two categories,\u00a0the <strong>algorithm distinguished clouds and cloud shadows from the scene<\/strong>, which is wonderful.<\/p>\n\n\n\n<p>We can add metrics to our visual inspection.\u00a0<strong>Inertia<\/strong>\u00a0is one of them. It tells us how far is our sample from its category centroid (in our case it informs how similar is one pixel in all available wavelengths to the class for it was classified). The lowest inertia will be when each position (pixel) of our scene represents its own class but it is not our goal to get the same image as we passed into the algorithm and we want to find a relatively low number of categories which are describing the content of the scene. The trick is to plot change in inertia as a function of a number of categories and find \u201celbow\u201d in the plotted curve.\u00a0Elbow represents an <em>optimal range<\/em>. We can see that elbow is placed at five or six clusters, close to the initial guess from visual inspection.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"604\" height=\"590\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/model_inertia.png\" alt=\"\" class=\"wp-image-35\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/model_inertia.png 604w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/model_inertia-300x293.png 300w\" sizes=\"auto, (max-width: 604px) 100vw, 604px\" \/><figcaption>Figure 7: Inertia of models in relation to the number of classes.<\/figcaption><\/figure>\n\n\n\n<p>Finding an elbow is one evaluation method, but it is not ideal. The other is\u00a0the <strong>silhouette score<\/strong>. Scikit-learn calculates silhouette_score\u00a0metric for the evaluation of algorithms based on distance between predicted categories. The score varies between +1 and -1, the coefficient close to the +1 is good, which means that our pixel is close to its category centroid, coefficient -1 indices that our pixel may be in a wrong cluster. This is a computationally expensive method, that\u2019s why we have set\u00a0<code>sample_size<\/code>\u00a0parameter of\u00a0<code>silhouette_score<\/code>\u00a0method to 1000. If you pass all samples then your algorithm will work extremely slow so beware!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"601\" height=\"590\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/silhouette_score.png\" alt=\"\" class=\"wp-image-36\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/silhouette_score.png 601w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/silhouette_score-300x295.png 300w\" sizes=\"auto, (max-width: 601px) 100vw, 601px\" \/><figcaption>Figure 8: Silhouette score in relation to the number of predicted classes.<\/figcaption><\/figure>\n\n\n\n<p>It\u2019s interesting. Silhouette score tells us that optimal models have a low number of categories. We know that three or four classes are not enough to catch variability in the image, so we pick the next lowest value.\u00a0Five categories should be optimal, visual inspection, inertia and silhouette coefficient tell<strong> us that this may be a good choice.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How to get a specific spectral indices for each label?<\/h2>\n\n\n\n<p>The solution for this concrete problem is rather simple. Our labeled output image is a numpy array where the specific labels (integers) are located in specific positions. We can check the position of those labels in the array and then we can find spectral indices from the processed bands: we simply take the address of label X and copy this address into the Landsat band. Then we obtain pixel value at this location. We are not going to re-build our <code>ClusteredBands<\/code> class. Instead, we use an external function that takes two arguments: labeled image and a list of bands. Function returns dictionary with two levels: first key is a raster filename, second level key is a label number and its value is a list of all spectral indices covered by this label:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>{raster file: label number: list of the unique spectral indices}<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Listing 8: Get spectral indices of classification labels from the input bands.<\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code>def get_spectral_indices(labeled_image: np.array, rasters: list) -> dict:\n    \"\"\"\n    Function creates list of the unique spectral indices for each label from unsupervised classified image.\n    \n    :param labeled_image: (numpy array) classification output,\n    :param rasters: (list) list of files with Landsat 8 bands.\n    \n    :returns: (dict) {raster: {label: &#91;indices]}}\n    \"\"\"\n    spectral_indices = {}\n    \n    for image in rasters:\n        with rio.open(image, 'r') as src:\n            \n            band = src.read(1)\n            band = np.nan_to_num(band)\n            spectral_indices&#91;image] = {}\n            \n            for lbl in np.unique(labeled_image):\n                indices = band&#91;labeled_image == lbl]\n                unique_indices = np.unique(indices)\n                spectral_indices&#91;image]&#91;lbl] = unique_indices\n    return spectral_indices\n\nfour_class_spectral_indices = get_spectral_indices(clustered_models.predicted_rasters&#91;1], complete_dataset)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>I hope that this article gives you an idea of how to perform unsupervised clustering of remotely sensed data. If you have any comments or if you are stuck somewhere then write a comment, I\u2019ll answer all your questions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to perform unsupervised classification of satellite images with Python.<\/p>\n","protected":false},"author":1,"featured_media":37,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,19,3,4,20],"tags":[21,22,25,7,24,27,28,23,26],"class_list":["post-28","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-machine-learning","category-python","category-raster","category-remote-sensing","tag-k-means-clustering","tag-landsat-8","tag-machine-learning","tag-python","tag-remote-sensing","tag-satellite-images","tag-scikit-learn","tag-toronto","tag-unsupervised-classification"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/28","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=28"}],"version-history":[{"count":5,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/28\/revisions"}],"predecessor-version":[{"id":596,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/28\/revisions\/596"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/37"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=28"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=28"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=28"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}