Data Science: Unsupervised Classification of satellite images with K-means algorithm
- K-means clustering technique (https://en.wikipedia.org/wiki/K-means_clustering).
- Python 3.8, numpy (https://numpy.org) and rasterio (https://rasterio.readthedocs.io/en/stable/), scikit-learn, matplotlib and optionally scikit-image.
- Jupyter Notebook (https://jupyter.org) and conda package and environment management system (https://docs.conda.io/en/latest/)
- You will prepare Landsat 8 data as the input for k-means clustering algorithm.
- You perform clustering tasks for different number of clusters and select optimal number of clusters for your analysis.
- You use k-means model inertia and silhouette metrics.
Why do we perform unsupervised classification of the satellite images?
Unsupervised categorization of remote sensed data seems to be impractical. In truth, supervised classification gives you better results and you know what you’re classifying. Unsupervised algorithms don’t give you that opportunity. Still, there are areas, where unsupervised classification rocks:
- Data compression (instead of thousands of different pixel values you get dozens of them, file size may be much smaller).
- Cloud detection. Scenario is simple, you have scene without clouds, train on it unsupervised classification algorithm and get multiple labels, then you do the same for future scenes and if labels’ borders are not aligned then probably you get a clouds.
- Anomaly detection. Imagine scene without many details (like sea surface). Clusters will be quite large but there are smaller dots – ships – which can be clipped from the image based on the simple statistics.
- Semi-supervised learning. You may have data labeled only partially, then use of k-means on the unlabeled data is the fastest method to label unknown values.
We use conda package manager to create working environment. You should install those packages (from conda-forge channel):
- (optional) scikit-image
To perform this task you need band set of atmospherically corrected Landsat 8 images. I’ve used scene LC08_L1TP_018030_20190630_20190706_01_T1 which covers Great Lakes region (Lake Erie, Lake Ontario, Lake Huron) from which I clipped Toronto (CA) area. You may use any other scene, especially images from your area of living!
It is a good idea to look into a scene before the analysis, and that’s why instead of jumping in the deep end we start with data visualization. We need for it only band 2, band 3 and band 4 of Landsat 8 scene. We will pass more data into k-means algorithm and visualize scene in false colors so input dataset consists bands from 1 to 7. All of them have 30 meters per pixel resolution so we do not need to perform upsampling or downsampling to perform analysis.
Listing 1: Import Python libraries
import os import numpy as np import rasterio as rio from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from skimage.exposure import equalize_adapthist import matplotlib.pyplot as plt
numpy for arrays handling,
rasterio for i/o operations related to the remote sensed data, scikit-learn’s
KMeans package for unsupervised classification of pixels, scikit-learn’s
silhouette_score function for models evaluation, matplotlib’s
pyplot for data visualization. Python’s
os library is used for the preparation of the paths to Landsat 8 bands. Scikit-image’s
equalize_adapthist function is optional, it is used for visualization of the RGB image. It equalizes image histogram. I always use it during my analysis because Landsat 8 scenes are dark and it is hard to distinguish details in the RGB image. On the other hand if you want to start analysis, skip this function. Equalized histograms are not needed by K-means clustering algorithm.
Listing 2: Load data
path = 'data/' complete_dataset = os.listdir(path) complete_dataset = [path + x for x in complete_dataset] print(complete_dataset)
Output of Listing 2:
['data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B1.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B2.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B3.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B4.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B5.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B6.tif', 'data/clip_RT_LC08_L1TP_018030_20190630_20190706_01_T1_B7.tif']
In my case bands are stored in the
'data/' folder inside the script directory, so you should change path name if your scene is elsewhere. Pay attention to the fact that path is relative and if your data is not in your script folder then you should provide absolute path. With simple list comprehension we create full paths to each band stored for processing.
You could visualize data directly in your notebook’s cell but I prefer to write function and classes even for those tedious tasks. You’ll see, that the function used for visualization of natural color image may be used for visualization of false color images too.
Listing 3: Function for visualization of image composites.
def show_rgb(bands_list, red=4, green=3, blue=2): stack =  colors = [red, green, blue] colors = ['B' + str(x) for x in colors] for color in colors: for band in bands_list: if color in band: with rio.open(band) as src: array = src.read(1) stack.append(array) break stack = np.dstack(stack) for i in range(0, 3): stack[:, :, i] = equalize_adapthist(stack[:, :, i], clip_limit=0.025) plt.figure(figsize=(15,15)) plt.axis('off') plt.imshow(stack) plt.show() show_rgb(complete_dataset)
show_rgb takes list of all bands and numbers of bands which are related to the red, green and blue layer in our image. For Landsat 8 those are band 4 = red, band 3 = green and band 2 = blue.
colors = ['B' + str(x) for x in colors]
makes this function useful only for Landsat 8 images because they have substring
BXX in the filename, where
XX is the number from 1 do 11, as example
LC08_L1TP_018030_20190630_20190706_01_T1_B1.tif. If you use other data sources then this function should be rewritten.
We search for each band in a for loop and create stack from them with numpy’s
dstack method. Then each layer of the stack is equalized with
equalize_adapthist function. Image may be slightly overexposed, we can control it with the
clip_limit parameter. Smaller value of this parameter makes image darker and some differences between objects will be less visible.
Toronto area has multiple buildings and man-made places and it is hard to distinguish vegetation or other fine details in this brownish image. We can use our function to visualize data with other band combination: 7-6-4. Id is designed for man-made objects and enhances them a lot. We pass new
blue argument values to our
show_rgb(complete_dataset, red=7, green=6, blue=4)
You may notice that the main features in this scene are buildings, green areas, water, clouds and cloud shadows (approx. 5 classes of objects). How well K-means algorithm distinguish between different terrain types? We will build special class to answer that question.
To perform K-means classification we need to provide all layers for analysis and number of clusters which should be detected. There are two lines of code to get classified output.
kmeans = KMeans(n_clusters=number of clusters) y_pred = kmeans.fit_predict(data stack)
But we cannot use it just like that. The problem is hidden in the number of clusters. It is unsupervised algorithm and (probably) we don’t know how many classes of terrain type are in our scene. To find out we create class which will test range of clusters, store classified images and trained models and their evaluation metrics.
Listing 4: ClusteredBands class – initial structure
class ClusteredBands: def __init__(self, rasters_list): self.rasters = rasters_list def set_raster_stack(self): pass def build_models(self, no_of_clusters_range): pass def show_clustered(self): pass def show_inertia(self): pass def show_silhouette_scores(self): pass
ClusteredBands has one parameter with which it is initialized. We pass to it paths to the Landsat 8 tiff files. It could be one band or multiple bands – it is up to you. I choose to pass all 30 meters bands but if you pass less layers (as example bands 4, 3, 2) then your analysis will be faster.
Each empty class method has purpose but the most important are
set_raster_stack: for development of the raster stack from all bands passed during class initialization,
build_models: for building multiple models, one for each range passed in the form of a list. This method will update model, prediction output in the form of classified raster image, inertia value and silhouette score for each range,
show_clustered: function which create figure for each output and show it in our notebook,
show_inertia: plot of inertia change vs number of ranges,
show_silhouette_scores: plot of silhouette score vs number of ranges.
Our class gives us opportunity to perform full analysis within one object so data is prepared only once with
set_raster_stack method and we can evaluate multiple ranges in the
build_models method. Class can read geotiff images then create layered stack from them. Next it builds and evaluates models for a given range of clusters. Data processing and model development steps end with output visualization methods.
Data processing is a first step in our analysis, we need to write method for reading
numpy arrays from the bands list passed as the initial argument.
Listing 5: ClusteredBands class – set bands stack method
class ClusteredBands: def __init__(self, rasters_list): self.rasters = rasters_list self.model_input = None self.width = 0 self.height = 0 self.depth = 0 def set_raster_stack(self): band_list =  for image in self.rasters: with rio.open(image, 'r') as src: band = src.read(1) band = np.nan_to_num(band) band_list.append(band) bands_stack = np.dstack(band_list) # Prepare model input from bands stack self.width, self.height, self.depth = bands_stack.shape self.model_input = bands_stack.reshape(self.width * self.height, self.depth)
set_raster_stack opens each image and gets numpy array of band values. We must be sure that
nan values are converted into numbers, as example 0s, so we use
nan_to_num method. Layers for analysis are prepared. We store width, height and depth of band stack, those parameters will be useful later, when we create classified output images. K-means algorithm takes a stack of arrays – we create them with
reshapemethod. Then we store input dataset in
With prepared data we can move to the machine learning task! Our method performs K-means clustering for different number of ranges, then store trained models, their evaluation metrics and classification outputs as model parameters. It will allow us to compare results for different number of clusters.
Listing 6: ClusteredBands class – models training
class ClusteredBands: def __init__(self, rasters_list): self.rasters = rasters_list self.model_input = None self.width = 0 self.height = 0 self.depth = 0 self.no_of_ranges = None self.models = None self.predicted_rasters = None self.s_scores =  self.inertia_scores =  def build_models(self, no_of_clusters_range): self.no_of_ranges = no_of_clusters_range models =  predicted =  inertia_vals =  s_scores =  for n_clust in no_of_clusters_range: kmeans = KMeans(n_clusters=n_clust) y_pred = kmeans.fit_predict(self.model_input) # Append model models.append(kmeans) # Calculate metrics s_scores.append(self._calc_s_score(y_pred)) inertia_vals.append(kmeans.inertia_) # Append output image (classified) quantized_raster = np.reshape(y_pred, (self.width, self.height)) predicted.append(quantized_raster) # Update class parameters self.models = models self.predicted_rasters = predicted self.s_scores = s.scores self.inertia_scores = inertia_vals def _calc_s_score(self, labels): s_score = silhouette_score(self.model_input, labels, sample_size=1000) return s_score
number_of_clastures_range passed into
build_models method must be a list of natural numbers. Our function perform calculations for each range (number of clusters). We automatically get a list of trained models which may be important if you’d like to inspect model hyperparameters. I think that more important are classified rasters because we can inspect what algorithm has produced. Manual inspection is prone to error and we enhance it with model inertia and silhouette score calculated for each range. We write data visualization methods to show output images, plot of inertia and plot of silhouette score. Full code for the
ClusteredBands class is relatively long, but half of it is occupied by data visualization methods, so don’t panic!
Listing 7: ClusteredBands class
class ClusteredBands: def __init__(self, rasters_list): self.rasters = rasters_list self.model_input = None self.width = 0 self.height = 0 self.depth = 0 self.no_of_ranges = None self.models = None self.predicted_rasters = None self.s_scores =  self.inertia_scores =  def set_raster_stack(self): band_list =  for image in self.rasters: with rio.open(image, 'r') as src: band = src.read(1) band = np.nan_to_num(band) band_list.append(band) bands_stack = np.dstack(band_list) # Prepare model input from bands stack self.width, self.height, self.depth = bands_stack.shape self.model_input = bands_stack.reshape(self.width * self.height, self.depth) def build_models(self, no_of_clusters_range): self.no_of_ranges = no_of_clusters_range models =  predicted =  inertia_vals =  s_scores =  for n_clust in no_of_clusters_range: kmeans = KMeans(n_clusters=n_clust) y_pred = kmeans.fit_predict(self.model_input) # Append model models.append(kmeans) # Calculate metrics s_scores.append(self._calc_s_score(y_pred)) inertia_vals.append(kmeans.inertia_) # Append output image (classified) quantized_raster = np.reshape(y_pred, (self.width, self.height)) predicted.append(quantized_raster) # Update class parameters self.models = models self.predicted_rasters = predicted self.s_scores = s.scores self.inertia_scores = inertia_vals def _calc_s_score(self, labels): s_score = silhouette_score(self.model_input, labels, sample_size=1000) return s_score def show_clustered(self): for idx, no_of_clust in enumerate(self.no_of_ranges): title = 'Number of clusters: ' + str(no_of_clust) image = self.predicted_rasters[idx] plt.figure(figsize = (15,15)) plt.axis('off') plt.title(title) plt.imshow(image, cmap='Accent') plt.show() def show_inertia(self): plt.figure(figsize = (10,10)) plt.title('Inertia of the models') plt.plot(self.no_of_ranges, self.inertia_scores) plt.show() def show_silhouette_scores(self): plt.figure(figsize = (10,10)) plt.title('Silhouette scores') plt.plot(self.no_of_ranges, self.s_scores) plt.show()
We test it and discuss results. First, initialize object with list of bands and then create band stack.
clustered_models = ClusteredBands(complete_dataset) clustered_models.set_raster_stack()
Set number of ranges and create models. This may take a while.
ranges = np.arange(3, 11, 1) clustered_models.build_models(ranges) clustered_models.show_clustered() clustered_models.show_inertia() clustered_models.show_silhouette_scores()
And we are ready to go! Look into outputs for 3, 5, 6 and 10 categories. Which one would you like to use as the foundation of more sophisticated analysis? Which number of ranges is optimal in your opinion?
I think that optimal results are for five (5) or six (6) categories. When you compare those images to the RGB scene then it is visible that the main terrain types are presented in the figures. Three (3) categories are not enough to catch spatial variability, large area of 0’s (NaN’s) is not making it better, it “steals” one label from useful classes. Ten (10) categories create some kind of noise and it is much harder to find out what they are representing when we compare output image to the RGB scene. So six or five classes give me what I was looking for. It is interesting that for those two number of categories, algorithm distinguished clouds and cloud shadows from the scene, which is wonderful.
We can add metrics to our visual inspection. Inertia is one of them. It tells us how far is our sample from its category centroid (in our case it informs how similar is one pixel in all available wavelengths to the class for it was classified). The lowest inertia will be when each position (pixel) of our scene represent its own class but it is not our goal to get the same image as we passed into algorithm and we want to find relatively low number of categories which are describing content of the scene. The trick is to plot change in inertia as a function of number of categories and find “elbow” in the plotted curve. Elbow represents optimal range. We can see that elbow is placed at five or six number of clusters, close to the initial guess from visual inspection.
Finding and elbow is one and not so ideal evaluation methods. The other is silhouette score. Scikit-learn offers
silhouette_score metric for evaluation of algorithms based on the distance calculation. Score varies between +1 and -1, coefficient close to the +1 is good, it means that our pixel is close to its category centroid, coefficient -1 indices that our pixel may be in a wrong cluster. This is computationally expensive method, that’s why we have set
sample_size parameter of
silhouette_score method to 1000. If you pass all samples then your algorithm will work extremely slow so beware!
It’s interesting. Silhouette score tells us that optimal models have low number of categories. We know that three or four classes are not enough to catch variability in the image, so we pick the next lowest value. Five categories should be optimal, visual inspection, inertia and silhouette coefficient tell us that this may be a good choice.
I hope that this article gives you idea how to perform unsupervised clustering of remote sensed data. If you have any comments or if you stuck somewhere then write a comment, I’ll answer all your questions.