{"id":6,"date":"2020-10-01T22:08:04","date_gmt":"2020-10-01T22:08:04","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=6"},"modified":"2020-10-01T22:08:04","modified_gmt":"2020-10-01T22:08:04","slug":"data-engineering-crop-big-raster-files-with-python-and-rasterio","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2020\/10\/01\/data-engineering-crop-big-raster-files-with-python-and-rasterio\/","title":{"rendered":"Data Engineering: Crop big raster files with Python and rasterio"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Prototyping of EO data processing and analysis scripts could be painful due to the size of images.<\/p><\/blockquote>\n\n\n\n<p>As a remote sensing specialist and a big fan of MacBook. I usually work in a coffe-house with my laptop and I must find a way to process data with limited RAM access. My laptop has 8 GB of memory. Typical Earth Observation product (image, classification map etc.) has a size of 1 GB. Someone could say that 8 GB is more than enough to&nbsp;<em>read<\/em>&nbsp;this file but&nbsp;<em>processing it<\/em>&nbsp;is a different piece of cake. It is possible to work with a big rasters when specialized software is used (<strong>QGIS<\/strong>). But when I\u2019m working in <em>Jupyter Notebook<\/em> and I\u2019m using <em>numpy<\/em> processing algorithms then RAM demand is close to my MacBook limit or exceeds it.<\/p>\n\n\n\n<p>Fortunately there are two methods to overcome this problem. The first method is to read only <a href=\"https:\/\/rasterio.readthedocs.io\/en\/stable\/topics\/windowed-rw.html\">part of an image<\/a>.<\/p>\n\n\n\n<p>The second method is to <strong>crop big image into smaller parts<\/strong> and store them for further processing. I\u2019ve described it in this article.&nbsp;<\/p>\n\n\n\n<p>Image cropping is very important for reproducibility of the processing piepline where, theoretically, we may validate every processing step. The major drawback of this method is a problem with a data storage. It is an exchange between RAM and the hard drive memory. MacBook\u2019s hard drive memory volume is very small so I must use external drive to store remote sensed images but at the same time.<\/p>\n\n\n\n<p>Cropping of georeferenced image may be performed with&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/rasterio.readthedocs.io\/en\/stable\/index.html\" target=\"_blank\">rasterio<\/a>&nbsp;library. It is the&nbsp;<em>must have<\/em>&nbsp;Python tool of a remote sensing specialist. When you prototype machine learning systems where satellite imagery is analyzed then rasterio is a better choice than GDAL. (Ok, in practice&nbsp;<em>it depends<\/em>, some data formats are not well suited for rasterio\u2026 or rasterio is not well suited for them). We need two more libraries for our purpose: <em>numpy<\/em> (linear algebra and matrix operations) and <em>matplotlib<\/em> (data visualization).<\/p>\n\n\n\n<p>Custom Python class to perform a clipping operation is presented in the listing below. Link to full implementation with docstrings is given at the end of article.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import numpy as np\nimport rasterio as rio\nimport affine\n\nclass ImageForClipModel:\n\n    # Get band, band shape, dtype, crs and transform values\n    def __init__(self, image_address):\n        with rio.open(image_address, 'r') as f:\n            self.band = f.read(1)\n            self.crs = f.crs\n            self.b_trans = f.transform\n        self.band_shape = self.band.shape\n        self.band_dtype = self.band.dtype\n        self.clipped_images = []\n        self.clipped_addresses = []\n\n    # Function for clipping band\n    def clip_raster(self, height, width, buffer=0,\n                    save_mode=False, prefix='clipped_band_', \n                    pass_empty=False):\n        r_pos = 0\n        while r_pos &lt; self.band_shape[0]:\n            col_pos = 0\n            while col_pos &lt; self.band_shape[1]:\n                clipped_image = self.band[r_pos:r_pos+\\  \n                                height,\n                                col_pos:col_pos + width]\n\n                # Check if frame is empty\n                if pass_empty:\n                    if np.mean(clipped_image) == 0:\n                        print('Empty frame, not saved')\n                        break\n\n                # Positioning\n                tcol, trow = self.b_trans * \\\n                             (col_pos, r_pos)\n                new_transform = affine.Affine(self.b_trans[0], \n                                              self.b_trans[1],\n                                              tcol,\n                                              self.b_trans[3], \n                                              self.b_trans[4], \n                                              trow)\n                image = [clipped_image, self.crs, new_transform,\n                         clipped_image.shape[0], \n                         clipped_image.shape[1],\n                         self.band_dtype]\n\n                # Save or append into a set\n                if save_mode:\n                    filename = prefix + 'x_' + \\\n                               str(col_pos) + '_y_' + \\ \n                               str(r_pos) + '.tif'\n\n                    with rio.open(filename, 'w',\n                                  driver='GTiff',\n                                  height=image[3], width=image[4],\n                                  count=1, dtype=image[5],\n                                  crs=image[1],\n                                  transform=image[2]) as dst:\n                        dst.write(image[0], 1)\n                    self.clipped_addresses.append(filename)\n                else:\n                    self.clipped_images.append(clipped_image)\n\n                # Update column position\n                col_pos = col_pos + width - buffer\n\n            # Update row position\n            r_pos = r_pos + height - buffer\n\n        if save_mode:\n            print('Tiles saved successfully')\n            return self.clipped_addresses\n        else:\n            print('Tiles prepared successfully')\n            return self.clipped_images<\/pre>\n\n\n\n<p>With this class it is possible to divide image into georeferenced quadrants. Example of the code usage (from<em> Jupyter Notebook<\/em>):<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">%matplotlib inline\nimport matplotlib.pyplot as plt\n\ntest_image = \/\n'raster_clip\/LC08_L1TP_190022_20180413_20180417_01_T1_sr_band5.tif'\n\n# Show test image\n\nwith rio.open(test_image, 'r') as src:\n    img = src.read(1)\n    img = img.astype(np.float32)\n    img = img \/ np.max(img)\n    img[img &lt;= 0] = np.nan\n    \nplt.figure(figsize=(12, 12))\nplt.imshow(img, cmap='gray')\nplt.show()<\/pre>\n\n\n\n<p><strong>>> Output:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"495\" height=\"500\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_1.png\" alt=\"\" class=\"wp-image-12\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_1.png 495w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_1-297x300.png 297w\" sizes=\"auto, (max-width: 495px) 100vw, 495px\" \/><figcaption>Figure 1: Clipped raster. Scene from the Landsat 8 band 5.<\/figcaption><\/figure><\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">clipper = ImageForClipModel(image_address=test_image)\n\nclipper.clip_raster(height=500,\n    width=500,\n    buffer=0,\n    save_mode=True,\n    prefix='raster_clip\/clipped_band_',\n    pass_empty=False)<\/pre>\n\n\n\n<p><strong>>> Output:<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"500\" height=\"500\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_2_small.png\" alt=\"\" class=\"wp-image-13\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_2_small.png 500w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_2_small-300x300.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2020\/10\/raster_clipping_2_small-150x150.png 150w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><figcaption>Figure 2: Clipped raster.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p>Script:&nbsp;<a href=\"https:\/\/github.com\/szymon-datalions\/geoprocessing\/blob\/master\/rasters\/clip_into_quadrants.py\">https:\/\/github.com\/szymon-datalions\/geoprocessing\/blob\/master\/rasters\/clip_into_quadrants.py<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Prototyping of EO data processing and analysis scripts could be painful due to the size of images. You can process smaller tiles &#8211; but first you must clip bigger scenes.<\/p>\n","protected":false},"author":1,"featured_media":14,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,3,4],"tags":[5,8,7,6],"class_list":["post-6","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-python","category-raster","tag-crop-raster","tag-how-to-crop-raster-in-python","tag-python","tag-rasterio"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/6","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=6"}],"version-history":[{"count":3,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/6\/revisions"}],"predecessor-version":[{"id":15,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/6\/revisions\/15"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/14"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=6"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=6"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=6"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}