{"id":538,"date":"2021-10-31T20:19:20","date_gmt":"2021-10-31T20:19:20","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=538"},"modified":"2022-09-11T08:43:42","modified_gmt":"2022-09-11T08:43:42","slug":"data-science-interpolate-air-quality-measurements-with-python","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2021\/10\/31\/data-science-interpolate-air-quality-measurements-with-python\/","title":{"rendered":"Data Science: Interpolate Air Quality Measurements with Python"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Update<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>2022-09-11 Tutorial has been updated to the new version of <code>pyinterpolate<\/code>. New figures and text corrections.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">How do we measure air quality? The most popular are stationary sensors that detect the amount of toxic compounds in the air. Each sensor&#8217;s readings may be described by  [<code>timestamp<\/code>, <code>latitude<\/code>, <code>longitude<\/code>, <code>value<\/code>]. Sensors do not cover a big area. We can&#8217;t expect observations to be taken at a regular grid, which is easy to interpolate. The final image from the raw data source may look like this one:<\/p>\n\n\n<div class=\"wp-block-image is-style-default\">\n<figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"915\" height=\"958\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030.png\" alt=\"The map of PM 2.5 daily concentrations in Poland.\" class=\"wp-image-958\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030.png 915w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030-287x300.png 287w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030-768x804.png 768w\" sizes=\"auto, (max-width: 915px) 100vw, 915px\" \/><figcaption>Figure 1: Particulate Matter 2.5 um pollution in Poland.<\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">This type of point map may be helpful for experts, but the general public will be bored, or much worse &#8211; confused &#8211; by this kind of visualization. There are so many white areas! Fortunately, we can interpolate air pollution in every hex within this map. We are going to learn how to do it step-by-step. We will use the Ordinary Kriging method implemented in <code>pyinterpolate<\/code> package.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We are going to use two datasets. One represents point measurements of PM2.5 concentrations, and another is a hex-grid over which we will interpolate PM2.5 concentrations. You should be able to follow each step in this article with your data. Just have in mind those constraints:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>both datasets should have the same projection,<\/li><li>points should be within the hexgrid (or other blocks \/ MultiPolygon) area.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The PM 2.5 measurements are available <a href=\"https:\/\/github.com\/SimonMolinsky\/articles\/blob\/main\/2021-10\/interpolation-air-pollution\/pm25_year_2021_daily.csv\">here<\/a>, and you may download the hexgrid <a href=\"https:\/\/github.com\/SimonMolinsky\/articles\/blob\/main\/2021-10\/interpolation-air-pollution\/hexgrid.shp\" data-type=\"URL\" data-id=\"https:\/\/github.com\/SimonMolinsky\/articles\/blob\/main\/2021-10\/interpolation-air-pollution\/hexgrid.shp\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setup<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The safest method of setup is to create a conda environment and install the package within it. <code>Pyinterpolate<\/code> works ith Python 3.7 up to Python 3.10 and within UNIX systems, so if you are a Windows user, you should consider using <em>Windows Subsystem for Linux<\/em> (WSL). <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s start with <code>conda<\/code> environment preparation:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda create -n [NAME OF YOUR ENV]<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We name the environment as <code>spatial-interpolate<\/code>, but you may consider your own name:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda create -n spatial-interpolate<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, activate the environment:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate spatial-interpolate<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then install core packages, I assume you use Python 3.7+ as your base version (Python 3.7, 3.8, 3.9, or 3.10):<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">(spatial-interpolate) conda install -c conda-forge pip notebook<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And install <code>pyinterpolate<\/code> with <code>pip<\/code>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">(spatial-interpolate) pip install pyinterpolate<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Sometimes you may get an error related to a missing package named <code>libspatialindex_c.so<\/code>. To install it on Linux, write in the terminal:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">sudo apt install libspatialindex-dev<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And to install it on macOS, use this command in the terminal:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">brew install spatialindex<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">At this point, we should be able to move forward. But, if you encounter any problems, don&#8217;t hesitate to ask for help <a href=\"https:\/\/github.com\/DataverseLabs\/pyinterpolate\/issues\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Read Data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our task is to transform point observations into a country-wide hexgrid. Observations are stored in a <code>csv<\/code> file. Each column is a timestamp, and rows represent stations. Two additional columns <code>x<\/code> and <code>y<\/code> are <code>longitude<\/code> and <code>latitude<\/code> of a station. The hexgrid is a <code>shapefile<\/code> which consists of six different types of files. We need to read only one of those &#8211; with <code>.shp<\/code> prefix &#8211; but all others must be placed in the same directory, with the same suffix name. The <code>csv<\/code> file could be read with <code>pandas<\/code>. <code>Shapefile<\/code> must be loaded with <code>geopandas<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will import all functions and packages in the first cell of a notebook, then we will read files and check how they look inside.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import pandas as pd\nimport geopandas as gpd\nimport numpy as np\nfrom pyproj import CRS\n\nfrom pyinterpolate import build_experimental_variogram, TheoreticalVariogram, kriging<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We have imported a lot of stuff here! Let&#8217;s check for what are those packages and functions responsible:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>pandas<\/code>: read <code>csv<\/code> file,<\/li><li><code>geopandas<\/code>: read <code>shapefile<\/code> and transform geometries,<\/li><li><code>numpy<\/code>: algebra and matrix operations,<\/li><li><code>pyproj.CRS<\/code>: sets Coordinate Reference System (projection of points),<\/li><li><code>build_experimental_variogram()<\/code>: calculate the error between observations in the function of the distance,<\/li><li><code>TheoreticalVariogram<\/code>: create a<code> <\/code>model of dissimilarity between samples in the function of the distance,<\/li><li><code>kriging()<\/code>: Interpolate missing values from a set of points and a semivariogram.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Don&#8217;t worry if you don&#8217;t understand all terms above. We will see what they <strong>do<\/strong> in practice in the following steps.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">readings = pd.read_csv('pm25_year_2021_daily.csv', index_col='name')\ncanvas = gpd.read_file('hexgrid.shp')\ncanvas['points'] = canvas.centroid<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Variable <code>readings<\/code> is a <code>DataFrame<\/code> with 153 columns from which two columns are spatial coordinates. The rest are a daily average of PM 2.5 concentration. It is a time series, but we are going to select one date for further analysis. Variable <code>canvas<\/code> represents a set of geometries and their ids. Geometries are stored as hexagonal blocks. We need to create a point grid from those blocks. It is straightforward in <code>GeoPandas<\/code>. We can use the <code>.centroid<\/code> attribute to get the centroid of each geometry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will limit the number of columns in <code>DataFrame<\/code> to <code>['01.01.2021', 'x', 'y']<\/code> and rename the first column to <code>pm2.5<\/code>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">df = readings[['01.01.2021', 'x', 'y']]\ndf.columns = ['pm2.5', 'x', 'y']<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Transform <code>DataFrame<\/code> into <code>GeoDataFrame<\/code> (Optional)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We can transform our <code>DataFrame<\/code> into <code>GeoDataFrame<\/code> to have the ability to show data as a map. There are two basic steps to do it correctly. First, we should prepare Coordinate Reference System (CRS). It is not simple to assign concrete CRS from scratch because it has multiple properties. Let&#8217;s take a look at CRS used to work with geographical data over Poland:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;Projected CRS: EPSG:2180>\nName: ETRF2000-PL \/ CS92\nAxis Info [cartesian]:\n- x[north]: Northing (metre)\n- y[east]: Easting (metre)for the European Petroleum Survey Group\nArea of Use:\n- name: Poland - onshore and offshore.\n- bounds: (14.14, 49.0, 24.15, 55.93)\nCoordinate Operation:\n- name: Poland CS92\n- method: Transverse Mercator\nDatum: ETRF2000 Poland\n- Ellipsoid: GRS 1980\n- Prime Meridian: Greenwich<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We are not going to write such long descriptions. Fortunately for us, the <em>European Petroleum Survey Group<\/em> (EPSG) created a database with numbered indexes that point directly to the most common CRS. We use <em>EPSG:2180<\/em> (look into the first row of CRS, and you&#8217;ll see it). Python&#8217;s <code>pyproj<\/code> package has a built-in database of EPSG-CRS pairs. We can provide an EPSG number to get specific CRS:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">DATA_CRS = CRS.from_epsg('2180')<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Bear in mind that we already know CRS\/EPSG of a data source. Sometimes it is not so simple with data retrieved from CSV files. You will be forced to dig deeper or try a few projections blindly to get a valid geometry (my first try is usually EPSG 4326 &#8211; from GPS coordinates). The next step is to transform our floating-point coordinates into  <code>Point()<\/code> geometry. After this transformation and setting a CRS, we get a <code>GeoDataFrame<\/code>. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">df['geometry'] = gpd.points_from_xy(df['x'], df['y'])\ngdf = gpd.GeoDataFrame(df, geometry='geometry', crs=DATA_CRS)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We can show our data on a map! First, we remove <code>NaN<\/code> rows, and then we will visualize available points over a canvas used for interpolation.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Remove NaNs\n\ngdf = gdf.dropna()<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">base = canvas.plot(color='white', edgecolor='black', alpha=0.1, figsize=(12, 12))\ngdf.plot(ax=base, column='pm2.5', legend=True, vmin=0, vmax=50, cmap='coolwarm')\nbase.set_title('2021-01-01 Daily PM 2.5 concentrations in Poland');<\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030.png\" alt=\"\" class=\"wp-image-958\" width=\"686\" height=\"719\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030.png 915w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030-287x300.png 287w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/initial_set_air_poll_v030-768x804.png 768w\" sizes=\"auto, (max-width: 686px) 100vw, 686px\" \/><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Build a variogram model<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before interpolation, we must create the (semi)variogram model. What is it? It is a function that measures dissimilarity between points. Its foundation is an assumption that close points tend to be similar and distant points are usually different. This simple technique is a potent tool for analysis. To make it work, we must set a few constants:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>STEP_SIZE<\/code>: the distance within points are grouped (binned) together. We start from a distance zero, then <code>0 + STEP_SIZE<\/code> from a point and so on. We test the average dissimilarity of a point and its neighbors within the <strong>step size i<\/strong> and <strong>step size i+1<\/strong>.<\/li><li><code>MAX_RANGE<\/code>: maximum range of analysis. At some range there is no correlation between points and it is a waste of resources to analyze such distant radius from the initial point. <code>MAX_RANGE<\/code> should be smaller than the maximum distance between points in dataset. Usually it is a half of the maximum distance.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The critical thing to notice is to set <code>STEP_SIZE<\/code> and <code>MAX_RANGE<\/code> in the units of data. Sometimes you get coordinates in degrees and sometimes in meters. You should always be aware of this fact. Our projection unit is a meter. We can check that the maximum distance between points could be close to 500 km. Thus, we set our maximum range close to the upper limit, to 400 kilometers. <\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">km_c = 10**3\nSTEP_SIZE = km_c * 50\nMAX_RANGE = km_c * 400<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><code>Pyinterpolate<\/code> operates on <code>numpy<\/code> arrays. That&#8217;s why we must provide only [<code>x<\/code>, <code>y<\/code>, <code>pm2.5<\/code>] as an array of values instead of <code>GeoDataFrame<\/code> for <code>build_experimental_variogram()<\/code> function. We use this array along with <code>STEP_SIZE<\/code> and <code>MAX_RANGE<\/code> and calculate spatial dissimilarity between points with <code>build_experimental_variogram()<\/code> function:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">arr = gdf[['x', 'y', 'pm2.5']].values\n\nsemivar = build_experimental_variogram(input_array=arr, step_size=STEP_SIZE, max_range=MAX_RANGE)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It is an intermediate step. If you print <code>semivar<\/code> you will see an array with three columns:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">array([[0.00000000e+00, 5.05416667e+00, 6.00000000e+01],\n       [3.00000000e+04, 5.76951786e+01, 5.60000000e+01],\n       [6.00000000e+04, 9.47646939e+01, 9.80000000e+01],\n       [9.00000000e+04, 7.46993333e+01, 1.50000000e+02],\n       [1.20000000e+05, 7.45332632e+01, 1.90000000e+02],\n       [1.50000000e+05, 1.31586140e+02, 2.28000000e+02],\n       [1.80000000e+05, 1.24392830e+02, 2.12000000e+02],\n       [2.10000000e+05, 1.40270976e+02, 2.46000000e+02],\n       [2.40000000e+05, 1.41650229e+02, 2.18000000e+02],\n       [2.70000000e+05, 1.42644732e+02, 2.24000000e+02]])<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What is the meaning of each column?<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Column 1 is the <strong>lag<\/strong>. We test all points within <code>(lag - 0.5 * step size, lag + 0.5 * step size<\/code>]. (It is done under the hood with function <code><a href=\"https:\/\/github.com\/DataverseLabs\/pyinterpolate\/blob\/main\/pyinterpolate\/transform\/select_values_in_range.py\">select_values_in_range()<\/a><\/code>).. To make it easier to understand, let&#8217;s look into the figure below that shows lags, step size, and max range from the perspective of a single point.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-1024x1024.jpg\" alt=\"\" class=\"wp-image-553\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-1024x1024.jpg 1024w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-300x300.jpg 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-150x150.jpg 150w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-768x768.jpg 768w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-550x550.jpg 550w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange-1125x1125.jpg 1125w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/10\/stepsizevslagvsmaxrange.jpg 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li>Column 2 is the semivariance. We can understand this parameter as a measurement of dissimilarity between points within bins given by a lag and a step size. It is averaged value for all points and their neighbors per lag.<\/li><li>Column 3 is a number of points within a specific distance.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What we have created is the <strong>experimental semivariogram<\/strong>. With it, we can create a theoretical model of semivariance. Pyinterpolate has a particular class for semivariogram modeling named <code>TheoreticalVariogram<\/code>. It takes an array of known points and the experimental semivariogram during initialization. We can model a semivariogram manually, but <code>TheoreticalVariogram<\/code> class has a method to do it automatically. It is named <code>.autofit()<\/code>. It returns the best model parameters and an error level (RMSE, Forecast Bias, Mean Absolute Error, or Symmetric Mean Absolute Percentage Error):<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">ts = TheoreticalVariogram()\nts.autofit(experimental_variogram=semivar, verbose=False)<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"json\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{'model_type': 'exponential',\n 'nugget': 0,\n 'sill': 148.20096938775512,\n 'rang': 115532.31719150333,\n 'fitted_model': array([[5.00000000e+04, 5.20624642e+01],\n        [1.00000000e+05, 8.58355732e+01],\n        [1.50000000e+05, 1.07744311e+02],\n        [2.00000000e+05, 1.21956589e+02],\n        [2.50000000e+05, 1.31176145e+02],\n        [3.00000000e+05, 1.37156904e+02],\n        [3.50000000e+05, 1.41036644e+02]]),\n 'rmse': 5.590562271875477,\n 'bias': nan,\n 'mae': nan,\n 'smape': nan}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Another helpful method is .plot(). It draws the experimental points and the theoretical curve, we can check if the semivariogram is valid or wrong and compare how the model behaves at a close distance against long distances.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1005\" height=\"525\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/theo_semi_air_pollution_v030.png\" alt=\"The theoretical semivariogram of a data.\" class=\"wp-image-961\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/theo_semi_air_pollution_v030.png 1005w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/theo_semi_air_pollution_v030-300x157.png 300w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/theo_semi_air_pollution_v030-768x401.png 768w\" sizes=\"auto, (max-width: 1005px) 100vw, 1005px\" \/><figcaption>The comparison between the modeled curve and the experimental values.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As you see, the dissimilarity increases over distance. Distant points tend to have different PM 2.5 concentrations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Interpolate Missing Values with Kriging<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The last step of our tutorial is interpolation. We use the <code>kriging<\/code> method. It is a function that predicts values at unknown locations based on the closest neighbors and theoretical semivariance. In practice, it multiplies concentrations of n-closest neighbors by weights related directly to the distance from the predicted location to its neighbor. <code>Pyinterpolate<\/code> makes this operation extremely simple. We use <code>kriging()<\/code> function and pass into it five parameters:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><code>observations<\/code>: array with triplets <code>[x, y, value]<\/code>, those are known measurements.<\/li><li><code>theoretical_model<\/code>: fitted <code>TheoreticalVariogram<\/code> model.<\/li><li><code>points<\/code>: a list of points to interpolate values.<\/li><li><code>neighbors_range<\/code>: a distance within we include neighbors to interpolate a missing value.<\/li><li><code>no_neighbors<\/code>: the number of neighbors to include in interpolation (if there are more neighbors in the <code>neighbors_range<\/code> than <code>no_neighbors<\/code>, then the algorithm takes only the first <code>no_neighbors<\/code> points based on the sorted distances.<\/li><\/ol>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Predict\n\npts = list(zip(canvas['points'].x, canvas['points'].y))  # Here we get points from canvas\n\npredicted = kriging(observations=arr,\n                    theoretical_model=ts,\n                    points=pts,\n                    neighbors_range=MAX_RANGE \/ 2,\n                    no_neighbors=8)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Do you remember when we created <strong>centroids for our hexagonal canvas<\/strong>? We have named column with these <code>points<\/code>. Now we use every row of this column and interpolate PM 2.5 concentration for it!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After prediction, we will transform output array into a GeoDataFrame to show results and store them for the future work:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">cdf = gpd.GeoDataFrame(data=predicted, columns=['yhat', 'err', 'x', 'y'])\ncdf['geometry'] = gpd.points_from_xy(cdf['x'], cdf['y'])\ncdf.set_crs(crs=DATA_CRS, inplace=True)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Function <code>get_prediction()<\/code> is passed in the <code>apply<\/code> method. It takes three parameters<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>kriging()<\/code> function returns four objects:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Estimated value.<\/li><li>Error.<\/li><li>Longitude.<\/li><li>Latitude.<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">We create GeoDataFrame from those, create <code>geometry<\/code> (column and attribute) and set CRS. Then, we can visualize the output:.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below the body of a <code>get_prediction()<\/code> function, we have a classical operation on <code>DataFrames<\/code>. Finally, we can show results:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">base = canvas.plot(color='white', edgecolor='black', alpha=0.1, figsize=(12, 12))\ncdf.plot(ax=base, column='yhat', legend=True, cmap='coolwarm')\nbase.set_title('Interpolated 2021-01-01 Daily PM 2.5 concentrations in Poland');<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"915\" height=\"944\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/interpol_air_poll_v030.png\" alt=\"The interpolated air pollution measurements map\" class=\"wp-image-964\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/interpol_air_poll_v030.png 915w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/interpol_air_poll_v030-291x300.png 291w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2022\/09\/interpol_air_poll_v030-768x792.png 768w\" sizes=\"auto, (max-width: 915px) 100vw, 915px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Isn&#8217;t it much more interesting than scatterplot?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Additional Reading<\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"https:\/\/github.com\/DataverseLabs\/pyinterpolate\/blob\/main\/tutorials\/Semivariogram%20Estimation%20(Basic).ipynb\">More about Semivariogram Modeling within <code>pyinterpolate<\/code><\/a><\/li><li><a href=\"https:\/\/github.com\/DataverseLabs\/pyinterpolate\/blob\/main\/tutorials\/Ordinary%20and%20Simple%20Kriging%20(Basic).ipynb\">More about Kriging within <code>pyinterpolate<\/code><\/a><\/li><li><a href=\"https:\/\/ml-gis-service.com\/index.php\/spatial-interpolation-101\/\">Learn about spatial interpolation!<\/a><\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Exercise<\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li>Try to reproduce this tutorial with other measurements. For example, temperature readings, height above sea level, humidity&#8230;<\/li><\/ol>\n","protected":false},"excerpt":{"rendered":"<p>How to get dense and continuous map from point observations in Python<\/p>\n","protected":false},"author":1,"featured_media":560,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[18,66,3,30,31],"tags":[119,117,47,123,118,122,121,120,7,62],"class_list":["post-538","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-pyinterpolate","category-python","category-spatial-statistics","category-tutorials","tag-air-quality","tag-analytics","tag-data-science","tag-distance","tag-interpolation","tag-measurements","tag-pm-10","tag-pm-2-5","tag-python","tag-spatial"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/538","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=538"}],"version-history":[{"count":36,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/538\/revisions"}],"predecessor-version":[{"id":967,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/538\/revisions\/967"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/560"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=538"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}