{"id":107,"date":"2021-03-19T21:16:46","date_gmt":"2021-03-19T21:16:46","guid":{"rendered":"https:\/\/ml-gis-service.com\/?p=107"},"modified":"2021-03-20T10:42:42","modified_gmt":"2021-03-20T10:42:42","slug":"data-science-feature-engineering-with-spatial-flavour","status":"publish","type":"post","link":"https:\/\/ml-gis-service.com\/index.php\/2021\/03\/19\/data-science-feature-engineering-with-spatial-flavour\/","title":{"rendered":"Data Science: Feature Engineering with Spatial Flavour"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Imagine that we are working for a <strong>real estate agency and our role is to estimate price of apartment renting in a different parts of the New York City<\/strong>. We have obtained multiple <em>features<\/em>:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>host type,<\/li><li>host name,<\/li><li>neighborhood (district name),<\/li><li>latitude,<\/li><li>longitude,<\/li><li>room type,<\/li><li>minimum nights,<\/li><li>number of reviews,<\/li><li>last review date,<\/li><li>reviews per month,<\/li><li>calculated host listings count,<\/li><li>availability 365 days per year (days).<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In classic machine learning approach we work through those variables and build model for prediction of a price. If our results are not satisfying we could perform <strong>feature engineering<\/strong> tasks to create more inputs for our model:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>multiply features by themself, get logarithms of values or square them,<\/li><li>perform label or one-hot encoding of categorical variables,<\/li><li>find some patterns in reviews (sentiment analysis),<\/li><li>use geographical data and create spatial features.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We will consider the last example and learn how to retrieve spatial information using <strong>GeoPandas<\/strong> package and publicly available geographical datasets. We will learn how to:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Transform spatial data into new variables for ML models.<\/li><li>Use<strong> GeoPandas<\/strong> package along with <strong>Pandas<\/strong> to get spatial features from simple tabular readings.<\/li><li>Retrieve and process freely available spatial data.<\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Data and environment<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Data for this article is shared in <em>Kaggle<\/em>. It&#8217;s <em>AirBnB dataset of apartments&#8217; renting prices in the New York city<\/em>. You may download it from here: <a href=\"https:\/\/www.kaggle.com\/dgomonov\/new-york-city-airbnb-open-data\">https:\/\/www.kaggle.com\/dgomonov\/new-york-city-airbnb-open-data<\/a>. Other datasets are shared by the New Your city here: <a href=\"https:\/\/www1.nyc.gov\/site\/planning\/data-maps\/open-data\/districts-download-metadata.page\">https:\/\/www1.nyc.gov\/site\/planning\/data-maps\/open-data\/districts-download-metadata.page<\/a>. Data is also available in the <strong><a href=\"https:\/\/github.com\/szymon-datalions\/articles\/tree\/main\/2021-03\/airbnb-data-augmentation\">blogpost repository<\/a><\/strong>. We will be working in <strong>Jupyter Notebook<\/strong> and it requires prior <code>conda<\/code> installation. Main Python packages used in this tutorial are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong><a href=\"https:\/\/pandas.pydata.org\">pandas<\/a><\/strong> for data wrangling and feature engineering,<\/li><li><strong><a href=\"https:\/\/geopandas.org\">geopandas<\/a><\/strong> which works under the hood and allows to process spatial data along with <strong><a href=\"https:\/\/pypi.org\/project\/Shapely\/\">shapely<\/a><\/strong> which is installed along with GeoPandas,<\/li><li><strong><a href=\"https:\/\/numpy.org\/\">numpy<\/a><\/strong> for linear algebra,<\/li><li><strong><a href=\"https:\/\/matplotlib.org\/\">matplotlib<\/a><\/strong> and <strong><a href=\"https:\/\/seaborn.pydata.org\/\">seaborn<\/a><\/strong> for data visualization.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">First we create new <code>conda<\/code> environment with desired packages. All steps are presented below. You may copy code to your terminal and everything should go well. Remember that our environment name is <strong>airbnb<\/strong>. <strong>You may change it in the first step to your own<\/strong> after <code>-n<\/code> flag.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda create -n airbnb -c conda-forge pandas geopandas numpy matplotlib seaborn notebook<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then activate environment:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">conda activate airbnb<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That&#8217;s all! Run new notebook with command:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"raw\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">jupyter notebook<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Data Exploration<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before we do any processing we take a quick look into our datasets. Core dataset is provided by <em>airbnb<\/em>. We import all packages for further processing at once:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import numpy as np\nimport geopandas as gpd\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nfrom scipy.stats import mstats\nfrom shapely.geometry import Point<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then we set project constants. In our case those are paths to the files and names of special columns:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Set project constants\n\nbaseline_dataset = 'data\/AB_NYC_2019.csv'\nbaseline_id_column = 'id'\n\nboroughs = 'data\/nybbwi_20d\/nybbwi.shp'\nfire_divs = 'data\/nyfd_20d\/nyfd.shp'\nhealth_cns = 'data\/nyhc_20d\/nyhc.shp'\npolice_prs = 'data\/nypp_20d\/nypp.shp'\nschool_dis = 'data\/nysd_20d\/nysd.shp'<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The baseline of our work is dataset without any <strong>spatial features<\/strong>. Sure, <code>baseline_dataset<\/code> contains <code>latitude<\/code> and <code>longitude<\/code> columns but they are not treated as a single object of a type <code>Point<\/code>. Instead they&#8217;re two <code>float<\/code>s. This dataset contains <em>hidden<\/em> spatial information. Columns <code>neighbourhood_group<\/code> and <code>neighbourhood<\/code> describe spatial features &#8211; administrative units &#8211; by their names. The problem is that we don&#8217;t know which areas are close to each other, how big they are, if they are close to the specific objects like rivers or sea and so on&#8230;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>latitude<\/code> \/ <code>longitude<\/code> description is a data engineering problem and we solve it with simple transformation. Why do we bother? Because spatial computations and spatial joins are much easier with spatial packages where data is stored as a special object named <code>Point<\/code>. It will be much easier to calculate distance from each apartment to a borough center or to discover within which district is an apartment placed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Initial step of data exploration is to look into it. We read data with pandas <code>read_csv()<\/code> method and show <code>sample()<\/code> of rows from it:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Read NYC airbnb data and take a look into it\n\ndf = pd.read_csv(baseline_dataset,\n                 index_col=baseline_id_column)\nprint(df.sample(3))<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Algorithm takes random sample of three rows and notebook shows it for us. Our sample may looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong><em>id<\/em><\/strong><\/td><td><strong>neighbourhood<\/strong><\/td><td><strong>&#8230;<\/strong><\/td><td><strong>latitude<\/strong><\/td><td><strong>longitude<\/strong><\/td><td><strong>price<\/strong><\/td><\/tr><tr><td><em>34228220<\/em><\/td><td>Hell&#8217;s Kitchen<\/td><td>&#8230;<\/td><td>40.75548<\/td><td>-73.99513<\/td><td>225<\/td><\/tr><\/tbody><\/table><figcaption>Table 1: Sample from baseline DataFrame.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We can use <code>DataFrame.info()<\/code> method to get better insights what&#8217;s going on with data, or <code>DataFrame.describe()<\/code> to access basic statistical properties of numerical variables. We skip those steps here but <a href=\"https:\/\/github.com\/szymon-datalions\/articles\/tree\/main\/2021-03\/airbnb-data-augmentation\">article notebook<\/a> has cells with those methods for the sake of completeness. We then load all spatial datasets and check their properties and perform visual inspection. <code>Shapefiles<\/code> are read by <strong>GeoPandas<\/strong> by <code>geopandas.read_file()<\/code> method:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Spatial data sources\n\ngdf_boroughs = gpd.read_file(boroughs)\ngdf_fire_divs = gpd.read_file(fire_divs)\ngdf_police_prs = gpd.read_file(police_prs)\ngdf_school_dis = gpd.read_file(school_dis)\ngdf_health_cns = gpd.read_file(health_cns)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">When we are working with spatial data it is always a good idea to look into plotted data and check its <em>Coordinate Reference System<\/em> (CRS). As example for Health Centers districts:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(gdf_health_cns.info())\nprint(gdf_health_cns.crs)\nbase = gdf_boroughs.plot(color='white', edgecolor='black', figsize=(12, 12))\ngdf_health_cns.plot(ax=base, color='lightgray', edgecolor='darkgray');\nplt.title('School Districts within main boroughs in the New York City.');<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;class 'geopandas.geodataframe.GeoDataFrame'>\nRangeIndex: 30 entries, 0 to 29\nData columns (total 6 columns):\n #   Column      Non-Null Count  Dtype   \n---  ------      --------------  -----   \n 0   BoroCode    30 non-null     int64   \n 1   BoroName    30 non-null     object  \n 2   HCentDist   30 non-null     int64   \n 3   Shape_Leng  30 non-null     float64 \n 4   Shape_Area  30 non-null     float64 \n 5   geometry    30 non-null     geometry\ndtypes: float64(2), geometry(1), int64(2), object(1)\nmemory usage: 1.5+ KB\nNone\nepsg:2263<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Most important facts about our dataset are that it has: <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>30 entries (30 school districts),<\/li><li>one column of type <code>geometry<\/code> which is encoded in the <code>EPSG:2263<\/code> projection which is designed for the New York State Plane Long Island Zone.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We can plot dataset as a polygon representing administrative boundaries (Figure 1) with <strong>GeoPandas<\/strong> <code>geometry.plot()<\/code> method.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"703\" height=\"712\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_1.png\" alt=\"\" class=\"wp-image-135\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_1.png 703w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_1-296x300.png 296w\" sizes=\"auto, (max-width: 703px) 100vw, 703px\" \/><figcaption>Figure 1: Spatial data with school districts in the New York City.<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">At this moment everything seems to be ok. Thing to remember is that our <strong>baseline dataset is not spatial at all<\/strong> and it&#8217;s CRS is different than CRS of the read shapefiles. Datasets with latitude \/ longitude angles are described by <code>EPSG:4326<\/code> projection in the most cases. This is projection used by GPS satellites and this is the reason why it is so popular.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Transformation from simple table into spatial dataset<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Our dataset has missing values but before we check duplicates and <code>NaN<\/code>s we will transform geometry from two <code>floats<\/code> into one <code>Point<\/code> object. From the engineering point of view we take <code>DataFrame<\/code> and transform it to the <code>GeoDataFrame<\/code>. In practice we must create column with <code>geometry<\/code> attribute and set its projection. <strong>Probably most of datasets with point measurements are described by <code>EPSG:4326<\/code> projection. But be careful with this hypothesis and always to find and check metadata because you may get strange results if you assign wrong projection to your data.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before we drop unwanted columns and\/or rows with missing values and duplicates we are going to create <code>Points<\/code> from <code>latitude<\/code> and <code>longitude<\/code> columns. <code>Point<\/code> is a basic data type from <strong>shapely<\/strong> package which is used under the hood of <strong>Geopandas<\/strong>. Transformation of pair of values into <code>Point<\/code> is simple. We pass two values to a constructor of <code>Point<\/code>. As example:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from shapely.geometry import Point\n\nlat = 51.5074\nlon = 0.1278\n\npair = [lon, lat]  # first lon (x) then lat (y)\npoint = Point(pair)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then we have to perform this transformation at a scale. We use <code>lambda<\/code> expression for it and pass into it lat\/lon columns to transform them into point. Function below is one solution for our problem:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Transform geometry\n\ndef lat_lon_to_point(dataframe, lon_col='longitude', lat_col='latitude'):\n    \"\"\"Function transform longitude and latitude coordinates into GeoSeries.\n    \n    INPUT:\n    \n    :param dataframe: DataFrame to be transformed,\n    :param lon_col: (str) longitude column name, default is 'longitude',\n    :param lat_col: (str) latitude column name, default is 'latitude'.\n    \n    OUTPUT:\n    \n    :return: (GeoPandas GeoSeries object)\n    \"\"\"\n\n    geometry = dataframe.apply(lambda x: Point([x[lon_col], x[lat_col]]), axis=1)\n    geoseries = gpd.GeoSeries(geometry)\n    geoseries.name = 'geometry'\n    \n    return geoseries<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Function returns <code>GeoSeries<\/code> object. It is similar to <code>Series<\/code> known from <strong>Pandas<\/strong> but it has special properties designed for spatial datasets. <code>GeoSeries<\/code> has own attributes and methods and it can be compared to <code>Time Series<\/code> from <strong>Pandas<\/strong>. In our case it is just a list of Points. At this moment we can easily append it to our baseline <code>DataFrame<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">geometry = lat_lon_to_point(df)\ngdf = df.join(geometry)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If we check info about our new dataframe <code>gdf<\/code> we see that <code>geometry<\/code> column has special <code>Dtype<\/code>, even if our <code>gdf<\/code> is still a classic <code>DataFrame<\/code> from <strong>Pandas<\/strong>!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(gdf.info())<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">&lt;class 'pandas.core.frame.DataFrame'>\nInt64Index: 48895 entries, 2539 to 36487245\nData columns (total 16 columns):\n #   Column                          Non-Null Count  Dtype   \n---  ------                          --------------  -----   \n 0   name                            48879 non-null  object  \n 1   host_id                         48895 non-null  int64   \n 2   host_name                       48874 non-null  object  \n 3   neighbourhood_group             48895 non-null  object  \n 4   neighbourhood                   48895 non-null  object  \n 5   latitude                        48895 non-null  float64 \n 6   longitude                       48895 non-null  float64 \n 7   room_type                       48895 non-null  object  \n 8   price                           48895 non-null  int64   \n 9   minimum_nights                  48895 non-null  int64   \n 10  number_of_reviews               48895 non-null  int64   \n 11  last_review                     38843 non-null  object  \n 12  reviews_per_month               38843 non-null  float64 \n 13  calculated_host_listings_count  48895 non-null  int64   \n 14  availability_365                48895 non-null  int64   \n 15  geometry                        48895 non-null  geometry\ndtypes: float64(3), geometry(1), int64(6), object(6)\nmemory usage: 7.3+ MB\nNone<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Due to the fact that it is <code>DataFrame<\/code> we are not able to plot map of those points with command <code>gdf['geometry'].plot()<\/code>. We get <code>TypeError<\/code> when we try this&#8230; In comparison we can plot generated <code>GeoSeries<\/code> without any problems, <strong>GeoPandas<\/strong> engine allows us to do it and command <code>geometry.plot()<\/code> works fine. We need to transform <code>DataFrame<\/code> into <code>GeoDataFrame<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before that we clean data and remove:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Unwanted columns,<\/li><li>Rows with missing values,<\/li><li>Rows with duplicates.<\/li><\/ol>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Leave only wanted columns\n\ngdf = gdf[['neighbourhood_group', 'neighbourhood',\n           'room_type', 'price', 'minimum_nights',\n           'number_of_reviews','reviews_per_month',\n           'calculated_host_listings_count', 'geometry']]\n\n# Drop NaNs\ngdf = gdf.dropna(axis=0)\n\n# Drop duplicates\ngdf = gdf.drop_duplicates()<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Baseline dataset is nearly done! One last thing is to change <code>DataFrame<\/code> into <code>GeoDataFrame<\/code> and set projection of the geometry column and we will be ready for the feature engineering. Change <code>DataFrame<\/code> into <code>GeoDataFrame<\/code> is very simple. We assign <strong>Pandas<\/strong> object into <strong>GeoPandas<\/strong> <code>GeoDataFrame<\/code>. Constructor takes three main arguments: input dataset, geometry column name or geometry array and projection:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">gpd.GeoDataFrame(*args, geometry=None, crs=None, **kwargs)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"> After this operation we are finally able to use spatial processing methods. As example we will be able to plot geometry with <strong>GeoPandas<\/strong> engine.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Transform dataframe into geodataframe\n\ngdf = gpd.GeoDataFrame(gdf, geometry='geometry', crs='EPSG:4326')\n# gdf.plot()  # Uncomment if you want to plot data<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Extremely important thing to do is a transformation of all projections to a single representation. We use projection from spatial data of New York districts and we are going to transform baseline points with apartments coordinates. <code>GeoDataFrame<\/code> object allows us to use <code>.to_crs()<\/code> method to make fast and readable re-projection of data. Very important step before processing of multiple datasets is to check if all projections are the same and if it is <code>True<\/code> then we can move forward!<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Transform CRS of the baseline GeoDataFrame\n\ngdf.to_crs(crs=gdf_boroughs.crs, inplace=True)\n\n# Check if all crs are the same\n\nprint(gdf.crs == gdf_boroughs.crs == gdf_fire_divs.crs == \\\n     gdf_health_cns.crs == gdf_police_prs.crs == gdf_school_dis.crs)<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">True<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Statistical analysis of the baseline dataset<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Maybe you are asking yourself why do we bother? We have many features provided in the initial dataset and probably they are good enough to train a model. We can check it before training by observation of basic statistical properties of variables in relation to <code>price<\/code> which we would like to predict. We have two simple options:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>observe <strong>distribution of price in relation to categorical variables<\/strong> to check if it overlaps frequently,<\/li><li>check <strong>correlation value between numerical features and price<\/strong>.<\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Categorical features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Categorical columns are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>neighbourhood_group<\/code>,<\/li><li><code>neighbourhood<\/code>,<\/li><li>and <code>room_type<\/code>.<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We can check overlapping between features by analysis of boxplots &#8211; but this method is valid only for small datasets. Example is <code>neighbourhood_group<\/code> with five categories:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def print_nunique(series):\n    text = f'Data contains {series.nunique()} unique categories'\n    print(text)\n\n# neighbourhood_group\n\nprint_nunique(gdf['neighbourhood_group'])<\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\">Data contains 5 unique categories\n<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">plt.figure(figsize=(6, 10))\nsns.boxplot(y='neighbourhood_group', x='price',\n            data=gdf, orient='h', showfliers=False)\nplt.title('Distribution of price in relation to categorical variables: Boroughs.')<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"468\" height=\"604\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_2.png\" alt=\"\" class=\"wp-image-144\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_2.png 468w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_2-232x300.png 232w\" sizes=\"auto, (max-width: 468px) 100vw, 468px\" \/><figcaption>Figure 2: Distribution of price in relation to neighbourhood_group categories.<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">We see that values overlap. They are probably not a good predictors of a renting price &#8211; with exclusion of Manhattan area with very high variance. Visual inspection may be misleading and we should perform additional statistical tests (as example <em>Kruskal-Wallis<\/em> one-way analysis of variance) or train our data on multiple decision trees with different random seeds to see how each category behaves. It is more desired style of work when we compare hundreds of categories&#8230; like <code>neighbourhood<\/code> column. In our case we create function which calculates basic statistics for each unique category and we compare those results (try to make a boxplot of <code>neighbourhood<\/code> column&#8230;)<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># neighbourhoods\n\nprint_nunique(gdf['neighbourhood'])<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">Data contains 218 unique categories<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># There are too many values to plot them all...\n\ndef categorical_distributions(dataframe, categorical_column, numerical_column, dropna=True):\n    \"\"\"\n    Function groups data by unique categories and check basic information about\n    data distribution per category.\n\n    INPUT:\n\n    :param dataframe: (DataFrame or GeoDataFrame),\n    :param categorical_column: (str),\n    :param numerical_column: (str),\n    :param dropna: (bool) default=True, drops rows with NaN's as index.\n\n    OUTPUT:\n\n    :return: (DataFrame) DataFrame where index represents unique category\n        and columns represents: count, mean, median, variance, 1st quantile,\n        3rd quantile, skewness and kurtosis.\n    \"\"\"\n\n    cat_df = dataframe[[categorical_column, numerical_column]].copy()\n\n    output_df = pd.DataFrame(index=cat_df[categorical_column].unique())\n\n    # Count\n    output_df['count'] = cat_df.groupby(categorical_column).count()\n    \n    # Mean\n    output_df['mean'] = cat_df.groupby(categorical_column).mean()\n\n    # Variance\n    output_df['var'] = cat_df.groupby(categorical_column).std()\n\n    # Median\n    output_df['median'] = cat_df.groupby(categorical_column).median()\n\n    # 1st quantile\n    output_df['1st_quantile'] = cat_df.groupby(categorical_column).quantile(0.25)\n\n    # 3rd quantile\n    output_df['3rd_quantile'] = cat_df.groupby(categorical_column).quantile(0.75)\n\n    # skewness\n    output_df['skew'] = cat_df.groupby(categorical_column).skew()\n\n    # kurtosis\n    try:\n        output_df['kurt'] = cat_df.groupby(categorical_column).apply(pd.Series.kurt)\n    except ValueError:\n        output_df['kurt'] = cat_df.groupby(categorical_column).apply(pd.Series.kurt)[numerical_column]\n        \n    # Dropna\n    \n    if dropna:\n        output_df = output_df[~output_df.index.isna()]\n\n    return output_df<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Based on the grouped statistic we may look into different statistical parameters alone or at all of them and check if there is a high or low level of variability between each category. As example histogram of variance &#8211; if it is wide and uniform then we may assume that categorical variables are good predictors. If there are tall and grouped peaks then it could be hard for algorithm to make predictions based only on categorical data.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">neigh_dists = categorical_distributions(gdf, 'neighbourhood', 'price')\n\n# Check histogram of variances\n\nplt.figure(figsize=(10, 6))\nsns.histplot(neigh_dists['var'], bins=20);\nplt.title('Histogram of price variance per category');<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"387\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_3-1.png\" alt=\"\" class=\"wp-image-161\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_3-1.png 606w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_3-1-300x192.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><figcaption>Figure 3: Histogram of price variances per borough.<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In our case we got something closer to the worst case scenario. Variance of price per neighborhood (Figure 3) has steep peak around low range counts. It has long tail with very small number of counts &#8211; some categories could be used for predictions based on their dispersion of values &#8211; but there are only few of them. At this stage we can assume that it&#8217;ll be hard to train a very good model. Idea of enhancing feature space with new variables has gained better foundation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Numerical features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">How good are our numerical features? One measure of <em>goodness<\/em> may be correlation between each numerical variable and predicted price. High absolute correlation means that variable is a strong predictor. Correlation close to 0 informs us that there is not relationship between feature and a predicted variable. Method to calculate correlation between columns of <code>DataFrame<\/code> is provided with <strong>Pandas<\/strong>:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">print(gdf.corrwith(gdf['price']))<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">price                             1.000000\nminimum_nights                    0.025506\nnumber_of_reviews                -0.035938\nreviews_per_month                -0.030608\ncalculated_host_listings_count    0.052903\ndtype: float64<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately predictors are weak and <strong>correlation between price and numerical variables is very weak<\/strong>. This is not something what we would expect from the modeling dataset! Can we make it better?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Spatial Feature Engineering<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">At this stage we have dataset with <em>AirBnB <\/em>apartment renting prices in the New York City and five other datasets with spatial features. We do two things in a data engineering pipeline:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>We create a <strong>new numerical feature &#8211; distance to each borough centroid from the apartment<\/strong>. We assume that this variable covers the fact that location matters in the case of renting prices.<\/li><li>We create a <strong>new categorical features, each for each school district, fire division, police precinct and health center<\/strong>. We perform <strong>label encoding<\/strong> with numerical values by taking unique id of each area. Then we calculate within which specific district is our point. The reasoning is the same as for point 1) &#8211; we use our knowledge and feed it into algorithm. There is a high chance that hidden variability within schools, health centers, police and fire fighters work is correlated with prices of apartments and model will use this unseen but expected relations.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">New numerical features &#8211; a distance to borough centroid from each apartment<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We have five districts in our dataset: Bronx, Brooklyn, Manhattan, Queens, Staten Island. The idea is to find distance from the centroids of those boroughs to an apartment location. This could be very easily done with <strong>GeoPandas<\/strong> methods. One is <code>.centroid<\/code> which calculates centroid of each geometry in a <strong>GeoDataFrame<\/strong>. Other is <code>GeoDataFrame.distance(other)<\/code>. We divide this operations into two cells in Jupyter Notebook. We calculate centroids:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Get borough centroids\n\ngdf_boroughs['centroids'] = gdf_boroughs.centroid<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And we estimate distance to each centroid from each apartment:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Create n numerical columns\n# n = number of boroughs\n\nfor bname in gdf_boroughs['BoroName'].unique():\n    district_name = 'cdist_' + bname\n    cent = gdf_boroughs[gdf_boroughs['BoroName'] == bname]['centroids'].values[0]\n    gdf[district_name] = gdf['geometry'].distance(cent)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In the loop above we create a column name with <code>cdist_<\/code> prefix and name of a borough. Then we calculate distance (in data units) from borough center to an apartment location. Table 2 is a sample row of the expanded set (only new columns and index are visible):<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong><em>id<\/em><\/strong><\/td><td><strong>cdist_Manhattan<\/strong><\/td><td><strong>cdist_Bronx<\/strong><\/td><td><strong>cdist_Brooklyn<\/strong><\/td><td><strong>cdist_Queens<\/strong><\/td><td><strong>cdist_Staten Island<\/strong><\/td><\/tr><tr><td><em>26424667<\/em><\/td><td>4718.20<\/td><td>43721.71<\/td><td>46721.83<\/td><td>45085.04<\/td><td>86889.10<\/td><\/tr><\/tbody><\/table><figcaption>Table 2: Sample distances from one apartment to each district center.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">New categorical features &#8211; school districts, fire divisions, police precincts and health centers labels<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We should go further with spatial feature engineering and we are going to use all available resources. There are multiple school districts or police precincts and calculations of distance from apartment to each of those areas would be not appriopriate because feature space will grow with number of unique districts. Fortunately we can tackle this problem differently. We are going to create new categorical features where we assign unique area id to each apartment. Algorithm is simple. For each district type we perform spatial join of our core set and a new spatial set. It can be done with <code>geopandas.sjoin()<\/code> method.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Make a copy of dataframe before any join\n\next_gdf = gdf.copy()\n\n# Join Fire Divisions\n\next_gdf = gpd.sjoin(ext_gdf, gdf_fire_divs[['geometry', 'FireDiv']], how='left', op='within')\next_gdf.drop('index_right', axis=1, inplace=True)\n\n# Join Health Centers\n\next_gdf = gpd.sjoin(ext_gdf, gdf_health_cns[['geometry', 'HCentDist']], how='left', op='within')\next_gdf.drop('index_right', axis=1, inplace=True)\n\n# Join Police Precincts\n\next_gdf = gpd.sjoin(ext_gdf, gdf_police_prs[['geometry', 'Precinct']], how='left', op='within')\next_gdf.drop('index_right', axis=1, inplace=True)\n\n# Join School Districts\n\next_gdf = gpd.sjoin(ext_gdf, gdf_school_dis[['geometry', 'SchoolDist']], how='left', op='within')\next_gdf.drop('index_right', axis=1, inplace=True)\n\n# We have to transform categorical columns to be sure that we don't get strange results\n\ncategorical_columns = ['FireDiv', 'HCentDist', 'Precinct', 'SchoolDist']\next_gdf[categorical_columns] = ext_gdf[categorical_columns].astype('category')<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Results<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Dataset is expanded by nine (9.) new spatial features<\/strong>. Are they worse, better or (approximately) identical to the baseline?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Distribution of categorical features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">First we check distribution of price variance per category. We must calculate this with our <code>categorical_distributions()<\/code> function and then plot them with <strong>Seaborn&#8217;s<\/strong> <code>histplot()<\/code>.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">categorical_columns = ['FireDiv', 'HCentDist', 'Precinct', 'SchoolDist']\n\nfor col in categorical_columns:\n    distribs = categorical_distributions(ext_gdf, col, 'price')\n    dist_var = distribs['var']\n    plt.figure(figsize=(10, 6))\n    sns.histplot(dist_var, bins=20);\n    plt.title(f'Histogram of price variances per category {col}');<\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_4-2.png\" alt=\"\" class=\"wp-image-163\" width=\"609\" height=\"387\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_4-2.png 609w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_4-2-300x191.png 300w\" sizes=\"auto, (max-width: 609px) 100vw, 609px\" \/><figcaption> Figure 4: Variance value counts per category &#8211; Fire Division units.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"599\" height=\"387\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_5-1.png\" alt=\"\" class=\"wp-image-164\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_5-1.png 599w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_5-1-300x194.png 300w\" sizes=\"auto, (max-width: 599px) 100vw, 599px\" \/><figcaption>Figure 5: Variance value counts per category &#8211; Health Center units.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"608\" height=\"387\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_6-1.png\" alt=\"\" class=\"wp-image-165\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_6-1.png 608w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_6-1-300x191.png 300w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><figcaption>Figure 6: Variance value counts per category &#8211; Police Precinct units.<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"606\" height=\"387\" src=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_7-1.png\" alt=\"\" class=\"wp-image-166\" srcset=\"https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_7-1.png 606w, https:\/\/ml-gis-service.com\/wp-content\/uploads\/2021\/03\/fig_7-1-300x192.png 300w\" sizes=\"auto, (max-width: 606px) 100vw, 606px\" \/><figcaption>Figure 7: Variance value counts per category &#8211; School District units.<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">As we see distribution of price variance is very promising, especially for Fire Division units! In each case counts are nearly uniform and we can assume that those predictors will be better than the categories from the baseline set.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Correlation of numerical variables and price<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Last step of this tutorial is to check a correlation between numerical variables and price. We run <code>.corrwith()<\/code> method and look if we get better features:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Check corr with numerical\n\next_gdf.corrwith(ext_gdf['price']).sort_values()<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">cdist_Manhattan                  -0.146215\ncdist_Staten Island              -0.058386\nnumber_of_reviews                -0.035938\nreviews_per_month                -0.030608\ncdist_Bronx                       0.011228\ncdist_Brooklyn                    0.011878\nminimum_nights                    0.025506\ncalculated_host_listings_count    0.052903\ncdist_Queens                      0.096206\nprice                             1.000000<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">What we get here is really interesting: <strong>larger distance from Manhattan is correlated with lower prices and larger distance to Queens is correlated with rising prices<\/strong>. It is reasonable. Distances to Manhattan, Staten Island and Queens are better predictors than for the other numeric values but all predictors are very weak or weak (Manhattan centroid).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Does model be better with those features? Probably yes. We have used publicly available datasets and enhanced baseline dataset with them at nearly zero cost. The next step is to use this data in a real modeling and check how our models behave now!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Exercises:<\/h2>\n\n\n\n<ol class=\"wp-block-list\"><li>Train decision tree regressor on the enhanced dataset. Check which features are the most important for the algorithm.<\/li><li>Prepare spatial features as in this article and store newly created dataset AND dataset without those spatial features (baseline) where latitude and longitude are floats. Divide indexes into train and test set for both of datasets (they must have the same index!). Train Decision Tree regressor on both sets and compare results.<\/li><li>As the number 2. but this time generate 10 realizations of train \/ test set with the new random seed per each realization and train 10 decision trees per dataset with the new random seed per iteration. Aggregate evaluation output &#8211; as example Root Mean Squared Error and compare training statistics. (You will generate 10 random train\/test splits x 10 random decision tree models x 2 datasets => 100 models per dataset).<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Enhance your training set with spatial features<\/p>\n","protected":false},"author":1,"featured_media":169,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,18,3,30,31],"tags":[39,37,25,38,40,41,34],"class_list":["post-107","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-engineering","category-data-science","category-python","category-spatial-statistics","category-tutorials","tag-airbnb","tag-feature-engineering","tag-machine-learning","tag-new-york","tag-prediction","tag-price","tag-spatial-statistics"],"_links":{"self":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/107","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/comments?post=107"}],"version-history":[{"count":53,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/107\/revisions"}],"predecessor-version":[{"id":201,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/posts\/107\/revisions\/201"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media\/169"}],"wp:attachment":[{"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/media?parent=107"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/categories?post=107"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ml-gis-service.com\/index.php\/wp-json\/wp\/v2\/tags?post=107"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}