Data Science: Feature Engineering with Spatial Flavour

New York at Night from Space. Credits: NASA

Data Science: Feature Engineering with Spatial Flavour

March 19, 2021

Data Engineering, Data Science, Python, Spatial Statistics, Tutorials

Imagine that we are working for a real estate agency and our role is to estimate price of apartment renting in a different parts of the New York City. We have obtained multiple features:

host type,
host name,
neighborhood (district name),
latitude,
longitude,
room type,
minimum nights,
number of reviews,
last review date,
reviews per month,
calculated host listings count,
availability 365 days per year (days).

In classic machine learning approach we work through those variables and build model for prediction of a price. If our results are not satisfying we could perform feature engineering tasks to create more inputs for our model:

multiply features by themself, get logarithms of values or square them,
perform label or one-hot encoding of categorical variables,
find some patterns in reviews (sentiment analysis),
use geographical data and create spatial features.

We will consider the last example and learn how to retrieve spatial information using GeoPandas package and publicly available geographical datasets. We will learn how to:

Transform spatial data into new variables for ML models.
Use GeoPandas package along with Pandas to get spatial features from simple tabular readings.
Retrieve and process freely available spatial data.

Data and environment

Data for this article is shared in Kaggle. It’s AirBnB dataset of apartments’ renting prices in the New York city. You may download it from here: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data. Other datasets are shared by the New Your city here: https://www1.nyc.gov/site/planning/data-maps/open-data/districts-download-metadata.page. Data is also available in the blogpost repository. We will be working in Jupyter Notebook and it requires prior conda installation. Main Python packages used in this tutorial are:

pandas for data wrangling and feature engineering,
geopandas which works under the hood and allows to process spatial data along with shapely which is installed along with GeoPandas,
numpy for linear algebra,
matplotlib and seaborn for data visualization.

First we create new conda environment with desired packages. All steps are presented below. You may copy code to your terminal and everything should go well. Remember that our environment name is airbnb. You may change it in the first step to your own after -n flag.

conda create -n airbnb -c conda-forge pandas geopandas numpy matplotlib seaborn notebook

Then activate environment:

conda activate airbnb

That’s all! Run new notebook with command:

jupyter notebook

Data Exploration

Before we do any processing we take a quick look into our datasets. Core dataset is provided by airbnb. We import all packages for further processing at once:

import numpy as np
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import mstats
from shapely.geometry import Point

Then we set project constants. In our case those are paths to the files and names of special columns:

# Set project constants

baseline_dataset = 'data/AB_NYC_2019.csv'
baseline_id_column = 'id'

boroughs = 'data/nybbwi_20d/nybbwi.shp'
fire_divs = 'data/nyfd_20d/nyfd.shp'
health_cns = 'data/nyhc_20d/nyhc.shp'
police_prs = 'data/nypp_20d/nypp.shp'
school_dis = 'data/nysd_20d/nysd.shp'

The baseline of our work is dataset without any spatial features. Sure, baseline_dataset contains latitude and longitude columns but they are not treated as a single object of a type Point. Instead they’re two floats. This dataset contains hidden spatial information. Columns neighbourhood_group and neighbourhood describe spatial features – administrative units – by their names. The problem is that we don’t know which areas are close to each other, how big they are, if they are close to the specific objects like rivers or sea and so on…

The latitude / longitude description is a data engineering problem and we solve it with simple transformation. Why do we bother? Because spatial computations and spatial joins are much easier with spatial packages where data is stored as a special object named Point. It will be much easier to calculate distance from each apartment to a borough center or to discover within which district is an apartment placed.

Initial step of data exploration is to look into it. We read data with pandas read_csv() method and show sample() of rows from it:

# Read NYC airbnb data and take a look into it

df = pd.read_csv(baseline_dataset,
                 index_col=baseline_id_column)
print(df.sample(3))

Algorithm takes random sample of three rows and notebook shows it for us. Our sample may looks like this:

id	neighbourhood	…	latitude	longitude	price
34228220	Hell’s Kitchen	…	40.75548	-73.99513	225

Table 1: Sample from baseline DataFrame.

We can use DataFrame.info() method to get better insights what’s going on with data, or DataFrame.describe() to access basic statistical properties of numerical variables. We skip those steps here but article notebook has cells with those methods for the sake of completeness. We then load all spatial datasets and check their properties and perform visual inspection. Shapefiles are read by GeoPandas by geopandas.read_file() method:

# Spatial data sources

gdf_boroughs = gpd.read_file(boroughs)
gdf_fire_divs = gpd.read_file(fire_divs)
gdf_police_prs = gpd.read_file(police_prs)
gdf_school_dis = gpd.read_file(school_dis)
gdf_health_cns = gpd.read_file(health_cns)

When we are working with spatial data it is always a good idea to look into plotted data and check its Coordinate Reference System (CRS). As example for Health Centers districts:

print(gdf_health_cns.info())
print(gdf_health_cns.crs)
base = gdf_boroughs.plot(color='white', edgecolor='black', figsize=(12, 12))
gdf_health_cns.plot(ax=base, color='lightgray', edgecolor='darkgray');
plt.title('School Districts within main boroughs in the New York City.');

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   BoroCode    30 non-null     int64   
 1   BoroName    30 non-null     object  
 2   HCentDist   30 non-null     int64   
 3   Shape_Leng  30 non-null     float64 
 4   Shape_Area  30 non-null     float64 
 5   geometry    30 non-null     geometry
dtypes: float64(2), geometry(1), int64(2), object(1)
memory usage: 1.5+ KB
None
epsg:2263

Most important facts about our dataset are that it has:

30 entries (30 school districts),
one column of type geometry which is encoded in the EPSG:2263 projection which is designed for the New York State Plane Long Island Zone.

We can plot dataset as a polygon representing administrative boundaries (Figure 1) with GeoPandas geometry.plot() method.

Figure 1: Spatial data with school districts in the New York City.

At this moment everything seems to be ok. Thing to remember is that our baseline dataset is not spatial at all and it’s CRS is different than CRS of the read shapefiles. Datasets with latitude / longitude angles are described by EPSG:4326 projection in the most cases. This is projection used by GPS satellites and this is the reason why it is so popular.

Transformation from simple table into spatial dataset

Our dataset has missing values but before we check duplicates and NaNs we will transform geometry from two floats into one Point object. From the engineering point of view we take DataFrame and transform it to the GeoDataFrame. In practice we must create column with geometry attribute and set its projection. Probably most of datasets with point measurements are described by EPSG:4326 projection. But be careful with this hypothesis and always to find and check metadata because you may get strange results if you assign wrong projection to your data.

Before we drop unwanted columns and/or rows with missing values and duplicates we are going to create Points from latitude and longitude columns. Point is a basic data type from shapely package which is used under the hood of Geopandas. Transformation of pair of values into Point is simple. We pass two values to a constructor of Point. As example:

from shapely.geometry import Point

lat = 51.5074
lon = 0.1278

pair = [lon, lat]  # first lon (x) then lat (y)
point = Point(pair)

Then we have to perform this transformation at a scale. We use lambda expression for it and pass into it lat/lon columns to transform them into point. Function below is one solution for our problem:

# Transform geometry

def lat_lon_to_point(dataframe, lon_col='longitude', lat_col='latitude'):
    """Function transform longitude and latitude coordinates into GeoSeries.
    
    INPUT:
    
    :param dataframe: DataFrame to be transformed,
    :param lon_col: (str) longitude column name, default is 'longitude',
    :param lat_col: (str) latitude column name, default is 'latitude'.
    
    OUTPUT:
    
    :return: (GeoPandas GeoSeries object)
    """

    geometry = dataframe.apply(lambda x: Point([x[lon_col], x[lat_col]]), axis=1)
    geoseries = gpd.GeoSeries(geometry)
    geoseries.name = 'geometry'
    
    return geoseries

Function returns GeoSeries object. It is similar to Series known from Pandas but it has special properties designed for spatial datasets. GeoSeries has own attributes and methods and it can be compared to Time Series from Pandas. In our case it is just a list of Points. At this moment we can easily append it to our baseline DataFrame.

geometry = lat_lon_to_point(df)
gdf = df.join(geometry)

If we check info about our new dataframe gdf we see that geometry column has special Dtype, even if our gdf is still a classic DataFrame from Pandas!

print(gdf.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48895 entries, 2539 to 36487245
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype   
---  ------                          --------------  -----   
 0   name                            48879 non-null  object  
 1   host_id                         48895 non-null  int64   
 2   host_name                       48874 non-null  object  
 3   neighbourhood_group             48895 non-null  object  
 4   neighbourhood                   48895 non-null  object  
 5   latitude                        48895 non-null  float64 
 6   longitude                       48895 non-null  float64 
 7   room_type                       48895 non-null  object  
 8   price                           48895 non-null  int64   
 9   minimum_nights                  48895 non-null  int64   
 10  number_of_reviews               48895 non-null  int64   
 11  last_review                     38843 non-null  object  
 12  reviews_per_month               38843 non-null  float64 
 13  calculated_host_listings_count  48895 non-null  int64   
 14  availability_365                48895 non-null  int64   
 15  geometry                        48895 non-null  geometry
dtypes: float64(3), geometry(1), int64(6), object(6)
memory usage: 7.3+ MB
None

Due to the fact that it is DataFrame we are not able to plot map of those points with command gdf['geometry'].plot(). We get TypeError when we try this… In comparison we can plot generated GeoSeries without any problems, GeoPandas engine allows us to do it and command geometry.plot() works fine. We need to transform DataFrame into GeoDataFrame.

Before that we clean data and remove:

Unwanted columns,
Rows with missing values,
Rows with duplicates.

# Leave only wanted columns

gdf = gdf[['neighbourhood_group', 'neighbourhood',
           'room_type', 'price', 'minimum_nights',
           'number_of_reviews','reviews_per_month',
           'calculated_host_listings_count', 'geometry']]

# Drop NaNs
gdf = gdf.dropna(axis=0)

# Drop duplicates
gdf = gdf.drop_duplicates()

Baseline dataset is nearly done! One last thing is to change DataFrame into GeoDataFrame and set projection of the geometry column and we will be ready for the feature engineering. Change DataFrame into GeoDataFrame is very simple. We assign Pandas object into GeoPandas GeoDataFrame. Constructor takes three main arguments: input dataset, geometry column name or geometry array and projection:

gpd.GeoDataFrame(*args, geometry=None, crs=None, **kwargs)

After this operation we are finally able to use spatial processing methods. As example we will be able to plot geometry with GeoPandas engine.

# Transform dataframe into geodataframe

gdf = gpd.GeoDataFrame(gdf, geometry='geometry', crs='EPSG:4326')
# gdf.plot()  # Uncomment if you want to plot data

Extremely important thing to do is a transformation of all projections to a single representation. We use projection from spatial data of New York districts and we are going to transform baseline points with apartments coordinates. GeoDataFrame object allows us to use .to_crs() method to make fast and readable re-projection of data. Very important step before processing of multiple datasets is to check if all projections are the same and if it is True then we can move forward!

# Transform CRS of the baseline GeoDataFrame

gdf.to_crs(crs=gdf_boroughs.crs, inplace=True)

# Check if all crs are the same

print(gdf.crs == gdf_boroughs.crs == gdf_fire_divs.crs == \
     gdf_health_cns.crs == gdf_police_prs.crs == gdf_school_dis.crs)

True

Statistical analysis of the baseline dataset

Maybe you are asking yourself why do we bother? We have many features provided in the initial dataset and probably they are good enough to train a model. We can check it before training by observation of basic statistical properties of variables in relation to price which we would like to predict. We have two simple options:

observe distribution of price in relation to categorical variables to check if it overlaps frequently,
check correlation value between numerical features and price.

Categorical features

Categorical columns are:

neighbourhood_group,
neighbourhood,
and room_type.

We can check overlapping between features by analysis of boxplots – but this method is valid only for small datasets. Example is neighbourhood_group with five categories:

def print_nunique(series):
    text = f'Data contains {series.nunique()} unique categories'
    print(text)

# neighbourhood_group

print_nunique(gdf['neighbourhood_group'])

Data contains 5 unique categories

plt.figure(figsize=(6, 10))
sns.boxplot(y='neighbourhood_group', x='price',
            data=gdf, orient='h', showfliers=False)
plt.title('Distribution of price in relation to categorical variables: Boroughs.')

Figure 2: Distribution of price in relation to neighbourhood_group categories.

We see that values overlap. They are probably not a good predictors of a renting price – with exclusion of Manhattan area with very high variance. Visual inspection may be misleading and we should perform additional statistical tests (as example Kruskal-Wallis one-way analysis of variance) or train our data on multiple decision trees with different random seeds to see how each category behaves. It is more desired style of work when we compare hundreds of categories… like neighbourhood column. In our case we create function which calculates basic statistics for each unique category and we compare those results (try to make a boxplot of neighbourhood column…)

# neighbourhoods

print_nunique(gdf['neighbourhood'])

Data contains 218 unique categories

# There are too many values to plot them all...

def categorical_distributions(dataframe, categorical_column, numerical_column, dropna=True):
    """
    Function groups data by unique categories and check basic information about
    data distribution per category.

    INPUT:

    :param dataframe: (DataFrame or GeoDataFrame),
    :param categorical_column: (str),
    :param numerical_column: (str),
    :param dropna: (bool) default=True, drops rows with NaN's as index.

    OUTPUT:

    :return: (DataFrame) DataFrame where index represents unique category
        and columns represents: count, mean, median, variance, 1st quantile,
        3rd quantile, skewness and kurtosis.
    """

    cat_df = dataframe[[categorical_column, numerical_column]].copy()

    output_df = pd.DataFrame(index=cat_df[categorical_column].unique())

    # Count
    output_df['count'] = cat_df.groupby(categorical_column).count()
    
    # Mean
    output_df['mean'] = cat_df.groupby(categorical_column).mean()

    # Variance
    output_df['var'] = cat_df.groupby(categorical_column).std()

    # Median
    output_df['median'] = cat_df.groupby(categorical_column).median()

    # 1st quantile
    output_df['1st_quantile'] = cat_df.groupby(categorical_column).quantile(0.25)

    # 3rd quantile
    output_df['3rd_quantile'] = cat_df.groupby(categorical_column).quantile(0.75)

    # skewness
    output_df['skew'] = cat_df.groupby(categorical_column).skew()

    # kurtosis
    try:
        output_df['kurt'] = cat_df.groupby(categorical_column).apply(pd.Series.kurt)
    except ValueError:
        output_df['kurt'] = cat_df.groupby(categorical_column).apply(pd.Series.kurt)[numerical_column]
        
    # Dropna
    
    if dropna:
        output_df = output_df[~output_df.index.isna()]

    return output_df

Based on the grouped statistic we may look into different statistical parameters alone or at all of them and check if there is a high or low level of variability between each category. As example histogram of variance – if it is wide and uniform then we may assume that categorical variables are good predictors. If there are tall and grouped peaks then it could be hard for algorithm to make predictions based only on categorical data.

neigh_dists = categorical_distributions(gdf, 'neighbourhood', 'price')

# Check histogram of variances

plt.figure(figsize=(10, 6))
sns.histplot(neigh_dists['var'], bins=20);
plt.title('Histogram of price variance per category');

Figure 3: Histogram of price variances per borough.

In our case we got something closer to the worst case scenario. Variance of price per neighborhood (Figure 3) has steep peak around low range counts. It has long tail with very small number of counts – some categories could be used for predictions based on their dispersion of values – but there are only few of them. At this stage we can assume that it’ll be hard to train a very good model. Idea of enhancing feature space with new variables has gained better foundation.

Numerical features

How good are our numerical features? One measure of goodness may be correlation between each numerical variable and predicted price. High absolute correlation means that variable is a strong predictor. Correlation close to 0 informs us that there is not relationship between feature and a predicted variable. Method to calculate correlation between columns of DataFrame is provided with Pandas:

print(gdf.corrwith(gdf['price']))

price                             1.000000
minimum_nights                    0.025506
number_of_reviews                -0.035938
reviews_per_month                -0.030608
calculated_host_listings_count    0.052903
dtype: float64

Unfortunately predictors are weak and correlation between price and numerical variables is very weak. This is not something what we would expect from the modeling dataset! Can we make it better?

Spatial Feature Engineering

At this stage we have dataset with AirBnB apartment renting prices in the New York City and five other datasets with spatial features. We do two things in a data engineering pipeline:

We create a new numerical feature – distance to each borough centroid from the apartment. We assume that this variable covers the fact that location matters in the case of renting prices.
We create a new categorical features, each for each school district, fire division, police precinct and health center. We perform label encoding with numerical values by taking unique id of each area. Then we calculate within which specific district is our point. The reasoning is the same as for point 1) – we use our knowledge and feed it into algorithm. There is a high chance that hidden variability within schools, health centers, police and fire fighters work is correlated with prices of apartments and model will use this unseen but expected relations.

New numerical features – a distance to borough centroid from each apartment

We have five districts in our dataset: Bronx, Brooklyn, Manhattan, Queens, Staten Island. The idea is to find distance from the centroids of those boroughs to an apartment location. This could be very easily done with GeoPandas methods. One is .centroid which calculates centroid of each geometry in a GeoDataFrame. Other is GeoDataFrame.distance(other). We divide this operations into two cells in Jupyter Notebook. We calculate centroids:

# Get borough centroids

gdf_boroughs['centroids'] = gdf_boroughs.centroid

And we estimate distance to each centroid from each apartment:

# Create n numerical columns
# n = number of boroughs

for bname in gdf_boroughs['BoroName'].unique():
    district_name = 'cdist_' + bname
    cent = gdf_boroughs[gdf_boroughs['BoroName'] == bname]['centroids'].values[0]
    gdf[district_name] = gdf['geometry'].distance(cent)

In the loop above we create a column name with cdist_ prefix and name of a borough. Then we calculate distance (in data units) from borough center to an apartment location. Table 2 is a sample row of the expanded set (only new columns and index are visible):

id	cdist_Manhattan	cdist_Bronx	cdist_Brooklyn	cdist_Queens	cdist_Staten Island
26424667	4718.20	43721.71	46721.83	45085.04	86889.10

Table 2: Sample distances from one apartment to each district center.

New categorical features – school districts, fire divisions, police precincts and health centers labels

We should go further with spatial feature engineering and we are going to use all available resources. There are multiple school districts or police precincts and calculations of distance from apartment to each of those areas would be not appriopriate because feature space will grow with number of unique districts. Fortunately we can tackle this problem differently. We are going to create new categorical features where we assign unique area id to each apartment. Algorithm is simple. For each district type we perform spatial join of our core set and a new spatial set. It can be done with geopandas.sjoin() method.

# Make a copy of dataframe before any join

ext_gdf = gdf.copy()

# Join Fire Divisions

ext_gdf = gpd.sjoin(ext_gdf, gdf_fire_divs[['geometry', 'FireDiv']], how='left', op='within')
ext_gdf.drop('index_right', axis=1, inplace=True)

# Join Health Centers

ext_gdf = gpd.sjoin(ext_gdf, gdf_health_cns[['geometry', 'HCentDist']], how='left', op='within')
ext_gdf.drop('index_right', axis=1, inplace=True)

# Join Police Precincts

ext_gdf = gpd.sjoin(ext_gdf, gdf_police_prs[['geometry', 'Precinct']], how='left', op='within')
ext_gdf.drop('index_right', axis=1, inplace=True)

# Join School Districts

ext_gdf = gpd.sjoin(ext_gdf, gdf_school_dis[['geometry', 'SchoolDist']], how='left', op='within')
ext_gdf.drop('index_right', axis=1, inplace=True)

# We have to transform categorical columns to be sure that we don't get strange results

categorical_columns = ['FireDiv', 'HCentDist', 'Precinct', 'SchoolDist']
ext_gdf[categorical_columns] = ext_gdf[categorical_columns].astype('category')

Results

Dataset is expanded by nine (9.) new spatial features. Are they worse, better or (approximately) identical to the baseline?

Distribution of categorical features

First we check distribution of price variance per category. We must calculate this with our categorical_distributions() function and then plot them with Seaborn’s histplot().

categorical_columns = ['FireDiv', 'HCentDist', 'Precinct', 'SchoolDist']

for col in categorical_columns:
    distribs = categorical_distributions(ext_gdf, col, 'price')
    dist_var = distribs['var']
    plt.figure(figsize=(10, 6))
    sns.histplot(dist_var, bins=20);
    plt.title(f'Histogram of price variances per category {col}');

Figure 4: Variance value counts per category – Fire Division units.

Figure 5: Variance value counts per category – Health Center units.

Figure 6: Variance value counts per category – Police Precinct units.

Figure 7: Variance value counts per category – School District units.

As we see distribution of price variance is very promising, especially for Fire Division units! In each case counts are nearly uniform and we can assume that those predictors will be better than the categories from the baseline set.

Correlation of numerical variables and price

Last step of this tutorial is to check a correlation between numerical variables and price. We run .corrwith() method and look if we get better features:

# Check corr with numerical

ext_gdf.corrwith(ext_gdf['price']).sort_values()

cdist_Manhattan                  -0.146215
cdist_Staten Island              -0.058386
number_of_reviews                -0.035938
reviews_per_month                -0.030608
cdist_Bronx                       0.011228
cdist_Brooklyn                    0.011878
minimum_nights                    0.025506
calculated_host_listings_count    0.052903
cdist_Queens                      0.096206
price                             1.000000

What we get here is really interesting: larger distance from Manhattan is correlated with lower prices and larger distance to Queens is correlated with rising prices. It is reasonable. Distances to Manhattan, Staten Island and Queens are better predictors than for the other numeric values but all predictors are very weak or weak (Manhattan centroid).

Does model be better with those features? Probably yes. We have used publicly available datasets and enhanced baseline dataset with them at nearly zero cost. The next step is to use this data in a real modeling and check how our models behave now!

Exercises:

Train decision tree regressor on the enhanced dataset. Check which features are the most important for the algorithm.
Prepare spatial features as in this article and store newly created dataset AND dataset without those spatial features (baseline) where latitude and longitude are floats. Divide indexes into train and test set for both of datasets (they must have the same index!). Train Decision Tree regressor on both sets and compare results.
As the number 2. but this time generate 10 realizations of train / test set with the new random seed per each realization and train 10 decision trees per dataset with the new random seed per iteration. Aggregate evaluation output – as example Root Mean Squared Error and compare training statistics. (You will generate 10 random train/test splits x 10 random decision tree models x 2 datasets => 100 models per dataset).