How should you evaluate session-based recommendations?

How should you evaluate session-based recommendations?

October 22, 2023

Data Science, e-commerce, Machine Learning, Python, Recommendation Engine, WSKNN

Session-based recommendation engine in Python
part 2

Previous parts

Which movie should you recommend next?

Introduction

Measuring model performance can be tricky if you work with a single output, but what should you do if model creates a sequence of items? The session-based recommenders may return one recommendation, but it will likely be irrelevant to a user. You may notice that almost all recommendations are generated in a sequence. The reason is simple: at least one product should be relevant among the five recommendations.

Recommendation systems metrics

The basic metrics for recommender systems are Precision, Recall, and Mean Reciprocal Rank (MRR). Precision and Recall are known from classifiers. MRR is an evaluation procedure explicitly designed for sequential outputs. If you want to underline that those metrics are used for recommendations, you will use @k characters, where k is the number of items returned. Precision@5 is the average precision of the five recommendations, and MRR@20 means that twenty recommendations are evaluated.

Example

A recommender created these outputs:

r1: [banana, cherry, tomato, avocado, strawberry]
r2: [mango, banana, apple, blueberry, lemon]
r3: [apple, watermelon, orange, pear, cherry]

And the real fruits bought later by customers are:

c1: [avocado, tomato, lemon, orange]
c2: [mango, grapes, watermelon, coconut, papaya, pineapple]
c3: [apple, orange, cherry, banana, kiwi, grapefruit, banana, coconut, blueberry]

With this information it is possible to calculate the session-based metrics.

Precision

How many relevant items are present in the top-k recommendations? In other words, precision is a fraction of relevant items to all items recommended:

$Precision@k = RelevantRecommendations / AllRecommendations$

Based on the examples, Precision@5 is:

Prec1: 0.4
Prec2: 0.2
Prec3: 0.6

Usually, you are not interested in a single reading but the average, so finally you get Precision@k equals to 0.4.

Important! Precision doesn’t depend on the position of relevant items in a sequence. Precision informs only about the existence of the relevant item in a sequence. Think about it this way: what will be Precision@inf when the recommender returns all products or all movies from a database?

Recall

How many relevant items are returned from all possible relevant items for a user? Recall is a fraction of relevant items from recommendation to ALL relevant items:

$Recall@k = RelevantRecommendations / AllRelevantItems$

Based on examples:

Rec1: 0.5
Rec2: 0.17
Rec3: 0.34

The average Recall@5 is equal to 0.34.

Important! Recall doesn’t depend on the position of relevant items in a sequence. Due to the fact that recommendations have limited number of items Recall may never be close to 1. Why? Consider a scenario: you set your system for five recommendations, but the average customer buys ten or more products. Even if all five items are relevant, there are still five products that the recommender didn’t show. Thus, the maximum recall will be equal to 0.5. This parameter may not be useful with a large product space.

Mean Reciprocal Rank

This metric is positional. It tells how fast relevant products occurred in a recommended sequence. Reciprocal Rank is calculated as the inverted position of the first relevant item in a sequence, and the mean is taken from multiple tested sequences from user actions.

Based on examples:

MRR1: 0.6 (3/5 – first relevant item on position 3 from 5),
MRR2: 1 (1/1)
MRR3: 1 (1/1)

Important! Only the first occurrence is counted. MRR doesn’t count all relevant items from the recommendation. A researcher might consider using Precision@k with MRR@k to cover more system properties.

Scoring recommendations

Scoring recommendations is fairly easy when you use built-in functions from the session-based recommendation engines. Package WSKNN has function score_model(). It calculates the MRR, Precision, and Recall.

As you may recall from the previous blog post, a fitted model takes multiple parameters. The number of recommendations is usually fixed, because this number is forced by a business logic. Most of the time we control:

number of the closest neighbors
the possible neighbors sampling strategy from ['common_items', 'recent', 'random']
the possible neighbors sample size
session weighting strategy from ['linear', 'log', 'quadratic']
session’s items ranking strategy from ['inv', 'linear', 'log', 'quadratic']

Thus, there are five parameters, two of those are numbers and three other have fixed values that relate to the sampling and ranking logic. It is not easy to pick the best set of parameters for our model without evaluation. In this guide you will learn how could you do it.

Step-by-step Coding Guide

If you didn’t do it before, setup your environment as in steps from 1 to 3. If you have done it before, follow the code from step 4.

Step 1

Download the MovieLens dataset (MovieLens 100k). You can get data from the tutorial’s repository here: [1].

Step 2

Create mamba environment or virtual environment.

mamba create -n movie-recommender Python=”3.10”

Step 3

Activate the environment, install pip and notebook from mamba, and then install wsknn from pip.

mamba activate movie-recommender
(movie-recommender) mamba install pip notebook
(movie-recommender) pip install wsknn

Step 4

Open Jupyter Notebook and create a new Python3 notebook.

Step 5

In the first cell, import the required packages and functions.

from typing import Dict, List, Union

import numpy as np
import pandas as pd
from tqdm import tqdm

from wsknn import fit
from wsknn.evaluate import score_model
from wsknn.preprocessing.parse_static import parse_flat_file

The typing package’s objects are used for type hinting in custom functions. Numpy and pandas are packages for data transformations, and tqdm shows a progress bar when multiple models are tested and scored.

WSKNN methods fit() and parse_flat_file() were covered in the last article. The core function here is score_model(). The function takes the trained model, and validation dataset to calculate MRR, Precision, and Recall. But first, you need to prepare data.

Step 6

Read and prepare training and validation datasets.

def train_validate_samples(set_of_sessions):
    
    sessions_keys = list(set_of_sessions.keys())
    n_sessions = int(0.1 * len(sessions_keys))
    key_sample = np.random.choice(sessions_keys, n_sessions)
    
    training_set = {_key: set_of_sessions[_key] for _key in sessions_keys if _key not in key_sample}
    validation_set = [set_of_sessions[_key] for _key in key_sample]
    
    return training_set, validation_set


fpath = 'ml-100k/u.data'
ds = parse_flat_file(fpath, sep='\t', session_index=0, product_index=1, time_index=3, time_to_numeric=True)

training_ds, validation_ds = train_validate_samples(ds[1].session_items_actions_map)

Step 7

The most convenient method of performing experiments and storing results is within a Python class. A basic implementation can be:

class TestModels:

    def __init__(self, training_set: Dict, test_set: List, psets: List):
        self.training_set = training_set
        self.test_set = test_set
        self.psets = psets
        self.scoring_results = self.get_scoring()

    def get_scoring(self):
        """
        Method scores multiple different models
        """
        scorings = []
        for params in tqdm(self.psets):
            model = fit(sessions=self.training_set, **params)
            scores = score_model(sessions=self.test_set, trained_model=model, k=5)
            scores.update(params)
            scorings.append(scores)

        scoring_results = pd.DataFrame(scorings)
        return scoring_results

    def scores(self):
        return self.scoring_results

Next, you can test the class implementation with a small set of models. You need to define list of 2-3 parameters sets and push it into the TestModels class along with a training, and a test datasets. Here are dictionaries with fixed parameters:

# Neighbors are most recent sessions
# Items are weighted and ranked by log function - newest items in the session are most important

parameter_set_recent_log_log = {
    'number_of_recommendations': 5,
    'number_of_neighbors': 10,
    'sampling_strategy': 'recent',
    'sample_size': 50,
    'weighting_func': 'log',
    'ranking_strategy': 'log',
    'return_events_from_session': False,
    'recommend_any': False
}

# Neighbors are sampled based on the common items
# Items are weighted and ranked by linear function

parameter_set_common_lin_lin = {
    'number_of_recommendations': 5,
    'number_of_neighbors': 10,
    'sampling_strategy': 'common_items',
    'sample_size': 50,
    'weighting_func': 'linear',
    'ranking_strategy': 'linear',
    'return_events_from_session': False,
    'recommend_any': False
}

# Neighbors are sampled randomly
# Items are weighted by log function and then ranked by their inverted position in a sequence (1/i)

parameter_set_random_log_inv = {
    'number_of_recommendations': 5,
    'number_of_neighbors': 10,
    'sampling_strategy': 'random',
    'sample_size': 50,
    'weighting_func': 'log',
    'ranking_strategy': 'inv',
    'return_events_from_session': False,
    'recommend_any': False
}

To get scores pass those dictionaries into TestModels instance:

scorer = TestModels(training_ds,
                   validation_ds,
                   [
                       parameter_set_recent_log_log,
                       parameter_set_common_lin_lin,
                       parameter_set_random_log_inv
                   ])

df = scorer.scores()

print(df.head())

	MRR	Precision	Recall	sampling_strategy	weighting_func	ranking_strategy
0	0.796099	0.610638	0.048660	recent	log	log
1	0.714184	0.497872	0.040337	common_items	linear	linear
2	0.751241	0.623404	0.052864	random	log	inv

Results of the initial class check.

Columns number_of_recommendations, number_of_neighbors, sample_size, return_events_from_session, and recommend_any are hidden in the table above because those parameters are fixed. Scoring differences are mostly noticeable when you control sampling_strategy, weighting_func, and ranking_strategy parameters.The class works as expected. The bigger parameter space may be analyzed.

Step 8

Writing each possible dictionary manually would be a tedious task. You can define a function that will create a number of configurations to try.

def generate_parameter_sets(number_of_recommendations: Union[List, int] = 5,
                            number_of_neighbors: Union[List, int] = 10,
                            sample_size: Union[List, int] = 100,
                            return_events_from_session: bool = False,
                            required_sampling_event = None,
                            required_sampling_event_index: int = None,
                            sampling_str_event_weights_index: int = None,
                            recommend_any: bool = False):
    """
    Function generates multiple parameter sets.
    """
    if isinstance(number_of_recommendations, int):
        number_of_recommendations = [number_of_recommendations]

    if isinstance(number_of_neighbors, int):
        number_of_neighbors = [number_of_neighbors]

    if isinstance(sample_size, int):
        sample_size = [sample_size]

    sampling_strategies = ['common_items', 'recent', 'random']
    weighting_funcs = ['linear', 'log', 'quadratic']
    ranking_strategies = ['inv', 'linear', 'log', 'quadratic']

    parameters_sets = []

    for n_recs in number_of_recommendations:
        for n_neighb in number_of_neighbors:
            for s_size in sample_size:
                for s_strategy in sampling_strategies:
                    for weight_f in weighting_funcs:
                        for rank_s in ranking_strategies:
                            d = {
                                'number_of_recommendations': n_recs,
                                'number_of_neighbors': n_neighb,
                                'sampling_strategy': s_strategy,
                                'sample_size': s_size,
                                'weighting_func': weight_f,
                                'ranking_strategy': rank_s,
                                'return_events_from_session': return_events_from_session,
                                'recommend_any': recommend_any,
                                'required_sampling_event': required_sampling_event,
                                'required_sampling_event_index': required_sampling_event_index,
                                'sampling_str_event_weights_index': sampling_str_event_weights_index
                            }
                            parameters_sets.append(d)
    return parameters_sets

pgrid = generate_parameter_sets(number_of_neighbors=[10, 20, 50], sample_size=[100, 200, 500])

print(len(pgrid))

>> 324

There are 324 model configurations with a mixed number of the closest neighbors, possible neighbors sample sizes, sampling strategies, weighting functions, and ranking strategies. It will take some time to check every model configuration by the TestModels class. It has a progress bar, so we will know how long it takes to get the results.

scorer = TestModels(training_ds,
                   validation_ds,
                   pgrid)

df = scorer.scores()

Step 9

The last step is to check scores and which configurations are the best for specific metrics. It is unlikely that all three metrics will be the highest possible for a single configuration.

Optimal configuration – MRR

Which configuration returns the relevant items in the best positions in a recommended sequence?

df.sort_values('MRR', ascending=False).head(1)

MRR: 0.855674
Precision: 0.691489
Recall: 0.059915
The number of closest neighbors (`number_of_neighbors`): 50
Sampling strategy: recent
Possible neighbors sample size: 500
Weighting function: log
Ranking Strategy: inv

The configuration can be translated to the natural language as: the optimal MRR is achieved when you sample 500 of the possible neighbors based on the recency of their actions. Neighbors similarity is calculated based on the assumption that the newest elements in a sequence have the highest weights. Similarly, the final recommendations weighting takes into account the position of an item in a sequence (first movie has the highest score).

Optimal configuration – Precision

Which configuration returns the highest ratio of the relevant items to the sequence items?

df.sort_values('Precision', ascending=False).head(1)

MRR: 0.825532
Precision: 0.702128
Recall: 0.061024
The number of closest neighbors (`number_of_neighbors`): 50
Sampling strategy: random
Possible neighbors sample size: 500
Weighting function: log
Ranking Strategy: quadratic

The optimal Precision is achieved when you take a random sample of 500 possible neighbors. The neighbors similarity is calculated based on the assumption that the newest elements in a sequence have much higher weights than older elements. Similarly, the final recommendations weighting uses the quadratic weighting function that assigns large weights to the first elements in a sequence and very small weights to the last elements.

Optimal configuration – Recall

Which configuration returns the highest ratio of the relevant items to the all relevant items for a user?

df.sort_values(‘Recall', ascending=False).head(1)

MRR: 0.825532
Precision: 0.702128
Recall: 0.061024
The number of closest neighbors (`number_of_neighbors`): 50
Sampling strategy: random
Possible neighbors sample size: 500
Weighting function: log
Ranking Strategy: quadratic

It is the same configuration as for Precision (and the reason is simple: we use fixed-length window for testing relevant items). The more interesting is the fact that Recall is very low. It means that the fraction of recommendations (5) to all relevant items is tiny; users usually watch many more movies than 5.

Summary

In this article, you have learned how to score session-based recommendations with theoretical metrics. Be warned that those metrics shouldn’t be the only reason you implement a model in one configuration, not another. Sometimes, you must put business logic first and decide to push different parameters into a production. And in the recommendation systems reality, the most valuable metrics are those from business analytics: monetization or click-through rates.

In the next chapter, you will tweak a system to force it to follow business logic. The session events (watched movies) will take custom weights.

Bibliography

[1] https://github.com/SimonMolinsky/blog-post-wsknn-movielens

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

How does session-based recommendation engine work? WSKNN algorithm overview – Sp.4ML

6 months ago

[…] probably damage the model’s evaluation metrics because they rely on static historical data [see previous article]. Weighting by external factors is a step forward from a static model – now you interfere […]