Which movie should you recommend next? Session-based recommendation engine in Python, part 1

September 23, 2023

Data Science, e-commerce, Machine Learning, Recommendation Engine, WSKNN

Introduction

Imagine the busiest city you know. You stand in the city center at noon and take one photo of this place. Then you investigate this picture and try to guess where every person will go. There are multiple pubs, restaurants, shops, and cultural objects. You assign one (or more) places to each person based on the direction of their bodies. Do you see what could go wrong here? What with the family where parents and their kids are looking in opposing directions? What about the person sitting on a bench? What if someone changes direction three seconds later? The model based on a single snapshot doesn’t perform well. But if we take ten more photos, we gradually increase the model’s accuracy. And if we come back here day by day, we may even see patterns:

People who go to a restaurant move next to a souvenir shop.

People who go to an old church go to a park next.
People who come from a park enter a restaurant.

What has happened on the higher level of abstraction? The model’s setting changed from a fixed point in time into a sequential model, where every action follows another action.

The comparison of a single snapshot which is frozen in time vs a sequence of snapshots that unfolds complex patterns

We can do the same thing with recommender systems. Let’s jump from city sightseeing to the movies. You probably know the Movie Lens dataset [1], used as a benchmarking tool for recommenders. The dataset has a file with four columns:

user id – unique ID of a user,
item id – unique ID of a movie,
rating,
timestamp – when the user rated a movie.

Usually, recommender systems use user ratings for recommendations. However, ratings are not always present. Thus, in this tutorial, we skip rating and will try to get the answer to the time-related question: which movie will the user see next based on a sequence of previous movies? (The ratings unavailability is not a theoretical case. Think about e-commerce shops: you have a stream of products viewed by a user, but there is no rating. You want to recommend the next item in the queue, and you may use only the user’s action history. Recommending the following movies in a sequence is an analogous scenario).

The sequence of movie characters which ends with a question mark. There is text in the image: "We know what users watched, but we don't have ratings. What should we recommend next?"

Step-by-step Guide

Python offers a lightweight package WSKNN. It is a recommender engine, and its core is the k-Nearest Neighbors (k-NN) algorithm [2]. The package can be installed from PyPI. It works with Python versions >=3.8. We will go through the recommendation process step by step, and explain how the package works in practice.

Step 1

Download the MovieLens dataset (MovieLens 100k). You can get data from the tutorial’s repository here: [3].

Step 2

Create mamba environment or virtual environment.

mamba create -n movie-recommender Python=”3.10”

Step 3

Activate the environment, install pip and notebook from mamba, and then install wsknn from pip.

mamba activate movie-recommender
(movie-recommender) mamba install pip notebook
(movie-recommender) pip install wsknn

Step 4

Open Jupyter Notebook and create a new Python3 notebook.

Step 5

In the first cell, import the required packages and functions.

import numpy as np
from wsknn import fit
from wsknn.preprocessing.parse_static import parse_flat_file

numpy is used to randomly divide input dataset into training and evaluation sets
The wsknn.fit() function trains the model,
The parse_flat_file() function prepares data from the MovieLens dataset.

Step 6

In the second cell, preprocess the input dataset.

fpath = 'ml-100k/u.data'
ds = parse_flat_file(fpath, sep='\t', session_index=0, product_index=1, time_index=3, time_to_numeric=True)

Step 7

Variable ds is a tuple with two objects: Items and Sessions. You will use the Sessions object only in this tutorial. It is a special data structure. It stores information about sequences of users’ actions, and mappings between sessions and items within those sessions. You might print the object.

print(ds[1])

Sessions object statistics:
*Number of unique sessions: 943
*The longest event stream size per session: 737
*Period start: 1997-09-20T05:05:10.000000Z
*Period end: 1998-04-23T01:10:38.000000Z

Step 8

You should have the same number of unique sessions and the same information about the longest session. It is the number of movies watched by a single user. The object stores information about the first and the last event dates, and its desired behavior because you can add new sequences in the future. You don’t need the whole object right now, only a single attribute from it named .session_items_actions_map. In the basic form, it is a dictionary where keys are user IDs, and each user has a tuple of equal lists. One list represents movies, and another has timestamps when the movie was graded. Both lists are ordered by a timestamp.

{
  “user id”:
    [
      [“movie 1 id”, “movie 2 id”, “movie n id”],
      [“movie 1 rating timestamp”, “movie 2 rating timestamp”]
    ]
}

The sequences may vary in length. Some users may watch dozens of movies, while others only watch a few. It makes things more interesting, and this is the reason why the basic data structure of the WSKNN recommender is not flat.

Step 9

You can fit a whole preprocessed dataset into a model, however, it is not a good idea, because it would be nice to evaluate the model on the records not previously seen by it. First, you should take away 10% of user-sessions for further tests. Thus, we take .session_items_actions_map from the Session object parsed in the 6th step. Then, divide it into two dictionaries, one for training and one for evaluation.

def train_validate_samples(set_of_sessions):
    
    sessions_keys = list(set_of_sessions.keys())
    n_sessions = int(0.1 * len(sessions_keys))
    key_sample = np.random.choice(sessions_keys, n_sessions)
    
    training_set = {_key: set_of_sessions[_key] for _key in sessions_keys if _key not in key_sample}
    validation_set = [set_of_sessions[_key] for _key in key_sample]
    
    return training_set, validation_set


training_ds, validation_ds = train_validate_samples(ds[1].session_items_actions_map)

Step 10

Train the model.

model = fit(sessions=training_ds,
            number_of_recommendations=5,
            number_of_neighbors=10,
            sampling_strategy='recent',
            sample_size=50,
            weighting_func='log',
            ranking_strategy='log',
            return_events_from_session=False,
            recommend_any=False)

The model takes multiple parameters:

The sessions parameter is a dictionary {user: ([movies], [timestamps])} or Sessions object
The number_of_recommendations is self-explaining: how many recommended movies should a model return?
The number_of_neighbors – how many closest neighbors does a model compare to the actual movie sequence?
The sampling_strategy set to recent tells the recommendation engine to pick neighbors who have a recent activity instead of the people who were rating movies in the past
The sample_size is an initial set of neighbors
The weighting_func and ranking_strategy parameters weight sequence itself, and each movie in the sequence
The return_events_from_session parameter is important in setting e-commerce data streams. Sometimes, users may see products seen in the past; in the movie context, this is non-desired behavior. Thus, you should block it.
The recommend_any parameter is important in production. Sometimes, a user has short activity and has watched only the newest movies. The model doesn’t find any recommendations for this user. You can force the model to recommend random items, but you can skip the recommendation procedure.

Step 11

You can evaluate (manually) the recommendations. The only problem is that movies are encoded as indexes. The movie’s names are in the u.item file. The function to get the movie name from its index is defined as:

def get_movie_name(movie_id: str):
    with open('ml-100k/u.item', 'r', encoding = "ISO-8859-1") as fin:
        for line in fin:
            splitted = line.split('|')
            if movie_id == splitted[0]:
                return splitted[1]

With this function, we can run three sample recommendations:

for ts in validation_ds[:3]:
    print('User watched')
    print(str([get_movie_name(x) for x in ts[0]]))
    print('Recommendations')
    recs = model.recommend(ts)
    for rec in recs:
        print('Item:', get_movie_name(rec[0]), '| weight:', rec[1])
    print('---')
    print('')

User watched
['Men in Black (1997)', 'Truth About Cats & Dogs, The (1996)', "My Best Friend's Wedding (1997)", 'North by Northwest (1959)', 'If Lucy Fell (1996)', 'French Twist (Gazon maudit) (1995)', 'Twelve Monkeys (1995)', 'Godfather, The (1972)', 'Spitfire Grill, The (1996)', 'Jaws (1975)', 'Willy Wonka and the Chocolate Factory (1971)', 'Vertigo (1958)', 'Mars Attacks! (1996)', 'Hoop Dreams (1994)', "Devil's Own, The (1997)", 'Lawrence of Arabia (1962)', 'Star Trek: First Contact (1996)', 'Crumb (1994)', 'Terminator 2: Judgment Day (1991)', 'Birdcage, The (1996)', 'Primal Fear (1996)', "She's the One (1996)", 'My Fellow Americans (1996)', 'Lone Star (1996)', 'GoodFellas (1990)', 'My Life as a Dog (Mitt liv som hund) (1985)', 'City Hall (1996)', 'Star Wars (1977)', 'Adventures of Pinocchio, The (1996)', 'Heat (1995)', 'English Patient, The (1996)', 'Boot, Das (1981)', 'Richard III (1995)', 'First Wives Club, The (1996)', "Jackie Chan's First Strike (1996)", 'Contact (1997)', 'Six Degrees of Separation (1993)', 'Clerks (1994)', 'Trainspotting (1996)', '2001: A Space Odyssey (1968)', 'Courage Under Fire (1996)', 'Usual Suspects, The (1995)', "Antonia's Line (1995)", 'Back to the Future (1985)', 'Mighty Aphrodite (1995)', 'Jerry Maguire (1996)', 'Toy Story (1995)', 'Leaving Las Vegas (1995)', 'Sleepers (1996)', 'Dead Man Walking (1995)', 'Rock, The (1996)', 'Casablanca (1942)', 'Magnificent Seven, The (1954)', 'Cold Comfort Farm (1995)', 'Saint, The (1997)', 'Aladdin (1992)', 'Star Trek: The Wrath of Khan (1982)', 'Sense and Sensibility (1995)', 'Better Off Dead... (1985)', 'Monty Python and the Holy Grail (1974)', 'Big Night (1996)', 'Twister (1996)', "Mr. Holland's Opus (1995)", 'Close Shave, A (1995)', 'Fargo (1996)', 'Return of the Jedi (1983)', 'Before and After (1996)', 'Kingpin (1996)']
Recommendations
Item: Raiders of the Lost Ark (1981) | weight: 1.3488079691609707
Item: True Lies (1994) | weight: 1.3488079691609707
Item: Terminator, The (1984) | weight: 1.3488079691609707
Item: Aliens (1986) | weight: 1.269264685932151
Item: Silence of the Lambs, The (1991) | weight: 1.269264685932151

And what do you think about the recommender output?

Summary

In this article, you have learned about time-dependent and varying-length sessions used for recommendations. You used the WSKNN package for the first time. This post is the first of five posts about session-based recommendations. The next part will be about the scoring metrics for session-based recommendation engines.

Bibliography

[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

[2] WSKNN on PyPI: https://pypi.org/project/wsknn/

[3] https://github.com/SimonMolinsky/blog-post-wsknn-movielens