Which movie should you recommend next? Session-based recommendation engine in Python, part 1
Introduction
Imagine the busiest city you know. You stand in the city center at noon and take one photo of this place. Then you investigate this picture and try to guess where every person will go. There are multiple pubs, restaurants, shops, and cultural objects. You assign one (or more) places to each person based on the direction of their bodies. Do you see what could go wrong here? What with the family where parents and their kids are looking in opposing directions? What about the person sitting on a bench? What if someone changes direction three seconds later? The model based on a single snapshot doesn’t perform well. But if we take ten more photos, we gradually increase the model’s accuracy. And if we come back here day by day, we may even see patterns:
- People who go to a restaurant move next to a souvenir shop.
- People who go to an old church go to a park next.
- People who come from a park enter a restaurant.
What has happened on the higher level of abstraction? The model’s setting changed from a fixed point in time into a sequential model, where every action follows another action.
We can do the same thing with recommender systems. Let’s jump from city sightseeing to the movies. You probably know the Movie Lens dataset [1], used as a benchmarking tool for recommenders. The dataset has a file with four columns:
- user id – unique ID of a user,
- item id – unique ID of a movie,
- rating,
- timestamp – when the user rated a movie.
Usually, recommender systems use user ratings for recommendations. However, ratings are not always present. Thus, in this tutorial, we skip rating and will try to get the answer to the time-related question: which movie will the user see next based on a sequence of previous movies? (The ratings unavailability is not a theoretical case. Think about e-commerce shops: you have a stream of products viewed by a user, but there is no rating. You want to recommend the next item in the queue, and you may use only the user’s action history. Recommending the following movies in a sequence is an analogous scenario).
Step-by-step Guide
Python offers a lightweight package WSKNN
. It is a recommender engine, and its core is the k-Nearest Neighbors (k-NN) algorithm [2]. The package can be installed from PyPI
. It works with Python versions >=3.8
. We will go through the recommendation process step by step, and explain how the package works in practice.
Step 1
Download the MovieLens dataset (MovieLens 100k). You can get data from the tutorial’s repository here: [3].
Step 2
Create mamba
environment or virtual environment.
mamba create -n movie-recommender Python=”3.10”
Step 3
Activate the environment, install pip
and notebook
from mamba
, and then install wsknn
from pip
.
mamba activate movie-recommender (movie-recommender) mamba install pip notebook (movie-recommender) pip install wsknn
Step 4
Open Jupyter Notebook and create a new Python3 notebook.
Step 5
In the first cell, import the required packages and functions.
import numpy as np from wsknn import fit from wsknn.preprocessing.parse_static import parse_flat_file
numpy
is used to randomly divide input dataset into training and evaluation sets- The
wsknn.fit()
function trains the model, - The
parse_flat_file()
function prepares data from the MovieLens dataset.
Step 6
In the second cell, preprocess the input dataset.
fpath = 'ml-100k/u.data' ds = parse_flat_file(fpath, sep='\t', session_index=0, product_index=1, time_index=3, time_to_numeric=True)
Step 7
Variable ds
is a tuple with two objects: Items
and Sessions
. You will use the Sessions
object only in this tutorial. It is a special data structure. It stores information about sequences of users’ actions, and mappings between sessions and items within those sessions. You might print
the object.
print(ds[1])
Sessions object statistics: *Number of unique sessions: 943 *The longest event stream size per session: 737 *Period start: 1997-09-20T05:05:10.000000Z *Period end: 1998-04-23T01:10:38.000000Z
Step 8
You should have the same number of unique sessions and the same information about the longest session. It is the number of movies watched by a single user. The object stores information about the first and the last event dates, and its desired behavior because you can add new sequences in the future. You don’t need the whole object right now, only a single attribute from it named .session_items_actions_map
. In the basic form, it is a dictionary where keys are user IDs, and each user has a tuple of equal lists. One list represents movies, and another has timestamps when the movie was graded. Both lists are ordered by a timestamp.
{ “user id”: [ [“movie 1 id”, “movie 2 id”, “movie n id”], [“movie 1 rating timestamp”, “movie 2 rating timestamp”] ] }
The sequences may vary in length. Some users may watch dozens of movies, while others only watch a few. It makes things more interesting, and this is the reason why the basic data structure of the WSKNN
recommender is not flat.
Step 9
You can fit a whole preprocessed dataset into a model, however, it is not a good idea, because it would be nice to evaluate the model on the records not previously seen by it. First, you should take away 10% of user-sessions for further tests. Thus, we take .session_items_actions_map
from the Session
object parsed in the 6th step. Then, divide it into two dictionaries, one for training and one for evaluation.
def train_validate_samples(set_of_sessions): sessions_keys = list(set_of_sessions.keys()) n_sessions = int(0.1 * len(sessions_keys)) key_sample = np.random.choice(sessions_keys, n_sessions) training_set = {_key: set_of_sessions[_key] for _key in sessions_keys if _key not in key_sample} validation_set = [set_of_sessions[_key] for _key in key_sample] return training_set, validation_set training_ds, validation_ds = train_validate_samples(ds[1].session_items_actions_map)
Step 10
Train the model.
model = fit(sessions=training_ds, number_of_recommendations=5, number_of_neighbors=10, sampling_strategy='recent', sample_size=50, weighting_func='log', ranking_strategy='log', return_events_from_session=False, recommend_any=False)
The model takes multiple parameters:
- The
sessions
parameter is a dictionary{user: ([movies], [timestamps])}
orSessions
object - The
number_of_recommendations
is self-explaining: how many recommended movies should a model return? - The
number_of_neighbors
– how many closest neighbors does a model compare to the actual movie sequence? - The
sampling_strategy
set torecent
tells the recommendation engine to pick neighbors who have a recent activity instead of the people who were rating movies in the past - The
sample_size
is an initial set of neighbors - The
weighting_func
andranking_strategy
parameters weight sequence itself, and each movie in the sequence - The
return_events_from_session
parameter is important in setting e-commerce data streams. Sometimes, users may see products seen in the past; in the movie context, this is non-desired behavior. Thus, you should block it. - The
recommend_any
parameter is important in production. Sometimes, a user has short activity and has watched only the newest movies. The model doesn’t find any recommendations for this user. You can force the model to recommend random items, but you can skip the recommendation procedure.
Step 11
You can evaluate (manually) the recommendations. The only problem is that movies are encoded as indexes. The movie’s names are in the u.item
file. The function to get the movie name from its index is defined as:
def get_movie_name(movie_id: str): with open('ml-100k/u.item', 'r', encoding = "ISO-8859-1") as fin: for line in fin: splitted = line.split('|') if movie_id == splitted[0]: return splitted[1]
With this function, we can run three sample recommendations:
for ts in validation_ds[:3]: print('User watched') print(str([get_movie_name(x) for x in ts[0]])) print('Recommendations') recs = model.recommend(ts) for rec in recs: print('Item:', get_movie_name(rec[0]), '| weight:', rec[1]) print('---') print('')
User watched ['Men in Black (1997)', 'Truth About Cats & Dogs, The (1996)', "My Best Friend's Wedding (1997)", 'North by Northwest (1959)', 'If Lucy Fell (1996)', 'French Twist (Gazon maudit) (1995)', 'Twelve Monkeys (1995)', 'Godfather, The (1972)', 'Spitfire Grill, The (1996)', 'Jaws (1975)', 'Willy Wonka and the Chocolate Factory (1971)', 'Vertigo (1958)', 'Mars Attacks! (1996)', 'Hoop Dreams (1994)', "Devil's Own, The (1997)", 'Lawrence of Arabia (1962)', 'Star Trek: First Contact (1996)', 'Crumb (1994)', 'Terminator 2: Judgment Day (1991)', 'Birdcage, The (1996)', 'Primal Fear (1996)', "She's the One (1996)", 'My Fellow Americans (1996)', 'Lone Star (1996)', 'GoodFellas (1990)', 'My Life as a Dog (Mitt liv som hund) (1985)', 'City Hall (1996)', 'Star Wars (1977)', 'Adventures of Pinocchio, The (1996)', 'Heat (1995)', 'English Patient, The (1996)', 'Boot, Das (1981)', 'Richard III (1995)', 'First Wives Club, The (1996)', "Jackie Chan's First Strike (1996)", 'Contact (1997)', 'Six Degrees of Separation (1993)', 'Clerks (1994)', 'Trainspotting (1996)', '2001: A Space Odyssey (1968)', 'Courage Under Fire (1996)', 'Usual Suspects, The (1995)', "Antonia's Line (1995)", 'Back to the Future (1985)', 'Mighty Aphrodite (1995)', 'Jerry Maguire (1996)', 'Toy Story (1995)', 'Leaving Las Vegas (1995)', 'Sleepers (1996)', 'Dead Man Walking (1995)', 'Rock, The (1996)', 'Casablanca (1942)', 'Magnificent Seven, The (1954)', 'Cold Comfort Farm (1995)', 'Saint, The (1997)', 'Aladdin (1992)', 'Star Trek: The Wrath of Khan (1982)', 'Sense and Sensibility (1995)', 'Better Off Dead... (1985)', 'Monty Python and the Holy Grail (1974)', 'Big Night (1996)', 'Twister (1996)', "Mr. Holland's Opus (1995)", 'Close Shave, A (1995)', 'Fargo (1996)', 'Return of the Jedi (1983)', 'Before and After (1996)', 'Kingpin (1996)'] Recommendations Item: Raiders of the Lost Ark (1981) | weight: 1.3488079691609707 Item: True Lies (1994) | weight: 1.3488079691609707 Item: Terminator, The (1984) | weight: 1.3488079691609707 Item: Aliens (1986) | weight: 1.269264685932151 Item: Silence of the Lambs, The (1991) | weight: 1.269264685932151
And what do you think about the recommender output?
Summary
In this article, you have learned about time-dependent and varying-length sessions used for recommendations. You used the WSKNN
package for the first time. This post is the first of five posts about session-based recommendations. The next part will be about the scoring metrics for session-based recommendation engines.
Bibliography
[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872
[2] WSKNN on PyPI: https://pypi.org/project/wsknn/
[3] https://github.com/SimonMolinsky/blog-post-wsknn-movielens
[…] Which movie should you recommend next? […]