Of course! Using Singular Value Decomposition (SVD) for recommendation is a classic and powerful technique, often called Latent Factor Filtering. The MovieLens dataset is the perfect dataset to learn this on.
Here’s a complete, step-by-step guide on how to build a movie recommendation system using SVD in Python.
The Core Idea of SVD for Recommendations
The goal of a recommendation system is to predict a user's rating for a movie they haven't seen yet. We can represent all user ratings as a large matrix, where:
- Rows are users.
- Columns are movies.
- Values are the ratings (e.g., 1-5).
This matrix is very sparse because most users have only rated a tiny fraction of the available movies.
SVD helps by "decomposing" this large, sparse matrix into three smaller, dense matrices:
R ≈ U * Σ * Vᵀ
- R: The original user-movie rating matrix.
- U: The User Features Matrix. It shows how much each user is associated with each latent feature (e.g., "Action Lover," "Drama Enthusiast").
- Σ (Sigma): The Singular Values Matrix. It's a diagonal matrix containing the "strength" or importance of each latent feature. We often use this to reduce the number of features (a process called Truncated SVD).
- Vᵀ: The Movie Features Matrix. It shows how much each movie is associated with each latent feature.
By multiplying the reduced versions of U and Σ * Vᵀ, we can get a predicted rating matrix. This new matrix is dense, meaning it has a predicted rating for every user-movie pair, even the ones the user didn't originally rate.
Step-by-Step Python Implementation
We'll use the popular scikit-learn library for SVD and pandas for data manipulation.
Step 1: Setup and Installation
First, make sure you have the necessary libraries installed.
pip install pandas scikit-learn
Step 2: Load and Prepare the MovieLens Data
For this example, we'll use a small version of the MovieLens dataset (100k ratings) that is conveniently included in the scikit-learn library. This avoids the need for external downloads.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy.sparse.linalg import svds
import numpy as np
# Load the MovieLens dataset (small 100k version)
from sklearn.datasets import load_mlcomp
# Note: The first time you run this, it will download the data.
# It might take a minute.
data = load_mlcomp('ml-100k', 'ml-100k')
# The data is in a 'data' attribute, which is a scipy sparse matrix
# Let's convert it to a dense pandas DataFrame for easier manipulation
raw_data = data.data
df = pd.DataFrame(raw_data.toarray(), columns=data.feature_names)
# The columns are 'user_id', 'item_id', 'rating', 'timestamp'
print("Original DataFrame Head:")
print(df.head())
print("\nOriginal DataFrame Shape:", df.shape)
Step 3: Create the User-Movie Rating Matrix
We need to pivot the DataFrame to create the user-movie matrix. The index will be user_id, the columns will be item_id, and the values will be the rating. Missing ratings will be filled with NaN.
# Create the user-movie rating matrix
R_df = df.pivot(index='user_id', columns='item_id', values='rating')
print("\nUser-Movie Rating Matrix (Head):")
print(R_df.head())
# Fill NaN with 0 for the SVD algorithm, but we'll remember the original sparsity
R = R_df.fillna(0).values
Step 4: Perform Truncated SVD
We will use scipy.sparse.linalg.svds for this. We need to choose the number of latent factors (k). This is a hyperparameter. A good starting point is around 50.
# Number of latent factors
k = 50
# Perform SVD
# We only need the first k components (Truncated SVD)
U, sigma, Vt = svds(R, k=k)
# sigma is returned as a 1D array, so we convert it to a diagonal matrix
sigma = np.diag(sigma)
print("\nShape of U:", U.shape) # (num_users, k)
print("Shape of sigma:", sigma.shape) # (k, k)
print("Shape of Vt:", Vt.shape) # (k, num_movies)
Step 5: Generate Predictions and Evaluate
Now, we reconstruct the rating matrix using the decomposed matrices to get our predictions.
# Reconstruct the predicted rating matrix
predicted_ratings = np.dot(np.dot(U, sigma), Vt)
# Convert the predicted ratings back to a DataFrame
predicted_ratings_df = pd.DataFrame(predicted_ratings,
index=R_df.index,
columns=R_df.columns)
print("\nPredicted Ratings Matrix (Head):")
print(predicted_ratings_df.head())
To see how well our model did, let's calculate the Root Mean Squared Error (RMSE) on a test set. We'll split the original data first.
# --- Evaluation ---
# Split the original data into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.25, random_state=42)
# Create training and testing matrices
R_train = train_data.pivot(index='user_id', columns='item_id', values='rating').fillna(0).values
R_test = test_data.pivot(index='user_id', columns='item_id', values='rating').fillna(0).values
# Perform SVD on the training data
U_train, sigma_train, Vt_train = svds(R_train, k=k)
sigma_train = np.diag(sigma_train)
# Predict ratings for the training data
predicted_ratings_train = np.dot(np.dot(U_train, sigma_train), Vt_train)
# Now, we need to compare these predictions with the *actual* test ratings.
# We can only evaluate on the entries that exist in the test set.
test_user_indices = test_data['user_id'].values - 1 # Adjust for 0-based indexing
test_movie_indices = test_data['item_id'].values - 1
actual_ratings = test_data['rating'].values
# Get the predicted ratings for the same user-movie pairs in the test set
predicted_ratings_test = predicted_ratings_train[test_user_indices, test_movie_indices]
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings_test))
print(f"\nRMSE on Test Set: {rmse:.4f}")
Step 6: Make Recommendations for a Specific User
The most exciting part! Let's pick a user and recommend movies they haven't seen yet.
def recommend_movies(user_id, num_recommendations=5):
"""
Recommends movies for a given user based on SVD predictions.
"""
# Get the user's index in the matrix (adjust for 0-based indexing)
user_row_number = user_id - 1
# Get the user's ratings and predictions
user_ratings = R_df.iloc[user_row_number]
user_predictions = predicted_ratings_df.iloc[user_row_number]
# Create a DataFrame of movies the user hasn't rated yet
unrated_movies = user_ratings[user_ratings.isnull()]
# Get the predicted ratings for these unrated movies
recommendations = pd.DataFrame({
'predicted_rating': user_predictions[unrated_movies.index]
})
# Sort by the highest predicted rating
recommendations = recommendations.sort_values(by='predicted_rating', ascending=False)
# Get the top N recommendations
top_recommendations = recommendations.head(num_recommendations)
# Get movie titles (assuming you have a movies file)
# For simplicity, we'll just return the item_ids.
# In a real project, you'd load 'u.item' to map item_id to title.
return top_recommendations.index.tolist()
# --- Example: Get recommendations for user 42 ---
user_to_recommend = 42
recommended_movie_ids = recommend_movies(user_to_recommend)
print(f"\nTop 5 movie recommendations for User {user_to_recommend}:")
print(f"Recommended Movie IDs: {recommended_movie_ids}")
# To get actual movie titles, you would load the movie titles file
# e.g., movie_titles = pd.read_csv('u.item', sep='|', header=None, usecols=[0, 1], encoding='latin-1')
# and then map the IDs to titles.
Full Code (Condensed)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from scipy.sparse.linalg import svds
from sklearn.datasets import load_mlcomp
# 1. Load Data
print("Loading data...")
data = load_mlcomp('ml-100k', 'ml-100k')
df = pd.DataFrame(data.data.toarray(), columns=data.feature_names)
# 2. Create Rating Matrix
print("Creating rating matrix...")
R_df = df.pivot(index=' 