Building a Recommendation Model Using K-Nearest Neighbors (KNN)
Recommendation systems are everywhere these days. From e-commerce sites to recipe apps, they are all around. One of the simplest and most intuitive way to build a recommendation engine is the K-Nearest Neighbors (KNN) algorithm. Here we will walk through how to build a recommendation engine using a pipeline for preprocessing and scaling with code snippets for each step.
K-Nearest Neighbors (KNN)
KNN is a non-parametric, instance-based learning algorithm. It works by finding the k nearest data points (neighbors) to a given input and then providing recommendations based on the similarity between these points. In our recommendation engine, we will be using cosine similarity as the metric to determine how similar data points (recipes or food items) are to each other.
The steps to build the recommendation model are:
- Extract the relevant columns for nutrition content.
- Scale the data so all features are comparable.
- Use KNN to find the nearest neighbors based on cosine distance.
- Build a pipeline.
- Filter the data based on user preferences (e.g. include or exclude specific ingredients).
- Make recommendations and calculate accuracy.
Now, let's get into the code.
Step 1: Extract Nutrition Columns
First we need to extract the relevant nutrition columns, such as calories, fat, protein etc. These features are used to measure the similarity between different food items.
def extract_nutrition_columns(dataframe):
columns = ['Calories', 'FatContent', 'SaturatedFatContent', 'CholesterolContent',
'SodiumContent', 'CarbohydrateContent', 'FiberContent', 'SugarContent', 'ProteinContent']
return dataframe[columns]
Step 2: Scale the Data
We scale the data so no single feature dominates (e.g. calories vs fiber) using StandardScaler.
Here we use StandardScaler to standardize the features by removing the mean and scaling to unit variance. This makes the KNN algorithm work better.
from sklearn.preprocessing import StandardScaler
def scaling(dataframe):
scaler = StandardScaler()
prep_data = scaler.fit_transform(dataframe.to_numpy())
return prep_data, scaler
Step 3: KNN Predictor
Next we use the KNN algorithm to create a model that will find the nearest neighbors based on cosine similarity.
We use cosine metric because it measures the cosine of the angle between two vectors, perfect for measuring similarity between food items based on nutrition values.
from sklearn.neighbors import NearestNeighbors
def nn_predictor(prep_data):
neigh = NearestNeighbors(metric='cosine', algorithm='brute')
neigh.fit(prep_data)
return neigh
Step 4: Pipeline
We then build a pipeline to chain the scaling and neighbor prediction steps together so we can apply the model to new inputs.
The pipeline makes the code cleaner and more modular. It ensures the data is scaled before it goes into the KNN model.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
def build_pipeline(neigh, scaler, params):
transformer = FunctionTransformer(neigh.kneighbors, kw_args=params)
pipeline = Pipeline([('std_scaler', scaler), ('NN', transformer)])
return pipeline
Step 5: Filter Data by Tags
Sometimes users want to include or exclude specific ingredients from the recommendations. We implement a function to filter data by the given tags.
def extract_ingredient_filtered_data(dataframe, include_tags=None, exclude_tags=None):
extracted_data = dataframe.copy()
def filter_row(tags_string):
if tags_string:
tags = [tag.strip().lower() for tag in tags_string.split(',')]
if include_tags:
for tag in include_tags:
if tag.lower() not in tags:
return False
if exclude_tags:
for tag in exclude_tags:
if tag.lower() in tags:
return False
return True
extracted_data = extracted_data[extracted_data['Tags'].apply(filter_row)]
return extracted_data
Step 6. Make Recommendations
Use the pipeline and KNN model to make recommendations based on the user's input.
import numpy as np
def apply_pipeline(pipeline, _input, extracted_data):
_input = np.array(_input).reshape(1, -1)
data = pipeline.transform(_input)
return extracted_data.iloc[data[1][0]], data[0][0]
Step 7: Calculating Accuracy
We calculate the "accuracy" based on how close the recommended items are to the input data.
This simple function converts the cosine distances into percentage-based accuracy scores.
def get_accuracy(distances):
accuracy = [100 - (i * 100) for i in distances]
return accuracy
Step 8. Combine all the steps
The recommend function combines all the steps to make recommendations. It first filters the data based on tags, extracts the relevant nutrition columns, scales the data, and applies the KNN model to provide the nearest neighbors.
def recommend(dataframe, _input, include_tags=[], exclude_tags=[], n_neighbors=5):
extracted_data = dataframe.copy()
params = {'n_neighbors': n_neighbors, 'return_distance': True}
if include_tags or exclude_tags:
extracted_data = extract_ingredient_filtered_data(dataframe, include_tags, exclude_tags)
params['n_neighbors'] = min(params['n_neighbors'], extracted_data.shape[0])
if params['n_neighbors'] == 0:
return None, None
extracted_cols = extract_nutrition_columns(dataframe)
prep_data, scaler = scaling(extracted_cols)
neigh = nn_predictor(prep_data)
pipeline = build_pipeline(neigh, scaler, params)
data, distances = apply_pipeline(pipeline, _input, extracted_data)
return data, get_accuracy(distances)
Conclusion
Using KNN for recommendation systems is a straightforward yet powerful approach. By combining data preprocessing, filtering, and the KNN algorithm, we've built a model that can make personalized recommendations based on nutritional content. The modular nature of this implementation makes it adaptable to various applications, from food recommendations to product suggestions.
You can easily extend this model by incorporating more features, experimenting with different distance metrics, or enhancing the filtering mechanism based on user preferences.