K-Nearest Neighbor (KNN) Algorithm: Principles and Applications

Essay, Pages 9 (2232 words)

Views

The K-Nearest Neighbor (KNN) algorithm is a simple classification algorithm that leverages the similarity between data points to classify new data instances. It is based on the idea that if an object is similar to its neighboring data points, it is likely to belong to the same class as those neighbors. This report explores the KNN algorithm, discusses how to determine the value of 'k,' and provides guidelines for selecting an appropriate 'k' value.

Introduction

In the KNN algorithm, each data point is assigned to a class based on the majority class among its k-nearest neighbors.

Don't use plagiarized sources. Get your custom paper on

“ K-Nearest Neighbor (KNN) Algorithm: Principles and Applications ”

Get high-quality paper

NEW! smart matching with writer

For example, if a volleyball is more similar to a football, basketball, or tennis ball than to a mobile phone, tablet, or laptop, it is classified as a sports-related item. KNN is commonly used in applications where similarity plays a crucial role, such as recommendation systems and image classification.

The Role of 'k' in KNN Algorithm

The 'k' in KNN stands for the number of nearest neighbors used to classify or predict a test sample.

In the context of KNN, 'k' determines the number of neighboring data points considered when assigning a class label to a new data point. In a binary classification scenario, such as classifying objects into Class i and Class ii, 'k' represents the number of nearest neighbors to be considered.

Notation

In this report, we use the following notation:

a. Class - i (represented by orange circles)
b. Class - ii (represented by blue circles)

Determining the Value of 'k'

The choice of 'k' in KNN is crucial, as it significantly impacts the algorithm's performance.

The following steps explain how to determine an appropriate 'k' value:

Split the dataset into a training dataset and a testing dataset.
Apply the KNN algorithm to the training set and cross-validate it using the test set.
Let 'n' represent the total number of data points in the dataset. The value of 'k' should range between 1 and the square root of 'n'.
Iterate through different values of 'k' and calculate the accuracy for each 'k' value.

The following Python code snippet demonstrates this process:


import numpy as np
from sklearn import neighbors, preprocessing, metrics

Define your datasets xtrain, xtest, ytrain, ytest
for i in range(1, 19):
knn = neighbors.KNeighborsClassifier(n_neighbors=i)
modelKnn = knn.fit(preprocessing.scale(xtrain), ytrain)
predKnn = modelKnn.predict(preprocessing.scale(xtest))
accuracy = metrics.accuracy_score(predKnn, ytest)
print("k =", i)
print("Accuracy =", accuracy)

In this code, 'i' represents the 'k' value, and 'accuracy' measures the performance of the KNN algorithm for each 'k' value. The 'k' value that yields the highest accuracy can be selected as the optimal 'k' value for your dataset.

Additional Considerations for 'k'

When choosing the 'k' value, keep the following points in mind:

'k' should be an odd number to avoid ties when determining the class label.
'k' should not be a multiple of the number of classes to prevent ambiguity.
'k' should be chosen neither too small nor too large, as this can affect the algorithm's performance.

Finding the Nearest Neighbor

Once we have determined the value of 'k,' the next step is to find the distance between the new data point and the existing data points. The distance calculation can be performed using methods such as:

Euclidean Distance
Manhattan Distance

Euclidean Distance

Euclidean distance is the direct or the shortest distance between two points in Euclidean space. It is commonly used in KNN to measure the similarity between data points. The formula for calculating Euclidean distance between two data points with coordinates (x1, y1) and (x2, y2) is:

^√((x1 - x2)² + (y1 - y2)²)

For example, let's calculate the Euclidean distance between two points (4, 5) and (1, 1):

√((4 - 1)² + (5 - 1)²) = 5

Manhattan Distance

Manhattan distance, also known as taxicab or L1 distance, measures the distance between two points along the axes at right angles. It is calculated as the sum of the absolute differences of their coordinates. The formula for Manhattan distance between two points with coordinates (x1, y1) and (x2, y2) is:

|x1 - x2| + |y1 - y2|

For example, let's calculate the Manhattan distance between two points (4, 5) and (1, 1):

|4 - 1| + |5 - 1| = 7

Example: Shirt Size Prediction

Let's illustrate the use of Euclidean distance and Manhattan distance with an example. Consider a dataset from a clothing store where a man's shirt size is determined by his height (in cm) and weight (in kg). We have a customer with a height of 161 cm and a weight of 61 kg, and we need to determine the best-fitting shirt size for him.

Height (cm)	Weight (kg)	Shirt Size	Euclidean Distance	Manhattan Distance
158	58	M	4.24	5
158	59	M	3.61	7
158	63	M	3.61	7
160	59	M	2.24	8
160	60	M	1.41	9
163	60	M	2.24	10
163	61	M	2	11
160	64	L	3.16	12
163	64	L	3.61	13
165	61	L	4	14
165	62	L	4.12	15
165	65	L	5.65	16
168	62	L	7.07	17
168	63	L	7.28	18
168	66	L	8.60	19
170	63	L	9.21	20
170	64	L	9.48	21
170	68	L	11.40	22

K-Nearest Neighbor (KNN) Algorithm

Application of KNN with 'k' = 5

Let's illustrate the application of KNN with 'k' = 5 using the Euclidean distance metric. In this scenario, we have calculated the Euclidean distance for a new data point, represented in yellow in the dataset. By finding the 5 nearest neighbors, we can determine the class to which the new data point belongs.

Out of the 5 nearest neighbors, 4 belong to the 'M' class, and 1 belongs to the 'L' class. Based on this majority vote, we classify the man as belonging to the 'M' class and recommend 'M' size shirts for him.

KNN Algorithm as a "Lazy Learner"

The KNN algorithm is often referred to as a "Lazy Learner" because it does not create a discriminating function from the trained data. Instead, it memorizes the training data and performs classification or prediction only when requested. There is no explicit learning phase in the model; all computations occur at the time of prediction.

Industrial Use of KNN Algorithm

The KNN algorithm finds extensive applications in various industrial domains:

Recommendation System: Companies like Amazon utilize KNN in their recommendation systems. When customers shop online, they are presented with product recommendations related to their interests and previous purchases. These recommendations are generated using KNN, making personalized suggestions based on the behavior of similar users. Amazon derives a significant portion of its revenues from such recommendation systems.
Concept Search: KNN is employed in semantic document searching and classification. With the vast amount of data available on the internet, each document may contain multiple concepts. KNN helps identify and classify documents that share similar topics or concepts. This is particularly useful when dealing with massive datasets where finding relevant concepts efficiently is essential.

For instance, when a user enters a keyword like "car," KNN can retrieve documents or concepts that are most similar to the keyword, providing relevant search results. This concept search functionality is powered by the KNN algorithm.

Additional industrial use cases of KNN algorithm include:

Handwriting detection, such as in Optical Character Recognition (OCR) systems.
Image recognition, where KNN helps identify and categorize images based on their visual features.
Video recognition, which involves classifying and annotating videos based on their content.

Step-by-Step Procedure to Perform a KNN Algorithm

Below is a step-by-step procedure to perform a KNN algorithm using Python. We will use the Iris Flower Dataset as an example.

Python Code


import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import neighbors, datasets
from sklearn.model_selection import cross_val_score, train_test_split
Ensure the plots are displayed inline
%matplotlib inline

Iris Flower Dataset

The Iris Flower Dataset is a well-known dataset that consists of samples from three different species of Iris flowers:

Iris setosa
Iris virginica
Iris versicolor

These flowers are distinguished based on various attributes, including:

Sepal length (cm)
Sepal width (cm)
Petal length (cm)
Petal width (cm)

These attributes are used as features for classification purposes.


Load the Iris dataset
iris = datasets.load_iris()

Extract features and target
X = iris.data
Y = iris.target

List of flower names
flowers = iris.target_names
print(flowers)

List of feature names
features = iris.feature_names
print(features)

Output:


['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Plot and Feature Visualization

Since there are four features for each data point, visualizing all four dimensions simultaneously is challenging. Therefore, we will visualize two features at a time for simplicity.


Scatter plot of Sepal length vs. Sepal width
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris data 2D')

Output:

Create a DataFrame for feature visualization df = pd.DataFrame(X, columns=features) df.plot(figsize=(11, 11))

K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN) is a straightforward classification algorithm that makes predictions based on the labels of its nearest neighbors.

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (where k is a positive integer, typically small). If k = 1, the object is simply assigned to the class of its single nearest neighbor.

Using KNN on the Whole Dataset

We can demonstrate the KNN algorithm by using it on the entire dataset.


Import necessary libraries
from sklearn import neighbors
import numpy as np

Create a KNN classifier with k=5
model = neighbors.KNeighborsClassifier(n_neighbors=5)

Fit the model to the dataset
model.fit(X, Y)

Predict the class label for a new data point
y_predicted = model.predict([[3, 5, 4, 2]])

print('Predicted Y :: {}'.format(y_predicted))
print('Target flower :: {}'.format(iris.target_names[y_predicted]))
print('Probability for each flower :: {}'.format(model.predict_proba([[3, 5, 4, 2]])))

Output:


Predicted Y :: [1]
Target flower :: ['versicolor']
Probability for each flower :: [[0. 0.8 0.2]]

Hold Out Method

The Hold Out Method is a common approach for evaluating a model's performance. It involves splitting the original training dataset into two parts:

Train: Used to fit the model.
Test: Used to evaluate the model's performance.

This method is useful for assessing how well the model generalizes to unseen data that it hasn't been trained on.


Create a fake new data point
new_observation = [[4.9, 2.7]]

Split the whole dataset into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1/3)

Create a KNN classifier with k=3
model = neighbors.KNeighborsClassifier(n_neighbors=3)

Fit the model to the training dataset
model.fit(X_train, Y_train)

Predict the labels for the test dataset
Y_pred = model.predict(X_test)

Calculate the accuracy of the model
accuracy = np.sum(Y_pred == Y_test) / len(Y_test)

accuracy_percentage = accuracy * 100

The accuracy of the KNN model on the test dataset is {{accuracy_percentage}}%.

k-Fold Cross Validation

k-Fold Cross Validation is an extension of the holdout method, taking it to the next level. Instead of a single train-test split, it performs holdout validation k times and combines the results to evaluate the overall performance of the model.


Import necessary libraries
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt

Create an odd list of k values for KNN
klist = list(range(1, 50, 2))

Create an empty list to store cross-validation scores
cv_scores = []

Perform 10-fold cross-validation for each k value
for k in klist:
knn = neighbors.KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, Y, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())

Convert cross-validation scores to misclassification errors
errors = 1 - np.array(cv_scores)

Determine the best k
optimal_k = klist[np.argmin(errors)]

print('The optimal number of neighbors is {}'.format(optimal_k))

Plot misclassification error vs k
plt.plot(klist, errors)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

Output:


The optimal number of neighbors is 13

In this process, we perform 10-fold cross-validation for various values of k to find the optimal number of neighbors for our KNN model. The optimal number of neighbors is determined to be 13 based on the lowest misclassification error.

Conclusion

In conclusion, the K-Nearest Neighbor (KNN) algorithm is a versatile and intuitive classification algorithm that can be applied to a wide range of problems. Throughout this report, we have explored the various aspects of KNN, from its fundamental concepts to practical applications and evaluation techniques.

We began by understanding the core principle of KNN, which involves classifying data points based on the majority class among their nearest neighbors. This simple yet effective approach allows KNN to adapt to different datasets and make accurate predictions, particularly when the value of 'k' is chosen carefully.

We discussed the process of determining the optimal 'k' value, which is crucial for the performance of the KNN model. Techniques such as holdout validation, k-fold cross-validation, and error analysis were explained to guide the selection of 'k' and evaluate the model's accuracy.

Furthermore, we explored real-world applications of KNN, including recommendation systems, concept search, OCR, image recognition, and video recognition. These examples highlight the algorithm's relevance and effectiveness in solving practical problems across various industries.

As demonstrated in the report, KNN is not only a "lazy learner" but also a powerful tool for classification tasks. Its ability to adapt to different datasets, ease of implementation, and interpretability make it a valuable addition to the machine learning toolkit.

In summary, K-Nearest Neighbor (KNN) is a valuable algorithm that offers flexibility and accuracy in classification tasks. By understanding its principles, choosing the right parameters, and leveraging cross-validation techniques, practitioners can harness the full potential of KNN for a wide range of applications.