To install StudyMoose App tap and then “Add to Home Screen”
Save to my list
Remove from my list
The K-Nearest Neighbor (KNN) algorithm is a simple classification algorithm that leverages the similarity between data points to classify new data instances. It is based on the idea that if an object is similar to its neighboring data points, it is likely to belong to the same class as those neighbors. This report explores the KNN algorithm, discusses how to determine the value of 'k,' and provides guidelines for selecting an appropriate 'k' value.
In the KNN algorithm, each data point is assigned to a class based on the majority class among its k-nearest neighbors.
For example, if a volleyball is more similar to a football, basketball, or tennis ball than to a mobile phone, tablet, or laptop, it is classified as a sports-related item. KNN is commonly used in applications where similarity plays a crucial role, such as recommendation systems and image classification.
The 'k' in KNN stands for the number of nearest neighbors used to classify or predict a test sample.
In the context of KNN, 'k' determines the number of neighboring data points considered when assigning a class label to a new data point. In a binary classification scenario, such as classifying objects into Class i and Class ii, 'k' represents the number of nearest neighbors to be considered.
In this report, we use the following notation:
The choice of 'k' in KNN is crucial, as it significantly impacts the algorithm's performance.
The following steps explain how to determine an appropriate 'k' value:
The following Python code snippet demonstrates this process:
import numpy as np
from sklearn import neighbors, preprocessing, metrics
Define your datasets xtrain, xtest, ytrain, ytest
for i in range(1, 19):
knn = neighbors.KNeighborsClassifier(n_neighbors=i)
modelKnn = knn.fit(preprocessing.scale(xtrain), ytrain)
predKnn = modelKnn.predict(preprocessing.scale(xtest))
accuracy = metrics.accuracy_score(predKnn, ytest)
print("k =", i)
print("Accuracy =", accuracy)
In this code, 'i' represents the 'k' value, and 'accuracy' measures the performance of the KNN algorithm for each 'k' value. The 'k' value that yields the highest accuracy can be selected as the optimal 'k' value for your dataset.
When choosing the 'k' value, keep the following points in mind:
Once we have determined the value of 'k,' the next step is to find the distance between the new data point and the existing data points. The distance calculation can be performed using methods such as:
Euclidean distance is the direct or the shortest distance between two points in Euclidean space. It is commonly used in KNN to measure the similarity between data points. The formula for calculating Euclidean distance between two data points with coordinates (x1, y1) and (x2, y2) is:
√((x1 - x2)2 + (y1 - y2)2)
For example, let's calculate the Euclidean distance between two points (4, 5) and (1, 1):
√((4 - 1)2 + (5 - 1)2) = 5
Manhattan distance, also known as taxicab or L1 distance, measures the distance between two points along the axes at right angles. It is calculated as the sum of the absolute differences of their coordinates. The formula for Manhattan distance between two points with coordinates (x1, y1) and (x2, y2) is:
|x1 - x2| + |y1 - y2|
For example, let's calculate the Manhattan distance between two points (4, 5) and (1, 1):
|4 - 1| + |5 - 1| = 7
Let's illustrate the use of Euclidean distance and Manhattan distance with an example. Consider a dataset from a clothing store where a man's shirt size is determined by his height (in cm) and weight (in kg). We have a customer with a height of 161 cm and a weight of 61 kg, and we need to determine the best-fitting shirt size for him.
Height (cm) | Weight (kg) | Shirt Size | Euclidean Distance | Manhattan Distance |
---|---|---|---|---|
158 | 58 | M | 4.24 | 5 |
158 | 59 | M | 3.61 | 7 |
158 | 63 | M | 3.61 | 7 |
160 | 59 | M | 2.24 | 8 |
160 | 60 | M | 1.41 | 9 |
163 | 60 | M | 2.24 | 10 |
163 | 61 | M | 2 | 11 |
160 | 64 | L | 3.16 | 12 |
163 | 64 | L | 3.61 | 13 |
165 | 61 | L | 4 | 14 |
165 | 62 | L | 4.12 | 15 |
165 | 65 | L | 5.65 | 16 |
168 | 62 | L | 7.07 | 17 |
168 | 63 | L | 7.28 | 18 |
168 | 66 | L | 8.60 | 19 |
170 | 63 | L | 9.21 | 20 |
170 | 64 | L | 9.48 | 21 |
170 | 68 | L | 11.40 | 22 |
Let's illustrate the application of KNN with 'k' = 5 using the Euclidean distance metric. In this scenario, we have calculated the Euclidean distance for a new data point, represented in yellow in the dataset. By finding the 5 nearest neighbors, we can determine the class to which the new data point belongs.
Out of the 5 nearest neighbors, 4 belong to the 'M' class, and 1 belongs to the 'L' class. Based on this majority vote, we classify the man as belonging to the 'M' class and recommend 'M' size shirts for him.
KNN Algorithm as a "Lazy Learner"
The KNN algorithm is often referred to as a "Lazy Learner" because it does not create a discriminating function from the trained data. Instead, it memorizes the training data and performs classification or prediction only when requested. There is no explicit learning phase in the model; all computations occur at the time of prediction.
The KNN algorithm finds extensive applications in various industrial domains:
For instance, when a user enters a keyword like "car," KNN can retrieve documents or concepts that are most similar to the keyword, providing relevant search results. This concept search functionality is powered by the KNN algorithm.
Additional industrial use cases of KNN algorithm include:
Below is a step-by-step procedure to perform a KNN algorithm using Python. We will use the Iris Flower Dataset as an example.
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import neighbors, datasets
from sklearn.model_selection import cross_val_score, train_test_split
Ensure the plots are displayed inline
%matplotlib inline
The Iris Flower Dataset is a well-known dataset that consists of samples from three different species of Iris flowers:
These flowers are distinguished based on various attributes, including:
These attributes are used as features for classification purposes.
Load the Iris dataset
iris = datasets.load_iris()
Extract features and target
X = iris.data
Y = iris.target
List of flower names
flowers = iris.target_names
print(flowers)
List of feature names
features = iris.feature_names
print(features)
Output:
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Since there are four features for each data point, visualizing all four dimensions simultaneously is challenging. Therefore, we will visualize two features at a time for simplicity.
Scatter plot of Sepal length vs. Sepal width
plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Iris data 2D')
Output:
Create a DataFrame for feature visualization
df = pd.DataFrame(X, columns=features)
df.plot(figsize=(11, 11))
K Nearest Neighbors (KNN) is a straightforward classification algorithm that makes predictions based on the labels of its nearest neighbors.
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (where k is a positive integer, typically small). If k = 1, the object is simply assigned to the class of its single nearest neighbor.
We can demonstrate the KNN algorithm by using it on the entire dataset.
Import necessary libraries
from sklearn import neighbors
import numpy as np
Create a KNN classifier with k=5
model = neighbors.KNeighborsClassifier(n_neighbors=5)
Fit the model to the dataset
model.fit(X, Y)
Predict the class label for a new data point
y_predicted = model.predict([[3, 5, 4, 2]])
print('Predicted Y :: {}'.format(y_predicted))
print('Target flower :: {}'.format(iris.target_names[y_predicted]))
print('Probability for each flower :: {}'.format(model.predict_proba([[3, 5, 4, 2]])))
Output:
Predicted Y :: [1]
Target flower :: ['versicolor']
Probability for each flower :: [[0. 0.8 0.2]]
The Hold Out Method is a common approach for evaluating a model's performance. It involves splitting the original training dataset into two parts:
This method is useful for assessing how well the model generalizes to unseen data that it hasn't been trained on.
Create a fake new data point
new_observation = [[4.9, 2.7]]
Split the whole dataset into a training set and a test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1/3)
Create a KNN classifier with k=3
model = neighbors.KNeighborsClassifier(n_neighbors=3)
Fit the model to the training dataset
model.fit(X_train, Y_train)
Predict the labels for the test dataset
Y_pred = model.predict(X_test)
Calculate the accuracy of the model
accuracy = np.sum(Y_pred == Y_test) / len(Y_test)
accuracy_percentage = accuracy * 100
The accuracy of the KNN model on the test dataset is {{accuracy_percentage}}%.
k-Fold Cross Validation is an extension of the holdout method, taking it to the next level. Instead of a single train-test split, it performs holdout validation k times and combines the results to evaluate the overall performance of the model.
Import necessary libraries
from sklearn import neighbors
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
Create an odd list of k values for KNN
klist = list(range(1, 50, 2))
Create an empty list to store cross-validation scores
cv_scores = []
Perform 10-fold cross-validation for each k value
for k in klist:
knn = neighbors.KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X, Y, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
Convert cross-validation scores to misclassification errors
errors = 1 - np.array(cv_scores)
Determine the best k
optimal_k = klist[np.argmin(errors)]
print('The optimal number of neighbors is {}'.format(optimal_k))
Plot misclassification error vs k
plt.plot(klist, errors)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
Output:
The optimal number of neighbors is 13
In this process, we perform 10-fold cross-validation for various values of k to find the optimal number of neighbors for our KNN model. The optimal number of neighbors is determined to be 13 based on the lowest misclassification error.
In conclusion, the K-Nearest Neighbor (KNN) algorithm is a versatile and intuitive classification algorithm that can be applied to a wide range of problems. Throughout this report, we have explored the various aspects of KNN, from its fundamental concepts to practical applications and evaluation techniques.
We began by understanding the core principle of KNN, which involves classifying data points based on the majority class among their nearest neighbors. This simple yet effective approach allows KNN to adapt to different datasets and make accurate predictions, particularly when the value of 'k' is chosen carefully.
We discussed the process of determining the optimal 'k' value, which is crucial for the performance of the KNN model. Techniques such as holdout validation, k-fold cross-validation, and error analysis were explained to guide the selection of 'k' and evaluate the model's accuracy.
Furthermore, we explored real-world applications of KNN, including recommendation systems, concept search, OCR, image recognition, and video recognition. These examples highlight the algorithm's relevance and effectiveness in solving practical problems across various industries.
As demonstrated in the report, KNN is not only a "lazy learner" but also a powerful tool for classification tasks. Its ability to adapt to different datasets, ease of implementation, and interpretability make it a valuable addition to the machine learning toolkit.
In summary, K-Nearest Neighbor (KNN) is a valuable algorithm that offers flexibility and accuracy in classification tasks. By understanding its principles, choosing the right parameters, and leveraging cross-validation techniques, practitioners can harness the full potential of KNN for a wide range of applications.
K-Nearest Neighbor (KNN) Algorithm: Principles and Applications. (2024, Jan 24). Retrieved from https://studymoose.com/document/k-nearest-neighbor-knn-algorithm-principles-and-applications
👋 Hi! I’m your smart assistant Amy!
Don’t know where to start? Type your requirements and I’ll connect you to an academic expert within 3 minutes.
get help with your assignment