k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics.
In this post, I will show how to use R’s knn() function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical scenarios. R is free and kNN has not been patented by some evil patent trolls (“patent assertion entities”), so there is no legal or other restrictions for us to go ahead with the demonstration.
kNN has many useful characteristics, one of which being its insensitivity to outliers that makes it resilient to any errors in the classification data (the supervised learning phase). As a downside, the algorithm is noted for its CPU and memory greediness. The R interpreter may also fail to take advantage of the multi-core CPU capabilities of modern computers. If and when such constrains hamper your data analytics work, the kNN algorithm can be implemented in a language that supports the creation of multi-threaded applications (e.g. Java or C#) that would evenly load all the CPU cores of your computer and improve the computational throughput of the application.
To simplify the demonstration of kNN and make it easy to follow, we will have only two classes used in object classification, which we label A and B.
Objects in classes A and B have two numeric attributes/properties that we map to X and Y Cartesian coordinates so that we can plot class instances (cases) as points on a 2-D chart. In other words, our cases are represented as points with X and Y coordinates (p(X,Y)).
Our simple classes A and B will have 3 object instances (cases) each.
Class A will include points with coordinates (0,0), (1,1), and (2,2).
Class B will include points with coordinates (6,6), (5.5, 7), and (6.5, 5).
In R, we can write down the above arrangement as follows:
# Class A training object instances (cases)
A1=c(0,0)
A2=c(1,1)
A3=c(2,2)
# Class B training objects instances (cases)
B1=c(6,6)
B2=c(5.5,7)
B3=c(6.5,5)
Here is how the classification training objects for class A and class B are arranged on the chart.
The knn() function is housed in the class package and is invoked as follows:
knn(train, test, cl, k)
where
test is a matrix or a data frame of test case(s) (one or more rows)
cl is a vector of classification labels (with the number of elements matching the number of classes in the training data set)
k is an integer value of closest cases (the k in the k-Nearest Neighbor Algorithm); normally, it is a small odd integer number
For full description of the knn() function, read its help page that you can see by typing ?knn
in the R console.
The points (cases) from both classification classes A and B must be packed in the same matrix used in the classification exercise.
In R you do this with a one-liner:
train=rbind(A1,A2,A3, B1,B2,B3)
This command will build the following matrix:
A1: 0.0 0
A2: 1.0 1
A3: 2.0 2
B1: 6.0 6
B2: 5.5 7
B3: 6.5 5
Now, when we try out classification of a test object (with properties expressed as X and Y coordinates), the kNN algorithm will use the Euclidean distance metric calculated for every row (case) in the training matrix to find the closest one for k=1 and the majority of closest ones for k > 1 (where k should be an odd number).
We also need to construct the cl parameter (the vector of classification labels). We can do this with this command:
cl=factor(c(rep("A",3),rep("B",3)))
This command uses the factor() function to create a vector of factors (discrete, enumerated values) that are used as class literals. In our setup, we have two factors (a.k.a. levels): A and B.
The rep() function replicates the first parameter (a printable symbol) the number of times passed on to it as the second parameter.
The resulting factor vector has the following content: A A A B B B
(3 As followed by 3 Bs to match the layout of our train cases – we have 3 cases of A followed by 3 cases of B).
To run the knn() function, we need to supply the test case(s). In our runs, we will start with a single test object that we create as follows:
test=c(4,4)
which corresponds to a point sitting approximately in the middle of the distance between A and B.
At this point, we have everything in place to run knn(). Let’s do it for k = 1 (classification by its proximity to a single neighbor).
For more informative reports of test object classification results, we are going to use the summary() function that can polymorphically act on its input parameter depending on its type. In our case, the input parameter to the summary() function is the output of the knn() function.
That’s how the nested call looks like:
summary(knn(train, test, cl, k = 1))
Here is the complete code listing that you can copy and paste into the R console.
# Class A cases
A1=c(0,0)
A2=c(1,1)
A3=c(2,2)
# Class B cases
B1=c(6,6)
B2=c(5.5,7)
B3=c(6.5,5)
# Build the classification matrix
train=rbind(A1,A2,A3, B1,B2,B3)
# Class labels vector (attached to each class instance)
cl=factor(c(rep("A",3),rep("B",3)))
# The object to be classified
test=c(4, 4)
# Load the class package that holds the knn() function
library(class)
# call knn() and get its summary
summary(knn(train, test, cl, k = 1))
# End of listing
After pasting the above code in the R console, press ENTER to submit it to the R interpreter for execution.
You should see the following output in your console:
A B
0 1
This result indicates that the test case has been classified as belonging to class B.
Type in the following command in console:
test=c(3.5, 3.5)
Visually, this test case point looks to be closer to the cluster of the A class cases (points). Let’s verify our assumption.
Repeat the same knn summary command as we did a moment ago:
summary(knn(train, test, cl, k = 1))
The result comes back as we expected:
A B
1 0
The point has been classified as belonging to class A.
Let’s increase the number of closest neighbors that are involved in voting during the classification step.
Type in the following command:
summary(knn(train, test, cl, k = 3))
Now, the positions of class B points make them closer as a whole (according to the Euclidean distance metric) to the test point, so the (3.5, 3.5) case is classified as belonging to class B this time.
A B
0 1
If you wish, you can further experiment with the k number (this is one of the exercises data scientists perform trying out better fitting classification models).
Now, let’s build a matrix of test cases containing four points:
(4,4) - should be classified as B
(3,3) - should be classified as A
(5,6) - should be classified as B
(7,7) - should be classified as B
As a result, we should be expecting our test batch to have 3 cases (points) classified as belonging to the B class and one A case.
Type in the following code:
test = matrix (c(4,4,3,3,5,6,7,7), ncol=2, byrow=TRUE)
This command will help us map our points into a two-column matrix containing the X and Y coordinates of our test points.
Now run again the previous knn summary command:
summary(knn(train, test, cl, k = 3))
R should print the following summary of its classification job:
A B
1 3
Which supports our assumptions.
I hope, now you are well equipped to start applying R’s knn() function in your problem domain.