# Machine learning k nearest neighbor

Machine learning k nearest neighbor

From starting to till this post we are using the dataset which contain information regarding Person age, salary and the person will buy the product or not. Using this dataset let’s build the classification model in machine learning. To build this model here we are using the classification KNN which termed as “k nearest neighbor classification”.

First, let’s understand KNN in technical terms, our outcome is predicted based on certain characteristics shown by the nearest neighbor.

For your understanding purpose, I am giving you a simple example. We know that based on the salary we can predict that either the person will buy the product or not. Now imagine that a new employee/ new item record is added. If the new record contains the value let’s say greater than 5000 that means the new employee is able to buy the product. You earn more you buy more, As simple. The same operation we perform using k nearest neighbor.

xSalary = datasetSalary.iloc[:, :-1].values

ySalary = datasetSalary.iloc[:, -1].values

xTrain,xTest,yTrain,yTest = train_test_split(xSalary,ySalary,test_size=0.2,random_state=0)

from sklearn.preprocessing import StandardScaler

sc=StandardScaler()

xTrain=sc.fit_transform(xTrain)

xTest=sc.transform(xTest)

Also, we have learnt the “fit_transform” and “transform”, to get the detail idea click here.

We are using sklearn, train_test_split method or function or whatever word you want to use, it helps to split the dataset into two part.

Now using classification model k-nearest neighbor, in image down below you can see that the plot contains two group. One group is in red color and second is in blue color. Now let’s say new person or item added which is denoted with the color GREEN.

So, now we have to decide in which group we suppose to place this person or item. Here we are using KNN, K means: – “n_neighbors=5”

First, Choose the number K of neighbors. By default, it is five.

Second, take the K nearest neighbors of new data point based on Euclidean distance. Why Euclidean because most of the time we use it but yes you can use the other one.

Euclidean distance: – the classic distance equal to the square root of the sum of squared differences between coordinates.

Third, among these K neighbor, count the number of data points in each category.

Fourth, you need to assign the new data point to the category where you counted the most neighbors.

Yep, now we can find out that our new member belongs to READ GROUP.

from sklearn.neighbors import KNeighborsClassifier

classifier=KNeighborsClassifier(n_neighbors=5, metric=’minkowski’,p=2)

classifier.fit(xTrain,yTrain)

yPred=classifier.predict(xTest)

We have already discussed the first parameter.

Second: – By default it is uniform. Weight is the weight function used in prediction. The term default means all the points in each neighborhood are weighted equally.

Third: – Compute the nearest neighbor, so auto is the best option. Why because it will choose automatically the most appropriate algorithm based on the value on the path to the fit method.

Fourth: – keep the default value as 30.

Fifth: – Power parameter for the Minkowski metric

Metric means the distance you want to use to compute between your observation points and neighbor. To use the Euclidean distance, we choose ‘minkowski‘.

Now let train our model.

classifier.fit(xTrain,yTrain)

Now let’s check the accuracy of our classifier by trying to predict for the test. Data Classifier has a predict method which takes NumPy array as input and return to output another NumPy Array.

We can also check the probability of prediction for all the test data. It is helpful when we have to sort data from prediction. Either the customer will buy or not.

yProb=classifier.predict_proba(xTest)[:,1]

To check the accuracy of the model by using confusion metric. It is statical technique to predict the accuracy of classification model.

Once we know all four type, we can easily determine the accuracy.

(True positive + True Negative) / (True positive + False Negative+ False Positive+ True Negative)

The library which we use, they are building classes to generate confusion metrics from actual and predicted data. (how to generate confusion matrix)

from sklearn.metrics import confusion_matrix

conMat=confusion_matrix(yTest,yPred)

To calculate the accuracy of model (how to check accuracy of machine learning model),

from sklearn.metrics import accuracy_score

print(accuracy_score(yTest,yPred))