How to split data into training and testing in python

How to split data into training and testing in python

When you are learning machine learning most of the time you have to face the question of when we suppose to perform the feature scaling before or after the splitting of the dataset? And if before/after then why we need to perform before/after. So, here is the answer.

First of all, before dive deep in we need to understand the basic definition of FEATURE SCALING.

The Feature Scaling is an operation/method/technique performed during the data pre-processing step. And this method or technique used to normalize the range of information or data. In short, it consists of scaling all your variable or your features actually to make sure they all take values in same scale. It helps us to get the information of the standard deviation and mean of your feature.

In a general word, the Our model or machine learning model or algorithm may not be work perfectly when the input numerical or digit values have different scales. Why? Because it may be possible that one feature to dominate the other one which therefore would be neglected by the machine learning model.

So, the answer to your question is you have to apply the feature scaling after splitting the data set. We apply feature scaling after the split just because to prevent information leakage on the test set which you are not supposed to have until the training is completely done.

The Splitting dataset divided into two forms 1) Training dataset 2) Test Dataset.

Why we need two forms? How to split dataset into train and test?

In the training dataset, you will train your model on existing observations.

In the test dataset, you are going to evaluate the performance of your model on new observations. And the term new observation works as future data that you are getting and, in the future, you will operate/use/deploy in your machine learning model.

Now how can we create the test set?

To create the test set in machine learning is very easy, you just need to pick some amount randomly of your record, typically 20% of the dataset, and set them aside. And to perform this operation we will use the sklearn library.

from sklearn.model_selection import train_test_split

train_matrix,train_dep,test_matrix,test_dep = train_test_split(X,Y,test_size=0.2)


[[0.0 0.0 1.0 20.0 nan]
[1.0 0.0 0.0 25.0 8000.0]
[0.0 0.0 1.0 24.0 5000.0]
[1.0 0.0 0.0 20.0 4000.0]
[0.0 0.0 1.0 29.0 6000.0]
[1.0 0.0 0.0 nan 3000.0]
[0.0 1.0 0.0 22.0 6000.0]
[0.0 0.0 1.0 21.0 3000.0]]

How to use sklearn in python

sklearn is very powerfullibrary for machine leraning using python sklearn comes pre-installed with Anaconda’s Spider if you are using different python environment you have to install sklearn using pip install. PIP Install is command to install any python library.Anaconda’s Spider comes pre-packaged with sklearn by pandas and many other libraries that are required for scientific computation and machine learning. Here we have used the sklearn, using it we are using train_test_split class break the dataset into two parts.

After performing all the computing operations, we will have 4 sets, 2 for the matrix of feature and 2 for the dependent variable, which will be working as an input for our model.

In the above example I have just printed only one record, but you can try to print it by your own. Also remember we have already discuss the X And Y in our previous blog post. If you have any problem just try to read it first.

Leave a Comment