How to handle missing data machine learning

How to handle missing data machine learning

In the previous blog post on Machine Learning, we saw how you can import a library into a GOOGLE COLAB and how you can run your first Machine learning program using the data in a CSV file. If you haven’t read that post, here is the link. I urge you to read it first. CLICK HERE

In today’s post on machine learning, I will explain how to work in your CSV file if there is no data/ missing data in a row, which means some of your rows contain blank space. I will explain this to you.

You may know that the computer is not as developed as the brain of the human body. So, when System/Model/Computer finds a blank row in your dataset, it is difficult for System/Model/Computer how to work with it. We have to give some instructions.

As you can see in the above image 2 blank rows are found in your dataset. It is best not to have this type of free space / blank row /missing data in your dataset.

Now to get various statistical elements about the data frame like how many rows? What is the mean and standard deviation?

You can also get the additional information using “.info”, which is datatype.

dataset.head(), we have already run the code above, it would give you the first five rows.

However, if there is up to 1% blank row then it is fine you can remove it, your model will not get affected.

Yes, there are pros and cons. like if we delete specific rows or columns with no specific information is the best way.

But the disadvantage is theses removal of the rows or columns is, we lose other rows’ data. Apply this technique when you have a large amount of dataset.

Now, if your dataset contains more than 1% empty rows then, we will use a library(scikit-learn).

This library will help us to get the average value of that particular column, and we will use this average value and put them in the empty rows.

Now the question arises in your mind this is the only way to work with the missing data. The answer is NO, you can use other options like, I am going to discuss in down below but before that let me give you some overview regarding sci-kit-learn.

scikit-learn: –

This library will be beneficial to perform many operations related to data pre-processing.

It is built and written in python.

It is open-source and commercially usable – BSD license.

Here to handle the missing data I am going to use the Module of this library “impute” which supports “SimpleImputer Class”. Check the image below.

The next step is to create the object of this class. Inside the parameter, we have to specify the argument like which missing value we have to replace (missing_values=np.nan). Also, here we want to replace the missing value with the MEAN, So, inside the second parameter specify related information(strategy=’mean’). Watch the above image very carefully.

All is set, now we have to use this value inside our matrix of features (Any doubt clicks here). Make sure that you specify only the NUMERICAL value, not the string one.

The FIT method will allow us to connect the object with a matrix of features (Here it is X). This means it will find the missing rows in your column and calculate the average age and Salary. IN SHORT, IT PERFORM THE CALCULATION OPERATION.

We have calculated value but yet not set inside the field so, we need to use another function/method named “Transform” to set the missing data.

Be very careful when you specify the upper bound range in python is EXCLUDED (X [:,1:3]).

The “Transform” method will actually perform the replacement task.

Another way to handle the missing data

1) We have already covered 1st option Delete the record and 2nd which is mean.

2) Other options are as below

  • Use a unique category
  • Predict Missing value
  • Create a separate model, you can try it also but it taking time but for practice purposes and for your understanding, you can try this method also.
  • Various algorithms are present use any of them.
  • Median statistical method: – Short the column in ascending order whichever be my central element will be placed inside the missing value row.
  • Mode statistical method: – Means the highest frequency of the value. What is that mean if your column contains value as 30 (5 times) then it will be placed inside the missing field.

Leave a Comment