How to use one hot encoder sklearn
Encoding categorical data is an again very important topic for your machine learning model. But before we dive deep into programming manner, let us understand it through everyday examples.
In each of my posts I think the reader is a novice.So before teaching the topic I compare it to everyday life.
The example given by me here is simply meant to make you understand.
- Suppose your body is accustomed to eating only vegetarian food and suddenly one day if you eat non-vegetarian food, the digestive organs in your body will have difficulty in functioning. This is because our body is not accustomed to it. So, your body wants to be given such food so that it can do its job well.
- For better digestion of your food, you will adopt different types of ways such as,
1) Walk a little.
2) Exercise or do some yoga.
- Now understand it in technical language, your machine learning model needs such input through which he can calculate and give an accurate prediction. Like the user will buy the product or not.
- Similarly, in machine learning there are different methods for encoding your data. We will discuss it here.
One hot Encoding
So we have to convert/encode our categorical data into numeric form. See in the image down below.
Here the states like Maharashtra, Gujarat, JandK termed as categorical/ string data.
To encode your data, if you are using the GOOGLE COLAB then it is fine you can directly start and import your module and class.
Let’s understand step by step line of code. We have already discussed how our table work for our Model. If you haven’t read yet then please click here to read in details. CLICK HERE
In today’s blog post we will be discussing the “One hot encoding” method. We use this technique when the features do not have any order (do not have a relationship between categories). Like if we provide the 0,1,2,3,4 (Converting your string/state name into a number) number then our Model imagines that there is a relationship/ numerical order between this record. Instead, we use 0 and 1. Each category is mapped with a binary variable containing either 0 or 1.
One hot encoding, encode our first column into 3 columns. If your column contains more than 3 categories/state name then it will generate 4 columns, 5 columns.
In short, generate binary vector for each our state.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
1) Column Transformer class from compose module of sklearn library.
2) One hot encoder class from preprocessing module of sklearn library.
The ColumnTransformer constructor contains some argument but we are interested in only two.
1st argument what kind of transformation we want to do, on which column if you don’t want to change then put it inside the second argument.
1) What kind of transformation do you want to perform? Well, Simple ENCODING. Based on your input it will make the setting of parameters and searching for the transformer easy.
2) What kind of encoding you want to do? Well, One hot encoding.
3) On which do you want to perform one hot encoding? Well, 1st column.
In the second column “remainder”, If you want to keep the rest of the columns of your data set, you have to provide information about it here. By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped.
ColumnTransformer class in-built contain the FIT concept. So, no need to worry about all the stuff which we have already perform in a previous blog post. It will perform fit and then transform together in one go. Your X must be NumPy array and the ColumnTransfomer class, fit_transform method does not return NumPy array so you need to convert it using “np”.
Dummy Encoding: – It is somehow the same as One hot encoding, with small improvement. In terms of one-hot encoding, for N categories in a variable, it uses N binary variables while Dummy encoding uses N-1 features to represent N labels/categories. See the image.
Both One hot encoding and Dummy Encoding is useful but there are some drawback also, if we have N number of values then we need N number of variable/vectors to encode the data. For example, a column with 7 different values will require 7 new variables for coding.