Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 24, 2023 06:15 am GMT

Transforming Categorical Data: A Practical Guide to Handling Non-Numerical Variables for Machine Learning Algorithms.

There are several ways to deal with categorical data, also known as label data, in data science:

  1. One-hot encoding

  2. Label encoding

  3. Dummy encoding

  4. Binning

  5. Count Encoding

  6. Frequency Encoding

  7. Target Encoding

The appropriate technique will depend on the specific data and the goals of the analysis. It's important to note that some algorithms like decision trees and random forest can handle categorical variables directly, so encoding may not be necessary.

We will now go through all the above ways with some sample data-set and also learn how o make our data trainable.

Let's Start

1. One-hot encoding

One-hot encoding is a technique used to convert categorical variables into numerical values by creating a binary column for each category. It is useful for handling categorical variables with multiple levels.

For example, let's say we have a dataset of hand bags with a column called "color" that contains the following values: "red", "green", and "blue".

colorpriceunits
red5002
green8003
blue3001
red4001
green6001

One-hot encoding would create three new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The resulting data might look like this:

colorpriceunitscolor_redcolor_greencolor_blue
red5002100
green8003010
blue3001001
red4001100
green6001010

As you can see, the original "color" column has been replaced by three new binary columns, one for each unique category. Each row now has a value of 1 in exactly one of these new columns, indicating the presence of that category.

But wait, you should have one question ..... How to do it using python? So, let's do it using python.

In Python, You can use the get_dummies() function from the pandas library to apply one-hot encoding to the "color" column of your dataframe. Here is an example of how to do it:

import pandas as pd# Create example dataframedf = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],                   'price': [500, 800, 300, 400, 600],                   'units': [2, 3, 1, 1, 1]})# Apply one-hot encoding to "color" columndf_encoded = pd.get_dummies(df, columns=['color'])print(df_encoded)

Alternatively, you can use the OneHotEncoder class from the sklearn.preprocessing library to apply one-hot encoding.

from sklearn.preprocessing import OneHotEncoder# Create example dataframedf = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],                   'price': [500, 800, 300, 400, 600],                   'units': [2, 3, 1, 1, 1]})# Create an instance of the encoderencoder = OneHotEncoder(sparse=False)# Fit and transform the "color" columncolor_encoded = encoder.fit_transform(df[['color']])# Create new dataframe with the encoded valuesdf_encoded = pd.concat([df.drop(columns=['color']), pd.DataFrame(color_encoded, columns=encoder.get_feature_names(['color']))], axis=1)print(df_encoded)

The resulting dataframe will look the same as the previous one, but the columns will have a prefix 'color_x0_' rather 'color'.

2. Label encoding

Label encoding is a technique used to convert categorical variables into numerical values by assigning a unique integer value to each category. It is useful for handling ordinal variables, where the order of the categories matters.

For example, let's say we have a dataset with a column called "size" that contains the following values: "small", "medium", "large". Label encoding would replace each category with an integer, such as: "small" = 0, "medium" = 1, "large" = 2. The resulting data might look like this:

sizeencoded_size
small0
medium1
large2
small0
medium1

As you can see, the original "size" column has been replaced by "encoded_size" column, each row now has a unique integer value representing the category.

You can use the LabelEncoder class from the sklearn.preprocessing library to apply label encoding to your data. Here is an example of how to do it:

from sklearn.preprocessing import LabelEncoder# Create example dataframedf = pd.DataFrame({'size': ['small', 'medium', 'large', 'small', 'medium'],                   'price': [500, 800, 300, 400, 600],                   'units': [2, 3, 1, 1, 1]})# Create an instance of the encoderencoder = LabelEncoder()# Fit and transform the "size" columndf['encoded_size'] = encoder.fit_transform(df['size'])print(df)

The resulting dataframe, df, will have an new column "encoded_size" representing the encoded values of size column. The resulting dataframe will look like this:

sizepriceunitsencoded_size
small50020
medium80031
large30012
small40010
medium60011

It's important to note that label encoding changes the relationship between the categories. It assigns a unique number to each category, but it doesn't take into account the ordinal relationship between the categories. In this case, the encoded values of "small", "medium" and "large" are 0, 1 and 2 respectively, but it doesn't mean that small is half the size of medium or large is twice the size of medium.

3. Dummy Encoding

Dummy encoding, also known as indicator encoding, is a technique used to convert categorical variables into numerical values by creating binary columns for each category, similar to one-hot encoding, but it doesn't remove any column. It is useful when working with categorical variables with many levels.

For example, let's say we have a dataset with a column called "color" that contains the following values: "red", "green", "blue". Dummy encoding would create three new binary columns, one for each unique category, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The resulting data might look like this:

colorredgreenblue
red100
green010
blue001
red100
green010

As you can see, the original "color" column is still present in the table, but three new binary columns, one for each unique category, has been added. Each row now has a value of 1 in exactly one of these new columns, indicating the presence of that category.

You can use the pd.concat() function from the pandas library to apply dummy encoding to the "color" column of your dataframe, here is an example of how to do it:

# Create example dataframedf = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],                   'price': [500, 800, 300, 400, 600],                   'units': [2, 3, 1, 1, 1]})# Apply dummy encoding to "color" columndf_encoded = pd.concat([df, pd.get_dummies(df['color'])], axis=1)print(df_encoded)

The resulting dataframe, df_encoded, will have three new binary columns, one for each unique category in the "color" column, with a value of 1 indicating that the category is present and a value of 0 indicating that it is not. The original "color" column is still present in the table. The resulting dataframe will look like this:

colorpriceunitsredgreenblue
red5002100
green8003010
blue3001001
red4001100
green6001010

4. Binning

Binning is a technique used to group numerical values into bins or ranges, it is used to handle numerical variables with a large number of unique values. Binning can be useful for creating categorical variables from numerical ones and for handling outliers in the data.

For example, let's say we have a dataset with a column called "age" that contains the following values: 18, 20, 25, 30, 35, 40, 45. To apply binning, we can divide the range of values into a pre-defined number of intervals or bins. For example, we can divide the range of ages into four bins: (18, 25], (25, 35], (35, 45], (45, 50]. This would group the ages into four categories: "young", "middle-aged", "old", and "very old". The resulting data might look like this:

ageage_bin
18young
20young
25middle-aged
30middle-aged
35old
40old
45very old

As you can see, the original "age" column is still present in the table, but a new column "age_bin" has been added, which contains the binned values for each age. The rows in the "age_bin" column now contain categorical values representing the age group.

You can use the cut() function from the pandas library to apply binning to the "age" column of your dataframe, here is an example of how to do it:

# Create example dataframedf = pd.DataFrame({'age': [18, 20, 25, 30, 35, 40, 45],                   'price': [500, 800, 300, 400, 600, 700, 800],                   'units': [2, 3, 1, 1, 1, 2, 3]})# Apply binning to "age" columndf['age_bin'] = pd.cut(df['age'], bins=[18, 25, 35, 45, 50], labels=['young', 'middle-aged', 'old', 'very old'])print(df)

The resulting dataframe, df, will have an new column "age_bin" representing the binned values of age column. The resulting dataframe will look like this:

agepriceunitsage_bin
185002young
208003young
253001middle-aged
304001middle-aged
356001old
407002old
458003very old

As you can see, the original "age" column is still present in the table, but a new column "age_bin" has been added, which contains the binned values for each age. The rows in the "age_bin" column now contain categorical values representing the age group.

5. Count Encoding

Count encoding is a technique used to convert categorical variables into numerical values by counting the number of occurrences of each category in the dataset. It is used to handle categorical variables with many levels.

For example, let's say we have a dataset with a column called "product" that contains the following values: "apple", "orange", "banana", "apple", "orange", "apple", "banana". Count encoding would replace each category with the number of times it appears in the dataset. The resulting data might look like this:

productcount_encoded
apple3
orange2
banana2
apple3
orange2
apple3
banana2

As you can see, the original "product" column is still present in the table, but a new column "count_encoded" has been added, which contains the count encoded values for each product. The rows in the "count_encoded" column now contain unique integer values representing the number of times each product appears in the dataset.

You can use the value_counts() function from the pandas library to apply count encoding to the "product" column of your dataframe, here is an example of how to do it:

# Create example dataframedf = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],                   'price': [500, 800, 300, 400, 600, 700, 800],                   'units': [2, 3, 1, 1, 1, 2, 3]})# Apply count encoding to "product" columndf['count_encoded'] = df['product'].map(df['product'].value_counts())print(df)

The resulting dataframe, df, will have an new column "count_encoded" representing the count encoded values of product column. The resulting dataframe will look like this:

productpriceunitscount_encoded
apple50023
orange80032
banana30012
apple40013
orange60012
apple70023
banana80032

6. Frequency Encoding

Frequency encoding is a technique used to convert categorical variables into numerical values by representing each category as the proportion of occurrences of that category in the dataset. It is similar to count encoding, but it normalizes the count by dividing it by the total number of occurrences of all categories in the dataset. It is used to handle categorical variables with many levels.

For example, let's say we have a dataset with a column called "product" that contains the following values: "apple", "orange", "banana", "apple", "orange", "apple", "banana". Frequency encoding would replace each category with the proportion of times it appears in the dataset. The resulting data might look like this:

productfrequency_encoded
apple0.429
orange0.286
banana0.286
apple0.429
orange0.286
apple0.429
banana0.286

As you can see, the original "product" column is still present in the table, but a new column "frequency_encoded" has been added, which contains the frequency encoded values for each product. The rows in the "frequency_encoded" column now contain decimal values between 0 and 1 representing the proportion of times each product appears in the dataset.

You can use the value_counts() function from the pandas library to apply frequency encoding to the "product" column of your dataframe, here is an example of how to do it:

# Create example dataframedf = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],                   'price': [500, 800, 300, 400, 600, 700, 800],                   'units': [2, 3, 1, 1, 1, 2, 3]})# Apply frequency encoding to "product" columndf['frequency_encoded'] = df['product'].map(df['product'].value_counts(normalize=True))print(df)

The resulting dataframe, df, will have an new column "frequency_encoded" representing the frequency encoded values of product column. The resulting dataframe will look like this:

productpriceunitsfrequency_encoded
apple50020.428571
orange80030.285714
banana30010.285714
apple40010.428571
orange60010.285714
apple70020.428571
banana80030.285714

7. Target Encoding

Target Encoding is a technique used to convert categorical variables into numerical values by representing each category as the mean of the target variable for that category. This technique is used when the categorical variable has a large number of levels and is also useful in situations where the data is highly imbalanced.

For example, let's say we have a dataset with a column called "product" and a target variable called "sales" that contains the following values:

productsales
apple100
orange200
banana50
apple150
orange300
apple50
banana20

Target encoding would replace each category in the "product" column with the mean of the "sales" column for that category. The resulting data might look like this:

productsalestarget_encoded
apple10083.333
orange200250.0
banana5035.0
apple15083.333
orange300250.0
apple5083.333
banana2035.0

As you can see, the original "product" column is still present in the table, but a new column "target_encoded" has been added, which contains the target encoded values for each product. The rows in the "target_encoded" column now contain decimal values representing the mean of the "sales" column for each product.

You can use the groupby() function from the pandas library to apply target encoding to the "product" column of your dataframe, here is an example of how to do it:

# Create example dataframedf = pd.DataFrame({'product': ['apple', 'orange', 'banana', 'apple', 'orange', 'apple', 'banana'],                   'sales': [100, 200, 50, 150, 300, 50, 20]})# Apply target encoding to "product" columndf['target_encoded'] = df.groupby('product')['sales'].transform('mean')print(df)

The resulting dataframe, df, will have an new column "target_encoded" representing the mean of sales column for each product. The resulting dataframe will look like this:

productsalestarget_encoded
apple10083.333
orange200250.0
banana5035.0
apple15083.333
orange300250.0
apple5083.333
banana2035.0

This blog is a part of a #100daysdatascience series. If you want to follow the whole series then go to the below links:

GitHub link: Complete-Data-Science-Bootcamp

Main Post: Complete-Data-Science-Bootcamp

If you liked the post and wanted me to support then...

Buy Me A Coffee


Original Link: https://dev.to/anurag629/transforming-categorical-data-a-practical-guide-to-handling-non-numerical-variables-for-machine-learning-algorithms-cld

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To