Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
June 22, 2022 05:11 pm GMT

How to Clean Data Using Pandas

What is data cleaning?

Data cleaning is a process to remove, add or modify data for analysis and other machine learning tasks. If data cleaning is necessary, it is always done before any kind of analysis or machine learning task.

Clive Humby said, Data is the new oil. But we know data still needs to be refined.

Why data cleaning is necessary?

Data is considered one of the major assets of a company. Misleading or inaccurate data is risky and can be a reason for the fall of a company.

Data available to us don't need to be useful every time, we must perform many operations to make it useful. So, it is a good idea to remove unnecessary data and, format and modify important data so that we can use it. In some scenarios, it is also required to add information externally by processing the available data. For example, adding a language column based on some data that already exists or generating a column with an average value based on some other columns data.

Introduction

There are many steps involved in the data cleaning process. These all steps are not necessary for everyone to follow or use. To perform the data cleaning, we will use python programming language with pandas library.

I have used python because of its expressiveness and, it is easy to learn and understand. More importantly, python is the choice of many experts for machine learning tasks because a person without a computer science background can easily learn it. Apart from pythons benefits; pandas is a fast, powerful, flexible and easy-to-use open-source data analysis and manipulation tool and it is one of the most popular data analysis and processing tools out there.

To know your data is very important before one starts the data cleaning process because what cleaning process to perform, is all depends on what kind of data one has and what is the nature of that data.

Step by step process for cleansing your data

Before cleaning the data, it is important to load data properly. In this tutorial, I will show basic methods to load data from a CSV file. Find more options to read CSV here.

import pandas as pd# 1. Read data from csv default waydf = pd.read_csv('my_file.csv')# 2. Read data from csv using comma as delimiterdf = pd.read_csv('my_file.csv', delimiter=',')# 3. Read data from csv using comma as delimiter and no headersdf = pd.read_csv('my_file.csv', delimiter=',', header=None)# 4. Read data from csv using comma as delimiter and custom headersmy_headers = ['Id','Name', 'Type', 'Price']df = pd.read_csv('my_file.csv', delimiter=',', header=0, names=my_headers)

Remove duplicate data

Certain steps need to be followed in every data cleaning process. One of those steps is the removal of duplicate data. Regardless of textual or numeric data, removal of duplicate data is very important because if the dataset contains too many duplicates then the time to process that data also increases.

# 1. Removes duplicate and returns a copy of dataframedf = df.drop_duplicates()# 2. Removes duplicates in placedf = df.drop_duplicates(inplace=True)# 3. Drops duplicates and keep first/last occurancedf = df.drop_duplicates(inplace=True, keep='last')# 4. Consider only certain columns for identigying duplicatesdf = df.drop_duplicates(subset=['Id', 'Price'], inplace=True, keep='last')

Remove emojis

There are many cases where we do not want emojis in our textual dataset. We can remove emojis by using a single line of code. The code snippet shown below will remove emojis from pandas dataframe column by column. The code snippet can be found on Stackoverflow.

df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

This code snippet encodes all the data into ASCII (American Standard Code for Information Interchange) values and ignores if the data can not be encoded. After encoding it tries to decode them all again because all the emojis were ignored in the encoding process. So now we have all the data without emojis.

Change data into lowercase

Possibilities of changing cases of data are very likely. Here, I have attached a code snippet to change data to lowercase. More examples can be found here.

df['Type'] = df['Type'].str.lower()df['Name'] = df['Name'].str.lower()

Remove multiple white spaces, tabs and new-lines

Every dataset contains unnecessary whitespaces, tabs and newlines. Problem is that we can see tabs and newlines clearly but not whitespaces which in turn affects when we train our models.

df['Type'] = df['Type'].str.replace('
', '')df['Type'] = df['Type'].str.replace('', ' ')df['Type'] = df['Type'].str.replace(' {2,}', ' ', regex=True)df['Type'] = df['Type'].str.strip()

The first two lines of code will replace tabs and newlines with empty characters, respectively. The third line will find two or more spaces with the help of regular expression(regex), and it will replace it with a single space. Finally, the last line will strip (removes whitespaces) data from both sides.

Remove URLs (Uniform Resource Locators)

Many people use surveys to get data. People tend to fill in random details and sometimes this data has URLs in them. I have used the regex pattern shown in the code snippet to remove URLs, though one can use any regex pattern to match URLs. Here, I have replaced matching URL patterns with empty string characters.

df['Type'] = df['Type'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Drop rows with empty data

After all the above cleaning process it left some empty data in columns. We have to get rid of those empty rows otherwise it creates uncertainty in the trained model. To make sure we drop all rows with empty data, we use two methods which are as shown below.

df.dropna()df['Type'].astype(bool)df = df[df['Type'].astype(bool)]

First-line removes all rows which contain np.nan, pd.NaT and None whereas other lines remove rows which has empty string characters. The second method is fast but if the column has even whitespace it will not work. This is another reason why we strip our data earlier.

More data processing

Sometimes you need to drop some columns, create a new column from existing data or remove rows that do not contain specific data in them.

import numpy as npdf = df.drop(['Id', 'Name'], axis=1)df = df[df['Type'].str.contains('frozen') | df['Type'].str.contains('green')]def detect_price(row):    if row['Price'] > 15.50:        return 'High'    elif row['Price'] > 5.50 and row['Price'] <= 15.50:        return 'Medium'    elif row['Price'] > 0.0 and row['Price'] <= 5.50:        return 'Low'    else:        return np.NaNdf['Range'] = df.apply (lambda row: detect_price(row), axis=1)

Here, line number three drops two columns named Id and Name; and returns a copy of the new dataframe. Line number four checks if the Type column contains string frozen or green then returns true and keeps that row. Line number 7 to 17 creates a new column named Range based on the value of column Price. By using the lambda function, I passed each row to the detect_price function and return the value based on the price. This returned value is then assigned to a new column on the row passed to the function. Here, is a reason to use np.NaN is we can remove those rows afterwards using df.dropna().

Conclusion

The data cleaning process is one of many processes involved in data science. To have clean and accurate data is a blessing in disguise. We need to clean and process data differently in every project. I have mentioned some of the cleaning methods which are used frequently. You can create your own set of methods as well or use any one of the existing methods. I have attached the whole code here for reference.


Original Link: https://dev.to/sahilfruitwala/how-to-clean-data-using-pandas-2on

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To