Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 28, 2021 07:48 am GMT

Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them

1. Missing Data

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()


The range index will show you the total number, and then beside each entry, you'll find its count. If it doesn't equate to the total number, then you have missing data in your set.

How to Deal with It?

This varies according to the situation at hand. For example, why is the data missing? And whether or not the occurrences seem random.

One way to go about this issue is to calculate the missing values using the mean.

For example, if you have missing values for the duration that a user viewed a product on your website. "duration" is the name of the variable in this case.

mean = df['duration'].mean()
df['duration'] = df['duration'].fillna(mean)


The second line can be written as:

df['duration'].fillna(mean, inplace=True)


And both serve to apply the changes (adding the data you just calculated) to the original set.

2. Duplicates

How to Check?

df.duplicated()


This should display "False" next to all the lines that aren't duplicates, and "True" next to the ones that are a duplicate of the ones above them.

I.e. The first instance will be marked as "False" but the second instance (which is the duplicate) will be marked as "True".

You can also check with:

sum(df.duplicated())


This works for bigger data sets, and it shows you just how many instances of duplicates you have.

How to Deal with It?

df.drop_duplicates(inplace=True)


Again, (inplace=True) is used to apply changes to the original data set.

3. Incorrect Data Types

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()


for example, if beside the variable "Timestamp" you find "object", this means that your data set is dealing with the timestamp as a string (str) which is not ideal. The proper representation is DateTime object.

In this case, we'll use:

df['timestamp'] = pd.to_datetime(df['timestamp')]


Note: Data type corrections aren't applied when you re-open the csv file. So, next time you parse the file, make sure to change them again accordingly.

Git_It


Original Link: https://dev.to/gharamelhendy/most-common-issues-with-real-life-data-2bh

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To