Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 15, 2022 08:04 pm GMT

Data analysis made simple: Python Pandas tutorial

This article was published on the Educative Blog by Amanda Fawcett on June 29, 2020, and updated by the Educative team on November 15, 2022.

Data is an important part of our world. In fact, 90% of the worlds data was created in just the last 7 years. Many tech giants have started hiring data scientists to analyze data and extract useful insights for business decisions.

Currently, Python is the most important language for data analysis, and many of the industry-standard tools are written in Python. Python Pandas is one of the most essential, in-demand tools that any aspiring data analysts need to learn. Today, well introduce you to the essentials of Pandas.

Today well go over:

  • Introducing Pandas for Python
  • Pandas Data Types and Structures
  • Series: the most important operations
  • DataFrame: the most important operations
  • How to read and import Pandas data
  • Data Wrangling with Pandas
  • Next steps

Introducing Pandas for Python

The Pandas library is one of the most important and popular tools for Python data scientists and analysts, as it is the backbone of many data projects. Pandas is an open-source Python package for data cleaning and data manipulation. It provides extended, flexible data structures to hold different types of labeled and relational data. On top of that, it is actually quite easy to install and use.

Pandas is often used in conjunction with other data science Python libraries. In fact, Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is also used in SciPy for statistical analysis or with Matplotlib for plotting functions. Pandas can be used on its own with a text editor or with Juptyer Notebooks, the ideal environment for more complex data modeling. Pandas is available for most versions of Python, including Python3.

Think of Pandas as the home for your data where you can clean, analyze, and transform your data all in one place. Pandas is essentially a more powerful replacement for Excel. Using Pandas, you can do things like:

  • Easily calculate statistics about data, such as finding the average, distribution, and median of columns
  • Use data visualization tools, such as Matplotlib, to easily create plot bars, histograms, and more
  • Clean your data by filtering columns by particular criteria or easily removing values
  • Manipulate your data flexibly using operations like merging, joining, reshaping, and more
  • Read, write, and store your clean data as a database, txt file, or CSV file

Popularity of Pandas

As we learned, Python is the most popular programming language for data analytics, and many of the popular machine learning and visualization libraries are written in Python, including Pandas, NumPy, TensorFlow, Matplotlib, Scikit-learn, and more. In fact, Python ranked 4th in the 2022 StackOverflow Developer Survey for the most popular programming languages among professional developers. It is beloved for its simplicity, easy learning curve, and improved library support.

Pandas is an important part of data analytics. It consistently ranks highly among the most popular and loved libraries. It also consistently ranks highly for most wanted programming tools, a sure sign that Pandas is a sought-after tool for developers around the world. Learning Pandas is an important step to becoming a data analyst.

First Step: Installing Pandas

You can install Pandas using the built-in Python tool pip and run the following command.

$ pip install pandas

Pandas

Pandas Data Structures and Data Types

A data type is like an internal construct that determines how Python will manipulate, use, or store your data. When doing data analysis, its important to use the correct data types to avoid errors. Pandas will often correctly infer data types, but sometimes, we need to explicitly convert data. Lets go over the data types available to us in Pandas, also called dtypes.

  • object: text or mixed numeric or non-numeric values
  • int64: integer numbers
  • bool: true/false values
  • float64: floating point numbers
  • category: finite list of text values
  • datetime64: Date and time values
  • timedelta[ns]: differences between two datetimes

A data structure is a particular way of organizing our data. Pandas has two data structures, and all operations are based on those two objects:

  • Series
  • DataFrame

Think of this as a chart for easy storage and organization, where Series are the columns, and the DataFrame is a table composed of a collection of series. Series can be best described as the single column of a 2-D array that can store data of any type. DataFrame is like a table that stores data similar to a spreadsheet using multiple columns and rows. Each value in a DataFrame object is associated with a row index and a column index.

Python pandas

Series: the most important operations

We can get started with Pandas by creating a series. We create a series by invoking the pd.Series() method and then passing a list of values.

We print that series using the print statement. Pandas will, by default, count the index from 0. We then explicitly define those values.

series1 = pd.Series([1,2,3,4])print(series1)

Lets look at a more complex example:

# importing pandas in our programimport pandas as pd# Defining a series objectsrs = pd.Series([1,2,3,4,5])# printing series valuesprint("The Series values are:")print(srs.values)# printing series indexesprint("
The Index values are:")print(srs.index.values)-->The Series values are:[1 2 3 4 5]The Index values are:[0 1 2 3 4]

How does this work? Well, the srs.values function on line 9 returns the values stored in the Series object, and the function srs.index.values on line 13 returns the index values.

Assign names to our values

Pandas will automatically generate our indexes, so we need to define them. Each index corresponds to its value in the Series object. Lets look at an example where we assign a country name to population growth rates.

#importing pandas in our programimport pandas as pd# Defining a series objectsrs = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])# Set Series namesrs.name = "Growth Rate"# Set index namesrs.index.name = "Country"# printing series valuesprint("The Indexed Series values are:")print(srs)-->The Indexed Series values are:CountryChina       11.9India       36.0USA         16.6Brazil      21.8Pakistan    34.2Name: Growth Rate, dtype: float64

How does this work? Two attributes of the Series object are used on line 8 and line 11. The attribute srs.name sets the name of our series object. The attribute srs.index.name then sets the name for the indexes.

Select entries from a Series

To select entries from a Series, we select elements based on the index name or index number.

import numpy as npimport pandas as pdsrs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])srs.index.name = "Index"print("The original Series:
", srs)print("
Series element at index ind3:")print(srs['ind3']) # Fetch element at index named ind3print("
Series element at index 3:")print(srs[3]) # Fetch element at index 3print("
Series elements at multiple indexes:
")print(srs[['ind1', 'ind4']]) # Fetch elements at multiple indexes-->('The original Series:
', Indexind0 0ind1 1ind2 2ind3 3ind4 4ind5 5dtype: int64)Series element at index ind3:3Series element at index 3:3Series elements at multiple indexes:Indexind1 1ind4 4dtype: int64

How does that work? Well, the elements from the Series are selected in 3 ways.

  • On line 9, the element is selected based on the index name.
  • On line 12, the element is selected based on the index number. Keep in mind that index numbers start from 0.
  • On line 15, multiple elements are selected from the Series by selecting multiple index names inside the [].

Drop entries from a Series

Dropping and unwanted index is a common function in Pandas. If the drop(index_name) function is called with a given index on a Series object, the desired index name is deleted.

import numpy as npimport pandas as pdsrs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])srs.index.name = "Index"print("The original Series:
", srs)srs = srs.drop('ind2') # drop index named ind2print("The New Series:
", srs)-->('The original Series:
', Indexind0 0ind1 1ind2 2ind3 3ind4 4ind5 5dtype: int64)('The New Series:
', Indexind0 0ind1 1ind3 3ind4 4ind5 5dtype: int64)

Here, the output that the ind2 index is dropped. Also, an index can only be dropped by specifying the index name and not the number. So, srs.drop(srs[2]) does not work.

There are many other functions, conditions, and logical operators we can apply to our series object to make productive use of indexes.

Some of those functions are:

  • The condition srs[srs == 1.0] will return a series object containing indexes with values equal to 1.0.
  • name: str, optional` gives a name to the Series
  • copy: bool, default False` allows us to copy data we input
  • The notnull() function will return a series object with indexes assigned to False (for NaN or null values), and the remaining indexes are assigned True

DataFrame: the most important operations

There are several ways to make a DataFrame in Pandas. The easiest way to create one from scratch is to create and print a df.

import pandas as pddf = pd.DataFrame({    "Column1": [1, 4, 8, 7, 9],    "Column2": ['a', 'column', 'with', 'a', 'string'],    "Column3": [1.23, 23.5, 45.6, 32.1234, 89.453],    "Column4": [True, False, True, False, True]})print(df)-->  Column1 Column2  Column3  Column40        1       a   1.2300     True1        4  column  23.5000    False2        8    with  45.6000     True3        7       a  32.1234    False4        9  string  89.4530     True

We can also create a dict and pass our dictionary data to the DataFrame constructor. Say we have some data on vegetable sales and want to organize it by type of vegetable and quantity. Our data would look like this:

data = {    'peppers': [3, 2, 0, 1],     'carrots': [0, 3, 7, 2]}

And now we pass it to the constructor using a simple command.

quantity = pd.DataFrame(data)quantity

How did that work? Well, each item, or value, in our data will correspond with a column in the DataFrame we created, just like a chart. The index for this DataFrame is listed as numbers, but we can specify them further depending on our needs. Say we wanted to know the quantity per month. That would be our new index. We do that using the following command.

quantity = pd.DataFrame(data, index=['June', 'July', 'August', 'September'])quantity

Get info about your data

One of the first commands you run after loading your data is .info(), which provides all the essential information about a dataset.

import pandas as pddf = pd.DataFrame({    "Column1": [1, 4, 8, 7, 9],    "Column2": ['a', 'column', 'with', 'a', 'string'],    "Column3": [1.23, 23.5, 45.6, 32.1234, 89.453],    "Column4": [True, False, True, False, True]})df.info()--><class 'pandas.core.frame.DataFrame'>RangeIndex: 5 entries, 0 to 4Data columns (total 4 columns):Column1    5 non-null int64Column2    5 non-null objectColumn3    5 non-null float64Column4    5 non-null booldtypes: bool(1), float64(1), int64(1), object(1)memory usage: 197.0+ bytes

From that, you can access more information with other operations, like .shape, which outputs a tuple of (rows, columns). This is super useful for telling us the size of our data, especially after weve cleaned it. That way, we can know what was removed.

We can also print a dataset's column names to find typos or formatting inconsistencies. We use the .columns operator to do so. You can then rename your columns easily. On top of that, the .rename() method allows us to rename columns, similar to a Search and Replace function of a Word doc.

Searching and selecting in our DataFrame

We also need to know how to manipulate or access the data in our DataFrame, such as selecting, searching, or deleting data values. You can do this either by column or by row. Lets see how its done. The easiest way to select a column of data is by using brackets [ ]. We can also use brackets to select multiple columns. Say we only wanted to look at Junes vegetable quantity.

quantity.loc['June']

Note: loc and iloc are used for locating data.

  • .iloc locates by numerical index
  • .loc locates by the index name. This is similar to list slicing in Python.

Pandas DataFrame object also provides methods to select specific columns. The following example shows how it can be done.

import pandas as pddf = pd.read_csv('cancer_stats.csv')print(df.columns) # print columns of DataFrameprint("
The First Column")print(df['Sex'].head()) # Fetch the sex colum from DataFrameprint("
The type of this column is: " + str(type(df['Sex'])) + "
")print("
The Second Column")print(df['Under 1'].head()) # Fetch the Under 1 colum from DataFrameprint("
The type of this column is: " + str(type(df['Under 1'])) + "
")print("
The Last Column")print(df['40-44'].head()) # Fetch the 40-44 colum from DataFrameprint("
The type of this column is: " + str(type(df['40-44'])) + "
")-->Index([u'Sex', u'Under 1', u'1-4', u'5-9', u'10-14', u'15-19', u'20-24', u'25-29', u'30-34', u'35-39', u'40-44'], dtype='object')The First Column0 Males1 Females2 Males3 Females4 MalesName: Sex, dtype: objectThe type of this column is: <class 'pandas.core.series.Series'>The Second Column0 821 732 03 04 0Name: Under 1, dtype: int64The type of this column is: <class 'pandas.core.series.Series'>The Last Column0 20451 44572 1393 814 2Name: 40-44, dtype: int64The type of this column is: <class 'pandas.core.series.Series'>

On line 5, the df.columns function displays the names of all columns present. We access a column by its name. On lines 8, 12, and 17, df['column_name'] is used to get the 1st, 2nd, and last columns.

Visit the original article on the Educative Blog to get some hands-on coding practice with DataFrames and the cancer_stats.csv dataset.

Create a new DataFrame from pre-existing columns

We can also grab multiple columns and create a new DataFrame object from it.

import pandas as pddf = pd.read_csv('test.csv')print(df.columns)print("
The original DataFrame:")print(df.head())print("
The new DataFrame with selected columns is:
")new_df = pd.DataFrame(df, columns=['Sex', 'Under 1', '40-44'])print(new_df.head())-->Index([u'Sex', u'Under 1', u'1-4', u'5-9', u'10-14', u'15-19', u'20-24', u'25-29', u'30-34', u'35-39', u'40-44'], dtype='object')The original DataFrame: Sex Under 1 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-440 Males 82 305 199 197 322 537 910 1239 1610 20451 Females 73 249 171 183 297 554 1341 2219 3085 44572 Males 0 0 1 4 10 7 17 30 67 1393 Females 0 1 0 5 8 8 12 29 53 814 Males 0 0 0 0 0 0 1 1 1 2The new DataFrame with selected columns is: Sex Under 1 40-440 Males 82 20451 Females 73 44572 Males 0 1393 Females 0 814 Males 0 2

Pandas in Python

Reindex data in a DataFrame

We can also reindex the data either by the indexes themselves or the columns. Reindexing with reindex() allows us to make changes without messing up the initial setting of the objects.

Note: The rules for reindexing are the same for Series and DataFrame objects.

#importing pandas in our programimport pandas as pd# Defining a series objectsrs1 = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])# Set Series namesrs1.name = "Growth Rate"# Set index namesrs1.index.name = "Country"srs2 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'])print("The series with new indexes is:
",srs2)srs3 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'], fill_value=0)print("
The series with new indexes is:
",srs3)-->('The series with new indexes is:
', CountryChina 11.9India 36.0Malaysia NaNUSA 16.6Brazil 21.8Pakistan 34.2England NaNName: Growth Rate, dtype: float64)('
The series with new indexes is:
', CountryChina 11.9India 36.0Malaysia 0.0USA 16.6Brazil 21.8Pakistan 34.2England 0.0Name: Growth Rate, dtype: float64)

How did that work? Well, on line 11, the indexes are changed. The new index name is added between Row2 and Row4. One line 14, the columns keyword should be specifically used to reindex the columns of DataFrame. The rules are the same as for the indexes. NaN values were assigned to the whole column by default.

Installing Pandas

How to read or import Pandas data

It is quite easy to read or import data from other files using the Pandas library. In fact, we can use various sources, such as CSV, JSON, or Excel to load our data and access it. Lets take a look at each.

Reading and importing data from CSV files

We can import data from a CSV file, which is common practice for Pandas users. We simply create or open our CSV file, copy the data, paste it into our Notepad, and save it in the same directory that houses your Python scripts. You then use a bit of code to read the data using the read_csv function build into Pandas.

import pandas as pddata = pd.read_csv('vegetables.csv')print(data)

read_csv will generate the index column as a default, so we need to change this for the first column is the index column. We can do this by passing the parameter index_col to tell Pandas which column to index.

data = pd.read_csv("data.csv", index_col=0)

Once weve used Pandas to sort and clean data, we can then save it back as the original file with simple commands. You only have to input the filename and extension. How simple!

df.to_csv('new_vegetables.csv')

Reading and importing data from JSON

Say you have a JSON file. A JSON file is basically like a stored Python dict, so Pandas can easily access and read it using the read_json function. Lets look at an example.

df = pd.read_json('purchases.json')

Just like with CSV files, once weve used Pandas to sort and clean data, we can then save it back as the original file with simple commands. You only have to input the filename and extension.

df.to_json('new_purchases.json')

Reading and importing data from Excel file

Say you have an Excel file. You can similarly use the read_excel function to access and read that data.

import pandas as pddata = pd.read_excel('workers.xlsx')print (data)

Once we call the read_excel function, we pass the name of the Excel file as our argument, so read_excel will open the files data. We can the print() to display the data. If we want to go one step further, we can add the loc() method from earlier, allowing us to read specific rows and columns of our file.

import pandas as pddata = pd.read_excel('workers.xlsx')print (data.loc[[1,4,7],['Name','Salary']])

Python Data Analysis

Data Wrangling with Pandas

Once we have our data, we can use the data wrangling processes to manipulate and prepare data for analysis. The most common data wrangling processes are merging, concatenation, and grouping. Lets get down to the basics of each of those.

Merging with Pandas

Merging is used when we want to collect data that shares a key variable but they are located in different DataFrames. To merge DataFrames, we use the merge() function. Say we have df1 and df2.

import pandas as pdd = {    'subject_id': ['1', '2', '3', '4', '5'],    'student_name': ['Mark', 'Khalid', 'Deborah', 'Trevon', 'Raven']}df1 = pd.DataFrame(d, columns=['subject_id', 'student_name'])print(df1)
import pandas as pddata = {    'subject_id': ['4', '5', '6', '7', '8'],    'student_name': ['Eric', 'Imani', 'Cece', 'Darius', 'Andre']}df2 = pd.DataFrame(data, columns=['subject_id', 'student_name'])print(df2)

So, how do we merge them? Its simple: with the merge() function!

pd.merge(df1, df2, on='subject_id')

Grouping with Pandas

Grouping is how we categorize our data. If a value occurs in multiple rows of a single column, the data related to that value in other columns can be grouped together. Just like with merging, its simpler than it sounds. We use the groupby function.

Take a look at this example:

# import pandas libraryimport pandas as pdraw = {    'Name': ['Darell', 'Darell', 'Lilith', 'Lilith', 'Tran', 'Tran', 'Tran',        'Tran', 'John', 'Darell', 'Darell', 'Darell'],    'Position': [2, 1, 1, 4, 2, 4, 3, 1, 3, 2, 4, 3],    'Year': [2009, 2010, 2009, 2010, 2010, 2010, 2011, 2012, 2011, 2013, 2013, 2012],    'Marks':[408, 398, 422, 376, 401, 380, 396, 388, 356, 402, 368, 378]}df = pd.DataFrame(raw)group = df.groupby('Year')print(group.get_group(2011))-->  Marks  Name  Position  Year6    396  Tran         3  20118    356  John         3  2011

Concatenation

Concatenation is a long word that means to add a set of data to another. We use the concat() function to do so. To clarify the difference between merge and concatenation, merge() combines data on shared columns, while concat() combines DataFrames across columns or rows.

print(pd.concat([df1, df2]))

Some other common data wrangling processes that you should know are:

  • Mapping data and finding duplicates
  • Finding outliers in data
  • Data Aggregation
  • Reshaping data
  • Replace & rename

Wrapping up and next steps

Now that you have a good sense of Python Pandas and the countless benefits it offers, its important to know what to learn next. Once you get the basics down, like how to import, read, and wrangle your data, its time to tackle the next stage of data analysis for Python:

  • Statistics
  • NumPy
  • Advanced data wrangling
  • Visualizations for data
  • Data scraping
  • Real-world projects

Educatives course Predictive Data Analysis for Python covers all these concepts and more with hands-on practice and industry-standard examples. You can master Pandas through quizzes, interactive examples, and real behavior analysis. By the end, you'll be a confident data analyst!

Keep reading about Python and data analysis on Educative

Start a discussion

What is your favorite use of data analysis in the real world? Was this article helpful? Let us know in the comments below!


Original Link: https://dev.to/educative/data-analysis-made-simple-python-pandas-tutorial-4hn5

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To