Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
September 10, 2020 10:07 pm GMT

Basic Pandas: How to add a column to a DataFrame

Pandas is one of my favorite Python libraries, and I use it every day. A very common action is to add a column to a DataFrame. This is a pretty basic task. Im going to look at a few examples to better show what is happening when we add a column, and how we need to think about the index of our data when we add it.

Lets start with a very simple DataFrame. This DataFrame has 4 columns of random floating point values. The index of this DataFrame will also be the default, a RangeIndex of the size of the DataFrame. Ill assume this python code is run in either a Jupyter notebook or ipython session with pandas installed. I used version 1.1.0 when I wrote this.

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.rand(6,4), columns=['a', 'b', 'c', 'd'])display(df)          a         b         c         d0  0.028948  0.613221  0.122755  0.7546601  0.880772  0.581651  0.968752  0.5515832  0.107115  0.511918  0.574167  0.8713003  0.830062  0.622413  0.118231  0.4445814  0.264822  0.370572  0.001680  0.3944885  0.749247  0.412359  0.092063  0.350451

Lets start with the simplest way to add a column, such as a single value. This will be applied to all rows in the DataFrame.

df['e'] = .5display(df['e'])0    0.51    0.52    0.53    0.54    0.55    0.5Name: e, dtype: float64

Now, under the hood, pandas is making life easier for you and taking your scalar value (the 0.5) and turning it into an array and using it to build a Series with the index (in this case a RangeIndex) of your DataFrame.

This is sort of the equivalent:

df['e_prime'] = pd.Series(.5, index=pd.RangeIndex(6))

You can also pass in an array yourself without an index, but it must match the dimensions of your DataFrame

df['f'] = np.random.rand(6,1)

If you try to do this with a non-matching shape, it wont work. This is because the DataFrame wont know where to put the values. You can try it and see the Exception that pandas raises.

Now what happens when the data you want to add doesnt match your current DataFrame, but it does have an index? Specifically, what if the index is different on the right hand side?

df['g'] = pd.Series(np.random.rand(50), index=pd.RangeIndex(2,52))display(df[['e', 'e_prime', 'f', 'g']])     e  e_prime         f         g0  0.5      0.5  0.777879       NaN1  0.5      0.5  0.621390       NaN2  0.5      0.5  0.294869  0.2837773  0.5      0.5  0.024411  0.6952154  0.5      0.5  0.173954  0.5855245  0.5      0.5  0.276633  0.751469

So what happened here? Our column g only has values at rows 2 through 5, even though we assigned a series with 50 values. Well, these were the rows that matched our index. For the rows that didnt have values, a NaN was inserted. You can try doing this where none of the data matches on the index and see what happens. Youll end up with a full column of NaNs. Another way to think of this is that we could use the loc method to select the rows we wanted to update, but unless we set the index on the right hand side, we still need to align with the shape of the DataFrame.

df.loc[2:5, 'g_prime'] = np.random.rand(4)display(df['g_prime'])0         NaN1         NaN2    0.1302463    0.4191224    0.3125875    0.101704Name: g_prime, dtype: float64

The main lesson here is to realize that assigning a column to a DataFrame can lead to some surprising results if you dont realize whether what you are assigning has a matching index or not.


Original Link: https://dev.to/wrighter/basic-pandas-how-to-add-a-column-to-a-dataframe-36n0

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To