Pandas and NumPy are fantastic libraries that enable you to take advantage of vectorization to write extremely efficient Python code. However, what happens when the calculation you wish to run changes based on the value in another column of your dataset?
For example, take a look at the dataset in the table below (along with the code to generate it):
Group | Value |
A | 1 |
A | 1 |
B | 1 |
C | 1 |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Group':['A','A','B','C'],
'Value':[1,1,1,1]
})
Imagine I wish to create a third column (‘Result’) based on the following logic:
- Multiply Value by 2 if Group == ‘A’
- Multiply Value by 3 if Group == ‘B’
- Multiply Value by 4 if Group == ‘C’
- Fill with a missing value (nan) if none of the above is true
In the past, I thought the most efficient solution was to use Pandas’ apply:
def run_calc(row):
if row['Group'] == 'A':
return row['Value'] * 2
elif row['Group'] == 'B':
return row['Value'] * 3
elif row['Group'] == 'C':
return row['Value'] * 4
else:
return np.nan
df['Result'] = df.apply(run_calc, axis=1)
This works, but once your dataset is in the millions of rows and above, it becomes extremely slow. This operation is not vectorized. It was only recently that I discovered there was a far better solution – NumPy’s select!
df['Result'] = np.select(
condlist=[
df['Group']=='A', df['Group']=='B', df['Group']=='C'
],
choicelist=[
df['Value']*2, df['Value']*3, df['Value']*4,
],
default=np.nan
)
Here, you can pass a list of vectors that are evaluated as True or False, and a second list describing the vectorized calculation you wish to run when its condition is met. Finally, you can tell it what default value to use if none of the conditions are met (the equivalent of ‘else’ in an explicit conditional). It is worth flagging here that if multiple conditions evaluate to True, the one that occurs earlier in the list will be used (similar to a typical if/else conditional).
Finally, if you only care about two conditions (the same as just using if/else), then you can use NumPy’s where:
df['Result'] = np.where(df['Group']=='A', df['Value']*2, np.nan)
While I intend to continue using apply for small-ish datasets (since I think its a little easier for others to read and understand), for larger datasets I’ve switched over to using select and recommend trying it out for yourself!