How to improve the performance of Pandas with eval and query

by Jack Simpson April 5, 2021

written by Jack Simpson April 5, 2021

I should mention up-front that the techniques described in this post are really only worthwhile once you have a dataset in the millions of rows or above. Once your data hits this size, it is worth paying the initial optimisation overhead as it will save you memory and be faster overall.

Pandas’ eval and query is built on Python’s Numexpr library, and provides an optimised way to run a calculation or filter on a Pandas dataframe. For example, the code below shows you the traditional way of doing these things in Pandas:

start = '2020-02-10 08:20:00'
end = '2020-02-10 08:30:00'
duids = ['LYA4', 'BW02']

# traditional vectorized calculation
map_gen_df['DIST'] = np.sqrt(lya1_df['SEC_DIFF'].pow(2) + lya1_df['VALUE_DIFF'].pow(2))

# traditional filter
event_duid_df = map_gen_df[(map_gen_df['MMSNAME'].isin(duids))&(map_gen_df['TIMESTAMP_MIN']>=start)&(map_gen_df['TIMESTAMP_MIN']<=end)]

Now, here is the equivalent eval and query version:

# eval calculation
map_gen_df['DIST'] = map_gen_df.eval('sqrt(SEC_DIFF**2 + VALUE_DIFF**2)', engine='numexpr')

# query filter
event_duid_df = map_gen_df.query('(MMSNAME in @duids)&(TIMESTAMP_MIN >= @start)&(TIMESTAMP_MIN <= @end)', engine='numexpr')

As you can see, rather than operating on the columns directly, both eval and query are passed a string containing the names of the columns and are able to access Python variables with the ‘@’ notation.

So why bother with this at all? The reason is that when you use the traditional techniques, Pandas is explicitly evaluating the boolean state of each of the subexpressions which requires additional memory be allocated. Under the hood, the operation really looks something like this:

temp_col_1 = map_gen_df['MMSNAME'].isin(duids)
temp_col_2 = map_gen_df['TIMESTAMP_MIN']>=start
temp_col_3 = map_gen_df['TIMESTAMP_MIN']<=end

event_duid_df = map_gen_df[temp_col_1 & temp_col_2 & temp_col_3]

Numexpr allows you to avoid these intermediate steps – in some of the tests I’ve run, I’ve been able to save the allocation of gigabytes of RAM on my laptop simply by switching over to using eval and query.

How to improve the performance of Pandas with eval and query

You may also like

Leave a Comment Cancel Reply