Programming and software development
I should mention up-front that the techniques described in this post are really only worthwhile once you have a dataset in the millions of rows or above. Once your data hits this size, it is worth paying the initial optimisation overhead as it will save you memory and be faster overall.
Pandas’ eval and query is built on Python’s Numexpr library, and provides an optimised way to run a calculation or filter on a Pandas dataframe. For example, the code below shows you the traditional way of doing these things in Pandas:
start = '2020-02-10 08:20:00' end = '2020-02-10 08:30:00' duids = ['LYA4', 'BW02'] # traditional vectorized calculation map_gen_df['DIST'] = np.sqrt(lya1_df['SEC_DIFF'].pow(2) + lya1_df['VALUE_DIFF'].pow(2)) # traditional filter event_duid_df = map_gen_df[(map_gen_df['MMSNAME'].isin(duids))&(map_gen_df['TIMESTAMP_MIN']>=start)&(map_gen_df['TIMESTAMP_MIN']<=end)]