Programming and software development
Every 5 minutes, AEMO will dispatch generators across the National Electricity Market (NEM) in order to meet demand. To achieve this, AEMO needs to predict what demand will look like 5 minutes in the future.
Programming and software development
Sometimes when you work on large projects, you can end up with code that looks like this:
if num_values < 15:
do_thing()
Now, what does the 15 mean? Well, I’m sure you would have known when you first wrote the code, but what happens in 3 months when you try to modify it or pass it on to someone else in your team? Also, what happens if you determine that the number should actually be 16? Now you have to go through all your files and swap numbers in all the right places (and if you miss any you can introduce subtle bugs).
So what’s the solution? Well, what you really need is a single source of truth for your clearly defined constants, which you can:
Python makes this really easy – all you have to do is create a “constants.py” file in your project directory (I suppose you could call it whatever you like, but that’s beside the point). From here, you can import your variables as if you were importing a library:
from constants import NOM_FREQ_HZ, VALUES_IN_MIN
A screenshot of this file from a recent project looks a little like this:
This approach helped me write much more readable code, when working with an extremely complicated dataset where numbers were used to represent different value categories.
Obviously for smaller scripts it might not be worth setting up a dedicated file, but for this project, my constants file ended up being a couple of hundred lines long, and is far easier to maintain than if they were defined multiple times across all the files in the project.
If you’ve ever wanted to see the impact that machine learning is having in the energy sector, then I recommend watching this seminar released by the National Renewable Energy Laboratory (NREL).
Each talk describes an application of machine learning in the industry at different levels, from the big (weather and climate modelling) through to the small (optimising the aerodynamics of turbine blades).
Some of the topics discussed include:
Most people know that I’m a huge fan of the Python programming language – while that isn’t going to change, a recent encounter with some researchers at CSIRO has convinced me that I should pick up Julia for some of the energy modelling and optimisation work that I do.
I’ve known for a while that Julia was a language with a lot of benefits (as fast as a lower-level language but with the productivity benefits of a higher-level language). However, if you understand how to write efficient vectorised code in Python (using NumPy and Pandas), then except for some use-cases, you don’t really get that much of a boost out of switching to Julia.
So what changed my mind? Well, for the past few years, the National Renewable Energy Laboratory (NREL) have been working on a number of amazing open-source energy modelling packages for the Julia programming language. I’ve now updated my electricity modelling resources page with the links to some material about these packages.
One of my favourite data science resources is the mini-episode series of the Data Skeptic podcast. These short episodes would feature the host explaining a data science concept to a non-expert in plain English.
I wanted to share a few of these with some colleagues from work and thought I’d catalogue them here.
I should mention up-front that the techniques described in this post are really only worthwhile once you have a dataset in the millions of rows or above. Once your data hits this size, it is worth paying the initial optimisation overhead as it will save you memory and be faster overall.
Pandas’ eval and query is built on Python’s Numexpr library, and provides an optimised way to run a calculation or filter on a Pandas dataframe. For example, the code below shows you the traditional way of doing these things in Pandas:
start = '2020-02-10 08:20:00'
end = '2020-02-10 08:30:00'
duids = ['LYA4', 'BW02']
# traditional vectorized calculation
map_gen_df['DIST'] = np.sqrt(lya1_df['SEC_DIFF'].pow(2) + lya1_df['VALUE_DIFF'].pow(2))
# traditional filter
event_duid_df = map_gen_df[(map_gen_df['MMSNAME'].isin(duids))&(map_gen_df['TIMESTAMP_MIN']>=start)&(map_gen_df['TIMESTAMP_MIN']<=end)]
Pandas and NumPy are fantastic libraries that enable you to take advantage of vectorization to write extremely efficient Python code. However, what happens when the calculation you wish to run changes based on the value in another column of your dataset?
For example, take a look at the dataset in the table below (along with the code to generate it):
Group | Value |
A | 1 |
A | 1 |
B | 1 |
C | 1 |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Group':['A','A','B','C'],
'Value':[1,1,1,1]
})
Imagine I wish to create a third column (‘Result’) based on the following logic:
A couple of years ago I started my PhD at the Australian National University working to quantify honeybee behaviour. We wanted to build a system that could automatically track and compare different groups of bees within the hive.
I took the project as I had a background in biology, beekeeping and programming, and I wanted to work in a lab where I could learn from a supervisor who was incredibly knowledgeable about both biology and software development.
Google have released an open source project on GitHub called Grumpy that converts Python to Go, and then compiles it down to native code.
It’s an interesting development, but since they won’t be supporting C extension modules (which basically rules out all the scientific and machine learning libraries I use), it means I probably won’t end up using this new tool too much.
I was reading a paper by Pedro Domingos this evening which had some tips and advice for people using machine learning. I’ve written down some bullet points for my own reference and I hope someone else finds it useful. I know I’ve made some of the mistakes he gives advice about avoiding.