Every 5 minutes, AEMO will dispatch generators across the National Electricity Market (NEM) in order to meet demand. To achieve this, AEMO needs to predict what demand will look like 5 minutes in the future.
Machine Learning
Excellent seminar on the applications of machine learning in the energy sector
If you’ve ever wanted to see the impact that machine learning is having in the energy sector, then I recommend watching this seminar released by the National Renewable Energy Laboratory (NREL).
Each talk describes an application of machine learning in the industry at different levels, from the big (weather and climate modelling) through to the small (optimising the aerodynamics of turbine blades).
Some of the topics discussed include:
- How researchers at NREL are using generative adversarial networks (GANs) to assist them with weather and climate modelling
- How you can represent a wind farm as a graph neural network (GNN) with directed edges (this is brilliant!)
- How hard it is to acquire enough data to train models for wind farms (this is why they mention having success with ensemble-based modelling approaches)
- How they’ve been creating simulations to augment their wind farm datasets
- A few key points which I agree with from personal experience
- Features matter more than models – having enough input data, processed in the right way often matters more than the specific machine learning algorithm you’re using
- Training models is expensive and time consuming, but once that stage is done, you can run them cheaply and quickly in production
One of my favourite data science resources is the mini-episode series of the Data Skeptic podcast. These short episodes would feature the host explaining a data science concept to a non-expert in plain English.
I wanted to share a few of these with some colleagues from work and thought I’d catalogue them here.
A couple of years ago I started my PhD at the Australian National University working to quantify honeybee behaviour. We wanted to build a system that could automatically track and compare different groups of bees within the hive.
I took the project as I had a background in biology, beekeeping and programming, and I wanted to work in a lab where I could learn from a supervisor who was incredibly knowledgeable about both biology and software development.
Notes from ‘A Few Useful Things to Know about Machine Learning’
I was reading a paper by Pedro Domingos this evening which had some tips and advice for people using machine learning. I’ve written down some bullet points for my own reference and I hope someone else finds it useful. I know I’ve made some of the mistakes he gives advice about avoiding.
Deep learning is a type of machine learning based on neural networks which were inspired by neurons in the brain. The difference between a deep neural network and a normal natural network is the number of ‘hidden layers’ between the input and output layers.
I recently watched an excellent presentation on Deep Learning by Roelof Pieters titled ‘Python for image and text understanding; One model to rule them all!‘ I can recommend watching it, and I’ve written this post for me to put down a few of my own bullet points from the talk for future reference.
I recently ran a fresh install on my Mac and thought I’d take the opportunity to document the libraries and programs I find incredibly useful.
The Python libraries I’ll frequently pip3 install
include:
I wrote a few quick bullet points down from the article “8 Proven Ways for improving the “Accuracy” of a Machine Learning Model” for future reference.
Improving Accuracy
- Add more data
- Fix missing values
- Continuous: impute with median/mean/mode
- Categorical: treat as separate class
- Predict missing classes with k-nearest neighbours
- Outliers
- Delete
- Bin
- Impute
- Treat as separate to the others
- Feature engineering
- Transform and normalise: scale between 0-1
- Eliminate skewness (e.g. log) for algorithms that require normal distribution
- Create features: Date of transactions might not be useful but day of the week may be
- Feature selection
- Best features to use: identify via visualisation or through domain knowledge
- Significance: Use p-values and other metrics to identify the right values. Can also use dimensionally reduction while preserving relationships in the data
- Test multiple machine learning algorithms and tune their parameters
- Ensemble methods: combine multiple week predictors (bagging and boosting)
I wrote a few quick bullet points down from the article “8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset” for future reference.
Tactics
- Imbalanced datasets occur when you have a class that occurs much more infrequently than the others.
- If a model ignores the class, it can still achieve a high classification accuracy, but that’s not the result we want
- Make sure you use a confusion matrix to ensure that you’re getting acceptable accuracy for all your classes
- One potential solution to this problem is to collect more data
- Resampling your dataset (can be random or non-random – stratified):
- Add copies of underrepresented classes (oversampling/sampling with replacement). Useful if you don’t have much data – 10s of thousands or less.
- Delete instances of classes that occur frequently (undersampling). Handy to use if you have a lot of data – 10s-100s of thousands of instances
- Try different algorithms: decision trees can perform well on imbalanced datasets
- Penalised models: Extra penalties for misclassifying minority class. Examples of these algorithms could include penalized-SVM and penalized-LDA.
- There are areas of research dedicated to imbalanced datasets: can look into anomaly detection and change detection.
I wrote a few quick bullet points down from the article “How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python“.
Metrics
- Classification accuracy
- Test how well predictions of a model do overall
- accuracy = correct predictions / total predictions
- Confusion matrix
- Use to identify how well your predictions did with different classes
- Very useful if you have an imbalanced dataset
- I wrote an extremely hacked together confusion matrix for my tag identification software. I had 4 classes (U, C, R, Q) and the confusion matrix shows you what your model predicted against what the real category was.
U |
C |
R |
Q |
|
U |
175 |
17 |
67 |
1 |
C |
11 |
335 |
14 |
0 |
R |
26 |
8 |
298 |
0 |
Q |
6 |
0 |
3 |
93 |
- Mean absolute error for regression
- Positive values – the average of how much your predicted value differ from the real value
- Root mean squared error for regression
- Square root of the mean of squared differences between the actual and predicted value
- Squaring the values gives you positive numbers and finding the root lets you compare the values to the original units.