The aim of our research was to look for irregularities in the behavior of electricity consumers during a year-and-a-half-long period. Anomaly detection, one of the most common parts of cyber security research, enables to detect threats based on unusual behavior.
Deviations from the normal situation can be looked for in virtually any data. A correctly performed analysis is based on the initial statistical analysis that maps the layout and status of the searched data. For the purpose of visual evaluation and better orientation in data, splitting into clusters based on the observed electricity consumption serves well. This can be done by using some of the commonly used data dimension reduction analyses, such as PCA or t-SNE. Anomalous values that do not fit into any of the clusters can be found even during splitting. However, dimension reduction methods weren't sufficient in this case.
Figure 1: The behavior of four anomalous users. The yellow colour shows the predicted electricity consumption while the green color represents the real consumption.
External factors also play a role
The electricity consumption of each user is largely influenced by various external factors, and so enriching data with a range of additional information was also part of the research. These factors include data evidencing temperature on a particular day or the number of sunshine hours. It may also be important, for instance, whether or not the day was a working day.
Prediction of electricity consumption
Enriched data was used for training various tested regressors, based on which the prediction of power consumption was made for each user. The GBRT (Gradient Boosted Regression Trees) regressor, which uses decision trees for prediction, was best suited. Users with the worst prediction results were identified as individuals with anomalous behavior. Check the Open data website, where you'll find a large amount of digitized data, for an inspiration on what information to enrich your data with.
In order to detect anomalies in electricity consumption, we first performed a statistical analysis, which resulted in splitting data into clusters. Then we took into account the external factors that affect electricity consumption, such as temperature or hours of sunshine. We used this enriched data to train the GBRT regressor, which uses decision trees for prediction. Electricity consumers with the worst prediction results were labelled as anomalous consumers.
- Jupyter Notebook: The useful Jupyter Notebook tool helped carry out the research. The tool facilitates work and, if the defined principles are followed, enables to create clear code.
- Pandas: The Pandas library was used for the basic statistics and analysis, on which the whole research was based.
- Matplotlib and Plotly: For the visualization of individual steps and continuous results, the Matplotlib and Plotly libraries were used.
- Scikit-learn: The scikit-learn library was one of the most important ones in the research. It enabled the implementation of methods for finding clusters of similar users as well as for the implementation of regressors, which predicted the future electricity consumption.
- Holidays: This library was helpful in enriching data with the information whether a day was or was not a working day.
- Astral: Another library necessary for data enrichment was used to determine the position of the Sun and the Moon, the day length, the night length and other data relevant for the analysis of electricity consumption.
- Grafana: The open-source platform for time series analysis Grafana was used to visualize the prediction and detected anomalous electricity consumers.
Related products and services
Great analytic results rest on clean data. Our solutions and tools will dramatically speed up data cleaning efforts. We’ve worked on hundreds of d...
With ever-changing regulatory needs as well as increased cyber threats, you need a future-proof solution that will help you meet strict data secur...
Read our blog
We must know, we will know
Expert team in big data and AI
Our team has presented hundreds of insights in many possible formats. We use tools and methods developed and used by scientific teams dedicated to research.
We strongly consider the existing business environment, capabilities to execute and skill of the staff. This enables us to provide minimum risk and bring quick success to your company.
Working with the best innovators
Cloudera, Microsoft, Clever Analytics, Apache Kafka, Apache Spark, Power BI, Tableau, Jupyter Notebooks