Piotr A. Kowalski, PhD, D.Sc., prof. AGH – machine learning expert in Airly
Good morning! Ladies and Gentlemen, the topic of today's lecture will be the prediction of the state of air quality”. This is the way I would start a lecture addressed to my students. However, today I am pleased to talk to you about a major event at Airly. Well, we are pleased to put into operation the latest product, which is an intelligent system for air quality prediction.
But we have to start from the beginning, i.e. explain what is a prediction? So, as we can recall, it has always been a pre-emptive issue - even from childhood that is most often associated with future events. In many books for children (but also for teenagers and adults) we can find examples of various fairies or magicians using strange artifacts who were able to say what will happen tomorrow, in a week or two – or who will win a skirmish. Almost always, this fantasy world, due to the originality of the task itself, is associated with magic and a kind of mysticism. Nowadays, prediction, or the ability to predict certain behaviors or phenomena, is implemented quite differently. In order to solve such an unusual and difficult problem, we reach for mathematical apparatuses, which due to the large amount of data are often supported by very complex computer methods.
The essence of wanting to know what will happen in the future is very practical. Imagine even a whole range of issues in the field of economics. Thanks to the possibility of prediction, we are able to notice a coming economic crisis and prepare for its effects or counteract its consequences. Interestingly, if this prediction is made as early as possible, then we can be tempted to initiate appropriate economic procedures to counteract such undesirable phenomena in economics as the crisis, the crash on stock exchanges, etc.
From the beginning of its existence, Airly has focused on the complexity of the solution, proposing air quality predictions. At the same time, this indication includes another 24h forecast in hour-by-hour mode, and not how often you can see or read in the form of averaged values for the next day. When I started to lead a team that carried out the task of intelligent air quality forecasting, I found a classic algorithm based on linear regression. First, we wanted to improve this algorithm by adding a number of interesting procedures, but something was still missing. This largely contributed to the use in this task of neural networks that are the domain of artificial intelligence.
When developing the prediction procedure, we first had to examine and decide which of the input data would be useful and which would only provide unnecessary noise. After an in-depth analysis, it turned out that our data set would be based on two types of data. The first is information related to the state of air quality (PM 10 and PM 2.5) from the last several dozen measurements originating from our dense sensor network, while the second is selected weather data.
As I mentioned earlier, prediction is not a simple task, it can even be said that it is a real challenge, especially if we take into account the 24-hour resolution of the calculated forecast. In the issue of searching for a prediction model, a lot depends on the creativity of the research team, because it is quite difficult to determine the number of algorithms that can obviously take on this challenge and generate a better or worse result. In the case of solving such a complicated and multifactorial problem as the air quality forecast, it is good to use methods belonging to the domain of computational intelligence. Artificial neural networks are even predestined for tasks where all knowledge is condensed in the data. And what are theses? They are collections of examples - often numerical - that constitute a natural knowledge base that, during the process called learning, is transferred to the structure of the neural network.
The work related to the search for an appropriate neural network structure lasted several months. During our research, we tried to test existing classic neural networks such as multilayer perceptron structures, as well as to reach for the latest scientific achievement called ‘deep learning’. Based on the results obtained, we proposed many proprietary solutions in the field of topology of the neural network. Each of the simulations slowly brought us closer to an ever better solution. Now, at the summary stage, we can say that several types of neural structures have been tested and a huge number of tests have been carried out within each of them. In total, thanks to cloud computing simulations, the operation of over 1,500 different neural network instances was checked. At the same time, data from more than 2,000 Airly suspended air pollution measurement stations were used to learn and validate neural structures. Such a large volume of data allows accumulating in the knowledge base a lot of interesting behaviors related to air pollution. Interestingly, we will never know about them, because they are only in the form of practically endless data bars whose "interpretation" is done by algorithms for learning artificial neural networks.
After many verifications and tests, we give you a ready-to-use algorithm based on the latest solutions related to artificial neural networks. In addition, the prediction procedure has been optimized in terms of speed of operation so as to generate the latest forecast every hour for the next 24 hours. Interestingly, the basic type of neural network we used did not exist during the implementation of the first steps with the air quality forecast at Airly. What's more, I have to point out here that the type of neural networks used has been used so far in completely different, conceptually distant tasks than the problem of prediction or regression.
During the work on the algorithm of forecasting air quality, our research was presented at several renowned scientific conferences – and met with a very warm welcome. It may even sound immodest, but on one of them our paper was considered the best of all submitted. It is clear that apart from the praise of our solution from the scientific community, numerical evaluation is important. That's why now in a few words I want to tell and show you how our forecasting model is verified. Well, the model itself calculates PM 10 and PM2.5 values, but shows the CAQI index value as output. This index determines the state of air quality using a number on a scale of 1 to 100, where a low value means good air quality and a high value indicates poor air quality. At the same time it is divided into five ranges. The first four are compartments with a width of 25 units and the last one above 100.
On the figure, one can see the accuracy of the air quality forecast in two variants. The first one represented in red assumes that we are interested in the absolute error (i.e. the absolute value between the prediction value and the actual value) not greater than 25 CAQI index units, while the second one marked in yellow represents the error not greater than 12.5 units of the considered index. Individual types of errors are spread across the width of the whole and half of the range from the domain related to verbal determination of air quality. From the graph, we can see that if we consider the entire width of the range, i.e. 25 CAQI index units as a satisfying maximum error, then we achieve a testability of almost 99% for the first hour and 95% for 24 hours. So the average testability for this variant is practically 96%. In the second, more restrictive case, we can again see that for the first hour the verifiability is 92% and then falls and in the last 24 hours reaches 73.5%. This variant has a testability of 77.5%.
On our air quality maps, we will present current forecast verifiability using this methodology. As a measure, we will use the average maximum error value for the full width of the range measured over a period of two weeks. Thus, if the verifiability value is 95%, it means that for the last 14 days, the generated forecast 95 times out of 100 was not burdened with a larger error than one CAQI index range.
At the end part of this article, I would like to thank the entire data analysis team and all my colleagues for the quality of their input and the numerous discussions without which we would not have generated such an excellent algorithm. In particular, I would like to thank especially Kasper, Olek, Denis and Michał for the wonderful great cooperation and research inspirations.
About the author:
Piotr A. Kowalski is a scientist working as a professor at AGH at the Faculty of Physics and Applied Computer Science, as well as at the System Research Institute of the Polish Academy of Sciences. In 2003, he obtained a master's degree in "Teleinformatics" and "Control Science" (both with distinction) at the Cracow University of Technology, in 2009, he defended his doctorate (Ph.D.) in intelligent data analysis, while in 2018 he obtained the habilitation degree (D.Sc.) in computer science at the Polish Academy of Sciences. Since 2018, he has been associated with Airly, in which he acts as an expert in machine learning and as the data analysis department coordinator. He is responsible for scientific research and for implementation of air quality prediction procedures. His research interests are embedded in the field of information technology and focus on intelligent methods (neural networks, fuzzy systems and algorithms inspired by nature) in application to complex systems and algorithms for knowledge discovery.