eureka

# "EUREKA!"

## How to build accurate predictors for real-valued outputs from simple methods

Authors: Luis Moreira-Matias, Ricardo Bessa and Josif Grabocka

### 1. Abstract

The world relies on numerical values. Today, it is possible to translate almost every activity into (one or multiple) series of numbers. The increasing number of sensor networks increased dramatically the availability of this data. Such availability pushed the Industry to invest more and more money on pulling out valuable information from such streams of data in order to infer their future values. However, such investment arises an important question: is this task that hard and/or expensive? In this tutorial, we will cover some of the simplest State-of-The-Art methods on performing short and mid-term predictions for real-valued outputs. Our goal is to provide some introductory background knowledge, as well as some hints and tricks which can ease the application of these methods to industrial problems. The target audience of this tutorial includes (but is not limited to) students that are beginning their track on studying Machine Learning/Data Mining methods, as well as industrial practitioners interested on know more about this type of numerical prediction on an easy going manner, independently of each individual’s background.

### 2. Theoretical Contents

Nowadays, three of the most popular regression methods are the Random Forests (RF), Artificial Neural Networks (ANN) and Support Vector Regression (SVR). They present a large range of applications, from transportation [chien02, jeong05, bin06, li2011, mendes-moreira12] to energy [bessa09, juban08, zamo14], among other industries.

Notwithstanding the capacity that these methods present on explaining the complex relationships between the explanatory variables and the target one, they can be roughly difficult to apply to every scenario without having a certain amount of expertise on each one of them.

The simplest one to apply out of these three is the RF - however it is highly dependable on a successful selection of a metric able to select the feature(s) to be included on each node's split criteria. Moreover, these methods only provide deterministic (or point) forecasts, which are insufficient for decision-making problems under risk - which comprise a common real world problem on different domains. This characteristic is intrinsic to one of their major drawbacks of these powerful methods: the so-called magic formula that they follow to predict real-valued outputs, as they do not provide almost no interpretability options about the target function (i.e. they are black-box methods). This issue mislead both beginners and industrial practitioners to think that they are the best solution to simply solve any and every prediction problem they are facing. However, we do know that there is not such a thing as free lunch.

Consequently, a question arises: do we need to be Machine Learning (ML) experts to build accurate prediction methods? In this tutorial, we aim to demonstrate that we do not. We will do it so by revisiting the basics behind some simple prediction methods which can cover an wide range of applications, independently of their domain and/or industry. These contents are briefly described in the following subsections.

#### M1. Linear Regression and Time Series Analysis

Linear Regression [draper2014] is one of the oldest and simplest prediction methods for real-valued outputs.

We will start this tutorial by revisiting its basics (i.e. target, learning and loss function) for univariate analysis with illustrative applications, along with some appointments for Bayesian Linear Regression [box2011] and Kernel Regression [nadaraya64].

Then, we will introduce time series analysis as different type of approach for predicting real-valued outputs, highlighting their value to deal with short-term horizons (per opposition to the Linear Regression methods, which aim to approximate a given target function). ARIMA [Box1976] and Exponential Smoothing [Holt2004] methods will be briefly introduced with some comparisons with the Linear Regression methods previously described. Finally, some basics on multivariate linear regression are provided by showing how simple matrix operations can be used to learn from data.

#### M2. Prediction through Matrix and Tensor Decompositions

Sometimes, we do make the problems more complex than they can be by using a large set of features without any selection criteria. Dimensionality reduction is a century old statistical procedure that aims at reconstructing a set of data observations from a number of latent prototypes. Pioneering approaches such as the Principal Component Analysis [jolliffe2002] and spin-off variations have a long tradition as feature extraction. The derived low-rank features do not contain noise and therefore boost classification in a range of applications domains, for instance in time-series classification [grabocka14]. Yet, a modern use of dimensionality reduction has recently arisen in terms of direct prediction. The factorization of the “buyers and products” matrix has revolutionized the field of recommender systems [koren09]. In the scenario of product prediction, the existing ratings (non-zero cells of the matrix) serve as the training set and the learned decomposition is used to predict the missing values of the matrix, i.e. the rating of items for which a buyer has provided no feedback. In addition, high-order interactions of the latent dimensions are modeled using Factorization Machines [rendle10]. Furthermore, the decomposition of tensors (multi-dimensional arrays) is currently used in a myriad of applications, ranging from the prediction of relations in Social Networks [rettinger12], up to the estimation of travel times for the vehicle trips of a city [wang14].

#### M3. Extreme Learning Machines

Multilayer perceptron (MLP) feedforward neural network is a powerful non-parametric ML technique used in many real-world applications. The number of optimization algorithms and heuristics used to estimate their weights and bias are vast. One interesting approach, which will be presented in this tutorial, is the Extreme Learning Machine (ELM) [Huang06a] that basically consists in a single layer feedforward neural network with the weights/bias of the first layer randomly generated (i.e., are not tuned) and the weights of the output layer are analytically determined by a least squares solution.

The main advantages of this approach are its simplicity and computational time, maintaining the universal approximation capability of the traditional MLP.

#### M4. Probabilistic forecasting with a basic linear quantile model

The prediction from most ML models (e.g., ANN) provide point (or deterministic) forecasts that are an estimate of the response variable conditional mean, which only measures the “center” of the conditional distribution. A more complete characterization of the conditional distribution is provided by probabilistic forecasts, which consists of estimating the future uncertainty of the response variable that can be expressed as a probability measure. There are several models for estimation of probabilistic predictions, and studies concentrate both on the characterization of the sources of uncertainty, and on the development of methods for on-line uncertainty estimation. In this tutorial, the use of a basic linear quantile regression to generate probabilistic forecasts (i.e., set of quantile forecasts) will be addressed. It is explained how the linear quantile regression model works and how can be modified to better handle non-linear relations between response variable and predictors in a local regression framework [bremnes04].

#### M5. Some notes on Ensemble Learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms. In this module, we will provide some hints on how to combine fair predictors to build powerful predictive machines for real-time learning. We will revisit the basics for Bagging [*breiman96*] to then perform dynamic model selection through linear combination of diverse model's outputs based on their recent performance [*wang2003mining*]. Finally, we provide some hints on how we could play with some reactiveness parameters and change detection algorithms (e.g., CUSUM [*page1954*]) the boost our predictive performance by handling drift adequately.

### 3. Real-World Applications

In this second part, we complete the theoretical subjects by presenting three successful real-world ML applications. Those applications can be divided into three folds:

• Real-Time Prediction of Taxi-Passenger Demand for a short-term horizon using an Online Ensemble Learning model (M5) the over multiple incremental time series analysis methods (M1) [moreira-matias2013, moreira-matias2013a];
• Renewable Energy Forecasting for a mid-term horizon based on ELM (M3) and on quantile-based probabilistic forecasting models (M4) [bessa12, bessa14];
• Link Prediction in Social Networks focuses on predicting relationships (links) among users. Matrix factorization (M2) has been successfully adopted to tackle the problem [Nickel2011]. Moreover, multi-relational factorizations can accurately accommodate auxiliary (for instance, shared hobbies) user information [Krohn-Grimberghe2012, Drumond2014].

### 4. Potential Audience

By including all these fundamental (and, somehow, introductory) concepts, we expect to attract all the Master's and PhD students working on related topics. The idea is to fill a gap that most of this students have on their awareness - as they soon focus on one or two type of algorithms/applications to conduct their research. On other hand, by providing some simple tricks and concrete aplicational cases, we also expect to attract industrial practitioners which may face this as an opportunity to collect a novel idea on how they can push forward their business and/or products leveraging on their own data (or on the one provided by their customers).

### 5. Presenter's Bio

#### Luis Moreira-Matias

Luis Moreira-Matias received a M.Sc. degree in Informatics Engineering and Computation from the University of Porto, Portugal, in 2009. He won an European Data Mining competition held during a Research Summer School held at TU Dortmund, in 2012. Currently, he is a soon-to-graduate Ph.D. candidate in Machine Learning at the same university and he works as Research Scientist at NEC Laboratories Europe (Heidelberg, Germany), integrated in the Intelligent Transportation Systems division. His research interests include Supervised Learning, Time Series Analysis, Online Learning and Intelligent Public Transports. He also authored more than 15 publications on world-leading journals and venues on these topics, such as the Information Sciences or the IEEE Transactions on Intelligent Transportation Systems.

#### Ricardo Bessa

Ricardo Bessa received his \textit{Licenciado} (five-year) degree from the Faculty of Engineering of the University of Porto, Portugal (FEUP) in 2006 in Electrical and Computer Engineering. In 2008, he received the M.Sc. degree in Data Analysis and Decision Support Systems on the Faculty of Economics of the University of Porto (FEP). He obtained his Ph.D. degree in the Doctoral Program in Sustainable Energy Systems at FEUP in 2013. Currently, he is a Senior Researcher and Area Manager at INESC TEC in its Center for Power and Energy Systems.

He worked in several international projects such as the European Projects ANEMOS.plus (FP6), SuSTAINABLE (FP7), EvolvDSO (FP7), and an international collaboration with Argonne National Laboratory for the U.S. Department of Energy. At the national level, he participated in the development of a wind and solar power forecasting system for the Portuguese wind farm owners.

His research interests include renewable energy forecasting, electric vehicles, data mining and decision-making under risk. He also authored 22 publications on Science Citation Index journals on these topics and it is a co-founder of a spin-off company called Prewind that sells renewable energy forecasting services.

#### Josif Grabocka

Josif Grabocka is a soon-to-graduate doctoral candidate affiliated with the Information Systems and Machine Learning Lab, Institute of Computer Science, University of Hildesheim, Germany. He obtained his Diploma in Computer Engineering at the Middle East Technical University of Ankara, Turkey in 2007 and received a Master degree in Applied Artificial Intelligence from the University of Dalarna, Sweden in 2010. His primary research interest is Machine Learning and specifically classification of time series. He has authored papers in selective conferences on the field, such as AAAI, ECML and ACM SIGKDD and journals such as “Data Mining and Knowledge Discovery” or IEEE Transactions on Knowledge and Data Engineering.

### Confirmed Presenting Venues

• ECML/PKDD 2015 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

### 6. Acknowledgments

This work was funded (i) by the SusCity project (“SusCity – MITP-TB/CS/ 0026/2013”) financed by national funds through “Fundacao para a Ciencia e a Tecnologia (FCT), Portugal”, (ii) by “Deutsche Forschungsgemeinschaft” through project HyLAP, by (iii) the EU FP7 project iTalk2Learn (\#318051) and by (iv) the EU FP7 project TEAM (\#318621).

### References

[1] S. Chien, Y. Ding, and C. Wei. Dynamic bus arrival time prediction with arti cial neural networks. Journal of Transportation Engineering, 128(5): 429{438, 2002.

[2] R. Jeong and L. Rilett. Prediction model of bus arrival time for real-time applications. Transportation Research Record: Journal of the Transportation Research Board, 1927(1):195{204, 2005.

[3] Y. Bin, Y. Zhongzhen, and Y. Baozhen. Bus arrival time prediction using support vector machines. Journal of Intelligent Transportation Systems, 10 (4):151{158, 2006.

[4] B. Li, D. Zhang, L. Sun, C. Chen, S. Li, G. Qi, and Q. Yang. Hunting or waiting? discovering passenger- nding strategies from a large-scale real-world taxi dataset. In 2011 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops), pages 63{68, 2011.

[5] J. Mendes-Moreira, A. Jorge, J. de Sousa, and C. Soares. Comparing state-of-the-art regression methods for long term travel time prediction. Intelli-gent Data Analysis, 16(3):427{449, 2012.

[6] R. Bessa, V. Miranda, and J. Gama. Entropy and correntropy against minimum square error in o ine and online three-day ahead wind power forecasting. Power Systems, IEEE Transactions on, 24(4):1657{1666, 2009.

[7] J. Juban, L. Fugon, and G. Kariniotakis. Uncertainty estimation of wind power forecasts: Comparison of probabilistic modelling approaches. In Eu-ropean Wind Energy Conference & Exhibition EWEC 2008, pages 10{pages. EWEC, 2008.

[8] M Zamo, O Mestre, P Arbogast, and O Pannekoucke. A benchmark of sta-tistical regression methods for short-term forecasting of photovoltaic elec-tricity production, part i: Deterministic forecast of hourly production. Solar Energy, 105:792{803, 2014.

[9] N. Draper and H. Smith. Applied regression analysis. John Wiley & Sons, 2014.

[10] G. Box and G. Tiao. Bayesian inference in statistical analysis, volume 40. John Wiley & Sons, 2011.

[11] E. Nadaraya. On estimating regression. Theory of Probability & Its Appli-cations, 9(1):141{142, 1964.

[12] G. Box, G. Jenkins, and G. Reinsel. Time series analysis. Holden-day San Francisco, 1976.

[13] C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 20(1):5{10, 2004.

[14] I. Jolli e. Principal component analysis. Wiley Online Library, 2002.

[15] J. Grabocka and L. Schmidt-Thieme. Invariant time-series factorization.

Data Mining and Knowledge Discovery, 28(5-6):1455{1479, 2014.

[16] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, (8):30{37, 2009.

[17] S. Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 995{1000. IEEE, 2010.

[18] A. Rettinger, H. Wermser, Y. Huang, and V. Tresp. Context-aware tensor decomposition for relation prediction in social networks. Social Network Analysis and Mining, 2(4):373{385, 2012.

[19] Y. Wang, Y. Zheng, and Y. Xue. Travel time estimation of a path using sparse trajectories. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 25{34. ACM, 2014.

[20] G. Huang, Q. Zhu, and C. Siew. Real-time learning capability of neural networks. IEEE Transactions on Neural Networks, 17(4):863 { 878, july 2006.

[21] J. Bremnes. Probabilistic wind power forecasts using local quantile regres-sion. Wind Energy, 7(1):47{54, 2004.

[22] L. Breiman. Bagging predictors. Machine learning, 24(2):123{140, 1996.

[23] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classi ers. In Proceedings of the ninth ACM SIGKDD inter-national conference on Knowledge discovery and data mining, pages 226{ 235. ACM, 2003.

[24] E. Page. Continuous inspection schemes. Biometrika, 41(1/2):100{115, 1954.

[25] L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, and L. Damas. Predicting taxi-passenger demand using streaming data. IEEE Transactions on Intelligent Transportation Systems, 14(3):1393{1402, 2013.

[26] L. Moreira-Matias, J. Gama, M. Ferreira, J. Mendes-Moreira, and L. Damas. On predicting the taxi-passenger demand: A real-time approach. In Progress in Arti cial Intelligence, volume 8154 of LNCS, pages 54{65. Springer, 2013.

[27] R. Bessa, V. Miranda, A. Botterud, J. Wang, and E. Constantinescu. Time adaptive conditional kernel density estimation for wind power forecasting. Sustainable Energy, IEEE Transactions on, 3(4):660{669, 2012.

[28] R. Bessa, A. Trindade, and V. Miranda. Spatial-temporal solar power fore-casting for smart grids. Industrial Informatics, IEEE Transactions on, 11 (1):232{241, Feb 2015.

[29] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Lise Getoor and Tobias Sche er, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML '11, pages 809{816, New York, NY, USA, June 2011. ACM. ISBN 978-1-4503-0619-5.

[30] A. Krohn-Grimberghe, L. Drumond, C. Freudenthaler, and L. Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data. In Proceedings of the Fifth ACM Inter-national Conference on Web Search and Data Mining, WSDM '12, pages 173{182, New York, NY, USA, 2012. ACM.

[31] L. Drumond, E. Diaz-Aviles, L. Schmidt-Thieme, and W.. Nejdl. Optimiz-ing multi-relational factorization models for multiple target relations. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM '14, pages 191{200, New York, NY, USA, 2014. ACM.