data markets V: online learning
Hello! Today we are looking into Online Machine Learning. Classically, machine learning works by using a set of training data to train a predictor, and this predictor is then used on new data as it comes in. In online machine learning, the predictor is instead continuously trained as new data is made available. This can be done for several reasons, e.g. if you know that your underlying distribution is changing over time. An example task could be predicting a house price based on things like its square footage, the number of rooms it contains, and similar. A non-online method would find a lot of data on this problem and use this to train a predictor, while an online method would do that too, but additionally continually train on new house sales as they appear.
I will implement an online linear regression market mechanism with a forgetting mechanism known as exponential forgetting. A forgetting mechanism makes the algorithm forget old observations over time, and is useful if we expect the distribution of our data to change over time. This is just one of many different online machine learning mechanisms.
online learning & markets
Online learning methods can be integrated in a regression market structure like the market structures we have discussed so far. Pinson et al. (2022)1 gives an introduction to how these methods work, as well as use cases within the wind energy market. I will coopt their method for constructing a market with an online regression method. They present several ways of feature pricing, but most natural is the use of the shapley value as I discussed in my last post. I will operate here on observational Shapley values, which are Shapley fair, but are not robust to replication, as we learned in my last post.
utilitarian online learning
With online regression methods in a market situation, you might have a situation where one of your data suppliers does not always provide data. This presents an interesting problem for the online regression method: How do we deal with missing data, both for the training and prediction step of the online regression method? Utilitarian online learning is a branch online learning that deals with varying feature spaces, i.e. the same features aren’t available at all points in time. These methods can be used for online regression situations where you use data from sources that isn’t always available. You could e.g. be using input from an environmental microphone on the street somewhere in a city, but the microphone might be down one day and thus not supplying information.
There is a recent survey article by He et. al (2023)2 that goes through several common utilitarian online learners that you may find interesting if you think online regression could be useful for you. I won’t dive further into this at this time, but I will in the future dive into the OCDS algorithm by He et. al (2019)1 specifically, which is a linear classifier that reconstructs features based on feature correlations.
Linear regression
If you don’t know the mathematics of linear regression, the idea is basically to find a set of parameters \(\hat{\mathbf{\beta}}\) to find a lienar function of your observed features that in your observed datapoints \(\mathbf{X}\) is as close as possible to the observed target value \(\mathbf{y}\). Here’s a great blog post about linear regression if you’d like to learn more about its theoretical underpinnings. Either way, we’re going to start with a quick brush up on linear regressions.
So, let’s say we have our datapoints in the matrix \(\mathbf{X}\). Every point we get in \(\mathbf{x_t}\) comes as a vector of values for each feature \(i\), so \(\mathbf{x_t} = [ x_{t,1}, x_{t,2}, ..., x_{t,i}, ..., x_{t,m}]\), where \(m\) is the number of features. Each point also has an associated \(y_t\), which is the value we want to approximate with our linear function. We’re minimizing the square of the difference between our function value and the true value \(y_t\), \(|f(\mathbf{x_t}) - y_t|^2\) for all t, i.e. the sum of the square differense. The difference \(f(\mathbf(x_t))-y_t\) is what is known as the residual, so squaring it gives the square residual, thus another name for linear regression, the line minimizing the residual sum of squares. The minimization problem is given in matrix notation as
\[\underset{f}{min}|f(\mathbf{X}) - \mathbf{y}|^2\]Since \(f\) is linear, we can describe it just based on a vector of coefficients \(\mathbf{\beta}\). The expression can be instead described as
\[\underset{\mathbf{\beta}}{min}|\mathbf{X}\mathbf{\beta} - \mathbf{y}|^2\]It turns out that there is a very simple solution to this problem. In fact, using some basic optimization theory and looking at the derivatives of the equation, one would find that at the optimum value for \(\mathbf{\beta}\)
\[\mathbf{X}^T\mathbf{X} \mathbf{\beta} = \mathbf{X}^T \mathbf{y}\]We give a new name to the matrix, \(\mathbf{M}=\mathbf{X}^T\mathbf{X}\). Then, the solution to the line-of-best-fit when minimizing sum square error is defined by the coefficients \(\mathbf{\beta}\),
\[\mathbf{\beta} = \mathbf{M}^{-1} \mathbf{X}^T \mathbf{Y}\]And that was a crash course in linear regression.
online linear regression
Linear regression can be done fairly easily in an online fashion. We realize that our regression is actually pretty much defined by just our two values \(\mathbf{M}\) and \(\mathbf{\beta}\). We then realize that we can just update these as we move along in time.
We first choose some subset of our data that we already know as base data that we use for initialization where we perform a direct linear regression. To perform this linear regression, we construct a matrix \(\mathbf{M}_{init}= \mathbf{X}_{init}^T \mathbf{X}_{init}\). We then calculate our initial parameters \(\mathbf{\beta}_{init}=\mathbf{M}^{-1}\mathbf{y}\).
Now, to get into the online setting, we’ll update our matrix \(\mathbf{M}\) and our parameters \(\mathbf{\beta}\) every time step. First, let’s think about what to do about the \(\mathbf{M}\) matrix. We can actually define an \(\mathbf{M}\) matrix for just a single observation, \(\mathbf{m_t} = \mathbf{x_t}^T \mathbf{x_t}\). Then, we can have a running matrix \(\mathbf{M_t}\) that gets updated each iteration, we could e.g. update it as \(\mathbf{M_t}= \mathbf{M_{t-1}} + \mathbf{m_t}\), with \(\mathbf{M_0}=\mathbf{M}_{init}\). We’re not going to update it exactly like that, since this would retain all former information at every update. We actually want to forget some information, since we want our regression to adapt to a changing underlying distribution. So we’ll introduce a memory factor, \(\lambda\) (note that this memory factor is confusingly known as the forgetting rate in the Pinson paper). Then, we’ll perform the update as
\[\mathbf{M_t} = \lambda \mathbf{M_{t-1}} + \mathbf{m_t}\]For the full update, we’ll actually perform something know as Newton-Raphson, since there is no simple way like this to update the parameters \(\mathbf{\beta}\). This will also allow us to minimize other loss functions than the sum of square residuals (see the Pinson paper if this interests you). The full update goes as follows:
First, you calculate the residuals
\[\mathbf{\epsilon_{t}} = y_t - \mathbf{\beta}^T_{t-1}{\mathbf{x}}_{t}\]Then, you update the matrix \(\mathbf{M}\)
\[M_{t} = \lambda \mathbf{M_{t-1}} + {\mathbf{x}}_{t}^T {\mathbf{x}}_{t}\]Then, you update the parameters \(\mathbf{\beta}\) with the new \(\mathbf{M}\) matrix. This is where Newton-Raphson enters the picture.
\[\mathbf{\beta}_{t} = \mathbf{\beta}_{t-1} + \mathbf{M}_{t}^{-1}{\mathbf{x}}_{t} \mathbf{\epsilon_{t}}\]Now, we have our update step.
turning it into a market
So, now we can perform an online linear regression, and we can go and create a market from it. As mentioned, we’ll follow the recipe provided by the Pinson paper. We’ll create a single-buyer multiple-seller market, where the single buyer is denoted the “Central agent”, and the set of all features is known as the “Grand Coalition”, or \(\Omega\).
The paper allocates payment based on an estimator of the loss. For a certain group of features \(\omega\), the loss estimate is denoted \(L_{\omega}\). The difference between loss estimates can be used as an estimate for how much adding a feature reduces the loss of the method. The loss estimators are calculated as:
\[L_{\omega,t}^* = \lambda L_{\omega,t-1}^*(\beta_\omega) + \frac{1}{n_\lambda} l_{\omega,t}(\beta_\omega)\]Where \(l_{\omega,t}(\beta_\omega)\) is the loss function for the group of features \(\omega\) at the given time step with coefficients \(\beta_\omega\), and \(n_\lambda = \frac{1}{1-\lambda}\). \(n_\lambda\) comes from the equation
\[\sum_{k=0}^\infty C \lambda^k = \frac{1}{1-\lambda} C\]This says something about when you’re repeatedly updating a value like we do \(\mathbf{M}\) above. If we had a scalar value \(b_t\), and every step updated it as \(b_t = \lambda b_t + C\), then \(b_t\) would converge to \(\frac{1}{1-\lambda} C=n_\lambda C\) as \(t\) goes to infinity. This means that if we scale everything down to the scale \(\frac{1}{n_\lambda}\) before adding it, the “scale” of our values will converge to 1.
So, loss estimates are a form of running average of the loss of the regression when using a certain group of features. I only calculate the loss estimates out-of-sample, since I find it sensible for a data user to only be willing to pay for out-of-sample performance. The paper also touches on payment for in-sample performance, though.
We want to determine how much to pay each feature now. We can use the loss estimate as the payout of the group of features seen as a coalition from cooperative game theory (see my last blog post), and thus determine the shapley values of each feature. Then, we can determine payment based on the shapley values. The payment fraction \(\pi_{k,t}\) for feature \(k\) and time step \(t\) will be denoted \(\psi_{k,t}\).
\[\pi_{k,t} = (L_{\omega, i}^*- L_{\Omega}^* ) \phi_i \psi_{k,t}(l)\]\(\pi_{k,t}\) is the payment to feature \(k\) at time \(t\). \(L^*\)s are loss estimators, \(L^*_{\omega,i}\) for the central agent and \(L_{\Omega}^*\) for the grand coalition. \(\psi_{k,t}\) is the payment allocation to the grand coalition. \(\phi_i\) is the willingness of the central agent to pay for improving the value of the loss function \(l\)
an online market
Usually, my demos are mostly focused on the gain for the person using the market. This time, I want to show off some of the nitty-gritty parts of the algorithm. Let’s imagine that you’re a rental bike provider. The dataset is completely artificial this time, and is also not very true-to-life. You want to predict the demand for rental bikes tomorrow, and it turns out that for every bike you’re off from the true demand, it costs you 5€, in missed rides or oversupply or something similar. You don’t have any data available yourself to predict these bike demands, so you go to the market and look, trying to predict the amount of needed rental bikes in an online fashion, since you think there’s probably a change over time in the effect of different features in the market.
You will be paying with a payment willingness of 1€ for every 1 reduction in the RMSE. This will let you make a good profit on forecasting improvements from market features.
The demo this time also shows the inner workings of the algorithm. The central parts of the online linear regression are the \(M\) matrix and the \(\beta\) vector, which is the vector of coefficients. The \(M\) matrix is updated every iteration based on the new observation, and the vector of coefficients \(\beta\) is then updated based on the new \(M\) matrix and the current residuals. To calculate the observational shapley values, there are actually more regressions running parallel to the true regression, which are only used to calculate the loss estimators discussed earlier. The market doesn’t guarantee that you’ll always save at every time step, since I only look at out-of-sample predictions, but it does guarantee it over the long run.
Press the “Next day” button to advance the simulation and fill out the tables below!
M Matrix | ||||
Constant | Precipitation | Wind Speed | Temperature | |
Observed values | 1 | |||
Coefficients | ||||
Shapley Values | ||||
Payments |
True value of y | |||
Prediction with no market features | Error | ||
Prediction with market features | Error | ||
Money spent on market data | |||
Money spent on bike management with market | |||
Money spent on bike management without market | |||
Net savings from using market | |||
Avg. net savings from using market |
Day #0
You’ll find that since the underlying distribution changes over time, so too do the coefficients, which allows the regression to stay up to date with the changing distribution. The market saves you money. You can also investigate how the M matrix and the coefficients update every iteration. Online regression can make a lot of sense when distributions change over time, instead of bach inference and prediction based on this batch. Many real-world phenomenon do change over time, like this situation with consumer demand. This is where online regression can come in very useful.
conclusion
So, we can construct a market not just with batch learning, but also using online machine learning. These methods allow us to update our regression method as new data becomes available to us, which is a very nice property. For my next post, I’ll tackle the OCDS3 algorithm and missing data in online machine learning. Thanks for reading!
references
-
Pinson, P., Han, L. & Kazempour, J. Regression markets and application to energy forecasting. TOP 30, 533–573 (2022). https://doi.org/10.1007/s11750-022-00631-7 ↩ ↩2
-
He, Yi & Schreckenberger, Christian & Stuckenschmidt, Heiner & Wu, Xindong. (2023). Towards Utilitarian Online Learning – A Review of Online Algorithms in Open Feature Space. 6647-6655. 10.24963/ijcai.2023/745. ↩
-
He, Y., Wu, B., Wu, D., Beyazit, E., Chen, S., & Wu, X. (2019). Online Learning from Capricious Data Streams: A Generative Approach. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, 2491–2497. https://doi.org/10.24963/ijcai.2019/346 ↩