# Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation

Masashi Sugiyama
Motoaki Kawanabe
Pages: 280
https://www.jstor.org/stable/j.ctt5hhbtm

1. Front Matter
(pp. i-iv)
(pp. v-x)
3. Foreword
(pp. xi-xii)
Klaus-Robert Müller

Modern machine learning faces a number of grand challenges. The ever growing World Wide Web, high throughput methods in genomics, and modern imaging methods in brain science, to name just a few, pose ever larger problems where learning methods need to scale, to increase their efficiency, and algorithms need to become able to deal with million-dimensional inputs at terabytes of data. At the same time it becomes more and more important to efficiently and robustly model highly complex problems that are structured (e.g., a grammar underlies the data) and exhibit nonlinear behavior. In addition, data from the real world are...

4. Preface
(pp. xiii-xiv)
5. I INTRODUCTION
• 1 Introduction and Problem Formulation
(pp. 3-18)

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

Machine learningis an interdisciplinary field of science and engineering studying that studies mathematical foundations and practical applications of systems that learn. Depending on the type of learning, paradigms of machine learning can be categorized into three types:

Supervised learningThe goal of supervised learning is to infer an underlying input–output relation based on input–output samples. Once the underlying relation can be successfully learned, output values for unseen input points can be predicted. Thus, the learning machine can generalize to...

6. II LEARNING UNDER COVARIATE SHIFT
• 2 Function Approximation
(pp. 21-46)

In this chapter, we introduce learning methods that can cope with covariate shift.

We employ a parameterized function$\hat{f}(x;\theta)$for approximating a target functionf(x) from training samples$\{(x_{i}^{\text{tr}},y_{i}^{\text{tr}})\}_{i=1}^{{{n}_{\text{tr}}}}$(see section 1.3.5).

A standard method to learn the parameterθwould beempirical risk minimization(ERM) (e.g., [193, 141]):${{\hat{\theta }}_{\text{ERM}}}:=\underset{\theta }{\mathop{\arg \min }}\,\left[ \frac{1}{{{n}_{\text{tr}}}}\sum\limits_{i=1}^{{{n}_{\text{tr}}}}{\text{loss(}x_{i}^{\text{tr}},y_{i}^{\text{tr}},\hat{f}(x_{i}^{\text{tr}};\theta ))} \right],$where$\text{loss(}x,y,\hat{y})$is a loss function (see section 1.3.2). If${{P}_{\text{tr}}}(x)={{P}_{\text{te}}}(x)$,${{\hat{\theta }}_{\text{ERM}}}$is known to beconsistent¹ [145]. Under covariate shift where${{P}_{\text{tr}}}(x)\ne {{P}_{\text{te}}}(x)$, however, the situation differs: ERM still gives a consistent estimator if the model is correctly specified, but it is no longer consistent if the model...

• 3 Model Selection
(pp. 47-72)

As shown in chapter 2, adaptive importance-weighted learning methods are promising in the covariate shift scenarios, given that theflattening parameter γis chosen appropriately. Althoughγ= 0.5 worked well for both the regression and the classification scenarios in the numerical examples in section 2.4,γ= 0.5 is not always the best choice; a good value ofγmay depend on the learning target function, the models, the noise level in the training samples, and so on. Therefore,model selectionneeds to be appropriately carried out for enhancing the generalization capability under covariate shift.

The goal of model...

• 4 Importance Estimation
(pp. 73-102)

In chapters 2 and 3, we have seen that the importance weight$w(x)=\frac{{{p}_{\text{te}}}(x)}{{{p}_{\text{tr}}}(x)}$can be used for asymptotically canceling the bias caused by covariate shift. However, the importance weight is unknown in practice, and needs to be estimated from data. In this chapter, we give a comprehensive overview of importance estimation methods.

The setup in this chapter is that in addition to the i.i.d. training input samples$\{x_{i}^{\text{tr}}\}_{i=1}^{{{n}_{\text{tr}}}}\overset{i.i.d.}{\mathop{\sim }}\,{{P}_{\text{tr}}}(x),$we are given i.i.d. test input samples$\{x_{j}^{\text{te}}\}_{j=1}^{{{n}_{\text{te}}}}\overset{i.i.d.}{\mathop{\sim }}\,{{P}_{\text{te}}}(x).$

Although this setup is similar tosemisupervised learning[30], our attention is directed to covariate shift adaptation.

The goal of importance estimation is to...

• 5 Direct Density-Ratio Estimation with Dimensionality Reduction
(pp. 103-124)

As shown in chapter 5, various methods have been developed for directly estimating the density ratio without going through density estimation. However, even these methods can perform rather poorly when the dimensionality of the data domain is high. In this chapter, a dimensionality reduction scheme for density-ratio estimation, calleddirect density-ratio estimation with dimensionality reduction(D³; pronounced as “D-cube”) [158], is introduced.

The basic assumption behind D³ is that the densitiesptr(x) andpte(x) are different not in the entire space, but only in somesubspace. This assumption can be mathematically formulated with the following linear mixing model.

Let$\{u_{i}^{\text{tr}}\}_{i=1}^{{{n}_{\text{tr}}}}$...

• 6 Relation to Sample Selection Bias
(pp. 125-136)

One of the most famous works on learning under changing environment is Heckman’s method for coping withsample selection bias[76, 77]. Sample selection bias has been proposed and extensively studied in econometrics and sociology, and Heckman received the Nobel Prize for economics in 2000.

Sample selection bias indicates the situation where the training data set consists of nonrandomly selected (i.e., biased) samples. Data samples collected through Internet surveys typically suffer from sample selection bias—samples corresponding to those who do not have access to the Internet are completely missing. Since the number of conservative people, such as the elderly,...

• 7 Applications of Covariate Shift Adaptation
(pp. 137-180)

In this chapter, we show applications of covariate shift adaptation techniques to real-world problems: the brain–computer interface in section 7.1, speaker identification in section 7.2, natural language processing in section 7.3, face-based age prediction in section 7.4, and human activity recognition from accelerometric data in section 7.5. In section 7.6, covariate shift adaptation techniques are employed for efficient sample reuse in the framework of reinforcement learning.

In this section, importance-weighting methods are applied tobrain–computer interfaces(BCIs), which have attracted a great deal of attention in biomedical engineering and machine learning [160, 102].

A BCI system allows direct...

7. III LEARNING CAUSING COVARIATE SHIFT
• 8 Active Learning
(pp. 183-214)

Active learning[107,33,55]—also referred to asexperimental designin statistics [93,48,127]—is the problem of determining the locations of training input points so that the generalization error is minimized (see figure 8.1). Active learning is particularly useful when the cost of sampling output valueyis very expensive. In such cases, we want to find the best input points to observe output values within a fixed number of budgets (which corresponds to the numberntrof training samples).

Since training input points are generated following a user-defined distribution, covariate shift naturally occurs in the active learning scenario. Thus, covariate...

• 9 Active Learning with Model Selection
(pp. 215-224)

In chapters 3 and 8, we addressed the problems of model selection¹ and active learning. When discussing model selection strategies, we assumed that the training input points have been fixed. On the other hand, when discussing active learning strategies, we assumed that the model had been fixed.

Although the problems of active learning and model selection share the common goal of minimizing the generalization error, they have been studied as two independent problems so far. If active learning and model selection are performed at the same time, the generalization performance will be further improved. We call the problem of simultaneously...

• 10 Applications of Active Learning
(pp. 225-238)

In this chapter, we describe real-world applications of active learning techniques: sampling policy design in reinforcement learning (section 10.1) and wafer alignment in semiconductor exposure apparatus (section 10.2).

As shown in section 7.6,reinforcement learning[174] is a useful framework to let a robot agent learn optimal behavior in an unknown environment.

The accuracy of estimated value functions depends on the training samples collected following sampling policy$\tilde{\pi }(a|s)$. In this section, we apply the population-based active learning method described in section 8.2.4 to designing good sampling policies [4]. The contents of this section are based on the framework of...

8. IV CONCLUSIONS
• 11 Conclusions and Future Prospects
(pp. 241-242)

In this book, we provided a comprehensive overview of theory, algorithms, and applications of machine learning under covariate shift.

Part II of the book covered topics on learning under covariate shift. In chapters 2 and 3, importance sampling techniques were shown to form the theoretical basis of covariate shift adaptation in function learning and model selection. In practice, importance weights needed in importance sampling are unknown. Thus, estimating the importance weights is a key component in covariate shift adaptation, which was covered in chapter 4. In chapter 5, a novel idea for estimating the importance weights in high-dimensional problems was...

9. Appendix: List of Symbols and Abbreviations
(pp. 243-246)
10. Bibliography
(pp. 247-258)
11. Index
(pp. 259-262)
12. Back Matter
(pp. 263-264)