Machine Learning in Non-Stationary Environments

Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation

Masashi Sugiyama
Motoaki Kawanabe
Copyright Date: 2012
Published by: MIT Press
Pages: 280
  • Cite this Item
  • Book Info
    Machine Learning in Non-Stationary Environments
    Book Description:

    As the power of computing has grown over the past few decades, the field of machine learning has advanced rapidly in both theory and practice. Machine learning methods are usually based on the assumption that the data generation mechanism does not change over time. Yet real-world applications of machine learning, including image recognition, natural language processing, speech recognition, robot control, and bioinformatics, often violate this common assumption. Dealing with non-stationarity is one of modern machine learning's greatest challenges. This book focuses on a specific non-stationary environment known as covariate shift, in which the distributions of inputs (queries) change but the conditional distribution of outputs (answers) is unchanged, and presents machine learning theory, algorithms, and applications to overcome this variety of non-stationarity. After reviewing the state-of-the-art research in the field, the authors discuss topics that include learning under covariate shift, model selection, importance estimation, and active learning. They describe such real world applications of covariate shift adaption as brain-computer interface, speaker identification, and age prediction from facial images. With this book, they aim to encourage future research in machine learning, statistics, and engineering that strives to create truly autonomous learning machines able to learn under non-stationarity.

    eISBN: 978-0-262-30122-0
    Subjects: Technology

Table of Contents

  1. Front Matter
    (pp. i-iv)
  2. Table of Contents
    (pp. v-x)
  3. Foreword
    (pp. xi-xii)
    Klaus-Robert Müller

    Modern machine learning faces a number of grand challenges. The ever growing World Wide Web, high throughput methods in genomics, and modern imaging methods in brain science, to name just a few, pose ever larger problems where learning methods need to scale, to increase their efficiency, and algorithms need to become able to deal with million-dimensional inputs at terabytes of data. At the same time it becomes more and more important to efficiently and robustly model highly complex problems that are structured (e.g., a grammar underlies the data) and exhibit nonlinear behavior. In addition, data from the real world are...

  4. Preface
    (pp. xiii-xiv)
    • 1 Introduction and Problem Formulation
      (pp. 3-18)

      In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

      Machine learningis an interdisciplinary field of science and engineering studying that studies mathematical foundations and practical applications of systems that learn. Depending on the type of learning, paradigms of machine learning can be categorized into three types:

      Supervised learningThe goal of supervised learning is to infer an underlying input–output relation based on input–output samples. Once the underlying relation can be successfully learned, output values for unseen input points can be predicted. Thus, the learning machine can generalize to...

    • 2 Function Approximation
      (pp. 21-46)

      In this chapter, we introduce learning methods that can cope with covariate shift.

      We employ a parameterized function$\hat{f}(x;\theta)$for approximating a target functionf(x) from training samples$\{(x_{i}^{\text{tr}},y_{i}^{\text{tr}})\}_{i=1}^{{{n}_{\text{tr}}}}$(see section 1.3.5).

      A standard method to learn the parameterθwould beempirical risk minimization(ERM) (e.g., [193, 141]):\[{{\hat{\theta }}_{\text{ERM}}}:=\underset{\theta }{\mathop{\arg \min }}\,\left[ \frac{1}{{{n}_{\text{tr}}}}\sum\limits_{i=1}^{{{n}_{\text{tr}}}}{\text{loss(}x_{i}^{\text{tr}},y_{i}^{\text{tr}},\hat{f}(x_{i}^{\text{tr}};\theta ))} \right],\]where$\text{loss(}x,y,\hat{y})$is a loss function (see section 1.3.2). If${{P}_{\text{tr}}}(x)={{P}_{\text{te}}}(x)$,${{\hat{\theta }}_{\text{ERM}}}$is known to beconsistent¹ [145]. Under covariate shift where${{P}_{\text{tr}}}(x)\ne {{P}_{\text{te}}}(x)$, however, the situation differs: ERM still gives a consistent estimator if the model is correctly specified, but it is no longer consistent if the model...

    • 3 Model Selection
      (pp. 47-72)

      As shown in chapter 2, adaptive importance-weighted learning methods are promising in the covariate shift scenarios, given that theflattening parameter γis chosen appropriately. Althoughγ= 0.5 worked well for both the regression and the classification scenarios in the numerical examples in section 2.4,γ= 0.5 is not always the best choice; a good value ofγmay depend on the learning target function, the models, the noise level in the training samples, and so on. Therefore,model selectionneeds to be appropriately carried out for enhancing the generalization capability under covariate shift.

      The goal of model...

    • 4 Importance Estimation
      (pp. 73-102)

      In chapters 2 and 3, we have seen that the importance weight\[w(x)=\frac{{{p}_{\text{te}}}(x)}{{{p}_{\text{tr}}}(x)}\]can be used for asymptotically canceling the bias caused by covariate shift. However, the importance weight is unknown in practice, and needs to be estimated from data. In this chapter, we give a comprehensive overview of importance estimation methods.

      The setup in this chapter is that in addition to the i.i.d. training input samples\[\{x_{i}^{\text{tr}}\}_{i=1}^{{{n}_{\text{tr}}}}\overset{i.i.d.}{\mathop{\sim }}\,{{P}_{\text{tr}}}(x),\]we are given i.i.d. test input samples\[\{x_{j}^{\text{te}}\}_{j=1}^{{{n}_{\text{te}}}}\overset{i.i.d.}{\mathop{\sim }}\,{{P}_{\text{te}}}(x).\]

      Although this setup is similar tosemisupervised learning[30], our attention is directed to covariate shift adaptation.

      The goal of importance estimation is to...

    • 5 Direct Density-Ratio Estimation with Dimensionality Reduction
      (pp. 103-124)

      As shown in chapter 5, various methods have been developed for directly estimating the density ratio without going through density estimation. However, even these methods can perform rather poorly when the dimensionality of the data domain is high. In this chapter, a dimensionality reduction scheme for density-ratio estimation, calleddirect density-ratio estimation with dimensionality reduction(D³; pronounced as “D-cube”) [158], is introduced.

      The basic assumption behind D³ is that the densitiesptr(x) andpte(x) are different not in the entire space, but only in somesubspace. This assumption can be mathematically formulated with the following linear mixing model.


    • 6 Relation to Sample Selection Bias
      (pp. 125-136)

      One of the most famous works on learning under changing environment is Heckman’s method for coping withsample selection bias[76, 77]. Sample selection bias has been proposed and extensively studied in econometrics and sociology, and Heckman received the Nobel Prize for economics in 2000.

      Sample selection bias indicates the situation where the training data set consists of nonrandomly selected (i.e., biased) samples. Data samples collected through Internet surveys typically suffer from sample selection bias—samples corresponding to those who do not have access to the Internet are completely missing. Since the number of conservative people, such as the elderly,...

    • 7 Applications of Covariate Shift Adaptation
      (pp. 137-180)

      In this chapter, we show applications of covariate shift adaptation techniques to real-world problems: the brain–computer interface in section 7.1, speaker identification in section 7.2, natural language processing in section 7.3, face-based age prediction in section 7.4, and human activity recognition from accelerometric data in section 7.5. In section 7.6, covariate shift adaptation techniques are employed for efficient sample reuse in the framework of reinforcement learning.

      In this section, importance-weighting methods are applied tobrain–computer interfaces(BCIs), which have attracted a great deal of attention in biomedical engineering and machine learning [160, 102].

      A BCI system allows direct...

    • 8 Active Learning
      (pp. 183-214)

      Active learning[107,33,55]—also referred to asexperimental designin statistics [93,48,127]—is the problem of determining the locations of training input points so that the generalization error is minimized (see figure 8.1). Active learning is particularly useful when the cost of sampling output valueyis very expensive. In such cases, we want to find the best input points to observe output values within a fixed number of budgets (which corresponds to the numberntrof training samples).

      Since training input points are generated following a user-defined distribution, covariate shift naturally occurs in the active learning scenario. Thus, covariate...

    • 9 Active Learning with Model Selection
      (pp. 215-224)

      In chapters 3 and 8, we addressed the problems of model selection¹ and active learning. When discussing model selection strategies, we assumed that the training input points have been fixed. On the other hand, when discussing active learning strategies, we assumed that the model had been fixed.

      Although the problems of active learning and model selection share the common goal of minimizing the generalization error, they have been studied as two independent problems so far. If active learning and model selection are performed at the same time, the generalization performance will be further improved. We call the problem of simultaneously...

    • 10 Applications of Active Learning
      (pp. 225-238)

      In this chapter, we describe real-world applications of active learning techniques: sampling policy design in reinforcement learning (section 10.1) and wafer alignment in semiconductor exposure apparatus (section 10.2).

      As shown in section 7.6,reinforcement learning[174] is a useful framework to let a robot agent learn optimal behavior in an unknown environment.

      The accuracy of estimated value functions depends on the training samples collected following sampling policy$\tilde{\pi }(a|s)$. In this section, we apply the population-based active learning method described in section 8.2.4 to designing good sampling policies [4]. The contents of this section are based on the framework of...

    • 11 Conclusions and Future Prospects
      (pp. 241-242)

      In this book, we provided a comprehensive overview of theory, algorithms, and applications of machine learning under covariate shift.

      Part II of the book covered topics on learning under covariate shift. In chapters 2 and 3, importance sampling techniques were shown to form the theoretical basis of covariate shift adaptation in function learning and model selection. In practice, importance weights needed in importance sampling are unknown. Thus, estimating the importance weights is a key component in covariate shift adaptation, which was covered in chapter 4. In chapter 5, a novel idea for estimating the importance weights in high-dimensional problems was...

  9. Appendix: List of Symbols and Abbreviations
    (pp. 243-246)
  10. Bibliography
    (pp. 247-258)
  11. Index
    (pp. 259-262)
  12. Back Matter
    (pp. 263-264)