# Foundations of Machine Learning

Mehryar Mohri
Ameet Talwalkar
Pages: 432
https://www.jstor.org/stable/j.ctt5hhcw1

1. Front Matter
(pp. i-iv)
(pp. v-x)
3. Preface
(pp. xi-xii)
4. 1 Introduction
(pp. 1-10)

Machine learning can be broadly defined as computational methods using experience to improve performance or to make accurate predictions. Here,experiencerefers to the past information available to the learner, which typically takes the form of electronic data collected and made available for analysis. This data could be in the form of digitized human-labeled training sets, or other types of information obtained via interaction with the environment. In all cases, its quality and size are crucial to the success of the predictions made by the learner.

Machine learning consists of designing efficient and accurate predictionalgorithms. As in other areas...

5. 2 The PAC Learning Framework
(pp. 11-32)

Several fundamental questions arise when designing and analyzing algorithms that learn from examples: What can be learned efficiently? What is inherently hard to learn? How many examples are needed to learn successfully? Is there a general model of learning? In this chapter, we begin to formalize and address these questions by introducing theProbably Approximately Correct(PAC) learning framework. The PAC framework helps define the class of learnable concepts in terms of the number of sample points needed to achieve an approximate solution,sample complexity, and the time and space complexity of the learning algorithm, which depends on the cost...

6. 3 Rademacher Complexity and VC-Dimension
(pp. 33-62)

The hypothesis sets typically used in machine learning are infinite. But the sample complexity bounds of the previous chapter are uninformative when dealing with infinite hypothesis sets. One could ask whether efficient learning from a finite sample is even possible when the hypothesis setHis infinite. Our analysis of the family of axis-aligned rectangles (Example 2.1) indicates that this is indeed possible at least in some cases, since we proved that that infinite concept class was PAC-learnable. Our goal in this chapter will be to generalize that result and derive general learning guarantees for infinite hypothesis sets.

A general...

7. 4 Support Vector Machines
(pp. 63-88)

This chapter presents one of the most theoretically well motivated and practically most effective classification algorithms in modern machine learning: Support Vector Machines (SVMs). We first introduce the algorithm for separable datasets, then present its general version designed for non-separable datasets, and finally provide a theoretical foundation for SVMs based on the notion of margin. We start with the description of the problem of linear classification.

Consider an input space${\cal X}$that is a subset of${\mathbb R^{N}}$withN≥ 1, and the output or target space${{\cal Y} =\{-1,+1}\}$, and let${f:{\cal X}\rightarrow {\cal Y}}$be the target function. Given a hypothesis set...

8. 5 Kernel Methods
(pp. 89-120)

Kernel methodsare widely used in machine learning. They are flexible techniques that can be used to extend algorithms such as SVMs to define non-linear decision boundaries. Other algorithms that only depend on inner products between sample points can be extended similarly, many of which will be studied in future chapters.

The main idea behind these methods is based on so-calledkernelsorkernel functions, which, under some technical conditions of symmetry andpositive-definiteness, implicitly define an inner product in a high-dimensional space. Replacing the original inner product in the input space with positive definite kernels immediately extends algorithms such...

9. 6 Boosting
(pp. 121-146)

Ensemble methodsare general techniques in machine learning for combining several predictors to create a more accurate one. This chapter studies an important family of ensemble methods known asboosting, and more specifically theAdaBoostalgorithm. This algorithm has been shown to be very effective in practice in some scenarios and is based on a rich theoretical analysis. We first introduce AdaBoost, show how it can rapidly reduce the empirical error as a function of the number of rounds of boosting, and point out its relationship with some known algorithms. Then we present a theoretical analysis of its generalization properties...

10. 7 On-Line Learning
(pp. 147-182)

This chapter presents an introduction to on-line learning, an important area with a rich literature and multiple connections with game theory and optimization that is increasingly influencing the theoretical and algorithmic advances in machine learning. In addition to the intriguing novel learning theory questions that they raise, on-line learning algorithms are particularly attractive in modern applications since they form an attractive solution for large-scale problems.

These algorithms process one sample at a time and can thus be significantly more efficient both in time and space and more practical than batch algorithms, when processing modern data sets of several million or...

11. 8 Multi-Class Classification
(pp. 183-208)

The classification problems we examined in the previous chapters were all binary. However, in most real-world classification problems the number of classes is greater than two. The problem may consist of assigning a topic to a text document, a category to a speech utterance or a function to a biological sequence. In all of these tasks, the number of classes may be on the order of several hundred or more.

In this chapter, we analyze the problem of multi-class classification. We first introduce the multi-class classification learning problem and discuss its multiple settings, and then derive generalization bounds for it...

12. 9 Ranking
(pp. 209-236)

The learning problem of ranking arises in many modern applications, including the design of search engines, information extraction platforms, and movie recommendation systems. In these applications, the ordering of the documents or movies returned is a critical aspect of the system. The main motivation for ranking over classification in the binary case is the limitation of resources: for very large data sets, it may be impractical or even impossible to display or process all items labeled as relevant by a classifier. A standard user of a search engine is not willing to consult all the documents returned in response to...

13. 10 Regression
(pp. 237-266)

This chapter discusses in depth the learning problem ofregression, which consists of using data to predict, as closely as possible, the correct real-valued labels of the points or items considered. Regression is a common task in machine learning with a variety of applications, which justifies the specific chapter we reserve to its analysis.

The learning guarantees presented in the previous sections focused largely on classification problems. Here we present generalization bounds for regression, both for finite and infinite hypothesis sets. Several of these learning bounds are based on the familiar notion of Rademacher complexity, which is useful for characterizing...

14. 11 Algorithmic Stability
(pp. 267-280)

In chapters 2–4 and several subsequent chapters, we presented a variety of generalization bounds based on different measures of the complexity of the hypothesis setHused for learning, including the Rademacher complexity, the growth function, and the VC-dimension. These bounds ignore the specific algorithm used, that is, they hold for any algorithm usingHas a hypothesis set.

One may ask if an analysis of the properties of a specific algorithm could lead to finer guarantees. Such an algorithm-dependent analysis could have the benefit of a more informative guarantee. On the other hand, it could be inapplicable to...

15. 12 Dimensionality Reduction
(pp. 281-292)

In settings where the data has a large number of features, it is often desirable to reduce its dimension, or to find a lower-dimensional representation preserving some of its properties. The key arguments for dimensionality reduction (or manifold learning) techniques are:

Computational: to compress the initial data as a preprocessing step to speed up subsequent operations on the data.

Visualization: to visualize the data for exploratory analysis by mapping the input data into two- or three-dimensional spaces.

Feature extraction: to hopefully generate a smaller and more effective or useful set of features.

The benefits of dimensionality reduction are often illustrated...

16. 13 Learning Automata and Languages
(pp. 293-312)

This chapter presents an introduction to the problem of learning languages. This is a classical problem explored since the early days of formal language theory and computer science, and there is a very large body of literature dealing with related mathematical questions. In this chapter, we present a brief introduction to this problem and concentrate specifically on the question of learning finite automata, which, by itself, has been a topic investigated in multiple forms by thousands of technical papers. We will examine two broad frameworks for learning automata, and for each, we will present an algorithm. In particular, we describe...

17. 14 Reinforcement Learning
(pp. 313-338)

This chapter presents an introduction to reinforcement learning, a rich area of machine learning with connections to control theory, optimization, and cognitive sciences. Reinforcement learning is the study of planing and learning in a scenario where a learner actively interacts with the environment to achieve a certain goal. This active interaction justifies the terminology ofagentused to refer to the learner. The achievement of the agent’s goal is typically measured by the reward he receives from the environment and which he seeks to maximize.

We first introduce the general scenario of reinforcement learning and then introduce the model of...

18. Conclusion
(pp. 339-340)

We described a large variety of machine learning algorithms and techniques and discussed their theoretical foundations as well as their use and applications. While this is not a fully comprehensive presentation, it should nevertheless offer the reader some idea of the breadth of the field and its multiple connections with a variety of other domains, including statistics, information theory, optimization, game theory, and automata and formal language theory.

The fundamental concepts, algorithms, and proof techniques we presented should supply the reader with the necessary tools for analyzing other learning algorithms, including variants of the algorithms analyzed in this book. They...

19. Appendix A Linear Algebra Review
(pp. 341-348)
20. Appendix B Convex Optimization
(pp. 349-358)
21. Appendix C Probability Review
(pp. 359-368)
22. Appendix D Concentration inequalities
(pp. 369-378)
23. Appendix E Notation
(pp. 379-380)
24. References
(pp. 381-396)
25. Index
(pp. 397-412)
26. Back Matter
(pp. 413-414)