# Optimization for Machine Learning

Suvrit Sra
Sebastian Nowozin
Stephen J. Wright
Pages: 512
https://www.jstor.org/stable/j.ctt5hhgpg

1. Front Matter
(pp. i-iv)
(pp. v-x)
3. Series Foreword
(pp. xi-xii)
Michael I. Jordan and Thomas G. Dietterich

The yearly Neural Information Processing Systems (NIPS) workshops bring together scientists with broadly varying backgrounds in statistics, mathematics, computer science, physics, electrical engineering, neuroscience, and cognitive science, unified by a common desire to develop novel computational and statistical strategies for information processing and to understand the mechanisms for information processing in the brain. In contrast to conferences, these workshops maintain a flexible format that both allows and encourages the presentation and discussion of work in progress. They thus serve as an incubator for the development of important new ideas in this rapidly evolving field. The series editors, in consultation with...

4. Preface
(pp. xiii-xiv)
Suvrit Sra, Sebastian Nowozin and Stephen J. Wright
5. 1 Introduction: Optimization and Machine Learning
(pp. 1-18)
Suvrit Sra, Sebastian Nowozin and Stephen J. Wright

Since its earliest days as a discipline, machine learning has made use of optimization formulations and algorithms. Likewise, machine learning has contributed to optimization, driving the development of new optimization approaches that address the significant challenges presented by machine learning applications. This cross-fertilization continues to deepen, producing a growing literature at the intersection of the two fields while attracting leading researchers to the effort.

Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. While techniques proposed twenty years and more ago continue to be refined, the increased complexity, size, and variety of...

6. 2 Convex Optimization with Sparsity-Inducing Norms
(pp. 19-54)
Francis Bach, Rodolphe Jenatton, Julien Mairal and Guillaume Obozinski

The principle of parsimony is central to many areas of science: the simplest explanation of a given phenomenon should be preferred over more complicated ones. In the context of machine learning, it takes the form of variable or feature selection, and it is commonly used in two situations. First, to make the model or the prediction more interpretable or computationally cheaper to use, that is, even if the underlying problem is not sparse, one looks for the best sparse approximation. Second, sparsity can also be used given prior knowledge that the model should be sparse.

For variable selection in linear...

7. 3 Interior-Point Methods for Large-Scale Cone Programming
(pp. 55-84)
Martin Andersen, Joachim Dahl, Zhang Liu and Lieven Vandenberghe

The cone programming formulation has been popular in the recent literature on convex optimization. In this chapter we define acone linear program(cone LP or conic LP) as an optimization problem of the form

minimize${c^T}x$

subject to$Gx{\underline \prec _C}h$(3.1)

$Ax = b$

with optimization variablex. The inequality$Gx{\underline \prec _C}h$is ageneralized inequality, which means that$h - Gx \in C$, whereCis a closed, pointed, convex cone with nonempty interior. We will also encountercone quadratic programs(cone QPs),

minimize$(1/2){x^T}Px + {c^T}x$(3.2)

subject to$Gx{\underline \prec _C}h$

$Ax = b$,

withPpositive semidefinite.

If$C = R_ + ^p$(the nonnegative orthant in${R^p}$), the generalized inequality is a componentwise...

8. 4 Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey
(pp. 85-120)
Dimitri P. Bertsekas

We consider optimization problems with a cost function consisting of a large number of component functions, such as

minimize$\sum\limits_{i = 1}^m {{f_i}(x)}$subject to$x \in X$, (4.1)

where${f_i}:{R^n} \mapsto R$, i = 1 , . . . ,mare real-valued functions, andXis a closed convex set.¹ We focus on the case where the number of componentsmis very large, and there is an incentive to use incremental methods that operate on a single component${f_i}$at each iteration, rather than on the entire cost function. If each incremental iteration tends to make reasonable progress in some “average” sense, then, depending...

9. 5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods
(pp. 121-148)
Anatoli Juditsky and Arkadi Nemirovski

At present, almost all of convex programming is within the grasp of polynomial time interior-point methods (IPMs) capable of solving convex programs to high accuracy at a low iteration count. However, the iteration cost of all known polynomial methods grows nonlinearly with a problem’s design dimensionn(number of decision variables), something liken³. As a result, as the design dimension grows, polynomial time methods eventually become impractical—roughly speaking, a single iteration lasts forever. What “eventually” means in fact depends on a problem’s structure. For instance, typical linear programming programs of decision-making origin have extremely sparse constraint matrices, and...

10. 6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem’s Structure
(pp. 149-184)
Anatoli Juditsky and Arkadi Nemirovski

The major drawback of the first-order methods (FOMs) considered in Chapter 5 is their slow convergence: as the number of stepstgrows, the inaccuracy decreases as slowly as$O(1/\sqrt {t)}$. As explained in Chapter 5, Section 5.1, this rate of convergence is unimprovable in theunstructuredlarge-scale case; however, convex problems usually have a lot of structure (otherwise, how could we know that the problem is convex?), and “good” algorithms should utilize this structure rather than be completely black-box-oriented. For example, by utilizing a problem’s structure, we usually can represent it as a linear/conic quadratic/semidefinite program (which usually is easy),...

11. 7 Cutting-Plane Methods in Machine Learning
(pp. 185-218)
Vojtěch Franc, Sören Sonnenburg and Tomáš Werner

Many problems in machine learning are elegantly translated into convex optimization problems, which, however, are sometimes difficult to solve efficiently with off-the-shelf solvers. This difficulty can stem from complexity of either the feasible set or the objective function. Often, these can be accessed only indirectly via an oracle. To access a feasible set, the oracle either asserts that a given query point lies in the set or finds a hyperplane that separates the point from the set. To access an objective function, the oracle returns the value and a subgradient of the function at the query point. Cutting-plane methods solve...

12. 8 Introduction to Dual Decomposition for Inference
(pp. 219-254)
David Sontag, Amir Globerson and Tommi Jaakkola

Many problems in engineering and the sciences require solutions to challenging combinatorial optimization problems. These include traditional problems such as scheduling, planning, fault diagnosis, or searching for molecular conformations. In addition, a wealth of combinatorial problems arise directly from probabilistic modeling (graphical models). Graphical models (Koller and Friedman, 2009) have been widely adopted in areas such as computational biology, machine vision, and natural language processing, and are increasingly being used as frameworks expressing combinatorial problems.

Consider, for example, a protein side-chain placement problem where the goal is to find the minimum energy conformation of amino acid sidechains along a fixed...

13. 9 Augmented Lagrangian Methods for Learning, Selecting, and Combining Features
(pp. 255-286)
Ryota Tomioka, Taiji Suzuki and Masashi Sugiyama

Sparse estimation has recently been attracting attention from both the theoretical side (Candès et al., 2006; Bach, 2008; Ng, 2004) and the practical side, for example, magnetic resonance imaging (Weaver et al., 1991; Lustig et al., 2007), natural language processing (Gao et al., 2007), and bioinformatics (Shevade and Keerthi, 2003).

Sparse estimation is commonly formulated in two ways: the regularized estimation (or MAP estimation) framework (Tibshirani, 1996), and the empirical Bayesian estimation (also known as the automatic relevance determination) (Neal, 1996; Tipping, 2001). Both approaches are based on optimizing some objective functions, though the former is usually formulated as a...

14. 10 The Convex Optimization Approach to Regret Minimization
(pp. 287-304)

In the online decision making scenario, a player has to choose from a pool of available decisions and then incurs a loss corresponding to the quality of the decision made. The regret minimization paradigm suggests the goal of incurring an average loss which approaches that of the best fixed decision in hindsight. Recently tools from convex optimization have given rise to algorithms which are more general, unifying previous results and many times giving new and improved regret bounds.

In this chapter we survey some of the recent developments in this exciting merger of optimization and learning. We start by describing...

15. 11 Projected Newton-type Methods in Machine Learning
(pp. 305-330)
Mark Schmidt, Dongmin Kim and Suvrit Sra

We study Newton-type methods for solving the optimization problem

$\mathop {\min }\limits_x f(x) + r(x)$, subject to$x \in \Omega$, (11.1)

where$f:{R^n} \to R$is twice continuously differentiable and convex;$r:{R^n} \to R$is continuous and convex, but not necessarily differentiable everywhere; and$\Omega$is a simple convex constraint set. This formulation is general and captures numerous problems in machine learning, especially wherefcorresponds to a loss, andrto a regularizer. Let us, however, defer concrete examples of (11.1) until we have developed some theoretical background.

We propose to solve (11.1) via Newton-type methods, a certain class of second-order methods that are known to often work well for...

16. 12 Interior-Point Methods in Machine Learning
(pp. 331-350)
Jacek Gondzio

Soon after Karmarkar (1984) had published his seminal paper, interior-point methods (IPMs) were claimed to have unequalled efficiency when applied to large-scale problems. Karmarkar’s first worst-case complexity proof was based on the use of projective geometry and cleverly chosen potential function, but was rather complicated. It generated huge interest in the optimization community and soon led to improvements and clarifications of the theory. A major step in this direction was made by Gill et al. (1986), who drew the community’s attention to a close relation between Karmarkar’s projective method and the projected Newton barrier method. The impressive effort of Lustig,...

17. 13 The Tradeoffs of Large-Scale Learning
(pp. 351-368)
Léon Bottou and Olivier Bousquet

The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant (1984) states that a problem is “learnable” when there exists a “probably approximately correct” learning algorithmwith polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., Vapnik, 1982; Boucheron et al., 2005; Bartlett and Mendelson, 2006), very little has been said about the complexity side of this proposal (e.g., Judd, 1988).

Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind:

Data mining exists because competitive advantages can be...

18. 14 Robust Optimization in Machine Learning
(pp. 369-402)
Constantine Caramanis, Shie Mannor and Huan Xu

Learning, optimization, and decision making from data must cope with uncertainty introduced both implicitly and explicitly. Uncertainty can be explicitly introduced when the data collection process is noisy, or when some data are corrupted. It may be introduced when the model specification is wrong, assumptions are missing, or factors are overlooked. Uncertainty is also implicitly present in pristine data, insofar as a finite sample empirical distribution, or function thereof, cannot exactly describe the true distribution in most cases. In the optimization community, it has long been known that the effect of even small uncertainty can be devastating in terms of...

19. 15 Improving First and Second-Order Methods by Modeling Uncertainty
(pp. 403-430)
Nicolas Le Roux, Yoshua Bengio and Andrew Fitzgibbon

Machine learning often looks like optimization: write down the likelihood of some training data under some model and find the model parameters which maximize that likelihood, or which minimize some divergence between the model and the data. In this context, conventional wisdom is that one should find in the optimization literature the state-of-the-art optimizer for one’s problem, and use it.

However, this should not hide the fundamental difference between these two concepts: while optimization is about minimizing some error on the training data, it is the performance on the test data we care about in machine learning. Of course, by...

20. 16 Bandit View on Noisy Optimization
(pp. 431-454)
Jean-Yves Audibert, Sébastien Bubeck and Rémi Munos

In this chapter, we investigate the problem of function optimization with a finite number of noisy evaluations. While at first one may think that simple repeated sampling can overcome the difficulty introduced by noisy evaluations, it is far from being an optimal strategy. Indeed, to make the best use of the evaluations, one may want to estimate the seemingly best options more precisely, while for bad options a rough estimate might be enough. This reasoning leads to non-trivial algorithms, which depend on the objective criterion that we set and on how we define the budget constraint on the number of...

21. 17 Optimization Methods for Sparse Inverse Covariance Selection
(pp. 455-478)
Katya Scheinberg and Shiqian Ma

In many practical applications of statistical learning the objective is not simply to construct an accurate predictive model, but rather to discover meaningful interactions among the variables. For example, in applications such as reverse engineering of gene networks, discovery of functional brain connectivity patterns from brain-imaging data, and analysis of social interactions, the main focus is on reconstructing the network structure representing dependencies among multiple variables, such as genes, brain areas, and individuals. Probabilistic graphical models, such as Markov networks (or Markov random fields), provide a statistical tool for multivariate data analysis that allows the capture of interactions such as...

22. 18 A Pathwise Algorithm for Covariance Selection
(pp. 479-494)
Vijay Krishnamurthy, Selin Damla Ahipaşaoğlu and Alexandre d’Aspremont

We consider the problem of estimating a covariance matrix from sample multivariate data by maximizing its likelihood while penalizing the inverse covariance so that its graph issparse. This problem is known as covariance selection and can be traced back at least to Dempster (1972). The coefficients of the inverse covariance matrix define the representation of a particular Gaussian distribution as a member of the exponential family; hence sparse maximum likelihood estimates of the inverse covariance yield sparse representations of the model in this class. Furthermore, in a Gaussian model, zeros in the inverse covariance matrix correspond toconditionallyindependent...