Statistics, Data Mining, and Machine Learning in Astronomy

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data

Željko Ivezić
Andrew J. Connolly
Jacob T. VanderPlas
Alexander Gray
Copyright Date: 2014
Edition: STU - Student edition
Pages: 544
  • Cite this Item
  • Book Info
    Statistics, Data Mining, and Machine Learning in Astronomy
    Book Description:

    As telescopes, detectors, and computers grow ever more powerful, the volume of data at the disposal of astronomers and astrophysicists will enter the petabyte domain, providing accurate measurements for billions of celestial objects. This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope. It serves as a practical handbook for graduate students and advanced undergraduates in physics and astronomy, and as an indispensable reference for researchers.

    Statistics, Data Mining, and Machine Learning in Astronomypresents a wealth of practical analysis problems, evaluates techniques for solving them, and explains how to use various approaches for different types and sizes of data sets. For all applications described in the book, Python code and example data sets are provided. The supporting data sets have been carefully selected from contemporary astronomical surveys (for example, the Sloan Digital Sky Survey) and are easy to download and use. The accompanying Python code is publicly available, well documented, and follows uniform coding standards. Together, the data sets and code enable readers to reproduce all the figures and examples, evaluate the methods, and adapt them to their own fields of interest.

    Describes the most useful statistical and data-mining methods for extracting knowledge from huge and complex astronomical data setsFeatures real-world data sets from contemporary astronomical surveysUses a freely available Python codebase throughoutIdeal for students and working astronomers

    eISBN: 978-1-4008-4891-1
    Subjects: Astronomy, Technology, Physics, Statistics

Table of Contents

  1. Front Matter
    (pp. i-iv)
  2. Table of Contents
    (pp. v-viii)
  3. Preface
    (pp. ix-x)
  4. I Introduction
    • 1 About the Book and Supporting Material
      (pp. 3-42)

      This chapter introduces terminology and nomenclature, reviews a few relevant contemporary books, briefly describes the Python programming language and the Git code management tool, and provides details about the data sets used in examples throughout the book.

      Data mining,machine learning, andknowledge discoveryrefer to research areas which can all be thought of as outgrowths of multivariate statistics. Their common themes are analysis and interpretation of data, often involving large quantities of data, and even more often resorting to numerical methods. The rapid development of these fields over the last few decades was led by computer scientists, often in...

    • 2 Fast Computation on Massive Data Sets
      (pp. 43-66)

      This chapter describes basic concepts and tools for tractably performing the computations described in the rest of this book. The need for fast algorithms for such analysis subroutines is becoming increasingly important as modern data sets are approaching billions of objects. With such data sets, even analysis operations whose computational cost is linearly proportional to the size of the data set present challenges, particularly since statistical analyses are inherently interactive processes, requiring that computations complete within some reasonable human attention span. For more sophisticated machine learning algorithms, the often worse-than-linear runtimes of straightforward implementations become quickly unbearable. In this chapter...

  5. II Statistical Frameworks and Exploratory Data Analysis
    • 3 Probability and Statistical Distributions
      (pp. 69-122)

      The main purpose of this chapter is to review notation and basic concepts in probability and statistics. The coverage of various topics cannot be complete, and it is aimed at concepts needed to understand material covered in the book. For an in-depth discussion of probability and statistics, please refer to numerous readily available textbooks, such as Bar89, Lup93, WJ03, Wass10, mentioned in §1.3.

      The chapter starts with a brief overview of probability and random variables, then it reviews the most common univariate and multivariate distribution functions, and correlation coefficients. We also summarize the central limit theorem and discuss how to...

    • 4 Classical Statistical Inference
      (pp. 123-174)

      This chapter introduces the main concepts ofstatistical inference, or drawing conclusions from data. There are three main types of inference:

      Point estimation: What is the best estimate for a model parameterθ, based on the available data?

      Confidence estimation: How confident should we be in our point estimate?

      Hypothesis testing: Are data at hand consistent with a given hypothesis or model?

      There are two major statistical paradigms which address the statistical inference questions: the classical, orfrequentistparadigm, and theBayesianparadigm (despite the often-used adjective “classical,” historically the frequentist paradigm was developed after the Bayesian paradigm). While most...

    • 5 Bayesian Statistical Inference
      (pp. 175-246)

      We have already addressed the main philosophical differences between classical and Bayesian statistical inferences in §4.1. In this chapter, we introduce the most important aspects of Bayesian statistical inference and techniques for performing such calculations in practice. We first review the basic steps in Bayesian inference in §5.1–5.4, and then illustrate them with several examples in §5.6–5.7. Numerical techniques for solving complex problems are discussed in §5.8, and the last section provides a summary of pros and cons for classical and Bayesian methods.

      Let us briefly note a few historical facts. The Reverend Thomas Bayes (1702–1761) was...

  6. III Data Mining and Machine Learning
    • 6 Searching for Structure in Point Data
      (pp. 249-288)

      We begin the third part of this book by addressing methods for exploring and quantifying structure in a multivariate distribution of points. One name for this kind of activity isexploratory data analysis(EDA). Given a sample ofNpoints inD-dimensional space, there are three classes of problems that are frequently encountered in practice: density estimation, cluster finding, and statistical description of the observed structure. The space populated by points in the sample can be real physical space, or a space spanned by the measured quantities (attributes). For example, we can consider the distribution of sources in a multidimensional...

    • [Illustrations]
      (pp. None)
    • 7 Dimensionality and Its Reduction
      (pp. 289-320)

      With the dramatic increase in data available from a new generation of astronomical telescopes and instruments, many analyses must address the question of the complexity as well as size of the data set. For example, with the SDSS imaging data we could measure arbitrary numbers of properties or features for any source detected on an image (e.g., we could measure a series of progressively higher moments of the distribution of fluxes in the pixels that make up the source). From the perspective of efficiency we would clearly rather measure only those properties that are directly correlated with the science we...

    • 8 Regression and Model Fitting
      (pp. 321-364)

      Regression is a special case of the general model fitting and selection procedures discussed in chapters 4 and 5. It can be defined as the relation between a dependent variable,y, and a set of independent variables,x, that describes the expectation value ofygivenx:E[y|x]. The purpose of obtaining a “best-fit” model ranges from scientific interest in the values of model parameters (e.g., the properties of dark energy, or of a newly discovered planet) to the predictive power of the resulting model (e.g., predicting solar activity). The usage of the word regression for this relationship dates...

    • 9 Classification
      (pp. 365-402)

      In chapter 6 we described techniques for estimating joint probability distributions from multivariate data sets and for identifying the inherent clustering within the properties of sources. We can think of this approach as theunsupervised classificationof data. If, however, we have labels for some of these data points (e.g., an object is tall, short, red, or blue) we can utilize this information to develop a relationship between the label and the properties of a source. We refer to this assupervised classification.

      The motivation for supervised classification comes from the long history of classification in astronomy. Possibly the most...

    • 10 Time Series Analysis
      (pp. 403-468)

      This chapter summarizes the fundamental concepts and tools for analyzing time series data. Time series analysis is a branch of applied mathematics developed mostly in the fields of signal processing and statistics. Contributions to this field, from an astronomical perspective, have predominantly focused on unevenly sampled data, low signal-to-noise data, and heteroscedastic errors. There are more books written about time series analysis than pages in this book and, by necessity, we can only address a few common use cases from contemporary astronomy. Even when limited to astronomical data sets, the diversity of potential applications is enormous. The most common applications...

  7. IV Appendices
    • A. An Introduction to Scientific Computing with Python
      (pp. 471-510)
    • B. AstroML: Machine Learning for Astronomy
      (pp. 511-514)
    • C. Astronomical Flux Measurements and Magnitudes
      (pp. 515-518)
    • D. SQL Query for Downloading SDSS Data
      (pp. 519-520)
    • E. Approximating the Fourier Transform with the FFT
      (pp. 521-526)
  8. Visual Figure Index
    (pp. 527-532)
  9. Index
    (pp. 533-540)