Modeling with Data

Modeling with Data: Tools and Techniques for Scientific Computing

Ben Klemens
Copyright Date: 2009
Pages: 470
  • Cite this Item
  • Book Info
    Modeling with Data
    Book Description:

    Modeling with Datafully explains how to execute computationally intensive analyses on very large data sets, showing readers how to determine the best methods for solving a variety of different problems, how to create and debug statistical models, and how to run an analysis and evaluate the results.

    Ben Klemens introduces a set of open and unlimited tools, and uses them to demonstrate data management, analysis, and simulation techniques essential for dealing with large data sets and computationally intensive procedures. He then demonstrates how to easily apply these tools to the many threads of statistical technique, including classical, Bayesian, maximum likelihood, and Monte Carlo methods. Klemens's accessible survey describes these models in a unified and nontraditional manner, providing alternative ways of looking at statistical concepts that often befuddle students. The book includes nearly one hundred sample programs of all kinds. Links to these programs will be available on this page at a later date.

    Modeling with Datawill interest anyone looking for a comprehensive guide to these powerful statistical tools, including researchers and graduate students in the social sciences, biology, engineering, economics, and applied mathematics.

    eISBN: 978-1-4008-2874-6
    Subjects: Statistics, Technology

Table of Contents

  1. Front Matter
    (pp. i-vi)
  2. Table of Contents
    (pp. vii-x)
  3. Preface
    (pp. xi-xvi)
  4. 1 Statistics in the modern day
    (pp. 1-14)

    Statistical analysis has two goals, which directly conflict. The first is to find patterns in static: given the infinite number of variables that one could observe, how can one discover the relations and patterns that make human sense? The second goal is a fight againstapophenia,the human tendency to invent patterns in random static. Given that someone has found a pattern regarding a handful of variables, how can one verify that it is not just the product of a lucky draw or an overactive imagination?

    Or, consider the complementary dichotomy of objective versus subjective. The objective side is often...

    • 2 C
      (pp. 17-73)

      This chapter introduces C and some of the general concepts behind good programming that script writers often overlook. The function-based approach, stacks of frames, debugging, test functions, and overall good style are immediately applicable to virtually every programming language in use today. Thus, this chapter on C may help you to become a better programmer with any programming language.

      As for the syntax of C, this chapter will cover only a subset. C has 32 keywords and this book will only use 18 of them.¹ Some of the other keywords are basically archaic, designed for the days when compilers needed...

    • 3 Databases
      (pp. 74-112)

      Structured Query Language (SQL¹)is a specialized language that deals only with the flow of information. Some things, like joining together multiple data sets, are a pain using traditional techniques of matrix manipulation, but are an easyqueryin a database language. Meanwhile, operations like matrix multiplication or inversion just can not be done via SQL queries. With both database tables and C-side matrices, your data analysis technique will be unstoppable.

      As a broad rule, try to do data manipulation, like pulling subsets from the data or merging together multiple data tables, using SQL. Then, as a last step, pull...

    • 4 Matrices and models
      (pp. 113-156)

      Recall that the C language provides only the most basic of basics, such as addition and division, and everything else is provided by a library. So before you can do data-oriented mathematics, you will need a library to handle matrices and vectors.

      There are many available; this book uses the GNU Scientific Library(GSL).The GSL is recommended because it is actively supported and will work on about as many platforms as C itself. Beyond functions useful for statistics, it also includes a few hundred functions useful in engineering and physics, which this book will not mention. The full reference...

    • 5 Graphics
      (pp. 157-188)

      Graphics is one of the places where the computing world has not yet agreed on a standard, and so instead there are a dozen standards, including JPG, PNG, PDF, GIF, and many other TLAs. You may find yourself in front of a computer that readily handles everything, including quick displays to the screen, or you may find yourself logging in remotely to a command-line at the university’s servers, that support only SVG graphics. Some journals insist on all graphics being in EPS format, and others require JPGs and GIFs.

      The solution to the graphics portability problem is to use a...

    • 6 More coding tools
      (pp. 189-216)

      If you have a good handle on Chapter 2, then you already have what you need to write some very advanced programs. But C is a world unto itself, with hundreds of utilities to facilitate better coding and many features for the programmer who wishes to delve further.

      This chapter covers some additional programming topics, and some details of C and its environment. As with earlier chapters, the syntax here is C-specific, but it is the norm for programming languages to have the sort of features and structures discussed here, so much of this chapter will be useful regardless of...

    • 7 Distributions for description
      (pp. 219-263)

      This chapter covers some methods of describing a data set, via a number of strategies of increasing complexity. The first approach, in Section 7.1, consists of simply looking at summary statistics for a series of observations about a single variable, like its mean and variance. It imposes no structure on the data of any sort. The next level of structure is to assume that the data is drawn from a distribution; instead of finding the mean or variance, we would instead use the data to estimate the parameters that describe the distribution. The simple statistics and distributions in this chapter...

    • 8 Linear projections
      (pp. 264-294)

      This chapter covers models that make sense of data with more dimensions than we humans can visualize. The first approach, taken in Section 8.1 and known as principal component analysis (PCA), is to find a two-or three-dimensional subspace that best describes the fifty-dimensional data, and flatten the data down to those few dimensions.

      The second approach, in Section 8.2, provides still more structure. The model labels one variable as the dependent variable, and claims that it is a linear combination of the other, independent, variables. This is the ordinary least squares (OLS) model, which has endless variants. The remainder of...

    • 9 Hypothesis testing with the CLT
      (pp. 295-324)

      The purpose of descriptive statistics is to say something about the data you have. The purpose of hypothesis testing is to say something about the data you don’t have.

      Say that you took a few samples from a population, maybe the height of several individuals, and the mean of your sample measurements is$\widehat\mu = 175$cm. If you did your sums right, then this is an indisputable, certain fact. But what is the mean height of the population from which you drew your data set? To guess at the answer to this question, you need to make some assumptions about how...

    • 10 Maximum likelihood estimation
      (pp. 325-355)

      Whether by divine design or human preference, problems involving the search for optima are everywhere. To this point, most models have had closed-form solutions for the optimal parameters, but if there is not a nice computational shortcut to finding them, you will have to hunt for them directly. There are a variety of routines to find optima, and Apophenia provides a consistent front-end to many of them via its apop_maximum_likelihood function.

      Given a distribution$p\left( \cdot \right),$the value at one input,$p\left( x \right),$islocal information:we need to evaluate the function at only one point to write down the value. However,...

    • 11 Monte Carlo
      (pp. 356-380)

      Monte Carlo (Italian and Spanish for Mount Carl) is a city in Monaco famous for its casinos, and has more glamorous associations with its name than Reno or Atlantic City.

      Monte Carlo methods are thus about randomization: taking existing data and making random transformations to learn more about it. But although the process involves randomness, its outcome is not just the mere whim of the fates. At the roulette table, a single player may come out ahead, but with millions of suckers testing their luck, casinos find that even a 49–51 bet in their favor is a reliable method...

  7. Appendix A: Environments and makefiles
    (pp. 381-391)
  8. Appendix B: Text processing
    (pp. 392-418)
  9. Appendix C: Glossary
    (pp. 419-434)
  10. Bibliography
    (pp. 435-442)
  11. Index
    (pp. 443-454)