The Practice of Data Analysis

The Practice of Data Analysis: Essays in Honor of John W. Tukey

D. R. Brillinger
L. T. Fernholz
S. Morgenthaler
Copyright Date: 1997
Pages: 352
https://www.jstor.org/stable/j.ctt7zthdd
  • Cite this Item
  • Book Info
    The Practice of Data Analysis
    Book Description:

    This collection of essays brings together many of the world's most distinguished statisticians to discuss a wide array of the most important recent developments in data analysis. The book honors John W. Tukey, one of the most influential statisticians of the twentieth century, on the occasion of his eightieth birthday. Contributors, some of them Tukey's former students, use his general theoretical work and his specific contributions to Exploratory Data Analysis as the point of departure for their papers. They cover topics from "pure" data analysis, such as gaussianizing transformations and regression estimates, and from "applied" subjects, such as the best way to rank the abilities of chess players or to estimate the abundance of birds in a particular area.

    Tukey may be best known for coining the common computer term "bit," for binary digit, but his broader work has revolutionized the way statisticians think about and analyze sets of data. In a personal interview that opens the book, he reviews these extraordinary contributions and his life with characteristic modesty, humor, and intelligence. The book will be valuable both to researchers and students interested in current theoretical and practical data analysis and as a testament to Tukey's lasting influence.

    The essays are by Dhammika Amaratunga, David Andrews, David Brillinger, Christopher Field, Leo Goodman, Frank Hampel, John Hartigan, Peter Huber, Mia Hubert, Clifford Hurvich, Karen Kafadar, Colin Mallows, Stephan Morgenthaler, Frederick Mosteller, Ha Nguyen, Elvezio Ronchetti, Peter Rousseeuw, Allan Seheult, Paul Velleman, Maria-Pia Victoria-Feser, and Alessandro Villa.

    Originally published in 1997.

    ThePrinceton Legacy Libraryuses the latest print-on-demand technology to again make available previously out-of-print books from the distinguished backlist of Princeton University Press. These paperback editions preserve the original texts of these important books while presenting them in durable paperback editions. The goal of the Princeton Legacy Library is to vastly increase access to the rich scholarly heritage found in the thousands of books published by Princeton University Press since its founding in 1905.

    eISBN: 978-1-4008-5160-7
    Subjects: Statistics

Table of Contents

  1. Front Matter
    (pp. i-iv)
  2. Table of Contents
    (pp. v-vi)
  3. Preface
    (pp. vii-viii)
    David R. Brillinger, Luisa T. Fernholz and Stephan Morgenthaler
  4. Opening Material
    • Introductory Remarks by the Editors
      (pp. 3-4)

      The opening material of this volume contains a short biography and a curriculum vitae which exhibit many facets of John W. Tukey’s extraordinary life and personality.

      Statistics as a field has gained enormously in prestige thanks to John Tukey and one measure of his prolific energies and his profound influence is the long list of Ph.D. students and grand-students which follows the CV. Anyone who has had the privilege to hear lectures by John Tukey or to discuss statistical issues with him, knows what a deep source of insights, often delivered along with an anecdote, he is. The transcribed conversation...

    • Biographical Information
      (pp. 5-8)
    • Curriculum Vitae of John Wilder Tukey
      (pp. 9-15)
    • Ph.D. Theses Directed by John W. Tukey Princeton University, 1940–1990
      (pp. 16-18)
    • Partial List of John W. Tukey’s Grandstudents
      (pp. 19-25)
    • A Conversation with John W. Tukey
      (pp. 26-45)
      Luisa T. Fernholz, Stephan Morgenthaler, others and JOHN W. TUKEY

      Q:I am going to start with a somewhat personal question. We heard yesterday that you did not have a formal education, but were educated at home. Could you tell us a little bit about that?

      A: Okay, well, by the time I was five, my parents had settled in New Bedford. My father was head of the Latin Department in the High School. In those unregenerate days a married woman couldn’t be a teacher in Massachussetts. So, my mother wasn’t a teacher, but she was a substitute. And I have heard it claimed, that between the two of them,...

    • Elizabeth Tukey’s Speech
      (pp. 46-47)
      Elizabeth Tukey

      I have a confession to make. I asked if I might introduce one of John’s oldest and best friends tonight – mainly because I want all our friends to understand what an influence he has been in John’s life.

      John just met Bill Baker at the Princeton Graduate College in the academic year 1937–38. John was still a chemist, and Bill, also a chemist, had entered Princeton a year earlier. Bill got his degree in Chemistry in 1938, and John got his in Mathematics in 1939. The war intervened but by February of 1945 John had taken a full time...

    • Program of the Conference in Honor of John W. Tukey on His 80th Birthday
      (pp. 48-48)
    • List of Participants
      (pp. 49-54)
    • [Illustrations]
      (pp. None)
  5. Scientific Papers
    • Errors-in-Variables Regression Estimators That Have High Breakdown and High Gaussian Efficiency
      (pp. 57-66)
      Dhammika Amaratunga

      For linear regression, it has been argued that a one-step M-estimator (Beaton and Tukey, 1974), whose single M step is taken from a high breakdown estimator, inherits the high breakdown property, while producing high asymptotic efficiency at the nominal Gaussian situation (Rousseeuw, 1984, 1994, Rousseeuw and Leroy, 1987, Jureckova and Portnoy, 1989), although whether the high efficiency carries over to finite samples, particularly when the data contain leverage points, has been debated (Morgenthaler, 1989, 1991, Stefanski, 1991, Coakley at al., 1994, Rousseeuw, 1994). The M step can be accomplished either as a weighted least squares (WLS) estimator (in which case,...

    • The Analytic Jackknife
      (pp. 67-76)
      David F. Andrews

      In this paper we develop expressions for jackknife estimates of variance in terms of operators. These operators lead to the analytic expressions for the evaluation of properties of estimates.

      The methods introduced here have been implemented as algorithms for symbolic calculation. The expressions given here were produced by these algorithms.

      We consider parameters to be properties of distributions. We consider the case where a sample ofnindependent, though not necessarily identically distributed, random variables are available for the estimation of these parameters. Although, in some very special cases, a family of distributions may be indexed by a parameter, we...

    • Assessing Connections in Networks of Biological Neurons
      (pp. 77-92)
      David R. Brillinger and Alessandro E. P. Villa

      The sequence of spikes of a neuron, referred to as a “spike train”, may carry important information processed by the brain and thus may underlie cognitive functions and sensory perception (Abeles, 1991). The data studied are recorded stretches of point processes corresponding to the firing times of neurons measured in the cat’s auditory thalamus (Webster and al., 1992). This set of nuclei is often viewed as the penultimate in an ascending hierarchy of processing stages of the auditory sensation that begins at the level of the inner ear. The thalamic nuclei belonging to the cat auditory pathway are themedial...

    • Estimating Abundances for a Breeding Bird Atlas
      (pp. 93-100)
      Christopher A. Field

      During the five year period from 1986 to 1990, data were collected for a breeding bird atlas for the three Maritime Provinces of Canada, New Brunswick, Nova Scotia and Prince Edward Island. The purpose of the Atlas (Erskine, 1992) is to determine which birds breed in which parts of the region and to give a rough estimate of the number of breeding pairs for each species. To collect the data, the region was divided into 1682 squares each10km²and it was decided to collect data on as many squares as possible. One quarter of the squares were designated as...

    • Statistical Methods, Graphical Displays, and Tukey’s Ladder of Re-Expression in the Analysis of Nonindependence in Contingency Tables: Correspondence Analysis, Association Analysis, and the Midway View of Nonindependence
      (pp. 101-132)
      Leo A. Goodman

      This article is concerned with methods for the analysis of cross-classified data pertaining to, say, two classifications, a row classification consisting of I rows and a column classification consisting of J columns (withI ≥2 andJ ≤2). We shall be concerned with the analysis of nonindependence between the row classification and the column classification in the two-wayI x Jcross-classification table. We shall first consider the analysis of nonindependence in the 2 x 2 table, and then we shall consider the more generalI x Jtable.

      For the 2 x 2 table, we shall apply...

    • Some Additional Notes on the “Princeton Robustness Year”
      (pp. 133-154)
      Frank Hampel

      For the academic year 1970/71, G.S. Watson, then chairman of the Princeton statistics department, invited P.J. Bickel (University of California, Berkeley), P.J. Huber (ETH Zurich) and the author (University of Zurich) to join J.W. Tukey (Princeton University and Bell Laboratories) for a cooperative effort for progress in robust statistics. The best-known, and virtually the only known outcome of this intensive research year is the book by Andrews et al. (1972) on the extensive Monte Carlo study of “robust estimates of location under symmetric longtailed distributions.” On the one hand, the book contains a wealth of new information, methods, results, and...

    • Tracking Chess Players’ Abilities
      (pp. 155-174)
      John A. Hartigan

      Players in the United States Chess Federation (USCF) are rated using the Elo (1961) rating system:

      For a player who has played less than 20 rated matches, there is an initial or provisional rating equal to the average rating of opponents plus 400 times the average number of wins minus losses. For a player with more than 20 matches, the player’s rating is updated by the Elo updating formula: each match increases the winner’s rating and decreases the loser’s rating by an amount proportional to the difference between the score (1 for a win, 1/2 for a draw, and 0...

    • Speculations on the Path of Statistics
      (pp. 175-192)
      Peter J. Huber

      During the past third of a century, there have been several papers and conferences attempting to assess the current state of statistics, its woes and its future, including some veiled attempts to influence the latter. The present paper is a subjective and partial review of those efforts. Prom a series of quotes there emerges an interesting, sometimes disturbing, but fairly consistent view of the paths along which statistics has been developing. I shall also make an effort of my own to assess our present time, to extrapolate the past developments and to make some predictions for the immediate future.

      I...

    • A Regression Analysis with Categorical Covariables, Two-way Heteroscedasticity, and Hidden Outliers
      (pp. 193-202)
      Mia Hubert and Peter J. Rousseeuw

      In this paper we will analyze a real data set with both continuous and categorical regressors, which contains heteroscedastic errors and some hidden outliers. It is self-evident nowadays that analyzing data involves using tools and ideas introduced by John Tukey, in this case boxplots, median polish, and insights from residual plots.

      The education expenditure data of Chatterjee and Price (1991, p. 119-121) consist of the per capita expenditure on education (EDUC) in the 50 states of the US, from 1965 until 1975. The states are grouped into four regions: North East (NE), North Central (NC), South (S), and West (W)....

    • Mean Square over Degrees of Freedom: New Perspectives on a Model Selection Treasure
      (pp. 203-216)
      Clifford M. Hurvich

      Tukey’s (Anscombe’s) mean square over residual degrees of freedom, orMS/v,is arguably the first formal model selection criterion ever proposed. The criterion was introduced and motivated in a linear regression context by Tukey in his discussion of a paper by Anscombe (1967). The criterion was subsequently described in Mosteller and Tukey (1977, Chapt. 15). Tukey’s original motivation drew on several key ideas, including an approximation to ratios of elements of the hat matrix given by Anscombe. AlthoughMS/vpredates such widely-used criteria asFPE(Akaike, 1969),AIC(Akaike, 1973), andPRESS(Allen, 1971), Tukey’s criterion has received remarkably little...

    • Geographical Trends in Cancer Mortality: Spatial Smoothers and Adjustment
      (pp. 217-234)
      Karen Kafadar

      Health-related data of various sorts are available nationally and are of wide interest, and maps are valuable tools for communicating the quantitative information that they contain. Through maps, we can: with as much precision as the map will allow. Three of the goals of such maps are:

      1. Summarize the data;

      2. Provide insights into interesting or unusual features in the data;

      3. Present the information in the data as clearly and as accurately as possible. These three goals are mutually supportive: the summary should indicate interesting features of the data that might not be obvious without the display, and...

    • Covering Designs in Random Environments
      (pp. 235-246)
      Colin L. Mallows

      The problem of designing a batch of tests for a large software product is superficially similar to that of designing an experiment for estimating main effects and interactions. In fact classical designs have been proposed for this purpose (Tatsumi, Watanabe, Takeuchi, and Shimokawa, 1987, Zeitler, 1991, Brownlie, Prowse and Phadke, 1992). However in some versions of the software testing problem a different approach is needed, and a different class of designs becomes attractive. For example, we will see that there is a useful design for 126 two-level factors using only 10 runs. The designs we shall consider have some features...

    • Gaussianizing Transformations and Estimation
      (pp. 247-260)
      Stephan Morgenthaler

      To transform data in order to make its appearance more symmetrical and more comparable across “batches” and in order to simplify relationships with explanatory variables are ideas that have been studied in depth and have been popularised by John Tukey (see, for example, Tukey, 1957 and Tukey, 1977, Chapters 3, 4 and 6). Under a wide range of circumstances it is possible to use comparatively simple statistical tools after the data have been transformed. As Anscombe & Tukey (1954) put it:

      When classical methods of analysis of variance and regression

      do not apply directly and it is possible to apply...

    • The Tennessee Study of Class Size in the Early School Grades
      (pp. 261-278)
      Frederick Mosteller

      If we want to improve school systems, we need to consider what changes may be practical and effective. Because we have all gone to school, we have ideas about how to improve the system. For example, James Garfield once said that a pine log with a student on one end and Mark Hopkins, a beloved president of Williams College, on the other would be an ideal university. Setting aside the discomfort of outdoor logs during New England winters, would this design have used President Hopkins’ time effectively? Aristotle, even when tutoring the young Alexander before he was called “the Great,”...

    • On the Distribution of Order Statistics from a p-wild Distribution
      (pp. 279-286)
      Ha H. Nguyen

      The robustness of a statistic is sometimes judged by studying its behavior in the presence of outliers. One distribution that is often found in the literature to describe the situation with outliers is thep-wild distribution where a sample of sizenconsists ofn — pobservations from one distributionf(x)andpobservations from another distributiong(x).

      David and Shu (1978) derived the relationship between therthorder statistic from a sample of sizenfrom aone-wild distribution and therthorder statistic from a sample of sizen— 1 from the uncontaminated distribution. This paper will derive...

    • Resistant Modeling of Income Distributions and Inequality Measures
      (pp. 287-298)
      Elvezio Ronchetti and Maria-Pia Victoria-Feser

      Robustness has been a recurrent element of John W. Tukey’s multiple, diverse, and influential contributions to statistics. His 1960 paper showed the damaging effects of small deviations from the model and was one of the building blocks of modern robust statistics. Robustness ideas played a crucial role in the development of his exploratory data analysis (Tukey, 1977) and appeared clearly in his influential 1962 paper and even earlier toward the end of World War II in the work of the Princeton University Fire Control Research Group.

      It is therefore fitting on this occasion to discuss some recent developments of robust...

    • Bonus Decompositions for Robust Analysis of 2n Factorial Experiments
      (pp. 299-316)
      Allan H. Seheult

      John Tukey (1971, Ch. 36, p. 37) introduced the idea of a “bonus” as

      ... a term that only operates — only contributes something not zero — when certain versions of the two factors combine.

      This notion of an interaction which appears to operate at just one combination of the versions of two factors in a 2nfactorial experiment has also been recognised by Daniel (p21, 1976)

      There is one contrast, not three, and one sentence that describe the situation completely. The contrast is [(1) +b+ab— 3a]; the sentence is: “the condition a is adverse (or advantageous) and the...

    • The Philosophical Past and the Digital Future of Data Analysis: 375 Years of Philosophical Guidance for Software Design on the Occasion of John W. Tukey’s 80th Birthday
      (pp. 317-337)
      Paul F. Velleman

      As a student of John Tukey, I learned to seek philosophical underpinnings for my work in data analysis and statistical computing. More recently, I have begun to marvel and to worry at how little we teach our students aboutwhywe do what we do with data, and to consider whether there is a way forward in statistical computing without a sound philosophical basis.

      In my search, I have studied work by a revolutionary thinker. One who challenges his readers to construct understanding of the world step-by-step from dispassionate consideration of evidence, moving from data to model and back to...