Data Mining for the Social Sciences

Data Mining for the Social Sciences: An Introduction

Paul Attewell
David B. Monaghan
with Darren Kwong
Copyright Date: 2015
Edition: 1
Pages: 264
  • Cite this Item
  • Book Info
    Data Mining for the Social Sciences
    Book Description:

    We live in a world of big data: the amount of information collected on human behavior each day is staggering, and exponentially greater than at any time in the past. Additionally, powerful algorithms are capable of churning through seas of data to uncover patterns. Providing a simple and accessible introduction to data mining, Paul Attewell and David B. Monaghan discuss how data mining substantially differs from conventional statistical modeling familiar to most social scientists. The authors also empower social scientists to tap into these new resources and incorporate data mining methodologies in their analytical toolkits.Data Mining for the Social Sciencesdemystifies the process by describing the diverse set of techniques available, discussing the strengths and weaknesses of various approaches, and giving practical demonstrations of how to carry out analyses using tools in various statistical software packages.

    eISBN: 978-0-520-96059-6
    Subjects: Population Studies, Political Science

Table of Contents

  1. Front Matter
    (pp. i-vi)
  2. Table of Contents
    (pp. vii-x)
    (pp. xi-xii)

      (pp. 3-12)

      Data mining(DM) is the name given to a variety of computer-intensive techniques for discovering structure and for analyzing patterns in data. Using those patterns, DM can create predictive models, or classify things, or identify different groups or clusters of cases within data. Data mining and its close cousinsmachine learning and predictive analyticsare already widely used in business and are starting to spread into social science and other areas of research.

      A partial list of current data mining methods includes:

      association rules

      recursive partitioning or decision trees, including CART (classification and regression trees) and CHAID (chi-squared automatic interaction...

      (pp. 13-29)

      Data mining (DM) offers an approach to data analysis that differs in important ways from the conventional statistical methods that have dominated over the last several decades. In this section we highlight some key contrasts between the emerging DM paradigm and the older statistical approach to data analysis, before detailing in later chapters the individual methods or tools that constitute DM. In illustrating these contrasts, we will use multiple regression to stand for the conventional approach, since this statistical method—along with its many extensions and offshoots, including logistic regression, event-history analysis, multilevel models, log-linear models, and structural equation modeling...

      (pp. 30-52)

      Data dredging—searching through data until one finds statistically significant relations—is deplored by conventional methods textbooks, which instruct students to generate hypotheses before beginning their statistical analyses. The DM approach raises data dredging to new heights—but to its credit DM does not follow the conventional paradigm’s bad example regarding significance testing when there are multiple predictors. It focuses instead on an alternative way of avoiding false-positive results or type I error: it emphasizes replication rather than significance testing, through a procedure known ascross-validation.

      Prior to beginning an analysis involving cross-validation, DM software separates the cases within a...

      (pp. 53-60)

      Having introduced the paradox of too little big data and noted the challenges caused by high-dimensional data, we can now discuss how a DM analysis typically proceeds. There are six conceptually separate steps: (1) deciding whether and how to sample data before analyzing it; (2) building a rich array of features or variables; (3) feature selection and feature extraction; (4) constructing or fitting a model using that smaller list of features on the training data; (5) verifying or validating that model on test data; and (6) trying out alternative DM methods and perhaps combining several (ensemble methods) in order to...


      (pp. 63-71)

      Earlier, we discussed how cross-validation (CV) works as a sort of quality control mechanism in data mining, and pointed out how CV methods contrast in an interesting manner with conventional tests for statistical significance. We will now discuss explicitly the logic of CV, and then provide a guide for how one carries out this technique in practice using a number of statistical packages.

      Many data mining texts deal with the logic of CV in a rather cursory fashion. The focus is on its practical application: how CV presents a solution to one or another problem that tends to be encountered...

      (pp. 72-92)

      When analyzing big data, we are faced with an overwhelming welter of information. We have too many cases, or too much information about each case, for effective use of standard statistical methodologies. We have already seen how having too many cases can cause programs to crash or run unduly slowly, and how this can sometimes be addressed simply by sampling our data. A more complicated situation emerges when we have too much information gathered about each case—in other words, when we have more variables than we know what to do with.

      Data miners use the letterNto refer...

    • 7 Creating New Variables Using Binning and Trees
      (pp. 93-115)

      Experienced data miners repeatedly tell newcomers that what takes the most time and requires the greatest care in data mining is typically not running the analysis (the modeling stage) but the stage prior to data analysis when the researcher creates the variables or features that are going to be entered into models. This is partly because researchers use their knowledge of the subject matter to ensure that important variables are not left out. Researchers also construct ratios that seem conceptually important (cost per square foot, shootings per 100,000 population, etc.) and which may prove empirically powerful predictors. Beyond this, though,...

      (pp. 116-132)

      When we have data of high dimension, that is, data that are verywide(lots of attributes or predictors), we sometimes want to find ways of reducing their dimensionality. We have already discussedfeature selectionmethods like stepwise regression, LASSO, and VIF regression. These methods are certainly options when we want to decrease the dimensionality of predictor variables in terms of their relation with an outcome. Feature selection tools are all “supervised” methods, in that one particular dimension of the data (the outcome, target, or dependent variable) is privileged, and we select variables that are interesting for how they relate...

      (pp. 133-161)

      In data mining,classifiersare programs that predict which category or class of a dependent variable individual observations fall into. For example, we previously classified individuals according to whether or not they have health insurance, making use of several demographic characteristics. In some types of classification algorithms, classification involves developing a predictive statistical model, using a set of independent variables or attributes to predict each individual’s value on an outcome or dependent variable ortarget. That prediction, in the form of a probability that a given case will fall into a certain category or class, is then used toclassify...

      (pp. 162-184)

      Developed by Breiman and colleagues (1983), the classification tree (also known as CART, CHAID, decision tree, or partition tree) is in some ways the paradigmatic data mining tool: simple, powerful, computation-intensive, nonparametric, and utterly data-driven. It is first and foremost a classifier, using input characteristics to create a model which sorts cases into categories with different values on an outcome of interest. And it doesn’t matter whether the outcome variable or the input variables are dichotomous, categorical, or continuous; partition trees can handle all of them, and deal with them in more or less the same manner. However, partition trees...

      (pp. 185-195)

      Artificial neural networks(ANNs;neural networksorneural netsfor short) are machine-learning tools which are inspired, as the name suggests, by the operation of biological neurons. To get a very general, abstract notion of how ANNs operate, consider the basic functioning of a neuron. Neurons have dendrites which gather input information from other neurons. This information is combined, and when some threshold is reached the neuron “fires.” In this fashion, the neuron channels information to other neurons. Additionally,networksof neurons are capable of “learning” based on previous errors.

      Artificial neural networks work in similar fashion. They gather information...

      (pp. 196-215)

      Cluster analysis is designed to address a very common situation in research. You may think that cases in your data—cities, students, children, or labor unions—do not represent a simple random smattering of individual observations, but are better described asgroupsof observations. What we want to do is to separate our cases into categories orclustersof cases—to do what is in some sense the simplest, most natural kind of social modeling, the kind everyone does constantly on an ad hoc basis in regular social life. But we want to do it with more precision, theoretical sophistication,...

      (pp. 216-226)

      First prominently used in the social sciences by Lazarsfeld and Henry (1968), latent class analysis (LCA) is another statistical technique in the broader family of latent variable models which includes principal component analysis, factor analysis, and clustering. It can be thought of as a model in which only one latent variable is estimated, and in which this latent variable has a categorical distribution. This assumption about the number and distribution of latent variables sets it apart from principal component analysis, which presumes that there are multiple latent variables and that these have a normal distribution. LCA is in some ways...

      (pp. 227-234)

      Assocation rule mining is one of the most widely used data mining techniques. In its classical form, as first developed by Agrawal, Imieliński, and Swami (1993), it was used to examine market-basket data in commercial settings. This practical application was designed to be of use to retailers who may be interested in patterns of purchasing engaged in by customers. Stores have a given set of items for sale at a point in time, and customers purchase sets of these items when they come to the store. A retailer may wish to know what else customers tend to buy when they...

  6. CONCLUSION Where Next?
    (pp. 235-238)

    It is over half a century since computerization started spreading through society, and by now its impacts are evident in many aspects of our lives. When businesses began installing computers in the 1960s and 1970s, they had very limited and practical goals in mind: automating various kinds of transactional records, reducing the costs of preparing bills and invoices, helping with accounts and balance sheets. Few realized that an important by-product would be a flood of business data that enabled managers to access the details of sales or cash flow at that very moment, instead of waiting for accounts to be...

    (pp. 239-244)
  8. NOTES
    (pp. 245-246)
  9. INDEX
    (pp. 247-252)