ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks

Dan Gusfield
Charles H. Langley
Yun S. Song
Yufeng Wu
Copyright Date: 2014
Published by: MIT Press
Pages: 600
  • Cite this Item
  • Book Info
    Book Description:

    In this book, Dan Gusfield examines combinatorial algorithms to construct genealogical and exact phylogenetic networks, particularly ancestral recombination graphs (ARGs). The algorithms produce networks (or information about networks) that serve as hypotheses about the true genealogical history of observed biological sequences and can be applied to practical biological problems.Phylogenetic trees have been the traditional means to represent evolutionary history, but there is a growing realization that networks rather than trees are often needed, most notably for recent human history. This has led to the development of ARGs in population genetics and, more broadly, to phylogenetic networks. ReCombinatorics offers an in-depth, rigorous examination of current research on the combinatorial, graph-theoretic structure of ARGs and explicit phylogenetic networks, and algorithms to reconstruct or deduce information about those networks.ReCombinatorics, a groundbreaking contribution to the emerging field of phylogenetic networks, connects and unifies topics in population genetics and phylogenetics that have traditionally been discussed separately and considered to be unrelated. It covers the necessary combinatorial and algorithmic background material; the various biological phenomena; the mathematical, population genetic, and phylogenetic models that capture the essential elements of these phenomena; the combinatorial and algorithmic problems that derive from these models; the theoretical results that have been obtained; related software that has been developed; and some empirical testing of the software on simulated and real biological data.

    eISBN: 978-0-262-32447-2
    Subjects: Mathematics

Table of Contents

  1. Front Matter
    (pp. i-vi)
  2. Table of Contents
    (pp. vii-x)
  3. Preface
    (pp. xi-xviii)
  4. Acknowledgments
    (pp. xix-xx)
  5. 1 Introduction
    (pp. 1-34)

    Now that high-throughput genomic technologies are available for sequencing, resequencing, finding genomic variations of different types, finding conserved features, and screening for traits, the dream of comparing sequence variations at the population level is a reality. Moreover, population-scale sequence variations have numerous practical applications (as in association mapping) and can be used to address evolutionary and historical questions (such as the migration of populations), and to address basic questions concerning molecular genetic processes (as in the mechanics of mutation, recombination, and repair).

    Nature and history, through point mutation, insertion and deletion, recombination, gene-conversion, genome rearrangement, lateral gene transfer, creation of...

  6. 2 Trees First
    (pp. 35-60)

    Our main interest is in genealogical and phylogeneticnetworks, which by definition are not trees. However, many of the network models derive from tree models, and many of the tools that address networks rely critically on tools for trees. Further, if we are to understand when and why recombination is needed, we need to understand when and why recombination isnotneeded. Therefore we must first understand some models of treelike evolution and some combinatorial and algorithmic results about evolutionarytrees. The main tree-based model of evolution that we use is called the (rooted, binary character)perfect-phylogenymodel.

    Definition Let...

  7. 3 A Deeper Introduction to Recombination and Networks
    (pp. 61-96)

    In the last chapter we discussed necessary and sufficient conditions for binary sequences to be representable by a perfect phylogeny. When there is a perfect phylogeny (i.e., when all sites are pairwise compatible), it serves as a hypothesis for the actual evolutionary history of the sequences. However, a perfect phylogeny does not exist for most sets of binary sequences encountered in populations, because some pairs of sites are incompatible. The principal biological reason, in the context of populations (over a relatively short historical time period), is thatmeiotic recombinationcreates new mosaic sequences in each generation. Changes due to recombination...

  8. 4 Exploiting Recombination
    (pp. 97-126)

    To further motivate the importance of understanding patterns of recombination in populations, we discuss three high-value practical problems whose solutions exploit properties of meiotic recombination. These three illustrations are highly simplified, with the intent of showing the role of recombination in thelogicof the solutions,¹ particularly for readers who may not have had any prior exposure to these problems or solutions.

    The first illustration,genetic mapping by linkage analysis, is the oldest one, devised more than one hundred years ago, well before any molecular understanding of genes or DNA. Building linkage maps, following the basic outline of the first...

  9. 5 First Bounds
    (pp. 127-176)

    In chapter 3 we introducedRmin(M), the minimum number of recombination nodes used in any ARG to derive the set of sequencesM, and we noted that the problem of computingRmin(M) is known to be NP-hard. Hence no provably correct, worst-case polynomial-time algorithm is known for computingRmin(M), and we do not expect there will be one. However, several worst-case polynomial-time algorithms have been developed that compute empirically goodlower boundsonRmin(M), and there are several other lower-bound methods whose worst case time is not polynomially bounded, but are fast in practice. Some of the lower bounds apply...

  10. 6 Fundamental Combinatorial Structure and Tools
    (pp. 177-196)

    This chapter is primarily technical, defining fundamental combinatorial objects (incompatibility graphs and conflict graphs) and developing powerful structural theorems about their nontrivial connected components, which are used as essential tools to derive deep results about recombination and ARGs. In a similar vein, we present in this chapter a surprising algorithmic result that will allow rapid computation of the most important feature of incompatibility and conflict graphs, that is, the number of nontrivial connected components, and how the nodes of the graphspartitioninto connected components.

    Recall the definitions in chapter 2 (page 58) of what it means for two sites...

  11. 7 First Uses of Fundamental Structure
    (pp. 197-234)

    In this chapter we discuss two uses of the fundamental combinatorial results developed in chapter 6. The first use establishes a graph-theoretic lower bound onRmin(M), called theconnected-component lower bound. As with other lower bounds, the connected-component lower bound is useful when used to obtain local lower bounds in the composite bound method, but it is also very useful as a mathematical tool, particularly in the discussion ofgalled treesin chapter 8. The second use of fundamental structure is to establish a key structural result about ARGs, called thefull-decomposition theorem. That theorem will be central in...

  12. 8 Galled Trees
    (pp. 235-284)

    In the previous chapters, we examined algorithms to compute goodlower boundsonRmin(M), and we showed polynomial-time constructive methods (via theorem 3.2.1 and corollary 7.3.2) to build an ARG for any inputM, but those methods were not guaranteed to build a MinARG and generally use many more recombination nodes than may be needed. We now begin the discussion of algorithms that construct ARGs with the goal of limiting the number of recombination nodes used. The number of recombination nodes used in a constructed ARG gives anupper boundonRmin(M).

    Ideally, we would like to construct MinARGs or...

  13. 9 General ARG Construction Methods
    (pp. 285-360)

    In the previous chapter, we considered the problem of constructing a MinARG, but only for the special case that there is a galled tree for the inputM. In this chapter we consider the problem of constructing good ARGs and MinARGs for arbitrary input.

    The problem of constructing a MinARG for a set of sequencesM, or even of computingRmin(M), is NP-hard. Thus, we do not have, nor do we expect to have, a worst-casepolynomial-timealgorithm for those problems. Instead, we have heuristic algorithms that empirically run fast (and sometimes can be made to run in worst-case polynomial...

  14. 10 The History and Forest Lower Bounds
    (pp. 361-380)

    In chapters 5, 7, 8, and 9, we examined several lower bounds onRmin(M). Those bounds differed in the times needed for their computation and in their level of accuracy (how close they are toRmin(M) in practice). We stated earlier that two additional lower bounds would be discussed after methods for general ARG construction were presented. Here we discuss thehistory lower boundonRmin(M) developed by Myers and Griffiths [299, 302], and the relatedforest lower bounddeveloped in [449]. These two bounds, usually called thehistory boundand theforest bound,were deferred until after the discussion...

  15. 11 Conditions to Guarantee a Fully Decomposed MinARG
    (pp. 381-388)

    As discussed in chapter 7, the task of constructing ARGs is simplified (both computationally and conceptually) if we restrict attention to fully decomposed ARGs. We showed in chapter 7 that there is always a fully decomposed ARG for any inputM, but for someMthere is no MinARG forMthat is fully decomposed. So, although it is attractive to restrict attention to fully decomposed ARGs, if we do so we may sometimes fail to construct a MinARG. It is therefore desirable to identify conditions thatguaranteethe existence of a fully decomposed MinARG. We have already developed one...

  16. 12 Tree and ARG-Based Haplotyping
    (pp. 389-432)

    Recall that ahaplotypeis a sequence obtained from individuals in a diploid population, and that for each individual, a haplotype is obtained from onlyoneof the two homologs of some chromosome. Recall also that agenotypeis a mixture of the data from the two haplotypes, and that generally, a genotype does not unambiguously determine the two originating haplotypes. The concept of ahaplotypewas extensively discussed in section 4.1 (page 98); the reader should review that section. The concept of agenotypewas also introduced in section 4.1, but will be more deeply discussed in this chapter....

  17. 13 Tree and ARG-Based Association Mapping
    (pp. 433-468)

    Association mapping is a widely used, population-based approach to try to efficiently locate genes and mutations influencing genetic traits of interest (diseases or important commercial traits). Already there have been thousands of association studies, and it is expected that the utility of association studies will increase as the cost of genomic sequencing continues to decline, allowing much larger sample sizes. The central roles of recombination and of ARGs in thelogicof association mapping was discussed in section 4.4, illustrated by the simplest case of locating a mutation causing asimple-Mendeliantrait.

    Association mapping was first developed for applications where...

  18. 14 Extensions and Connections
    (pp. 469-522)

    In this chapter, we discuss several topics that extend and connect to material in earlier chapters. In contrast to those earlier chapters, the material presented here is more introductory, and the citations are meant to be representative, orienting the reader to the field.

    We start by returning toperfect phylogeny, introducing the perfect-phylogeny problem whenmore than two statesare allowed for a character. Then we discuss a recombination model, themosaic model, that is appropriate for shorter time frames than was assumed so far. Next, at the other time extreme, we discuss phylogenetic network problems that arise in studying...

  19. Appendix A A Short Introduction to Integer Linear Programming
    (pp. 523-532)
  20. Bibliography
    (pp. 533-564)
  21. Index
    (pp. 565-580)