Mining the Biomedical Literature

Mining the Biomedical Literature

Hagit Shatkay
Mark Craven
Copyright Date: 2012
Published by: MIT Press
Pages: 150
https://www.jstor.org/stable/j.ctt5vjr0m
  • Cite this Item
  • Book Info
    Mining the Biomedical Literature
    Book Description:

    The introduction of high-throughput methods has transformed biology into a data-rich science. Knowledge about biological entities and processes has traditionally been acquired by thousands of scientists through decades of experimentation and analysis. The current abundance of biomedical data is accompanied by the creation and quick dissemination of new information. Much of this information and knowledge, however, is represented only in text form--in the biomedical literature, lab notebooks, Web pages, and other sources. Researchers' need to find relevant information in the vast amounts of text has created a surge of interest in automated text-analysis.In this book, Hagit Shatkay and Mark Craven offer a concise and accessible introduction to key ideas in biomedical text mining. The chapters cover such topics as the relevant sources of biomedical text; text-analysis methods in natural language processing; the tasks of information extraction, information retrieval, and text categorization; and methods for empirically assessing text-mining systems. Finally, the authors describe several applications that recognize entities in text and link them to other entities and data resources, support the curation of structured databases, and make use of text to enable further prediction and discovery.

    eISBN: 978-0-262-30516-7
    Subjects: Biological Sciences, Technology

Table of Contents

  1. Front Matter
    (pp. i-vi)
  2. Table of Contents
    (pp. vii-viii)
  3. Acknowledgments
    (pp. ix-xii)
  4. 1 Introduction
    (pp. 1-8)

    The current millennium started with the sequencing of the human genome. There are now thousands of sequenced genomes available, covering a wide range of organisms and a broad collection of individuals within the human population. Additionally, there is a multitude of datasets characterizing dynamic aspects of cells such as molecular abundances, interactions, and localizations. The hope is that in knowing and analyzing the sequences of such genomes and associated data, scientists are opening the “book of life” and will be able to understand the intricate processes governing life, death, and disease at the most basic molecular level.

    However, the enterprise...

  5. 2 Fundamental Concepts in Biomedical Text Analysis
    (pp. 9-32)

    The development of the Internet has made it easy for biologists to create databases and online portals representing various aspects of biological knowledge and to make these resources publicly available. Although there are hundreds of such online resources¹ representing biological knowledge in a structured format, much of the scientific community’s knowledge is represented only as unstructured text.

    A structured format is one in which information is organized and represented in a formal and predefined manner. For example, a relational database consists of multiple tables corresponding to predefined relations. Each table is defined by a fixed set of fields, each of...

  6. 3 Information Retrieval
    (pp. 33-52)

    In its most basic form,information retrievalis the task of finding a set of relevant documents in a large text collection. Naturally, the relevance of a document depends on our particular information need at a given moment. Most of us perform information retrieval on a daily basis, using search engines such as Google or, for searches specific to the biomedical domain, PubMed. The typical retrieval task performed using such search engines is known asad hocretrieval. Under this retrieval scenario, a user specifies a query, which is most often a Boolean combination of terms or words, and hopefully...

  7. 4 Information Extraction
    (pp. 53-76)

    Chapter 3 focused on tasks that involve identifying which documents or passages in a large corpus are relevant to a given query. In some situations, however, one may want automated systems to perform a more fine-grained, in-depth analysis of the text. One type of analysis that may be useful is the identification of entities of interest, and of relations among entities. This is commonly referred to as theinformation-extraction(IE) task [32, 46].

    Figure 4.1 provides an illustration of the information-extraction task. This example assumes that we are interested in recognizing entities of the types protein and location (where the...

  8. 5 Evaluation
    (pp. 77-98)

    From the discussion in previous chapters, it is clear that automated text mining and effective information retrieval can help realize a wide range of biological and medical goals. These goals vary in scope and domain; some examples of these goals, ordered in ascending level of difficulty, may include the following:

    Supporting curation of gene and protein information in organism-specific databases through focused, accurate retrieval;

    Providing easy access to information about bio-entities within displayed text by highlighting and hyperlinking such entities;

    Automatically reconstructing models of molecular networks from the published literature (which is an ambitious and not always well-defined task).

    Generalizing...

  9. 6 Putting It All Together: Current Applications and Future Directions
    (pp. 99-114)

    Throughout the previous chapters we have covered a variety of text-mining methods applicable to the broad range of tasks that are involved in obtaining information from text. In the beginning of chapter 1 we listed several goals within the biomedical domain that can be realized through the use of text. In this chapter we provide examples of systems and tools that have been developed to support such specific biomedical goals, and discuss in more detail the text-based methods that they employ.

    The identification of bioentities such as genes, proteins, small molecules, drugs, and diseases, as described in chapter 4, can...

  10. References
    (pp. 115-130)
  11. Index
    (pp. 131-138)