Systems Biology: Volume I: Genomics (Series in Systems Biology)

  • 0 79 8
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Systems Biology: Volume I: Genomics (Series in Systems Biology)

Systems Biology Volume I: Genomics Series in Systems Biology Edited by Dennis Shasha, New York University EDITORIAL BO

1,031 291 5MB

Pages 337 Page size 252 x 379.08 pts Year 2006

Report DMCA / Copyright


Recommend Papers

File loading please wait...
Citation preview

Systems Biology Volume I: Genomics

Series in Systems Biology Edited by Dennis Shasha, New York University EDITORIAL BOARD Michael Ashburner, University of Cambridge Amos Bairoch, Swiss Institute of Bioinformatics Charles Cantor, Sequenom, Inc. Leroy Hood, Institute for Systems Biology Minoru Kanehisa, Kyoto University Raju Kucherlapati, Harvard Medical School Systems Biology describes the discipline that seeks to understand biological phenomena on a large scale: the association of gene with function, the detailed modeling of the interaction among proteins and metabolites, and the function of cells. Systems Biology has wide-ranging application, as it is informed by several underlying disciplines, including biology, computer science, mathematics, physics, chemistry, and the social sciences. The goal of the series is to help practitioners and researchers understand the ideas and technologies underlying Systems Biology. The series volumes will combine biological insight with principles and methods of computational data analysis.

Cellular Computing, edited by Martyn Amos Systems Biology, Volume I: Genomics, edited by Isidore Rigoutsos and Gregory Stephanopoulos Systems Biology, Volume II: Networks, Models, and Applications, edited by Isidore Rigoutsos and Gregory Stephanopoulos

Systems Biology Volume I: Genomics

Edited by

Isidore Rigoutsos & Gregory Stephanopoulos

1 2007

1 Oxford University Press, Inc., publishes works that further Oxford University’s objective of excellence in research, scholarship, and education. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam

Copyright © 2007 by Oxford University Press, Inc. Published by Oxford University Press, Inc. 198 Madison Avenue, New York, New York 10016 Oxford is a registered trademark of Oxford University Press. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. Library of Congress Cataloging-in-Publication Data Systems biology/edited by Isidore Rigoutsos and Gregory Stephanopoulos. v. ; cm.—(Series in systems biology) Includes bibliographical references and indexes. Contents: 1. Genomics—2. Networks, models, and applications. ISBN-13: 978-0-19-530081-9 (v. 1) ISBN 0-19-530081-5 (v. 1) ISBN-13: 978-0-19-530080-2 (v. 2) ISBN 0-19-530080-7 (v. 2) 1. Computational biology. 2. Genomics. 3. Bioinformatics. I. Rigoutsos, Isidore. II. Stephanopoulos, G. III. Series. [DNLM: 1. Genomics. 2. Computational Biology. 3. Models, Genetic. 4. Systems Biology. QU58.5 S995 2006] QH324.2.S97 2006 570—dc22 2005031826

9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper

To our mothers

This page intentionally left blank


First and foremost, we wish to thank all the authors who contributed the chapters of these two books. In addition to the professionalism with which they handled all aspects of production, they also applied the highest standards in authoring pieces of work of the highest quality. For their willingness to share their unique expertise on the many facets of systems biology and the energy they devoted to the preparation of their chapters, we are profoundly grateful. Next, we wish to thank Dennis Shasha, the series editor, and Peter Prescott, Senior Editor for Life Sciences, Oxford University Press, for embracing the project from the very first day that we presented the idea to them. Peter deserves special mention for it was his continuous efforts that helped remove a great number of obstacles along the way. We also wish to thank Adrian Fay, who coordinated several aspects of the review process and provided input that improved the flow of several chapters, as well as our many reviewers, Alice McHardy, Aristotelis Tsirigos, Christos Ouzounis, Costas Maranas, Daniel Beard, Daniel Platt, Jeremy Rice, Joel Moxley, Kevin Miranda, Lily Tong, Masaru Nonaka, Michael MacCoss, Michael Pitman, Nikos Kyrpides, Rich Jorgensen, Roderic Guigo, Rosaria De Santis, Ruhong Zhou, Serafim Batzoglou, Steven Gygi, Takis Benos, Tetsuo Shibuya, and Yannis Kaznessis, for providing helpful and detailed feedback on the early versions of the chapters; without their help the books would not have been possible. We are also indebted to Kaity Cheng for helping with all of the administrative aspects of this project. And, finally, our thanks go to our spouses whose understanding and patience throughout the duration of the project cannot be overstated.

This page intentionally left blank




Systems Biology: A Perspective


1 Prebiotic Chemistry on the Primitive Earth Stanley L. Miller & H. James Cleaves 2 Prebiotic Evolution and the Origin of Life: Is a System-Level Understanding Feasible? Antonio Lazcano 3 Shotgun Fragment Assembly Granger Sutton & Ian Dew




4 Gene Finding 118 John Besemer & Mark Borodovsky 5 Local Sequence Similarities Temple F. Smith


6 Complete Prokaryotic Genomes: Reading and Comprehension 166 Michael Y. Galperin & Eugene V. Koonin 7 Protein Structure Prediction 187 Jeffrey Skolnick & Yang Zhang 8 DNA–Protein Interactions Gary D. Stormo


9 Some Computational Problems Associated with Horizontal Gene Transfer 248 Michael Syvanen 10 Noncoding RNA and RNA Regulatory Networks in the Systems Biology of Animals 269 John S. Mattick Index

303 ix

This page intentionally left blank




Department of Biology Georgia Institute of Technology Atlanta, Georgia [email protected]

Faculty of Science Universidad Nacional Autónoma de México Mexico City, Mexico [email protected]


Department of Biology Georgia Institute of Technology Atlanta, Georgia [email protected]

Institute for Molecular Bioscience University of Queensland Brisbane, Australia [email protected]


The Scripps Institution of Oceanography University of California, San Diego La Jolla, California [email protected]

Scripps Institution of Oceanography University of California, San Diego La Jolla, California [email protected]

IAN DEW Steck Consulting, LLC Washington, DC [email protected]


National Center for Biotechnology Information National Institutes of Health Bethesda, Maryland [email protected]

New York State Center of Excellence in Bioinformatics and Life Sciences University at Buffalo The State University of New York Buffalo, New York [email protected]



National Center for Biotechnology Information National Institutes of Health Bethesda, Maryland [email protected]

BioMolecular Engineering Resource Center Boston University Boston, Massachusetts [email protected]







Department of Genetics Washington University in St. Louis St. Louis, Missouri [email protected]

Department of Medical Microbiology and Immunology University of California Davis School of Medicine Sacramento, California [email protected]

GRANGER SUTTON J. Craig Venter Institute Rockville, Maryland [email protected]

YANG ZHANG Center for Bioinformatics University of Kansas Lawrence, Kansas [email protected]

Systems Biology: A Perspective

As recently as a decade ago, the core paradigm of biological research followed an established path: beginning with the generation of a specific hypothesis a concise experiment would be designed that typically focused on studying a small number of genes. Such experiments generally measured a few macromolecules, and, perhaps, small metabolites of the target system. The advent of genome sequencing and associated technologies greatly improved scientists’ ability to measure important classes of biological molecules and their interactions. This, in turn, expanded our view of cells with a bevy of previously unavailable data and made possible genome-wide and cell-wide analyses. These newly found lenses revealed that hundreds (sometimes thousands) of molecules and interactions, which were outside the focus of the original study, varied significantly in the course of the experiment. The term systems biology was coined to describe the field of scientific inquiry which takes a global approach to the understanding of cells and the elucidation of biological processes and mechanisms. In many respects, this is also what physiology (from the Greek physis = nature and logos = word-knowledge) focused on for the most part of the twentieth century. Indeed, physiology’s goal has been the study of function and characteristics of living organisms and their parts and of the underlying physiochemical phenomena. Unlike physiology, systems biology attempts to interpret and contextualize the large and diverse sets of biological measurements that have become visible through our genomic-scale window on cellular processes by taking a holistic approach and bringing to bear theoretical, computational, and experimental advances in several fields. Indeed, there is considerable excitement that, through this integrative perspective, systems biology will succeed in elucidating the mechanisms that underlie complex phenomena and which would have otherwise remained undiscovered. For the purposes of our discussion, we will be making use of the following definition: “Systems biology is an integrated approach that brings together and leverages theoretical, experimental, and computational approaches in order to establish connections among important molecules or groups of molecules in order to aid the eventual mechanistic explanation of cellular processes and systems.” More specifically, we view systems biology as a field that aims to uncover concrete molecular relationships for targeted analysis through the interpretation xiii


Systems Biology: A Perspective

of cellular phenotype in terms of integrated biomolecular networks. The fidelity and breadth of our network and state characterization are intimately related to the degree of our understanding of the system under study. As the readers will find, this view permeates the treatises that are found in these two books. Cells have always been viewed as elegant systems of immense complexity that are, nevertheless, well coordinated and optimized for a particular purpose. This apparent complexity led scientists to take a reductionist approach to research which, in turn, contributed to a rigorous understanding of low-level processes in a piecemeal fashion. Nowadays, completed genomic sequences and systems-level probing hold the potential to accelerate the discovery of unknown molecular mechanisms and to organize the existing knowledge in a broader context of high-level cellular understanding. Arguably, this is a formidable task. In order to improve the chances of success, we believe that one must anchor systems biology analyses to specific questions and build upon the existing core infrastructure that the earlier, targeted research studies have allowed us to generate. The diversity of molecules and reactions participating in the various cellular functions can be viewed as an impediment to the pursuit of a more complete understanding of cellular function. However, it actually represents a great opportunity as it provides countless possibilities for modifying the cellular machinery and commandeering it toward a specific goal. In this context, we distinguish two broad categories of questions that can guide the direction of systems biology research. The first category encompasses topics of medical importance and is typically characterized by forward-engineering approaches that focus on preventing or combating disease. The second category includes problems of industrial interest, such as the genetic engineering of microbes so as to maximize product formation, the creation of robust-production strains, and so on. The applications of the second category comprise an important reverse-engineering component whereby microbes with attractive properties are scrutinized for the purpose of transferring any insights learned from their functions to the further improvement and optimization of production strains. PRIOR WORK

As already mentioned, and although the term systems biology did not enter the popular lexicon until recently, some of the activities it encompasses have been practiced for several decades. As we cannot possibly be exhaustive, we present a few illustrative examples of approaches that have been developed in recent years and successfully applied to relatively small systems. These examples can serve as useful guides in our attempt to tackle increasingly larger challenges.

Systems Biology: A Perspective


Metabolic Control Analysis (MCA)

Metabolic pathways and, in general, networks of reactions are characterized by substantial stoichiometric and (mostly) kinetic complexity in their own right. The commonly applied assumption of a single ratelimiting step leads to great simplification of the reaction network and often yields analytical expressions for the conversion rates. However, this assumption is not justified for most biological systems where kinetic control is not concentrated in a single step but rather is distributed among several enzymatic steps. Consequently, kinetics and flux control of a bioreaction network represent properties of the entire system and can be determined from the characteristics of individual reactions in a bottom-up approach or from the response of the overall system in a top-down approach. The concepts of MCA and distribution of kinetic control in a reaction pathway have had a profound impact on the identification of target enzymes whose genetic modification permitted the amplification of the product flux through a pathway. Signaling Pathways

Signal transduction is the process by which cells communicate with each other and their environment and involves a multitude of proteins that can be in active or inactive states. In their active (phosphorylated) state they act as catalysts for the activation of subsequent steps in the signaling cascade. The end result is the activation of a transcription factor which, in turn, initiates a gene transcription event. Until recently, and even though several of the known proteins participate in more than one signaling cascade, such systems were being studied in isolation from one another. A natural outcome of this approach was of course the ability to link a single gene with a single ligand in a causal relationship whereby the ligand activates the gene. However, such findings are not representative in light of the fact that signaling pathways branch and interact with one another creating a rather intricate and complex signaling network. Consequently, more tools, computational as well as experimental, are required if we are to improve our understanding of signal transduction. Developing such tools is among the goals of the recently formed Alliance for Cellular Signaling, an NIH-funded project involving several laboratories and research centers ( Reconstruction of Flux Maps

Metabolic pathway fluxes are defined as the actual rates of metabolite interconversion in a metabolic network and represent most informative measures of the actual physiological state of cells and organisms. Their dependence on enzymatic activities and metabolite concentrations makes them an accurate representation of carbon and energy flows through the various pathway branches. Additionally, they are very


Systems Biology: A Perspective

important in identifying critical reaction steps that impact flux control for the entire pathway. Thus, flux determination is an essential component of strain evaluation and metabolic engineering. Intracellular flux determination requires the enumeration and satisfaction of all intracellular metabolite balances along with the use of sufficient measurements typically derived from the introduction of isotopic tracers and metabolite and mass isotopomer measurement by gas chromatography–mass spectrometry. It is essentially a problem of constrained parameter estimation in overdetermined systems with overdetermination providing the requisite redundancy for reliable flux estimation. These approaches are basically methods of network reconstruction whereas the obtained fluxes represent properties of the entire system. As such, the fluxes accurately reflect changes introduced through genetic or environmental modifications and, thus, can be used to assess the impact of such modifications on cell physiology and product formation, and to guide the next round of cell modifications. Metabolic Engineering

Metabolic engineering is the field of study whose goal is the improvement of microbial strains with the help of modern genetic tools. The strains are modified by introducing specific transport, conversion, or deregulation changes that lead to flux redistribution and the improvement of product yield. Such modifications rely to a significant extent on modern methods from molecular biology. Consequently, the following central question arises: “What is the real difference between genetic engineering and metabolic engineering?” We submit that the main difference is that metabolic engineering is concerned with the entire metabolic system whereas genetic engineering specifically focuses on a particular gene or a small collection of genes. It should be noted that over- or underexpression of a single gene or a few genes may have little or no impact on the attempt to alter cell physiology. On the other hand, by examining the properties of the metabolic network as a whole, metabolic engineering attempts to identify targets for amplification as well as rationally assess the effect that such changes will incur on the properties of the overall network. As such, metabolic engineering can be viewed as a precursor to functional genomics and systems biology in the sense that it represents the first organized effort to reconstruct and modify pathways using genomic tools while being guided by the information of postgenomic developments. WORDS OF CAUTION

In light of the many exciting possibilities, there are high expectations for the field of systems biology. However, as we move forward, we should not lose sight of the fact that the field is trying to tackle a

Systems Biology: A Perspective


problem of considerable magnitude. Consequently, any expectations of immediate returns on the scientific investment should be appropriately tempered. As we set out to forecast future developments in this field, it is important to keep in mind several points. Despite the wealth of available genomic data, there are still a lot of regions in the genomes of interest that are functional and which have not been identified as such. In order to practice systems biology, lists of “parts” and “relationships” that are as complete as possible are needed. In the absence of such complete lists, one generally hopes to derive at best an approximate description of the actual system’s behavior. A prevalent misconception among scientists states that nearly complete lists of parts are already in place. Unfortunately, this is not the case––the currently available parts lists are incomplete as evidenced by the fact that genomic maps are continuously updated through the addition of removal of (occasionally substantial amounts of) genes, by the discovery of more regions that code for RNA genes, and so on. Despite the wealth of available genomic data, knowledge about existing optimal solutions to important problems continues to elude us. The current efforts in systems biology are largely shaped by the available knowledge. Consequently, optimal solutions that are implemented by metabolic pathways that are unknown or not yet understood are beyond our reach. A characteristic case in point is the recent discovery, in sludge microbial communities, of a Rhodocyclus-like polyphosphateaccumulating organism that exhibits enhanced biological phosphorus removal abilities. Clearly, this microbe is a great candidate to be part of a biological treatment solution to the problem of phosphorus removal from wastewater. Alas, this is not yet an option as virtually nothing is known about the metabolic pathways that confer phosphorus removal ability to this organism. Despite the wealth of available genomic data, there are still a lot of important molecular interactions of whose existence we are unaware. Continuing on our parts and relationships comment from above, it is worth noting another prevalent misconception among scientists: it states that nearly complete lists of relationships are already in place. For many years, pathway analysis and modeling has been characterized by proteincentric views that comprised concrete collections of proteins participating in well-understood interactions. Even for well-studied pathways, new important protein interactions are continuously discovered. Moreover, accumulating experimental evidence shows that numerous important interactions are in fact effected by the action of RNA molecules on DNA molecules and by extension on proteins. Arguably, posttranscriptional gene silencing and RNA interference represent one area of research activity with the potential to substantially revise our current


Systems Biology: A Perspective

understanding of cellular processes. In fact, the already accumulated knowledge suggests that the traditional protein-centric views of the systems of interest are likely incomplete and need to be augmented appropriately. This in turn has direct consequences on the modeling and simulation efforts and on our understanding of the cell from an integrated perspective. Constructing biomolecular networks for new systems will require significant resources and expertise. Biomolecular networks incorporate a multitude of relationships that involve numerous components. For example, constructing gene interaction maps requires large experimental investments and computational analysis. As for global protein–protein interaction maps, these exist for only a handful of model species. But even reconstructing well-studied and well-documented networks such as metabolic pathways in a genomic context can prove a daunting task. The magnitude of such activities has already grown beyond the capabilities of a single investigator or a single laboratory. Even when one works with a biomolecular network database, the system picture may be incomplete or only partially accurate. In the postgenomic era, the effort to uncover the structure and function of genetic regulatory networks has led to the creation of many databases of biological knowledge. Each of these databases attempts to distill the most salient features from incomplete, and at times flawed, knowledge. As an example, several databases exist that enumerate protein interactions for the yeast genome and have been compiled using the yeast twohybrid screen. These databases currently document in excess of 80,000 putative protein–protein interactions; however, the knowledge content of these databases has only a small overlap, suggesting a strong dependence of the results on the procedures used and great variability in the criteria that were applied before an interaction could be entered in the corresponding knowledge repository. As one might have expected, the situation is less encouraging for those organisms with lower levels of direct interaction experimentation and scrutiny (e.g., Escherichia coli) or which possess larger protein interaction spaces (e.g., mouse and human); in such cases, the available databases capture only a minuscule fraction of the knowledge spectrum. Carrying out the necessary measurements requires significant resources and expertise. Presently, the only broadly available tool for measuring gene expression is the DNA chip (in its various incarnations). Conducting a large-scale transcriptional experiment will incur unavoidable significant costs and require that the involved scientists be trained appropriately. Going a step further, in order to measure protein levels, protein states, regulatory elements, and metabolites, one needs access to complex and specialized equipment. Practicing systems biology will

Systems Biology: A Perspective


necessitate the creation of partnerships and the collaboration of faculty members across disciplines. Biologists, engineers, chemists, physicists, mathematicians, and computer scientists will need to learn to speak one another’s language and to work together. It is unlikely that a single/complex microarray experiment will shed light on the interactions that a practitioner seeks to understand. Even leaving aside the large amounts of available data and the presence of noise, many of the relevant interactions will simply not incur any large or direct transcriptional changes. And, of course, one should remain mindful of the fact that transcript levels do not necessarily correlate with protein levels, and that protein levels do not correlate well with activity level. The situation is accentuated further if one considers that both transcript and protein levels are under the control of agents such as microRNAs that were discovered only recently––the action of such agents may also vary temporally contributing to variations across repetitions of the same experiment. Patience, patience, and patience: the hypotheses that are derived from systemsbased approaches are more complex than before and disproportionately harder to validate. For a small system, it is possible to design experiments that will test a particular hypothesis. However, it is not obvious how this can be done when the system under consideration encompasses numerous molecular players. Typically, the experiments that have been designed to date strove to keep most parameters constant while studying the effect of a small number of changes introduced to the system in a controlled manner. This conventional approach will need to be reevaluated since now the number of involved parameters is dramatically higher and the demands on system controls may exceed the limits of present experimentation. ABOUT THIS BOOK

From the above, it should be clear that the systems biology field comprises multifaceted research work across several disciplines. It is also hierarchical in nature with single molecules at one end of the hierarchy and complete, interacting organisms at the other. At each level of the hierarchy, one can distinguish “parts” or active agents with concrete static characteristics and dynamic behavior. The active agents form “relationships” by interacting among themselves within each level, but can also be involved in inter-level interactions (e.g., a transcription factor, which is an agent associated with the proteomic level, interacts at specific sites with the DNA molecule, an agent associated with the genomic level of the hierarchy). Clearly, intimate knowledge and understanding of the specifics at each level will greatly facilitate the undertaking of systems


Systems Biology: A Perspective

biology activities. Experts are needed at all levels of the hierarchy who will continue to generate results with an eye toward the longer-term goal of the eventual mechanistic explanation of cellular processes and systems. The two books that we have edited try to reflect the hierarchical nature of the problem as well as this need for experts. Each chapter is contributed by authors who have been active in the respective domains for many years and who have gone to great lengths to ensure that their presentations serve the following two purposes: first, they provide a very extensive overview of the domain’s main activities by describing their own and their colleagues’ research efforts; and second, they enumerate currently open questions that interested scientists should consider tackling. The chapters are organized into a “Genomics” and a “Networks, Models, and Applications” volume, and are presented in an order that corresponds roughly to a “bottom-up” traversal of the systems biology hierarchy. The “Genomics” volume begins with a chapter on prebiotic chemistry on the primitive Earth. Written by Stanley Miller and James Cleaves, it explores and discusses several geochemically reasonable mechanisms that may have led to chemical self-organization and the origin of life. The second chapter is contributed by Antonio Lazcano and examines possible events that may have led to the appearance of encapsulated replicative systems, the evolution of the genetic code, and protein synthesis. In the third chapter, Granger Sutton and Ian Dew present and discuss algorithmic techniques for the problem of fragment assembly which, combined with the shotgun approach to DNA sequencing, allowed for significant advances in the field of genomics. John Besemer and Mark Borodovsky review, in chapter 4, all of the major approaches in the development of gene-finding algorithms. In the fifth chapter, Temple Smith, through a personal account, covers approximately twenty years of work in biological sequence alignment algorithms that culminated in the development of the Smith–Waterman algorithm. In chapter 6, Michael Galperin and Eugene Koonin discuss the state of the art in the field of functional annotation of complete genomes and review the challenges that proteins of unknown function pose for systems biology. The state of the art of protein structure prediction is discussed by Jeffrey Skolnick and Yang Zhang in chapter 7, with an emphasis on knowledge-based comparative modeling and threading approaches. In chapter 8, Gary Stormo presents and discusses experimental and computational approaches that allow the determination of the specificity of a transcription factor and the discovery of regulatory sites in DNA. Michael Syvanen presents and discusses the phenomenon of horizontal gene transfer in chapter 9 and also presents computational questions that relate to the phenomenon. The first volume concludes with a chapter

Systems Biology: A Perspective


by John Mattick on non-protein-coding RNA and its involvement in regulatory networks that are responsible for the various developmental stages of multicellular organisms. The “Networks, Models, and Applications” volume continues our ascent of the systems biology hierarchy. The first chapter, which is written by Cristian Ruse and John Yates III, introduces mass spectrometry and discusses its numerous uses as an analytical tool for the analysis of biological molecules. In chapter 2, Chris Floudas and Ho Ki Fung review mathematical modeling and optimization methods for the de novo design of peptides and proteins. Chapter 3, written by William Swope, Jed Pitera, and Robert Germain, describes molecular modeling and simulation techniques and their use in modeling and studying biological systems. In chapter 4, Glen Held, Gustavo Stolovitzky, and Yuhai Tu discuss methods that can be used to estimate the statistical significance of changes in the expression levels that are measured with the help of global expression assays. The state of the art in high-throughput technologies for interrogating cellular signaling networks is discussed in chapter 5 by Jason Papin, Erwin Gianchandani, and Shankar Subramaniam, who also examine schemes by which one can generate genotype–phenotype relationships given the available data. In chapter 6, Dimitrios Mastellos and John Lambris use the complement system as a platform to describe systems approaches that can help elucidate gene regulatory networks and innate immune pathway associations, and eventually develop effective therapeutics. Chapter 7, written by Sang Yup Lee, Dong-Yup Lee, Tae Yong Kim, Byung Hun Kim, and Sang Jun Lee, discusses how computational and “-omics” approaches can be combined in order to appropriately engineer “improved” versions of microbes for industrial applications. In chapter 8, Markus Herrgård and Bernhard Palsson discuss the design of metabolic and regulatory network models for complete genomes and their use in exploring the operational principles of biochemical networks. Raimond Winslow, Joseph Greenstein, and Patrick Helm review and discuss the current state of the art in the integrative modeling of the cardiovascular system in chapter 9. The volume concludes with a chapter on embryonic stem cells and their uses in testing and validating systems biology approaches, written by Andrew Thomson, Paul Robson, Huck Hui Ng, Hasan Otu, and Bing Lim. The companion website for Systems Biology Volumes I and II provides color versions of several figures reproduced in black and white in print. Please refer to to view these figures in color: Volume I: Figures 7.5 and 7.6 Volume II: Figures 3.10, 5.1, 7.4 and 9.8

This page intentionally left blank

Systems Biology Volume I: Genomics

This page intentionally left blank

1 Prebiotic Chemistry on the Primitive Earth Stanley L. Miller & H. James Cleaves

The origin of life remains one of the humankind’s last great unanswered questions, as well as one of the most experimentally challenging research areas. It also raises fundamental cultural issues that fuel at times divisive debate. Modern scientific thinking on the topic traces its history across millennia of controversy, although current models are perhaps no older than 150 years. Much has been written regarding pre-nineteenth-century thought regarding the origin of life. Early views were wide-ranging and often surprisingly prescient; however, since this chapter deals primarily with modern thinking and experimentation regarding the synthesis of organic compounds on the primitive Earth, the interested reader is referred to several excellent resources [1–3]. Despite recent progress in the field, a single definitive description of the events leading up to the origin of life on Earth some 3.5 billion years ago remains elusive. The vast majority of theories regarding the origin of life on Earth speculate that life began with some mixture of organic compounds that somehow became organized into a self-replicating chemical entity. Although the idea of panspermia (which postulates that life was transported preformed from space to the early sterile Earth) cannot be completely dismissed, it seems largely unsupported by the available evidence, and in any event would simply push the problem to some other location. Panspermia notwithstanding, any discussion of the origin of life is of necessity a discussion of organic chemistry. Not surprisingly, ideas regarding the origin of life have developed to a large degree concurrently with discoveries in organic chemistry and biochemistry. This chapter will attempt to summarize key historical and recent findings regarding the origin of organic building blocks thought to be important for the origin of life on Earth. In addition to the background readings regarding historical perspectives suggested above, the interested reader is referred to several additional excellent texts which remain scientifically relevant [4,5; see also the journal Origins of Life and the Evolution of the Biosphere]. 3




There are two fundamental complementary approaches to the study of the origin of life. One, the top-down approach, considers the origin of the constituents of modern biochemistry and their present organization. The other, the bottom-up approach, considers the compounds thought to be plausibly produced under primitive planetary conditions and how they may have come to be assembled. The crux of the study of the origin of life is the overlap between these two regimes. The top-down approach is biased by the general uniformity of modern biochemistry across the three major extant domains of life (Archaea, Bacteria, and Eukarya). These clearly originated from a common ancestor based on the universality of the genetic code they use to form proteins and the homogeneity of their metabolic processes. Investigations have assumed that whatever the first living being was, it must have been composed of similar biochemicals as one would recover from a modern organism (lipids, nucleic acids, proteins, cofactors, etc.), that somehow were organized into a self-propagating assemblage. This bias seems to be legitimized by the presence of biochemical compounds in extraterrestrial material and the relative success of laboratory syntheses of these compounds under simulated prebiotic conditions. It would be a simpler explanatory model if the components of modern biochemistry and the components of the first living things were essentially similar, although this need not necessarily be the case. The bottom-up approach is similarly biased by present-day biochemistry; however, some more exotic chemical schemes are possible within this framework. All living things are composed of but a few atomic elements (CHNOPS in addition to other trace components), which do not necessarily reflect their cosmic or terrestrial abundances, and which begs the question why these elements were selected for life. This may be due to some intrinsic aspect of their chemistry, or some of the components may have been selected based on the metabolism of more complicated already living systems, or there may have been selection based on prebiotic availability, or some mixture of the three.


The historical evolution of thinking on the origin of life is intimately tied to developments of other fields, including chemistry, biology, geology, and astronomy. Importantly, the concept of biological evolution proposed by Darwin led to the early logical conclusion that there must have been a first organism, and a distinct origin of life. In part of a letter that Darwin sent in 1871 to Joseph Dalton Hooker, Darwin summarized his rarely expressed ideas on the emergence of

Prebiotic Chemistry on the Primitive Earth


life, as well as his views on the molecular nature of basic biological processes: It is often said that all the conditions for the first production of a living being are now present, which could ever have been present. But if (and oh what a big if) we could conceive in some warm little pond with all sorts of ammonia and phosphoric salts, –light, heat, electricity &c present, that a protein compound was chemically formed, ready to undergo still more complex changes, at the present such matter wd be instantly devoured, or absorbed, which would not have been the case before living creatures were formed.... By the time Darwin wrote to Hooker DNA had already been discovered, although its role in genetic processes would remain unknown for almost eighty years. In contrast, the role that proteins play in biological processes had already been firmly established, and major advances had been made in the chemical characterization of many of the building blocks of life. By the time Darwin wrote this letter the chemical gap separating organisms from the nonliving world had been bridged in part by laboratory syntheses of organic molecules. In 1827 Berzelius, probably the most influential chemist of his day, had written, “art cannot combine the elements of inorganic matter in the manner of living nature.” Only one year later his former student Friedrich Wöhler demonstrated that urea could be formed in high yield by heating ammonium cyanate “without the need of an animal kidney” [6]. Wöhler’s work represented the first synthesis of an organic compound from inorganic starting materials. Although it was not immediately recognized as such, a new era in chemical research had begun. In 1850 Adolph Strecker achieved the laboratory synthesis of alanine from a mixture of acetaldehyde, ammonia, and hydrogen cyanide. This was followed by the experiments of Butlerov showing that the treatment of formaldehyde with alkaline catalysts leads to the synthesis of sugars. By the end of the nineteenth century a large amount of research on organic synthesis had been performed, and led to the abiotic formation of fatty acids and sugars using electric discharges with various gas mixtures [7]. This work was continued into the twentieth century by Löb, Baudish, and others on the synthesis of amino acids by exposing wet formamide (HCONH2) to a silent electrical discharge [8] and to UV light [9]. However, since it was generally assumed that that the first living beings had been autotrophic organisms, the abiotic synthesis of organic compounds did not appear to be a necessary prerequisite for the emergence of life. These organic syntheses were not conceived as laboratory simulations of Darwin’s warm little pond,



but rather as attempts to understand the autotrophic mechanisms of nitrogen assimilation and CO2 fixation in green plants. It is generally believed that after Pasteur disproved spontaneous generation using his famous swan-necked flask experiments, the discussion of life beginning’s had been vanquished to the realm of useless speculation. However, scientific literature of the first part of the twentieth century shows many attempts by scientists to solve this problem. The list covers a wide range of explanations from the ideas of Pflüger on the role of hydrogen cyanide in the origin of life, to those of Arrhenius on panspermia. It also includes Troland’s hypothesis of a primordial enzyme formed by chance in the primitive ocean, Herrera’s sulphocyanic theory on the origin of cells, Harvey’s 1924 suggestion of a heterotrophic origin in a high-temperature environment, and the provocative 1926 paper that Hermann J. Muller wrote on the abrupt, random formation of a single, mutable gene endowed with catalytic and replicative properties [10]. Most of these explanations went unnoticed, in part because they were incomplete, speculative schemes largely devoid of direct evidence and not subject to experimentation. Although some of these hypotheses considered life an emergent feature of nature and attempted to understand its origin by introducing principles of evolutionary development, the dominant view was that the first forms of life were photosynthetic microbes endowed with the ability to fix atmospheric CO2 and use it with water to synthesize organic compounds. Oparin’s proposal stood in sharp contrast with the then prevalent idea of an autotrophic origin of life. Trained as both a biochemist and an evolutionary biologist, Oparin found it was impossible to reconcile his Darwinian beliefs in a gradual evolution of complexity with the commonly held suggestion that life had emerged already endowed with an autotrophic metabolism, which included chlorophyll, enzymes, and the ability to synthesize organic compounds from CO2. Oparin reasoned that since heterotrophic anaerobes are metabolically simpler than autotrophs, the former would necessarily have evolved first. Thus, based on the simplicity and ubiquity of fermentative metabolism, Oparin [11] suggested in a small booklet that the first organisms must have been heterotrophic bacteria that could not make their own food but obtained organic material present in the primitive milieu. Careful reading of Oparin’s 1924 pamphlet shows that, in contrast to common belief, at first he did not assume an anoxic primitive atmosphere. In his original scenario he argued that while some carbides, that is, carbon–metal compounds, extruded from the young Earth’s interior would react with water vapor leading to hydrocarbons, others would be oxidized to form aldehydes, alcohols, and ketones.

Prebiotic Chemistry on the Primitive Earth


These molecules would then react among themselves and with NH3 originating from the hydrolysis of nitrides: FemCn + 4m H2O → m Fe3O4 + C3nH8m FeN + 3H2O → Fe(OH)3 + NH3 to form “very complicated compounds,” as Oparin wrote, from which proteins and carbohydrates would form. Oparin’s ideas were further elaborated in a more extensive book published with the same title in Russian in 1936. In this new book his original proposal was revised, leading to the assumption of a highly reducing milieu in which iron carbides of geological origin would react with steam to form hydrocarbons. Their oxidation would yield alcohols, ketones, aldehydes, and so on, which would then react with ammonia to form amines, amides, and ammonium salts. The resulting protein-like compounds and other molecules would form a hot, dilute soup, which would aggregate to form colloidal systems, that is, coacervates, from which the first heterotrophic microbes evolved. Like Darwin, Oparin did not address in his 1938 book the origin of nucleic acids, because their role in genetic processes was not yet suspected. At around the same time, J.B.S. Haldane [12] published a similar proposal, and thus the theory is often credited to both scientists. For Oparin [13], highly reducing atmospheres corresponded to mixtures of CH4, NH3, and H2O with or without added H2. The atmosphere of Jupiter contains these chemical species, with H2 in large excess over CH4. Oparin’s proposal of a primordial reducing atmosphere was a brilliant inference from the then fledgling knowledge of solar atomic abundances and planetary atmospheres, as well as from Vernadky’s idea that since O2 is produced by plants, the early Earth would be anoxic in the absence of life. The benchmark contributions of Oparin’s 1938 book include the hypothesis that heterotrophs and anaerobic fermentation were primordial, the proposal of a reducing atmosphere for the prebiotic synthesis and accumulation of organic compounds, the postulated transition from heterotrophy to autotrophy, and the considerable detail in which these concepts are addressed. The last major theoretical contribution to the modern experimental study of the origin of life came from Harold Clayton Urey. An avid experimentalist with a wide range of scientific interests, Urey offered explanations for the composition of the early atmosphere based on then popular ideas of solar system formation, which were in turn based on astronomical observations of the atmospheres of the giant planets and star composition. In 1952 Urey published The Planets, Their Origin and Development [14], which delineated his ideas of the formation



of the solar system, a formative framework into which most origin of life theories are now firmly fixed, albeit in slightly modified fashions. In contrast, shortly thereafter, Rubey [15] proposed an outgassing model based on an early core differentiation and assumed that the early atmosphere would have been reminiscent of modern volcanic gases. Rubey estimated that a CH4 atmosphere could not have persisted for much more than 105 to 108 years due to photolysis. The Urey/Oparin atmospheres (CH4, NH3, H2O) models are based on astrophysical and cosmochemical models, while Rubey’s CO2, N2, H2O model is based on extrapolation of the geological record. Although this early theoretical work has had a great influence on subsequent research, modern thinking on the origin and evolution of the chemical elements, the solar system, the Earth, and the atmosphere and oceans have not been shaped largely with the origin of life as a driving force. On the contrary, current origin of life theories have been modified to fit contemporary models in geo- and cosmochemistry. Life, Prebiotic Chemistry, Carbon, and Water

A brief justification is necessary for the discussion that will follow. One might ask why the field of prebiotic chemistry has limited itself to aqueous reactions that produce reduced carbon compounds. First, the necessary bias introduced by the nature of terrestrial organisms must be considered. There is only one example of a functioning biology, our own, and it is entirely governed by the reactions of reduced carbon compounds in aqueous media. The question naturally arises whether there might be other types of chemistry that might support a functioning biology. Hydrogen is the most abundant atom in the universe, tracing its formation to the time shortly after the Big Bang. Besides helium and small amounts of lithium, the synthesis of the heavier elements had to await later cycles of star formation and supernova explosions. Due to the high proportion of oxygen and hydrogen in the early history of the solar system, most other atomic nuclei ended up as either their oxides or hydrides. Water can be considered as the hydride of oxygen or the oxide of hydrogen. Water is one of the most abundant compounds in the universe. Life in the solid state would be difficult, as diffusion of metabolites would occur at an appallingly slow pace. Conversely, it is improbable that life in the gas phase would be able to support the stable associations required for the propagation of genetic information, and large molecules are generally nonvolatile. Thus it would appear that life would need to exist in a liquid medium. The question then becomes what solvent molecules are prevalent and able to exist in the liquid phase over the range of temperatures where reasonable reaction rates might proceed while at the same time preserving the integrity of the solute compounds. The high temperature limit is set by the

Prebiotic Chemistry on the Primitive Earth


decomposition of chemical compounds, while the low temperature limit is determined by the reactivity of the solutes. Water has the largest liquid stability range of any known common molecular compound at atmospheric pressure, and the dielectric constant of water and the high heat capacity are uniquely suited to many geochemical processes. There are no other elements besides carbon that appear to be able to furnish the immense variety of chemical compounds that allow for a diverse biochemistry. Carbon is able to bond with a large variety of other elements to generate stable heteroatomic bonds, as well as with itself to give a huge inventory of carbon-based molecules. In addition, carbon has the exceptional ability to form stable doublebonded structures with itself, which are necessary for generating fixed molecular shapes and planar molecules necessary for molecular recognition. Most of the fundamental processes of life at the molecular level are based on molecular recognition, which depends on the ability of molecules to possess functional groups that allow for weak interactions such as hydrogen bonding and π-stacking. Carbon appears unique in the capacity to form stable alcohols, amines, ketones, and so on. While silicon is immediately below carbon in the periodic table, its polymers are generally unstable, especially in water, and silicon is unable to form stable double bonds with itself. Organisms presently use energy, principally sunlight, to transform environmental precursors such as CO2, H2O, and N2 into their constituents. While silicon is more prevalent in the Earth’s crust than carbon, and both are generated copiously in stars, silicon is unable to support the same degree of molecular complexity as carbon. Silicon is much less soluble in water than are carbon species, and does not have an appreciably abundant gas phase form such as CH4 or CO2, making the metabolism of silicon more difficult for a nascent biology. THE PRIMITIVE EARTH AND SOURCES OF BIOMOLECULES

The origin of life can be constrained into a relatively short period of the Earth’s history. On the upper end, the age of the solar system has been determined to be approximately 4.65 billion years from isotopic data from terrestrial minerals, lunar samples, and meteorites, and the Earth–moon system is estimated to be approximately 4.5 billion years old. The early age limit for the origin of life on Earth is also constrained by the lunar cratering record, which suggests that the flux of large asteroids impacting the early Earth’s surface until ~3.9 billion years ago was sufficient to boil the terrestrial oceans and sterilize the planet. On the more recent end, there is putative isotopic evidence for biological activity from ~3.8 billion years ago (living systems tend to incorporate the lighter isotope of carbon, 12C, preferentially over 13C during carbon fixation due to metabolic kinetic isotope effects).



There is more definitive fossil evidence from ~3.5 billion years ago in the form of small organic inclusions in cherts morphologically similar to cyanobacteria, as well as stromatolitic assemblages (layered mats reminiscent of the layered deposits created by modern microorganismal communities). Thus the time window for the origin of the predecessor of all extant life appears to be between ~3.9 billion and 3.8 billion years ago. The accumulation and organization of organic material leading to the origin of life must have occurred during the same period. While some authors have attempted to define a reasonable time frame for biological organization based on the short time available [16], it has been pointed out that the actual time required could be considerably longer or shorter [17]. It should be borne in mind that there is some uncertainty in many of the ages mentioned above. In any event, life would have had to originate in a relatively short period, and the synthesis and accumulation of the organic compounds for this event must have preceded it in an even shorter time period. The synthesis and survival of organic biomonomers on the primitive Earth would have depended on the prevailing environmental conditions. Unfortunately, to a large degree these conditions are poorly defined by geological evidence. Solar System Formation and the Origin of the Earth

If the origin of life depends on the synthesis of organic compounds, then the source and nature of these is the crucial factor in the consideration of any subsequent discussions of molecular organization. The origin of terrestrial prebiotic organic compounds depends on the primordial atmospheric composition. This in turn is determined by the oxidation state of the early mantle, which depends on the manner in which the Earth formed. Discussions of each of these processes are unfortunately compromised by the paucity of direct geological evidence remaining from the time period under discussion, and are therefore somewhat speculative. While a complete discussion of each of these processes is outside the scope of this chapter, they are crucial for understanding the uncertainty surrounding modern thinking regarding the origin of the prebiotic compounds necessary for the origin of life. According to the current model, the solar system is thought to have formed by the coalescence of a nebular cloud which accreted into small bodies called planetesimals which eventually aggregated to form the planets [18]. In brief, the sequence of events is thought to have commenced when a gas cloud enriched in heavier elements produced in supernovae began to collapse gravitationally on itself. This cool diffuse cloud gradually became concentrated into denser regions where more complex chemistry became possible, and in so doing began to heat up. As this occurred, the complex components of the gas cloud began to differentiate in what may be thought of as a large distillation process. The cloud condensed, became more disk-like, and

Prebiotic Chemistry on the Primitive Earth


began to rotate to conserve angular momentum. Once the center of the disk achieved temperatures and pressures high enough to begin hydrogen fusion, the sun was born. The intense radiative power of the nascent sun drove the lower boiling point elements outward toward the edge of the solar system where they condensed and froze out. Farther out in the disk, dust-sized grains were also in the process of coalescing due to gravitational attraction. These small grains slowly agglomerated to form larger and larger particles that eventually formed planetesimals and finally planets. It is noteworthy that the moon is thought to have formed from the collision of a Mars-sized body with the primitive Earth. The kinetic energy of such a large collision must have been very great, so great in fact that it would have provided enough energy to entirely melt the newly formed Earth and probably strip away its original atmosphere. Discussions of planetary formation and atmospheric composition are likely to be relevant to various other planets in our solar system and beyond, thus the following discussion may be generalizable. The Early Atmosphere

The temperature at which the planets accreted is important for understanding the early Earth’s atmosphere, which is essential for understanding the possibility of terrestrial prebiotic organic synthesis. This depends on the rate of accretion. If the planet accreted slowly, more of the primitive gases derived from planetesimals, likely reminiscent of the reducing chemistry of the early solar nebula, could have been retained. If it accreted rapidly, the model favored presently, the original atmosphere would have been lost and the primitive atmosphere would have been the result of outgassing of retained mineral-associated volatiles and subsequent extraterrestrial delivery of volatiles. CH4, CO2, CO, NH3, H2O, and H2 are the most abundant molecular gas species in the solar system, and this was likely true on the early Earth as well, although it is the relative proportions of these that is of interest. It remains contentious whether the Earth’s water was released via volcanic exhalation of water associated with mineral hydrates accreted during planetary formation or whether it was accreted from comets and other extraterrestrial bodies during planet formation. It seems unlikely that the Earth kept much of its earliest atmosphere during early accretion, thus the primordial atmosphere would have been derived from outgassing of the planet’s interior, which is thought to have occurred at temperatures between 300 and 1500 °C. Modern volcanoes emit a wide range of gas mixtures. Most modern volcanic emissions are CO2 and SO2, rather than CH4 and H2S (table 1.1). It seems likely that most of the gases released today are from the reactions of reworked crustal material and water, and do not represent components of the Earth’s deep interior. Thus modern volcanic gases may tell us little about the early Earth’s atmosphere.



Table 1.1 Gases detected in modern volcanic emissions (adapted from Miller and Orgel [4]) Location










White Island, New Zealand Nyerogongo Lava Lake, Congo Mount Hekla, Iceland Lipari Island, Italy Larderello, Italy Zavaritskii crater, Kamchatka Same crater, B1 Unimak Island, Alaska


























— 47

21 —

— —

— —

42 —

25 —

— —

12 53

— 95

Values for gases (except water) are given in volume percent. The value for water is its percentage of the total gases

The oxidation state of the early mantle likely governed the distribution of reducing gases released during outgassing. Holland [19] proposed a multistage model based on the Earth being formed through cold homogeneous accretion in which the Earth’s atmosphere went through two stages, an early reduced stage before complete differentiation of the mantle, and a later neutral/oxidized stage after differentiation. During the first stage, the redox state of the mantle was governed by the Fe°/Fe2+ redox pair, or iron–wustite buffer. The atmosphere in this stage would be composed of H2O, H2, CO, and N2, with approximately 0.27–2.7 × 10−5 atm of H2. Once Fe° had segmented into the core, the redox state of magmas would have been controlled by the Fe2+/Fe3+ pair, or fayalite–magnetite–quartz buffer. In reconstructing the early atmosphere with regard to organic synthesis, there is particular interest in determining the redox balance of the crust–mantle–ocean–atmosphere system. Endogenous organic synthesis seems to depend, based on laboratory simulations, on the early atmosphere being generally reducing, which necessitates low O2 levels in the primitive atmosphere. Little is agreed upon about the composition of the early atmosphere, other than that it almost certainly contained very little free O2. O2 can be produced by the photodissociation of water: 2H2O → O2 + 2H2

Prebiotic Chemistry on the Primitive Earth


Today this occurs at the rate of ~10−8 g cm−2 yr−1, which is rather slow, and it seems likely that the steady state would have been kept low early in the Earth’s history due to reaction with reduced metals in the crust and oceans such as Fe2+. Evidence in favor of high early O2 levels comes from morphological evidence that fossil bacteria appear to have been photosynthetic, although this is somewhat speculative. On the other hand, uranite (UO2) and galena (PbS) deposits from 2–2.8 bya testify to low atmospheric O2 levels until relatively recently, since both of these species are easily oxidized to UO3 and PbSO4, respectively. More evidence that O2 is the result of buildup from oxygenic photosynthesis and a relatively recent addition to the atmosphere comes from the banded iron formations (BIFs). These are distributed around the world from ~1.8–2.2 mya and contain extensive deposits of magnetite Fe3O4, which may be considered a mixture of FeO and hematite (Fe2O3), interspersed with bands of hematite. Hematite requires a higher pO2 to form. On the modern Earth, high O2 levels allow for the photochemical formation of a significant amount of ozone. Significantly, O3 serves as the major shield of highly energetic UV light on the Earth’s surface today. Even low O2 levels may have created an effective ozone shield on the early Earth [20]. The oceans could also have served as an important UV shield protecting the nascent organic chemicals [21]. It is important to note that while UV can be a significant source of energy for synthesizing organics, it is also a means of destroying them. While this suggests that the early atmosphere was probably not oxidizing, it does not prove or offer evidence that it was very reducing. Although it is generally accepted that free oxygen was generally absent from the early Archean Earth’s atmosphere, there is no agreement on the composition of the primitive atmosphere; opinions vary from strongly reducing (CH4 + N2, NH3 + H2O, or CO2 + H2 + N2) to neutral (CO2 + N2 + H2O). The modern atmosphere is not thermodynamically stable, and the modern atmosphere is not in equilibrium with respect to the biota, the oceans, or the continents. It is unlikely that it ever was. In the presence of large amounts of H2, the thermodynamically stable form of carbon is CH4: CO2 + 4H2 → CH4 + 2H2O CO + 3H2 → CH4 + H2O C + 2H2 → CH4

K25 = 8.1 × 1022 K25 = 2.5 × 1026

K25 = 7.9 × 108

In the absence of large amounts of H2, intermediate forms of carbon, such as formate and methanol, are unstable with respect to CO2



and CH4, and thus these are the stable forms at equilibrium. Even large sources of CO would have equilibrated with respect to these in short geological time spans. In the presence of large amounts of H2, NH3 is the stable nitrogen species, although not to the extreme extent of methane: 1 2 N2


3 2 H2

→ NH3

K25 = 8.2 × 102

If a reducing atmosphere was required for terrestrial prebiotic organic synthesis, the crucial question becomes the source of H2. Miller and Orgel [4] have estimated the pH2 as 10−4 to 10−2 atm. Molecular hydrogen could have been supplied to the primitive atmosphere from various sources. For example, if there had been extensive weathering of Fe2+-bearing rocks which had not been equilibrated with the mantle, followed by photooxidation in water [22]:

ν 3H + Fe O 2Fe2+ + 3H2O h→ 2 2 3 although this reaction may also have been equilibrated during volcanic outgassing. The major sink for H2 is Jeans escape, whereby gas molecules escape the Earth’s gravitational field. This equation is important for all molecular gas species, and thus we will include it here: L = N(RT/2pm)1/2 (1 + x)e–x, where x = GMm/RTae where L = rate of escape (in atoms cm–2 s–1) N = density of the gas in the escape layer R = gas constant m = atomic weight of the gas G = gravitational constant M = mass of the Earth T = absolute temperature in the escape layer aε = radius at the escape layer. The escape layer on the Earth begins ~600 km above the Earth’s surface. Molecules must diffuse to this altitude prior to escape. The major conduits of H to the escape layer are CH4, H2, and H2O, since H2O and CH4 can be photodissociated at this layer. Water is, however, frozen out at lower altitudes, and thus does not contribute significantly to this process. The importance of the oxidation state of the atmosphere may be linked to the production of HCN, which is essential for the synthesis of amino acids and purine nucleobases, as well as cyanoacetylene

Prebiotic Chemistry on the Primitive Earth


for pyrimidine nucleobase synthesis. In CH4/N2 atmospheres HCN is produced abundantly [23,24], but in CO2/N2 atmospheres most of the N atoms produced by splitting N2 recombine with O atoms to form NO. The Early Oceans

The pH of the modern ocean is governed by the complex interplay of dissolved salts, atmospheric CO2 levels, and clay mineral ion exchange processes. The pH and concentrations of Ca2+, Mg2+, Na+ and K+ are maintained by equilibria with clays rather than by the bicarbonate buffer system. The bicarbonate concentration is determined by the pH and Ca2+ concentrations. After deriving these equilibria, the pCO2 can be derived from equilibrium considerations with clay species, which gives a pCO2 of 1.3 × 10−4 atm to 3 × 10−4 atm [4]. For comparison, presently CO2 is ~0.03 volume %. This buffering mechanism and pCO2 would leave the primitive oceans at ~pH 8, which is coincidentally a favorable pH for many prebiotic reactions. The cytosol of most modern cells is also maintained via a series of complicated cellular processes near pH 8, suggesting that early cells may have evolved in an environment close to this value. Our star appears to be a typical G2 class star, and is expected to have followed typical stellar evolution for its mass and spectral type. Consequently the solar luminosity would have been ~30% less during the time period we are concerned with, and the UV flux would have been much higher [20]. A possible consequence of this is that the prebiotic Earth may have frozen completely to a depth of ~300 m [25]. There is now good evidence for various completely frozen “Snowball Earth” periods later during the Earth’s history [26]. There is some evidence that liquid water was available on the Archean Earth between 4.3 and 4.4 bya [27,28], thus the jury is still out as to whether the early Earth was hot or cold, or perhaps had a variety of environments. The presence of liquid surface water would have necessitated that the early Earth maintained a heat balance that offset the postulated 30% lesser solar flux from the faint young sun. Presently the Earth’s temperature seems to be thermostatted by the so-called BLAG [29] model. The model suggests that modern atmospheric CO2 levels are maintained at a level that ensures moderate temperatures by controlling global weathering rates and thus the flux of silicates and carbonates through the crust–ocean interface. When CO2 levels are high, the Earth is warmed by the greenhouse effect and weathering rates are increased, allowing a high inflow of Ca2+ and silicates to the oceans, which precipitates CO2 as CaCO3, and lowers the temperature. As the atmosphere cools, weathering slows, and the buildup of volcanically outgassed CO2 again raises the temperature. On the early Earth, however, before extensive recycling of the crust became common, large amounts of CO2 may have been sequestered as CaCO3 in sediments, and the environment may have been considerably colder.



Energy Sources on the Early Earth

Provided the early atmosphere had a sufficiently reducing atmosphere, energy would have been needed to dissociate these gases into radicals which could recombine to form reactive intermediates capable of undergoing further reaction to form biomolecules. The most abundant energy sources on Earth today are shown in table 1.2. Energy fluxes from these sources may have been slightly different in the primitive environment. As mentioned earlier, the dim young sun would have provided a much higher flux of UV radiation than the modern sun. It is also likely that volcanic activity was higher than it is today, and radioactive decay would have been more intense, especially from 40K [30], which is the probable source of the high concentrations of Ar in the modern atmosphere. Shock waves from extraterrestrial impactors and thunder were also probably more common during the tail of the planetary accretion process. Presently huge amounts of energy are discharged atmospherically in the form of lightning; it is difficult to estimate the flux early in the Earth’s history. Also significant is the energy flux associated with the van Allen belts and static electricity discharges. Some energy sources may have been more important for some synthetic reactions. For example, electric discharges are very effective at producing HCN from CH4 and NH3 or N2, but UV radiation is not. Electric discharge reactions also occur near the Earth’s surface whereas UV reactions occur higher in the atmosphere. Any molecules created would have to be transported to the ocean, and might be destroyed on the way. Thus transport rates must also be taken into account

Table 1.2 Energy sources on the modern Earth (adapted from Miller and Orgel [4]) Source Total radiation from sun Ultraviolet light < 300 nm Ultraviolet light < 250 nm Ultraviolet light < 200 nm Ultraviolet light < 150 nm Electric discharges Cosmic rays Radioactivity (to 1.0 km) Volcanoes Shock waves

Energy (cal cm−2 yr−1)

Energy (J cm−2 yr−1)

260,000 3400 563 41 1.7 4.0a 0.0015 0.8 0.13 1.1b

1,090,000 14,000 2360 170 7 17 0.006 3.0 0.5 4.6

cal cm−2 yr−1 of corona discharge and 1 cal cm−2 yr−1 of lightning. cal cm−2 yr−1 of this is the shock wave of lightning bolts and is also included under electric discharges. a3


Prebiotic Chemistry on the Primitive Earth


when considering the relative importance of various energy sources in prebiotic synthesis. Atmospheric Syntheses

Urey’s early assumptions as to the constitution of the primordial atmosphere led to the landmark Miller–Urey experiment, which succeeded in producing copious amounts of biochemicals, including a large percentage of those important in modern biochemistry. Yields of intermediates as a function of the oxidation state of the gases involved have been investigated and it has been shown that reduced gas mixtures are generally much more conducive to organic synthesis than oxidizing or neutral gas mixtures. This appears to be because of the likelihood of reaction-terminating O radical collisions where the partial pressure of O-containing species is high. Even mildly reducing gas mixtures produce copious amounts of organic compounds. The yields may have been limited by carbon yield or energy yield. It seems likely that energy was the not the limiting factor [24]. Small Reactive Intermediates

Small reactive intermediates are the backbone of prebiotic organic synthesis. They include HCHO, HCN, ethylene, cyanoacetylene, and acetylene which can be recombined to form larger and more complex intermediates that ultimately form stable biochemicals. Most of these reactive intermediates would have been produced at relatively slow rates, resulting in low concentrations in the primitive oceans, where many of the reactions of interest would occur. Subsequent reactions which would have produced more complicated molecules would have depended on the balance between atmospheric production rates and rain-out rates of small reactive intermediates, as well as the degradation rates, which would have depended on the temperature and pH of the early oceans. It is difficult to estimate the concentrations of the compounds that could have been achieved without knowing the source production rates or loss rates in the environment. Nevertheless, low temperatures would have been more conducive to prebiotic organic synthesis, using the assumptions above. For example, steady-state concentrations of HCN would have depended on production rates as well as on the energy flux and the reducing nature of the early atmosphere. Sinks for HCN would have been photolysis and hydrolysis of HCN [31], as well as the pH and temperature of the early oceans and the rate of circulation of the oceans through hydrothermal vents. Ultraviolet irradiation of reduced metal sulfides has been shown to be able to reduce CO2 to various low molecular weight compounds including methanol, HCHO, HCOOH, and short fatty acids [32]. This may have been an important source of biomolecules on the early Earth.



Concentration Mechanisms

When aqueous solutions are frozen, as the ice lattice forms solutes are excluded and extremely concentrated brines may be formed. In the case of HCN, the final eutectic mixture contains 75% HCN. In principle any degree of concentration up to this point is possible. Salt water, however, cannot be concentrated to the same degree as fresh water in a eutectic; for example, from 0.5 M NaCl, similar to the concentration in the modern ocean, the eutectic of the dissolved salt is the limit, which is only a concentration factor of ~10. Eutectic freezing has been shown to be an excellent mechanism for producing biomolecules such as amino acids and adenine from HCN [33]. This would of course require that at least some regions of the early Earth were cold enough to freeze, which would require that atmospheric greenhouse warming due to CO2, CH4, and NH3 or organic aerosols was not so great as to prohibit at least localized freezing. Concentration by evaporation is also possible for nonvolatile compounds, as long as they are stable to the drying process [34]. Some prebiotic organic syntheses may have depended on the availability of dry land areas. Although continental crust had almost certainly not yet formed, the geological record contains some evidence of sedimentary rocks that must have been deposited in shallow environments on the primitive Earth. It is not unreasonable to assume that some dry land was available on the primitive Earth in environments such as island arcs. There is the possibility that hydrophobic compounds could have been concentrated in lipid phases if such phases were available. Calculations and some experiments suggest that an early reducing atmosphere might have been polymerized by solar ultraviolet radiation in geologically short periods of time. An oil slick 1–10 m thick could have been produced in this way and could have been important in the concentration of hydrophobic molecules [35]. Clays are complex mineral assemblages formed from dissolved aluminosilicates. Such minerals form templates for the formation of subsequent layers of mineral, leading to speculation that the first organisms may have been mineral-based [36]. Clays are also capable of binding organic material via ionic and van der Waals forces, and may have been locations for early prebiotic synthesis. Early ion exchange processes would also have concentrated 40K+, which would have exposed early prebiotic organics to high fluxes of ionizing radiation [30]. SYNTHESIS OF THE MAJOR CLASSES OF BIOCHEMICALS

The top-down approach to origin of life research operates on the premise that the earliest organisms were composed of the same, or similar,

Prebiotic Chemistry on the Primitive Earth


biochemicals as modern organisms. The following sections will consider biomolecules and experimental results demonstrating how these may have come to be synthesized on the primitive Earth via plausible geochemical processes. Amino Acids

Experimental evidence in support of Oparin’s hypothesis of chemical evolution came first from Urey’s laboratory, which had been involved with the study of the origin of the solar system and the chemical events associated with this process. Urey considered the origin of life in the context of his proposal of a highly reducing terrestrial atmosphere [37]. The first successful prebiotic amino acid synthesis was carried out with an electric discharge (figure 1.1) and a strongly reducing model atmosphere of CH4, NH3, H2O, and H2 [38]. The result of this experiment was a large yield of racemic amino acids, together with hydroxy acids, short aliphatic acids, and urea (table 1.3). One of the surprising results of this experiment was that the products were not a large random mixture of organic compounds, but rather a relatively small number of compounds were produced in substantial yield. Moreover, with a few exceptions, the compounds were of biochemical significance. The synthetic routes to prebiotic bioorganic compounds and the geochemical

Figure 1.1 The apparatus used in the first electric discharge synthesis of amino acids and other organic compounds in a reducing atmosphere. It was made entirely of gas, except for the tungsten electrodes [38].



Table 1.3 Yields of small organic molecules from sparking a mixture of methane, hydrogen, ammonia, and water (yields given based on input carbon in the form of methane [59 mmoles (710 mg)]) Compound Glycine Glycolic acid Sarcosine Alanine Lactic acid N-Methylalanine a-Amino-n-butyric acid a-Aminoisobutyric acid a-Hydroxybutyric acid b -Alanine Succinic acid Aspartic acid Glutamic acid Iminodiacetic acid Iminoaceticpropionic acid Formic acid Acetic acid Propionic acid Urea N-Methyl urea

Yield (µmoles) 630 560 50 340 310 10 50 1 50 150 4 4 6 55 15 2330 150 130 20 15

Yield (%) 2.1 1.9 0.25 1.7 1.6 0.07 0.34 0.007 0.34 0.76 0.27 0.024 0.051 0.37 0.13 4.0 0.51 0.66 0.034 0.051

plausibility of these became experimentally tractable as a result of this experimental demonstration. The mechanism of synthesis of the amino and hydroxy acids formed in the spark discharge experiment was investigated [39]. The presence of large quantities of hydrogen cyanide, aldehydes, and ketones in the water flask (figure 1.2), which were clearly derived from the methane, ammonia, and hydrogen originally included in the apparatus, showed that the amino acids were not formed directly in the electric discharge, but were the outcome of a Strecker-like synthesis that involved aqueous phase reactions of reactive intermediates. The mechanism is shown in figure 1.3. Detailed studies of the equilibrium and rate constants of these reactions have been performed [40]. The results demonstrate that both amino and hydroxy acids could have been synthesized at high dilutions of HCN and aldehydes in the primitive oceans. The reaction rates depend on temperature, pH, HCN, NH3, and aldehyde concentrations, and are rapid on a geological time scale. The half-lives for the hydrolysis of the intermediate products in the reactions, amino and hydroxy nitriles, can be less than a thousand years at 0 °C [41].

Prebiotic Chemistry on the Primitive Earth


Figure 1.2 The concentrations of ammonia (NH3), hydrogen cyanide (HCN), and aldehydes (CHO-containing compounds) present in the lowermost U-tube of the apparatus shown in figure 1.1. The concentrations of the amino acids present in the lower flask are also shown. These amino acids were produced from the sparking of a gaseous mixture of methane (CH4), ammonia (NH3), water vapor (H2O), and hydrogen in the upper flask. The concentrations of NH3, HCN, and aldehydes decrease over time as they are converted to amino acids.

The slow step in amino acid synthesis is the hydrolysis of the amino nitrile which could take 10,000 years at pH 8 and 25 °C. An additional example of a rapid prebiotic synthesis is that of amino acids on the Murchison meteorite (which will be discussed later), which apparently took place in less than 105 years [42]. These results suggest that if the prebiotic environment was reducing, then the synthesis of the building blocks of life was efficient and did not constitute the limiting step in the emergence of life. The Strecker synthesis of amino acids requires the presence of ammonia (NH3) in the prebiotic environment. As discussed earlier,



Figure 1.3 The Strecker and cyanohydrin mechanisms for the formation of amino and hydroxy acids from ammonia, aldehydes and ketones, and cyanide.

gaseous ammonia is rapidly decomposed by ultraviolet light [43], and during early Archean times the absence of a significant ozone layer would have imposed an upper limit to its atmospheric concentration. Since ammonia is very soluble in water, if the buffer capacity of the primitive oceans and sediments was sufficient to maintain the pH + at ~8, then dissolved NH4 (the pKa of NH3 is ~9.2) in equilibrium with dissolved NH3 would have been available. Since NH4+ is similar in size to K+ and thus easily enters the same exchange sites on clays, NH4+ concentrations were probably no higher than 0.01 M. The ratio of hydroxy acids to amino acids is governed by the ammonia (NH3) concentration which would have to be ~0.01 M at 25 °C to make a 50/50 mix; equal amounts of the cyanohydrin and aldehyde are generated at CN− concentrations of 10−2 to 10−4 M. A more realistic atmosphere for the primitive Earth may be a mixture of CH4 with N2 with traces of NH3. There is experimental evidence that this mixture of gases is quite effective with electric discharges in producing amino acids [41]. Such an atmosphere, however, would nevertheless be strongly reducing. Alternatively, amino acids can be synthesized from the reaction of urea, HCN, and an aldehyde or a ketone (the Bucherer–Bergs synthesis, figure 1.4). This reaction pathway may have been significant if little free ammonia were available. A wide variety of direct sources of energy must have been available on the primitive Earth (table 1.2). It is likely that in the prebiotic environment solar radiation, and not atmospheric electricity, was the major

Prebiotic Chemistry on the Primitive Earth


Figure 1.4 The Bucherer–Bergs mechanism of synthesis of amino acids, which uses urea instead of ammonia as the source of the amino group.

source of energy reaching the Earth’s surface. However, it is unlikely that any single one of the energy sources listed in table 1.2 can account for all organic compound syntheses. The importance of a given energy source in prebiotic evolution is determined by the product of the energy available and its efficiency in generating organic compounds. Given our current ignorance of the prebiotic environment, it is impossible to make absolute assessments of the relative significance of these different energy sources. For instance, neither the pyrolysis (800 to 1200 °C) of a CH4/NH3 mixture or the action of ultraviolet light acting on a strongly reducing atmosphere give good yields of amino acids. However, the pyrolysis of methane, ethane, and other hydrocarbons gives good yields of phenylacetylene, which upon hydration yields phenylacetaldehyde. The latter could then participate in a Strecker synthesis and act as a precursor to the amino acids phenylalanine and tyrosine in the prebiotic ocean. The available evidence suggests that electric discharges were the most important source of hydrogen cyanide, which is recognized as an important intermediate in prebiotic synthesis. However, the hot H atom mechanism suggested by Zahnle could also have been significant [44]. In addition to its central role in the formation of amino nitriles during the Strecker synthesis, HCN polymers have been shown to be a source of amino acids. Ferris et al. [45] have demonstrated that, in addition to urea, guanidine, and oxalic acid, hydrolysis of HCN polymers



produces glycine, alanine, aspartic acid, and aminoisobutyric acid, although the yields are not particularly high except for glycine (~1%). Modern organisms construct their proteins from ~20 universal amino acids which are almost exclusively of the L enantiomer. The amino acids produced by prebiotic syntheses would have been racemic. It is unlikely that all of the modern amino acids were present in the primitive environment, and it is unknown which, if any, would have been important for the origin of life. Acrolein would have been produced in fairly high yield from the reaction of acetaldehyde with HCHO [46], which has several very robust atmospheric syntheses. Acrolein can be converted into several of the biological amino acids via reaction with various plausible prebiotic compounds [47] (figure 1.5). There has been less experimental work with gas mixtures containing CO and CO2 as carbon sources instead of CH4, although CO-dominated atmospheres could not have existed except transiently. Spark discharge experiments using CH4, CO, or CO2 as a carbon source with various

Figure 1.5 Acrolein may serve as an important precursor in the prebiotic synthesis of several amino acids.

Prebiotic Chemistry on the Primitive Earth


Figure 1.6 Amino acid yields based on initial carbon. In all experiments reported here, the partial pressure of N2, CO, or CO2 was 100 torr. The flask contained 100 ml of water with or without 0.05 M NH4Cl brought to pH 8.7. The experiments were conducted at room temperature, and the spark generator was operated continuously for two days.

amounts of H2 have shown that methane is the best source of amino acids, but CO and CO2 are almost as good if a high H2/C ratio is used (figure 1.6). Without added hydrogen, however, the amino acid yields are very low, especially when CO2 is the sole carbon source. The amino acid diversity produced in CH4 experiments is similar to that reported by Miller [38]. With CO and CO2, however, glycine was the predominant amino acid, with little else besides alanine produced [41]. The implication of these results is that CH4 is the best carbon source for abiotic synthesis. Although glycine was essentially the only amino acid produced in spark discharge experiments with CO and CO2, as the primitive ocean matured the reaction between glycine, H2CO, and HCN could have led to the formation of other amino acids such as alanine, aspartic acid, and serine. Such simple mixtures may have lacked the chemical diversity required for prebiotic evolution and the origin of the first life forms. However, since it is not known which amino acids were required for the emergence of life, we can say only that CO and CO2 are less favorable than CH4 for prebiotic amino acid synthesis, but that amino acids produced from CO and CO2 may have been adequate. The spark discharge yields of amino acids, HCN, and aldehydes are about the same using CH4, H2/CO >1, or H2/CO2 >2. However, it is not clear how such high molecular hydrogen-to-carbon ratios for the last



two reaction mixtures could have been maintained in the prebiotic atmosphere. Synthesis of Nucleic Acid Bases

Nucleic acids are the central repository of the information that organisms use to construct enzymes via the process of protein synthesis. In all living organisms genetic information is stored in DNA, which is composed of repeating units of deoxyribonucleotides (figure 1.7), which is transcribed into linear polymers of RNA, which are composed of repeating polymers of ribonucleotides. The difference between these two is the usage of deoxyribose in DNA and ribose in RNA, and uracil in RNA and thymine in DNA. It is generally agreed that one of the principal characteristics of life is the ability to transfer information from one generation to the next. Nucleic acids seem uniquely structured for this function, and thus a considerable amount of attention has been dedicated to elucidating their prebiotic synthesis. PURINES

The first evidence that the components of nucleic acids may have been synthesized abiotically was provided in 1960 [48]. Juan Oró, who was at the time studying the synthesis of amino acids from aqueous solutions of HCN and NH3, reported the abiotic formation of adenine, which may be considered a pentamer of HCN (C5H5N5) from these same mixtures. Oró found that concentrated solutions of ammonium cyanide which were refluxed for a few days produced adenine in up to 0.5% yield along with 4-aminoimidazole-5-carboxamide and an intractable polymer [48,49]. The polymer also yields amino acids, urea, guanidine, cyanamide, and cyanogen. It is surprising that a synthesis requiring at least five steps should produce such high yields of adenine. The mechanism of synthesis has since been studied in some detail. The initial step is the dimerization of HCN followed by further reaction to give HCN trimer and HCN tetramer, diaminomaleonitrile (DAMN) (figure 1.8). As demonstrated by Ferris and Orgel [50], a two-photon photochemical rearrangement of diaminomaleonitrile proceeds readily with high yield in sunlight to amino imidazole carbonitrile (AICN) (figure 1.9). Further reaction of AICN with small molecules generated in polymerizing HCN produces the purines (figure 1.10). The limits of the synthesis as delineated by the kinetics of the reactions and the necessity of forming the dimer, trimer, and tetramer of HCN have been investigated, and this has been used to delineate the limits of geochemically plausible synthesis. The steady-state concentrations of HCN would have depended on the pH and temperature of the early oceans and the input rate of HCN from atmospheric synthesis.

Figure 1.7 The structures of DNA and RNA (A = adenine, G = guanine, C = cytosine, T = thymine, U = uracil). 27



Figure 1.8 The mechanism of formation of DAMN from HCN.

Assuming favorable production rates, Miyakawa et al. [31] estimated steady-state concentrations of HCN of 2 × 10−6 M at pH 8 and 0 °C in the primitive oceans. At 100 °C and pH 8 the steady-state concentration would have been 7 × 10−13 M. HCN hydrolyzes to formamide, which then hydrolyzes to formic acid and ammonia. It has been estimated that oligomerization and hydrolysis compete at approximately 10−2 M concentrations of HCN at pH 9 [51], although it has been shown that adenine is still produced from solutions as dilute as 10−3 M [52]. If the concentration of HCN were as low as estimated, it is possible that HCN tetramer formation may have occurred on the primitive Earth in eutectic solutions of HCN–H2O, which may have existed in the polar regions of an Earth of the present average temperature. High yields of the HCN tetramer have been reported by cooling dilute cyanide solutions to temperatures between −10 and −30 °C for a few months [51]. Production of adenine by HCN polymerization is accelerated by the presence of formaldehyde and other aldehydes, which could have also been available in the prebiotic environment [53]. The prebiotic synthesis of guanine was first studied under conditions that required unrealistically high concentrations of a number of precursors, including ammonia [54]. Purines, including guanine,

Figure 1.9 The synthesis of AICN via photoisomerization of DAMN.

Prebiotic Chemistry on the Primitive Earth


Figure 1.10 Prebiotic synthesis of purines from AICN.

hypoxanthine, xanthine, and diaminopurine, could have been produced in the primitive environment by variations of the adenine synthesis using aminoimidazole carbonitrile and aminoimidazole carboxamide [55] with other small molecule intermediates generated from HCN. Reexamination of the polymerization of concentrated NH4CN solutions has shown that, in addition to adenine, guanine is also produced at both −80 and −20 °C [56]. It is probable that most of the guanine obtained from the polymerization of NH4CN is the product of 2,6-diaminopurine hydrolysis, which reacts readily with water to give guanine and isoguanine. The yields of guanine in this reaction are 10 to 40 times less than those of adenine. Adenine, guanine, and a simple set of amino acids dominated by glycine have also been detected in dilute solutions of NH4CN which were kept frozen for 25 years at −20 and −78 °C, as well as in the aqueous products of spark discharge experiments from a reducing experiment frozen for 5 years at −20 °C [33]. The mechanisms described above are likely an oversimplification. In dilute aqueous solutions adenine synthesis may also involve the formation and rearrangement of other precursors such as 2-cyano and 8-cyano adenine [53]. PYRIMIDINES

The prebiotic synthesis of pyrimidines has also been investigated extensively. The first synthesis investigated was that of uracil from



malic acid and urea [57]. The abiotic synthesis of cytosine in an aqueous phase from cyanoacetylene (HCCCN) and cyanate (NCO−) was later described [58,59]. Cyanoacetylene is abundantly produced by the action of a spark discharge on a mixture of methane and nitrogen, and cyanate is produced from cyanogen (NCCN) or from the decomposition of urea (H2NCONH2). Cyanoacetylene is apparently also a Strecker synthesis precursor to aspartic acid. However, the high concentrations of cyanate (> 0.1 M) required in this reaction are unrealistic, since cyanate is rapidly hydrolyzed to CO2 and NH3. Urea itself is fairly stable, depending on the concentrations of NCO− and NH3. Later, it was found that cyanoacetaldehyde (the hydration product of cyanoacetylene) and urea react to form cytosine and uracil. This was extended to a high yield synthesis that postulated drying lagoon conditions. The reaction of uracil with formaldehyde and formate gives thymine in good yield [60]. Thymine may also be synthesized from the UV-catalyzed dehydrogenation of dihydrothymine. This is produced from the reaction of b-aminoisobutryic acid with urea [61]. The reaction of cyanoacetaldehyde (which is produced in high yields from the hydrolysis of cyanoacetylene) with urea produces no detectable levels of cytosine [62]. However, when the same nonvolatile compounds are concentrated in laboratory models of “evaporating pond” conditions simulating primitive lagoons or pools on drying beaches on the early Earth, surprisingly high amounts of cytosine (>50%) are observed [63]. These results suggest a facile mechanism for the accumulation of pyrimidines in the prebiotic environment (figure 1.11).

Figure 1.11 Two possible mechanisms for the prebiotic synthesis of the biological pyrimidines.

Prebiotic Chemistry on the Primitive Earth


Figure 1.12 One possible mechanism for the formation of N6 modified purines.

A related synthesis under evaporating conditions uses cyanoacetaldehyde with guanidine, which produce diaminopyrimidine [62] in very high yield [64], which then hydrolyzes to uracil and small amounts of cytosine. Uracil (albeit in low yields), as well as its biosynthetic precursor orotic acid, have also been identified among the hydrolytic products of hydrogen cyanide polymer [45,65]. A wide variety of other modified nucleic acid bases may also have been available on the early Earth. The list includes isoguanine, which is a hydrolytic product of diaminopurine [56], and several modified purines which may have resulted from side reactions of both adenine and guanine with a number of alkylamines under the concentrated conditions of a drying pond [66], including N6-methyladenine, 1-methyladenine, N6,N6-dimethyladenine, 1-methylhypoxanthine, 1-methylguanine, and N2-methylguanine (figure 1.12). Modified pyrimidines may have also been present in the primitive Earth. These include dihydrouridine, which is formed from NCO− and b-alanine [67], and others like diaminopyrimidine, thiocytosine [64], and 5-substituted uracils, formed via reaction of uracil with formaldehyde, whose functional side groups may have played an important role in the early evolution of catalysis prior to the origin of proteins, and which are efficiently formed under plausible prebiotic conditions [68] (figure 1.13). Carbohydrates

Most biological sugars are composed of the empirical formula (CH2O)n, a point that was underscored by Butlerow’s 1861 discovery of the formose reaction [69], which showed that a complex mixture of the sugars of biological relevance could be formed by the reaction of HCHO under basic conditions. The Butlerow synthesis is complex and incompletely understood. It depends on the presence of suitable inorganic catalysts, with calcium hydroxide (Ca(OH)2) or calcium carbonate (CaCO3) being the most completely investigated. In the absence of basic catalysts, little or no sugar is obtained. At 100 °C, clays such as kaolin serve to catalyze the formation of sugars, including ribose,



Figure 1.13 The reaction of uracil with formaldehyde to produce 5-hydroxymethyl uracil, and functional groups attached to 5-substituted uracil. Incorporation of these amino acid analogs into polyribonucleotides during the “RNA world” stage may have led to a substantial widening of the catalytic properties of ribozymes.

in small yields from dilute (0.01 M) solutions of formaldehyde [70–72]. This reaction has been extensively investigated with regard to catalysis and several interesting phenomena have been observed. For instance, the reaction is catalyzed by glycolaldehyde, acetaldehyde, and various organic catalysts [73]. Ribose was among the last of life’s building blocks characterized by chemists. Suggestions for the existence of an “RNA world,” a period during early biological evolution when biological systems used RNA both as a catalyst and an informational macromolecule, make it possible that ribose may have been among the oldest carbohydrates to be employed by living beings. Together with the other sugars that are produced by the condensation of formaldehyde under alkaline conditions [69],

Prebiotic Chemistry on the Primitive Earth


it is also one of the organic compounds to be synthesized in the laboratory under conditions that are relevant from a prebiotic perspective. The Butlerow synthesis is autocatalytic and proceeds through glycoaldehyde, glyceraldehyde, and dihydroxyacetone, four-carbon sugars, and five-carbon sugars to give finally hexoses, including biologically important carbohydrates such as glucose and fructose. The detailed reaction sequence may proceed as shown in figure 1.14. The reaction produces a complex mixture of sugars including various 3-, 4-, 5-, 6-, and 7-membered carbohydrates, including all isomers (for the addition of each CH2O unit, both isomers are produced) (figure 1.11) and generally is not particularly selective, although methods of overcoming this have been investigated. Inclusion of acetaldehyde in the reaction may lead to the formation of deoxyribose [74] (figure 1.15). The reaction tends to stop when the formaldehyde has been consumed and ends with the production of higher C4–C7 sugars that can form cyclic acetals and ketals. The reaction produces all of the epimers and isomers of the small C2–C6 sugars, some of the C7 ones, and various dendroaldoses and dendroketoses, as well as small molecules such as glycerol and pentaerythritol. Schwartz and De Graaf [72] have discovered an interesting photochemical formose reaction that generates pentaerythritol almost exclusively. Both L- and D-ribose occur in this complex mixture, but are not particularly abundant. Since all carbohydrates have somewhat similar chemical properties, it is difficult to envision simple mechanisms that could lead to the enrichment of ribose from this mixture, or how the relative yield of ribose required for the formation of RNA could be enhanced. However, the recognition that the biosynthesis of sugars leads not to the formation of free carbohydrates but of sugar phosphates, led Albert Eschenmoser and his associates to show that, under slightly basic conditions, the condensation of glycoaldehyde-2-phosphate in the presence of formaldehyde results in the considerably selective synthesis of ribose-2,4-diphosphate [75]. This reaction has also been shown to take place under neutral conditions and low concentrations in the presence of minerals [76], and is particularly attractive given the properties of pyranosyl-RNA (p-RNA), a 2′,4′-linked nucleic acid analog whose backbone includes the six-member pyranose form of ribose-2,4-diphosphate [77]. The major problem with this work is that a reasonable source of the starting material, oxirane carbonitrile (which hydrolyzes to glycolaldehyde-2-phosphate), has not been identified. There are three major obstacles to the relevance of the formose reaction as a source of sugars on the primitive Earth. The first problem is that the Butlerow synthesis gives a wide variety of straight-chain and branched sugars. Indeed, more than 40 different sugars have


Figure 1.14 A simplified scheme of the formose reaction.

Prebiotic Chemistry on the Primitive Earth


Figure 1.15 Possible prebiotic synthesis of deoxyribose from glyceraldehyde and acetaldehyde.

been identified in one reaction mixture [78] (figure 1.16). The second problem is that the conditions of synthesis are also conducive to the degradation of sugars [71]. Sugars undergo various diagenetic reactions on short geological time scales that are seemingly prohibitive to the accumulation of significant amounts on the primitive Earth. At pH 7, the half-life for decomposition of ribose is 73 minutes at 100 °C, and 44 years at 0 °C [79]. The same is true of most other sugars, including ribose-2,4-diphosphate. The third problem is that the concentrations of HCHO required appear to be prebiotically implausible, although the limits of the synthesis have not been determined.

Figure 1.16 Gas chromatogram of derivatives of the sugars formed by the formose reaction. The arrows point to the two ribose isomers (adapted from Decker et al. [78]).



There are a number of possible ways to stabilize sugars; the most interesting one is to attach the sugar to a purine or pyrimidine, that is, by converting the carbohydrate to a glycoside, but the synthesis of nucleosides is notoriously difficult under plausible prebiotic conditions. It therefore has become apparent that ribonucleotides could not have been the first components of prebiotic informational macromolecules [80]. This has led to propositions of a number of possible substitutes for ribose in nucleic acid analogs, in what has been dubbed the “pre-RNA world” [81]. A Paradox?

When aqueous solutions of HCN and HCHO are mixed, the predominant product is glycolonitrile [82], which seems to preclude the formation of sugars and nucleic acid bases in the same location [83]. Nevertheless both sugars and nucleic acid bases have been found in the Murchison meteorite [84,85] and it seems likely that the chemistry which produced the compounds found in the Murchison meteorite was from reactions such as the Strecker synthesis. This suggests that the conditions for the synthesis of sugars, amino acids, and purines from HCHO and HCN, either exist at very limited concentrations of NH3, HCN, and HCHO and pH, or the two precursors were produced under different regimes in different locations. Lipids

Amphiphilic molecules are especially interesting due to their propensity to spontaneously assemble into micelles and vesicles. Cell membranes are almost universally composed of phosphate esters of fatty acid glycerides. Fatty acids are biosynthesized today by multifunctional enzymes or enzyme complexes. Nevertheless, as all life we know of is composed of cells, these compounds seem crucial. Eukaryotic and bacterial cell membranes are composed of largely straight-chain fatty acid acyl glycerols while those of the Archaea are often composed of polyisoprenoid glycerol ethers. Either type may have been the primordial lipid component of primitive cells. Most prebiotic simulations fail to generate large amounts of fatty acids, with the exception of simulated hydrothermal vent reactions, which arguably use unreasonably high concentrations of reactants [86]. Heating glycerol with fatty acids and urea has been shown to produce acylglycerols [87]. A prebiotic synthesis of long-chain isoprenoids has been suggested by Ourisson based on the Prins reaction of formaldehyde with isobutene [88]. The Murchison meteorite contains small amounts of higher straightchain fatty acids, some of which may be contamination [89]. Amphiphilic components have been positively identified in the Murchison meteorite [90], although the yields of these molecules are poor in typical spark discharge experiments [91].

Prebiotic Chemistry on the Primitive Earth



It might be assumed that most of the inorganic cofactors (Mo, Fe, Mn, etc.) were present as soluble ions in the prebiotic seas to some degree. Many of the organic cofactors, however, are either clearly byproducts of an extant metabolism or have syntheses so complex that their presence on the early Earth cannot reasonably be postulated. Most enzyme-catalyzed reactions use a cofactor, and these are often members of a small set of small biochemicals known collectively as vitamins. The most widely used is nicotinamide, and several prebiotic syntheses of this compound have been devised [92,93]. Other interesting vitamins that have prebiotic syntheses include components of coenzyme A and coenzyme M [94–96] and analogs of pyridoxal [97]. There have been reports of flavin-like compounds generated from dry-heated amino acids, but these have not been well characterized [98]. It may be that many compounds that do not have prebiotic syntheses were generated later once a functioning biochemistry was in place [99]. Interestingly, many of these are able to carry out their catalyses, albeit to a lesser degree, in the absence of the enzyme. Nonenzymatic reactions that occur in the presence of vitamin cofactors include thiaminmediated formose reactions [100] and transamination with pyridoxal [101]. These may have some relevance to prebiotic chemistry, or perhaps to the early development of metabolism. It is unclear whether porphyrins were necessary for the origin of life, although they are now a part of every terrestrial organism’s biochemistry as electron carriers and photopigments. They can be formed rather simply from the reaction of pyrroles and HCHO [102,103] (figure 1.17). Small Molecules Remaining to be Synthesized

There are numerous biochemicals that do not appear to be prebiotically accessible, despite some interesting prebiotic syntheses that have been developed. Tryptophan, phenylalanine, tyrosine, histidine, arginine, lysine, and the cofactors pyridoxal, thiamin, riboflavin, and folate are notable examples. These may not be necessary for the origin of life and may be instead byproducts of a more evolutionarily sophisticated metabolism.

Figure 1.17 Prebiotic synthesis of porphyrins from pyrroles and formaldehyde.




One popular theory for the origin of life posits the existence of an RNA world, a time when RNA molecules played the roles of both catalysts and genetic molecule [104]. A great deal of research has been carried out on the prebiotic synthesis of nucleosides and nucleotides. Although few researchers still consider this idea plausible for the origin of life, it is possible that an RNA world existed as an intermediary stage in the development of life once a simpler self-replicating system had evolved. Perhaps the most promising nucleoside syntheses start from purines and pure D-ribose in drying reactions, which simulate conditions that might occur in an evaporating basin. Using hypoxanthine and a mixture of salts reminiscent of those found in seawater, up to 8% of b-D-inosine is formed, along with the a-isomer. Adenine and guanine gave lower yields, and in both cases a mixture of a- and b-isomers was obtained [105]. Pyrimidine nucleosides have proven to be much more difficult to synthesize. Direct heating of ribose and uracil or cytosine has thus far failed to produce uridine or cytidine. Pyrimidine nucleoside syntheses have been demonstrated which start from ribose, cyanamide, and cyanoacetylene; however, a-D-cytidine is the major product [106]. This can be photoanomerized to b-D-cytidine in low yield; however, the converse reaction also occurs. Sutherland and coworkers [107] demonstrated a more inventive approach, showing that cytidine3′-phosphate can be prepared from arabinose-3-phosphate, cyanamide, and cyanoacetylene in a one-pot reaction. The conditions may be somewhat forced, and the source of arabinose-3-phosphate is unclear, nevertheless the possibility remains that more creative methods of preparing the pyrimidine nucleosides may be possible. Alternatively, the difficulties with prebiotic ribose synthesis and nucleoside formation have led some to speculate that perhaps a simpler genetic molecule with a more robust prebiotic synthesis preceded RNA. A number of alternatives have been investigated. Some propose substituting other sugars besides ribose. When formed into sugar phosphate polymers, these also often form stable basepaired structures with both RNA/DNA and themselves [77,108–110], opening the possibility of genetic takeover from a precursor polymer to RNA/DNA. These molecules would likely suffer from the same drawbacks as RNA, such as the difficulty of selective sugar synthesis, sugar instability, and the difficulty of nucleoside formation. Recently it has been demonstrated, based on the speculations of Joyce et al. [81] and the chemistry proposed by Nelsestuen [111] and Tohidi and Orgel [112], that backbones based on acyclic nucleoside analogs may be more easily obtained under reasonable prebiotic conditions, for example by the reaction of nucleobases with acrolein during mixed

Prebiotic Chemistry on the Primitive Earth


formose reactions [113]. This remains a largely unexplored area of research. More exotic alternatives to nucleoside formation have been proposed based on the peptide nucleic acid (PNA) analogs of Nielsen and coworkers [114]. Miller and coworkers [115] were able to demonstrate the facile synthesis of all of the components of PNA in very dilute solution or under the same chemical conditions required for the synthesis of the purines or pyrimidines. The assembly of the molecules into oligomers has not yet been demonstrated and may be unlikely due to the instability of PNA to hydrolysis and cyclization [116]. Nevertheless, there may be alternative structures which have not yet been investigated that may sidestep some of the problems with the original PNA backbone. Nucleotides

Condensed phosphates are the universal biological energy currency; however, abiological dehydration reactions are extremely difficult in aqueous solution due to the high water activity. Phosphate concentrations in the modern oceans are extremely low, partially due to rapid scavenging of phosphates by organisms, but also because of the extreme insolubility of calcium phosphates. Indeed, almost all of the phosphate present on the Earth today is present as calcium phosphate deposits such as apatite. There is some evidence, however, that condensed phosphates are emitted in volcanic fumaroles [117]. An extensive review of the hydrolysis and formation rates of condensed phosphates has not been conducted; however, it has been suggested that condensed phosphates are not likely to be prebiotically available materials [118]. Heating orthophosphate at relatively low temperatures in the presence of ammonia results in a high yield of condensed phosphates [119]. Additionally, trimetaphosphate (TMP) has been shown to be an active phosphorylating agent for various prebiological molecules including amino acids and nucleosides [120,121]. Early attempts to produce nucleotides used organic condensing reagents such as cyanamide, cyanate, or dicyanamide [122]. Such reactions were generally inefficient due to the competition of the alcohol groups of the nucleosides with water in an aqueous environment. Nucleosides can be phosphorylated with acidic phosphates such as NaH2PO4 when dry heated [123]. The reactions are catalyzed by urea and other amides, particularly if ammonium is included in the reaction. Heating ammonium phosphate with urea also gives a mixture of high molecular weight polyphosphates [119]. Nucleosides can be phosphorylated in high yield by heating ammonium phosphate with urea at moderate temperatures, as might occur in a drying basin [124]. For example, by heating uridine with urea and ammonium phosphate yields of phosphorylated nucleosides as high as



70% have been achieved. In the case of purine nucleosides, however, there is also considerable glycosidic cleavage due to the acidic microenvironment created. This is another problem with the RNA world, that the synthesis of purine nucleosides is somewhat robust, but nucleotide formation may be difficult, while nucleotide formation from pyrimidine nucleosides is robust, but nucleoside formation appears to be difficult. Hydroxyapatite itself is a reasonable phosphorylating reagent. Yields as high as 20% of nucleosides were achieved by heating nucleosides with hydroxyapatite, urea, and ammonium phosphate [124]. Heating ammonium phosphates with urea leads to a mixture of high molecular weight polyphosphates [119]. Although these are not especially good phosphorylating reagents under prebiotic conditions, they tend to degrade, especially in the presence of divalent cations at high temperatures, to cyclic phosphates such as trimetaphosphate, which have been shown to be promising phosphorylating reagents [121]. cis-Glycols react readily with trimetaphosphate under alkaline conditions to yield cyclic phosphates, and the reaction proceeds somewhat reasonably under more neutral conditions in the presence of magnesium cation [125]. HYDROTHERMAL VENTS AND THE ORIGIN OF LIFE

The discovery of hydrothermal vents at the oceanic ridge crests and the appreciation of their significance in the element balance of the hydrosphere represents a major development in oceanography [126]. Since the process of hydrothermal circulation probably began early in the Earth’s history, it is likely that vents were present in the Archean oceans. Large amounts of ocean water now pass through the vents, with the whole ocean going through them every 10 million years [127]. This flow was probably greater during the early history of the Earth, since the heat flow from the planet’s interior was greater during that period. The topic has received a great deal of attention, partly because of doubts regarding the oxidization state of the early atmosphere. Following the first report of the vents’ existence, a detailed hypothesis suggesting a hydrothermal emergence of life was published [128], in which it was suggested that amino acids and other organic compounds are produced during passage through the temperature gradient of the 350 °C vent waters to the 0 °C ocean waters. Polymerization of the organic compounds thus formed, followed by their self-organization, was also proposed to take place in this environment, leading to the first forms of life. At first glance, submarine hydrothermal springs would appear to be ideally suited for creating life, given the geological plausibility of a hot early Earth. More than a hundred vents are known to exist along

Prebiotic Chemistry on the Primitive Earth


the active tectonic areas of the Earth, and at least in some of them catalytic clays and minerals interact with an aqueous reducing environment rich in H2, H2S, CO, CO2, and perhaps HCN, CH4, and NH3. Unfortunately it is difficult to corroborate these speculations with the findings of the effluents of modern vents, as a great deal of the organic material released from modern sources is diagenized biological material, and it is difficult to separate the biotic from the abiotic components of these reactions. Much of the organic component of hydrothermal fluids may be formed from diagenetically altered microbial matter. So far, the most articulate autotrophic hypothesis stems from the work of Wächtershäuser [129,130], who has argued that life begun with the appearance of an autocatalytic, two-dimensional chemolithtrophic metabolic system based on the formation of the highly insoluble mineral pyrite (FeS2). The reaction FeS + H2S → FeS2 + H2 is very favorable. It is irreversible and highly exergonic with a standard free energy change ∆G° = −9.23 kcal/mol, which corresponds to a reduction potential E° = −620 mV. Thus, the FeS/H2S combination is a strong reducing agent, and has been shown to provide an efficient source of electrons for the reduction of organic compounds under mild conditions. The scenario proposed by Wächtershäuser [129,130] fits well with the environmental conditions found at deep-sea hydrothermal vents, where H2S, CO2, and CO are quite abundant. The FeS/H2S system does not reduce CO2 to amino acids, purines, or pyrimidines, although there is more than adequate free energy to do so [131]. However, pyrite formation can produce molecular hydrogen, and reduce nitrate to ammonia, and acetylene to ethylene [132]. More recent experiments have shown that the activation of amino acids with carbon monoxide and (Ni,Fe)S can lead to peptide-bond formation [133]. In these experiments, however, the reactions take place in an aqueous environment to which powdered pyrite has been added; they do not form a dense monolayer of ionically bound molecules or take place on the surface of pyrite. None of the experiments using the FeS/H2S system reported so far suggests that enzymes and nucleic acids are the evolutionary outcome of surface-bounded metabolism. The results are also compatible with a more general model of the primitive soup in which pyrite formation is an important source of electrons for the reduction of organic compounds. It is possible that under certain geological conditions the FeS/H2S combination could have reduced not only CO but also CO2 released from molten magna in deep-sea vents, leading to biochemical monomers [134]. Peptide synthesis could have taken place in an iron and nickel sulfide system [133] involving amino acids formed by electric discharges via a Strecker-type synthesis, although this scenario requires the transportation of compounds formed at the surface to the deep-sea vents [135]. It seems likely that concentrations of reactants



would be prohibitively low based on second-order reaction kinetics. If the compounds synthesized by this process did not remain bound to the pyrite surface, but drifted away into the surrounding aqueous environment, then they would become part of the prebiotic soup, not of a two-dimensional organism. In general, organic compounds are decomposed rather than created at hydrothermal vent temperatures, although of course temperature gradients exist. As has been shown by Sowerby and coworkers [136], concentration on mineral surfaces would tend to concentrate any organics created at hydrothermal vents in cooler zones, where other reaction schemes would need to be appealed to. The presence of reduced metals and the high temperatures of hydrothermal vents have also led to suggestions that reactions similar to those in Fischer–Trospch-type (FTT) syntheses may be common under such regimes. It is unclear to what extent this is valid, as typical FTT catalysts are easily poisoned by water and sulfide. It has been argued that some of the likely environmental catalysts such as magnetite may be immune to such poisoning [137]. Stability of Biomolecules at High Temperatures

A thermophilic origin of life is not a new idea. It was first suggested by Harvey [138], who argued that the first life forms were heterotrophic thermophiles that had originated in hot springs such as those found in Yellowstone Park. As underlined by Harvey, the one advantage of high temperatures is that the chemical reactions could go faster and the primitive enzymes could have been less efficient. However, high temperatures are destructive to organic compounds. Hence, the price paid is loss of biochemical compounds to decomposition. Although some progress has been made in synthesizing small molecules under hydrothermal vent type conditions, the larger trend for biomolecules at high-temperature conditions is decomposition. As has been demonstrated by various authors, most biological molecules have half-lives to hydrolysis on the order of minutes to seconds at the high temperatures associated with hydrothermal vents. As noted above, ribose and other sugars are very thermolabile compounds [79]. The stability of ribose and other sugars is problematic, but pyrimidines and purines, and many amino acids, are nearly as labile. At 100 °C the half-life for deamination of cytosine is 21 days, and 204 days for adenine [139,140]. Some amino acids are stable, for example, alanine with a half-life for decarboxylation of approximately 19,000 years at 100 °C, but serine decarboxylates to ethanolamine with a half-life of 320 days [141]. White [142] measured the decomposition of various compounds at 250 °C and pH 7 and found half-lives of amino acids from 7.5 s to 278 min, half-lives for peptide bonds from >2c and l and 2c are constants while r grows with the length of the target sequence, the computation time is dominated by the (r ∗ l) term which is significantly more efficient than the previous (r ∗ l)2 time. Unfortunately most long target DNA sequences of interest do not satisfy the k-mer uniqueness assumption for practical values of k. In fact, a sizable portion of many target sequences constitutes ubiquitous repeats where k-mers not only occur more than once but occur many

Figure 3.4 Representation of overlap phase involving 2 ∗ n ∗ n edit graphs (r–j represents the reverse complement of rj).



times, and the number of occurrences grows with target sequence length. In the extreme case, where a single k-mer occurs in every fragment, the problem regresses back to computing an overlap between every pair of fragments and taking (r ∗ l)2 time. In practice some k-mers occur too frequently to be used in an efficient overlap computation, so most shotgun fragment assemblers of recent vintage impose some maximum threshold on the number of fragments in which a k-mer can occur before it is no longer used to seed potential overlaps. Not investigating these potential overlaps may increase the number of false negative overlaps, but since most of the potential overlaps in these highly abundant k-mers must come from different copies of repeats, the number of false positive overlaps should be reduced even more. For all but the most similar repeats, less frequently occurring k-mers will exist due to the differences in the repeat copies, and most of the true positive overlaps based on these will be detected. A novel approach to determining fragment overlaps [28] was inspired by a method called sequencing by hybridization (SBH) which attempts to reconstruct sequences from the set of their constituent k-mers [29,30]. A k-mer graph represents the fragments and implicitly their overlaps by depicting each k-mer in any fragment as a directed edge between its (k−1)-mer prefix and (k−1)-mer suffix. A fragment is simply a path in the graph starting with the edge representing the first k-mer in the fragment, proceeding in order through the edges representing the rest of the k-mers in the fragment, and ending with the edge representing the last k-mer in the fragment. The fragment overlaps are implicit in the intersections of the fragment paths. No explicit criterion is given to specify the quality of intersection that would constitute an overlap. Rather, the fragment layout problem is solved directly using the k-mer graph and fragment paths. Generating the k-mer graph and fragment paths is more efficient than computing overlaps but requires a much greater amount of computer memory. The amount of memory is currently prohibitive for long target DNA sequences but computer memory continues to become denser and cheaper, so this constraint may soon be surmounted. LAYOUT PHASE

The overlap graph is a standard representation in fragment assembly algorithms in which the vertices represent fragments and the edges represent overlaps. An innovative, nonstandard approach represents fragments as a pair of vertices, one for each end of the fragment, and a fragment edge joining the vertices. To understand the reason that this is desirable, one must understand the nature of DNA sequence fragments. DNA is a double-stranded helix in which nucleotides (made of a molecule of sugar, a molecule of phosphoric acid, and a molecule called a base) on one strand are joined together along the sugar-phosphate backbone.

Shotgun Fragment Assembly


A second strand of nucleotides runs antiparallel to the first with the directionality of the sugar-phosphate backbone reversed and each base (one of the chemicals adenine, thymine, guanine, and cytosine, represented by the letters A, T, G, and C, respectively) bonded to a complementary base at the same position on the other strand (A bonds to T and G bonds to C). Each strand has a phosphate at one end called the 5’ (five prime) end and a sugar at the other end called the 3’ (three prime) end. Since the strands are antiparallel, the 5’ end of one strand is paired with the 3’ end of the other strand. When given the DNA sequence of one strand (a string of letters from the alphabet {A,C,G,T}), the sequence of the other strand starting from the 5’ end can be generated by starting from the 3’ end of the given strand and writing the complement of each letter until the 5’ end is reached (see figure 3.5).

Figure 3.5 DNA double helix (A) and diagram (B) distinguishing the sugarphosphate backbone from nitrogenous bases and showing the 5′ and 3′ ends.



This process is called reverse complementing a sequence, and the sequence generated is called its reverse complement. In shotgun sequencing, multiple copies of a large target DNA molecule are sheared into double-stranded clones that are then inserted into a sequencing vector with a random orientation. Each end of the clone can then be sequenced as a fragment. Current sequencing technology determines the sequence of a fragment from the 5’ to the 3’ end starting from a known position (sequencing primer) in the sequencing vector near the insertion site of the clone (this can be done for just one strand or for both using a second sequencing primer site on the opposite side and strand of the clone insertion site). Each fragment sequence thus has an implied 5’ and 3’ end (see figures 3.1 and 3.5). Two fragments from different copies of the target DNA that share some of the same region are said to overlap, but due to the random strand orientation of the clone in the sequencing vector, the two overlapping fragments’ sequences may both be from the same strand of the target DNA, or one from each of the two strands. If the two sequences are from the same strand, the 3’ end of one sequence will overlap the 5’ end of the other. If the two sequences are from opposite strands, then either the two 5’ ends will overlap or the two 3’ ends will overlap (with one or the other fragment reverse complemented). In order to represent this in an overlap graph where one vertex corresponds to one fragment, the edges must be “super”-directed. In a directed graph, an arrow on one end of the edge represents that the edge goes from vertex a to vertex b or vice versa: a super-directed graph imparts additional information on the edge, in this case which end of vertex a (5’ or 3’) goes to which end of vertex b. This can be drawn as a bidirected line with an arrowhead at each vertex with the arrowheads oriented independently of each other toward or away from a vertex (see figure 3.6A and B, and [4]). In the bidirected overlap graph the directions of the arrows at a vertex effectively divide the set of edges touching it into overlaps involving the 5’ end of the fragment and overlaps involving the 3’ end of the fragment. A dovetail path in the bidirected overlap graph is constrained, by rules on the arrowheads, to require that consecutive edges involve opposite ends of the fragment they have in common [4]. A representation using two vertices per fragment (see figure 3.6C), one vertex for each fragment end and a connecting fragment edge, explicitly represents the fragment ends, allows all edges (overlap and fragment) to be undirected, and defines a dovetail path as a path in the overlap graph that traverses a fragment edge, then zero or more ordered pairs of (overlap edge, fragment edge). The information inherent in an overlap is simply the two fragment ends involved and the length of the overlap. When the subsequences of the two fragment ends that overlap are constrained to be identical, the length of the overlap is simply the length of the shared subsequence; if

Shotgun Fragment Assembly


Figure 3.6 Super-directed graph of reads from figure 3.2 (A), reduced superdirected graph with unitigs circled (B), and alternative, undirected representation of reduced graph (C).

some variation in the subsequences is allowed, particularly insertions and deletions (indels), then the length of the overlap must be represented by a 2-tuple of the lengths of both subsequences aligned in the overlap to be complete. An equivalent representation of overlap length used in [4] is the lengths of the subsequences in each fragment not aligned in the overlap (which is just the length of the fragment minus the length of the subsequence aligned in the overlap). These lengths are called overhangs, or hangs for short. Additional information can be retained for each edge, such as the edit distance to convert the sequence



of one fragment end to the other fragment end, or some likelihood/ probability estimate that the overlap is true. The layout problem is to find a maximal set of edges (overlaps) that are consistent with each other. At the heart of a consistent set of overlaps is a pairwise alignment of two fragments. If a multiple sequence alignment of fragments includes all of the pairwise fragment alignments, then the overlaps are consistent. A more formal approach is to consider the target sequence laid along a line with integer coordinates from 1 to n. Each fragment (subsequence of the target) is viewed as an interval on this line (see figures 3.2, 3.7, and 3.8). A true overlap between a pair of fragments occurs if and only if the fragments’ intervals intersect. This implies that an overlap graph that contains all of the true overlaps (edges) and no false overlaps must be an interval graph (an interval graph is just a term for a graph which meets the intersection of intervals definition above where vertices are intervals and edges are intersections [31,32]). The layout problem can then be viewed as finding a maximal subset of edges (or subgraph) of the overlap graph that forms an interval graph. This criterion was established based on the observation that overlaps must meet the triangle condition of interval graphs [14]. The triangle condition states that if the intersections of intervals i and j and of intervals j and k are known, then the intersection of intervals i and k is completely determined (see figure 3.7). Most assemblers do not require the layout solution to be an interval graph, but rather (first setting aside fragment intervals that are contained in other fragment intervals) that the layout graph must be a subgraph of an interval graph in which the maximal intersections for each fragment end are retained (see figure 3.6). For SCS, it is clear that fragments that are substrings of other fragments (called contained fragments) need not be considered because a superstring of the noncontained fragments necessarily includes the contained fragments as substrings, and the superstring cannot be made shorter by adding additional constraints. Most assemblers follow this

Figure 3.7 Triangle condition of intervals showing that i ∩ j and j ∩ k implies i ∩ k.

Shotgun Fragment Assembly


approach of setting aside contained fragments and then placing them at the end of the layout phase. We will follow [21] in distinguishing between containment overlaps, in which the overlap completely contains one of the two fragments, and dovetail overlaps where only one end of each fragment is included in the overlap. Once contained fragments and their containment overlaps are set aside, the overlap graph contains only dovetail overlaps. Consequently, we will generally refer to dovetail overlaps simply as overlaps. Given the set of these remaining overlaps, the interval of each fragment has at most two maximal intersections, each involving one of its fragment ends, with the intervals of two other fragments. The noncontained fragment intervals can be ordered from left to right along the target sequence for fragments fi , i = 1 to rnc (r is the total number of fragments, rnc is the number of noncontained fragments), such that b0 < b1 < b2 < … < bi < … < brnc and e0 < e1 < e2 < … < ei < … < ernc (recall that bi is the beginning position of fragment fi and ei the end). The length of the union of the first two fragment intervals is l1 + l2 – o1,2, where li is the length of fi and oi,j is the length of the overlap between fi and fj. Building on this, the length of rnc

rnc −1

i =1

i =1

the target sequence, n, can be written as ∑ li − ∑ oi , i +1 . A desired solution in the overlap graph would be (again following [4] and [21]) a dovetail chain/path that traverses a fragment edge, followed by zero or more ordered pairs of (overlap edge, fragment edge). Fragment edges would represent the fragments f1 to frnc and overlap edges would represent the overlaps o1,2 to ornc −1, rnc. With the sum of the fragment lengths a constant, maximizing the length of the overlaps between adjacent fragments is equivalent to minimizing the length of the reconstructed string. Recall that finding an optimal solution to SCS is NP-hard, which is why a simple greedy algorithm is often used. The standard approach is to sort all of the overlaps by length and seed the solution with the longest overlap. Then the rest of the overlaps are iterated over in order and either added to the solution if the fragment ends are not in an overlap that is already part of the solution or discarded if one or both fragment ends are. The constraint that each fragment end can be involved in only one overlap guarantees that the solution will be a set of dovetail paths which must be a subgraph of an interval graph. A similar greedy approach was used in [14], but after the solution was generated, all discarded edges were checked for consistency with the dovetail paths solution using the triangle condition, and discrepancies were reported. Recall that an overlap graph with all true overlaps (no false positive or false negative overlaps) is an interval graph and any dovetail path in it will produce a correct solution. If some short true overlaps are not detected but longer overlaps cover the same intervals, the result is still a subgraph of an interval graph, and any dovetail path through it is a



correct solution. If, however, a short true overlap is missed and no other overlap covers the interval, this will create an apparent gap in the solution. The same effect is produced by a lack of fragment coverage anywhere along the target sequence, and this is just intrinsic to the random sampling of the fragments [33]. The simple greedy solution fails when repeated regions of the target sequence generate false positive overlaps. If the copies of a repeat along the sequence (line) are indistinguishable, the associated graph is no longer an interval graph. In effect, the different copies end up being “glued together” [34], creating loops in the line and cycles in the graph. Within such a repeat, any given maximal overlap may not be a true overlap. Because the fragment intervals in the repeat are glued together there will be interleaving of fragment intervals from different copies of the repeat. This guarantees that some maximal overlaps will be false. The SCS solution fails to take repeats into account, and meeting the shortest superstring criterion actually compresses exact repeat copies into a single occurrence in the superstring (see figure 3.8). The first approach to achieving a correct layout solution in the presence of repeats is to reduce the number of false positive overlaps as much as possible. If the repeats are exact copies, then nothing can be done for

Figure 3.8 Example showing that the SCS solution can be overcompressed, misordered, and disconnected.

Shotgun Fragment Assembly


overlaps within these regions, but most repeats have some level of discrepancy between copies. As discussed in the overlap phase section, if the level of repeat discrepancy is significantly greater than the differences due to sequencing error (or corrected sequencing error, if using error correction), then false overlaps can be distinguished and kept out of the overlap graph. The Phrap assembler applies this technique aggressively and with good results [35]. Phrap uses quality values (estimates of error probability at each base call) to differentiate sequencing error from repeat copy differences. In and of itself, this provides a large advantage over assemblers not using any form of sequence error correction. The really aggressive aspect of Phrap’s approach is to use a maximum likelihood test to choose between two models: that the two fragments are from the same interval of the target sequence with differences due to sequencing error, or that the two fragments are from different copies of a repeat. The test includes a tunable parameter for the expected number of differences per length, typically 1 to 5 per 100 base pairs. This approach rejects many more false positive overlaps than would a test that an overlap is due to random chance. It also results in more false negative overlaps, but the tradeoff often provides very good results. Perhaps the furthest this repeat separation solution can be pushed is to use correlated differences gleaned from a multiple sequence alignment of fragments [36–39]. The key concept is that differences between copies of a repeat (called distinguishing base sites or defined nucleotide positions) will be the same for fragments from different copies (correlated), whereas sequencing errors will occur at random positions (uncorrelated). This method starts with a multiple sequence alignment of fragments and finds columns with discrepancies (called separating columns) (see figure 3.9). Each separating column partitions the set of fragments spanning it into two or more classes. When a set of fragments spans multiple separating columns the partitions can be tested for consistency. If the partitioning

Figure 3.9 Correlated differences (C and T in sequences 1–5, A and G in sequences 6–9 in the same column pairs) supporting repeat separation and an uncorrelated sequencing error in sequence 4.



is consistent, then the columns are correlated and the differences are much more likely a result of repeat differences than sequencing error. The correlated partitioning test can be either heuristic or based on a statistical model. The test can be applied either immediately after the overlap phase as another filter to remove false positive overlaps, or after an initial portion of the layout phase has produced a contig (short for contiguous sequence) containing fragments from different repeat copies that need to be separated. Ultimately, however, large and highly complex genomes have repeat copies that are identical, or at least sufficiently similar, such that any approach to repeat separation based on repeat copy differences must fail. The second approach to repeat resolution in the overlap graph is to first recognize what portions of the overlap graph or which sets of fragments are likely to be from intervals of the target sequence that are repeats. One distinguishing feature of repeat fragments that has been widely recognized [40–43] is that they will have, on average, proportionately (to the repeat copy number) more overlaps than fragments from unique regions of the target sequence. The number of overlaps for a fragment or fragment end is a binomially distributed random variable parameterized by the number and length of the fragments and the length of the target sequence (if the fragments are randomly uniformly sampled from the target sequence) [44]. This binomial distribution is usually well approximated by the Poisson distribution [33]. Unfortunately the intersection between the distributions of repeat and unique fragments is large for repeats with a small number of copies (e.g., 2 or 3) when the coverage ratio of total length of fragments to target sequence length is 5 to 15, which is standard for shotgun assembly projects. Thus, setting a threshold in terms of a number of overlaps to differentiate between repeat and unique fragments will lead to a high number of false positives, false negatives, or both. A different property of repeat regions has also been widely recognized and has already been mentioned above: inconsistent triples of fragments and their overlaps that do not meet the triangle condition [14]. The triangle condition is violated at the boundaries of repeat regions (see figure 3.10). A particularly elegant approach to distinguishing these repeat boundaries in the overlap graph is to remove overlaps (edges) in those portions of the graph that have interval graph properties. This reduces these intervals to single dovetail chains [4]. The overlaps that are removed can be reconstructed from a pair of other overlaps (see figure 3.6A and B). This is also known as chordal graph reduction in interval graphs. The size and complexity of the overlap graph is greatly reduced using this method. The only branching in the reduced graph occurs where fragments cross repeat boundaries. The dovetail chains in the reduced overlap graph have no conflicting choice of layout positions, so they can be represented as contigs called chunks [4]

Shotgun Fragment Assembly


Figure 3.10 Reads from figure 3.2 that do not meet the triangle condition (A) and associated unique/repeat boundary detection via sequence alignment shown at the fragment level (B) and at the sequence level (C).

or unitigs [45]. The overlap graph is thus transformed into a chunk or unitig graph that has the same branching structure as the reduced overlap graph. A unitig can comprise fragments from a single interval of the target sequence (a unique unitig), from multiple intervals that have been collapsed or glued together (a repeat unitig), or unfortunately a combination of the two. A combined unitig (part unique, part repeat) only occurs when the boundaries between the end of the repeat and unique sequence is not detected for at least two copies of the repeat. This can occur due to low sequence coverage (the boundary is not sampled) or some failure in overlap detection. For deep sequence coverage, combined unitigs are rare, so we will set aside this problem for now. If we can distinguish unique from repeat unitigs the layout problem will be greatly simplified. There have been two complementary approaches for differentiating unique from repeat unitigs. One method looks at the density of fragments within the unitig (sometimes called arrival rate) and determines the probability or likelihood ratio that the fragment density is consistent with randomly sampling from a single interval or multiple intervals [45]. This is analogous to the previous approach of determining that a fragment end is in a repeat based on the number of overlaps that include it. The density measure becomes much more powerful than the fragment overlap measure as the length of the unitig increases because the density distributions for unique and repeat unitigs intersect less as unitig length increases. The separation power of the density measure also increases with coverage depth of the random sampling in the same fashion as the fragment overlap measure.



Figure 3.11 Short repeats with spanning reads (A) that produce a unique, reducible layout graph pattern (B) and the corresponding solvable pattern in the k-mer graph approach (C).

A second method looks at the local branching structure of the unitig graph [24]. Typically each end of a unique unitig has either a single edge (overlap) with another unitig or no edge if it abuts a gap in the fragment coverage. In contrast, both ends of a repeat unitig have multiple edges to other unitigs representing the different unique intervals that flank each copy of the collapsed repeat (see figure 3.6). Unfortunately, there are rare as well as more common counterexamples to this simple rule (see figures 3.11 and 3.12). One approach to overcoming the branching associated with short repeats is to look instead at mate pair branching (see figure 3.13) [24]. Mate pairs should appear in the layout with a known orientation and distance between them. If multiple sets of mate pairs have one fragment in a repeat unitig and the other in different unique flanking unitigs, then from the perspective of the repeat unitig, the unique unitigs should all occupy the same interval. Just as with the overlap branching, this pattern would identify the unitig as a repeat with multiple different flanking regions. Another elegant approach for constructing the unitig/chunk/repeat graph is based on the k-mer structure of fragments rather than explicit fragment overlaps [25,28]. Recall from the overlap phase section that a k-mer graph represents the fragments, and implicitly their overlaps, in terms of the set of sequenced k-mers, which are drawn as edges between the prefix (k−1)-mer and suffix (k−1)-mer of each k-mer. A fragment is just a path in the graph starting with the edge representing the first k-mer in the fragment, proceeding in order through the rest of the edges representing k-mers in the fragment, and ending with the edge representing the last k-mer in the fragment. If there is no sequencing

Figure 3.12 Example of repeats within repeats and unique between repeats (A) and unitig graph showing unique/repeat multiplicities (B).

Figure 3.13 Initial scaffold graph from figure 3.2 example (A), reduced scaffold graph (B), and final scaffold (C). 97



error, or the sequencing error can be corrected as discussed in the overlap phase section, then the branching structure in the k-mer graph is largely the same as that for the unitig graph, so that branching occurs only where a k-mer crosses a repeat boundary. A few minor differences do exist between the k-mer graph and the unitig graph due to the vertices and edges representing slightly different objects. In the k-mer graph a vertex represents a fixed length (k−1) interval of the target sequence; in the unitig graph (or more correctly the reduced overlap graph before dovetail paths are coalesced into unitigs) a vertex represents a variable length (a fragment length) interval of the target sequence which is usually at least an order of magnitude larger than k–1. In the k-mer graph an edge represents a fixed length (k) interval of the target sequence which contains the two intervals represented by the two vertices it connects; in the unitig graph an edge represents a variable length interval of the target sequence which is the intersection or overlap of the two intervals represented by the two vertices it connects. In the k-mer graph any (k−1)-mer that occurs multiple times in the target sequence will be represented by a single vertex. At the boundary of a repeat where the next (k−1)-mer occurs only once in the target sequence, the k-mer graph must branch with edges to each of the unique (k−1)-mers flanking each copy of the repeat. As a result, repeat boundaries are known precisely in the k-mer graph. In particular, a vertex at the end of a repeat will have edges to several neighboring vertices across the repeat boundary. The neighboring vertices that are in unique sequence will have only one edge in the direction of the repeat boundary, to the last vertex in the repeat. This overcomes the short repeat branching pattern encountered in unitig graphs (see figure 3.11) but not the repeat within a repeat branching pattern (see figure 3.12). The short repeat pattern occurs in the unitig graph because, even though each fragment interval in a unique unitig occurs only once in the target sequence, a fragment subinterval at the end of the unique unitig occurs multiple times (a repeat). This short repeat is represented by multiple edges from the end vertex of the unique unitig: each edge represents the intersection of two fragment intervals, and these subintervals are part of a repeat. The k-mer graph avoids this problem by using edges to represent unions rather than intersections of the (k−1)-mer intervals represented by vertices. So, edges are not subintervals of the vertices. Dovetail paths in the k-mer graph can be coalesced in a fashion similar to dovetail paths in the reduced overlap graph. In the reduced overlap graph the union of fragment vertices along the dovetail path is replaced with a vertex representing a unitig; in the k-mer graph the union of k-mer edges along a dovetail path becomes an edge representing a unitig, and the two bounding vertices remain the same [25,28]. The only exception is the trivial dovetail path that has no edges, where

Shotgun Fragment Assembly


the single vertex represents a (k−1) length repeat. We will call this coalesced k-mer graph the k-mer unitig graph. As in the unitig graph, unitigs in the k-mer unitig graph can represent unique or repeat intervals in the target sequence. Gaps in fragment coverage at repeat boundaries can obfuscate the true repeat branching structure in either graph. The approach to finding unique unitigs in the k-mer unitig graph is actually a subproblem of determining the multiplicity (number of copies) of each unitig where a unique unitig has multiplicity one. The assumption is that each unitig has multiplicity at least one and that most unitigs have multiplicity exactly one. The local flow of unitig multiplicities into, through, and out of a unitig must be balanced (Kirchhoff’s law). A simple heuristic is to start all unitig multiplicities at one and iteratively apply the balancing condition until a stable balanced solution is reached [46]. For example, if two single copy unitigs both precede a unitig in the unitig graph, that unitig must have at least two copies. This approach cannot correctly solve every situation (see figure 3.12) and even the more rigorous minimal flow algorithm [46,47] does not solve this example. Nevertheless, this approach has been shown to work well in practice for bacterial genomes [25,46]. It remains to be seen if the heuristic or the minimal flow algorithm can scale to large, complex genomes. Perhaps a more promising approach is to combine the depth of coverage as an initial estimate of unitig multiplicity and then apply the heuristic balancing of flows. These initial multiplicity estimates and balanced flows would be real valued but could be forced over iterations to converge to integers. This would give a high probability of correctly solving the example cited above. With the k-mer unitig graph and unitig graph (called here the fragment unitig graph to clearly differentiate from the k-mer unitig graph), the goal of the layout phase is to find one or more paths which together include each unique unitig once and each repeat unitig the number of times it occurs in the target sequence. This starts with an initial labeling of the unique unitigs or the multiplicity of the unitigs in the unitig graph. Already for both unitig graphs the nonbranching portions of the overlap graph have been coalesced into unitigs. The next step is to extend unique unitigs across repeat boundaries using fragments that start in the unique unitig. For the k-mer unitig graph the repeat boundary is at the end of the unitig, so all fragment paths that include the unitig boundary (k−1)-mer start before the repeat boundary. For the fragment unitig graph, the unique unitigs branching from the repeat unitigs must be aligned to identify the repeat boundaries within each unique unitig [45] (see figure 3.10). In the fragment unitig graph, unitigs that overlap the unique fragment on the other side of the repeat boundary must be the correct overlap, and any conflicting overlaps with that end of the unique unitig (branching in the graph) can be discarded. In the k-mer unitig graph, a set of equivalent graph transformations is



defined which allows unitigs with multiplicity greater than one to be duplicated and the edges adjacent to those unitigs to be assigned to exactly one of the copies [25]. This can only be done when all of the fragment paths through the repeat unitig are consistent with the assignment of the adjacent edges to the unitig copies. This means that if a fragment spans from one unique unitig across a short repeat unitig to another unique unitig, then the two unique unitigs can be combined into a single unique unitig (containing the short repeat unitig) (see figure 3.11). All shotgun fragment assemblers that use only fragment sequence data are stymied by identical repeats that are long enough that they cannot be spanned by a single fragment. The reason for this is that once one traverses an edge from a unique unitig into a long repeat unitig, the fragment data cannot indicate which edge to follow leaving the repeat. For very simple repeat structures, if there is complete fragment coverage of the target sequence there will be only one path that includes every unique unitig in a single path (see figure 3.14A), but

Figure 3.14 Subsequence from U1 to U3 has two Eulerian tours with the same sequence (given R1 and R2 are identical) (A). Addition of third copy of repeat R makes order of U2 and U3 ambiguous in different Eulerian tours (B). Hamiltonian representation is shown on the left to illustrate increased complexity.

Shotgun Fragment Assembly


even for a slightly more complicated repeat structure this is no longer true (see figure 3.14B). For the SBH problem, in which all that is known about a short target sequence is the set of k-mers that appear in it, the k-mer graph was originally designed to address this issue. By having the edges represent k-mers and the vertices represent (k−1)-mers (overlaps of k-mers) instead of the reverse, the desired solution becomes a path that uses every edge exactly once (an Eulerian path) rather than a path that uses every vertex exactly once (a Hamiltonian path) [48] (see figure 3.14). In general a Hamiltonian path takes exponential time to compute. This means that in practice only target sequences with no duplicated (k−1)-mers can be solved using the Hamiltonian path approach. With no (k−1)-mer duplications all vertices have a maximum of one outgoing and one incoming edge, which makes the determination of the Hamiltonian path trivial [49]. An Eulerian path is efficient to compute (linear in the number of edges) if one exists, and as mentioned above only a single Eulerian path is possible for very simple repeat structures. For short, random target sequences (length 200) and k = 8, a simple Hamiltonian path (maximum degree ≤ 1) was found in 46% of test cases whereas a single Eulerian path was found 94% of the time [48]. Unfortunately for large target sequences that have complex repeat structures, the number of Eulerian paths quickly becomes exponential. So the Eulerian formulation provides no efficiency in solving the layout problem. The equivalent graph transformations approach for the k-mer unitig graph is, however, very useful for simplifying the k-mer unitig graph. This approach has been called the Eulerian Superpath Problem, and it is important to understand that the power of it comes from simplifying the structure of the graph by splitting spanned repeat unitigs and not by any computational advantage of the Eulerian versus Hamiltonian framing of the problem. Long Identical Repeats

When long, identical repeats are encountered, the solution to the layout problem must use additional information beyond that of the fragment sequences. The most useful and easily obtained auxiliary information used by shotgun fragment assemblers is mate pair data. Mate pair data is obtained when sequence fragments are generated from both ends of a cloned piece of DNA (see figures 3.1, 3.2, and 3.5). Because of this, the method is referred to as double-barreled sequencing, double-ended sequencing, or pairwise-end sequencing. Generally, the DNA is inserted into a sequencing vector with universal sequencing primers at either end of the inserted DNA to produce two fragments. The inserted DNA must be double-stranded, but current sequencing technology can process only a single strand of DNA at a time, from the 5′ end to the 3′ end. This imposes the constraint that the mate pair fragments must be located on opposite strands of the solution. A more useful constraint is the



approximate distance between the 5′ ends of each mate pair. There are standard size selection techniques for creating collections (called libraries) of DNA clones that have an approximately known length distribution. The distribution is often roughly normal (Gaussian) with an approximate mean and variance. The use of mate pairs was first proposed as a method to determine which clones spanned gaps in fragment coverage of the target sequence [50]. The entire clone or just the portion needed to close the gap could then be sequenced to finish the target sequence. This is much more efficient than continuing to generate more random shotgun fragments with diminishing probability that a shotgun fragment would be encountered from the missing portions of the target sequence (the gaps). The clones used to generate the mate pairs are usually larger than the length of the mate pair fragments summed together. The amount of target sequence contained in the clones is larger than the amount contained in the fragments, but the random coverage of the target sequence by the clones, called the clone coverage, follows the same statistical model as the random coverage by the fragments (sequence coverage) [33]. Given sufficient sequence coverage to be able to place mate pair fragments into contigs, the probability that a gap in fragment coverage (called a sequencing gap) will be spanned by a mate pair increases with the clone coverage [51]. A mate pair that spans a sequencing gap constrains the orientation (which strand of the target sequence) and the order (before or after) the two contigs (or unitigs) containing the mate pair fragments have with respect to each other (see figure 3.13). The distance between the two contigs (length of the gap) can also be estimated based on the length distribution of the library that the mate pair was sampled from. In practice, the initial estimate of the size distribution of the library is determined based on the apparent size of the clones as measured by the dispersion on an agarose gel run. A refined estimate can be computed by bootstrapping. The assembler is run to generate contigs or unitigs. The lengths of the clones as implied by the positions of the mate pairs in the contigs can be used to estimate the library size distribution. In the absence of repeats, the fragment unitig graph, the k-mer unitig graph, or any of even the simplest greedy layout algorithms will produce a set of unitigs that terminate at sequencing gaps or the ends of the target sequence. The mate pairs allow the unitigs to be placed into structures called scaffolds (see figures 3.13 and 3.15) [51]. A scaffold is thus an ordered and oriented set of contigs separated by gaps. A mate pair graph that is analogous to the fragment unitig graph can be constructed. Vertices are still the unitigs but edges are now the distances between the unitigs as computed by the mate pairs connecting them. Overlap edges in the fragment unitig graph always have negative distances because the sequence in the overlap is shared in common

Shotgun Fragment Assembly


Figure 3.15 Mate pair edges between i–j and i–k imply distance and orientation between j and k.

between the two unitigs and the distance (length of overlap) is known precisely. Mate pair edges usually have a positive distance but can have a negative distance (indicating a possibly undetected overlap) and the variance on the distance estimate is much larger. In a target sequence without repeats (a line), the unitigs represent single intervals along the line. Since the mate pair edges constrain the relative distance between these intervals, there is a condition analogous to the triangle condition in the overlap graph: if the distances from unitig i to unitig j and from j to k are known from mate pair edges, then the distance from i to k is known (within some variance) and any mate pair edge between i and k must be consistent with this distance if the unitigs are really single intervals. Exactly as in the overlap graph, chordal edge removal can be performed to leave only the mate pair edges between adjacent unitigs. In contrast to the overlap graph, edges may be missing between adjacent unitigs (intervals), but these edges can often be inferred from edges between nonadjacent unitigs (see figure 3.15). Chordal edge removal in the mate pair graph removes all branching in the unique intervals of the genome but leaves the same problem as the unitig graph with branching still occurring at the repeat unitigs. As described above for both types of unitig graphs, if a fragment spans a short repeat unitig, it crosses the repeat boundary on both sides and connects the two flanking unique unitigs. This makes it possible to merge the two unique unitigs with a copy of the intervening repeat unitig into a single contig that represents the correct local portion of the solution path. This would also require removing any edges from the unique unitig ends that are internal to the newly formed contig and replacing the edges adjacent to the external ends of the unique unitigs with edges to the external ends of the newly formed contig. Meanwhile, the repeat unitig and its other edges may still be used for other parts of the solution path. This equivalent graph transformation simplifies the graph and would yield the solution if all of the repeats could be spanned. Whereas fragments will not span long repeats, mate pairs can, and a wide array of clone lengths is possible using a number of different cloning vectors and methods. In contrast to a fragment that spans a short repeat between unique unitigs, which gives a path through the unitig graph and determines the sequence, a mate pair edge provides a distance estimate but no path.



If only a single path exists in the unitig graph and it is consistent with the mate pair edge distance, then it is likely to be correct. There is often only one consistent path for bacterial genomes [46] but not for more complex genomes [45]. Another approach is to require that every unitig on a consistent path between the unique unitigs also have a consistent mate pair edge to the flanking unique unitigs [52]. Spanning and placing the repeat unitigs between mate-pair-connected unique unitigs greatly simplifies the unitig graph. Unfortunately, imperfect labeling of unique unitigs and other imperfections of the data ultimately lead most shotgun fragment assemblers to resort to some form of greedy heuristic, such as using mate pair edges with the largest number of supporting mate pairs in the presence of conflicting edges, to generate a final solution. Mate Pair Validation

Mate pairs are a powerful tool for validating that the layout phase has produced a correct solution. Even though more recent assemblers use mate pairs in some fashion to guide the layout phase [24,40,42,53] of the assembly, mistakes are still made due to data and software imperfections. Patterns of unsatisfied mate pairs can identify many of these mistakes. A mate pair is satisfied if the two fragments from opposite ends and strands of the same clone appear on opposite strands and at a distance consistent with the clone size distribution (see figure 3.16A and B).

Figure 3.16 End reads of a clone as in target sequence (A), correctly assembled— satisfied (B), too far apart—stretched (C), too close together—compressed (D), misoriented to the right—normal (E), misoriented to the left—antinormal (F), and misoriented away from each other—outtie (G).

Shotgun Fragment Assembly


A mate pair that fails either of these two conditions is called unsatisfied. If two nonadjacent intervals of the target sequence have been inappropriately joined in the layout solution, this creates a bad junction, and mate pairs that would have spanned the correct junction will be unsatisfied (see figure 3.16C–G). Given deep clone coverage, most bad junctions will lead to multiple unsatisfied mate pairs. Each unsatisfied mated fragment defines an interval within which its mate should have appeared, implying that there is a bad junction somewhere within this interval. The intersection of these overlapping intervals can then be used to narrowly bound the region in which the bad junction must occur. This kind of analysis has been use to identify bad junctions in target sequence reconstructions [52,54]. In addition, these bad junctions often have cooccurring overlap signatures: a short or low-quality overlap or a layout that is only one sequence deep at that point (see chimeric fragments below). Some assemblers make use of these unsatisfied mate pair patterns to break and rejoin layouts, particularly when they coincide with a weak overlap pattern or chimeric fragment [42,53,55]. Chimeric Fragments

A chimeric clone is produced during the process of creating clone libraries when two pieces of fractured DNA fuse together. This fused piece of DNA no longer corresponds to a contiguous interval of the target sequence. If a fragment sequenced from a chimeric clone is long enough to cross the fused or chimeric junction, then the resulting fragment is called a chimeric fragment. Incorporating chimeric fragments into a layout would result in an incorrect reconstruction of the target sequence. Chimeric fragments tend to share a characteristic pattern of overlaps with nonchimeric fragments. Overlaps that are short enough not to reach (or that go just slightly beyond) the chimeric junction will be found; but, barring the unlikely event of another nearly identical chimeric fragment, there should be no overlaps with the chimeric fragment that cross the chimeric junction by more than a few bases. What distinguishes this pattern from a low coverage region is that fragments will exist that overlap with the fragments overlapping both ends of the chimeric fragment causing a branching in the unitig graph. This pattern is easy to detect after the overlap phase [35,56] and is incorporated by most assemblers by discarding the chimeric fragments before the layout phase. A chimeric fragment can also be recognized and discarded during the layout phase based on the unitig graph pattern (see figure 3.17A) in which a unitig composed of a single fragment causes branching in two intervals of the unitig graph that would otherwise be unbranched. Unfortunately, the same overlap pattern, or equivalent unitig graph pattern, can be induced by a pair of two-copy repeats in close proximity. To compensate for this we can use the previously described techniques



Figure 3.17 Unitig graph pattern of a chimeric single-read unitig U3 (A), spur fragment U3 (B), and polymorphic unitigs U2 and U3 (C).

to determine the likely multiplicity of the two unitigs with edges to the apparently chimeric fragment. If both unitigs appear to have multiplicity two, then the fragment should be retained. Chimeric Mate Pairs

A chimeric mate pair occurs when the mated fragments from a single clone do not meet the clone constraints (opposite strand and expected distance) when placed on the target sequence. This can occur in at least two basic ways: the clone is chimeric as above, or fragments from different clones are mislabeled as being mated. Before capillary sequencing machines, parallel sequencing lanes on an agarose gel were often mistracked, associating the wrong fragment with a clone. Even after capillary sequencing, sequencing plates used in sequencing machines can be rotated or mislabeled, associating fragments with the wrong clones. Undoubtedly, there are and will continue to be other clever ways by which lab techniques or software misassociate fragments and clones. For this reason, most assemblers do not consider any single, uncorroborated mate pair to be reliable. Most assemblers will only use mate pair edges (as discussed above) if they are supported by at least two mate pairs. For large genomes the chance that any two chimeric mate pairs will support the same mate pair edge is small [45]. For the same reason, bad junction detection based on unsatisfied mate pairs also sets a threshold of the intersection of at least two unsatisfied intervals. Spur Fragments

Spur fragments (also called dead-end fragments [24]) are fragments whose sequence on one end does not overlap any other fragment.

Shotgun Fragment Assembly


Of course this is true of fragments on the boundary of a sequencing gap, so an additional criterion that the fragment cause a branching in the unitig graph is also needed to define a spur fragment. The spur pattern in the unitig graph is similar to the chimeric pattern where a single fragment unitig causes a branching in the unitig graph that would otherwise not occur (see figure 3.17B). Some of the reasons that spur fragments occur are undetected low-quality or artifactual sequence generated by the sequencing process that has not been trimmed and vector sequence that is undetected and therefore untrimmed. Spur fragments can also result from chimeric junctions close enough to a fragment end that overlaps with the short chimeric portion of the fragment will not be detected. Spur fragments are easy to detect using overlap or unitig graph patterns and can then be discarded. As with chimeric fragments, there are conditions under which a spur pattern can occur even though the spur fragment accurately reflects the target sequence. This can happen if, for instance, a sequencing gap exists in the unique flanking region near a repeat boundary of a two-copy repeat and the fragment coverage is low (single fragment). If the unitig at the branch point caused by the spur appears to be multiplicity two, the spur should probably be retained. Vector and Quality Trimming

Current sequencing technology usually requires that some known DNA sequence be attached to both ends of each randomly sheared piece of the target sequence (often a plasmid cloning vector). Part of this socalled vector sequence is almost always included at the beginning, or 5′ end, of a fragment as part of the sequencing process. If the sheared piece of target sequence is short, the sequencing process can run into vector sequence at the 3′ end of a fragment (see figure 3.1C). Although today the majority if not the entire length of most fragment sequences is of very high quality, both the beginning and end of the sequence are sometimes of such low quality that overlaps cannot be detected in these regions, even with error correction. The vector and low-quality regions of fragments do not reflect the underlying target sequence and can cause overlaps to be missed. A preprocess called vector and quality trimming is performed before the overlap phase in most assemblers to attempt to detect and remove these regions from the fragments. The vector can be detected using standard sequence alignment techniques that are only complicated in two cases: the sequence quality is low, which can be addressed by quality trimming, or the sequencing vector is very short so that a significant alignment (greater than random chance) does not exist. The latter can be addressed by trimming off any short alignment at the potential cost of trimming off a few nonvector base pairs. Quality trimming is usually based on the quality values (error estimates) for the base calls. Using these error estimates, a maximum number of expected errors per



fixed window length are allowed and the maximum contiguous set of these windows (intervals) is retained as the quality-trimmed fragment. This trimming is usually somewhat conservative and so a complementary method using the overlap machinery is sometimes employed [35,53]. Instead of insisting that an overlap extend all the way to the end of both trimmed fragments, high-quality alignments that terminate before the end of untrimmed fragments can be considered. If the alignment is significant, then it is due to the fragments being from the same interval of the target sequence or sharing a repeat in the target sequence. In either case the aligned portion of the fragment is not likely to be low-quality sequence and can be used to extend the quality values based quality trimming. Polymorphic Target Sequences

If a clonal target sequence is asexually reproduced DNA, a single version of the target sequence is copied with little or no error, and we can conceptually think of each random shotgun fragment as having been sampled from a single target sequence. Unfortunately when the copies of the target DNA to be sheared are acquired from multiple individuals or even a single individual with two copies of each chromosome (one from each parent), this assumption is incorrect and we must allow for variance between the different copies of the target sequence. If the variance between copies is very low (say a single base pair difference per 1000), then the overlap and layout phases are unlikely to be impacted. A rate of variance that is well within the sequencing error rate (or corrected error rate) will not prevent any overlaps from being discovered. Even the most aggressive repeat separation strategies require at least two differences between fragments for separation, so variance with at most one difference per fragment length will not affect the layout phase. Unfortunately, polymorphic variance is often significantly greater than sequencing error. If the polymorphic variation in all intervals of a two-haplotype target sequence exceeds the sequencing error variation, the problem would be the same as assembling a target sequence that was twice as long as expected, since we could easily separate the two haplotypes. Polymorphic variation more often varies from quite low to high from region to region within the target sequence. The low-variance unique regions end up in a single unitig whereas the high-variance unique regions get split into multiple unitigs (two in the case of two haplotypes). This complicates the branching in the unitig graph and makes it more difficult to determine unitig multiplicities based on the branching structure. In some cases a polymorphic branching pattern within a unique region of the target sequence can be recognized and collapsed into a single unitig [57]. A common polymorphic pattern called a bubble occurs when unitig U1 branches out to unitigs U2 and U3 which then converge

Shotgun Fragment Assembly


back into unitig U4 (see figure 3.17C). There are two possibilities in the underlying target sequence to account for the bubble: unitigs U1 and U4 are unique unitigs and unitigs U2 and U3 are polymorphic haplotypes of the same analogous unique region between U1 and U4, or unitigs U1 and U4 are both repeats and unitigs U2 and U3 are different intervals of the target between copies of U1 and U4. These two cases can often be distinguished by the depth of coverage of the unitigs U1, U2, U3, and U4. CONSENSUS PHASE

The layout phase determines the order, orientation (strand), and amount of overlap between the fragments. The consensus phase determines the most likely reconstruction of the target sequence, usually called the consensus sequence, which is consistent with the layout of the fragments. As we discussed above, an overlap between fragments i and j, which defines a pairwise alignment, and an overlap between fragments j and k, when taken together create a multiple sequence alignment between fragments i, j, and k (see figures 3.7, 3.9, and 3.18). In general, the pairwise alignments between adjacent fragments in the layout can be used to create a multiple sequence alignment of all of the fragments. At each position, or column, of the multiple sequence alignment, different base calls or gaps inserted within a fragment for alignment may be present for each of the fragments that span that position of the target sequence. Which target sequence base pair (or gap in the absence of a base pair) is most likely to have resulted in the base calls seen in the column? In the absence of information other than the base calls and when the accuracy of the fragments is high, a simple majority voting algorithm works well. With quality values available as error estimates for the base calls, a quality value weighted voting improves the result. A Bayesian estimate which can also incorporate the a priori base pair composition propensities can be used to make the base call and provide an error estimate (quality value) for the consensus base call [22]. If the target sequence copies are polymorphic, the same Bayesian model can be used to assign probabilities that a column reflects sampling from a polymorphic position in the target sequence. The difficulty in generating the best consensus sequence does not lie in calling a base for a given column but in generating optimal columns in the multiple sequence alignment. Optimal multiple sequence alignments have an entire literature of their own. Dynamic programming is generally considered to give optimal pairwise alignments in a reasonable amount of computation (at least for short sequences) needing time proportional to the length of the fragments involved squared. Dynamic programming alignment can be easily extended to multiple sequence alignment but takes time proportional to the length of the fragments raised to the number of fragments to be aligned, which is impractical.



The most common practical approach is to determine a likely optimal order of pairwise alignments to perform to create a multiple sequence alignment. The order of pairwise alignments is usually determined by maximal pairwise similarity. After a pairwise alignment the sequences are merged either into a consensus sequence or a profile representing each column as a weighted vector rather than as a consensus call. For shotgun fragment assembly the fragments are so similar (except if larger polymorphisms have been collapsed together into a single unitig) that the order of pairwise alignment and merged sequence representation has little impact. As mentioned above, the obvious choice is just to proceed from the first fragment in a contig and use the pairwise alignment with the adjacent fragment until the last fragment is reached. There is one glaring shortcoming resulting from this approach where gaps in adjacent columns are not optimally aligned (see figure 3.18). Alignment B is better because fewer sequencing errors are needed to explain it and sequencing errors are rare events. Two different methods have been developed to refine the initial multiple sequence alignment to correct this problem. The first removes fragments one at a time from the initial multiple sequence alignment and then realigns each fragment to the new profile of the multiple sequence alignment resulting from its removal [58]. This process iterates until the multiple sequence alignment score stops improving or a time limit is reached. The second method first finds some small number of consecutive columns, say six, which have no internal differences (all base calls in a column are the same with no gaps). These anchoring columns are unlikely to be in error and even more unlikely to be improved by any local multiple sequence alignment refinement technique. The abacus method, so called because gaps are shifted like beads on an abacus, then tries to reposition the gaps between anchors so that the gaps are concentrated in fewer columns [45]. Neither method always produces optimal results, but both methods produce significantly improved results over unrefined alignments.

Figure 3.18 Nonoptimal multiple sequence alignment (A) and optimal alignment (B).

Shotgun Fragment Assembly


An entirely different approach to consensus avoids the gap alignment optimization problem by using only a single, high-quality base from one fragment instead of letting bases from all fragments in the column vote in calling a given consensus base [35]. The quality values indicate that there are likely to be very few errors in each interval. A transition region where two fragments’ base calls match exactly is chosen to switch from one high-quality fragment to the next. If desired, this consensus sequence can just be the starting point for the consensus approaches discussed above. First all of the fragments would be aligned pairwise against the consensus sequence and then either or both of the above refinements could be performed. PAST AND FUTURE

The first whole genome shotgun assembly was performed with a lot of manual intervention to determine the 48,502 base pair genome of the bacteriophage lambda virus [59,60]. As larger genomic clones and genomes were shotgun sequenced, more automated methods were developed for assembling them. Many felt there was a limit to the genome size or the complexity of the genomic content that whole genome shotgun assemblers could be designed to handle. Of course at the extreme one can imagine genomes such as 10 million base pairs of a single nucleotide, say A with only a smattering of C, T, or G nucleotides intermingled, where there is no hope of using whole genome shotgun assembly or any other current assembly method. The interesting question becomes: for genomes we wish to sequence and assemble, can sufficiently sophisticated whole genome shotgun methodologies and assembly algorithms be devised to produce the sequence for these genomes? The frontier has continued to be expanded in the face of skepticism from 1 million base pair bacteria [61], to 100 million base pair invertebrates [45], to 3 billion base pair mammals [62]. Our belief is that while large strides have been made in the capabilities of whole genome shotgun assembly algorithms, there is much that can still be done to push the frontier further out and at the same time reduce the finishing effort required for genomes within our current capabilities. No single assembler incorporates the most advanced version of the methods discussed above and approaches to deal with polymorphism, tandem repeats [63], and large segmental duplications [64] are in their infancy. LITERATURE

The first shotgun assembly programs [1,2] were primarily concerned with finding overlaps and constructing contigs from these overlaps that could be presented to the scientists for confirmation. The target sequences were small enough that a high degree of manual inspection



and correction was acceptable and any repeat structure was relatively simple. Even at this early stage the tradeoff between sensitivity and specificity in overlap detection was understood. These early programs assumed that any significant overlap was likely to be real and could be used first-come first-serve to construct contigs. Any mistakes could be corrected with manual intervention. Shotgun fragment assembly was quickly posed as a mathematical problem, Shortest Common Superstring (SCS), with a well-defined criterion, the length of the superstring, to be optimized. A simple greedy approach of merging fragments with the longest remaining overlap was proposed as a solution and bounds on performance were proven and later improved [5–13]. This new approach was then put into practice and violations of the triangle condition were recognized as indications of repeat boundaries that could not be solved using this simple approach [14]. The next wave of fragment assemblers began arriving almost a decade later with CAP [23], which has continued to be improved over time with CAP2 [56], CAP3 [53], and PCAP [27]. CAP used improved sensitivity and specificity measures but, more importantly, introduced the first version of sequence error correction based on overlap alignments. CAP2 recognized repeat contigs based on triangle condition violations at repeat boundaries and attempted to separate the copies of these repeats based on small differences (dividing columns) between different copies. CAP2 also used an eloquent chimeric fragment detection algorithm. CAP3 introduced the use of unsatisfied mate pairs to break and repair incorrect layouts. Phrap [35] used base call quality values generated by Phred [26] for much better error correction and repeat separation. Work on distinguishing columns perhaps has the most promise for repeat separation [36–39]. Others had also recognized the value of quality values [65,66]. Another method for detecting repeat boundaries was based on determining cliques in the overlap graph [67] which would share fragments with adjoining cliques. If there was more than one adjoining clique on each end, a repeat boundary was present. All of these approaches to finding repeat boundaries as violations of the triangle condition were made explicit in the transitive, or chordal, graph reduction approach [4] that removes overlap edges from the overlap graph until only branching edges due to repeat boundaries are left. TIGR Assembler made the first use of mate pairs to guide the layout phase of the assembly [40,61]. Other assemblers improved the efficiency of some stages of assembly [58,68–73]. A nice formalization of several phases of fragment assembly is presented in [21], but the branch and bound algorithms presented are only practical for target sequences with low repeat complexity. Genetic algorithm and simulated annealing approaches for searching the space of good layouts can outperform the simple greedy heuristic for target sequences with a few repeats, but the search does not scale for complex repeat structures [74–76].

Shotgun Fragment Assembly


A very different approach, the k-mer graph, was also developed in this intermediary time frame [28] and then expanded recently [25,34,46,77]. A new set of fragment assemblers have recently been developed that build on previous work and can scale to mammalian size genomes [24,27,41–43,45,55]. The results of one of these assemblers for Drosophila [78,79] and human [62] has been compared to finished versions of these genomes. Different genome sequencing strategies have been debated [50,51,80–83]. The impact of a new sequencing technology which produces short reads at low cost allowing for deep coverage has been evaluated for the k-mer graph approach [84]. We should like to recommend a few general supplementary texts for the interested reader [3,85,86]. ACKNOWLEDGMENTS We should like to thank all of those mentioned in this chapter or inadvertently overlooked who have worked on and contributed directly or indirectly to the problem of shotgun fragment assembly. We should like to recognize all of our colleagues, who are too numerous to list here, who have worked either directly with us on shotgun fragment assembly or more generally on whole genome shotgun sequencing for their efforts and encouragement. A special thanks goes to the group of people who helped design and build the Celera Assembler but more importantly made it a joy to come to work: Eric Anson, Vineet Bafna, Randall Bolanos, Hui-Hsien Chou, Art Delcher, Nathan Edwards, Dan Fasulo, Mike Flanigan, Liliana Florea, Bjarni Halldorsson, Sridhar Hannenhalli, Aaron Halpern, Merissa Henry, Daniel Huson, Saul Kravitz, Zhongwu Lai, Ross Lippert, Stephano Lonardi, Jason Miller, Clark Mobarry, Laurent Mouchard, Gene Myers, Michelle Perry, Knut Reinert, Karin Remington, Hagit Shatkay, Russell Turner, Brian Walenz, and Shibu Yooseph. Finally, we owe a debt of thanks to Mark Adams, Mike Hunkapiller, and Craig Venter for providing the opportunity and data to work on the Drosophila, human, and many other exciting genomes.

REFERENCES 1. Gingeras, T., J. Milazzo, D. Sciaky and R. Roberts. Computer programs for the assembly of DNA sequences. Nucleic Acid Research, 7(2):529–45, 1979. 2. Staden, R. Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. Nucleic Acid Research, 10(15):4731–51, 1982. 3. Setubal, J. and J. Meidanis. Introduction to Computational Molecular Biology (pp. 105–42). PWS Publishing Company, Boston, 1997. 4. Myers, E. Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology, 2(2):275–90, 1995. 5. Tarhio, J. and E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings. Theoretical Computer Science, 57(1):131–45, 1988.



6. Turner, J. Approximation algorithms for the shortest common superstring. Information and Computation, 83(1):1–20, 1989. 7. Gallant, J., D. Maier and J. Storer. On finding minimal length superstrings. Journal of Computer and Systems Science, 20:50–8, 1980. 8. Gallant, J. The complexity of the overlap method for sequencing biopolymers. Journal of Theoretical Biology, 101(1):1–17, 1983. 9. Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis. Linear approximation of shortest superstrings. Proceedings of the 23rd AC Symposium on Theory of Computation, 328–36, 1991. 10. Blum, A., T. Jiang, M. Li, J. Tromp and M. Yannakakis. Linear approximation of shortest superstrings. Journal of the ACM, 41:634–47, 1994. 11. Armen, C. and C. Stein. A 2.75 approximation algorithm for the shortest superstring problem. Technical Report PCS-TR94-214, Department of Computer Science, Dartmouth College, Hanover, N.H., 1994. 12. Armen, C. and C. Stein. A 2 2/3-approximation algorithm for the shortest superstring problem. Combinatorial Pattern Matching, 87-101, 1996. 13. Kosaraju, R., J. Park and C. Stein. Long tours and short superstrings. Proceedings of the 35th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 166–77, 1994. 14. Peltola, H., H. Söderlund, J. Tarhio and E. Ukkonen. Algorithms for some string matching problems arising in molecular genetics. Proceedings of the 9th IFIP World Computer Congress, 59–64, 1983. 15. Peltola, H., H. Söderlund and E. Ukkonen. SEQAID: a DNA sequence assembling program based on a mathematical model. Nucleic Acids Research, 12(1 Pt 1):307–21, 1984. 16. Needleman, S. and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48(3):443–53, 1970. 17. Smith, T. and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–7, 1981. 18. Sellers, P. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms, 1:359–73, 1980. 19. Sankoff, D. and J. Kruskal. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, Mass., 1983. 20. Myers, E. Incremental Alignment Algorithms and Their Applications. Technical Report TR 86-2, Department of Computer Science, University of Arizona, Tucson, 1986. 21. Kececioglu, J. and E. Myers. Combinatorial algorithms for DNA sequence assembly. Algorithmica, 13(1/2):7–51, 1995. 22. Churchill, G. and M. Waterman. The accuracy of DNA sequences: estimating sequence quality. Genomics, 14(1):89–98, 1992. 23. Huang, X. A contig assembly program based on sensitive detection of fragment overlaps. Genomics, 14(1):18–25, 1992. 24. Batzoglou, S., D. Jaffe, K. Stanley, J. Butler, S. Gnerre, et al. ARACHNE: a whole genome shotgun assembler. Genome Research, 12:177–89, 2002. 25. Pevzner, P., H. Tang and M. Waterman. An eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences USA, 98(17):9748–53, 2001.

Shotgun Fragment Assembly


26. Ewing, B. and P. Green. Base-calling of automated sequencer traces using phred. ii. error probabilities. Genome Research, 8(3):186–94, 1998. 27. Huang, X. and J. Wang. PCAP: a whole-genome assembly program. Genome Research, 13(9):2164–70, 2003. 28. Idury, R. and M. Waterman. A new algorithm for DNA sequence assembly. Journal of Computational Biology, 2(2):291–306, 1995. 29. Drmanac, R., I. Labat, I. Brukner and R. Crkvenjakov. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics, 4(2):114–28, 1989. 30. Drmanac, R., I. Labat and R. Crkvenjakov. An algorithm for the DNA sequence generation from k-tuple word contents of the minimal number of random fragments. Journal of Biomolecular Structure and Dynamics, 8(5):1085–1102, 1991. 31. Columbic, M. Algorithmic Graph Theory and Perfect Graphs. Academic Press, London, 1980. 32. Fishburn, P. Interval Orders and Interval Graphs: A Study of Partially Ordered Sets. Wiley, New York, 1985. 33. Lander, E. and M. Waterman. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics, 2(3):231–9, 1988. 34. Pevzner, P., H. Tang and G. Tesler. De novo repeat classification and fragment assembly. Genome Research, 14(9):1786–96, 2004. 35. Green, P. PHRAP documentation., 1994. 36. Kececioglu, J. and J. Yu. Separating repeats in DNA sequence assembly. Proceedings of the 5th ACM Conference on Computational Molecular Biology, 176–83, 2001. 37. Roberts, M., B. Hunt, J. Yorke, R. Bolanos and A. Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–52, 2004. 38. Tammi, M., E. Arner, T. Britton and B. Andersson. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs. Bioinformatics, 18(3):379–88, 2002. 39. Tammi, M., E. Arner, E. Kindlund and B. Andersson. Correcting errors in shotgun sequences. Nucleic Acids Research, 31(15):4663–72, 2003. 40. Sutton, G., O. White, M. Adams and A. Kerlavage. TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1:9–19, 1995. 41. Wang, J., G. Wong, P. Ni, Y. Han, X. Huang, et al. RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Research, 12(5):824–31, 2002. 42. Mullikin, J. and Z. Ning. The phusion assembler. Genome Research, 13(1):81–90, 2003. 43. Havlak, P., R. Chen, K. Durbin, A. Egan, Y. Ren, et al. The Atlas genome assembly system. Genome Research, 14(4):721–32, 2004. 44. Roach, J. Random subcloning. Genome Research, 5(5):464–73, 1995. 45. Myers, E., G. Sutton, A. Delcher, I. Dew, D. Fasulo, et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196–204, 2000. 46. Pevzner, P. and H. Tang. Fragment assembly with double-barreled data. Bioinformatics, 17(Suppl 1):S225–33, 2001.



47. Grotschel, M., L. Lovasz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimization. Springer-Verlag, Berlin, 1993. 48. Pevzner, P. l-tuple DNA sequencing: computer analysis. Journal of Biomolecular Structure and Dynamics, 7(1):63–73, 1989. 49. Lysov, Y., V. Florentiev, A. Khorlin, K. Khrapko, V. Shik and A. Mirzabekov. DNA sequencing by hybridization with oligonucleotides. Dokl. Academy of Sciences USSR, 303:1508–11, 1988. 50. Edwards, A. and C. Caskey. Closure strategies for random DNA sequencing. Methods: A Companion to Methods in Enzymology, 3:41–47, 1990. 51. Roach, J., C. Boysen, K. Wang and L. Hood. Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics, 26(2):345–53, 1995. 52. Venter, J., M. Adams, E. Myers, P. Li, R. Mural and G. Sutton. The sequence of the human genome. Science, 291(5507):1304–51, 2001. 53. Huang, X. and A. Madan. CAP3: a DNA sequence assembly program. Genome Research, 9(9):868–77, 1999. 54. Huson, D., A. Halpern, Z. Lai, E. Myers, K. Reinert and G. Sutton. Comparing assemblies using fragments and mate-pairs. Proceedings of the 1st Workshop on Algorithms Bioinformatics, WABI-01:294–306, 2001. 55. Jaffe, D., J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh, et al. Wholegenome sequence assembly for mammalian genomes: Arachne 2. Genome Research, 13(1):91–6, 2003. 56. Huang, X. An improved sequence assembly program. Genomics, 33(1):21–31, 1996. 57. Fasulo, D., A. Halpern, I. Dew and C. Mobarry. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18(Suppl 1): S294–302, 2002. 58. Anson, E. and E. Myers. Realigner: a program for refining DNA sequence multialignments. Journal of Computational Biology, 4(3):369–83, 1997. 59. Sanger, F., A. Coulson, G. Hong, D. Hill and G. Petersen. Nucleotide sequence of bacteriophage λ DNA. Journal of Molecular Biology, 162(4):729–73, 1982. 60. Staden, R. A new computer method for the storage and manipulation of DNA gel reading data. Nucleic Acid Research, 8(16):3673–94, 1980. 61. Fleischmann, R., M. Adams, O. White, R. Clayton, E. Kirkness, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5253):496–512, 1995. 62. Istrail, S., G. Sutton, L. Florea, A. Halpern, et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proceedings of the National Academy of Sciences USA, 101(7):1916–21, 2004. 63. Tammi, M., E. Arner and B. Andersson. TRAP: Tandem Repeat Assembly Program produces improved shotgun assemblies of repetitive sequences. Computer Methods and Programs in Biomedicine, 70(1):47–59, 2003. 64. Eichler, E. Masquerading repeats: paralogous pitfalls of the human genome. Genome Research, 8(8):758–62, 1998. 65. Lawrence, E. and V. Solovyev. Assignment of position-specific error probability to primary DNA sequence data. Nucleic Acid Research, 22(7):1272–80, 1994. 66. Bonfield, J. and R. Staden. The application of numerical estimates of base calling accuracy to DNA sequencing projects. Nucleic Acids Research, 23:1406–10, 1995.

Shotgun Fragment Assembly


67. Gleizes, A. and A. Henaut. A global approach for contig construction. Computer Applications in the Biosciences, 10(4):401–8, 1994. 68. Kim, S. and A. Segre. AMASS: a structured pattern matching approach to shotgun sequence assembly. Journal of Computational Biology, 6(2):163–86, 1999. 69. Bonfield, J., K. Smith and R. Staden. A new DNA sequence assembly program. Nucleic Acids Research, 23(24):4992–9, 1995. 70. Gryan, G. Faster sequence assembly software for megabase shotgun assemblies. Genome Sequencing and Analysis Conference VI, 1994. 71. Chen, T. and S. Skiena. Trie-based data structures for sequence assembly. 8th Symposium on Combinatorial Pattern Matching, 206–23, 1997. 72. Pop, M., D. Kosack and S. Salzberg. Hierarchical scaffolding with Bambus. Genome Research, 14(1):149–59, 2004. 73. Kosaraju, R. and A. Delcher. Large-scale assembly of DNA strings and space-efficient construction of suffix trees. Proceedings of the 27th ACM Symposium on Theory of Computing, 169–77, 1995. 74. Burks, C., R. Parsons and M. Engle. Integration of competing ancillary assertions in genome assembly. ISMB 1994, 62–9, 1994. 75. Parsons, R., S. Forrest and C. Burks. Genetic algorithms: operators, and DNA fragment assembly. Machine Learning, 21(1-2):11–33, 1995. 76. Parsons, R, and M. Johnson. DNA sequence assembly and genetic algorithms: new results and puzzling insights. Proceedings of Intelligent Systems in Molecular Biology, 3:277–84, 1995. 77. Mulyukov, Z. and P. Pevzner. EULER-PCR: finishing experiments for repeat resolution. Pacific Symposium in Biocomputing 2002, 199–210, 2002. 78. Celniker, S., D. Wheeler, B. Kronmiller, J. Carlson, A. Halpern, et al. Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biology, 3(12):1–14, 2002. 79. Hoskins, R., C. Smith, J. Carlson, A. Carvalho, A. Halpern, et al. Heterochromatic sequences in a Drosophila whole-genome shotgun assembly. Genome Biology, 3(12):1–16, 2002. 80. Webber, J. and E. Myers. Human whole-genome shotgun sequencing. Genome Research, 7(5):401–9, 1997. 81. Green, P. Against a whole-genome shotgun. Genome Research, 7(5):410–17, 1997. 82. Anson, E. and E. Myers. Algorithms for whole genome shotgun sequencing. Proceedings RECOMB’99, 1–9, 1999. 83. Chen, E., D. Schlessinger and J. Kere. Ordered shotgun sequencing, a strategy for integrated mapping and sequencing of YAC clones. Genomics, 17(3):651–6, 1993. 84. Chaisson, M., P. Pevzner and H. Tang. Fragment assembly with short reads. Bioinformatics, 20(13):2067–74, 2004. 85. Myers, E. Advances in sequence assembly. In M.D. Adams, C. Fields and J. C. Venter (Eds.), Automated DNA Sequencing and Analysis (pp. 231–8) Academic Press, London, 1994. 86. Myers, G. Whole-genome DNA sequencing. Computing in Science and Engineering, 1(3):33–43, 1999.

4 Gene Finding John Besemer & Mark Borodovsky

Recent advances in sequencing technology have created an unprecedented opportunity to obtain the complete genomic sequence of any biological species in a short time and at a reasonable cost. Computational gene-finding approaches allow researchers to quickly transform these texts, strings of millions of nucleotides with little obvious meaning, into priceless annotated books of life. Immediately, researchers can start extracting pieces of fundamental knowledge about the species at hand: translated protein products can be characterized and added to protein families; predicted gene sequences can be used in phylogenetic studies; the order of predicted prokaryotic genes can be compared to other genomes to elucidate operon structures; and so on. While software programs for gene finding have reached a high level of accuracy, especially for prokaryotic genomes, they are not designed to replace human experts in the genome annotation process. The programs are extremely useful tools that greatly reduce the time required for sequence annotation, and the overall quality of annotations is improving as the programs become better. This acceleration is becoming a critical issue, as a recent attempt to sequence microbial populations en masse, rather than individual genomes, produced DNA sequences from over 1800 species including 148 novel phylotypes [1]. While genes can be found experimentally, these procedures are both time-consuming and costly and are best utilized on small scales. Sets of experimentally validated genes, however, are of the utmost importance to the area of computational gene finding as they provide the most trustable sets for testing programs. Currently, such sets are rather few in number ([2] and [3] give two notable examples for Escherichia coli K12; [4] and [5] are well known for Homo sapiens and Arabidopsis thaliana, respectively). Typically, they contain a small number of genes used to validate the predictions made in a particular study [6]. Recently, RT-PCR and direct sequencing techniques have been used to verify specific computational predictions [7]. COMPONENTS OF GENE FINDERS

Before considering specific gene finders, it is important to mention the two major components present in many current programs: the prediction 118

Gene Finding


algorithm and the statistical models that typically reflect the genomic features of a particular species (or, perhaps, group of species such as plants or mammals). The algorithm defines the mathematical framework on which a gene finder is based. While popular gene finders frequently use statistical methods [often based on Markov models or hidden Markov models (HMMs)] and dynamic programming algorithms, other approaches including artificial neural networks have been attempted with considerable success as well. Selection of an appropriate framework (algorithm) requires knowledge of the organization of the genome under study, so it relies on biological knowledge in addition to mathematics. The second major component, which specifically influences ab initio, or single sequence, gene finders, is the set of model parameters the program uses to make gene predictions in anonymous DNA. These models can be of many types: homogeneous Markov chains to model noncoding DNA, inhomogeneous three-periodic Markov chains and interpolated Markov chains to model coding DNA; position-specific weight matrices (which also could be derived from statistical models) to model ribosomal binding sites (RBS), splice sites and other motifs; and exponential and gamma-type distributions to model distances within and between particular gene elements in the genome. A sample of the models used by the prokaryotic and eukaryotic versions of GeneMark.hmm is shown in figure 4.1. Exactly which combination of models is used depends on both the species being studied and the amount of available data. For instance, at the early stages of a genome sequencing project, the number of experimentally validated genes is typically not sufficient to derive the parameters of high-precision models of the protein-coding DNA, but there are ways to circumvent this problem (see below). MAJOR CHALLENGES IN GENE FINDING

Even though current gene-finding programs have reached high levels of accuracy, there remains significant room for improvement. In prokaryotes, where impressive average sensitivity figures over 95% have frequently been published [8–11], false positive rates of 10–20% are routine. These numbers can be much worse when the focus shifts to short genes, given the “evil little fellows” (ELFs) moniker due to difficulties distinguishing true short genes from random noncoding open reading frames (ORFs) [12]. The overannotation of short genes is a problem that plagues nearly all annotated microbial genomes [13]. Exact determination of the 5′-ends of prokaryotic genes has taken great strides from times when programs simply predicted genes as ORFs with ambiguous starts or extended all predicted genes to the longest ORF. Relatively recently, models of the RBS started to be used in the algorithms in a more advanced manner. Still, as of yet, the issue of accurate gene start prediction is not closed.



Figure 4.1 A sample of statistical models. (A) Two-component model of the Bacillus subtilis RBS; nucleotide frequency distribution displayed as a sequence logo [121], left; and distribution of spacer lengths, right (used by the prokaryotic version GeneMark.hmm). (B) Graphical representation of donor site model for A. thaliana, displayed as a pictogram (Burge, C., pictogram.html). (C) Same as in B, for acceptor site model. (D) Distribution of exon lengths for A. thaliana (used by the eukaryotic version of GeneMark.hmm).

The complicated organization of eukaryotic genes presents even more challenges. While the accuracy of predicting exon boundaries is approaching 90% for some genomes, assembling the predicted exons into complete gene structures can be an arduous task. In addition, the errors made here tend to multiply; a missed splice site may corrupt the

Gene Finding


whole gene prediction unless one or more additional missed splice sites downstream allow the exon–intron structure prediction to get back on track. In addition, while some genome organization features may be common for all prokaryotes, such as that gene length distributions in all prokaryotes are similar to that of E. coli [10], current data show that the eukaryotes tend to have much more diversity. There is no universal exon or intron length distribution; the average number of introns per gene is variable; branch points are prominent in some genomes and seemingly missing in others; and so on. To deal with this diversity in genome organization one may need algorithms that can alter their structure, typically the HMM architecture, to better fit the genetic language grammar of a particular genome [14]. While comparisons of different programs are discussed extensively in the literature, the area of gene prediction is missing thoroughly organized competitions such as CASP in the area of protein structure prediction. Recent initiatives such as ENCODE ( ENCODE/) attempt to fill this void. Several publications have set out to determine which program is the “best” gene finder for a particular genome [15–18]. However detailed these studies are, their results are difficult to extrapolate beyond the data sets they used, as the performance differences among gene finders are tightly correlated to differences in the sequence data used for training and testing. As performance tests are clear drivers of the practical gene finders’ development, it is important that the algorithms’ developers consider the simultaneous pursuit of another goal—the creation of programs that not only make accurate predictions, but also serve the purpose of improving our understanding of the biological processes that have brought the genomes, as they are now, to life. CLASSIFYING GENE FINDERS

In the early years of gene finding, it was quite easy to classify gene finders into two broad categories: intrinsic and extrinsic [19]. Ideally, the intrinsic approach, which gives rise to ab initio or single sequence gene finders, uses no explicit information about DNA sequences other than the one being analyzed. This definition is not perfect though, since an intrinsic approach may rely on statistical models with parameters derived from other sequences. This loophole in the definition of the intrinsic approach is tolerable, provided the term “intrinsic” conveys the meaning of statistical model-based approaches as opposed to similarity search-based ones. Therefore, intrinsic methods rely on the parameters of statistical models which are learned from collections of known genes. In general, this learning has to be genome-specific, though recent studies have shown that reasonable predictions can be obtained even



with models deviating from those precisely tuned for a particular genome [20,21]. Initially, this was observed with the Markov models generated from E. coli K12 genes of class III, a rather small class which displayed the least pronounced E. coli-specific codon usage pattern and presumably contained a substantial portion of laterally transferred genes [22]. These “atypical models” were able to predict the majority of genes of E. coli. This observation led to the development of a heuristic approach for deriving models which capture the basic, but still genome-specific, pattern of nucleotide ordering in protein-coding DNA [20]. Heuristic models can be tuned for a particular genome by adjusting just a few parameters reflecting specific nucleotide composition. This approach is also useful for deriving models for rather short inhomogeneous sections of genomes such as pathogenicity islands or for the genomes of viruses and phages for which there is not enough data for conventional model training [23,24]. Extrinsic gene-finding approaches utilize sequence comparison methods, such as BLASTX (translated in six-frame nucleotide query versus protein database), TBLASTX (translated in six-frame nucleotide query versus translated in six-frame nucleotide database), or BLASTN (nucleotide query versus nucleotide database) [25]. Robison et al. [26] introduced the first extrinsic gene finders for bacterial genomes. Programs performing alignment of DNA to libraries of nucleotide sequences known to be expressed (cDNA and EST sequences) have to properly compensate for large gaps (which represent introns in the genomic sequence) to be useful for detecting genes in eukaryotic DNA. There are several programs that accomplish this task, including est_genome [27], sim4 [28], BLAT [29], and GeneSeqer [30]. Among these, GeneSeqer stands out as the best performing, with this leadership status apparently gained by making use of “intrinsic” features, namely, species-specific probabilistic models of splice sites. The utilization of these models allows the program to better select biologically relevant alignments from a few alternatives with similar scores, resulting in more accurate exon prediction. The classification of modern gene finders becomes more difficult due to the integrated nature of the new methods. As new high-throughput methods are developed and new types of data are becoming available in vast amounts (cDNA, EST, gene expression data, etc.), more complex gene-finding approaches are needed to utilize all this information. The ENSEMBL project [31] serves as an excellent example of a system that intelligently integrates both intrinsic and extrinsic data. In current practice, nearly all uses of gene finding are integrated in nature. For example, the application of the ab initio gene finders to eukaryotic genomes with frequent repetitive sequences is always preceded by a run of RepeatMasker (Smit, A.M.A., Hubley, R., and Green, P., to remove species-specific genomic interspersed repeats revealed by

Gene Finding


similarity search through an existing repeat database. To fine-tune gene start predictions, the prokaryotic gene finders may rely on prior knowledge of the 3′-tail of the 16S rRNA of the species being analyzed. The ab initio definition has to become more general with the recent introduction of gene-finding approaches based on phylogenetic hidden Markov models (phylo-HMMs), such as Siepel and Haussler’s method [32] to predict conserved exons in unannotated orthologous genomic fragments of multiple genomes. While such a method belongs to the ab initio class as defined, since no knowledge of the gene content of the multiple DNA sequences is required, the algorithm relies heavily on the assumption that the fragments being considered are orthologous and the phylogenetic relationships of the species considered are known. While developments of intrinsic and extrinsic approaches will advance further in coming years, genome annotators will continue to rely on integrated approaches. In addition, researchers are also frequently combining the predictions of multiple gene finders into a joint set of metapredictions [33]. Such methods are gaining in sophistication and popularity in genome sequencing projects [34]. ACCURACY EVALUATION

The quality of gene prediction is frequently characterized by values of sensitivity (Sn) and specificity (Sp). Sensitivity is defined as the ratio of the number of true positive predictions made by a prediction algorithm to the number of all positives in the test set. Specificity is defined as the ratio of the number of true positive predictions to the total number of predictions made. Readers with a computer science background may be familiar with the terms recall and precision rather than sensitivity and specificity, respectively. For gene prediction algorithms, sensitivity and specificity are often determined in terms of individual nucleotides, splice sites, translation starts and ends, separate exons, and whole genes. Both sensitivity and specificity must be determined on test sets of sequences with experimentally validated genes. Some of the levels of sensitivity and specificity definition (i.e., nucleotides, exons, complete genes) are more useful for the realistic evaluation of practical algorithm performance than others. For prokaryotes, the basic unit of gene structure is the complete gene. For state-of-the art prokaryotic gene finders, Sn at the level of complete genes is typically above 90% and for many species close to 100%. The Sp value in the “balanced” case is expected to be about the same as Sn, but some prediction programs are tuned to exhibit higher Sn than Sp. The rationale here is that a human expert working with a prediction program would rather take the time to eliminate some (usually low-scoring) predictions deemed to be false positives than leaving this elimination entirely



to the computer. At first glance, the high Sn and Sp figures may indicate that the problem of prokaryotic gene finding is “solved.” This, however, is not the case as such overall figures do not adequately reflect the errors in finding the exact gene starts or the rate of erroneous prediction in the group of short genes (shorter than 400 nt). In most eukaryotes, the basic unit of gene structure is the exon. Thus, exon-level accuracy is a quite natural and informative measure. In stateof-the art eukaryotic gene finders, exon-level Sn and Sp approach 85%. Interestingly, complete gene prediction accuracy will not be a highly relevant measure until exon-level accuracy approaches 100%. Even with 90% Sn at the exon level, the probability of correctly predicting all exons of a ten-exon gene correctly (and thus, the complete gene correctly) is only 0.910 or ~35%, though this is a rough estimate as the events are not strictly independent. For eukaryotes, Sn and Sp are often presented at the nucleotide level as well. However, such data should be used with caution as some nucleotides are more important than others. For instance, the knowledge of the exact locations of gene boundaries (splice sites and gene starts and stops) is especially important. Misplacement of an exon border by one nucleotide may dramatically affect a large portion of the sequence of the predicted protein product. In the worst case, it is possible to predict a gene with near 100% nucleotide level Sn and Sp while missing every single splice site. Gene-finding programs are often compared based on Sn and Sp calculated for particular test sets. This approach meets the difficulty of operating with multiple criteria, that is, the highest performing tool in terms of Sn may have lower Sp than the others. One of the ways to combine Sn and Sp into a single measure is to employ the F-measure, defined as 2 * Sn * Sp/(Sn + Sp). Yet another integrative method is to use the ROC curve [35]. GENE FINDING IN PROKARYOTES

Organization of protein-coding genes in prokaryotes is relatively simple. In the majority of prokaryotic genomes sequenced to date, genes are tightly packed and make up approximately 90% of the total genomic DNA. Typically, a prokaryotic gene is a continuous segment of a DNA strand (containing an integral number of triplets) which starts with the triplet ATG (the major start codon, or GTG, CTG, and TTG, which are less frequent starts) and ends with one of the gene-terminating triplets TAG, TGA, or TAA. Traditionally, a triplet sequence which starts with ATG and ends with a stop codon is called an open reading frame (ORF). Note that an ORF may or may not code for protein. However, the length distributions of ORFs known to code for protein and ORFs that simply occur by chance differ significantly. Figure 4.2 shows the probability

Gene Finding


Figure 4.2 Length distributions of arguably noncoding ORFs and GenBank annotated protein-coding ORFs for the E. coli K12 genome.

densities of the length distributions of both random ORFs and GenBank annotated genes for the E. coli K12 genome [36]. The exponential and gamma distributions are typically used to approximate the length distributions for random ORFs and protein-coding ORFs respectively [10]. Parameters of these distributions can vary across genomes with different G+C contents. As was mentioned, a challenging problem in prokaryotic gene finding is the discrimination of the relatively few true short genes from the large number of random ORFs of similar length. According to the GenBank annotation [36] there are 382 E. coli K12 genes 300 nt or shorter while NCBI’s OrfFinder reports over 17,000 ORFs in the same length range. This statistic is illustrated in the inset of figure 4.2. The development of ab initio approaches for gene finding in prokaryotes has a rather long history initiated by the works of Fickett [37], Gribskov et al. [38], and Staden [39]. The first characterization of nucleotide compositional bias related to DNA protein-coding function was apparently done by Erickson and Altman in 1979 [40]. It is worth noting that a frequently used gene finder, FramePlot [41], utilizes the simple measure of positional G+C frequencies to predict genes in prokaryotic genomes with high G+C%.



Application of Markov Chain Models

In the 1980s a number of measures of DNA coding potential were suggested based on various statistically detectable features of protein-coding sequences (Fickett and Tung [42] reviewed and compared 21 different measures). Markov chain theory provided a natural basis for the mathematical treatment of DNA sequence [43] and ordinary Markov models have been used since the 1970s [44]. When a sufficient amount of sequence data became available, three-periodic inhomogeneous Markov models were introduced and proven to be more informative and useful for protein-coding sequence modeling and recognition than other types of statistics [45–47]. The three-periodic Markov chain models have not only an intuitive connection to the triplet structure of the genetic code, but also reflect fundamental frequency patterns generated by this code in protein-coding regions. Subsequently, Markov models of different types, ordinary (homogeneous) and inhomogeneous, necessary to describe functionally distinct regions of DNA, were integrated within the architecture of a hidden Markov model (HMM) with duration (see below). The first gene-finding program using Markov chain models, GeneMark [48], uses a Bayesian formalism to assess the a posteriori probability that the functional role of a given short fragment of DNA sequence is coding (in one of the six possible frames) or noncoding. These calculations are performed using a three-periodic (inhomogeneous) Markov model of protein-coding DNA sequence and an ordinary Markov model of noncoding DNA. To analyze a long sequence, the sliding window technique is used and the Bayesian algorithmic step is repeated for each successive window. The default window size and sliding step size are 96 nt and 12 nt respectively. GeneMark has been shown to be quite accurate at assigning functional roles to small fragments [48]. The posterior probabilities of a particular function defined for overlapping windows covering a given ORF are then averaged into a single score. The ORF is predicted as a protein-coding gene if the score is above the preselected threshold. The GeneMark program has been used as the primary annotation tool in many large-scale sequencing projects, including as milestones the pioneer projects on the first bacterial genome of Haemophilus influenzae, the first archaeal genome of Methanococcus jannaschii, and the E. coli genome project. Interestingly, the subsequent development of the new approach implemented in GeneMark.hmm (see below) has been a development of a method with properties complementary to GeneMark, rather than being a better version of GeneMark (referred to as the HMM-like algorithm [34]). It was shown [6] that this complementarity is akin to the complementarity of the Viterbi algorithm (GeneMark.hmm) and posterior decoding algorithm (GeneMark), both frequently used in HMM applications.

Gene Finding


HMM Algorithms

There are some inherent limitations of the sliding window approach: (i) it is difficult to identify short genes, those of length comparable to the window size, and (ii) it is difficult to pinpoint real gene starts when alternative starts are separated by a distance smaller than half of the window length. The HMM modeling paradigm, initially developed in speech recognition [49] and introduced to biological sequence analysis in the mid-1990s, could be naturally used to reformulate the gene-finding problem statement in the HMM terms. This approach removed the need for the sliding window, and the general Viterbi algorithm adjusted for the case of the HMM model of genomic DNA would deliver the maximum likelihood genomic sequence parse into protein-coding and noncoding regions. The first algorithm, ECOPARSE, explicitly using a hidden Markov model for gene prediction in the E. coli genome was developed by Krogh et al. [50]. The HMM technique implies, in general, that the DNA sequence is interpreted as a sequence of observed states (the nucleotides) emitted stochastically by the hidden states (labeled by the nucleotide function: protein-coding, noncoding, etc.) which, in turn, experience transitions regulated by probabilistic rules. In its classic form, an HMM would emit an observed state (a nucleotide) from each hidden state. This assumption causes the lengths of protein-coding genes to be distributed geometrically, a significant deviation from the length distribution of real genes. The classic HMM can be modified to allow a single hidden state to emit a whole nucleotide segment with length distribution of a desirable shape. This modification is known as HMM “with duration,” a generalized HMM (GHMM), or a semi-Markov HMM [49]. The Markov models of protein-coding regions (with separate submodels for typical and atypical gene classes) and models of noncoding regions can then be incorporated into the HMM framework to assess the probability of a stretch of DNA sequence emitted by a particular hidden state. The performance of an HMM-based algorithm critically depends on the choice of the HMM architecture, that is, the choice of the hidden states and transition links between them. For instance, the prokaryotic version of GeneMark.hmm uses the HMM architecture shown in figure 4.3. With all components of the HMM in place, the problem is reduced to finding the maximum likelihood sequence of hidden states associated with emitted DNA fragments, thus the sequence parse, given the whole sequence of nucleotides (observed states). This problem is solved by the modified Viterbi algorithm. Interestingly, the classic notion of statistical significance, which has been used frequently in the evaluation of the strength of pairwise sequence similarity, has not been used in gene prediction algorithms until recently. This measure was reintroduced in the EasyGene algorithm [35] which evaluates the score of an ORF with regard to the expected



Figure 4.3 Simplified diagram of hidden state transitions in the prokaryotic version of GeneMark.hmm. The hidden state “gene” represents the proteincoding sequence as well as an RBS and a spacer sequence. Two distinct Markov chain models represent the typical and atypical genes, thus genes of both classes can be predicted. For simplicity, only the direct strand is shown and gene overlaps, while considered in the algorithm, are not depicted.

number of ORFs of the same or higher score in a random sequence of similar composition. Gene Start Prediction

Commonly, there exist several potential start codons for a predicted gene. Unless a gene finder with strong enough discrimination power for true gene start prediction was at hand, the codon ATG producing the longest ORF was identified by annotators (or the program itself) as the predicted gene start. It was estimated that this simple method pinpoints the true start for approximately 75% of the real genes [8]. We emphasize that 75% is a rough estimate, obtained under the assumption that there is no use of (relatively rarely occurring) GTG, CTG, and TTG as start codons and that the DNA sequence is described by the simplest multinomial model with equal percentages of each of the four nucleotides. Still, there is a need to predict gene starts more accurately. Such an improvement would not only give the obvious benefit of providing more reliable genes and proteins, but also would improve the delineation of intergenic regions containing sites involved in the regulation of gene expression. To improve gene start prediction, the HMM architecture of prokaryotic GeneMark.hmm contains hidden states for the

Gene Finding


RBS, modeled with a position-specific probability matrix [51], and the spacer between the RBS and the predicted translation start codon. This two-component RBS and spacer model is illustrated in figure 4.1A. Note that the accurate detection of gene starts can also be delayed to a postprocessing stage, following the initial rough gene prediction. Such an approach was implemented by Hannenhalli et al. [52] in RBSfinder for the Glimmer program [53]; in MED-Start [54]; and in the initial version of GeneMark.hmm [10]. Markov Models Are Not the Only Way to Go

While Markov models and HMMs provide a solid mathematical framework for the formalization of the gene-finding problem, a variety of other approaches have been applied as well. An algorithm called ZCURVE [11] uses positional nucleotide frequencies, along with phase-specific dinucleotide frequencies (only dinucleotides covering the first two and last two codon positions are considered), to represent fragments of DNA (such as ORFs) as vectors in 33-dimensional space. Training sets of coding and noncoding DNA are used by the Fischer discriminant algorithm to define a boundary in space (the Z curve) separating the coding and noncoding sequences. The authors recognized that, while a set of ORFs with strong evidence of being protein-coding is not difficult to compile, acquiring a reliable noncoding set is more challenging as the intergenic regions in bacteria are short and may be densely populated with RNA genes and regulatory motifs which may alter the base composition. The authors proposed building the noncoding training set from the reverse complements of shuffled versions of the sequences used in the coding training set. In tests on 18 complete bacterial genomes, the ZCURVE program demonstrated sensitivity similar to Glimmer and approximately 10% higher specificity [11]. The Bio-Dictionary Gene Finder (BDGF) exhibits features of both extrinsic and intrinsic gene finders [55]. The Bio-Dictionary [56] itself is a database of so-called “seqlets”—patterns extracted from the GenPept protein database using the Teiresias algorithm [57]. The version of the Bio-Dictionary used with BDGF contains approximately 26 million seqlets which represent all patterns of length 15 or less that start and end with a literal (i.e., one of the 20 amino acid characters), contain at least six literals, and occur two or more times in GenPept. The Teiresias algorithm extracts these patterns in an unsupervised mode. Utilizing BDGF to find genes is relatively straightforward. All ORFs in a genomic sequence are collected and translated into proteins. These proteins are scanned for the presence of seqlets and if the number of detected seqlets is sufficiently large, the ORF is predicted as a gene. In practice, the seqlets are weighted based on their amino acid composition, and these weights are used to calculate the score. These precomputed weights are not species-specific parameters; thus, BDGF can be applied



to any genome without the need for additional training. Tested on 17 prokaryotic genomes, BDGF was shown to predict genes with approximately 95% sensitivity and specificity. As viral genomes are much smaller than those of prokaryotes, gene finders requiring species-specific training are often unsuccessful. BDGF has proven to be a useful tool in this respect, as shown by its use in a reannotation effort of the human cytomegalovirus [58]. The CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) suite of programs [59] uses a combination of intrinsic and extrinsic approaches. The extrinsic information is provided by BLASTN alignments against a database of DNA sequences. For a genomic region aligned to a database sequence, “comparative coding scores” are generated for all six reading frames. In a particular frame, synonymous differences in the nucleotide sequence, those that do not change the encoded amino acid, contribute positively to the score and nonsynonymous changes contribute negatively. Intrinsic information is included in a second score calculated using a version of the dicodon method of Claverie and Bougueleret [60]. These intrinsic and extrinsic scores are used to identify regions of DNA with significant evidence of being coding. Subsequently, regions with significant coding evidence are extended downstream to the first available stop codon. Finally, a score derived from a predefined RBS motif along with a local coding score are used to define the start codon. In a recent test of several prokaryotic gene finders [33], the largely heuristic CRITICA was shown to have the highest specificity at the cost of some sensitivity when compared to Glimmer [9], ORPHEUS [61], and ZCURVE [11]. The Role of Threshold

Regardless of the methods used, all prokaryotic gene finders must divide the set of all ORFs in a given sequence into two groups: those that are coding and those that are noncoding. In this context, the role of a threshold must be discussed. The thresholds (user defined in some programs, hard-coded in others) essentially determine the number of predicted genes. As such, threshold directly affects the sensitivity and specificity of the prediction method. The following trivial cases illustrate the two possible extremes in choosing thresholds: (i) the program predicts all ORFs as genes (100% Sn, low Sp); and (ii) the program predicts only the highest scoring ORFs as genes, thus making a small number of predictions (100% Sp, low Sn). A rather appealing approach to avoid these extremes would be to define a “balanced” threshold that the number of false positives and false negatives would be equal. Most programs, however, lean toward the case of higher sensitivity and lower specificity, perhaps because overprediction is deemed to be the lesser evil, given the hope that false positives would be filtered out by human experts. An experiment with

Gene Finding


Table 4.1 Sensitivity and specificity values of E. coli gene predictions by GeneMark (with different thresholds) and GeneMark.hmm, abbreviated GM and GM-HMM, respectively Program Threshold Predictions (no.) Sensitivity (%) Specificity (%)

GM 0.3 4,447 93.4 88.5

GM 0.4 4,086 91.4 94.9

GM-HMM n/a 4,045 92.4 96.9

GM 0.5 3,829 88.2 97.7

GM 0.6 3,623 84.5 99.0

GeneMark, a program that allows the user to adjust the threshold parameter, and GeneMark.hmm, one that does not, demonstrates the effect of the threshold value on the overall prediction result for a particular genome (table 4.1). GeneMark with 0.4 threshold performs approximately as well as GeneMark.hmm in terms of sensitivity and specificity. GENE FINDING IN EUKARYOTES

The definition of a gene in eukaryotes is more complex than in prokaryotes. Eukaryotic genes are split into alternating coding and noncoding fragments, exons and introns, respectively. The boundaries of introns are referred to as splice sites. Nearly all introns begin with GT and end with AG dinucleotides, a fact exploited by all eukaryotic gene finders. While the nucleotide signatures of the splice sites are highly conserved, other basic features of introns such as their average lengths and numbers per gene vary among species. Therefore, species-specific training and algorithmic implementations are of high importance in this area. Gene prediction in eukaryotic DNA is further complicated by the existence of pseudogenes, genomic sequences apparently evolved from functional genes that have lost their protein-coding function. One particular class, called processed pseudogenes, share many features with single exon genes and have been a common source of false positive errors in eukaryotic gene finding. Recently, methods for the accurate identification of processed pseudogenes have been developed [62,63]. HMM-Based Algorithms

The GENSCAN program [64], intensively used in annotation of the whole human genome, leveraged key features of earlier successful gene finders, such as the use of three-periodic inhomogeneous Markov models [45] and the parallel processing of direct and reverse strands [48]. The major innovation of GENSCAN was the independent introduction of



a generalized hidden Markov model (described earlier by Kulp et al. [65]), along with the model of splice sites using statistical decomposition to account for the most informative patterns. Unlike the genomes of prokaryotes, where within the HMM framework a single Markov model of second order could accurately detect the majority of genes, several models of fourth or fifth order are required for eukaryotic genomes. The reasons for that are as follows. First, exons are of rather short size in comparison with ORF-long prokaryotic genes and accurate detection of exons requires high-order models. Second, the genomes of many high eukaryotes (including human) are composed of distinct regions of differing G+C content termed isochores, which are typically hundreds of kilobases in length. Thus, on the scale of whole chromosomes, eukaryotic genomes are quite inhomogeneous and the use of only one model is not practical. Therefore, to estimate the parameters of several models, the training sequences were divided into empirically defined clusters covering the whole range of G+C content of the human genome. The probabilistic framework, that is, the HMM architecture, used by GENSCAN (figure 4.4) includes multi-intron genes, genes without introns, intergenic regions, promoters, and polyadenylation signals. With all of these genomic elements permitted to appear in the direct or reverse strand, this has been a quite complete gene model for a eukaryotic genome. In recent years some other elements, such as exonic splicing enhancers, have been explored in considerable detail [66]. In addition to the predictions made by the Viterbi algorithm (i.e., the most likely sequence parse), GENSCAN provided an assessment of the confidence of predicted exons in terms of an a posteriori probability of the exon computed by the posterior decoding algorithm. This feature has been further utilized by Rogic et al. in a program combining the predictions of GENSCAN and HMMGene [67] to obtain predictions with higher specificity [68]. Following the release of GENSCAN, a number of programs using HMM techniques for eukaryotic gene finding have become available, including AUGUSTUS [69], FGENESH [70], HMMGene [67], and the eukaryotic version of GeneMark.hmm ([10], http://opal.biology. In tests of gene-finding accuracy, GENSCAN is still among the top performing programs [15,16]; in one comparative study, the GENSCAN predictions were even used as a standard by which other approaches were judged [71]. However, it seems to be fair to say that there is no ab initio gene-finding program at this time that would be uniformly better than other programs for all currently known genomes. For instance, GENSCAN and FGENESH have been among the most accurate for the human genome, while GeneMark.hmm has been one of the most accurate for plant genomes [18], the Genie program has been tuned up for the Drosophila genome [69], and so on.

Figure 4.4 Diagram of the hidden state transitions in the HMM of GENSCAN program. Protein-coding sequences (exons) and noncoding sequences (introns and intergenic regions) are represented by circles and diamonds, respectively [64]. 133



The AUGUSTUS program makes use of a novel intron submodel, which treats the introns as members of two groups clustered merely by their length [69]. The paradigm of handling short and long introns separately follows the current biological concept claiming the existence of two mechanisms of intron splicing, the “intron definition” and the “exon definition,” related to short and long introns, respectively [72]. In many eukaryotic genomes, the intron length distribution contains a peak near 100 nt and a very long tail. Mathematically, this distribution is often best described as a mixture of two lognormal distributions [72,73], though the exact shape of the distribution varies significantly among species. In the AUGUSTUS algorithm, short introns are precisely modeled with length distributions calculated from sets of known genes and long introns are modeled with a geometric distribution. Modeling the splicing mechanisms in even more detail was employed in the INTRONSCAN algorithm, which is focused on detecting short introns specifically rather than complete gene structures [73]. In a significant exploratory effort, the authors identified the amount of information contained in donor and acceptor sites, branch points, and oligonucleotide composition in several eukaryotic genomes. The recent SNAP program [14] is also HMM based, but uses a reduced genome model as compared to GENSCAN. It does not include hidden states for promoters, polyadenylation signals, and UTRs. Even with this simplified model, the program accuracy was shown to be high enough on many test sets, a fact attributed to species-specific training. An interesting feature of the SNAP program is that its HMM state diagram is not fixed. A user can alter the structure of the HMM to better match the architecture to the genome under study. Gene Prediction in Genome Pairs

While HMM-based intrinsic approaches have been the main direction in eukaryotic gene finding for some time, efforts utilizing comparative genomics are now becoming more and more widespread. One of these approaches, implemented in a program called SLAM [74], uses a generalized pair HMM (GPHMM) to simultaneously construct an alignment between orthologous regions of two genomes, such as H. sapiens and Mus musculus, and identify genes in the aligned regions of both. A GPHMM is described as a hybrid of a generalized HMM and a pair HMM, which emits pairs of symbols (including a gap symbol) and is useful in the area of sequence alignment (see ch. 4 in [43]). As input, SLAM takes two DNA sequences along with their approximate alignment, defined as a set of “reasonable” alignments determined by the AVID global alignment tool [75]. These alignments help limit the search space, and thus the computational complexity, of the GPHMM. The output consists of predicted gene structures for each of the DNA sequences. One difficulty that SLAM attempts to overcome is that, perhaps surprisingly,

Gene Finding


there is a large amount of noncoding sequence conserved between the human and mouse genomes. The implementation of a conserved noncoding state in the GPHMM decreases the rate of false positive predictions by eliminating the possibility of predicting these conserved sequences as exons. Another program based on a pair HMM is Doublescan [76]. Doublescan does not require the two homologous DNA sequences to be prealigned. Still, it imposes a restriction that the features to be matched in the two sequences must be collinear to one another. The authors explain this restriction by the fact that the sequences intended to be analyzed are relatively short and contain only a small number of genes. As compared to GENSCAN and when tested on a set of 80 human/mouse orthologous pairs, Doublescan exhibited 10% higher Sn and 4% higher Sp at the level of complete gene structures, even though GENSCAN performs better at the level of individual nucleotides and exons. Two programs, SGP2 [77] and TWINSCAN [78], attempted to improve prediction specificity using the informant genome approach. Both of these programs exploit homology between two genomes, the target genome and the informant (or reference) genome. SGP2 heuristically integrates the ab initio gene finder GeneID [79,80] with the TBLASTX similarity search program [25]. The GeneID program provides a score, a log-likelihood ratio, for each predicted exon. SGP2 adds the GeneID score to a weighted score (also a log-likelihood ratio) of highscoring pairs identified by the TBLASTX search against the informant genome database. Predicted exons are then combined into complete gene structures “maximizing the sum of the scores of the assembled exons,” the same principle used by GeneID itself [77]. The first step in the TWINSCAN algorithm is the generation of a conservation sequence, which replaces the nucleotides of the target sequence (with repeats masked by RepeatMasker) with one of three symbols indicating a match, mismatch, or unaligned as compared to the top four homologs in a database of sequences from the informant genome. The probability of the conservation sequence is calculated given the conservation model. This conservation model is a Markov model of certain order with transition probabilities defined for the three state symbols that make up the conservation sequence (e.g., the probability of a gap following five match characters) rather than the nucleotide alphabets. The TWINSCAN program successfully proved the value of human/mouse genome comparisons for producing accurate computer annotations [81]. While the resulting annotation is quite conservative with only 25,622 genes predicted, its sensitivity is slightly higher than GENSCAN, at the level of both exons and complete genes, in concert with high exon-level specificity. ROSETTA [82] and AGenDA (Alignment-based Gene-Detection Algorithm) [83] are algorithms that represent yet another approach to



finding genes in pairs of homologous DNA sequences, targeting once again the human and mouse genomes. Elements of intrinsic gene finders are utilized in both programs to score the potential gene structures determined from the alignment of the human and mouse sequences. The alignments, identifying syntenic regions, are provided by the GLASS global sequence alignment program. ROSETTA predicts genes by identifying elements of coincident gene structure (i.e., splice sites, exon length, sequence similarity, etc.) in syntenic regions of DNA from two genomes. ROSETTA uses a dynamic programming algorithm to define candidate gene structures in both aligned sequences. Each of the gene structures is scored by measures of splice site strength, codon usage bias, amino acid similarity, and exon length. Parameters of the scoring models are estimated from a set of known orthologs. AGenDa searches for conserved splice sites in locally homologous sequences, as determined by the DIALIGN program [84], to define candidate exons. These candidates are then assembled into complete gene structures via a dynamic programming procedure. The only models utilized are relatively simple consensus sequences used to score splice sites, but still a training set is required. In a test on 117 mouse/human gene pairs, ROSETTA performed approximately identically to GENSCAN in terms of nucleotide-level sensitivity and slightly better in terms of nucleotide-level specificity. AGenDA, more recent than ROSETTA, in a test on the same set performed approximately identically to GENSCAN in terms of both exon-level sensitivity and specificity. Construction of the initial alignment of the genomic sequences (sometimes complete genomes) presents a significant challenge for conventional alignment algorithms in terms of prohibitively long running time. A new class of genomic alignment tools, such as MUMmer, OWEN, and VISTA [85–87], have been introduced to address these concerns. While these tools make no attempt to pinpoint protein-coding regions and their borders, they are efficient for determining the syntenic regions that are used as input for ROSETTA and similar programs. Multigenome Gene Finders: Phylogenetic HMMS

The extension of the comparative gene-finding approach to more than two genomes, hence requiring the use of multiple alignments, has recently been implemented on the basis of phylo-HMMs [88] and evolutionary hidden Markov models (EHMMs) [89]. These approaches combine finite HMM techniques of gene modeling with continuous Markov chains, frequently used in the field of molecular evolution. In addition to the alignments utilized by GPHMM methods, a phylo-HMM requires a phylogenetic tree describing the evolutionary relationship of the genomes under study.

Gene Finding


Depending on the type of data provided, a phylo-HMM-based procedure could run (i) as a single sequence gene finder if only one sequence is available, (ii) as GPHMM if a pairwise alignment is provided, or (iii) as a bona fide phylo-HMM when a multiple alignment and a tree are provided [89]. The EHMM implementation of Pedersen and Hein [89] was presented as a proof of concept and was limited by the choice of a simplistic HMM of gene structure. Although the accuracy of the EHMM predictions did not match GENSCAN in tests, it is important to note that the optimal input for an EHMM would be a multiple alignment of several closely related complete genomes; currently, such data frequently are not available [89]. The phylo-HMM-based procedure ExoniPhy developed by Siepel and Haussler [32] for identification of evolutionarily conserved exons is quite sophisticated [90]. This approach targets the exons of core genes (those found in all domains of life), rather than the complete gene structures, because exons are more likely to be preserved over the course of evolution than complete genes. The predicted exons can later be pieced together into complete genes with a dynamic programming algorithm as in SGP2 or GeneID. The diagram of hidden state transitions used in ExoniPhy is shown in figure 4.5. ExoniPhy includes three major features that improve the performance of phylo-HMMs in terms of exon prediction. The first is the use of contextdependent phylogenetic models. The second is the explicit modeling of conserved noncoding DNA as in SLAM. The third is the modeling of insertions and deletions (indels), taking into account that the pattern of indels frequently is quite different in coding and noncoding sequences. Interestingly, almost 90% of the conserved exons in mouse, human, and rat have no gaps in their alignments. As an exon predictor, ExoniPhy was shown to perform comparably to GENSCAN, SGP2, TWINSCAN, and SLAM. However, the authors admit that there clearly is room for improvement as the current version of ExoniPhy does not contain several advanced features common to other gene finders such as species-specific distributions of exon lengths and higher-order splice site models. Extrinsic Gene Finders

Increased availability of experimental data indicating DNA sequence transcriptional activity in the form of cDNA and EST sequences (typically from the same genome) and protein sequences (from other genomes) has led to the development of gene finders that leverage this information. Mapping EST and cDNA sequences to genomic DNA as a method of predicting the transcribed genes generally falls in the realm of pairwise sequence alignment. Therefore, such approaches (covered to some extent earlier in this chapter) are not considered here in more detail.



Figure 4.5 Diagram of hidden state transitions in the HMM of ExoniPhy program. States in the direct strand are shown on top and states in the reverse strand are shown at the bottom. Circles represent states with variable length, while boxes represent fixed-length states [32].

The concepts of extrinsic evidence-based eukaryotic gene finding were implemented in a rather sophisticated way in the algorithms Procrustes [91] and GeneWise [92]. Both programs are computationally expensive and rely on significant computational resources to identify the piece of extrinsic evidence, the reference protein (if such would exist at all) homologous to the one encoded in the given DNA sequence. Essentially, both programs are screening the protein database or their sizable subsets, one protein at a time, attempting to extract from a

Gene Finding


given genomic DNA a set of protein-coding exons (with associated introns) that would be translated into a protein product homologous to the database protein. Procrustes works by first determining all subsequences that could be potential exons; a basic approach is selecting those sequences bound by AG at their 5′-ends and GT at their 3′-ends. This set, collectively referred to as “candidate blocks,” can be assembled into a large number of potential complete gene structures. Procrustes uses the spliced alignment algorithm to efficiently scan all of the possible assemblies and find the one with the highest similarity score to a given database protein. The authors have determined that if a sufficiently similar protein exists in the protein database, the highest scoring block assembly is “almost guaranteed” to represent the correct gene structure. In a set of human genes having a homologous protein in another primate, 87% were predicted exactly. GeneWise was described in detail relatively recently [92], though it has been used in the ENSEMBL pipeline since 1997 [31] and has been used in a number of genome annotation projects. Its predecessor, PairWise [93], had been developed with the goal of finding frameshifts in genes with protein products belonging to known families. Availability of a protein profile built from a multiple alignment of the family members, thus delineating a conserved domain, allowed getting extrinsic evidence of a frameshift that would destroy the fit of the newly identified protein product to the profile. GeneWise exhibits a significantly new development over PairWise. GeneWise states the goal of gene prediction, rather than frameshift detection, and to reach the stated goal it employs a consistent approach based on HMM theory. Given that models of both pairwise alignment and protein-coding gene prediction are readily represented by HMMs, GeneWise uses a combined (merged) HMM model with hidden states reflecting the status of alignment between the amino acids of the database protein and triplets of genomic DNA that my fall either into coding (exon) or noncoding (intron) regions. The HMM modeling the gene structure is much simpler than in GENSCAN—only one strand is considered, with nucleotide triplets being observed states in proteincoding exons, intron boundaries not allowed to split codons, and so on. In the case when a reference database protein is a member of a family, the pairwise sequence alignment is extended to alignment with the family-specific profile HMM with parameters already defined in the HMMER package [94]. As compared to GENSCAN, GeneWise exhibits higher specificity, as would be expected, with predictions being attached to extrinsic evidence. However, its sensitivity is lower than that of GENSCAN, as GeneWise has no means to identify genes for which protein products do not generate well-detectable hits in the protein databases. The accuracy of GeneWise decreases as the percent identity and the length of the alignment to a reference protein decreases. The most accurate



predictions are made when the database screening hits a reference protein that is more than 85% identical to the protein to be predicted by GeneWise along the whole length of this new prediction. Site Detectors

While many eukaryotic gene finders attempt to predict complete gene structures, there are a number of recent programs that focus on the use of advanced techniques to detect gene components such as promoters [95], transcription starts [96], splice sites [97], and exonic splicing enhancers [66]. There is considerable innovation in this area and the types of algorithms being introduced are quite diverse. Still, the common feature of these approaches is that each one introduces a new concept that can later be integrated into a full-scale gene finder or annotation pipeline. The accurate prediction of promoters is an important task that can contribute to improving the accuracy of eukaryotic gene finders [98]. As promoters are located upstream to the start of transcription, finding a promoter helps narrow down the region where translation starts may be located. This information is important for gene-finding algorithms, as two common sources of error at the level of complete gene structure prediction are the joining of two adjacent genes into one and the splitting of a single gene into two. Either of these errors could be prevented if the gene finder uses a priori knowledge of the promoter locations. The early development of promoter prediction programs was challenged by a notoriously large number of false positive predictions, on the order of more than ten false predictions for each true positive [98]. The PromoterInspector program [95], which specifically predicts polymerase II promoter regions, reduced this overprediction rate to approximately a one-to-one ratio. A promoter region in terms of PromoterInspector is a sequence that contains a promoter either on the direct or the reverse strand. PromoterInspector utilizes an unsupervised learning technique to extract sets of oligonucleotides (with mismatches) from training sets containing promoter and nonpromoter sequences. Genomic sequences are processed using a sliding window approach and the prediction of a promoter region requires the classification of a certain number of successive windows as promoters. Sherf et al. demonstrated the power of integrating promoter predictions with ab initio gene predictions [99]. In a test on annotated human chromosome 22, promoters predicted in regions compatible with the 5′-end predictions of GENSCAN matched annotated genes with high frequency. Closely related to the prediction of promoters is the prediction of transcriptional starts. In human and other mammalian genomes, transcriptional starts are frequently located in the vicinity of CpG islands. To predict transcriptional starts, the Dragon Gene Start Finder (Dragon GSF) [96] first identifies the locations of CpG islands in the genome.

Gene Finding


Then it uses an artificial neural network to evaluate all of the predicted transcription start sites, supplied by the Dragon Promoter Finder [100], with respect to the locations and compositions of the CpG islands and downstream sequence elements. The sequence site where the sum of scores of these factors reaches the highest value (provided it is above a preset threshold) pinpoints the transcription start. Currently, the program can only find transcription starts associated with CpG islands. This restriction imposes a limit on the sensitivity the program can achieve. Still, these types of signal sensors may suffer more from relatively low specificity, given that the number of detected CpG islands frequently exceeds the number of genes in a sequence by a large margin. There are numerous techniques used to identify splice sites in genomic sequences, as their accurate detection is imperative for nearly all gene-finding systems for eukaryotes. Castelo and Guigo recently presented a new method using inclusion-driven Bayesian networks (idlBNs) [97]. The idlBN method performs comparably to the best of the previously utilized approaches for splice site identification (including position-specific weight matrices based on zero and first-order Markov models). This method shows superior training dynamics; as the training size increases, the false positive rate decreases more quickly for idlBNs than it does for weight matrix or Markov chain-based approaches. Rather surprisingly, the integration of the idlBN method with a gene prediction program, GeneID, showed that improved signal detection does not necessarily lead to large improvements in genefinding accuracy. The authors offered the caveat that this relationship might depend on the specific gene finder, while their testing was limited to only one. Exonic splicing enhancers (ESEs) are short sequence motifs located in the exons near splice sites and implicated in enhancing splicing activity. The detection of ESEs is beneficial for gene-finding programs as ESEs can help delineate the boundaries between coding and noncoding DNA, especially when the sequence patterns at these boundary sites are weak [101]. The RESCUE method (Relative Enhancer and Silencer Classification by Unanimous Enrichment), a general sequence motif detection method, was applied for ESE detection and implemented in a program called RESCUE-ESE [66]. In a set of human genes, RESCUE-ESE was able to detect ten candidate splice enhancer motifs. Biochemical experiments confirmed the ESE activity of all ten predictions. MODEL TRAINING

The accuracy of the gene predictions made by a particular program is highly dependent on the choice of training data and training methods. Thus, making optimal choices is another part of the science (or art) of gene finding.



Determination of Training Set Size

Training sets are typically derived from the expert annotated collections of genomic sequences. Sets of the required size may not always be available either for technical reasons (e.g., at the beginning of genome sequencing projects) or for more fundamental ones (e.g., extremely small genome size). While in practice it has been observed that threeperiodic Markov chain models are effective for gene finding, any discussion of the application of Markov models for DNA sequence analysis would be incomplete without addressing the question of which order model is most appropriate [102]. In general, accuracy of gene prediction, especially for short genes (or exons), increases with an increase in the model order. As far as minimum orders are concerned, it was observed that models of less than order two do not perform well for gene prediction applications, largely because the second-order Markov chains are the shortest chains for which entire codons are included in the frequency statistics and codon usage frequency can be captured. Models of order five have an additional advantage as they capture the frequency statistics of all oligonucleotides up to dicodons (hexamers). The maximum order of the model that can be used is limited by the size of the available training sequence. The minimal size of the training set can be defined in terms of 100(1−a)% confidence intervals for the estimated transition probabilities. Then, for a = 0.05 (and assuming a genome with about equal frequencies of each nucleotide), the number of observations required to estimate each transition probability is approximately equal to 400 [103]. Markov models of higher order have a larger number of parameters. As the amount of training data needed per parameter does not change, the required training set size grows geometrically as a function of model order. In real genomes, certain oligomers are overrepresented while others are underrepresented. Therefore, some transition probabilities will be defined with higher accuracy than others. To deal with this effect, the Glimmer program [9] employs a special class of Markov chain models called interpolated Markov models (IMMs). IMMs use a combination of high-order and lower-order Markov chains with weights depending on the frequencies of each oligomer. Further generalization of the interpolated models to so-called models with deleted interpolation (DI) is possible [102]. The performance of different types of Markov chain models, both conventional fixed order (FO) models and models with interpolation, was assessed within the framework of the GeneMark algorithm [102]. It was observed that the DI models slightly outperformed other types of models in detecting genes in genomes with medium G+C content. For genomic DNA with high (or low) G+C content, it was observed that the DI models were in some cases slightly outperformed by the FO models.

Gene Finding


Nonsupervised Training Methods

Frequently, it is difficult to find reliably annotated DNA sequence in sufficient amounts to build models by supervised training. However, the total length of sequenced DNA could be sufficient to harbor training sets for high-order models and nonsupervised training would be a valuable option. Nonsupervised training algorithms have been described for prokaryotic gene finders such as GeneMark or Glimmer [9,21,104,105]. Also, a nonsupervised training procedure, GeneMarkS [8], was proposed for building models for GeneMark.hmm. GeneMarkS starts the iterative training process from models with heuristically defined pseudocounts [20]. The rounds of sequence labeling into coding and noncoding regions, recompilation of the training sets, and model training follow until convergence. The heuristic approach [20] by itself may produce sufficiently accurate models without a training set. Models built by the heuristic approach have been successfully used for gene prediction in the genomes of viruses and phages, often too small to provide enough sequence to estimate parameters of statistical models via regular training. The heuristic approach was used for annotation of viral genes in the VIOLIN database [24], which contains computer reannotations for more than 2000 viral genomes. Self-training methods may also successfully incorporate similarity search in databases of protein sequences to identify members of the emerging training set, as is done by the ORPHEUS [61] and EasyGene [35] programs. Nonsupervised training may use clustering routines to separate genes of the atypical class, presumably populated with genes horizontally transferred into a given (microbial) genome in the course of evolution [106]. For analysis of new genomes, in the absence of substantial training sets, models from close phylogenetic neighbors with similar G+C content were used, albeit with varying degrees of success. While the reference genome model could be successful for a number of cases, a simple test on prokaryotic genomes can show why this method should be applied cautiously. For instance, the genomes of E. coli K12 and Prochlorococcus marinus str. MIT9313 have G+C% of 50.8 and 50.7, respectively. With a model trained on P. marinus MIT9313 sequence, GeneMark.hmm detects 92% of the genes in E. coli K12, while using a model trained on E. coli K12 sequence GeneMark.hmm detects only 74% of the P. marinus MIT9313 genes. Therefore, the operation of choosing a reasonable reference genome is not a symmetrical one. Interestingly, a complete genome of another strain of P. marinus (P. marinus subsp. marinus str. CCMP1375) with a much lower G+C% (36.4) is available and the usefulness of phylogenetic distance as a criterion for selecting a reference genome can be immediately tested. Cross-species tests show, somewhat surprisingly, that with MIT9313



models GeneMark.hmm detects 90% of the CCMP1375 genes, but with models derived from CCMP1375 the program detects a mere 8% of MIT9313 genes. Recently, eukaryotic genome sequencing has experienced an acceleration akin to prokaryotic genome sequencing in the late 1990s. The feasibility of unsupervised model estimation for eukaryotic genomes has recently been shown (Ter-Hovhannisyan, V., Lomsadze, A., Borodovsky, M., unpublished). COMBINING THE OUTPUT OF GENE FINDERS

In 1996, Burset and Guigo combined several gene finders employing different methodologies to improve the accuracy of eukaryotic gene prediction [4]. At that time, GENSCAN and similar HMM-based gene finders were not available, and the eukaryotic sequence contigs used in the tests were relatively short and contained strictly one gene per sequence. As even the best eukaryotic gene finders of the time had rather low sensitivity, a significant number of exons were missed by any single program. However, with several methods used in concert, only about 1% of the real exons were missed completely. Exons predicted by all the programs in exactly the same way were labeled “almost certainly true” [4]. Since Burset and Guigo’s paper, combination of the predictions of multiple gene finders into a set of metapredictions has been a popular idea implemented in several programs (see below). Interestingly, in some ways these programs mirror the mode of operations of expert annotators running different gene prediction programs on an anonymous genomic DNA sequence at hand. After the release of GENSCAN, Murakami and Takagi used several methods to combine GENSCAN with three other programs: FEXH [107], GeneParser3 [108], and Grail [109]. The best performing combination, however, achieved only modest improvements over the predictions of GENSCAN alone. Recently, McHardy et al. [110] combined the outputs of the prokaryotic gene finders Glimmer [9] and CRITICA [59], representing the intrinsic and extrinsic classes, respectively. The three combination methods were (i) the union of CRITICA and a special run of Glimmer with its model parameters estimated from the predictions of CRITICA; (ii) an overlap threshold rule in which a Glimmer prediction was discarded if it significantly overlapped a CRITICA prediction; and (iii) a vote score threshold strategy in which predictions of Glimmer (again trained on the set of predictions of CRITICA) were discarded if the vote score (defined as the sum of the scores of the ORF analyzed in all reading frames other than the one voted for) was below a certain threshold. In a test on 113 genomes, the best of these methods, the vote score

Gene Finding


approach, showed accuracy comparable to the YACOP program [33], which combines the predictions of CRITICA, Glimmer, and ZCURVE using the Boolean operation CRITICA ∪ (Glimmer ∩ ZCURVE) and outperforms individual gene finders in terms of reducing false positive predictions. Rogic et al. [68] combined two eukaryotic gene finders, GENSCAN and HMMgene, using exon posterior probability scores validated earlier as sufficiently reliable [16]. In tests on long multigene sequences it was shown that the best performance was provided by the method called Exon Union-Intersection with Reading Frame Consistency. This method works by first selecting gene structures based on the union and intersection of the predictions of the two programs and by choosing the program producing the higher gene probability (average of the exon probabilities for all exons in a gene) as the one imposing the reading frame for the complete predicted gene. Yet another method to combine the outputs of several gene prediction programs (eukaryotic GeneMark.hmm, GENSCAN, GeneSplicer [111], GlimmerM [112], and TWINSCAN), along with protein sequence alignments, cDNA alignments, and EST alignments, has been implemented in a program called Combiner [113]. In tests on 1783 A. thaliana genes confirmed by cDNA, Combiner consistently outperformed the individual gene finders in terms of accurate prediction of both complete gene structures and separate exons. Note that Combiner specifically employed programs showing high accuracy of gene prediction in plant genomes [114]. GAZE [115] utilizes a dynamic programming algorithm to integrate information from a variety of intrinsic and extrinsic sources into the prediction of complete gene structures. The novel feature of GAZE is that the model of “legal” gene structure is defined in an easily edited external XML format as a list of features (e.g., translation start and stop, donors and acceptors). The use of an external XML gene model description allows GAZE to be quickly reconfigured to gain the ability to handle specific features of particular genomes, such as trans-splicing for the Caenorhabditis elegans model. The EuGene program [116], using an approach similar to GAZE, was developed to integrate several sources of information for gene prediction in A. thaliana. An extension of this program, called EuGeneHom, utilizes the EuGene framework to predict eukaryotic genes based on similarity to multiple homologous proteins [117]. CONCLUSIONS

Though the problem of prokaryotic gene finding has already been addressed at a level that meets the satisfaction of experimental biologists, considerable innovation is still needed to drive this field to perfection. Thus, new gene finders utilizing novel mathematical approaches such as



Spectral Rotation Measure [118] and Self-Organizing Map [119] are still being introduced. The accuracy of eukaryotic gene-finding programs, though facilitated by the development of new training algorithms and more precise methods to locate short signals in genomes, is not expected to reach the same level as in prokaryotes in the near future. Perhaps for this reason, innovation in eukaryotic gene finding is currently focused less on the application of novel statistics-based methods, but instead on new methods that leverage the power of comparative genomics in ever more complex fashions. While the major challenges in prokaryotic gene finding have been narrowed down to the prediction of gene starts, discrimination of short genes from random ORFs, and prediction of atypical genes, the issues facing eukaryotic gene finding are more numerous. The extent of alternatively spliced genes in the genomes of high eukaryotes, along with exactly how to evaluate alternative splicing predictions, is currently under study. The initial applications of phylo-HMMs have shown the power of ab initio gene finders that can handle multiple genomes simultaneously and future improvements in this direction are expected. Overlapping (or nested) genes, initially thought to be rare but now known to be quite frequent in prokaryotes and eukaryotes, may be a rather challenging target for eukaryotic gene finding [120]. More precise models of promoters, terminators, and regulatory sites that may aid in the determination of gene starts and 5′ and 3′ untranslated regions are under development as well. ACKNOWLEDGMENTS The authors would like to thank Vardges Ter-Hovhannisyan and Wenhan Zhu for computational support and Alexandre Lomsadze, Alexander Mitrophanov, and Mikhail Roytberg for useful discussions. This work was supported in part by grants from the U.S. Department of Energy and the U.S. National Institutes of Health.

REFERENCES 1. Venter, J. C., K. Remington, J. F. Heidelberg, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66–74, 2004. 2. Link, A. J., K. Robison and G. M. Church. Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K12. Electrophoresis, 18(8):1259–1313, 1997. 3. Rudd, K. E. EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Research, 28(1):60–4, 2000. 4. Burset, M. and R. Guigo. Evaluation of gene structure prediction programs. Genomics, 34(3):353–67, 1996.

Gene Finding


5. Korning, P. G., S. M. Hebsgaard, P. Rouze and S. Brunak. Cleaning the GenBank Arabidopsis thaliana data set. Nucleic Acids Research, 24(2):316–20, 1996. 6. Slupska, M. M., A. G. King, S. Fitz-Gibbon, et al. Leaderless transcripts of the crenarchaeal hyperthermophile Pyrobaculum aerophilum. Journal of Molecular Biology, 309(2):347–60, 2001. 7. Guigo, R., E. T. Dermitzakis, P. Agarwal, et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proceedings of the National Academy of Sciences USA, 100(3):1140–5, 2003. 8. Besemer, J., A. Lomsadze and M. Borodovsky. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research, 29(12):2607–18, 2001. 9. Delcher, A. L., D. Harmon, S. Kasif, et al. Improved microbial gene identification with GLIMMER. Nucleic Acids Research, 27(23):4636–41, 1999. 10. Lukashin, A. V. and M. Borodovsky. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Research, 26(4):1107–15, 1998. 11. Guo, F. B., H. Y. Ou and C. T. Zhang. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Research, 31(6):1780–9, 2003. 12. Ochman, H. Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends in Genetics, 18(7):335–7, 2002. 13. Skovgaard, M., L. J. Jensen, S. Brunak, et al. On the total number of genes and their length distribution in complete microbial genomes. Trends in Genetics, 17(8):425–8, 2001. 14. Korf, I. Gene finding in novel genomes. BMC Bioinformatics, 5(1):59, 2004. 15. Guigo, R., P. Agarwal, J. F. Abril, et al. An assessment of gene prediction accuracy in large DNA sequences. Genome Research, 10(10):1631–42, 2000. 16. Rogic, S., A. K. Mackworth and F. B. Ouellette. Evaluation of gene-finding programs on mammalian sequences. Genome Research, 11(5):817–32, 2001. 17. Kraemer, E., J. Wang, J. Guo, et al. An analysis of gene-finding programs for Neurospora crassa. Bioinformatics, 17(10):901–12, 2001. 18. Mathe, C., P. Dehais, N. Pavy, et al. Gene prediction and gene classes in Arabidopsis thaliana. Journal of Biotechnology, 78(3):293–9, 2000. 19. Borodovsky, M., K. E. Rudd and E. V. Koonin. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Research, 22(22):4756–67, 1994. 20. Besemer, J. and M. Borodovsky. Heuristic approach to deriving models for gene finding. Nucleic Acids Research, 27(19):3911–20, 1999. 21. Hayes, W. S. and M. Borodovsky. How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Research, 8(11):1154–71, 1998. 22. Borodovsky, M., J. D. McIninch, E. V. Koonin, et al. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research, 23(17):3554–62, 1995. 23. Huang, S. H., Y. H. Chen, G. Kong, et al. A novel genetic island of meningitic Escherichia coli K1 containing the ibeA invasion gene (GimA): functional annotation and carbon-source-regulated invasion of human


24. 25. 26. 27.


29. 30. 31. 32.

33. 34.

35. 36. 37. 38.






brain microvascular endothelial cells. Functional and Integrative Genomics, 1(5):312–22, 2001. Mills, R., M. Rozanov, A. Lomsadze, et al. Improving gene annotation of complete viral genomes. Nucleic Acids Research, 31(23):7041–55, 2003. Altschul, S. F., W. Gish, W. Miller, et al. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–10, 1990. Robison, K., W. Gilbert and G. M. Church. Large scale bacterial gene discovery by similarity search. Nature Genetics, 7(2):205–14, 1994. Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Computer Applications in the Biosciences, 13(4):477–8, 1997. Florea, L., G. Hartzell, Z. Zhang, et al. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research, 8(9):967–74, 1998. Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Research, 12(4):656–64, 2002. Usuka, J., W. Zhu and V. Brendel. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics, 16(3):203–11, 2000. Birney, E., T. D. Andrews, P. Bevan, et al. An overview of Ensembl. Genome Research, 14(5):925–8, 2004. Siepel, A. and D. Haussler. Computational identification of evolutionarily conserved exons. In RECOMB ‘04: Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (pp. 177–86). ACM Press, New York, 2004. Tech, M. and R. Merkl. YACOP: enhanced gene prediction obtained by a combination of existing methods. In Silico Biology, 3(4):441–51, 2003. Iliopoulos, I., S. Tsoka, M. A. Andrade, et al. Evaluation of annotation strategies using an entire genome sequence. Bioinformatics, 19(6):717–26, 2003. Larsen, T. S. and A. Krogh. EasyGene—a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinformatics, 4(1):21, 2003. Blattner, F. R., G. Plunkett, 3rd, C. A. Bloch, et al. The complete genome sequence of Escherichia coli K-12. Science, 277(5331):1453–74, 1997. Fickett, J. W. Recognition of protein coding regions in DNA sequences. Nucleic Acids Research, 10(17):5303–18, 1982. Gribskov, M., J. Devereux and R. R. Burgess. The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Research, 12(1 Pt 2):539–49, 1984. Staden, R. Measurements of the effects that coding for a protein has on a DNA-sequence and their use for finding genes. Nucleic Acids Research, 12(1):551–67, 1984. Erickson, J. W. and G. G. Altman. Search for patterns in the nucleotidesequence of the MS2 genome. Journal of Mathematical Biology, 7(3):219–30, 1979. Ishikawa, J. and K. Hotta. FramePlot: a new implementation of the frame analysis for predicting protein-coding regions in bacterial DNA with a high G+C content. FEMS Microbiology Letters, 174(2):251–3, 1999. Fickett, J. W. and C. S. Tung. Assessment of protein coding measures. Nucleic Acids Research, 20(24):6441–50, 1992.

Gene Finding


43. Durbin, R., S. Eddy, A. Krogh and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, 1998. 44. Gatlin, L. L. Information Theory and the Living System. Columbia University Press, New York, 1972. 45. Borodovsky, M., Y. A. Sprizhitsky, E. I. Golovanov and A. A. Alexandrov. Statistical features in the Escherichia coli genome functional primary structure. II. Non-homogeneous Markov chains. Molekuliarnaia Biologiia, 20:833–40, 1986. 46. Borodovsky, M., Y. A. Sprizhitsky, E. I. Golovanov and A. A. Alexandrov. Statistical features in the Escherichia coli genome functional primary structure. III. Computer recognition of protein coding regions. Molekuliarnaia Biologiia, 20:1144–50, 1986. 47. Tavare, S. and B. Song. Codon preference and primary sequence structure in protein-coding regions. Bulletin of Mathematical Biology, 51(1):95–115, 1989. 48. Borodovsky, M. and J. McIninch. Genmark—parallel gene recognition for both DNA strands. Computers and Chemistry, 17(2):123–33, 1993. 49. Rabiner, L. R. A tutorial on hidden Markov-models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–86, 1989. 50. Krogh, A., I. S. Mian and D. Haussler. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research, 22(22):4768–78, 1994. 51. Staden, R. Computer methods to locate signals in nucleic-acid sequences. Nucleic Acids Research, 12(1):505–19, 1984. 52. Hannenhalli, S. S., W. S. Hayes, A. G. Hatzigeorgiou and J. W. Fickett. Bacterial start site prediction. Nucleic Acids Research, 27(17):3577–82, 1999. 53. Suzek, B. E., M. D. Ermolaeva, M. Schreiber and S. L. Salzberg. A probabilistic method for identifying start codons in bacterial genomes. Bioinformatics, 17(12):1123–30, 2001. 54. Zhu, H. Q., G. Q. Hu, Z. Q. Ouyang, et al. Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics, 20(18):3308–17, 2004. 55. Shibuya, T. and I. Rigoutsos. Dictionary-driven prokaryotic gene finding. Nucleic Acids Research, 30(12):2710–25, 2002. 56. Rigoutsos, I., A. Floratos, C. Ouzounis, et al. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins, 37(2):264–77, 1999. 57. Rigoutsos, I. and A. Floratos. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14(1):55–67, 1998. 58. Murphy, E., I. Rigoutsos, T. Shibuya and T. E. Shenk. Reevaluation of human cytomegalovirus coding potential. Proceedings of the National Academy of Sciences USA, 100(23):13585–90, 2003. 59. Badger, J. H. and G. J. Olsen. CRITICA: coding region identification tool invoking comparative analysis. Molecular Biology and Evolution, 16(4):512–24, 1999. 60. Claverie, J. M. and L. Bougueleret. Heuristic informational analysis of sequences. Nucleic Acids Research, 14(1):179–96, 1986. 61. Frishman, D., A. Mironov, H. W. Mewes and M. Gelfand. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Research, 26(12):2941–7, 1998.



62. Zhang, Z. and M. Gerstein. Large-scale analysis of pseudogenes in the human genome. Current Opinion in Genetics and Development, 14(4):328–35, 2004. 63. Coin, L. and R. Durbin. Improved techniques for the identification of pseudogenes. Bioinformatics, 20(Suppl 1):I94–100, 2004. 64. Burge, C. and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268(1):78–94, 1997. 65. Kulp, D., D. Haussler, M. G. Reese and F. H. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 4:134–42, 1996. 66. Fairbrother, W. G., R. F. Yeh, P. A. Sharp and C. B. Burge. Predictive identification of exonic splicing enhancers in human genes. Science, 297(5583):1007–13, 2002. 67. Krogh, A. Two methods for improving performance of an HMM and their application for gene finding. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 5:179–86, 1997. 68. Rogic, S., B. F. Ouellette and A. K. Mackworth. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics, 18(8):1034–45, 2002. 69. Stanke, M. and S. Waack. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(Suppl 2):II215–25, 2003. 70. Salamov, A. A. and V. V. Solovyev. Ab initio gene finding in Drosophila genomic DNA. Genome Research, 10(4):516–22, 2000. 71. Reese, M. G., G. Hartzell, N. L. Harris, et al. Genome annotation assessment in Drosophila melanogaster. Genome Research, 10(4):483–501, 2000. 72. Berget, S. M. Exon recognition in vertebrate splicing. Journal of Biological Chemistry, 270(6):2411– 14, 1995. 73. Lim, L. P. and C. B. Burge. A computational analysis of sequence features involved in recognition of short introns. Proceedings of the National Academy of Sciences USA, 98(20):11193-8, 2001. 74. Alexandersson, M., S. Cawley and L. Pachter. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Research, 13(3):496–502, 2003. 75. Bray, N., I. Dubchak and L. Pachter. AVID: a global alignment program. Genome Research, 13(1):97–102, 2003. 76. Meyer, I. M. and R. Durbin. Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics, 18(10):1309–18, 2002. 77. Parra, G., P. Agarwal, J. F. Abril, et al. Comparative gene prediction in human and mouse. Genome Research, 13(1):108–17, 2003. 78. Korf, I., P. Flicek, D. Duan and M. R. Brent. Integrating genomic homology into gene structure prediction. Bioinformatics, 17(Suppl 1):S140–8, 2001. 79. Guigo, R., S. Knudsen, N. Drake and T. Smith. Prediction of gene structure. Journal of Molecular Biology, 226(1):141–57, 1992. 80. Parra, G., E. Blanco and R. Guigo. GeneID in Drosophila. Genome Research, 10(4):511–15, 2000. 81. Flicek, P., E. Keibler, P. Hu, et al. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Research, 13(1):46–54, 2003.

Gene Finding


82. Batzoglou, S., L. Pachter, J. P. Mesirov, et al. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Research, 10(7):950–8, 2000. 83. Rinner, O. and B. Morgenstern. AGenDA: gene prediction by comparative sequence analysis. In Silico Biology, 2(3):195–205, 2002. 84. Morgenstern, B., A. Dress and T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proceedings of the National Academy of Sciences USA, 93(22):12098–103, 1996. 85. Kurtz, S., A. Phillippy, A.L. Delcher, et al. Versatile and open software for comparing large genomes. Genome Biology, 5(2):R12, 2004. 86. Ogurtsov, A. Y., M. A. Roytberg, S. A. Shabalina and A. S. Kondrashov. OWEN: aligning long collinear regions of genomes. Bioinformatics, 18(12):1703–4, 2002. 87. Couronne, O., A. Poliakov, N. Bray, et al. Strategies and tools for wholegenome alignments. Genome Research, 13(1):73–80, 2003. 88. Siepel, A. and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2-3): 413–28, 2004. 89. Pedersen, J. S. and J. Hein. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19(2):219–27, 2003. 90. Brent, M. R. and R. Guigo. Recent advances in gene structure prediction. Current Opinion in Structural Biology, 14(3):264–72, 2004. 91. Gelfand, M. S., A. A. Mironov and P. A. Pevzner. Gene recognition via spliced sequence alignment. Proceedings of the National Academy of Sciences USA, 93(17):9061–6, 1996. 92. Birney, E., M. Clamp and R. Durbin. GeneWise and Genomewise. Genome Research, 14(5):988–95, 2004. 93. Birney, E., J. D. Thompson and T. J. Gibson. PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Research, 24(14):2730–9, 1996. 94. Eddy, S. R. Profile hidden Markov models. Bioinformatics, 14(9):755–63, 1998. 95. Scherf, M., A. Klingenhoff and T. Werner. Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. Journal of Molecular Biology, 297(3): 599-606, 2000. 96. Bajic, V. B. and S. H. Seah. Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Research, 13(8):1923–9, 2003. 97. Castelo, R. and R. Guigo. Splice site identification by idlBNs. Bioinformatics, 20(Suppl 1):I69–76, 2004. 98. Fickett, J. W. and A. G. Hatzigeorgiou. Eukaryotic promoter recognition. Genome Research, 7(9):861–78, 1997. 99. Scherf, M., A. Klingenhoff, K. Frech, et al. First pass annotation of promoters on human chromosome 22. Genome Research, 11(3):333–40, 2001. 100. Bajic, V. B., S. H. Seah, A. Chong, et al. Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics, 18(1):198–9, 2002.



101. Graveley, B. R. Sorting out the complexity of SR protein functions. RNA, 6(9):1197–211, 2000. 102. Azad, R. K. and M. Borodovsky. Effects of choice of DNA sequence model structure on gene identification accuracy. Bioinformatics, 20(7):993–1005, 2004. 103. Borodovsky, M., W. S. Hayes and A. V. Lukashin. Statistical predictions of coding regions in prokaryotic genomes by using inhomogeneous Markov models. In R.L. Charlebois (Ed.), Organization of the Prokaryotic Genome (pp. 11–34). ASM Press, Washington, D.C., 1999. 104. Audic, S. and J. M. Claverie. Self-identification of protein-coding regions in microbial genomes. Proceedings of the National Academy of Sciences USA, 95(17):10026–31, 1998. 105. Baldi, P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. Bioinformatics, 16(4):367–71, 2000. 106. Hayes, W. S. and M. Borodovsky. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. In Pacific Symposium on Biocomputing (pp. 279–90). World Scientific, Singapore, 1998. 107. Solovyev, V. and A. Salamov. The Gene-Finder computer tools for analysis of human and model organisms genome sequences. Proceedings of the International Conference on Intelligent Systems for Molecular Biology, 5: 294–302, 1997. 108. Snyder, E. E. and G. D. Stormo. Identification of protein coding regions in genomic DNA. Journal of Molecular Biology, 248(1):1–18, 1995. 109. Xu, Y., R. Mural, M. Shah and E. Uberbacher. Recognizing exons in genomic sequence using GRAIL II. Genetic Engineering, 16:241–53, 1994. 110. McHardy, A. C., A. Goesmann, A. Puhler and F. Meyer. Development of joint application strategies for two microbial gene finders. Bioinformatics, 20(10):1622–31, 2004. 111. Pertea, M., X. Lin and S. L. Salzberg. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research, 29(5):1185–90, 2001. 112. Salzberg, S. L., M. Pertea, A. L. Delcher, et al. Interpolated Markov models for eukaryotic gene finding. Genomics, 59(1):24-31, 1999. 113. Allen, J. E., M. Pertea and S. L. Salzberg. Computational gene prediction using multiple sources of evidence. Genome Research, 14(1):142–8, 2004. 114. Pavy, N., S. Rombauts, P. Dehais, et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics, 15(11):887–99, 1999. 115. Howe, K. L., T. Chothia and R. Durbin. GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Research, 12(9):1418–27, 2002. 116. Schiex, T., A. Moisan and P. Rouze. EuGene: an eucaryotic gene finder that combines several sources of evidence. In O. Gascuel and M. F. Sagot (Eds.), Computational Biology. LNCS 2066 (pp. 111–25). Springer, Heidelberg, 2001. 117. Foissac, S., P. Bardou, A. Moisan, et al. EuGeneHom: a generic similaritybased gene finder using multiple homologous sequences. Nucleic Acids Research, 31(13):3742–5, 2003. 118. Kotlar, D. and Y. Lavner. Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Research, 13(8):1930–7, 2003.

Gene Finding


119. Mahony, S., J. O. McInerney, T. J. Smith and A. Golden. Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models. BMC Bioinformatics, 5(1):23, 2004. 120. Veeramachaneni, V., W. Makalowski, M. Galdzicki, et al. Mammalian overlapping genes: the comparative perspective. Genome Research, 14(2):280–6, 2004. 121. Schneider, T. D. and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18(20):6097–100, 1990.

5 Local Sequence Similarities Temple F. Smith

In today’s genomic era, the use of computer-based comparative genetic sequence analysis has become routine. It is used to identify the function of newly sequenced genes, to identify conserved functional sites, to reconstruct probable evolutionary histories, and to investigate many other biological questions. DNA and protein comparative sequence analysis is often considered to be the founding aspect of what is now called Bioinformatics and Genomics. Given that the development of computational tools played a key role, a bit of history will help in understanding both the motivations and the sequence of ideas that led to many comparative sequence tools, including the local dynamic or Smith–Waterman alignment algorithm. It has often been stated that Zuckerkandl and Pauling introduced in 1965 the idea of using comparative sequence information to study evolution [1]. It is clear, however, that many were already considering that method by 1964, as seen by the lively discussions at the symposium on “Evolving Genes and Proteins” held September 17–18, 1964, at Rutgers University [2]. Three papers in particular provide sequence comparative alignments: one on a cytochrome c by Margoliash and Smith [3]; one on a dehydrogenase by N. O. Kaplan [4]; and one by Pauling [5]. The first paper was the precursor to the famous paper by Fitch and Margoliash [6] presenting the first large-scale evolutionary reconstruction based on protein sequence alignments. This meeting at Rutgers was attended by over 250 researchers from departments of chemistry, microbiology, applied physics, schools of medicine, and others, foretelling the highly multidisciplinary nature of biology’s future and the future of bioinformatics, an interdisciplinary area often traced back to sequence comparative analyses. Since Sanger’s seminal work on sequencing insulin [7], it had taken fewer than ten years for a large number of scientists to recognize the wealth of information available in such protein sequences. By 1969, fourteen years later, Margaret Dayhoff had collected and aligned over 300 protein sequences and organized them into evolutionary-based clusters [8]. Alignments by hand were straightforward for many of the proteins sequenced early on, such as the cytochrome c’s and globins. It became clear, however, that there were two fundamental problems requiring a 154

Local Sequence Similarities


more rigorous approach. These were the often degenerate or alternative possible placements of alignment gaps, and the need for a means of weighting the differences or similarities between different amino acids that one placed in homologous or aligned positions. There were a number of early heuristic approaches, such as those by Fitch [9] and Dayhoff [8]. It was the algorithm by Needleman and Wunsch in 1970 [10], however, that set the stage for nearly all later biological sequence alignment tools. Although not initially recognized as such, this work by Needleman and Wunsch was an application of Bellman’s 1957 dynamic programming methodology [11]. The fully rigorous application of dynamic programming came slowly with work by Sankoff [12], Reichert et al. [13], and Sellers [14]. In the latter, Sellers was able to introduce a true metric or distance measure between sequence pairs, a metric that Waterman et al. [15] were then able to generalize. This outline of the early sequence alignment algorithm developments lay behind the work of my colleague, Michael Waterman, and myself that led to the local or optimal maximum similar common subsequence algorithm, since labeled the “Smith–Waterman algorithm.“ It was work on a related problem, however, the prediction of RNA secondary structure, that was our critical introduction to dynamic programming. Much of our early collaboration took place in the summers at Los Alamos, New Mexico. I first met Michael when we were both invited to spend the summer of 1974 at Los Alamos National Laboratory through the efforts of William Beyer. I had previously spent time at Los Alamos interacting with researchers in the Biological Sciences group and with the mathematician, Stanislaw Ulam. By Michael’s own account (Waterman, Skiing the Sun: New Mexico Essays, 1997) our meeting was not altogether an auspicious one. We were both young faculty from somewhat backwater, nonresearch universities. We were thus highly motivated by the opportunity to get some real research done over the short summer. Our earliest efforts focused on protein sequence comparisons and evolutionary tree building, yet interestingly, neither we nor our more senior “leader,” Bill Beyer, were initially aware of the key work of Needleman and Wunsch, nor of dynamic programming. Much of this ignorance is reflected in a first paper on the subject [16] produced prior to working with Waterman. Over the next couple of summers we became not only very familiar with the work of Needleman and Wunsch, Sankoff, and Sellers, but were able to contribute to the mathematics of sequence comparison matrices [15]. By the end of 1978 we had successfully applied the mathematics of dynamic programming to the prediction of tRNA secondary structures. Here, as in nearly all of our joint work, Waterman was the true mathematician while I struggled with the biochemistry and the writing of hundreds of lines of FORTRAN code. We made a good team. Los Alamos was a special place at that time, filled with very bright and



Figure 5.1 A photo of Mike Waterman (right) and Temple Smith (left) taken in the summer of 1976 at Los Alamos National Laboratory, Los Alamos, New Mexico, by David Lipman. Dr. Lipman is one of the key developers of the two major heuristic high-speed generalizations of the Smith–Waterman algorithm, FASTA and Blast.

unique people, and a place where one felt connected to the grand wide-open spaces of the American southwest (see figure 5.1). At about this same time there were discussions about creating a database for DNA sequences similar to what had been done for both protein sequences and 3D-determined structures [17]. As members of the Los Alamos group considering applying for grant support, both Michael and I were active in discussions on likely future analysis needs. These needs included not only using sequence alignment to compare two sequences of known functional similarity, but searching long DNA sequences for matches to much shorter single gene sequences. The need to identify short matches within long matches was reinforced by the discovery of introns in 1976 [18–21]. Peter Sellers [14] and David Sankoff [12] had each worked on this problem, but without obtaining a timely general solution. Waterman and I were lucky enough to have recognized the similarity between protein pairwise sequence alignment and the geology problem called stratigraphic correlations— the problem of aligning two sequences of geological strata. Upon recognition, we were able to write a paper [22] in just a few days that included the dynamic programming solution to finding the optimal matching subsequence of a short sequence within a much longer one. In that solution lay one of the keys to the real problem that needed

Local Sequence Similarities


answering, the initiation of the traceback matrix boundaries with zeros (see below). To understand the simple, yet subtle, modification we made to the dynamic programming algorithms developed by 1978, we need to recall that nearly everyone was working and thinking in terms of distances between sequences and sequence elements. This was natural from the evolutionary viewpoint since one was generally attempting to infer the distance between organisms, or more accurately between them and their last common ancestor. Various researchers (again including Dayhoff) were working with the probability that any two amino acids would be found in homologous positions or aligned across from one another. These were nearly always converted to some form of distance [8,23], typically an edit distance, the minimum number of mutations or edits required to convert one sequence into the other. In modern form these amino acid pair probabilities are converted to log-likelihoods where the object is to maximize the total likelihood rather than minimize a distance. These log-likelihoods have the form LL(i,j) = log [P(ai,aj)/P(ai)P(aj)]


providing a similarity measure of how likely it is to see amino acid, i, in the same homologous position as amino acid, j, in two related proteins as compared to observing such an association by chance. Such measures range from positive values, similar, through zero, to negative values or dissimilar. Clearly, as recognized by Dayhoff and others, the involved probabilities depend both on how great the evolutionary distance is between the containing organisms and on the degree and type of selection pressures affecting and/or allowing the changes. More importantly, one normally has to obtain estimates of the probabilities from the very groups of proteins that one is trying to align. Why was the concept of a similarity measure so critical to us? If one wants to define an optimal substring match between two long sequences as one having the minimal distance, then any run of identities including single matches has the minimum distance of zero. There will clearly be many such optimal or minimum distance matches between nearly any two real biological sequences. Sellers, attempting to deal with this potential problem [14], introduced a complex optimalization of maximum length with minimum distance that later proved to be incorrect [24]. Waterman and I recognized that by using a similarity measure and not having any cost for dissimilar or nonmatching terminal or end subsequences, the rigorous dynamic programming logic could be applied to the problem of identifying the maximum similar subsequences between any two longer sequences. The maximum similarity, not minimum distance, is just what was needed.




The algorithm, like that of Needleman and Wunsch, is a deceptively simple iterative application of dynamic programming that is today routinely encoded by many undergraduates. It is described formally as follows: Given two sequences, A and B, of length n and m; on a common alphabet A = {a1,a2,…,an} and B = {b1,b2,…,bm}; a measure of similarity between the elements of that alphabet, s(ai, bj); and a cost or dissimilarity value, W(k), for a deletion of length k introduced into either sequence, a matrix H is defined such that (2a) H k 0 = H0l = 0 for 0 ≤ k ≤ n and 0 ≤ l ≤ m and







Hij = max Hi −1, j −1 + s( ai , b j ), max Hi − k , j − W ( k ) , max Hi , j −l − W (l) , 0 k ≥1

l ≥1


forr i ≤ k ≤ n and j ≤ l ≤ m. The elements of H have the interpretation that the value of matrix element, Hij, is the maximum similarity of two segments ending in ai and bj, respectively. The formula for Hij follows by considering four possibilities: 1. If ai and bj are associated or aligned, the similarity is Hi−1,j−1 + s(ai, bj). 2. If ai is at the end of a deletion of length k, the similarity is Hi−k,j − W(k). 3. If bj is at the end of a deletion of length l, the similarity is Hi,j−1 − W(l). 4. Finally, a zero is assigned if all of the above three options would result in a calculated negative similarity value, Hij = 0, indicating no positive similarity subsequence starting anywhere and ending at ai and bj. The zeros in equations (2a) and (2b) allow the exclusion of leading and trailing subsequences whose total sum of element pair similarities, s(ai,bj), plus any alignment insertion/deletions are negative, thus having a net dissimilarity. The matrix of sequence elements similarities, s(i,j), is arbitrary to a high degree. In most biological cases, s(ai,bj) directly or indirectly reflects some evolutionary or statistical assumptions, as in equation (1). These assumptions influence the constant values in deletion weight function, W(k), as well. For example, one normally assumes that the likelihood of a sequence deletion is less likely than any point or signal element mutation. This in turn requires W(l) to be less than the minimum value of s(ai,bj).

Local Sequence Similarities


The pair of maximally similar subsequences is found by first locating the maximum element in the matrix H. Then by sequentially determining the H matrix elements that lead to that maximum H value, with a traceback procedure ending at the last nonzero value, the two maximally similar subsequences are obtained. The two subsequences are composed of the sequence of pairs associated with the H elements composing this traceback. The traceback itself also produces the optimal alignment, the one having the highest similarity value between these two subsequences. The zero in equation (2b) thus defines via the alphabet similarities, s(ai,bj), the level of alphabet similarity considered dissimilar. In most biological cases s(ai,bj) directly or indirectly reflects some evolutionary or statistical assumptions. These assumptions also determine the relationship between the deletion weight, W(k), value and s(ai,bj).


The functional form of the gap function, W(k), was originally viewed as restricted by the dynamic program optimization to be a monotonic increasing function of the gap length, k. The simplest form used, and that used in the original published example of the Smith–Waterman algorithm [25], is the affine linear function of gap length, W(k) = W0 + W1 * (k − 1)


Here W0 is a cost associated with opening a gap in either sequence. W1 is the cost or penalty of extending the gap beyond a length of one sequence element. Such a function has limited biological justification. Surely in most cases longer gaps or insertion/deletions are less likely than shorter ones? However, all deletions above some minimum length may all be nearly as likely. Very long insertion/deletion events can be quite common, such as those in chromosomal rearrangements and in many protein sequences containing large introns. In addition, if one is searching for the optimal local alignments between protein-encoding DNA sequences, then nearly all expected DNA insertion/deletion events are modulo three. Also in the case of proteins such mutational events have highly varying probabilities along the sequence. They are much more likely in surface loop regions, for example. The latter is taken into account in many implementations of alignment dynamic programming tools. There W(k) is made a function of the sequence element ai or bj, or more exactly the three-dimensional environment associated with that sequence element, the amino acid.




The somewhat arbitrary nature of the sequence element similarity or scoring matrix is closely related to the need to be able to identify suboptimal or alternative optimal alignments. Like the insertion/deletion cost function, W(k), discussed above, the “true” cost or likelihood of a given sequence element replacement can and generally is a function of the local context. In DNA sequences this is a function not only of where, but what type of information is being encoded. Also in DNA comparisons across different species there are the well-known differences in overall base composition or the varying likelihood of each nucleotide at each codon position. In protein sequences it is not only a function of local structure, but of the class of protein. Membraneassociated, excreted, and globular proteins all have fundamentally different amino acid background frequencies. This means that the typically employed s(ai,bj) element-to-element similarity matrices are at best an average over many context factors, and therefore even minor differences could produce different optimal alignments, particularly in detail. This in turn means that, in any case where the details of the

Figure 5.2a The H matrix generated using equation (3) and the following similarity scoring function: matches, s = +1; mismatches, s = −0.5; and W(k) = − (0.91 + 0.10 * k) for the two DNA input sequences, AAGCATCAAGTCT and TCATCAGCGG. The heavy boxed value of 5.49 is the global local maximum while the other three boxed values are the next three nearest suboptimal similarity values. The arrows indicate the diagonal traceback steps; dashes indicate horizontal traceback steps involving W(k). Displayed values are × 100.

Local Sequence Similarities


alignment are important, one needs to look at any alternative alignments with the same or nearly the same total similarity score as the optimal. An example would be those cases where one is attempting to interpret evolutionary history between relatively distant sequences where there are likely many uncertainties in the alignment. There are in principle three different types of such alternative nearoptimal alignments. There are those that contain a substantial portion of the optimal. These can be found by recalculating, at each step in the traceback, which of the H-matrix cells could have contributed to the current cell within some fixed value, X, and then proceeding through each of these in turn, subtracting the needed contribution from the value, X, until the traceback is complete or until the sum exceeds X. This is a very computationally intensive procedure for most cases. The simplest of such tracebacks, however, are those for which X is set equal to zero and only those alternative traceback alignments having the same score as the optimal are obtained. Figure 5.2a displays two such equally optimal tracebacks with the implied alignment difference being only the location of the central gap by one. A third class of suboptimal local alignments is obtained by rerunning the dynamic programming while not allowing the position pairs

Figure 5.2b Pairwise alignments for the maximum similar local aligned segments of the two input DNA sequences, AAGCATCAAGTCT and TCATCAGCGG, using the similarity scoring function given in figure 5.2a. Each of these alignments corresponds to one of the tracebacks shown in figure 5.2a.



aligned in the optimal to contribute. Those position pairs are set equal to zero, requiring any suboptimal sequence to avoid those pairs. The next best such alignment is then obtained by the standard traceback, beginning as before at the largest positive value. This procedure can be repeated to obtain a set of suboptimal alignments of decreasing total similarity. Examples of these are shown in figure 5.2b. Note that the particular cases displayed could have been obtained directly from the initial H matrix only because they do not involve any of the aligned pairs that contributed to the optimal. STATISTICAL INTERPRETATION

It was noticed early on [26] that the optimal local alignment similarity varied linearly for a fixed alphabet with the logarithm of the product of the lengths of the two sequences. This is very pronounced as at least one of the sequences gets very long. This was initially only an empirical observation that was later clearly shown to be an instance of the Erdös–Renyi Law [27]. One thus expects, as one or both sequences (or the database of sequences against which one is searching a single sequence) becomes large, for the similarity to go as S(ab) = log(n*m) with an error on the order of log[log(n*m)]. This implies that the statistical significance—its deviation above the expected—of a local sequence alignment against a large database is a function of the logarithm of the database size. Considerable effort has gone into being able to estimate correctly the associated probabilities [28] from the extreme value distribution. CONCLUSION

It must be noted that neither the motivation nor surely all of the ideas that went into our derivation of this local algorithm were unique to us. Many others, including Sankoff, Sellers, Goad [29], and even Dayhoff, were working on similar ideas, and had we not proposed this solution, one of them would have arrived there shortly. As one of the authors of this algorithm I have always enjoyed telling my own graduate students that my contribution was just adding a zero in the right place at the right time. But even that would have meant little without Waterman’s input and his solid mathematical proof that the result remained a rigorous application of Bellman’s dynamic programming logic [11]. It is not obvious from the iterative formulation of the algorithm that it is invariant under the reversal of the two sequences or even that the elements lying between the positions associated with the maximum H value and the tracebacked first nonzero positions is the optimal. There have been many efficiency improvements to the original formulation of the local or Smith–Waterman alignment algorithm, beginning with reducing the length cubed complexity [30] and identifying all

Local Sequence Similarities


nearby suboptimal similar common subsequences. The latter is conceptually very important since both s(ai,bj) and W(k) are normally obtained as averages over some data set of biologically “believed” aligned sequences, none of which can be assumed with much certainty to have the same selective history or even structural sequence constraints as any new sequence searched for its maximally similar subsequence against all of the currently available data. But most importantly this algorithm led to the development of very fast heuristic algorithms that obtain generally identical results, these being first FastA [31] and then Blast [32]. Blast and its associated variants have become today’s standard for searching very large genomic sequence databases for common or homologous subsequences. There have been many applications of our algorithm to biological problems and to the comparisons with other sequence match tools, as testified by the original paper’s large citation list. One of these allows me to end this short historic review with one more Smith and Waterman anecdote. While on sabbatical at Yale University, Mike came to visit me, and on our way to lunch we passed through the Yale Geology Department. There stood two stratigraphic columns with strings connecting similar strata within the two columns—a sequence alignment of similar sediments! Given that Sankoff had recently pointed out to us that researchers studying bird songs had identified similar subsequences via time warping [33], we now faced the possibility that the geologist had also solved the problem before us! After a somewhat depressing lunch we went to the Geology Department chairman’s office and asked. Lo and behold, this was an unsolved problem in geology! This resulted in our first geology paper [34] basically written over the next couple of days. REFERENCES 1. Zuckerkandl, E. and L. C. Pauling. Molecules as documents of evolutionary history. Journal of Theoretical Biology, 8:357–8, 1965. 2. Proceedings published as: Bryson, V. and H. Vogel (Eds.), Evolving Genes and Proteins. Academic Press, New York, 1965. 3. Margoliash, E. and E. Smith. Structural and functional aspects of cytochrome c in relation to evolution. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 221-42). Academic Press, New York, 1965. 4. Kaplan, N. Evolution of dehydrogenases. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 243-78). Academic Press, New York, 1965. 5. Zuckerkandl, E. and L. Pauling. Evolutionary divergence and convergence in proteins. In V. Bryson and H. Vogel (Eds.), Evolving Genes and Proteins (pp. 97-166). Academic Press, New York, 1965. 6. Fitch, W. and E. Margoliash. Construction of phylogenetic trees. Science, 155:279–84, 1967.



7. Sanger, F. The structure of insulin. In D. Green (Ed.), Currents in Biochemical Research. Interscience, New York, 1956. 8. Dayhoff, M. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, Md., 1969. 9. Fitch, W. An improved method of testing for evolutionary homology. Journal of Molecular Biology, 16:9-16, 1966. 10. Needleman, S. B. and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48:443–53, 1970. 11. Bellman, R. Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. 12. Sankoff, D. Matching sequences under deletion/insertion constraints. Proceedings of the National Academy of Sciences USA, 68:4–6, 1972. 13. Reichert, T., D. Cohen and A. Wong. An application of information theory to genetic mutations and the matching of polypeptide sequences. Journal of Theoretical Biology, 42:245–61, 1973. 14. Sellers, P. On the theory and computation of evolutionary distances. SIAM Journal of Applied Mathematics, 26:787–93, 1974. 15. Waterman, M., T. F. Smith and W. A. Beyer. Some biological sequence metrics. Advanced Mathematics, 20:367–87, 1976. 16. Beyer, W. A., M. L. Stein, T. F. Smith and S. M. Ulam. A molecular sequence metric and evolutionary trees. Mathematical Biosciences, 19:9–25, 1974. 17. Smith, T. F. The history of the genetic sequence databases. Genomics, 6:701–7, 1990. 18. Berget, S., A. Berk, T. Harrison and P. Sharp. Spliced segments at the 5’ termini of adenovirus-2 late mRNA: a role for heterogeneous nuclear RNA in mammalian cells. Cold Spring Harbor Symposia on Quantitative Biology, XLII:523–30, 1977. 19. Breathnach, R., C. Benoist, K. O’Hare, F. Gannon and P. Chambon. Ovalbumin gene: evidence for leader sequence in mRNA and DNA sequences at the exon-intron boundaries. Proceedings of the National Academy of Sciences USA, 75:4853–7, 1978. 20. Broker, T. R., L. T. Chow, A. R. Dunn, R. E. Gelinas, J. A. Hassell, D. F. Klessig, J. B. Lewis, R. J. Roberts and B. S. Zain. Adenovirus-2 messengers—an example of baroque molecular architecture. Cold Spring Harbor Symposia on Quantitative Biology, XLII:531–54, 1977. 21. Jeffreys, A. and R. Flavell. The rabbit b-globin gene contains a large insert in the coding sequence. Cell, 12:1097–1108, 1977. 22. Smith, T. and M. Waterman. New stratigraphic correlation techniques. Journal of Geology, 88:451–57, 1980. 23. Dayhoff, M. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Silver Springs, Md., 1972. 24. Waterman, M. Sequence alignments. In M. S. Waterman (Ed.), Mathematical Methods for DNA Sequences (pp. 53-92). CRC Press, Boca Raton, Fl., 1989. 25. Smith, T. and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–7, 1981. 26. Smith, T. F. and C. Burks. Searching for sequence similarities. Nature, 301:174, 1983.

Local Sequence Similarities


27. Arratia, R. and M. Waterman. An Erdös-Renyi law with shifts. Advances in Mathematics, 55:13–23, 1985. 28. Karlin, S. and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proceedings of the National Academy of Sciences USA, 87:2264–8, 1990. 29. Goad, W. and M. Kanehisa. Pattern recognition in nucleic acid sequences: a general method for finding local homologies and symmetries. Nucleic Acids Research, 10:247–63, 1982. 30. Gotoh, O. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162:705–8, 1982. 31. Pearson, W. R. and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA, 85:2444–8, 1988. 32. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–10, 1990. 33. Bradley, D. W. and R. A. Bradley., Application of sequence comparison to the study of bird songs. In D. Sankoff and J. B. Kruskal (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (pp. 189-210). Addison-Wesley, Reading, Mass., 1983. 34. Smith, T. F. and M. S. Waterman. New stratigraphic correlation techniques. Journal of Geology, 88:451–7, 1980.

6 Complete Prokaryotic Genomes: Reading and Comprehension Michael Y. Galperin & Eugene V. Koonin

The windfall of complete genomic sequences in the past ten years has dramatically changed the face of biology, which has started to lose its purely descriptive character. Instead, biology is gradually becoming a quantitative discipline that deals with firmly established numerical values and can be realistically described by mathematical models. Indeed, we now know that, for example, the bacterium Mycoplasma genitalium has a single chromosome which consists of 580,074 base pairs and carries genes for three ribosomal RNAs (5S, 16S, and 23S), 36 tRNAs, and 478 proteins [1,2]. We also know that about a hundred of the protein-coding genes can be disrupted without impairing the ability of this bacterium to grow on a synthetic peptide-rich broth containing the necessary nutrients [3], suggesting that the truly minimal gene set necessary for the cell life might be even smaller, in the 300–350 gene range [4–6]. Furthermore, we know that the cell of Aquifex aeolicus with its 1521 protein-coding genes is capable of autonomous, autotrophic existence in the environment, requiring for growth only hydrogen, oxygen, carbon dioxide, and mineral salts [7]. These observations bring us to the brink of finally answering the 60-year-old question posed by Erwin Schrödinger’s “What is Life?” Furthermore, although the descriptions of greatly degraded organelle-like cells like Buchnera aphidicola [8,9] and giant cell-like mimiviruses [10] necessarily complicate the picture, analysis of these genomes allows an even better understanding of what is necessary and what is dispensable for cellular life. That said, every microbial genome contains genes whose products have never been experimentally characterized and which lack experimentally characterized close homologs. The numbers of these “hypothetical” genes vary from a handful in small genomes of obligate parasites and/or symbionts of eukaryotes [11] to thousands in the much larger genomes of environmental microorganisms. As discussed previously, the existence of proteins with unknown function even in model organisms, such as Escherichia coli, Bacillus subtilis, or yeast Saccharomyces cerevisiae, poses a challenge not just to functional genomics but also to biology in general [12]. While some of these 166

Complete Prokaryotic Genomes


represent species-specific “ORFans” which might account for the idiosyncrasies of each particular organism [13,14], there are also hundreds of “conserved hypothetical” proteins with a relatively broad phyletic distribution. As long as we do not understand functions of a significant fraction of genes in any given genome, “complete” understanding of these organisms as biological systems remains a moving target. Therefore, before attempting to disentangle the riveting complexity of interactions between the parts of biological machines and to develop theoretical and experimental models of these machines— the stated goals of systems biology—it will be necessary to gain at least a basic understanding of the role of each part. Fortunately, it appears that the central pathways of information processing and metabolism are already known, and the existing models of the central metabolism in simple organisms (e.g., obligate parasites such as Haemophilus influenzae or Helicobacter pylori) adequately describe the key processes [15–17]. However, even in these organisms, there are hundreds of uncharacterized proteins, which are expressed under the best growth conditions [18,19], let alone under nutritional, oxidative, or acidic stress. We will not be able to create full-fledged metabolic models accommodating stress-induced changes of the metabolism without a much better understanding of these processes, which critically depends on elucidation of functions of uncharacterized and poorly characterized proteins. Here, we briefly review the current state of functional annotation of complete genomes and discuss what can be realistically expected from complete genome sequences in the near future. KNOWN, KNOWN UNKNOWN, AND UNKNOWN UNKNOWN PROTEINS

The analysis of the first several sequenced genomes proved to be an exciting but also a humbling exercise. It turned out that, even in the genomes of best-studied organisms, such as E. coli, B. subtilis, or yeast, less than half of all genes has ever been studied experimentally or assigned a phenotype [12]. For a certain portion of the genes, typically 30–40% of the genome, general functions could be assigned based on subtle sequence similarities of their products to experimentally characterized distant homologs. However, for at least 30–35% of genes in most genomes, there was no clue as to their cellular function. For convenience, borrowing the terminology from a popular expression, we shall refer to those genes whose function we know (or, rather, think that we know) as “knowns”; to those genes whose function we can describe only in some general terms as “known unknowns”; and to those genes whose function remains completely enigmatic as “unknown unknowns.” This classification will allow us to consider each of these classes of genes separately, concentrating on the



specific problems in understanding—and properly annotating—their functions. Not All “Knowns” Are Really Understood

Whenever biologists talk about annotation of gene functions in sequenced genomes, they complain about the lack of solid data. The most typical question is, “Is this real (i.e., experimental) data or just a computer-based prediction?” It may come as a great surprise to anybody not immediately involved in database management that (i) experimental data are not always trustable and (ii) computational predictions are not always unsubstantial. It is true, however, that gene and protein annotations in most public and commercial databases are often unreliable. To get the most out of them, one needs to understand the underlying causes of this unreliability, which include misinterpretation of experimental results, misidentification of multidomain proteins, and change in function due to nonorthologous gene displacement and enzyme recruitment. These errors are exacerbated by the propagation in the database due to (semi)automatic annotation methods used in most genome sequencing projects. In many cases, this results in biologically senseless annotation, which may sometimes be amusing but often becomes really annoying. Nevertheless, there will never be enough time and funds to experimentally validate all gene annotations in even the simplest genomes. Therefore, one necessarily has to rely on annotation generated with computational approaches. It is important, however, to distinguish between actual, experimental functional assignments and those derived from them on the basis of sequence similarity (sometimes questionable). Sometimes wrong annotations appear in the database because of an actual error in an experimental paper. The best example is probably tRNA-guanine transglycosylase (EC, the enzyme that inserts queuine (7-deazaguanine) into the first position of the anticodon of several tRNAs. While the bacterial enzyme is relatively well characterized [20], the eukaryotic one, reportedly consisting of two subunits, is not. Ten years ago, both subunits of the human enzyme were purified and partially sequenced [21,22]. Later, however, it turned out that the putative N-terminal fragment of the 32 kD subunit (GenBank accession no. AAB34767) actually belongs to a short-chain dehydrogenase, a close homolog of peroxisomal 2-4-dienoyl-coenzyme A reductase (GenBank accession no. AF232010), while the putative 60 kD subunit (GenBank accession no. L37420) actually turned out to be a ubiquitin C-terminal hydrolase. Although the correct sequence of the eukaryotic tRNA-guanine transglycosylase, homologous to the bacterial form and consisting of a single 44 kD subunit, was later determined [23] and deposited in GenBank (accession no. AF302784), the original erroneous assignments persist in the public databases.

Complete Prokaryotic Genomes


In many cases, however, the blame for erroneous functional assignments lies not with the experimentalists, but with genome analysts, where functional calls are often made (semi)automatically, based solely on the definition line of the top BLAST hit. Even worse, this process often strips “putative” from the tentative names, such as “putative protoporphyrin oxidase,” assigned by the original authors. This leads to the paradoxical situation when the gene, referred to by the original authors only as hemX (GenBank accession no. CAA31772), is confidently annotated as uroporphyrin-III C-methyltransferase in E. coli strain O157:H7, Salmonella typhimurium, Yersinia pestis, and many other bacteria, even though it has never been studied experimentally and lacks the easily recognizable S-adenosylmethionine-binding sequence motif. Despite the repeated warnings, for example in [24], this erroneous annotation found its way into the recently sequenced genomes of Chromobacterium violaceum, Photorhabdus luminescens, and Y. pseudotuberculosis. This seems to be a manifestation of the “crowd effect”: so many proteins of the HemX family have been misannotated that their sheer number convinces a casual user of the databases that this annotation must be true. Several examples of such persistent erroneous annotations are listed in table 6.1. Loss of such qualifiers as “putative,” “predicted,” or “potential” is a common cause of confusion, as it produces a seemingly conclusive annotation out of inconclusive experimental data. For example, ABC1 (activity of bc1) gene has been originally described as a yeast nuclear gene whose product was required for the correct assembly and functioning of the mitochondrial cytochrome bc1 complex [25,26]. Later, mutations in this locus were shown to affect ubiquinone biosynthesis and coincide with previously described ubiquinone biosynthesis mutations ubiB in E. coli and COQ8 in yeast [27,28]. Still, the authors were careful not to claim that ABC1 protein directly participates in ubiquinone biosynthesis, as the sequence of this protein clearly identifies it as a membrane-bound kinase, closely related to Ser/Thr protein kinases. Hence, the simplest suggestion was (and still remains) that ABC1 is a protein kinase that regulates ubiquinone biosynthesis and bc1 complex formation. Nevertheless, members of the ABC1 family are often annotated as “ubiquinone biosynthesis protein” or even “2-octaprenylphenol hydroxylase.” Even worse, because the name of this family is similar to ATP-binding cassette (ABC) transporters, ABC1 homologs are sometimes annotated as ABC transporters. Given that the original explanation that ABC1 might function as a chaperon has recently been questioned [29], one has to conclude that this ubiquitous and obviously important protein is misannotated in almost every organism. The story of the ABC1 protein shows that, besides being wrong per se, erroneous or incomplete annotations often obscure the fact that



Table 6.1 Some commonly misannotated protein families

Protein name

Protein family COG Pfam

E. coli HemX E. coli NusB


E. coli PgmB E. coli PurE





M. jannaschii MJ0010



M. jannaschii MJ0697 T. pallidum TP0953 A. fulgidus AF0238







R. baltica RB4770



F. tularensis FTT1298




Erroneous annotation

More appropriate annotation

04375 Uroporphyrinogen III methylase 01029 N utilization substance protein B

O. sativa K1839 05303 OJ1191_G08.42

Uncharacterized protein, HemX family [84] Transcription antitermination factor [84] Phosphoglycerate Broad specificity mutase 2 phosphatase [85] Phosphoribosyl Phosphoribosyl aminoimidazole carboxyaminoimidazole carboxylase mutase [86,87] Phosphonopyruvate Cofactor-independent decarboxylase phosphoglycerate mutase [88,89] Fibrillarin rRNA methylase (nucleolar protein 1) [90,91] Pheromone Uncharacterized protein shutdown protein TraB family [92] Centromere/ tRNA pseudouridine microtubule-binding synthase [93,94] protein ABC transporter Predicted Ser/Thr protein kinase, regulates ubiquinone biosynthesis 2-polyprenylphenol Predicted Ser/Thr 6-hydroxylase protein kinase, regulates ubiquinone biosynthesis Eukaryotic translation Uncharacterized protein initiation factor with a TPR repeat, 3 subunit CLU1 family [95]

we actually might not know the function of such a misannotated protein. Another good example is the now famous case of the HemK protein family, originally annotated as an “unidentified gene that is involved in the biosynthesis of heme in Escherichia coli” [30], then recognized as a methyltransferase unrelated to heme biosynthesis [31] and reannotated as adenine-specific DNA methylase [32]. The recent characterization of this protein as a glutamine N5-methyltransferase of peptide release factors [33,34] revealed the importance of this posttranslational modification that had been previously overlooked [35,36]. Remarkably, orthologs of HemK in humans and other eukaryotes are still annotated (without experimental support)

Complete Prokaryotic Genomes


as DNA methyltransferases. As a result, the role and extent of glutamine N5-methylation in eukaryotic proteins remains obscure. Another example of the annotation of a poorly characterized protein as if it was a “known” is the TspO/CrtK/MBR family of integral membrane proteins, putative signaling proteins found in representatives of all domains of life, from archaea to human [37,38]. These proteins, alternatively referred to as tryptophan-rich sensory proteins or as peripheral-type mitochondrial benzodiazepine receptors, contain five predicted transmembrane segments with 12–14 well-conserved aromatic amino acid residues, including seven Trp residues [37]. They have been shown to regulate photosynthesis gene expression in Rhodobacter sphaeroides, to respond to nutrient stress in Sinorhizobium meliloti, and to bind various benzodiazepines, tetrapyrrols, and steroids, including cholesterol, protoporphyrin IX, and many others [37,39–41]. None of these functions, however, would explain the role of these proteins in B. subtilis or Archaeoglobus fulgidus, which do not carry out photosynthesis and have no known affinity to benzodiazepines or steroids. Thus, instead of describing the function of these proteins, at least in bacteria, the existing annotation obscures the fact that it still remains enigmatic. In practicality, this means that any talk about “complete” understanding of the bacterial cell or even about creating the complete “parts list” should be taken with a grain of salt. There always remains a possibility that a confidently annotated gene (protein) might have a different function, in addition to or even instead of what has been assumed previously. Having said that, the problems with “knowns” are rare and far between, particularly compared with the problems with annotation of “known unknowns” and “unknown unknowns.” Known Unknowns

As noted above, even relatively small and simple microbial genomes contain numerous genes whose precise functions cannot be assigned with any degree of confidence. While some of these are “ORFans,” the number of genes that are found in different phylogenetic lineages and are commonly referred to as “conserved hypothetical” keeps increasing with every new sequenced genome. As repeatedly noted, annotation of an open reading frame as a “conserved hypothetical protein” does not necessarily mean that the function of its product is completely unknown, less so that its existence is questionable [12,24,42]. Generally speaking, if a conserved protein is found in several genomes, it is not really hypothetical anymore (see [43] for a discussion of possible exceptions). Even when a newly sequenced protein has no close homologs with known function, it is often possible to make a general prediction of its function based on: (1) subtle sequence similarity to a previously characterized protein, presence of a conserved sequence



motif, or a diagnostic structural feature [12,24,42]; (2) “genomic context,” that is, gene neighborhood or domain fusion data and phyletic patterns for the given protein family [24,44]; or (3) a specific expression pattern or protein–protein interaction data [19,45,46]. The methods in the first group are homology-based and rely on transfer of function from previously characterized proteins that possess the same structural fold and, typically, belong to the same structural superfamily. Since proteins with similar structures usually display similar biochemical properties and catalyze reactions with similar chemical mechanisms, homology-based methods often allow the prediction of the general biochemical activity of the protein in question but tell little about the specific biological process(es) it might be involved in. Table 6.2 lists several well-known superfamilies of enzymes with the biochemical reactions that they catalyze and the biological processes in which they are involved. One can see that the assignment of a novel protein to a particular superfamily is hardly sufficient for deducing its biological function. This is why we refer to such proteins as “known unknowns.” Genome context-based methods of functional prediction do not rely on homology of the given protein to any previously characterized one. Instead, these methods derive predictions for uncharacterized genes from experimental or homology-based functional assignments for the genes that are either located next to the gene in question or are present (or absent) in the same organisms as that gene [44,47–55]. Thus, identification of homologs by sequence and structural similarity still plays a crucial role in the genome context methods, even if indirectly. In this type of analysis, the reliability of predictions for unknowns critically depends on the accuracy of the functional assignments for the neighboring gene(s) and the strength of the association between the two. On the plus side, these assignments are purely computational, do not require any experimental analysis, and can be performed on a genomic scale. Such assignments proved particularly successful in identification of candidates for filling gaps (i.e., reactions lacking an assigned enzyme) in metabolic pathways [24,56–59]. These approaches also worked well for multi-subunit protein complexes like the proteasome, DNA repair systems, or the RNA-degrading exosome [60–62]. Table 6.3 shows some nontrivial genome context-based computational predictions that have been subsequently verified by direct experimental studies. Domain fusions (the so-called Rosetta Stone method) also can be used to deduce protein function [46,51,63]. This approach proved to be particularly fruitful in the analysis of signal transduction pathways, which include numerous multidomain proteins with a great variety of domain combinations. Sequence analysis of predicted signaling proteins encoded in bacterial and eukaryotic genomes revealed complex domain architectures and allowed the identification

Complete Prokaryotic Genomes


Table 6.2 The range of biological functions among representatives of some common superfamilies of enzymesa Biochemical function

Biological function (pathway)

Acid phosphatase superfamily [96,97] Phosphatidic acid phosphatase Diacylglycerol pyrophosphate phosphatase Glucose-6-phosphatase

Lipid metabolism Lipid metabolism, signaling Gluconeogenesis, regulation

ATP-grasp superfamily [98–101] ATP-citrate lyase Biotin carboxylase Carbamoyl phosphate synthase D-ala-D-ala ligase Glutathione synthetase Succinyl-CoA synthetase Lysine biosynthesis protein LysX Malate thiokinase Phosphoribosylamine-glycine ligase Protein S6-glutamate ligase Tubulin-tyrosine ligase Synapsin

TCA cycle Fatty acid biosynthesis Pyrimidine biosynthesis Peptidoglycan biosynthesis Redox regulation TCA cycle Lysine biosynthesis Serine cycle Purine biosynthesis Modification of the ribosome Microtubule assembly regulation Regulation of nerve synapses

HAD superfamily [102–105] Glycerol-3-phosphatase Haloacid dehalogenase Histidinol phosphatase Phosphoserine phosphatase Phosphoglycolate phosphatase Phosphomannomutase P-type ATPase Sucrose phosphate synthase

Osmoregulation Haloacid degradation Histidine biosynthesis Serine, pyridoxal biosynthesis Sugar metabolism (DNA repair) Protein glycosylation Cation transport Sugar metabolism

Alkaline phosphatase superfamily [106–108] Alkaline phosphatase N-Acetylgalactosamine 4-sulfatase Nucleotide pyrophosphatase Phosphoglycerate mutase Phosphoglycerol transferase Phosphopentomutase Steroid sulfatase Streptomycin-6-phosphatase aNot

Phosphate scavenging Chondroitin sulfate degradation Cellular signaling Glycolysis Osmoregulation Nucleotide catabolism Estrogen biosynthesis Streptomycin biosynthesis

all enzymatic activities or biological functions found in a given superfamily are listed

of a number of novel conserved domains [38,64–68]. While the exact functions (e.g., ligand specificity) of some of these domains remains obscure, association of these domains with known components of the signal transduction machinery strongly suggests their involvement in signal transduction [38,67,68]. Over the past several years, functional



Table 6.3 Recently verified genome context-based functional assignments

Protein name E. coli NadR E. coli YjbN B. subtilis YgaA H. pylori HP1533 Human COASY M. jannaschii MJ1440 M. jannaschii MJ1249 P. aeruginosa PA2081 P. furiosus PF1956 P. horikoshii PH0272 S. pneumoniae SP0415 S. pneumoniae SP0415

Protein family COG Pfam

Assigned function

References [109,110]


Ribosylnicotinamide kinase tRNA-dihydrouridine synthase Enoyl-ACP reductase




Thymidylate synthase







Phosphopantetheine adenylyltransferase Shikimate kinase



















3-Dehydroquinate synthase Kynurenine formamidase


Fructose 1,6-bisphosphate aldolase Methylmalonyl-CoA racemase trans-2,cis-3-decenoylACP isomerase Enoyl-ACP reductase



[123] [124] [125]

predictions were made for many of these proteins. Several remarkable examples of recently characterized and still uncharacterized signaling domains are listed in table 6.4. Sometimes functional assignment for a “known unknown” gene can be made on the basis of experimental data on its expression under particular conditions (e.g., under nutritional stress), coimmunoprecipitation with another, functionally characterized protein, and two-hybrid or other protein–protein interaction screens [45,69]. Although these annotation methods are not entirely computational, results of large-scale gene expression and protein–protein interaction experiments are increasingly available in public databases and can be searched without performing actual experiments [69–73] (see [74] for a complete listing). These data, however, have an obvious drawback. Even if there is convincing evidence that a given protein is induced, for example by phosphate starvation or UV irradiation, it says virtually nothing about the actual biological function of this particular protein [19].

Complete Prokaryotic Genomes


Table 6.4 Poorly characterized protein domains involved in signal transduction Domain name

Database entry COG Pfam


— 3614 4252 5278 3322 — 4250 2205 3300

02743 03924 05226 05227 05228 — — 02702 03707


MASE1 MASE2 PfoR PutP-like HDOD TspO

3447 — 1299 0591 1639 3476

05231 05230 — — 08668 03073

Predicted function


Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Extracytoplasmic sensor domain Cytoplasmic turgor-sensing domain Membrane-bound metal-binding sensor domain Membrane-bound metal-binding sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Membrane-bound sensor domain Signal output domain Membrane-bound tryptophan-rich sensory protein

[126] [127,128] [68] [68] [68] [68] [68] [129] [130] [131] [132] [132] [133,134] [135] [38] [37,39,41]

Likewise, even if an interaction of two proteins is well documented, it is often difficult to judge whether (1) this interaction is biologically relevant, that is, it occurs inside the cell at physiological concentrations of the interacting components; and (2) in which way, if at all, does this interaction affect the function of the “known” member of the pair. Therefore, the clues to function provided by gene expression and protein–protein interaction experiments are usually just that—disparate clues—that require careful analysis by a well-educated biologist to come up with an even tentative functional prediction. For this reason, many genes whose altered expression has been documented in microarray experiments still have to be assigned to the category of “unknown unknowns.” An important exception to that rule is a group of poorly characterized genes whose products participate in cell division (table 6.5). Although a significant fraction of such genes have no known (or predicted) enzymatic activity, they still qualify as “known unknowns” based on the mutation phenotypes and protein–protein interaction data. Table 6.5 lists some of the known cell division genes and the available sparse clues as to their roles in the process of cell division.



Table 6.5 Cell division proteins of unknown biochemical function

Protein name

Database entry COG Pfam

BolA CrcB

0271 0239

01722 02537

DivIVA EzrA FtsB FtsL FtsL FtsN FtsW FtsX IspA Maf

3599 4477 2919 3116 4839 3087 0772 2177 2917 0424

05103 06160 04977 04999 — — 01098 02687 04279 02545













SpoIID StbD ZipA

2385 2161 3115

08486 — 04354

Functional assignment Stress-induced morphogen Integral membrane protein possibly involved in chromosome condensation Cell division initiation protein Negative regulator of septation ring formation Initiator of septum formation Cell division protein Protein required for the initiation of cell division Cell division protein Bacterial cell division membrane protein Cell division protein, putative permease Intracellular septation protein A Nucleotide-binding protein implicated in inhibition of septum formation Uncharacterized protein involved in chromosome partitioning Uncharacterized protein involved in chromosome partitioning Uncharacterized protein involved in chromosome partitioning Translational repressor of toxin–antitoxin stability system Sporulation protein Antitoxin of toxin–antitoxin stability system Cell division membrane protein, interacts with FtsZ

Unknown Unknowns

By definition, “unknown unknowns” are those genes that cannot be assigned a biochemical function and have no clearly defined biological function either. A recent survey of the most common “unknown unknowns” showed that, although many of them have a wide phyletic distribution, very few (if any) are truly universal [42]. Many “unknown unknowns” are conserved only within one or more of the major divisions of life (bacteria, archaea, or eukaryotes) or, more often, are restricted to a particular phylogenetic lineage, such as proteobacteria or fungi. Presence of a gene in all representatives of a particular phylogenetic lineage suggests that it might perform a function that is essential for the organisms of that lineage. In contrast, many “unknown unknowns” have a patchy phylogenetic distribution, being present in some representatives of a given lineage and absent in other representatives of the same lineage. This patchy distribution is likely to reflect

Complete Prokaryotic Genomes


Table 6.6 Uncharacterized proteins of known structure

Protein name

Protein family COG Pfam

PDB code Tentative annotation





NIP7 RtcB TM0613 TT1751 YbgI YchN YebC YgfB YjeF YigZ YodA

1374 1690 2250 3439 0327 1553 0217 3079 0062 1739 3443

03657 01139 05168 03625 01784 01205 01709 03595 03853 01205 —

1sqw 1uc2 1o3u 1j3m 1nmo 1jx7 1kon 1izm 1jzt 1vi7 1s7d

Uncharacterized enzyme, butirosin synthesis Possible RNA-binding protein Possible role in RNA modification Uncharacterized protein Uncharacterized protein Possible transcriptional regulator Uncharacterized protein Potential role in DNA recombination Uncharacterized protein Possible role in RNA processing Possible enzyme of sugar metabolism Cadmium-induced protein

frequent horizontal gene transfer and/or gene loss, suggesting that the encoded function is not essential for cell survival. This nonessentiality, at least under standard laboratory conditions, could be the cause of the lack of easily detectable phenotypes, which makes these genes “unknown unknowns” in the first place. The progress in structural genomics has led to a paradoxical situation where a significant fraction of “unknown unknown” proteins have known 3D structure [19,75–77], which, however, does not really help in functional assignment. Table 6.6 lists some of such “unknown unknown” proteins with determined 3D structure. CONCLUSION

In conclusion, improved understanding of the cell as a biological system critically depends on the improvements in functional annotation. As long as there are numerous poorly characterized genes in every sequenced microbial genome, there always remains a chance that some key component of the cell metabolism or signal response mechanism has been overlooked [42]. The recent discoveries of the deoxyxylulose pathway [78,79] for terpenoid biosynthesis in bacteria and of the cyclic diguanylate (c-di-GMP)-based bacterial signaling system [38,67,80] indicate that these suspicions are not unfounded. Furthermore, several key metabolic enzymes have been described only in the past two to three years, indicating that there still are gaping holes in our understanding of microbial cell metabolism [58,81]. Recognizing the problem and identifying and enumerating these



holes through metabolic reconstruction [59,71] or integrated analysis approaches such as Gene Ontology [82,83] is a necessary prerequisite to launching projects that would aim at closing those holes (see [81]). Nevertheless, we would like to emphasize that the number of completely enigmatic “unknown unknowns” is very limited, particularly in the small genomes of heterotrophic parasitic bacteria. For many other uncharacterized genes, there are clear predictions of enzymatic activity that could (and should) be tested experimentally. It would still take a significant effort to create the complete “parts list,”that is, a catalog of the functions of all genes, even for the relatively simple bacteria and yeast [36]. However, the number of genes in these genomes is relatively small and the end of the road is already in sight.

REFERENCES 1. Fraser, C. M., J. D. Gocayne, O. White, et al. The minimal gene complement of Mycoplasma genitalium. Science, 270:397–403, 1995. 2. Dandekar, T., M. Huynen, J. T. Regula, et al. Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Research, 28:3278–88, 2000. 3. Hutchison, C. A., S. N. Peterson, S. R. Gill, et al. Global transposon mutagenesis and a minimal Mycoplasma genome. Science, 286:2165–9, 1999. 4. Mushegian, A. R., and E. V. Koonin. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences USA, 93:10268–73, 1996. 5. Koonin, E. V. How many genes can make a cell: the minimal-gene-set concept. Annual Reviews in Genomics and Human Genetics, 1:99–116, 2000. 6. Peterson, S. N., and C. M. Fraser. The complexity of simplicity. Genome Biology, 2:comment2002, 1-2002.8, 2001. 7. Deckert, G., P. V. Warren, T. Gaasterland, et al. The complete genome of the hyperthermophilic bacterium Aquifex aeolicus. Nature, 392:353–8, 1998. 8. Andersson, J. O. Evolutionary genomics: is Buchnera a bacterium or an organelle? Current Biology, 10:R866–8, 2000. 9. Gil, R., F. J. Silva, E. Zientz, et al. The genome sequence of Blochmannia floridanus: comparative analysis of reduced genomes. Proceedings of the National Academy of Sciences USA, 100:9388–93, 2003. 10. Raoult, D., S. Audic, C. Robert, et al. The 1.2-Mb genome sequence of mimivirus. Science, 306:1344–50, 2004. 11. Shimomura, S., S. Shigenobu, M. Morioka, et al. An experimental validation of orphan genes of Buchnera, a symbiont of aphids. Biochemical and Biophysical Research Communications, 292:263–7, 2002. 12. Galperin, M. Y. Conserved “hypothetical” proteins: new hints and new puzzles. Comparative and Functional Genomics, 2:14–18, 2001. 13. Siew, N., and D. Fischer. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins, 53:241–51, 2003. 14. Siew, N., Y. Azaria and D. Fischer. The ORFanage: an ORFan database. Nucleic Acids Research, 32:D281–3, 2004.

Complete Prokaryotic Genomes


15. Schilling, C. H., and B. O. Palsson. Assessment of the metabolic capabilities of Haemophilus influenzae Rd through a genome-scale pathway analysis. Journal of Theoretical Biology, 203:249–83, 2000. 16. Schilling, C. H., M. W. Covert, I. Famili, et al. Genome-scale metabolic model of Helicobacter pylori 26695. Journal of Bacteriology, 184:4582–93, 2002. 17. Raghunathan, A., N. D. Price, M. Y. Galperin, et al. In silico metabolic model and protein expression of Haemophilus influenzae strain Rd KW20 in rich medium. OMICS: A Journal of Integrative Biology, 8:25–41, 2004. 18. Kolker, E., S. Purvine, M. Y. Galperin, et al. Initial proteome analysis of model microorganism Haemophilus influenzae strain Rd KW20. Journal of Bacteriology, 185:4593–602, 2003. 19. Kolker, E., K. S. Makarova, S. Shabalina, et al. Identification and functional analysis of “hypothetical” genes expressed in Haemophilus influenzae. Nucleic Acids Research, 32:2353–61, 2004. 20. Ferre-D’Amare, A. R. RNA-modifying enzymes. Current Opinion in Structural Biology, 13:49–55, 2003. 21. Slany, R. K., and S. O. Muller. tRNA-guanine transglycosylase from bovine liver. Purification of the enzyme to homogeneity and biochemical characterization. European Journal of Biochemistry, 230:221–8, 1995. 22. Deshpande, K. L., P. H. Seubert, D. M. Tillman, et al. Cloning and characterization of cDNA encoding the rabbit tRNA-guanine transglycosylase 60-kilodalton subunit. Archives of Biochemistry and Biophysics, 326:1–7, 1996. 23. Deshpande, K. L., and J. R. Katze. Characterization of cDNA encoding the human tRNA-guanine transglycosylase (TGT) catalytic subunit. Gene, 265:205–12, 2001. 24. Koonin, E. V., and M. Y. Galperin. Sequence—Evolution—Function: Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, Boston, 2002. 25. Bousquet, I., G. Dujardin and P. P. Slonimski. ABC1, a novel yeast nuclear gene has a dual function in mitochondria: it suppresses a cytochrome b mRNA translation defect and is essential for the electron transfer in the bc1 complex. EMBO Journal, 10:2023–31, 1991. 26. Brasseur, G., G. Tron, G. Dujardin, et al. The nuclear ABC1 gene is essential for the correct conformation and functioning of the cytochrome bc1 complex and the neighbouring complexes II and IV in the mitochondrial respiratory chain. European Journal of Biochemistry, 246:103–11, 1997. 27. Poon, W. W., D. E. Davis, H. T. Ha, et al. Identification of Escherichia coli ubiB, a gene required for the first monooxygenase step in ubiquinone biosynthesis. Journal of Bacteriology, 182:5139–46, 2000. 28. Do, T. Q., A. Y. Hsu, T. Jonassen, et al. A defect in coenzyme Q biosynthesis is responsible for the respiratory deficiency in Saccharomyces cerevisiae abc1 mutants. Journal of Biological Chemistry, 276:18161–8, 2001. 29. Hsieh, E. J., J. B. Dinoso and C. F. Clarke. A tRNA(TRP) gene mediates the suppression of cbs2-223 previously attributed to ABC1/COQ8. Biochemical and Biophysical Research Communications, 317:648–53, 2004. 30. Nakayashiki, T., K. Nishimura and H. Inokuchi. Cloning and sequencing of a previously unidentified gene that is involved in the biosynthesis of heme in Escherichia coli. Gene, 153:67–70, 1995.



31. Le Guen, L., R. Santos and J. M. Camadro. Functional analysis of the hemK gene product involvement in protoporphyrinogen oxidase activity in yeast. FEMS Microbiology Letters, 173:175–82, 1999. 32. Bujnicki, J. M., and M. Radlinska. Is the HemK family of putative S-adenosylmethionine-dependent methyltransferases a “missing” zeta subfamily of adenine methyltransferases? A hypothesis. IUBMB Life, 48: 247–9, 1999. 33. Nakahigashi, K., N. Kubo, S. Narita, et al. HemK, a class of protein methyl transferase with similarity to DNA methyl transferases, methylates polypeptide chain release factors, and hemK knockout induces defects in translational termination. Proceedings of the National Academy of Sciences USA, 99:1473–8, 2002. 34. Heurgue-Hamard, V., S. Champ, A. Engstrom, et al. The hemK gene in Escherichia coli encodes the N5-glutamine methyltransferase that modifies peptide release factors. EMBO Journal, 21:769–78, 2002. 35. Clarke, S. The methylator meets the terminator. Proceedings of the National Academy of Sciences USA, 99:1104–6, 2002. 36. Roberts, R. J. Identifying protein function—a call for community action. PLoS Biology, 2:E42, 2004. 37. Yeliseev, A. A., and S. Kaplan. TspO of Rhodobacter sphaeroides. A structural and functional model for the mammalian peripheral benzodiazepine receptor. Journal of Biological Chemistry, 275:5657–67, 2000. 38. Galperin, M. Y. Bacterial signal transduction network in a genomic perspective. Environmental Microbiology, 6:552–67, 2004. 39. Gavish, M., I. Bachman, R. Shoukrun, et al. Enigma of the peripheral benzodiazepine receptor. Pharmacological Reviews, 51:629–50, 1999. 40. Lacapere, J. J., and V. Papadopoulos. Peripheral-type benzodiazepine receptor: structure and function of a cholesterol-binding protein in steroid and bile acid biosynthesis. Steroids, 68:569–85, 2003. 41. Davey, M. E., and F. J. de Bruijn. A homologue of the tryptophan-rich sensory protein TspO and FixL regulate a novel nutrient deprivationinduced Sinorhizobium meliloti locus. Applied and Environmental Microbiology, 66:5353–9, 2000. 42. Galperin, M. Y., and E. V. Koonin. “Conserved hypothetical” proteins: prioritization of targets for experimental study. Nucleic Acids Research, 32:5452–63, 2004. 43. Natale, D. A., M. Y. Galperin, R. L. Tatusov, et al. Using the COG database to improve gene recognition in complete genomes. Genetica, 108:9–17, 2000. 44. Galperin, M. Y., and E. V. Koonin. Who’s your neighbor? New computational approaches for functional genomics. Nature Biotechnology, 18:609–13, 2000. 45. Marcotte, E. M., M. Pellegrini, H. L. Ng, et al. Detecting protein function and protein-protein interactions from genome sequences. Science, 285: 751–3, 1999. 46. Marcotte, E. M., M. Pellegrini, M. J. Thompson, et al. A combined algorithm for genome-wide prediction of protein function. Nature, 402:83–6, 1999. 47. Overbeek, R., M. Fonstein, M. D’Souza, et al. The use of contiguity on the chromosome to predict functional coupling. In Silico Biology, 1:93–108, 1998.

Complete Prokaryotic Genomes


48. Overbeek, R., M. Fonstein, M. D’Souza, et al. The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences USA, 96:2896–901, 1999. 49. Huynen, M., B. Snel, W. Lathe, 3rd, et al. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Research, 10:1204–10, 2000. 50. Snel, B., P. Bork and M. A. Huynen. The identification of functional modules from the genomic association of genes. Proceedings of the National Academy of Sciences USA, 99:5890–5, 2002. 51. Dandekar, T., B. Snel, M. Huynen, et al. Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences, 23:324–8, 1998. 52. Tatusov, R. L., E. V. Koonin and D. J. Lipman. A genomic perspective on protein families. Science, 278:631–7, 1997. 53. Gaasterland, T., and M. A. Ragan. Microbial genescapes: phyletic and functional patterns of ORF distribution among prokaryotes. Microbial and Comparative Genomics 3:199–217, 1998. 54. Pellegrini, M., E. M. Marcotte, M. J. Thompson, et al. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences USA, 96:4285–8, 1999. 55. Tatusov, R. L., M. Y. Galperin, D. A. Natale, et al. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research, 28:33–6, 2000. 56. Dandekar, T., S. Schuster, B. Snel, et al. Pathway alignment: application to the comparative analysis of glycolytic enzymes. Biochemical Journal, 343:115–24, 1999. 57. Huynen, M. A., T. Dandekar and P. Bork. Variation and evolution of the citricacid cycle: a genomic perspective. Trends in Microbiology, 7:281–91, 1999. 58. Osterman, A., and R. Overbeek. Missing genes in metabolic pathways: a comparative genomics approach. Current Opinion in Chemical Biology, 7: 238–51, 2003. 59. Green, M. L., and P. D. Karp. A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics, 5:76, 2004. 60. Koonin, E. V., Y. I. Wolf and L. Aravind. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Research, 11:240–52, 2001. 61. Verma, R., L. Aravind, R. Oania, et al. Role of Rpn11 metalloprotease in deubiquitination and degradation by the 26S proteasome. Science, 298:611–15, 2002. 62. Makarova, K. S., L. Aravind, N. V. Grishin, et al. A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis. Nucleic Acids Research, 30:482–96, 2002. 63. Enright, A. J., I. Illopoulos, N. C. Kyrpides, et al. Protein interaction maps for complete genomes based on gene fusion events. Nature, 402:86–90, 1999. 64. Aravind, L., and C. P. Ponting. The GAF domain: an evolutionary link between diverse phototransducing proteins. Trends in Biochemical Sciences, 22:458–9, 1997.



65. Aravind, L., and C. P. Ponting. The cytoplasmic helical linker domain of receptor histidine kinase and methyl-accepting proteins is common to many prokaryotic signalling proteins. FEMS Microbiology Letters, 176: 111–16, 1999. 66. Taylor, B. L., and I. B. Zhulin. PAS domains: internal sensors of oxygen, redox potential, and light. Microbiology and Molecular Biology Reviews, 63:479–506, 1999. 67. Galperin, M. Y., A. N. Nikolskaya and E. V. Koonin. Novel domains of the prokaryotic two-component signal transduction system. FEMS Microbiology Letters, 203:11–21, 2001. 68. Zhulin, I. B., A. N. Nikolskaya and M. Y. Galperin. Common extracellular sensory domains in transmembrane receptors for diverse signal transduction pathways in bacteria and archaea. Journal of Bacteriology, 185:285–94, 2003. 69. Salwinski, L., C. S. Miller, A. J. Smith, et al. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 32:D449–51, 2004. 70. Edgar, R., M. Domrachev and A. E. Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 30:207–10, 2002. 71. Karp, P. D., M. Riley, M. Saier, et al. The EcoCyc Database. Nucleic Acids Research, 30:56–8, 2002. 72. Munch, R., K. Hiller, H. Barg, et al. PRODORIC: prokaryotic database of gene regulation. Nucleic Acids Research, 31:266–9, 2003. 73. Makita, Y., M. Nakao, N. Ogasawara, et al. DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics. Nucleic Acids Research, 32:D75–7, 2004. 74. Galperin, M. Y. The Molecular Biology Database Collection: 2005 update. Nucleic Acids Research, 33:D5–24, 2005. 75. Gilliland, G. L., A. Teplyakov, G. Obmolova, et al. Assisting functional assignment for hypothetical Haemophilus influenzae gene products through structural genomics. Current Drug Targets and Infectious Disorders, 2:339–53, 2002. 76. Frishman, D. What we have learned about prokaryotes from structural genomics. OMICS: A Journal of Integrative Biology, 7:211–24, 2003. 77. Kim, S. H., D. H. Shin, I. G. Choi, et al. Structure-based functional inference in structural genomics. Journal of Structural and Functional Genomics, 4:129–35, 2003. 78. Eisenreich, W., F. Rohdich and A. Bacher. Deoxyxylulose phosphate pathway to terpenoids. Trends in Plant Sciences, 6:78–84, 2001. 79. Eisenreich, W., A. Bacher, D. Arigoni, et al. Biosynthesis of isoprenoids via the non-mevalonate pathway. Cellular and Molecular Life Sciences, 61:1401–26, 2004. 80. Jenal, U. Cyclic di-guanosine-monophosphate comes of age: a novel secondary messenger involved in modulating cell surface structures in bacteria? Current Opinion in Microbiology, 7:185–91, 2004. 81. Karp, P. D. Call for an enzyme genomics initiative. Genome Biology, 5:401, 2004. 82. Harris, M. A., J. Clark, A. Ireland, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research, 32:D258–61, 2004.

Complete Prokaryotic Genomes


83. Camon, E., D. Barrell, V. Lee, et al. The Gene Ontology Annotation (GOA) Database—an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biology, 4:5–6, 2004. 84. Sasarman, A., Y. Echelard, J. Letowski, et al. Nucleotide sequence of the hemX gene, the third member of the Uro operon of Escherichia coli K12. Nucleic Acids Research, 16:11835, 1988. 85. Rigden, D. J., I. Bagyan, E. Lamani, et al. A cofactor-dependent phosphoglycerate mutase homolog from Bacillus stearothermophilus is actually a broad specificity phosphatase. Protein Science 10:1835–46, 2001. 86. Mathews, I. I., T. J. Kappock, J. Stubbe, et al. Crystal structure of Escherichia coli PurE, an unusual mutase in the purine biosynthetic pathway. Structure with Folding and Design, 7:1395–1406, 1999. 87. Thoden, J. B., T. J. Kappock, J. Stubbe, et al. Three-dimensional structure of N5-carboxyaminoimidazole ribonucleotide synthetase: a member of the ATP grasp protein superfamily. Biochemistry, 38:15480–92, 1999. 88. van der Oost, J., M. A. Huynen and C. H. Verhees. Molecular characterization of phosphoglycerate mutase in archaea. FEMS Microbiology Letters, 212:111–20, 2002. 89. Graham, D. E., H. Xu and R. H. White. A divergent archaeal member of the alkaline phosphatase binuclear metalloenzyme superfamily has phosphoglycerate mutase activity. FEBS Letters, 517:190–4, 2002. 90. Feder, M., J. Pas, L. S. Wyrwicz, et al. Molecular phylogenetics of the RrmJ/fibrillarin superfamily of ribose 2’-O-methyltransferases. Gene, 302:129–38, 2003. 91. Deng, L., N. G. Starostina, Z. J. Liu, et al. Structure determination of fibrillarin from the hyperthermophilic archaeon Pyrococcus furiosus. Biochemical and Biophysical Research Communications, 315:726–32, 2004. 92. An, F. Y., and D. B. Clewell. Characterization of the determinant (traB) encoding sex pheromone shutdown by the hemolysin/bacteriocin plasmid pAD1 in Enterococcus faecalis. Plasmid, 31:215–21, 1994. 93. Koonin, E. V. Pseudouridine synthases: four families of enzymes containing a putative uridine-binding motif also conserved in dUTPases and dCTP deaminases. Nucleic Acids Research, 24:2411–15, 1996. 94. Lafontaine, D. L., C. Bousquet-Antonelli, Y. Henry, et al. The box H+ACA snoRNAs carry Cbf5p, the putative rRNA pseudouridine synthase. Genes and Development, 12:527–37, 1998. 95. Fields, S. D., M. N. Conrad and M. Clarke. The S. cerevisiae CLU1 and D. discoideum cluA genes are functional homologues that influence mitochondrial morphology and distribution. Journal of Cell Science, 111:1717–27, 1998. 96. Stukey, J., and G. M. Carman. Identification of a novel phosphatase sequence motif. Protein Science, 6:469–72, 1997. 97. Neuwald, A. F. An unexpected structural relationship between integral membrane phosphatases and soluble haloperoxidases. Protein Science, 6:1764–7, 1997. 98. Fan, C., P. C. Moews, Y. Shi, et al. A common fold for peptide synthetases cleaving ATP to ADP: glutathione synthetase and D-alanine:D-alanine ligase of Escherichia coli. Proceedings of the National Academy of Sciences USA, 92:1172–6, 1995.



99. Artymiuk, P. J., A. R. Poirrette, D. W. Rice, et al. Biotin carboxylase comes into the fold. Nature Structural Biology, 3:128–32, 1996. 100. Murzin, A. G. Structural classification of proteins: new superfamilies. Current Opinion in Structural Biology, 6:386–94, 1996. 101. Galperin, M. Y., and E. V. Koonin. A diverse superfamily of enzymes with ATP-dependent carboxylate-amine/thiol ligase activity. Protein Science, 6:2639–43, 1997. 102. Koonin, E. V., and R. L. Tatusov. Computer analysis of bacterial haloacid dehalogenases defines a large superfamily of hydrolases with diverse specificity. Application of an iterative approach to database search. Journal of Molecular Biology, 244:125–32, 1994. 103. Aravind, L., M. Y. Galperin and E. V. Koonin. The catalytic domain of the P-type ATPase has the haloacid dehalogenase fold. Trends in Biochemical Sciences, 23:127–9, 1998. 104. Collet, J. F., V. Stroobant, M. Pirard, et al. A new class of phosphotransferases phosphorylated on an aspartate residue in an amino-terminal DXDX(T/V) motif. Journal of Biological Chemistry, 273:14107–12, 1998. 105. Collet, J. F., V. Stroobant and E. Van Schaftingen. Mechanistic studies of phosphoserine phosphatase, an enzyme related to P-type ATPases. Journal of Biological Chemistry, 274:33985–90, 1999. 106. Grana, X., L. de Lecea, M. R. el-Maghrabi, et al. Cloning and sequencing of a cDNA encoding 2,3-bisphosphoglycerate-independent phosphoglycerate mutase from maize. Possible relationship to the alkaline phosphatase family. Journal of Biological Chemistry, 267:12797–803, 1992. 107. Galperin, M. Y., A. Bairoch and E. V. Koonin. A superfamily of metalloenzymes unifies phosphopentomutase and cofactor-independent phosphoglycerate mutase with alkaline phosphatases and sulfatases. Protein Science, 7:1829–35, 1998. 108. Galperin, M. Y., and M. J. Jedrzejas. Conserved core structure and active site residues in alkaline phosphatase superfamily enzymes. Proteins, 45:318–24, 2001. 109. Kurnasov, O. V., B. M. Polanuyer, S. Ananta, et al. Ribosylnicotinamide kinase domain of NadR protein: identification and implications in NAD biosynthesis. Journal of Bacteriology, 184:6906–17, 2002. 110. Singh, S. K., O. V. Kurnasov, B. Chen, et al. Crystal structure of Haemophilus influenzae NadR protein. A bifunctional enzyme endowed with NMN adenyltransferase and ribosylnicotinimide kinase activities. Journal of Biological Chemistry, 277:33291–9, 2002. 111. Bishop, A. C., J. Xu, R. C. Johnson, et al. Identification of the tRNAdihydrouridine synthase family. Journal of Biological Chemistry, 277: 25090–5, 2002. 112. Heath, R. J., N. Su, C. K. Murphy, et al. The enoyl-[acyl-carrier-protein] reductases FabI and FabL from Bacillus subtilis. Journal of Biological Chemistry, 275:40128–33, 2000. 113. Myllykallio, H., G. Lipowski, D. Leduc, et al. An alternative flavindependent mechanism for thymidylate synthesis. Science, 297:105–7, 2002. 114. Daugherty, M., B. Polanuyer, M. Farrell, et al. Complete reconstitution of the human coenzyme A biosynthetic pathway via comparative genomics. Journal of Biological Chemistry, 277:21431–9, 2002.

Complete Prokaryotic Genomes


115. Aghajanian, S., and D. M. Worrall. Identification and characterization of the gene encoding the human phosphopantetheine adenylyltransferase and dephospho-CoA kinase bifunctional enzyme (CoA synthase). Biochemical Journal, 365:13–18, 2002. 116. Zhyvoloup, A., I. Nemazanyy, A. Babich, et al. Molecular cloning of CoA synthase. The missing link in CoA biosynthesis. Journal of Biological Chemistry, 277:22107–10, 2002. 117. Daugherty, M., V. Vonstein, R. Overbeek, et al. Archaeal shikimate kinase, a new member of the GHMP-kinase family. Journal of Bacteriology, 183:292–300, 2001. 118. White, R. H. L-Aspartate semialdehyde and a 6-deoxy-5-ketohexose 1-phosphate are the precursors to the aromatic amino acids in Methanocaldococcus jannaschii. Biochemistry, 43:7618–27, 2004. 119. Kurnasov, O., L. Jablonski, B. Polanuyer, et al. Aerobic tryptophan degradation pathway in bacteria: novel kynurenine formamidase. FEMS Microbiology Letters, 227:219–27, 2003. 120. Galperin, M. Y., L. Aravind and E. V. Koonin. Aldolases of the DhnA family: a possible solution to the problem of pentose and hexose biosynthesis in archaea. FEMS Microbiology Letters, 183:259–64, 2000. 121. Siebers, B., H. Brinkmann, C. Dorr, et al. Archaeal fructose-1,6bisphosphate aldolases constitute a new family of archaeal type class I aldolase. Journal of Biological Chemistry, 276:28710–18, 2001. 122. Lorentzen, E., B. Siebers, R. Hensel, et al. Structure, function and evolution of the Archaeal class I fructose-1,6-bisphosphate aldolase. Biochemical Society Transactions, 32:259–63, 2004. 123. Bobik, T. A., and M. E. Rasche. Identification of the human methylmalonylCoA racemase gene based on the analysis of prokaryotic gene arrangements. Implications for decoding the human genome. Journal of Biological Chemistry, 276:37194–8, 2001. 124. Marrakchi, H., K. H. Choi and C. O. Rock. A new mechanism for anaerobic unsaturated fatty acid formation in Streptococcus pneumoniae. Journal of Biological Chemistry, 277:44809–16, 2002. 125. Marrakchi, H., W. E. Dewolf, Jr., C. Quinn, et al. Characterization of Streptococcus pneumoniae enoyl-(acyl-carrier protein) reductase (FabK). Biochemical Journal, 370:1055–62, 2003. 126. Anantharaman, V., and L. Aravind. Cache—a signaling domain common to animal Ca2+-channel subunits and a class of prokaryotic chemotaxis receptors. Trends in Biochemical Sciences, 25:535–7, 2000. 127. Mougel, C., and I. B. Zhulin. CHASE: an extracellular sensing domain common to transmembrane receptors from prokaryotes, lower eukaryotes and plants. Trends in Biochemical Sciences, 26:582–4, 2001. 128. Anantharaman, V., and L. Aravind. The CHASE domain: a predicted ligand-binding module in plant cytokinin receptors and other eukaryotic and bacterial receptors. Trends in Biochemical Sciences, 26:579–82, 2001. 129. Heermann, R., A. Fohrmann, K. Altendorf, et al. The transmembrane domains of the sensor kinase KdpD of Escherichia coli are not essential for sensing K+ limitation. Molecular Microbiology, 47:839–48, 2003. 130. Galperin, M. Y., T. A. Gaidenko, A. Y. Mulkidjanian, et al. MHYT, a new integral membrane sensor domain. FEMS Microbiology Letters, 205: 17–23, 2001.



131. Nikolskaya, A. N., and M. Y. Galperin. A novel type of conserved DNA-binding domain in the transcriptional regulators of the AlgR/ AgrA/LytR family. Nucleic Acids Research, 30:2453–9, 2002. 132. Nikolskaya, A. N., A. Y. Mulkidjanian, I. B. Beech, et al. MASE1 and MASE2: two novel integral membrane sensory domains. Journal of Molecular Microbiology and Biotechnology, 5:11–16, 2003. 133. Awad, M. M., and J. I. Rood. Perfringolysin O expression in Clostridium perfringens is independent of the upstream pfoR gene. Journal of Bacteriology, 184:2034–8, 2002. 134. Savic, D. J., W. M. McShan and J. J. Ferretti. Autonomous expression of the slo gene of the bicistronic nga-slo operon of Streptococcus pyogenes. Infection and Immunity, 70:2730–3, 2002. 135. Häse, C. C., N. D. Fedorova, M. Y. Galperin, et al. Sodium ion cycle in bacterial pathogens: evidence from cross-genome comparisons. Microbiology and Molecular Biology Reviews, 65:353–70, 2001.

7 Protein Structure Prediction Jeffrey Skolnick & Yang Zhang

Over the past decade, the success of genome sequence efforts has brought about a paradigm shift in biology [1]. There is increasing emphasis on the large-scale, high-throughput examination of all genes and gene products of an organism, with the aim of assigning their functions [2]. Of course, biological function is multifaceted, ranging from molecular/biochemical to cellular or physiological to phenotypical [3]. In practice, knowledge of the DNA sequence of an organism and the identification of its open reading frames (ORFs) does not directly provide functional insight. Here, the focus is on the proteins in a genome, namely, the proteome, but recognizes that proteins are only a subset of all biologically important molecules and addresses aspects of molecular/biochemical function and protein–protein interactions. At present, evolutionary-based approaches can provide insights into some features of the biological function of about 40–60% of the ORFs in a given proteome [4]. However, pure evolutionary-based approaches increasingly fail as the protein families become more distant [5], and predicting the functions of the unassigned ORFs in a genome remains an important challenge. Because the biochemical function of a protein is ultimately determined by both the identity of the functionally important residues and the three-dimensional structure of the functional site, protein structures represent an essential tool in annotating genomes [6–11]. The recognition of the role that structure can play in elucidating function is one impetus for structural genomics that aims for high-throughput protein structure determination [12]. Another is to provide a complete library of solved protein structures so that an arbitrary sequence is within modeling distance of an already known structure [13]. Then, the protein folding problem, that is, the prediction of a protein’s structure from its amino acid sequence, could be solved by enumeration. In practice, the ability to generate accurate models from distantly related templates will dictate the number of protein folds that need to be determined experimentally [14–16]. Protein–protein interactions, which are involved in virtually all cellular processes [17], represent another arena where protein structure prediction could play an important role. This area is in ferment, with considerable concern about the accuracy and consistency of high-throughput experimental methods [18]. 187



In what follows, an overview of areas that comprise the focus of this chapter is presented. First, the state of the art of protein structure prediction is discussed. Then, the status of approaches to biochemical function prediction based on both protein sequence and structure is reviewed, followed by a review of the status of approaches for determining protein–protein interactions. Then, some recent promising advances in these areas are described. In the concluding section, the status of the field and directions for future research are summarized. BACKGROUND

Historically, protein structure prediction approaches are divided into three general categories, Comparative Modeling (CM) [19], threading [20], and New Fold methods or ab initio folding [21–23], which are schematically depicted in figure 7.1. In CM, the protein’s structure is predicted by aligning the target protein’s sequence to an evolutionarily related template sequence with a solved structure in the PDB [24], that is, two homologous sequences are aligned, and a three-dimensional model built based on this alignment [25]. In threading, the goal is to match the target sequence whose structure is unknown to a template that adopts a known structure, whether or not the target and template are evolutionarily related [26]. It should identify analogous folds, that is, where they adopt a similar fold without an apparent evolutionary relationship [27–29]. Note that the distinction between these approaches is becoming increasingly blurred [29–31]. Certainly, the general approach of CM and threading is the same: identify a structurally related template, identify an alignment between the target sequence and the template structure, build a continuous, full-length model, and then refine the resulting structure [26]. Ab initio folding usually refers to approaches that model protein structures on the basis of physicochemical principles. However, many recently developed New Fold/ab initio approaches often exploit evolutionary and threading information [30] (e.g., predicted secondary structure or contacts), although some versions are more physics-based [32]; perhaps such approaches should be referred to as semi-first principles. Indeed, a number of groups have developed approaches spanning the range from CM to ab initio [29,30] folding that performed reasonably well in CASP5, the fifth biannual communitywide experiment to assess the status of the field of protein structure prediction [33]. Comparative Modeling

Comparative Modeling (CM) can be used to predict the structure of those proteins whose sequence identity is above 30% with a template protein sequence [34], although progress has been reported at lower sequence identity [26]. An obvious limitation is that it requires a homologous

Protein Structure Prediction


Figure 7.1 Schematic overview of the methodologies employed in Comparative Modeling/threading and ab initio folding.

protein, the template, whose structure is known. When proteins have more than 50% sequence identity to their templates, in models built by CM techniques, the backbone atoms [19] can have up to a 1 Å rootmean-square deviation (RMSD) from native; this is comparable to experimental accuracy [9]. For target proteins with 30–50% sequence identity to their templates, the backbone atoms often have about 85% of their core regions within a RMSD of 3.5 Å from native, with errors



mainly in the loops [19]. When the sequence identity drops below 30%, the model accuracy by CM sharply decreases because of the lack of significant template hits and substantial alignment errors. The sequence identity