1. Introduction
Information is central to life. The principle enunciated by Crick, that information flows from the gene (DNA) to the protein, occupies such a key place in modern molecular biology that it is frequently referred to as the central dogma: DNA acts as a template to replicate itself, DNA is transcribed into RNA, and RNA is translated into protein.
The mission of biology is to answer the question What is life? For many centuries, the study of the living world proceeded by examination of its external characteristics (i.e., of phenotype, including behaviour). This led to Linnaeus hierarchical classification. A key advance was made about 150 years ago when Mendel established the notion of an unseen heritable principle. Improvements in experimental techniques lead to a steady acceleration in the gathering of facts about the components of living matter, culminating in Watson and Cricks discovery of the DNA double helix half a century ago, which ushered in the modern era of molecular biology.
The mission of biology remained unchanged during these developments, but knowledge about life became steadily more detailed. As Sommerhoff has remarked, To put it navely, the fundamental problem of theoretical biology is to discover how the behaviour of myriads of blind, stupid, and by inclination chaotic, atoms can obey the laws of physics and chemistry, and at the same time become integrated into organic wholes and into activities of such purpose-like character. Since he wrote those words, experimental molecular biology has advanced far and fast, yet the most important question of all, what is life? remains a riddle.
It is a curious fact that although information figures so prominently in the central dogma, the concept of information has continued to receive rather cursory treatment in molecular biology textbooks. Even today, the word information may not even appear in the index. On the other hand, whole chapters are devoted to energy and energetics, which, like information, is another fundamental, irreducible concept. Although the doctoral thesis of Shannon, one of the fathers of information theory, was entitled An algebra for theoretical genetics, apart from genetics, biology remained largely untouched by developments in information science.
One might speculate on why information was placed so firmly at the core of molecular biology by one of its pioneers. During the preceding decade, there had been tremendous advances in the theory of communicationthe science of the transmission of information. Shannon published his seminal paper on the mathematical theory of communication only a few years before Watson and Cricks work. In that context, the notion of a sequence of DNA bases as message with meaning seemed only natural, and the next major developmentthe establishment of the genetic code with which the DNA sequence could be transformed into a protein sequencewas cast very much in the language and concepts of communication theory. More puzzling is that there was not subsequently a more vigorous interchange between the two disciplines. Probably the lack of extensive datasets and of powerful computers, which made the necessary calculations intolerably tedious, or simply too long, provides sufficient explanation for this neglectand hence, now that both these requirements (datasets and powerful computers) are being met, it is not surprising that there is a great revival in the application of information ideas to biology. One may indeed hope that this revival will at last lead to a real answer being advanced in response to the vital question what is life?: In other words, information science is perhaps the missing discipline that, along with the physics and chemistry already being brought to bear, is needed to answer the question.
1.1 What is Bioinformatics?
The term bioinformatics seems to have been first used in the mid-1980s in order to describe the application of information science and technology in the life sciences. The definition was at that time very general, covering everything from robotics to artificial intelligence. Later, bioinformatics came to be somewhat prosaically defined as the use of computers to retrieve, process, analyse, and simulate biological information. An even narrower definition was the application of information technology to the management of biological data. Such definitions fail to capture the centrality of information in biology. If, indeed, information is the most fundamental concept underlying biology and bioinformatics is the exploration of all the ramifications and implications of that basis, then bioinformatics is excellently positioned to revive consideration of the central question what is life? A more appropriate definition of bioinformatics is, therefore, the science of how information is generated, transmitted, received, stored, processed and interpreted in biological systems or, more succinctly, the application of information science to biology.
The emergence of information theory by the middle of the twentieth century enabled the creation of a formal framework within which information could be quantified. To be sure, the theory was, and to some extent still is, incomplete, especially regarding those aspects going beyond the merely faithful transmission of messages, in order to enquire about, and even quantify, the meaning and significance of messages.
In parallel to these developments, other advances, including the development of the idea of algorithmic complexity, with which the names of Kolmogorov and Chaitin are associated, allowed a number of other crucial clarifications to be made, including the notion that randomness is minimally informative. The DNA sequence of a living organism must depart in some way from randomness, and the study of these departures could be said to constitute the core of bioinformatics.
Alongside information theory, cybernetics developed as a distinctive science at around the same time and largely within the same constellation. Its definition is well conveyed by the subtitle of Wieners eponymous book (): the study of control and communication in the animal and the machine. The word itself was coined by Ampre (as cyberntique ) more than a century earlier. It is derived from the Greek
, meaning steersman, from which we get our Latin gubernetes , morphing into governor. A governor such as Watts for the steam engine uses a relatively simple feedback mechanism in its operation, and feedback has remained an important concept within cybernetics. It appears to have already been used by Plato as a metaphor for governance in society (which was the interest of Ampre in the topic). According to Aristotle,
, the art of the steersman, implied teleological (goal-oriented) activity as well as knowledge, which is, as Sommerhoff has pointed out, perhaps the most characteristic apparent feature of living organisms. Information is, of course, central to considering how control and communication are enacted and, hence, bioinformatics and cybernetics become almost synonymous.
1.2 What Can Bioinformatics Do?
In a very short interval, bioinformatics has become an extremely active research field. Although it began with sequence comparison (which is a subbranch of the study of the nonrandomness of DNA sequences), it now encompasses a far wider spread of activity, which truly epitomizes modern scientific research. It is highly interdisciplinary, requiring at least mathematical, biological, physical, and chemical knowledge, and its implementation may furthermore require knowledge of computer science, chemical engineering, biotechnology, medicine, pharmacology, etc. There is, moreover, little distinction between work carried out in the public domain, either in academic institutions (universities) or state research laboratories, or privately by commercial firms.