Sign In

Communications of the ACM

Communications of the ACM

Discovery in the Human Genome Project

Knowledge discovery in biology has profited enormously the past few years from technological improvements, especially in large-scale sequencing projects. This development, in turn, created new challenges for computational tools for discovery to exhaustively and efficiently analyze and interpret greater amounts of raw data. Two new major issues of the molecular biological community are:

  • The obvious need and technical difficulty to setup and maintain distributed databases to be linked, coordinated, and integrated over the Web as the basis for analyzing and interpreting biological organisms. Due to the wide variety and uncontrolled use of intuitively simple but difficult to recognize terms and definitions, semantic integration of heterogeneous knowledge sources poses unexpected formidable difficulties [2].
  • The growing possibility and scientific challenge to shift from the established, but severely limited, approach of predicting biological function from protein tertiary structure to the new approach of identifying biological function directly from within the Human Genome Project recently acquired masses of DNA sequences and expression data, with the ultimate goal of understanding biological organisms in biochemical and molecular biological terms.

These challenges prompted many new developments, for example, integrated data warehouses and distributed database federations to manage primary and derived genetic data; and new methods to analyze and interpret genomic sequences.

Back to Top

Knowledge Discovery Approaches

The ultimate challenge in molecular biology is to make sense of the billions of four-letter sequences and to understand their significance in the composition and behavior of biological beings. Mere statistical analysis often seems to capture the functional aspect hidden within genomic fragments inadequately. Hence, some researchers attempt to elucidate higher structure information from plain sequences by using knowledge discovery methods. Here, we discuss three such approaches.

Data mining for regulatory elements in yeast. Regulatory elements in the genome are stretches of DNA that exert control over other genes and thereby influence the phenotype and behavior of the organism. The Transcription Factor Combination Discoverer [1] is a data mining tool for finding and analyzing combinations of transcription-factor binding sites and potential promoter classes in the yeast genome.

Yeast genes and transcription-factor binding-site descriptions are taken from the MIPS and IMD databases. A combination of binding sites is characterized by the following parameters:

Coverage: The number of its occurrences in upstream regions.

Goodness: The ratio of the number of its occurrences in the upstream regions vs. the number of occurrences in random regions (of the same length and number).

Unexpectedness: The ratio of its occurrences vs. the expected number of occurrences based on the individual sites.

Combinations with high values for all these parameters can possibly define promoter classes. A high value of the coverage parameter ensures the combination is present in at least the given number of upstream regions. The goodness parameter ensures the frequency of the combination in upstream regions is not a mere consequence of a high frequency in the genome as the whole. The unexpectedness parameter ensures the frequency of the combination is not only a consequence of the high frequency of individual participating binding sites.

Automated clustering and assembly of large EST collections. Short DNA sequences experimentally confirmed to correspond to actual gene products are called "expressed sequence tags" (ESTs). It is an ongoing challenge to find the correct placement of ESTs and their corresponding genes on the chromosome.

The REX algorithm [4] is a simple yet effective algorithm that integrates the clustering and assembly steps necessary to create a consensus assembled sequence from EST data. It is fast and reflects an intuitive idea of extending an anchor sequence in direction with ESTs derived from an arbitrary number of sequence databases. A rough assembled sequence is created in the process of extending an anchor. This rough assembly is then improved by analyzing a multiple alignment of contigs to make consensus base calls and to remove potential insertion and deletion errors. This method also addresses the issues of splice variants and unspliced DNA data.

Genetic algorithms for protein threading. Since the prediction of protein tertiary structure from protein sequence turned out to be extremely difficult, researchers have attempted to succeed on the inverse route: Given a valid, native protein tertiary structure, which protein sequences fit that structure best?

This NP-hard problem is called "protein threading" and it is hoped that its reverse application will be beneficial to the original protein folding problem. Genetic Algorithms (GAs) have been used [3] in order to significantly improve the ability of threading algorithms to find the optimal alignment of a sequence to a structure, such as the alignment with the minimum free energy.

Major progress has been reported in the design of a representation of the threading alignment as a string of fixed length. The described algorithm is shown to perform well even without predefined core elements. The authors maintain that GA threading is capable of finding consistently good solutions of full alignments in search spaces of size up to 1070.

Back to Top

Summary and Conclusion

The three approaches presented here are meant to illustrate the status of awareness of computational biologists of methods from automatic knowledge discovery. It is there, but the latest algorithms are not always used and more attention to the domain of computational biology would certainly be welcome.

However, it must be emphasized the application of knowledge discovery methods to biological problems depends crucially on profound understanding of the application domain, in this case a domain where 90% experimental accuracy is considered highly significant, fuzzy rules are common, and the ability to abstract from irrelevant details is important but difficult. The sample studies presented also reflect the fact the issues in molecular biology today are many and subtle and most obvious approaches to the great problems have already been tried, and yet, there is much room for improvement.

Back to Top


1. Brazma, A., Vilo, J., Ukkonen, E., and Valtonen, K. Data mining for regulatory elements in yeast. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology 5 (1997), 65–74.

2. Schulze-Kremer, S. Ontologies for molecular biology. In Proceedings of the Third Pacific Symposium on Biocomputing 3 (1998), 693–704.

3. Yadgari, J., Amir, A., and Unger, R. Genetic algorithms for protein threading. In Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology 6 (1998), 193–202.

4. Yee, D. P. and Conklin, D. (1998). Automated clustering and assembly of large EST collections. In Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology 6 (1998), 203–211.

Back to Top


Steffen Schulze-Kremer ( is the informatics manager at the Resource Center within the German Human Genome Project at the Max Planck Institute for Molecular Genetics in Berlin.

©1999 ACM  0002-0782/99/1100  $5.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 1999 ACM, Inc.


No entries found