Abb. INB-Logo

Archiv: Forschung / archive: research

Zurück / back

Binding Site Analysis and Detection



Genetic information, stored in DNA sequences, controls and organizes many phenotypic processes, such as metabolism, and others. Protein-DNA interactions are an important step in the flow of information from the genome into the phenotype. Therefore, analysis of protein-DNA interactions is a highly relevant subject in biofinformatics. Such interactions take place at specific sites on the genome. These sites are called binding sites.


Information Theoretic Analysis

The binding sites for a given protein on the genome are characterized by the positional information Rfreq and the sequence information Rseq. We have investigated the relation between Rfreq and Rseq on the basis of maximum entropy analysis. By modelling co-evolution between the genome and the DNA-binding protein, we obtained the key result that the equality Rfreq = Rseq holds approximately for all biological systems which are genetically autonomous, i.e. which encode all their DNA-binding proteins within their own genome.


Binding Site Detection: The Binding Matrix

The standard approach to binding site detection is the profile matrix. Based on a set of binding sites of length L that were discovered experimentally, the profile matrix is a 4 × L matrix that contains for each position the occurrence frequency for each of the 4 base pairs. Genomic sequences are scanned for binding sites by scoring each site in the genome and checking whether the score exceeds a prespecified threshold value.

We have systematically addressed the binding site detection problem by considering a transcription factor as a molecular device which classifies words, defined as sequences of length L, either as binding words or as non-binding words. The presence of a binding word at a given site implies that the site is a binding site. Our analysis revealed that the profile matrix is not the maximum likelihood estimator for the classifier implemented by the protein. We developed an algorithm for computing the maximum likelihood estimator, called the binding matrix. Performance tests based upon the TRANSFAC database showed that the binding matrix improves specificity by one order of magnitude.

We provide a Web-based service for binding matrix computation.




  • Thomas Martinetz, Jan E. Gewehr, Jan T. Kim, 2002: "A Statistical Learning Approach to Predicting Protein-DNA-Binding Sites". Poster at ECCB 2002 [PDF].
  • Jan T. Kim, Thomas Martinetz, Daniel Polani, 2002: "The Effects of the Transcription Factor on Binding Site Information Are Constrained by Genetic Autonomy". Poster at ECCB 2002 [PDF].
  • Jan T. Kim, Thomas Martinetz, Daniel Polani, 2003: "Bioinformatic Principles Underlying the Information Content of Transcription Factor Binding Sites". Journal of Theoretical Biology 220: 529-544.