Phylogenetic footprinting
   HOME

TheInfoList



OR:

Phylogenetic footprinting is a technique used to identify
transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The f ...
binding sites (TFBS) within a non-coding region of DNA of interest by comparing it to the
orthologous Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a s ...
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
in different
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
. When this technique is used with a large number of closely related species, this is called phylogenetic shadowing. Researchers have found that non-coding pieces of DNA contain binding sites for regulatory proteins that govern the spatiotemporal expression of genes. These transcription factor binding sites (TFBS), or regulatory motifs, have proven hard to identify, primarily because they are short in length, and can show
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...
variation. The importance of understanding transcriptional regulation to many fields of
biology Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
has led researchers to develop strategies for predicting the presence of TFBS, many of which have led to publicly available databases. One such technique is
Phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
Footprinting Footprinting (also known as reconnaissance) is the technique used for gathering information about computer systems and the entities they belong to. To get this information, a hacker might use various tools and technologies. This information is v ...
. Phylogenetic footprinting relies upon two major concepts: # The function and DNA binding preferences of
transcription factors In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fun ...
are well-conserved between diverse
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
. # Important
non-coding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
sequences that are essential for regulating gene expression will show differential selective pressure. A slower rate of change occurs in TFBS than in other, less critical, parts of the non-coding genome.


History

Phylogenetic footprinting was first used and published by Tagle et al. in 1988, which allowed researchers to predict evolutionary conserved cis-regulatory elements responsible for embryonic ε and γ
globulin The globulins are a family of globular proteins that have higher molecular weights than albumins and are insoluble in pure water but dissolve in dilute salt solutions. Some globulins are produced in the liver, while others are made by the immune ...
gene expression in primates. Before phylogenetic footprinting,
DNase Deoxyribonuclease (DNase, for short) refers to a group of glycoprotein endonucleases which are enzymes that catalyze the hydrolytic cleavage of phosphodiester linkages in the DNA backbone, thus degrading DNA. The role of the DNase enzyme in cells ...
footprinting was used, where protein would be bound to DNA transcription factor binding sites (TFBS) protecting it from DNase digestion. One of the problems with this technique was the amount of time and labor it would take. Unlike DNase footprinting, phylogenetic footprinting relies on
evolutionary Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
constraints within the genome, with the "important" parts of the sequence being conserved among the different species.


Protocol

It is important when using this technique to decide which genome your sequence should be aligned to. More divergent species will have less sequence similarity between orthologous genes. Therefore, the key is to pick species that are related enough to detect homology, but divergent enough to maximize non-alignment "noise". Step wise approach to Phylogenetic footprinting consists of : # One should decide on the gene of interest. # Carefully choose species with orthologous genes. # Decide on the length of the upstream or maybe downstream region to be looked at. # Align the sequences. # Look for conserved regions and analyse them.


Not all TFBS are found

Not all transcription binding sites can be found using phylogenetic footprinting due to the statistical nature of this technique. Here are several reasons why some TFBS are not found:


Species specific binding sites

Some binding sites seem to have no significant matches in most other species. Therefore, detecting these sites by phylogenetic footprinting is likely impossible unless a large number of closely related species are available.


Very short binding sites

Some binding sites show excellent conservation, but just in a shorter region than the ones were looked for. Such short motifs (e.g., GC-box) often occur by chance in nonfunctional sequences and detecting these motifs can be challenging.


Less specific binding factors

Some binding sites show some conservation but have had insertions or deletions. It is not obvious if these sequences with insertions or deletions are still functional. Though they may still be functional if the binding factor is less specific (or less 'picky' if you will). Because deletions and insertions are rare in binding sites, considering insertions and deletions in the sequence would detect a few more true TFBSs, but it could likely include many more false positives.


Not enough data

Some motifs are quite well conserved, but they are statistically insignificant in a specific dataset. The motif might have appeared in different species by chance. These motifs could be detected if sequences from more organisms are available. So this will be less of a problem in the future.


Compound binding regions

Some transcription factors bind as dimers. Therefore, their binding sites may consist of two conserved regions, separated by a few variable nucleotides. Because of the variable internal sequence, the motif cannot be detected. However, if we could use a program to search for motifs containing a variable sequence in the middle, without counting mutations, these motifs could be discovered.


Accuracy

It is important to keep in mind that not all conserved sequences are under selection pressure. To eliminate false positives statistical analysis must be performed that will show that the motifs reported have a mutation rate meaningfully less than that of the surrounding nonfunctional sequence. Moreover, results could be more accurate if the prior knowledge about the sequence is considered. For example, some regulatory elements are repeated 15 times in a promoter region (e.g., some metallothionein promoters have up to 15 metal response elements (MREs)). Thus, to eliminate false motifs with inconsistent order across species, the orientation and order of regulatory elements in a promoter region should be the same in all species. This type of information could help us to identify regulatory elements that are not adequately conserved but occur in several copies in the input sequence.Blanchette, M. and Tompa, M. 2002. Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. ''Genome Res.'' 12: 739-748


References

Phylogenetics