HOME

TheInfoList



OR:

In the fields of
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combine ...
and
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fou ...
, Genome survey sequences (GSS) are nucleotide sequences similar to
expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proce ...
s (ESTs) that the only difference is that most of them are
genomic Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
in origin, rather than
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the p ...
.GenBank Flat File 96.0 Release Notes
/ref> Genome survey sequences are typically generated and submitted to NCBI by labs performing
genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a ...
and are used, amongst other things, as a framework for the mapping and sequencing of genome size pieces included in the standard
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
divisions.


Contributions

Genome survey sequencing is a new way to map the genome sequences since it is not dependent on
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the p ...
. Current genome sequencing approaches are mostly high-throughput shotgun methods, and GSS is often used on the first step of sequencing. GSSs can provide an initial global view of a genome, which includes both coding and
non-coding DNA Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and re ...
and contain repetitive section of the genome unlike ESTs. For the estimation of repetitive sequences, GSS plays an important role in the early assessment of a sequencing project since these data can affect the assessment of sequences coverage, library quality and the construction process.Otto, Thomas D., et al. "ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS)." Bmc Bioinformatics 9.1 (2008): 366. For example, in the estimation of dog genome, it can estimate the global parameters, such as neutral mutation rate and repeat content. GSS is also an effective way to large-scale and rapidly characterizing genomes of related species where there is only little gene sequences or maps. GSS with low coverage can generate abundant information of gene content and putative regulatory elements of comparative species. It can compare these genes of related species to find out relatively expanded or contracted families. And combined with physical clone coverage, researchers can navigate the genome easily and characterize the specific genomic section by more extensive sequencing.


Limitation

The limitation of genomic survey sequence is that it lacks long-range continuity because of its fragmentary nature, which makes it harder to forecast gene and marker order. For example, to detect repetitive sequences in GSS data, it may not be possible to find out all the repeats since the repetitive genome may be longer than the reads, which is difficult to recognize.


Types of data

The GSS division contains (but is not limited to) the following types of data:


Random "single pass read" genome survey sequences

Random “single pass read” genome survey sequences is GSSs that generated along single pass read by random selection. Single-pass sequencing with lower fidelity can be used on the rapid accumulation of genomic data but with a lower accuracy. It includes RAPD, RFLP, AFLP and so on.


Cosmid/BAC/YAC end sequences

Cosmid/BAC/YAC end sequences use
Cosmid A cosmid is a type of hybrid plasmid that contains a Lambda phage ''cos'' sequence. They are often used as a cloning vector in genetic engineering. Cosmids can be used to build genomic libraries. They were first described by Collins and Hohn in ...
/
Bacterial artificial chromosome A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid (or F-plasmid), used for transforming and cloning in bacteria, usually ''E. coli''. F-plasmids play a crucial role because they contain partition ...
/
Yeast artificial chromosome Yeast artificial chromosomes (YACs) are genetically engineered chromosomes derived from the DNA of the yeast, ''Saccharomyces cerevisiae'', which is then ligated into a bacterial plasmid. By inserting large fragments of DNA, from 100–1000  ...
to sequence the genome from the end side. These sequences act like very low copy plasmids that there is only one copy per cell sometimes. To get enough chromosome, they need a large number of E. coli culture that 2.5 - 5 litres may be a reasonable amount. Cosmid/BAC/YAC can also be used to get bigger clone of DNA fragment than vectors like plasmid and phagemid. A larger insert is often helpful for the sequence project in organizing clones. Eukaryotic proteins can be expressed by using YAC with posttranslational modification. BAC can’t do that, but BACs can reliably represent human DNA much better than YAC or cosmid.


Exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
trapped genomic sequences

Exon trapped sequence is used to identify genes in cloned DNA, and this is achieved by recognizing and trapping carrier containing exon sequence of DNA. Exon trapping has two main features: First, it is independent of availability of the RNA expressing target DNA. Second, isolated sequences can be derived directly from clone without knowing tissues expressing the gene which needs to be identified. During slicing, exon can be remained in mRNA and information carried by exon can be contained in the protein. Since fragment of DNA can be inserted into sequences, if an exon is inserted into intron, the transcript will be longer than usual and this transcript can be trapped by analysis.


Alu

PCR PCR or pcr may refer to: Science * Phosphocreatine, a phosphorylated creatine molecule * Principal component regression, a statistical technique Medicine * Polymerase chain reaction ** COVID-19 testing, often performed using the polymerase chain r ...
sequences

Alu repetitive element is member of Short Interspersed Elements (SINE) in mammalian genome. There are about 300 to 500 thousand copies of Alu repetitive element in human genome, which means one Alu element exists in 4 to 6 kb averagely. Alu elements are distributed widely in mammalian genome, and repeatability is one of the characteristics, that is why it is called Alu repetitive element. By using special Alu sequence as target locus, specific human DNA can be obtained from clone of TAC, BAC, PAC or human-mouse cell hybrid. PCR is an approach used to clone a small piece of fragment of DNA. The fragment could be one gene or just a part of gene. PCR can only clone very small fragment of DNA, which generally does not exceed 10kbp. Alu PCR is a "DNA fingerprinting" technique. This approach is rapid and easy to use. It is obtained from analysis of many genomic loci flanked by Alu repetitive elements, which are non-autonomous retrotransposons present in high number of copies in primate genomes. Alu element can be used for genome fingerprinting based on PCR, which is also called Alu PCR.


Transposon-tagged sequences

There are several ways to analyze the function of a particular gene sequence, the most direct method is to replace it or cause a
mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitos ...
and then to analyze the results and effects. There are three method are developed for this purpose: gene replacement, sense and anti-sense suppression, and
insertional mutagenesis In molecular biology, insertional mutagenesis is the creation of mutations of DNA by addition of one or more base pairs. Such insertional mutations can occur naturally, mediated by viruses or transposons, or can be artificially created for researc ...
. Among these methods, insertional mutagenesis was proved to be very good and successful approach. At first, T-DNA was applied for insertional mutagenesis. However, using
transposable element A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Trans ...
can bring more advantages. Transposable elements were first discovered by
Barbara McClintock Barbara McClintock (June 16, 1902 – September 2, 1992) was an American scientist and cytogenetics, cytogeneticist who was awarded the 1983 Nobel Prize in Physiology or Medicine. McClintock received her PhD in botany from Cornell University in ...
in
maize Maize ( ; ''Zea mays'' subsp. ''mays'', from es, maíz after tnq, mahiz), also known as corn (North American and Australian English), is a cereal grain first domesticated by indigenous peoples in southern Mexico about 10,000 years ago. The ...
plants. She identified the first transposable genetic element, which she called the Dissociation (Ds) locus. The size of transposable element is between 750 and 40000bp. Transposable element can be mainly classified as two classes: One class is very simple, called insertion sequence (IS), the other class is complicated, called transposon. Transposon has one or several characterized genes, which can be easily identified. IS has the gene of transposase. Transposon can be used as tag for a DNA with a know sequence. Transposon can appear at other locus through transcription or reverse transcription by the effect of nuclease. This appearance of transposon proved that genome is not statistical, but always changing the structure of itself. There are two advantages by using transposon tagging. First, if a transposon is inserted into a gene sequence, this insertion is single and intact. The intactness can make tagged sequence easily to molecular analysis. The other advantage is that, many transposons can be found eliminated from tagged gene sequence when
transposase A transposase is any of a class of enzymes capable of binding to the end of a transposon and catalysing its movement to another part of a genome, typically by a cut-and-paste mechanism or a replicative mechanism, in a process known as transpositio ...
is analyzed. This provides confirmation that the inserted gene sequence was really tagged by transposon.


Example of GSS file

The following is an example of GSS file that can be submitted to GenBank:dbGSS_submit
/ref>
TYPE: GSS
STATUS:  New
CONT_NAME: Sikela JM
GSS#: Ayh00001
CLONE: HHC189
SOURCE: ATCC
SOURCE_INHOST: 65128
OTHER_GSS:  GSS00093, GSS000101
CITATION: 
Genomic sequences from Human 
brain tissue
SEQ_PRIMER: M13 Forward
P_END: 5'
HIQUAL_START: 1
HIQUAL_STOP: 285
DNA_TYPE: Genomic
CLASS: shotgun
LIBRARY: Hippocampus, Stratagene (cat. #936205)
PUBLIC: 
PUT_ID: Actin, gamma, skeletal
COMMENT:
SEQUENCE:
AATCAGCCTGCAAGCAAAAGATAGGAATATTCACCTACAGTGGGCACCTCCTTAAGAAGCTG
ATAGCTTGTTACACAGTAATTAGATTGAAGATAATGGACACGAAACATATTCCGGGATTAAA
CATTCTTGTCAAGAAAGGGGGAGAGAAGTCTGTTGTGCAAGTTTCAAAGAAAAAGGGTACCA
GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT
GCAAAAGTGATAATGATTTGAGGATTTCTGTCTCTAATTGGAGGATGATTCTCATGTAAGGT
TGTTAGGAAATGGCAAAGTATTGATGATTGTGTGCTATGTGATTGGTGCTAGATACTTTAAC
TGAGTATACGAGTGAAATACTTGAGACTCGTGTCACTT
, , 


References

{{reflist, 20em Bioinformatics Genomics