Bloom Filters In Bioinformatics

picture info	Bloom Filters In Bioinformatics Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter. Bloom filters are primarily used in bioin ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Bloom Filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash co ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	BLAST (biotechnology) In bioinformatics, BLAST (basic local alignment search tool) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences, and identify database sequences that resemble alphabet above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the pig genome that resemble the mouse gene based on similarity of sequence. Background BLAST, which ''The New York Times'' called ''the Google of biological research'', is one of the most widely used bioinformatics programs for sequence searching. It addresses a fundamental problem in bioinformatics ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Next-generation Sequencing Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1994 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads (50 to 400 bases each) per instrument run. Many NGS platforms differ in engineering configurations and sequencing chemistry. They share the technical paradigm of massive parallel sequencing via spatially separated, clonally amplified DNA templates or single DNA molecules in a flow cell. This design is very different from that of Sanger sequencing—also known as capillary sequencing or first-generation sequencing—which is based on electrophoretic separation of chain-termination products produced in individ ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	De Bruijn Graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple times in a sequence. For a set of symbols the set of vertices is: :V=S^n=\. If one of the vertices can be expressed as another vertex by shifting all its symbols by one place to the left and adding a new symbol at the end of this vertex, then the latter has a directed edge to the former vertex. Thus the set of arcs (that is, directed edges) is :E=\. Although De Bruijn graphs are named after Nicolaas Govert de Bruijn, they were discovered independently by both De Bruijn and I. J. Good. Much earlier, Camille Flye Sainte-Marie implicitly used their properties. Properties * If , then the condition for any two vertices forming an edge holds vacuously, and hence all the vertices are connected, forming a total of edges. * Each vertex h ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Genome Assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript ( ESTs). The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable. Genom ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Comparing De Brujin Graph In Memory Comparison or comparing is the act of evaluating two or more things by determining the relevant, comparable characteristics of each thing, and then determining which characteristics of each are similar to the other, which are different, and to what degree. Where characteristics are different, the differences may then be evaluated to determine which thing is best suited for a particular purpose. The description of similarities and differences found between the two things is also called a comparison. Comparison can take many distinct forms, varying by field: To compare things, they must have characteristics that are similar enough in relevant ways to merit comparison. If two things are too different to compare in a useful way, an attempt to compare them is colloquially referred to in English as "comparing apples and oranges." Comparison is widely used in society, in science and in the arts. General usage Comparison is a natural activity, which even animals engage in when dec ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	European Nucleotide Archive The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank). The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank. The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data. As of early 2012, the ENA and other INSDC member databases each contained complete genomes of 5,682 organisms and sequence data for almost 700,000. Moreover, the volume of data i ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Document Retrieval Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words. Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines. Description Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to expert systems that answer questions by inferring over a logical knowledge database. A document retrieval syst ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	RNA-Seq RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome. Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-transcriptional modifications, gene fusion, mutations/ SNPs and changes in gene expression over time, or differences in gene expression in different groups or treatments. In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA, such as miRNA, tRNA, and ribosomal profiling. RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5' and 3' gene boundaries. Recent advances in RNA-Seq include single cell sequencing, in situ sequencing of fixed tissue, and native RNA molecule sequencing with single-molecule real- ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Sequence Read Archive The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ). The archive was established by the National Center for Biotechnology Information (NCBI) in 2007 in order to provide a repository for data produced by RNA-Seq and ChIP-Seq studies as well as large-scale studies including the Human Microbiome Project and the 1000 Genomes Project. Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. The volume of data deposited in the Seque ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Data Structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data. Usage Data structures serve as the basis for abstract data types (ADT). The ADT defines the logical form of the data type. The data structure implements the physical form of the data type. Different types of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, Relational database, relational databases commonly use B-tree indexes for data retrieval, while compiler Implementation, implementations usually use hash tables to look up identifiers. Data structures provide a means to manage large amounts of data efficiently for uses such a ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Reference Genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, the most recent human reference genome (assembly '' GRCh38/hg38'') is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]