File Compression

picture info	File Compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder. The process of reducing the size of a data file is often referred to as data compression. In the context of data transmission, it is called source coding; encoding done at the source of the data before it is stored or transmitted. Source coding should not be confused with channel coding, for error detection and correction or line coding, the means for mapping data onto a signal. Co ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Information Theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. The field is at the intersection of probability theory, statistics, computer science, statistical mechanics, information engineering (field), information engineering, and electrical engineering. A key measure in information theory is information entropy, entropy. Entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. For example, identifying the outcome of a fair coin flip (with two equally likely outcomes) provides less information (lower entropy) than specifying the outcome from a roll of a dice, die (with six equally likely outcomes). Some other important measures in information theory are mutual informat ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Run-length Encoding Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most efficient on data that contains many such runs, for example, simple graphic images such as icons, line drawings, Conway's Game of Life, and animations. For files that do not have many runs, RLE could increase the file size. RLE may also be used to refer to an early graphics file format supported by CompuServe for compressing black and white images, but was widely supplanted by their later Graphics Interchange Format (GIF). RLE also refers to a little-used image format in Windows 3.x, with the extension rle, which is a run-length encoded bitmap, used to compress the Windows 3.x startup screen. Example Consider a screen containing plain black text on a solid white background. There will be many long runs of white pixel ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Probabilistic Model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" ( Herman Adèr quoting Kenneth Bollen). All statistical hypothesis tests and all statistical estimators are derived via statistical models. More generally, statistical models are part of the foundation of statistical inference. Introduction Informally, a statistical model can be thought of as a statistical assumption (or set of statistical assumptions) with a certain property: that the assumption allows us to calculate the probability of any event. As an example, consider a pair of ordinary six-si ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Burrows–Wheeler Transform The Burrows–Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is ''reversible'', without needing to store any additional data except the position of the first original character. The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation. The Burrows–Wheeler transform is an algorithm used to prepare data for use with data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while Burrows was working at DEC Systems Research Center in Palo Alto, California. It is based on a previously unpublished transformation discovered by Wheeler in 1983. The algorithm can be imple ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Prediction By Partial Matching Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis. Theory Predictions are usually reduced to symbol rankings. Each symbol (a letter, bit or any other amount of data) is ranked before it is compressed, and the ranking system determines the corresponding codeword (and therefore the compression rate). In many compression algorithms, the ranking is equivalent to probability mass function estimation. Given the previous letters (or given a context), each symbol is assigned with a probability. For instance, in arithmetic coding the symbols are ranked by their probabilities to appear after previous symbols, and the whole sequence is compressed into a single fraction that is computed according ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Randomized Algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performance in the "average case" over all possible choices of random determined by the random bits; thus either the running time, or the output (or both) are random variables. One has to distinguish between algorithms that use the random input so that they always terminate with the correct answer, but where the expected running time is finite (Las Vegas algorithms, for example Quicksort), and algorithms which have a chance of producing an incorrect result (Monte Carlo algorithms, for example the Monte Carlo algorithm for the MFAS problem) or fail to produce a result either by signaling a failure or failing to terminate. In some cases, probabilistic algorithms are the only practical means of solving a problem. In common practice, randomized algor ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Re-Pair Re-Pair (short for Recursive Pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side. How it works Re-Pair was first introduced by NJ. Larsson and A. MoffatLarsson, N. J., & Moffat, A. (2000). Off-line dictionary-based compression. Proceedings of the IEEE, 88(11), 1722-1732. in 1999. In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good perf ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Sequitur Algorithm Sequitur (or Nevill-Manning algorithm) is a recursive algorithm developed by Craig Nevill-Manning and Ian H. Witten in 1997 that infers a hierarchical structure (context-free grammar) from a sequence of discrete symbols. The algorithm operates in linear space and time. It can be used in data compression software applications. Constraints The sequitur algorithm constructs a grammar by substituting repeating phrases in the given sequence with new rules and therefore produces a concise representation of the sequence. For example, if the sequence is : S→abcab, the algorithm will produce : S→AcA, A→ab. While scanning the input sequence, the algorithm follows two constraints for generating its grammar efficiently: digram uniqueness and rule utility. Digram uniqueness Whenever a new symbol is scanned from the sequence, it is appended with the last scanned symbol to form a new digram. If this digram has been formed earlier then a new rule is made to replace both occurrences of th ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Grammar-based Codes Grammar-based codes or Grammar-based compression are compression algorithms based on the idea of constructing a context-free grammar (CFG) for the string to be compressed. Examples include universal lossless data compression algorithms. To compress a data sequence x = x_1 \cdots x_n, a grammar-based code transforms x into a context-free grammar G. The problem of finding a smallest grammar for an input sequence (smallest grammar problem) is known to be NP-hard, so many grammar-transform algorithms are proposed from theoretical and practical viewpoints. Generally, the produced grammar G is further compressed by statistical encoders like arithmetic coding. Examples and characteristics The class of grammar-based codes is very broad. It includes block codes, the multilevel pattern matching (MPM) algorithm, variations of the incremental parsing Lempel-Ziv code, and many other new universal lossless compression algorithms. Grammar-based codes are universal in the sense that they can a ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Huffman Coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (''weight'') for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted. However, although opt ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Graphics Interchange Format The Graphics Interchange Format (GIF; or , see pronunciation) is a bitmap image format that was developed by a team at the online services provider CompuServe led by American computer scientist Steve Wilhite and released on 15 June 1987. It is in widespread usage on the World Wide Web due to its wide support and portability between applications and operating systems. The format supports up to 8 bits per pixel for each image, allowing a single image to reference its own palette of up to 256 different colors chosen from the 24-bit RGB color space. It also supports animations and allows a separate palette of up to 256 colors for each frame. These palette limitations make GIF less suitable for reproducing color photographs and other images with color gradients, but well-suited for simpler images such as graphics or logos with solid areas of color. GIF images are compressed using the Lempel–Ziv–Welch (LZW) lossless data compression technique to reduce the file size wi ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Lempel–Ziv–Welch Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement and has the potential for very high throughput in hardware implementations. It is the algorithm of the widely used Unix file compression utility compress and is used in the GIF image format. Algorithm The scenario described by Welch's 1984 paper encodes sequences of 8-bit data as fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting of the corresponding 8-bit character, and the codes 256 through 4095 are created in a dictionary for sequences encountered in the data as it is encoded. At each stage in compression, input bytes are gathered into a sequence until the next character would make a sequence with no code yet in the dictionary. The co ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]