Coding theory is the study of the properties of

code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communicati ...

s and their respective fitness for specific applications. Codes are used for

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

cryptography Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or ''-logy, -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of Adversary (cryptography), ...

, error detection and correction,

data transmission Data communication, including data transmission and data reception, is the transfer of data, signal transmission, transmitted and received over a Point-to-point (telecommunications), point-to-point or point-to-multipoint communication chann ...

and

data storage Data storage is the recording (storing) of information (data) in a storage medium. Handwriting, phonographic recording, magnetic tape, and optical discs are all examples of storage media. Biological molecules such as RNA and DNA are con ...

. Codes are studied by various scientific disciplines—such as

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

electrical engineering Electrical engineering is an engineering discipline concerned with the study, design, and application of equipment, devices, and systems that use electricity, electronics, and electromagnetism. It emerged as an identifiable occupation in the l ...

mathematics Mathematics is a field of study that discovers and organizes methods, Mathematical theory, theories and theorems that are developed and Mathematical proof, proved for the needs of empirical sciences and mathematics itself. There are many ar ...

linguistics Linguistics is the scientific study of language. The areas of linguistic analysis are syntax (rules governing the structure of sentences), semantics (meaning), Morphology (linguistics), morphology (structure of words), phonetics (speech sounds ...

, and

computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...

—for the purpose of designing efficient and reliable

methods. This typically involves the removal of redundancy and the correction or detection of errors in the transmitted data. There are four types of coding: #

Data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

(or ''source coding'') # Error control (or ''channel coding'') # Cryptographic coding # Line coding Data compression attempts to remove unwanted redundancy from the data from a source in order to transmit it more efficiently. For example, DEFLATE data compression makes files smaller, for purposes such as to reduce Internet traffic. Data compression and error correction may be studied in combination. Error correction adds useful redundancy to the data from a source to make the transmission more robust to disturbances present on the transmission channel. The ordinary user may not be aware of many applications using error correction. A typical music compact disc (CD) uses the Reed–Solomon code to correct for scratches and dust. In this application the transmission channel is the CD itself. Cell phones also use coding techniques to correct for the fading and noise of high frequency radio transmission. Data modems, telephone transmissions, and the NASA Deep Space Network all employ channel coding techniques to get the bits through, for example the turbo code and LDPC codes.

History of coding theory

Shannon’s paper focuses on the problem of how best to encode the

information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...

a sender wants to transmit. In this fundamental work he used tools in probability theory, developed by

Norbert Wiener Norbert Wiener (November 26, 1894 – March 18, 1964) was an American computer scientist, mathematician, and philosopher. He became a professor of mathematics at the Massachusetts Institute of Technology ( MIT). A child prodigy, Wiener late ...

, which were in their nascent stages of being applied to communication theory at that time. Shannon developed information entropy as a measure for the uncertainty in a message while essentially inventing the field of

. The

binary Golay code In mathematics and electronics engineering, a binary Golay code is a type of linear error-correcting code used in digital communications. The binary Golay code, along with the ternary Golay code, has a particularly deep and interesting connection ...

was developed in 1949. It is an error-correcting code capable of correcting up to three errors in each 24-bit word, and detecting a fourth.

Richard Hamming Richard Wesley Hamming (February 11, 1915 – January 7, 1998) was an American mathematician whose work had many implications for computer engineering and telecommunications. His contributions include the Hamming code (which makes use of a Ha ...

won the Turing Award in 1968 for his work at

Bell Labs Nokia Bell Labs, commonly referred to as ''Bell Labs'', is an American industrial research and development company owned by Finnish technology company Nokia. With headquarters located in Murray Hill, New Jersey, Murray Hill, New Jersey, the compa ...

in numerical methods, automatic coding systems, and error-detecting and error-correcting codes. He invented the concepts known as

Hamming code In computer science and telecommunications, Hamming codes are a family of linear error-correcting codes. Hamming codes can detect one-bit and two-bit errors, or correct one-bit errors without detection of uncorrected errors. By contrast, the ...

s, Hamming windows, Hamming numbers, and

Hamming distance In information theory, the Hamming distance between two String (computer science), strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number ...

. In 1972, Nasir Ahmed proposed the discrete cosine transform (DCT), which he developed with T. Natarajan and K. R. Rao in 1973. The DCT is the most widely used lossy compression algorithm, the basis for multimedia formats such as

JPEG JPEG ( , short for Joint Photographic Experts Group and sometimes retroactively referred to as JPEG 1) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degr ...

MPEG The Moving Picture Experts Group (MPEG) is an alliance of working groups established jointly by International Organization for Standardization, ISO and International Electrotechnical Commission, IEC that sets standards for media coding, includ ...

and MP3.

Source coding

The aim of source coding is to take the source data and make it smaller.

Definition

Data can be seen as a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

X:\Omega\to\mathcal

, where

x \in \mathcal

appears with probability

\mathbb =x /math>.

Data are encoded by strings (words) over an

alphabet An alphabet is a standard set of letter (alphabet), letters written to represent particular sounds in a spoken language. Specifically, letters largely correspond to phonemes as the smallest sound segments that can distinguish one word from a ...

\Sigma

. A code is a function :

C:\mathcal\to\Sigma^*

(or

\Sigma^+

if the empty string is not part of the alphabet).

C(x)

is the code word associated with

x

. Length of the code word is written as :

l(C(x)).

Expected length of a code is :

l(C) = \sum_l(C(x))\mathbb =x .

The concatenation of code words

C(x_1, \ldots, x_k) = C(x_1)C(x_2) \cdots C(x_k)

. The code word of the empty string is the empty string itself: :

C(\epsilon) = \epsilon

Properties

C:\mathcal\to\Sigma^*

is non-singular if injective. #

C:\mathcal^*\to\Sigma^*

is uniquely decodable if injective. #

C:\mathcal\to\Sigma^*

is instantaneous if

C(x_1)

is not a proper prefix of

C(x_2)

(and vice versa).

Principle

Entropy Entropy is a scientific concept, most commonly associated with states of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynamics, where it was first recognized, to the micros ...

of a source is the measure of information. Basically, source codes try to reduce the redundancy present in the source, and represent the source with fewer bits that carry more information. Data compression which explicitly tries to minimize the average length of messages according to a particular assumed probability model is called

entropy encoding In information theory, an entropy coding (or entropy encoding) is any lossless data compression method that attempts to approach the lower bound declared by Shannon's source coding theorem, which states that any lossless data compression method ...

. Various techniques used by source coding schemes try to achieve the limit of entropy of the source. ''C''(''x'') ≥ ''H''(''x''), where ''H''(''x'') is entropy of source (bitrate), and ''C''(''x'') is the bitrate after compression. In particular, no source coding scheme can be better than the entropy of the source.

Example

Facsimile transmission uses a simple run length code. Source coding removes all data superfluous to the need of the transmitter, decreasing the bandwidth required for transmission.

Channel coding

The purpose of channel coding theory is to find codes which transmit quickly, contain many valid code words and can correct or at least detect many errors. While not mutually exclusive, performance in these areas is a trade-off. So, different codes are optimal for different applications. The needed properties of this code mainly depend on the probability of errors happening during transmission. In a typical CD, the impairment is mainly dust or scratches. CDs use cross-interleaved Reed–Solomon coding to spread the data out over the disk. Although not a very good code, a simple repeat code can serve as an understandable example. Suppose we take a block of data bits (representing sound) and send it three times. At the receiver we will examine the three repetitions bit by bit and take a majority vote. The twist on this is that we do not merely send the bits in order. We interleave them. The block of data bits is first divided into 4 smaller blocks. Then we cycle through the block and send one bit from the first, then the second, etc. This is done three times to spread the data out over the surface of the disk. In the context of the simple repeat code, this may not appear effective. However, there are more powerful codes known which are very effective at correcting the "burst" error of a scratch or a dust spot when this interleaving technique is used. Other codes are more appropriate for different applications. Deep space communications are limited by the

thermal noise A thermal column (or thermal) is a rising mass of buoyant air, a convective current in the atmosphere, that transfers heat energy vertically. Thermals are created by the uneven heating of Earth's surface from solar radiation, and are an example ...

of the receiver which is more of a continuous nature than a bursty nature. Likewise, narrowband modems are limited by the noise, present in the telephone network and also modeled better as a continuous disturbance. Cell phones are subject to rapid fading. The high frequencies used can cause rapid fading of the signal even if the receiver is moved a few inches. Again there are a class of channel codes that are designed to combat fading.

Linear codes

The term algebraic coding theory denotes the sub-field of coding theory where the properties of codes are expressed in algebraic terms and then further researched. Algebraic coding theory is basically divided into two major types of codes: * Linear block codes * Convolutional codes It analyzes the following three properties of a code – mainly: * Code word length * Total number of valid code words * The minimum

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects, points, people, or ideas are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two co ...

between two valid code words, using mainly the

, sometimes also other distances like the Lee distance

Linear block codes

Linear block codes have the property of

linearity In mathematics, the term ''linear'' is used in two distinct senses for two different properties: * linearity of a '' function'' (or '' mapping''); * linearity of a '' polynomial''. An example of a linear function is the function defined by f(x) ...

, i.e. the sum of any two codewords is also a code word, and they are applied to the source bits in blocks, hence the name linear block codes. There are block codes that are not linear, but it is difficult to prove that a code is a good one without this property. Linear block codes are summarized by their symbol alphabets (e.g., binary or ternary) and parameters (''n'',''m'',''d_min'') where # n is the length of the codeword, in symbols, # m is the number of source symbols that will be used for encoding at once, # ''d_min'' is the minimum hamming distance for the code. There are many types of linear block codes, such as # Cyclic codes (e.g.,

s) # Repetition codes # Parity codes # Polynomial codes (e.g., BCH codes) # Reed–Solomon codes # Algebraic geometric codes # Reed–Muller codes # Perfect codes # Locally recoverable code Block codes are tied to the sphere packing problem, which has received some attention over the years. In two dimensions, it is easy to visualize. Take a bunch of pennies flat on the table and push them together. The result is a hexagon pattern like a bee's nest. But block codes rely on more dimensions which cannot easily be visualized. The powerful (24,12) Golay code used in deep space communications uses 24 dimensions. If used as a binary code (which it usually is) the dimensions refer to the length of the codeword as defined above. The theory of coding uses the ''N''-dimensional sphere model. For example, how many pennies can be packed into a circle on a tabletop, or in 3 dimensions, how many marbles can be packed into a globe. Other considerations enter the choice of a code. For example, hexagon packing into the constraint of a rectangular box will leave empty space at the corners. As the dimensions get larger, the percentage of empty space grows smaller. But at certain dimensions, the packing uses all the space and these codes are the so-called "perfect" codes. The only nontrivial and useful perfect codes are the distance-3 Hamming codes with parameters satisfying (2^''r'' – 1, 2^''r'' – 1 – ''r'', 3), and the 3,12,7binary and 1,6,5ternary Golay codes. Another code property is the number of neighbors that a single codeword may have. Again, consider pennies as an example. First we pack the pennies in a rectangular grid. Each penny will have 4 near neighbors (and 4 at the corners which are farther away). In a hexagon, each penny will have 6 near neighbors. When we increase the dimensions, the number of near neighbors increases very rapidly. The result is the number of ways for noise to make the receiver choose a neighbor (hence an error) grows as well. This is a fundamental limitation of block codes, and indeed all codes. It may be harder to cause an error to a single neighbor, but the number of neighbors can be large enough so the total error probability actually suffers. Properties of linear block codes are used in many applications. For example, the syndrome-coset uniqueness property of linear block codes is used in trellis shaping, one of the best-known shaping codes.

Convolutional codes

The idea behind a convolutional code is to make every codeword symbol be the weighted sum of the various input message symbols. This is like

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

used in LTI systems to find the output of a system, when you know the input and impulse response. So we generally find the output of the system convolutional encoder, which is the convolution of the input bit, against the states of the convolution encoder, registers. Fundamentally, convolutional codes do not offer more protection against noise than an equivalent block code. In many cases, they generally offer greater simplicity of implementation over a block code of equal power. The encoder is usually a simple circuit which has state memory and some feedback logic, normally XOR gates. The decoder can be implemented in software or firmware. The Viterbi algorithm is the optimum algorithm used to decode convolutional codes. There are simplifications to reduce the computational load. They rely on searching only the most likely paths. Although not optimum, they have generally been found to give good results in low noise environments. Convolutional codes are used in voiceband modems (V.32, V.17, V.34) and in GSM mobile phones, as well as satellite and military communication devices.

Cryptographic coding

Cryptography Cryptography, or cryptology (from "hidden, secret"; and ''graphein'', "to write", or ''-logy, -logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of Adversary (cryptography), ...

or cryptographic coding is the practice and study of techniques for secure communication in the presence of third parties (called adversaries). More generally, it is about constructing and analyzing protocols that block adversaries; various aspects in

information security Information security is the practice of protecting information by mitigating information risks. It is part of information risk management. It typically involves preventing or reducing the probability of unauthorized or inappropriate access to data ...

such as data confidentiality,

data integrity Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle. It is a critical aspect to the design, implementation, and usage of any system that stores, proc ...

authentication Authentication (from ''authentikos'', "real, genuine", from αὐθέντης ''authentes'', "author") is the act of proving an Logical assertion, assertion, such as the Digital identity, identity of a computer system user. In contrast with iden ...

, and non-repudiation are central to modern cryptography. Modern cryptography exists at the intersection of the disciplines of

, and

. Applications of cryptography include ATM cards, computer passwords, and

electronic commerce E-commerce (electronic commerce) refers to Commerce, commercial activities including the electronic buying or selling Goods and services, products and services which are conducted on online platforms or over the Internet. E-commerce draws on tec ...

. Cryptography prior to the modern age was effectively synonymous with ''

encryption In Cryptography law, cryptography, encryption (more specifically, Code, encoding) is the process of transforming information in a way that, ideally, only authorized parties can decode. This process converts the original representation of the inf ...

'', the conversion of information from a readable state to apparent nonsense. The originator of an encrypted message shared the decoding technique needed to recover the original information only with intended recipients, thereby precluding unwanted persons from doing the same. Since

World War I World War I or the First World War (28 July 1914 – 11 November 1918), also known as the Great War, was a World war, global conflict between two coalitions: the Allies of World War I, Allies (or Entente) and the Central Powers. Fighting to ...

and the advent of the

computer A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...

, the methods used to carry out cryptology have become increasingly complex and its application more widespread. Modern cryptography is heavily based on mathematical theory and computer science practice; cryptographic algorithms are designed around computational hardness assumptions, making such algorithms hard to break in practice by any adversary. It is theoretically possible to break such a system, but it is infeasible to do so by any known practical means. These schemes are therefore termed computationally secure; theoretical advances, e.g., improvements in

integer factorization In mathematics, integer factorization is the decomposition of a positive integer into a product of integers. Every positive integer greater than 1 is either the product of two or more integer factors greater than 1, in which case it is a comp ...

algorithms, and faster computing technology require these solutions to be continually adapted. There exist information-theoretically secure schemes that cannot be broken even with unlimited computing power—an example is the one-time pad—but these schemes are more difficult to implement than the best theoretically breakable but computationally secure mechanisms.

Line coding

line code In telecommunications, a line code is a pattern of voltage, current, or photons used to represent digital data transmission (telecommunications), transmitted down a communication channel or written to a storage medium. This repertoire of signal ...

(also called digital baseband modulation or digital baseband transmission method) is a

chosen for use within a communications system for baseband transmission purposes. Line coding is often used for digital data transport. It consists of representing the

digital signal A digital signal is a signal that represents data as a sequence of discrete values; at any given time it can only take on, at most, one of a finite number of values. This contrasts with an analog signal, which represents continuous values; ...

to be transported by an amplitude- and time-discrete signal that is optimally tuned for the specific properties of the physical channel (and of the receiving equipment). The waveform pattern of voltage or current used to represent the 1s and 0s of a digital data on a transmission link is called ''line encoding''. The common types of line encoding are unipolar, polar, bipolar, and Manchester encoding.

Other applications of coding theory

Another concern of coding theory is designing codes that help

synchronization Synchronization is the coordination of events to operate a system in unison. For example, the Conductor (music), conductor of an orchestra keeps the orchestra synchronized or ''in time''. Systems that operate with all parts in synchrony are sa ...

. A code may be designed so that a phase shift can be easily detected and corrected and that multiple signals can be sent on the same channel. Another application of codes, used in some mobile phone systems, is code-division multiple access (CDMA). Each phone is assigned a code sequence that is approximately uncorrelated with the codes of other phones. When transmitting, the code word is used to modulate the data bits representing the voice message. At the receiver, a demodulation process is performed to recover the data. The properties of this class of codes allow many users (with different codes) to use the same radio channel at the same time. To the receiver, the signals of other users will appear to the demodulator only as a low-level noise. Another general class of codes are the automatic repeat-request (ARQ) codes. In these codes the sender adds redundancy to each message for error checking, usually by adding check bits. If the check bits are not consistent with the rest of the message when it arrives, the receiver will ask the sender to retransmit the message. All but the simplest wide area network protocols use ARQ. Common protocols include SDLC (IBM), TCP (Internet),

X.25 X.25 is an ITU-T standard protocol suite for Packet switched network, packet-switched data communication in wide area network, wide area networks (WAN). It was originally defined by the CCITT, International Telegraph and Telephone Consultative Co ...

(International) and many others. There is an extensive field of research on this topic because of the problem of matching a rejected packet against a new packet. Is it a new one or is it a retransmission? Typically numbering schemes are used, as in TCP.

Group testing

Group testing uses codes in a different way. Consider a large group of items in which a very few are different in a particular way (e.g., defective products or infected test subjects). The idea of group testing is to determine which items are "different" by using as few tests as possible. The origin of the problem has its roots in the

Second World War World War II or the Second World War (1 September 1939 – 2 September 1945) was a World war, global conflict between two coalitions: the Allies of World War II, Allies and the Axis powers. World War II by country, Nearly all of the wo ...

when the

United States Army Air Forces The United States Army Air Forces (USAAF or AAF) was the major land-based aerial warfare service component of the United States Army and ''de facto'' aerial warfare service branch of the United States during and immediately after World War II ...

needed to test its soldiers for

syphilis Syphilis () is a sexually transmitted infection caused by the bacterium ''Treponema pallidum'' subspecies ''pallidum''. The signs and symptoms depend on the stage it presents: primary, secondary, latent syphilis, latent or tertiary. The prim ...

Analog coding

Information is encoded analogously in the

neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...

s of

brain The brain is an organ (biology), organ that serves as the center of the nervous system in all vertebrate and most invertebrate animals. It consists of nervous tissue and is typically located in the head (cephalization), usually near organs for ...

s, in analog signal processing, and analog electronics. Aspects of analog coding include analog error correction, analog data compression and analog encryption.

Neural coding

Neural coding is a

neuroscience Neuroscience is the scientific study of the nervous system (the brain, spinal cord, and peripheral nervous system), its functions, and its disorders. It is a multidisciplinary science that combines physiology, anatomy, molecular biology, ...

-related field concerned with how sensory and other information is represented in the

by networks of

neurons A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

. The main goal of studying neural coding is to characterize the relationship between the stimulus and the individual or ensemble neuronal responses and the relationship among electrical activity of the neurons in the ensemble. It is thought that neurons can encode both digital and analog information, and that neurons follow the principles of information theory and compress information, and detect and correct errors in the signals that are sent throughout the brain and wider nervous system.

Notes

References

* Elwyn R. Berlekamp (2014), ''Algebraic Coding Theory'', World Scientific Publishing (revised edition), . * MacKay, David J. C.
Information Theory, Inference, and Learning Algorithms
' Cambridge: Cambridge University Press, 2003. * Vera Pless (1982), '' Introduction to the Theory of Error-Correcting Codes'', John Wiley & Sons, Inc., . * Randy Yates,
A Coding Theory Tutorial
'. {{Authority control Error detection and correction