HOME

TheInfoList



OR:

In the
statistical theory The theory of statistics provides a basis for the whole range of techniques, in both study design and data analysis, that are used within applications of statistics. The theory covers approaches to statistical-decision problems and to statistica ...
of
estimation Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is de ...
, the German tank problem consists of estimating the maximum of a
discrete uniform distribution In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of ''n'' values has equal probability 1/''n''. Anoth ...
from
sampling without replacement In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample ...
. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to ''N''. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate ''N'' from these observed numbers. The problem can be approached using either frequentist inference or
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and ...
, leading to different results. Estimating the population maximum based on a ''single'' sample yields divergent results, whereas estimation based on ''multiple'' samples is a practical estimation question whose answer is simple (especially in the frequentist setting) but not obvious (especially in the Bayesian setting). The problem is named after its historical application by Allied forces in
World War II World War II or the Second World War, often abbreviated as WWII or WW2, was a world war that lasted from 1939 to 1945. It involved the World War II by country, vast majority of the world's countries—including all of the great power ...
to the estimation of the monthly rate of German tank production from very limited data. This exploited the manufacturing practice of assigning and attaching ascending sequences of serial numbers to tank components (chassis, gearbox, engine, wheels), with some of the tanks eventually being captured in battle by Allied forces.


Suppositions

The adversary is presumed to have manufactured a series of tanks marked with consecutive whole numbers, beginning with serial number 1. Additionally, regardless of a tank's date of manufacture, history of service, or the serial number it bears, the distribution over serial numbers becoming revealed to analysis is uniform, up to the point in time when the analysis is conducted.


Example

Assuming tanks are assigned sequential serial numbers starting with 1, suppose that four tanks are captured and that they have the serial numbers: 19, 40, 42 and 60. The ''frequentist'' approach predicts the total number of tanks produced will be: :N \approx 74 The ''Bayesian'' approach predicts that the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
number of tanks produced will be very similar to the frequentist prediction: :N_ \approx 74.5 whereas the Bayesian
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ar ...
predicts that the number of tanks produced would be: :N_ \approx 89 Let equal the total number of tanks predicted to have been produced, equal the highest serial number observed and equal the number of tanks captured. The frequentist prediction is calculated as: :N \approx m + \frac - 1=74 The Bayesian median is calculated as: :N_ \approx m + \frac =74.5 The Bayesian mean is calculated as: :N_ \approx (m - 1)\frac = 89 Both Bayesian computations are based on the following
probability mass function In probability and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes it is also known as the discrete density function. The probability mass ...
: :\Pr(N=n) = \begin 0 &\text n < m \\ \frac \frac &\text n \ge m, \end This distribution has a positive
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal ...
, related to the fact that there are at least 60 tanks. Because of this skewness, the mean may not be the most meaningful estimate. The
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
in this example is 74.5, in close agreement with the frequentist formula. Using
Stirling's approximation In mathematics, Stirling's approximation (or Stirling's formula) is an approximation for factorials. It is a good approximation, leading to accurate results even for small values of n. It is named after James Stirling, though a related but less p ...
, the Bayesian probability function may be approximated as :\Pr(N=n) \approx \begin 0 &\text n < m \\ (k-1)m^n^ &\text n \ge m, \end which results in the following approximation for the median: :N_ \approx m + \frac Finally, the average estimate by Bayesians, and its deviation, are computed as: :\begin N &\approx \mu \pm \sigma = 89 \pm 50, \\ pt \mu &= (m - 1)\frac, \\ pt \sigma &= \sqrt. \end


Historical example of the problem

During the course of the Second World War, the
Western Allies The Allies, formally referred to as the United Nations from 1942, were an international military coalition formed during the Second World War (1939–1945) to oppose the Axis powers, led by Nazi Germany, Imperial Japan, and Fascist Italy ...
made sustained efforts to determine the extent of German production and approached this in two major ways: conventional intelligence gathering and statistical estimation. In many cases, statistical analysis substantially improved on conventional intelligence. In some cases, conventional intelligence was used in conjunction with statistical methods, as was the case in estimation of
Panther tank The Panther tank, officially ''Panzerkampfwagen V Panther'' (abbreviated PzKpfw V) with Sonderkraftfahrzeug, ordnance inventory designation: ''Sd.Kfz.'' 171, is a German medium tank of World War II. It was used on the Eastern Front (World War ...
production just prior to
D-Day The Normandy landings were the landing operations and associated airborne operations on Tuesday, 6 June 1944 of the Allied invasion of Normandy in Operation Overlord during World War II. Codenamed Operation Neptune and often referred to as ...
. The allied command structure had thought the
Panzer V The Panther tank, officially ''Panzerkampfwagen V Panther'' (abbreviated PzKpfw V) with Sonderkraftfahrzeug, ordnance inventory designation: ''Sd.Kfz.'' 171, is a German medium tank of World War II. It was used on the Eastern Front (World War ...
(Panther) tanks seen in Italy, with their high velocity, long-barreled 75 mm/L70 guns, were unusual heavy tanks and would only be seen in northern France in small numbers, much the same way as the
Tiger I The Tiger I () was a German heavy tank of World War II that operated beginning in 1942 in Africa and in the Soviet Union, usually in independent heavy tank battalions. It gave the German Army its first armoured fighting vehicle that mounted ...
was seen in Tunisia. The US Army was confident that the Sherman tank would continue to perform well, as it had versus the
Panzer III The ''Panzerkampfwagen III'', commonly known as the Panzer III, was a medium tank developed in the 1930s by Germany, and was used extensively in World War II. The official German ordnance designation was Sd.Kfz. 141. It was intended to fight ot ...
and
Panzer IV The ''Panzerkampfwagen'' IV (Pz.Kpfw. IV), commonly known as the ''Panzer'' IV, was a German medium tank developed in the late 1930s and used extensively during the Second World War. Its ordnance inventory designation was Sd.Kfz. 161. The Panz ...
tanks in North Africa and Sicily. Shortly before D-Day, rumors indicated that large numbers of Panzer V tanks were being used. To determine whether this was true, the Allies attempted to estimate the number of tanks being produced. To do this, they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross-check the analysis. Similar analyses were done on wheels, which were observed to be sequentially numbered (i.e., 1, 2, 3, ..., ''N''). The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use. A discussion with British road wheel makers then estimated the number of wheels that could be produced from this many molds, which yielded the number of tanks that were being produced each month. Analysis of wheels from two tanks (32 road wheels each, 64 road wheels total) yielded an estimate of 270 tanks produced in February 1944, substantially more than had previously been suspected. German records after the war showed production for the month of February 1944 was 276. The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase "German tank problem" became accepted as a descriptor for this type of statistical analysis. Estimating production was not the only use of this serial-number analysis. It was also used to understand German production more generally, including number of factories, relative importance of factories, length of supply chain (based on lag between production and use), changes in production, and use of resources such as rubber.


Specific data

According to conventional Allied intelligence estimates, the Germans were producing around 1,400 tanks a month between June 1940 and September 1942. Applying the formula below to the serial numbers of captured tanks, the number was calculated to be 246 a month. After the war, captured German production figures from the ministry of
Albert Speer Berthold Konrad Hermann Albert Speer (; ; 19 March 1905 – 1 September 1981) was a German architect who served as the Minister of Armaments and War Production in Nazi Germany during most of World War II. A close ally of Adolf Hitler, h ...
showed the actual number to be 245. Estimates for some specific months are given as:


Similar analyses

Similar serial-number analysis was used for other military equipment during World War II, most successfully for the
V-2 The V-2 (german: Vergeltungswaffe 2, lit=Retaliation Weapon 2), with the technical name ''Aggregat 4'' (A-4), was the world’s first long-range guided ballistic missile. The missile, powered by a liquid-propellant rocket engine, was develope ...
rocket. Factory markings on Soviet military equipment were analyzed during the
Korean War {{Infobox military conflict , conflict = Korean War , partof = the Cold War and the Korean conflict , image = Korean War Montage 2.png , image_size = 300px , caption = Clockwise from top:{ ...
, and by German intelligence during World War II. In the 1980s, some Americans were given access to the production line of Israel's
Merkava The Merkava ( he, מרכבה, , "chariot") is a series of main battle tanks used by the Israel Defense Forces and the backbone of the IDF's armored corps. The tank began development in 1970, and its first generation, the Merkava mark 1, entere ...
tanks. The production numbers were classified, but the tanks had serial numbers, allowing estimation of production. The formula has been used in non-military contexts, for example to estimate the number of
Commodore 64 The Commodore 64, also known as the C64, is an 8-bit home computer introduced in January 1982 by Commodore International (first shown at the Consumer Electronics Show, January 7–10, 1982, in Las Vegas). It has been listed in the Guinness W ...
computers built, where the result (12.5 million) matches the low-end estimates.


Countermeasures

To confound serial-number analysis, serial numbers can be excluded, or usable auxiliary information reduced. Alternatively, serial numbers that resist cryptanalysis can be used, most effectively by randomly choosing numbers without replacement from a list that is much larger than the number of objects produced, or by producing random numbers and checking them against the list of already assigned numbers; collisions are likely to occur unless the number of digits possible is more than twice the number of digits in the number of objects produced (where the serial number can be in any base); see
birthday problem In probability theory, the birthday problem asks for the probability that, in a set of randomly chosen people, at least two will share a birthday. The birthday paradox is that, counterintuitively, the probability of a shared birthday exceeds 5 ...
. For this, a
cryptographically secure pseudorandom number generator A cryptographically secure pseudorandom number generator (CSPRNG) or cryptographic pseudorandom number generator (CPRNG) is a pseudorandom number generator (PRNG) with properties that make it suitable for use in cryptography. It is also loosely kno ...
may be used. All these methods require a lookup table (or breaking the cypher) to back out from serial number to production order, which complicates use of serial numbers: a range of serial numbers cannot be recalled, for instance, but each must be looked up individually, or a list generated. Alternatively, sequential serial numbers can be encrypted with a simple
substitution cipher In cryptography, a substitution cipher is a method of encrypting in which units of plaintext are replaced with the ciphertext, in a defined manner, with the help of a key; the "units" may be single letters (the most common), pairs of letters, tri ...
, which allows easy decoding, but is also easily broken by
frequency analysis In cryptanalysis, frequency analysis (also known as counting letters) is the study of the frequency of letters or groups of letters in a ciphertext. The method is used as an aid to breaking classical ciphers. Frequency analysis is based on ...
: even if starting from an arbitrary point, the plaintext has a pattern (namely, numbers are in sequence). One example is given in
Ken Follett Kenneth Martin Follett, (born 5 June 1949) is a British author of thrillers and historical novels who has sold more than 160 million copies of his works. Many of his books have achieved high ranking on best seller lists. For example, in the ...
's novel ''
Code to Zero ''Code to Zero'' is a novel by the British author Ken Follett, published by Pan Macmillan. The story follows Luke, an amnesic who spends the duration of the book learning of his life, and slowly uncovering secrets of a conspiracy to hold America ...
'', where the encryption of the
Jupiter-C The Jupiter-C was an American research and development vehicle developed from the Jupiter-A. Jupiter-C was used for three unmanned sub-orbital spaceflights in 1956 and 1957 to test re-entry nosecones that were later to be deployed on the more ...
rocket serial numbers is given by: The code word here is
Huntsville Huntsville is a city in Madison County, Limestone County, and Morgan County, Alabama, United States. It is the county seat of Madison County. Located in the Appalachian region of northern Alabama, Huntsville is the most populous city in th ...
(with repeated letters omitted) to get a 10-letter key. The rocket number 13 was therefore "HN", and the rocket number 24 was "UT".


Frequentist analysis


Minimum-variance unbiased estimator

For
point estimation In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate since it identifies a point in some parameter space) which is to serve as a "best guess" or "best estimate" of an unknown popula ...
(estimating a single value for the total, \widehat), the
minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter. For pra ...
(MVUE, or UMVU estimator) is given by: :\widehat = m(1 + k^) - 1, where ''m'' is the largest serial number observed (
sample maximum In statistics, the sample maximum and sample minimum, also called the largest observation and smallest observation, are the values of the greatest and least elements of a sample. They are basic summary statistics, used in descriptive statistics ...
) and ''k'' is the number of tanks observed (
sample size Sample size determination is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences about a populati ...
). Note that once a serial number has been observed, it is no longer in the pool and will not be observed again. This has a variance : \operatorname\left(\widehat\right) = \frac\frac \approx \frac \text k \ll N, so the
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
is approximately ''N''/''k'', the expected size of the gap between sorted observations in the sample. The formula may be understood intuitively as the sample maximum plus the average gap between observations in the sample, the sample maximum being chosen as the initial estimator, due to being the
maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
, with the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum, and written as :\widehat = m + \frac= m + mk^ - 1 = m(1 + k^) - 1. This can be visualized by imagining that the observations in the sample are evenly spaced throughout the range, with additional observations just outside the range at 0 and ''N'' + 1. If starting with an initial gap between 0 and the lowest observation in the sample (the sample minimum), the average gap between consecutive observations in the sample is (m - k)/k; the -k being because the observations themselves are not counted in computing the gap between observations.. A derivation of the expected value and the variance of the sample maximum are shown in the page of the
discrete uniform distribution In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of ''n'' values has equal probability 1/''n''. Anoth ...
. This philosophy is formalized and generalized in the method of
maximum spacing estimation In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of ''spac ...
; a similar heuristic is used for
plotting position Plot or Plotting may refer to: Art, media and entertainment * Plot (narrative), the story of a piece of fiction Music * ''The Plot'' (album), a 1976 album by jazz trumpeter Enrico Rava * The Plot (band), a band formed in 2003 Other * ''Plot'' ...
in a
Q–Q plot In statistics, a Q–Q plot (quantile-quantile plot) is a probability plot, a graphical method for comparing two probability distributions by plotting their ''quantiles'' against each other. A point on the plot corresponds to one of the qu ...
, plotting sample points at , which is evenly on the uniform distribution, with a gap at the end.


Confidence intervals

Instead of, or in addition to, ''point'' estimation, ''interval'' estimation can be carried out, such as
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
s. These are easily computed, based on the observation that the probability that ''k'' observations in the sample will fall in an interval covering ''p'' of the range (0 ≤ ''p'' ≤ 1) is ''p''''k'' (assuming in this section that draws are ''with'' replacement, to simplify computations; if draws are without replacement, this overstates the likelihood, and intervals will be overly conservative). Thus the
sampling distribution In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. If an arbitrarily large number of samples, each involving multiple observations (data points), were sep ...
of the quantile of the sample maximum is the graph ''x''1/''k'' from 0 to 1: the ''p''-th to ''q''-th quantile of the sample maximum ''m'' are the interval 'p''1/''k''''N'', ''q''1/''k''''N'' Inverting this yields the corresponding confidence interval for the population maximum of 'm''/''q''1/''k'', ''m''/''p''1/''k'' For example, taking the symmetric 95% interval ''p'' = 2.5% and ''q'' = 97.5% for ''k'' = 5 yields 0.0251/5 ≈ 0.48, 0.9751/5 ≈ 0.995, so the confidence interval is approximately .005''m'', 2.08''m'' The lower bound is very close to ''m'', thus more informative is the asymmetric confidence interval from ''p'' = 5% to 100%; for ''k'' = 5 this yields 0.051/5 ≈ 0.55 and the interval 'm'', 1.82''m'' More generally, the (downward biased) 95% confidence interval is 'm'', ''m''/0.051/''k''= 'm'', ''m''·201/k For a range of ''k'' values, with the UMVU point estimator (plus 1 for legibility) for reference, this yields: Immediate observations are: * For small sample sizes, the confidence interval is very wide, reflecting great uncertainty in the estimate. * The range shrinks rapidly, reflecting the exponentially decaying probability that ''all'' observations in the sample will be significantly below the maximum. * The confidence interval exhibits positive skew, as ''N'' can never be below the sample maximum, but can potentially be arbitrarily high above it. Note that ''m''/''k'' cannot be used naively (or rather (''m'' + ''m''/''k'' − 1)/''k'') as an estimate of the
standard error The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation. If the statistic is the sample mean, it is called the standard error o ...
''SE'', as the standard error of an estimator is based on the ''population'' maximum (a parameter), and using an estimate to estimate the error in that very estimate is
circular reasoning Circular may refer to: * The shape of a circle * ''Circular'' (album), a 2006 album by Spanish singer Vega * Circular letter (disambiguation) ** Flyer (pamphlet), a form of advertisement * Circular reasoning, a type of logical fallacy * Circula ...
.


Bayesian analysis

The Bayesian approach to the German tank problem is to consider the credibility (N=n\mid M=m, K=k) that the number of enemy tanks N is equal to the number n, when the number of observed tanks, K is equal to the number k, and the maximum observed serial number M is equal to the number m. The answer to this problem depends on the choice of prior for N. One can proceed using a proper prior, e.g., the Poisson or Negative Binomial distribution, where closed formula for the posterior mean and posterior variance can be obtained. An alternative is to proceed using direct calculations as shown below. For brevity, in what follows, (N=n\mid M=m,K=k) is written (n\mid m,k)


Conditional probability

The rule for
conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...
gives :(n\mid m,k)(m\mid k) = (m\mid n,k)(n\mid k)= (m,n\mid k)


Probability of ''M'' knowing ''N'' and ''K''

The expression : (m \mid n,k)=(M=m \mid N=n,K=k) is the conditional probability that the maximum serial number observed, M, is equal to m, when the number of enemy tanks, N, is known to be equal to n, and the number of enemy tanks observed, K, is known to be equal to k. It is : (m\mid n,k) = \binom\binom^ \le mm\le n] where \binom n k is a
binomial coefficient In mathematics, the binomial coefficients are the positive integers that occur as coefficients in the binomial theorem. Commonly, a binomial coefficient is indexed by a pair of integers and is written \tbinom. It is the coefficient of the t ...
and \le n/math> is an
Iverson bracket In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement . It maps any statement to a function of the free variables in that statement ...
. The expression can be derived as follows: (m\mid n,k) answers the question: "What is the probability of a specific serial number m being the highest number observed in a sample of k tanks, given there are n tanks in total?" One can think of the sample of size k to be the result of k individual draws. Assume m is observed on draw number d. The probability of this occurring is: : \underbrace_ \cdot \underbrace_ \cdot \underbrace_ = \frac \cdot \frac. As can be seen from the right-hand side, this expression is independent of d and therefore the same for each d\leq k. As m can be drawn on k different draws, the probability of any specific m being the largest one observed is k times the above probability: : (m\mid n,k) = k\cdot \frac\cdot \frac = \binom\binom^.


Probability of ''M'' knowing only ''K''

The expression (m\mid k)=(M=m\mid K=k) is the probability that the maximum serial number is equal to m once k tanks have been observed but before the serial numbers have actually been observed. The expression (m\mid k) can be re-written in terms of the other quantities by marginalizing over all possible n. :\begin (m\mid k) &=(m\mid k)\cdot 1 \\ &=(m\mid k) \\ &=(m\mid k) \\ &=\sum_^\infty(m\mid n,k)(n\mid k) \end


Credibility of ''N'' knowing only ''K''

The expression :(n\mid k)=(N=n\mid K=k) is the credibility that the total number of tanks, N, is equal to n when the number K tanks observed is known to be k, but before the serial numbers have been observed. Assume that it is some
discrete uniform distribution In probability theory and statistics, the discrete uniform distribution is a symmetric probability distribution wherein a finite number of values are equally likely to be observed; every one of ''n'' values has equal probability 1/''n''. Anoth ...
: (n\mid k) = (\Omega - k)^ \le nn < \Omega] The upper limit \Omega must be finite, because the function :f(n) =\lim_(\Omega - k)^ \le nn < \Omega]=0 is not a mass distribution function.


Credibility of ''N'' knowing ''M'' and ''K''

: (n\mid m,k) = (m\mid n,k)\left(\sum_^ (m\mid n,k)\right)^ \le nn < \Omega] If ''k'' ≥ 2, then \sum_^\infty(m\mid n,k)<\infty, and the unwelcome variable \Omega disappears from the expression. : (n\mid m,k) = (m\mid n,k)\left(\sum_^ (m\mid n,k)\right)^ \le n /math> For ''k'' ≥ 1 the
mode Mode ( la, modus meaning "manner, tune, measure, due measure, rhythm, melody") may refer to: Arts and entertainment * '' MO''D''E (magazine)'', a defunct U.S. women's fashion magazine * ''Mode'' magazine, a fictional fashion magazine which is ...
of the distribution of the number of enemy tanks is ''m''. For ''k'' ≥ 2, the credibility that the number of enemy tanks is ''equal to'' n, is : (N=n\mid m,k) = (k - 1)\binomk^\binom n k^ \le n The credibility that the number of enemy tanks, ''N'', is ''greater than n'', is : (N>n\mid m,k)= \begin 1 &\text n < m \\ \frac &\text n \ge m \end


Mean value and standard deviation

For ''k'' ≥ 3, ''N'' has the finite
mean value There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the ''arithm ...
: :(m - 1)(k - 1)(k - 2)^ For ''k'' ≥ 4, ''N'' has the finite
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
: :(k - 1)^(k - 2)^(k - 3)^(m - 1)^(m + 1 - k)^ These formulas are derived below.


Summation formula

The following binomial coefficient identity is used below for simplifying
series Series may refer to: People with the name * Caroline Series (born 1951), English mathematician, daughter of George Series * George Series (1920–1995), English physicist Arts, entertainment, and media Music * Series, the ordered sets used in ...
relating to the German Tank Problem. :\sum_^\infty \frac 1 = \frac k\frac 1 This sum formula is somewhat analogous to the integral formula :\int_^\infty \frac = \frac 1\frac 1 These formulas apply for ''k'' > 1.


One tank

Observing one tank randomly out of a population of ''n'' tanks gives the serial number ''m'' with probability 1/''n'' for ''m'' ≤ ''n'', and zero probability for ''m'' > ''n''. Using
Iverson bracket In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement . It maps any statement to a function of the free variables in that statement ...
notation this is written :(M=m\mid N=n,K=1) = (m\mid n) = \frac This is the conditional probability mass distribution function of m. When considered a function of ''n'' for fixed ''m'' this is a likelihood function. :\mathcal(n) = \frac The
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
estimate for the total number of tanks is ''N''0 = ''m'', clearly a biased estimate since the true number can be more than this, potentially many more, but cannot be fewer. The marginal likelihood (i.e. marginalized over all models) is
infinite Infinite may refer to: Mathematics * Infinite set, a set that is not a finite set *Infinity, an abstract concept describing something without any limit Music *Infinite (group), a South Korean boy band *''Infinite'' (EP), debut EP of American m ...
, being a tail of the harmonic series. :\sum_n \mathcal(n) = \sum_^\infty \frac = \infty but :\begin \sum_n \mathcal(n) < \Omega &= \sum_^ \frac \\ pt &= H_ - H_ \end where H_n is the
harmonic number In mathematics, the -th harmonic number is the sum of the reciprocals of the first natural numbers: H_n= 1+\frac+\frac+\cdots+\frac =\sum_^n \frac. Starting from , the sequence of harmonic numbers begins: 1, \frac, \frac, \frac, \frac, \do ...
. The credibility mass distribution function depends on the prior limit \Omega: :\begin &(N=n\mid M=m,K=1) \\ pt = &(n\mid m) = \frac \frac \end The mean value of N is :\begin \sum_n n\cdot(n\mid m) &= \sum_^ \frac \\ pt &= \frac \\ pt &\approx \frac \end


Two tanks

If two tanks rather than one are observed, then the probability that the larger of the observed two serial numbers is equal to ''m'', is :(M=m\mid N=n,K=2) = (m\mid n) = \le nfrac When considered a function of ''n'' for fixed ''m'' this is a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
:\mathcal(n) = \ge mfrac The total likelihood is :\begin \sum_\mathcal(n) &= \frac \sum_^\infty \frac \\ pt &= \frac \cdot \frac \cdot \frac \\ pt &= 2 \end and the credibility mass distribution function is :\begin &(N=n\mid M=m,K=2) \\ pt = &(n\mid m) \\ pt = &\frac \\ pt = & \ge mfrac \end The
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...
\tilde satisfies :\sum_n \ge \tilden\mid m) = \frac so :\frac = \frac and so the median is :\tilde = 2m - 1 but the mean value of N is infinite :\mu = \sum_n n \cdot (n\mid m) = \frac1\sum_^\infty \frac = \infty


Many tanks


Credibility mass distribution function

The conditional probability that the largest of ''k'' observations taken from the serial numbers , is equal to ''m'', is :\begin &(M=m\mid N=n,K=k\ge 2) \\ = &(m\mid n,k) \\ = & \le nfrac \end The likelihood function of ''n'' is the same expression :\mathcal(n) = \ge mfrac The total likelihood is finite for ''k'' ≥ 2: :\begin \sum_n \mathcal(n) &= \frac \sum_^\infty \\ &= \frac \cdot \frac \cdot \frac \\ &= \frac k \end The credibility mass distribution function is :\begin &(N=n\mid M=m,K=k \ge 2) = (n\mid m,k) \\ = &\frac \\ = & \ge mfrac \frac \\ = & \ge mfrac \frac \\ = & \ge mfrac \frac \frac \frac \end The complementary cumulative distribution function is the credibility that ''N'' > ''x'' :\begin &(N>x\mid M=m,K=k) \\ pt = &\begin 1 &\textx < m \\ \sum_^\infty (n\mid m,k) &\textx \ge m \end \\ = & + \ge msum_^\infty \frac\frac \\ pt = & + \ge mfrac \frac \sum_^\infty \frac \\ pt = & + \ge mfrac \frac \cdot \frac \frac \\ pt = & + \ge mfrac \end The
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...
is the credibility that ''N'' ≤ ''x'' :\begin &(N\le x\mid M=m,K=k) \\ pt = &1 - (N>x\mid M=m,K=k) \\ pt = & \ge mleft(1 - \frac\right) \end


Order of magnitude

The order of magnitude of the number of enemy tanks is :\begin \mu &= \sum_n n\cdot(N=n\mid M=m,K=k) \\ pt & = \sum_n n \ge mfrac n \frac \\ pt & = \frac1 \frac1\sum_^\infty \frac 1\\ pt & = \frac1 \frac1 \cdot \frac\frac \\ pt & = \frac1 \frac \end


Statistical uncertainty

The statistical uncertainty is the standard deviation \sigma, satisfying the equation :\sigma^2 + \mu^2 = \sum_n n^2 \cdot (N=n\mid M=m,K=k) So :\begin \sigma^2+\mu^2-\mu & = \sum_n n(n-1)\cdot(N=n\mid M=m,K=k)\\ pt & = \sum_^\infty n(n-1)\fracn \frac \frac \frac\\ pt & = \frac1 \frac1 \frac \cdot \frac1 \sum_^\infty \frac 1\\ pt& = \frac1 \frac1 \frac \frac1 \frac \frac 1\\ pt& = \frac1 \frac1 \frac \end and :\begin \sigma &= \sqrt \\ pt &= \sqrt \end The
variance-to-mean ratio In probability theory and statistics, the index of dispersion, dispersion index, coefficient of dispersion, relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a pro ...
is simply :\frac\mu = \frac


See also

*
Mark and recapture Mark and recapture is a method commonly used in ecology to estimate an animal population's size where it is impractical to count every individual. A portion of the population is captured, marked, and released. Later, another portion will be captur ...
, other method of estimating population size *
Maximum spacing estimation In statistics, maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a method for estimating the parameters of a univariate statistical model. The method requires maximization of the geometric mean of ''spac ...
, which generalizes the intuition of "assume uniformly distributed" *
Copernican principle In physical cosmology, the Copernican principle states that humans, on the Earth or in the Solar System, are not privileged observers of the universe, that observations from the Earth are representative of observations from the average position ...
and
Lindy effect The Lindy effect (also known as Lindy's Law) is a theorized phenomenon by which the future life expectancy of some non-perishable things, like a technology or an idea, is proportional to their current age. Thus, the Lindy effect proposes the longe ...
, analogous predictions of lifetime assuming just one observation in the sample (current age). ** The
Doomsday argument The Doomsday Argument (DA), or Carter catastrophe, is a probabilistic argument that claims to predict the future population of the human species, based on an estimation of the number of humans born to date. The Doomsday argument was originally ...
, application to estimate expected survival time of the human race. *
Generalized extreme value distribution In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known a ...
, possible limit distributions of sample maximum (opposite question). *
Maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...
*
Bias of an estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In s ...
*
Likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...


Further reading

*


Notes


References


Works cited

* * * {{Probability distributions, discrete-infinite Estimation methods World War II tanks of Germany Applied mathematics Bayesian statistics Probability problems Discrete distributions Theory of probability distributions Parametric statistics Serial numbers