The iterative proportional fitting procedure (IPF or IPFP, also known as biproportional fitting or biproportion in statistics or economics (input-output analysis, etc.), RAS algorithm in economics, raking in survey statistics, and matrix scaling in computer science) is the operation of finding the fitted matrix
which is the closest to an initial matrix
but with the row and column totals of a target matrix
(which provides the constraints of the problem; the interior of
is unknown). The fitted matrix being of the form
, where
and
are diagonal matrices such that
has the margins (row and column sums) of
. Some algorithms can be chosen to perform biproportion. We have also the entropy maximization, information loss minimization (or cross-entropy) or RAS which consists of factoring the matrix rows to match the specified row totals, then factoring its columns to match the specified column totals; each step usually disturbs the previous step’s match, so these steps are repeated in cycles, re-adjusting the rows and columns in turn, until all specified marginal totals are satisfactorily approximated. However, all algorithms give the same solution.
In three- or more-dimensional cases, adjustment steps are applied for the marginals of each dimension in turn, the steps likewise repeated in cycles.
History
IPF has been "re-invented" many times, the earliest by Kruithof in 1937
in relation to telephone traffic ("Kruithof’s double factor method"),
Deming and
Stephan in 1940 for adjusting census crosstabulations, and G.V. Sheleikhovskii for traffic as reported by Bregman. (Deming and Stephan proposed IPFP as an algorithm leading to a minimizer of the
Pearson X-squared statistic, which Stephan later reported it ''does not'',.
Early proofs of uniqueness and convergence came from Sinkhorn (1964), Bacharach (1965), Bishop (1967), and
Fienberg (1970). Bishop's proof that IPFP finds the maximum likelihood estimator for any number of dimensions extended a 1959 proof by Brown for 2x2x2... cases. Fienberg's proof by
differential geometry
Differential geometry is a mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of differential calculus, integral calculus, linear algebra and multili ...
exploits the method's constant crossproduct ratios, for strictly positive tables. Csiszár (1975). found necessary and sufficient conditions for general tables having zero entries. Pukelsheim and Simeone (2009)
give further results on convergence and error behavior.
An exhaustive treatment of the algorithm and its mathematical foundations can be found in the book of Bishop et al. (1975). Idel (2016) gives a more recent survey.
Other general algorithms can be modified to yield the same limit as the IPFP, for instance the
Newton–Raphson method
In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valu ...
and
the
EM algorithm
EM, Em or em may refer to:
Arts and entertainment Music
* EM, the E major musical scale
* Em, the E minor musical scale
* Electronic music, music that employs electronic musical instruments and electronic music technology in its production
* Ency ...
. In most cases, IPFP is preferred due to its computational speed, low storage requirements, numerical stability and algebraic simplicity.
Applications of IPFP have grown to include
trip distribution
Trip distribution (or destination choice or zonal interchange analysis) is the second component (after trip generation, but before mode choice and route assignment) in the traditional four-step transportation forecasting model. This step matches ...
models, Fratar or Furness and other applications in transportation planning (Lamond and Stewart), survey weighting, synthesis of cross-classified demographic data, adjusting
input–output model
In economics, an input–output model is a quantitative economic model that represents the interdependencies between different sectors of a national economy or different regional economies.Thijs Ten Raa, Input–Output Economics: Theory and App ...
s in economics, estimating expected quasi-independent
contingency tables
In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business ...
,
biproportional apportionment
Biproportional apportionment is a proportional representation method to allocate seats in proportion to two separate characteristics. That is, for two different partitions each part receives the proportional number of seats within the total numb ...
systems of political representation, and for a
preconditioner
In mathematics, preconditioning is the application of a transformation, called the preconditioner, that conditions a given problem into a form that is more suitable for numerical solving methods. Preconditioning is typically related to reducing ...
in linear algebra.
Biproportion
Biproportion, whatever the algorithm used to solve it, is the following concept:
, matrix
and matrix
are known real nonnegative matrices of dimension
; the interior of
is unknown and
is searched such that
has the same margins than
, i.e.
and
(
being the sum vector, and such that
is closed to
following a given criterion, the fitted matrix being of the form
, where
and
are diagonal matrices.
s.t.
, ∀
and
, ∀
.
The Lagrangian is
.
Thus
, for ∀
,
which, after posing
and
, yields
, ∀
, i.e.,
,
with
, ∀
and
, ∀
.
and
form a system that can be solve iteratively:
, ∀
and
, ∀
.
The solution
is independent of the initialization chosen (i.e., we can begin by
, ∀
or by
, ∀
. If the matrix
is “indecomposable”, then this process has a unique fixed-point because it is deduced from program a program where the function is a convex and continuously derivable function defined on a compact set. In some cases the solution may not exist: see de Mesnard's example cited by Miller and Blair (Miller R.E. & Blair P.D. (2009) Input-output analysis: Foundations and Extensions, Second edition, Cambridge (UK): Cambridge University Press, p. 335-336 (freely available)).
Some properties (see de Mesnard (1994)):
Lack of information: if
brings no information, i.e.,
, ∀
then
.
Idempotency:
if
has the same margins than
.
Composition of biproportions:
;
.
Zeros: a zero in
is projected as a zero in
. Thus, a bloc-diagonal matrix is projected as a bloc-diagonal matrix and a triangular matrix is projected as a triangular matrix.
Theorem of separable modifications: if
is premutiplied by a diagonal matrix and/or postmultiplied by a diagonal matrix, then the solution is unchanged.
Theorem of "unicity": If
is any non-specified algorithm, with
,
and
being unknown, then
and
can always be changed into the standard form of
and
. The demonstrations calls some above properties, particularly the Theorem of separable modifications and the composition of biproportions.
Algorithm 1 (classical IPF)
Given a two-way (''I'' × ''J'')-table
, we wish to estimate a new table
for all ''i'' and ''j'' such that the marginals satisfy
and
.
Choose initial values
, and for
set
:
:
Repeat these steps until row and column totals are sufficiently close to u and v.
Notes:
* For the RAS form of the algorithm, define the diagonalization operator
, which produces a (diagonal) matrix with its input vector on the main diagonal and zero elsewhere. Then, for each row adjustment, let
, from which
. Similarly each column adjustment's
, from which
. Reducing the operations to the necessary ones, it can easily be seen that RAS does the same as classical IPF. In practice, one would not implement actual matrix multiplication with the whole R and S matrices; the RAS form is more a notational than computational convenience.
Algorithm 2 (factor estimation)
Assume the same setting as in the classical IPFP.
Alternatively, we can estimate the row and column factors separately: Choose initial values
, and for
set
:
:
Repeat these steps until successive changes of a and b are sufficiently negligible (indicating the resulting row- and column-sums are close to u and v).
Finally, the result matrix is
Notes:
* The two variants of the algorithm are mathematically equivalent, as can be seen by formal induction. With factor estimation, it is not necessary to actually compute each cycle's
.
* The factorization is not unique, since it is
for all
.
Discussion
The vaguely demanded 'similarity' between M and X can be explained as follows: IPFP (and thus RAS)
maintains the crossproduct ratios, i.e.
:
since
This property is sometimes called structure conservation and directly leads to the geometrical interpretation of contingency tables and the proof of convergence in the seminal paper of Fienberg (1970).
Direct factor estimation (algorithm 2) is generally the more efficient way to solve IPF: Whereas a form of the classical IPFP needs
:
elementary operations in each iteration step (including a row and a column fitting step), factor estimation needs only
:
operations being at least one order in magnitude faster than classical IPFP.
IPFP can be used to estimate expected quasi-independent (incomplete) contingency tables, with
, and
for included cells and
for excluded cells. For fully independent (complete) contingency tables, estimation with IPFP concludes exactly in one cycle.
Comparison with the NM-method
Similar to the IPF, the
NM-method
The NM-method or Naszodi–Mendonca method is the operation that can be applied in statistics, econometrics, economics, sociology, and demography to construct counterfactual contingency tables. The method finds the matrix X ( X \in \mathbb^ ) ...
is also an operation of finding a matrix
which is the “closest” to matrix
(
) while its row totals and column totals are identical to those of a target matrix
.
However, there are differences between the NM-method and the IPF. For instance, the NM-method defines closeness of matrices of the same size differently from the IPF.
Also, the NM-method was developed to solve for matrix
in problems, where matrix
is not a sample from the population characterized by the row totals and column totals of matrix
, but represents another population.
[ In contrast, matrix is a sample from this population in problems where the IPF is applied as the ]maximum likelihood estimator
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statist ...
.
Existence and uniqueness of MLEs
Necessary and sufficient conditions for the existence and uniqueness of MLEs are complicated in the general case (see), but sufficient conditions for 2-dimensional tables are simple:
* the marginals of the observed table do not vanish (that is, ) and
* the observed table is inseparable (i.e. the table does not permute to a block-diagonal shape).
If unique MLEs exist, IPFP exhibits linear convergence in the worst case (Fienberg 1970), but exponential convergence has also been observed (Pukelsheim and Simeone 2009). If a direct estimator (i.e. a closed form of ) exists, IPFP converges after 2 iterations. If unique MLEs do not exist, IPFP converges toward the so-called ''extended MLEs'' by design (Haberman 1974), but convergence may be arbitrarily slow and often computationally infeasible.
If all observed values are strictly positive, existence and uniqueness of MLEs and therefore convergence is ensured.
Example
Consider the following table, given with the row- and column-sums and targets.
For executing the classical IPFP, we first adjust the rows:
The first step exactly matched row sums, but not the column sums. Next we adjust the columns:
Now the column sums exactly match their targets, but the row sums no longer match theirs. After completing three cycles, each with a row adjustment and a column adjustment, we get a closer approximation:
Implementation
The R package mipfp (currently in version 3.1) provides a multi-dimensional implementation of the traditional iterative proportional fitting procedure. The package allows the updating of a ''N''-dimensional array with respect to given target marginal distributions (which, in turn can be multi-dimensional).
Python has an equivalent package, ipfn that can be installed via pip. The package supports numpy and pandas input objects.
See also
* Data cleansing
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
* Data editing Data editing is defined as the process involving the review and adjustment of collected survey data. Data editing helps define guidelines that will reduce potential bias and ensure consistent estimates leading to a clear analysis of the data set by ...
* NM-method
The NM-method or Naszodi–Mendonca method is the operation that can be applied in statistics, econometrics, economics, sociology, and demography to construct counterfactual contingency tables. The method finds the matrix X ( X \in \mathbb^ ) ...
* Triangulation (social science)
In the social sciences, triangulation refers to the application and combination of several research methods in the study of the same phenomenon. By combining multiple observers, theories, methods, and empirical materials, researchers hope to overc ...
for quantitative and qualitative study data enhancement.
References
{{DEFAULTSORT:Iterative Proportional Fitting
Contingency table
Statistical algorithms