statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, the medcouple is a

robust statistic Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such ...

that measures the

skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal d ...

of a

univariate distribution In statistics, a univariate distribution is a probability distribution of only one random variable. This is in contrast to a multivariate distribution, the probability distribution of a random vector (consisting of multiple random variables). Examp ...

. It is defined as a scaled median difference between the left and right half of a distribution. Its robustness makes it suitable for identifying

outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...

s in adjusted boxplots. Ordinary

box plot In descriptive statistics, a box plot or boxplot is a method for graphically demonstrating the locality, spread and skewness groups of numerical data through their quartiles. In addition to the box on a box plot, there can be lines (which are ca ...

s do not fare well with skew distributions, since they label the longer unsymmetrical tails as outliers. Using the medcouple, the whiskers of a boxplot can be adjusted for skew distributions and thus have a more accurate identification of outliers for non-symmetrical distributions. As a kind of

order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Import ...

, the medcouple belongs to the class of incomplete generalised L-statistics. Like the ordinary

median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...

, the medcouple is a nonparametric statistic, thus it can be computed for any distribution.

Definition

The following description uses zero-based indexing in order to harmonise with the indexing in many programming languages. Let

X := \

be an ordered sample of size

n

, and let

x_m

be the

X

. Define the sets ::

X^+ := \

, ::

X^- := \

, of sizes

p := , X^+,

and

q := , X^-,

respectively. For

x_i^+ \in X^+

and

x_j^- \in X^-

, we define the ''kernel function'' :

h(x_i^+, x_j^-) := \begin
\displaystyle\frac & \text  x_i^+ > x_j^-, \\
\operatorname (p - 1 - i - j) & \text x_i^+ = x_m = x_j^-,
\end

where

\operatorname

is the

sign function In mathematics, the sign function or signum function (from '' signum'', Latin for "sign") is an odd mathematical function that extracts the sign of a real number. In mathematical expressions the sign function is often represented as . To avoi ...

. The ''medcouple'' is then the median of the set ::

\

. In other words, we split the distribution into all values greater or equal to the median and all values less than or equal to the median. We define a kernel function whose first variable is over the

p

greater values and whose second variable is over the

q

lesser values. For the special case of values tied to the median, we define the kernel by the

signum function In mathematics, the sign function or signum function (from '' signum'', Latin for "sign") is an odd mathematical function that extracts the sign of a real number. In mathematical expressions the sign function is often represented as . To avo ...

. The medcouple is then the median over all

pq

values of

h(x_i^+, x_j^-)

. Since the medcouple is not a median applied to all

(x_i, x_j)

couples, but only to those for which

x_i^+ \geq x_m \geq x_j^-

, it belongs to the class of incomplete generalised L-statistics.

Properties of the medcouple

The medcouple has a number of desirable properties. A few of them are directly inherited from the kernel function.

The medcouple kernel

We make the following observations about the kernel function

h(x_i^+, x_j^-)

: # The kernel function is location-invariant. If we add or subtract any value to each element of the sample

X

, the corresponding values of the kernel function do not change. # The kernel function is scale-invariant. Equally scaling all elements of the sample

X

does not alter the values of the kernel function. These properties are in turn inherited by the medcouple. Thus, the medcouple is independent of the

and

standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...

of a distribution, a desirable property for measuring

. For ease of computation, these properties enable us to define the two sets ::

Z^+ := \left.\left\

Z^- := \left.\left\

where

r = 2 \max_ , x_i,

. This makes the set

Z := Z^+ \cup Z^-

have

range Range may refer to: Geography * Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra) ** Mountain range, a group of mountains bordered by lowlands * Range, a term used to i ...

of at most 1, median 0, and keep the same medcouple as

X

. For

Z

, the medcouple kernel reduces to ::

h(z_i^+, z_j^-) := \begin
\displaystyle\frac & \text  z_i^+ > z_j^- \\
\operatorname (p - 1 - i - j) & \text z_i^+ = 0 = z_j^-
\end

Using the recentred and rescaled set

Z

we can observe the following. #

The kernel function is between -1 and 1, that is,

, h(z_i^+, z_j^-),  \leq 1

. This follows from the

reverse triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...

, a,  - , b,  \leq , a  - b,

with

a = z_i^+

and

b = z_j^-

and the fact that

z_i^+ \geq 0 \geq z_j^-

#The medcouple kernel

h(z_i^+, z_j^-)

is non-decreasing in each variable. This can be verified by the partial derivatives

\frac

and

\frac

, both nonnegative, since

z_i^+ \geq 0 \geq z_j^-

. With properties 1, 2, and 4, we can thus define the following

matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

, ::

H :=(h_) = (h(z_i^+, z_j^-)) =
\begin
h(z_0^+, z_0^-) & \cdots & h(z_0^+, z_^-) \\
\vdots & \ddots & \vdots \\
h(z_^+, z_0^-) & \cdots & h(z_^+, z_^-)
\end.

If we sort the sets

Z^+

and

Z^-

in decreasing order, then the matrix

H

has sorted rows and sorted columns, ::

H =
\begin
h(z_0^+, z_0^-)      & \geq & \cdots  & \geq & h(z_0^+, z_^-) \\
\geq                 &      &         &      & \geq   \\
\vdots               &      & \ddots  &      & \vdots \\
\geq                 &      &         &      & \geq   \\
h(z_^+, z_0^-) & \geq & \cdots  & \geq & h(z_^+, z_^-)
\end.

The medcouple is then the median of this matrix with sorted rows and sorted columns. The fact that the rows and columns are sorted allows the implementation of a fast algorithm for computing the medcouple.

Robustness

The

breakdown point Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such ...

is the number of values that a statistic can resist before it becomes meaningless, i.e. the number of arbitrarily large outliers that the data set

X

may have before the value of the statistic is affected. For the medcouple, the breakdown point is 25%, since it is a median taken over the couples

(x_i, x_j)

such that

x_i \geq x_m \geq x_j

Values

Like all measures of

, the medcouple is positive for distributions that are skewed to the right, negative for distributions skewed to the left, and zero for symmetrical distributions. In addition, the values of the medcouple are bounded by 1 in absolute value.

Algorithms for computing the medcouple

Before presenting medcouple algorithms, we recall that there exist

O(n)

algorithms for the finding the median. Since the medcouple is a median, ordinary algorithms for median-finding are important.

Naïve algorithm

The naïve

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

for computing the medcouple is slow. It proceeds in two steps. First, it constructs the medcouple matrix

H

which contains all of the possible values of the medcouple kernel. In the second step, it finds the median of this matrix. Since there are

pq \approx \frac

entries in the matrix in the case when all elements of the data set

X

are unique, the

algorithmic complexity Algorithmic may refer to: *Algorithm, step-by-step instructions for a calculation **Algorithmic art, art made by an algorithm **Algorithmic composition, music made by an algorithm ** Algorithmic trading, trading decisions made by an algorithm **Alg ...

of the naïve algorithm is

O(n^2)

. More concretely, the naïve algorithm proceeds as follows. Recall that we are using zero-based indexing. function naïve_medcouple(vector X): ''// X is a vector of size n.'' ''// Sorting in decreasing order can be done in-place in O(n log n) time'' sort_decreasing(X) xm := median(X) xscale := 2 * max(abs(X)) ''// Define the upper and lower centred and rescaled vectors'' ''// they inherit X's own decreasing sorting'' Zplus := x in X such that x >= xm Zminus := x in X such that x <= xm p := size(Zplus) q := size(Zminus) ''// Define the kernel function

closing Closing may refer to: Business and law * Closing (law), a closing argument, a summation * Closing (real estate), the final step in executing a real estate transaction * Closing (sales), the process of making a sale * Closure (business), Closing a ...

over Zplus and Zminus'' function h(i, j): a := Zplus b := Zminus if a

b: return signum(p - 1 - i - j) else: return (a + b) / (a - b) endif endfunction ''// O(n^2) operations necessary to form this vector'' H := i in ,_1,_...,_p_-_1and_j_in_[0,_1,_...,_q_-_1 _____ _____return_median(H) _endfunction The_final_call_to_median_on_a_vector_of_size_

O(n^2)

_can_be_done_itself_in_

O(n^2)

_operations,_hence_the_entire_naïve_medcouple_algorithm_is_of_the_same_complexity.

__Fast_algorithm_

The_fast_algorithm_outperforms_the_naïve_algorithm_by_exploiting_the_sorted_nature_of_the_medcouple_matrix_

H

._Instead_of_computing_all_entries_of_the_matrix,_the_fast_algorithm_uses_the_K^th_pair_algorithm_of_Johnson_&_Mizoguchi. The_first_stage_of_the_fast_algorithm_proceeds_as_the_naïve_algorithm._We_first_compute_the_necessary_ingredients_for_the_kernel_matrix,_

H_=_(h_)

,_with_sorted_rows_and_sorted_columns_in_decreasing_order._Rather_than_computing_all_values_of_

h_

,_we_instead_exploit_the_monotonicity_in_rows_and_columns,_via_the_following_observations.

__Comparing_a_value_against_the_kernel_matrix_

First,_we_note_that_we_can_compare_any_

u

_with_all_values_

h_

_of_

H

_in_

O(n)

_time._For_example,_for_determining_all_

i

_and_

j

_such_that_

h__>_u

,_we_have_the_following_function: _____function_greater_h(kernel_h,_int_p,_int_q,_real_u): _________//_h_is_the_kernel_function,_h(i,j)_gives_the_ith,_jth_entry_of_H _________//_p_and_q_are_the_number_of_rows_and_columns_of_the_kernel_matrix_H _________ _________//_vector_of_size_p _________P_:=_vector(p) _________ _________//_indexing_from_zero _________j_:=_0 _________ _________//_starting_from_the_bottom,_compute_the_supremum.html" "title=",_1,_...,_q_-_1.html" ;"title=", 1, ..., p - 1and j in [0, 1, ..., q - 1">, 1, ..., p - 1and j in [0, 1, ..., q - 1 return median(H) endfunction The final call to median on a vector of size

O(n^2)

can be done itself in

O(n^2)

operations, hence the entire naïve medcouple algorithm is of the same complexity.

Fast algorithm

The fast algorithm outperforms the naïve algorithm by exploiting the sorted nature of the medcouple matrix

H

. Instead of computing all entries of the matrix, the fast algorithm uses the K^th pair algorithm of Johnson & Mizoguchi. The first stage of the fast algorithm proceeds as the naïve algorithm. We first compute the necessary ingredients for the kernel matrix,

H = (h_)

, with sorted rows and sorted columns in decreasing order. Rather than computing all values of

h_

, we instead exploit the monotonicity in rows and columns, via the following observations.

Comparing a value against the kernel matrix

First, we note that we can compare any

u

with all values

h_

H

O(n)

time. For example, for determining all

i

and

j

such that

h_ > u

, we have the following function: function greater_h(kernel h, int p, int q, real u): // h is the kernel function, h(i,j) gives the ith, jth entry of H // p and q are the number of rows and columns of the kernel matrix H // vector of size p P := vector(p) // indexing from zero j := 0 // starting from the bottom, compute the supremum">least upper bound In mathematics, the infimum (abbreviated inf; plural infima) of a subset S of a partially ordered set P is a greatest element in P that is less than or equal to each element of S, if such an element exists. Consequently, the term ''greatest low ...