Cramér's V

	Cramér's V In statistics, Cramér's V (sometimes referred to as Cramér's phi and denoted as φ''c'') is a measure of association between two nominal variables, giving a value between 0 and +1 (inclusive). It is based on Pearson's chi-squared statistic and was published by Harald Cramér in 1946. Usage and interpretation φ''c'' is the intercorrelation of two discrete variablesSheskin, David J. (1997). Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton, Fl: CRC Press. and may be used with variables having two or more levels. φ''c'' is a symmetrical measure: it does not matter which variable we place in the columns and which in the rows. Also, the order of rows/columns does not matter, so φ''c'' may be used with nominal data types or higher (notably, ordered or numerical). Cramér's V varies from 0 (corresponding to no association between the variables) to 1 (complete association) and can reach 1 only when each variable is completely determined by the other. ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of statistical survey, surveys and experimental design, experiments. When census data (comprising every member of the target population) cannot be collected, statisticians collect data by developing specific experiment designs and survey sample (statistics), samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Tschuprow's T In statistics, Tschuprow's ''T'' is a measure of association between two nominal variables, giving a value between 0 and 1 (inclusive). It is closely related to Cramér's V, coinciding with it for square contingency tables. It was published by Alexander Tschuprow (alternative spelling: Chuprov) in 1939.Tschuprow, A. A. (1939) ''Principles of the Mathematical Theory of Correlation''; translated by M. Kantorowitsch. W. Hodge & Co. Definition For an ''r'' × ''c'' contingency table with ''r'' rows and ''c'' columns, let \pi_ be the proportion of the population in cell (i,j) and let :\pi_=\sum_^c\pi_ and \pi_=\sum_^r\pi_. Then the mean square contingency is given as : \phi^2 = \sum_^r\sum_^c\frac , and Tschuprow's ''T'' as : T = \sqrt . Properties ''T'' equals zero if and only if independence holds in the table, i.e., if and only if \pi_=\pi_\pi_. ''T'' equals one if and only there is perfect dependence in the table, i.e., if and only if for each ''i'' there is onl ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Statistical Ratios Statistics (from German: ', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. When census data (comprising every member of the target population) cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the sy ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Effect Size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the value of one parameter for a hypothetical population, or to the equation that operationalizes how statistics or parameters lead to the effect size value. Examples of effect sizes include the correlation between two variables, the regression coefficient in a regression, the mean difference, or the risk of a particular event (such as a heart attack) happening. Effect sizes are a complement tool for statistical hypothesis testing, and play an important role in power analyses to assess the sample size required for new experiments. Effect size are fundamental in meta-analyses which aim to provide the combined effect size based on data from multiple studies. The cluster of data-analysis methods concerning effect sizes is referred to as estima ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Contingency Table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term ''contingency table'' was first used by Karl Pearson in "On the Theory of Contingency and Its Relation to Association and Normal Correlation", part of the '' Drapers' Company Research Memoirs Biometric Series I'' published in 1904. A crucial problem of multivariate statistics is finding the (direct-)dependence structure underlying the variables contained in high-dimensional contingency tables. If some of the conditional independences are revealed, then even the storage of the data can be done in a smarter way (see Lauritzen (2002)). In order to do this one can use in ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Fowlkes–Mallows Index The Fowlkes–Mallows index is an external evaluation method that is used to determine the similarity between two clusterings (clusters obtained after a clustering algorithm), and also a metric to measure confusion matrices. This measure of similarity could be either between two hierarchical clusterings or a clustering and a benchmark classification. A higher value for the Fowlkes–Mallows index indicates a greater similarity between the clusters and the benchmark classifications. It was invented by Bell Labs statisticians Edward Fowlkes and Collin Mallows in 1983. Preliminaries The Fowlkes–Mallows index, when results of two clustering algorithms are used to evaluate the results, is defined as : FM = \sqrt= \sqrt where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. TPR is the ''true positive rate'', also called '' sensitivity'' or ''recall'', and PPV is the ''positive predictive rate'', also known as '' p ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Jaccard Index The Jaccard index is a statistic used for gauging the similarity and diversity of sample sets. It is defined in general taking the ratio of two sizes (areas or volumes), the intersection size divided by the union size, also called intersection over union (IoU). It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is often called the critical success index in meteorology. It was later developed independently by Paul Jaccard, originally giving the French name (coefficient of community), and independently formulated again by Taffee Tadashi Tanimoto. Thus, it is also called Tanimoto index or Tanimoto coefficient in some fields. Overview The Jaccard index measures similarity between finite non-empty sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets: : J(A, B) = \frac = \frac. Note that by design, 0 \le J(A, B) \le 1. If the sets A and B have no elements in common, their intersectio ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Dunn Index The Dunn index, introduced by Joseph C. Dunn in 1974, is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this is the computational cost as the number of clusters and dimensionality of the data increase. Preliminaries There are many ways to define the size or diameter of a cluster. It could be the distance between the farthest two points inside a cluster, it could be the mean of all the pairwise distance ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Davies–Bouldin Index The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval. Preliminaries Given ''n'' dimensional points, let ''C''''i'' be a cluster of data points. Let ''X''''j'' be an ''n''-dimensional feature vector assigned to cluster ''C''''i''. : S_i = \left(\frac \sum_^ \right)^ Here A_i is the centroid of ''C''''i'' and ''T''''i'' is the size of the cluster ''i''. S_i is the ''q''th root of the ''q''th moment of the points in cluster ''i'' about the mean. If q=1 then S_i is the average distance between the feature vectors in cluster ''i'' and the centroid of the cluster. Usually the value of ''p'' is 2, which makes the dista ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Rand Index The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. The Rand index is the accuracy of determining if a link belongs within a cluster or not. Rand index Definition Given a set of n elements S = \ and two partitions of S to compare, X = \, a partition of ''S'' into ''r'' subsets, and Y = \, a partition of ''S'' into ''s'' subsets, define the following: * a, the number of pairs of elements in S that are in the same subset in X and in the same subset in Y * b, the number of pairs of elements in S that are in different subsets in X and in different subsets in Y * c, the number of pairs of elements in S that are in the same subset in X and in different subsets in Y * d, the number of pairs of elements in S that are in different subsets in ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Goodman And Kruskall's Lambda In probability theory and statistics, Goodman & Kruskal's lambda (\lambda) is a measure of proportional reduction in error in cross tabulation analysis. For any sample with a nominal independent variable and dependent variable (or ones that can be treated nominally), it indicates the extent to which the modal categories and frequencies for each value of the independent variable differ from the overall modal category and frequency, i.e., for all values of the independent variable together. \lambda is defined by the equation :\lambda = \frac. where :\varepsilon_1 is the overall non-modal frequency, and :\varepsilon_2 is the sum of the non-modal frequencies for each value of the independent variable. Values for lambda range from zero (no association between independent and dependent variables) to one ( perfect association). Weaknesses Although Goodman and Kruskal's lambda is a simple way to assess the association between variables, it yields a value of 0 (no association) wheneve ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Uncertainty Coefficient In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal Association (statistics), association. It was first introduced by Henri Theil and is based on the concept of information entropy. Definition Suppose we have samples of two discrete random variables, ''X'' and ''Y''. By constructing the joint distribution, , from which we can calculate the conditional probability distribution, conditional distributions, and , and calculating the various entropies, we can determine the degree of association between the two variables. The entropy of a single distribution is given as: : H(X)= -\sum_x P_X(x) \log P_X(x) , while the conditional entropy is given as: : H(X, Y) = -\sum_ P_(x,~y) \log P_(x, y) . The uncertainty coefficient or proficiency is defined as: : U(X, Y) = \frac = \frac , and tells us: given ''Y'', what fraction of the bits of ''X'' can we predict? In this case we can think of ''X'' as contain ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]