Category utility is a measure of "category goodness" defined in and . It attempts to maximize both the probability that two objects in the same category have attribute values in common, and the probability that objects from different categories have different attribute values. It was intended to supersede more limited measures of category goodness such as " cue validity" (; ) and "collocation index" . It provides a normative

information-theoretic Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. T ...

measure of the ''predictive advantage'' gained by the observer who possesses knowledge of the given category structure (i.e., the class labels of instances) over the observer who does ''not'' possess knowledge of the category structure. In this sense the motivation for the category utility measure is similar to the

information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

metric used in decision tree learning. In certain presentations, it is also formally equivalent to the

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such ...

, as discussed below. A review of category utility in its probabilistic incarnation, with applications to

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, is provided in .

Probability-theoretic definition of category utility

The probability-theoretic definition of category utility given in and is as follows: :

CU(C,F) = \tfrac \sum_ p(c_j) \left c_j)^2 - \sum_ \sum_^m p(f_)^2\right

where

F = \, \ i=1 \ldots n

is a size-

n\

set of

m\

-ary features, and

C = \ \ j=1 \ldots p

is a set of

p\

categories. The term

p(f_)\

designates the

marginal probability In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...

that feature

f_i\

takes on value

k\

, and the term

p(f_, c_j)\

designates the category-

conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...

that feature

f_i\

takes on value

k\

''given'' that the object in question belongs to category

c_j\

. The motivation and development of this expression for category utility, and the role of the multiplicand

\textstyle \tfrac

as a crude overfitting control, is given in the above sources. Loosely , the term

\textstyle p(c_j) \sum_ \sum_^m p(f_, c_j)^2

is the expected number of attribute values that can be correctly guessed by an observer using a probability-matching strategy together with knowledge of the category labels, while

\textstyle p(c_j) \sum_ \sum_^m p(f_)^2

is the expected number of attribute values that can be correctly guessed by an observer the same strategy but without any knowledge of the category labels. Their difference therefore reflects the relative advantage accruing to the observer by having knowledge of the category structure.

Information-theoretic definition of category utility

The

definition of category utility for a set of entities with size-

n\

binary feature set

F = \, \ i=1 \ldots n

, and a binary category

C = \

is given in as follows: :

- \sum_^n p(f_i)\log p(f_i)

where

p(c)\

is the

prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

of an entity belonging to the positive category

c\

(in the absence of any feature information),

p(f_i, c)\

is the conditional probability of an entity having feature

f_i\

given that the entity belongs to category

c\

p(f_i, \bar)

is likewise the conditional probability of an entity having feature

f_i\

given that the entity belongs to category

\bar

, and

p(f_i)\

is the prior probability of an entity possessing feature

f_i\

(in the absence of any category information). The intuition behind the above expression is as follows: The term

p(c)\textstyle  \sum_^n p(f_i, c)\log p(f_i, c)

represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category

c\

. Similarly, the term

p(\bar)\textstyle  \sum_^n p(f_i, \bar)\log p(f_i, \bar)

represents the cost (in bits) of optimally encoding (or transmitting) feature information when it is known that the objects to be described belong to category

\bar

. The sum of these two terms in the brackets is therefore the

weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...

of these two costs. The final term,

\textstyle  \sum_^n p(f_i)\log p(f_i)

, represents the cost (in bits) of optimally encoding (or transmitting) feature information when no category information is available. The value of the category utility will, in the above formulation, be negative (???).

Category utility and mutual information

and mention that the category utility is equivalent to the

. Here is a simple demonstration of the nature of this equivalence. Assume a set of entities each having the same

n

features, i.e., feature set

F = \, \ i=1 \ldots n

, with each feature variable having cardinality

m

. That is, each feature has the capacity to adopt any of

m

distinct values (which need ''not'' be ordered; all variables can be nominal); for the special case

m=2

these features would be considered ''binary'', but more generally, for any

m

, the features are simply ''m-ary''. For the purposes of this demonstration, without loss of generality, feature set

F

can be replaced with a single aggregate variable

F_a

that has cardinality

m^n

, and adopts a unique value

v_i, \ i=1 \ldots m^n

corresponding to each feature combination in the Cartesian product

\otimes F

. (Ordinality does ''not'' matter, because the mutual information is not sensitive to ordinality.) In what follows, a term such as

p(F_a=v_i)

or simply

p(v_i)

refers to the probability with which

F_a

adopts the particular value

v_i

. (Using the aggregate feature variable

F_a

replaces multiple summations, and simplifies the presentation to follow.) For this demonstration, also assume a single category variable

C

, which has cardinality

p

. This is equivalent to a classification system in which there are

p

non-intersecting categories. In the special case of

p=2

there are the two-category case discussed above. From the definition of mutual information for discrete variables, the mutual information

I(F_a;C)

between the aggregate feature variable

F_a

and the category variable

C

is given by: :

I(F_a;C) = \sum_ \sum_ p(v_i,c_j) \log \frac

where

p(v_i)

is the

of feature variable

F_a

adopting value

v_i

p(c_j)

is the

of category variable

C

adopting value

c_j

, and

p(v_i,c_j)

is the

joint probability Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

of variables

F_a

and

C

simultaneously adopting those respective values. In terms of the conditional probabilities this can be re-written (or defined) as :

\\ & = \sum_ \sum_ p(v_i, c_j)p(c_j) \log p(v_i, c_j)- \sum_ \sum_ p(v_i, c_j)p(c_j) \log p(v_i) \\ & = \sum_ \sum_ p(v_i, c_j)p(c_j) \log p(v_i, c_j)- \sum_ \sum_ p(v_i,c_j) \log p(v_i) \\ & = \sum_ \sum_ p(v_i, c_j)p(c_j) \log p(v_i, c_j)- \sum_ \log p(v_i) \sum_ p(v_i,c_j) \\ & = \sum_ \sum_ p(v_i, c_j)p(c_j) \log p(v_i, c_j)- \sum_ p(v_i) \log p(v_i) \\ \end

If the original definition of the category utility from above is rewritten with

C = \

, :

CU(C,F) = \sum_ \sum_ p(f_i, c_j) p(c_j) \log p(f_i, c_j) - \sum_ p(f_i) \log p(f_i)

This equation clearly has the same form as the (blue) equation expressing the mutual information between the feature set and the category variable; the difference is that the sum

\textstyle \sum_

in the category utility equation runs over independent binary variables

F = \, \ i=1 \ldots n

, whereas the sum

\textstyle \sum_

in the mutual information runs over ''values'' of the single

m^n

-ary variable

F_a

. The two measures are actually equivalent then ''only'' when the features

\

, are ''independent'' (and assuming that terms in the sum corresponding to

p(\bar)

are also added).

Insensitivity of category utility to ordinality

Like the mutual information, the category utility is not sensitive to any ''ordering'' in the feature or category variable values. That is, as far as the category utility is concerned, the category set is not qualitatively different from the category set since the formulation of the category utility does not account for any ordering of the class variable. Similarly, a feature variable adopting values is not qualitatively different from a feature variable adopting values . As far as the category utility or ''mutual information'' are concerned, ''all'' category and feature variables are ''nominal variables.'' For this reason, category utility does not reflect any ''

gestalt Gestalt may refer to: Psychology * Gestalt psychology, a school of psychology * Gestalt therapy, a form of psychotherapy * Bender Visual-Motor Gestalt Test, an assessment of development disorders * Gestalt Practice, a practice of self-exploration ...

'' aspects of "category goodness" that might be based on such ordering effects. One possible adjustment for this insensitivity to ordinality is given by the weighting scheme described in the article for

Category "goodness": models and philosophy

This section provides some background on the origins of, and need for, formal measures of "category goodness" such as the category utility, and some of the history that lead to the development of this particular metric.

What makes a good category?

At least since the time of

Aristotle Aristotle (; grc-gre, Ἀριστοτέλης ''Aristotélēs'', ; 384–322 BC) was a Greek philosopher and polymath during the Classical period in Ancient Greece. Taught by Plato, he was the founder of the Peripatetic school of ph ...

there has been a tremendous fascination in philosophy with the nature of

concept Concepts are defined as abstract ideas. They are understood to be the fundamental building blocks of the concept behind principles, thoughts and beliefs. They play an important role in all aspects of cognition. As such, concepts are studied by ...

s and

universals In metaphysics, a universal is what particular things have in common, namely characteristics or qualities. In other words, universals are repeatable or recurrent entities that can be instantiated or exemplified by many particular things. For exa ...

. What kind of ''entity'' is a concept such as "horse"? Such abstractions do not designate any particular individual in the world, and yet we can scarcely imagine being able to comprehend the world without their use. Does the concept "horse" therefore have an independent existence outside of the mind? If it does, then what is the locus of this independent existence? The question of locus was an important issue on which the classical schools of

Plato Plato ( ; grc-gre, Πλάτων ; 428/427 or 424/423 – 348/347 BC) was a Greek philosopher born in Athens during the Classical period in Ancient Greece. He founded the Platonist school of thought and the Academy, the first institution ...

and

famously differed. However, they remained in agreement that universals ''did'' indeed have a mind-independent existence. There was, therefore, always a ''fact to the matter'' about which concepts and universals exist in the world. In the late

Middle Ages In the history of Europe, the Middle Ages or medieval period lasted approximately from the late 5th to the late 15th centuries, similar to the post-classical period of global history. It began with the fall of the Western Roman Empire ...

(perhaps beginning with Occam, although Porphyry also makes a much earlier remark indicating a certain discomfort with the status quo), however, the certainty that existed on this issue began to erode, and it became acceptable among the so-called nominalists and

empiricists In philosophy, empiricism is an epistemological theory that holds that knowledge or justification comes only or primarily from sensory experience. It is one of several views within epistemology, along with rationalism and skepticism. Empir ...

to consider concepts and universals as strictly mental entities or conventions of language. On this view of concepts—that they are purely representational constructs—a new question then comes to the fore: "Why do we possess one set of concepts rather than another?" What makes one set of concepts "good" and another set of concepts "bad"? This is a question that modern philosophers, and subsequently

theorists and cognitive scientists, have struggled with for many decades.

What purpose do concepts serve?

One approach to answering such questions is to investigate the "role" or "purpose" of concepts in cognition. Thus the answer to "What are concepts good for in the first place?" by and many others is that classification (conception) is a precursor to ''

induction Induction, Inducible or Inductive may refer to: Biology and medicine * Labor induction (birth/pregnancy) * Induction chemotherapy, in medicine * Induced stem cells, stem cells derived from somatic, reproductive, pluripotent or other cell t ...

'': By imposing a particular categorization on the universe, an organism gains the ability to deal with physically non-identical objects or situations in an identical fashion, thereby gaining substantial predictive leverage (; ). As J.S. Mill puts it , From this base,

Mill Mill may refer to: Science and technology * * Mill (grinding) * Milling (machining) * Millwork * Textile mill * Steel mill, a factory for the manufacture of steel * List of types of mill * Mill, the arithmetic unit of the Analytical Engine early ...

reaches the following conclusion, which foreshadows much subsequent thinking about category goodness, including the notion of category utility: One may compare this to the "category utility hypothesis" proposed by : "A category is useful to the extent that it can be expected to improve the ability of a person to accurately predict the features of instances of that category." Mill here seems to be suggesting that the best category structure is one in which object features (properties) are maximally informative about the object's class, and, simultaneously, the object class is maximally informative about the object's features. In other words, a useful classification scheme is one in which category knowledge can be used to accurately infer object properties, and property knowledge can be used to accurately infer object classes. One may also compare this idea to

's criterion of ''counter-predication'' for definitional predicates, as well as to the notion of concepts described in formal concept analysis.

Attempts at formalization

A variety of different measures have been suggested with an aim of formally capturing this notion of "category goodness," the best known of which is probably the " cue validity". Cue validity of a feature

f_i\

with respect to category

c_j\

is defined as the conditional probability of the category given the feature (;;),

p(c_j, f_i)\

, or as the deviation of the conditional probability from the category base rate (;),

p(c_j, f_i)-p(c_j)\

. Clearly, these measures quantify only inference from feature to category (i.e., ''cue validity''), but not from category to feature, i.e., the ''category validity''

p(f_i, c_j)\

. Also, while the cue validity was originally intended to account for the demonstrable appearance of '' basic categories'' in human cognition—categories of a particular level of generality that are evidently preferred by human learners—a number of major flaws in the cue validity quickly emerged in this regard (;;, and others). One attempt to address both problems by simultaneously maximizing both feature validity and category validity was made by in defining the "collocation index" as the product

p(c_j, f_i) p(f_i, c_j)\

, but this construction was fairly ad hoc (see ). The category utility was introduced as a more sophisticated refinement of the cue validity, which attempts to more rigorously quantify the full inferential power of a class structure. As shown above, on a certain view the category utility is equivalent to the mutual information between the feature variable and the category variable. It has been suggested that categories having the greatest overall category utility are those that are not only those "best" in a normative sense, but also those human learners prefer to use, e.g., "basic" categories . Other related measures of category goodness are "cohesion" (;) and "salience" .

Applications

* Category utilility is used as the category evaluation measure in the popular

conceptual clustering Conceptual clustering is a machine learning paradigm for unsupervised classification that has been defined by Ryszard S. Michalski in 1980 (Fisher 1987, Michalski 1980) and developed mainly during the 1980s. It is distinguished from ordinary data ...

algorithm called COBWEB .

References

* * * * * * * * * * * . * * * * * * {{refend Category utility Category utility

Probability-theoretic definition of category utility

Information-theoretic definition of category utility

Category utility and mutual information

Insensitivity of category utility to ordinality

Category "goodness": models and philosophy

What makes a good category?

What purpose do concepts serve?

Attempts at formalization

Applications

See also

References