Naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
s are a popular
statistical
Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
technique
Technique or techniques may refer to:
Music
* The Techniques, a Jamaican rocksteady vocal group of the 1960s
*Technique (band), a British female synth pop band in the 1990s
* ''Technique'' (album), by New Order, 1989
* ''Techniques'' (album), by M ...
of
e-mail filtering
Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...
. They typically use
bag-of-words
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding g ...
features to identify
email spam
Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming).
The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoida ...
, an approach commonly used in
text classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
.
Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
to calculate a probability that an email is or is not spam.
Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.
History
Bayesian algorithms were used for email filtering as early as 1996. Although naive Bayesian filters did not become popular until later, multiple programs were released in 1998 to address the growing problem of unwanted email. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998. That work was soon thereafter deployed in commercial spam filters.
Variants of the basic technique have been implemented in a number of research works and commercial
software
Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work.
At the lowest programming level, executable code consists ...
products. Many modern mail
clients implement Bayesian spam filtering. Users can also install separate
email filtering programs.
Server-side
In the client–server model, server-side refers to programs and operations that run on the server. This is in contrast to client-side programs and operations which run on the client.
General concepts
Typically, a server is a computer application ...
email filters, such as
DSPAM,
SpamAssassin
Apache SpamAssassin is a computer program used for anti-spam techniques, e-mail spam filtering. It uses a variety of spam-detection techniques, including Domain Name System, DNS and fuzzy checksum techniques, Bayesian spam filtering, Bayesian filt ...
,
SpamBayes
SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others.
The most notable difference b ...
,
Bogofilter
Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally writt ...
and
ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within
mail server
Within the Internet email system, a message transfer agent (MTA), or mail transfer agent, or mail relay is software that transfers electronic mail messages from one computer to another using SMTP. The terms mail server, mail exchanger, and MX host ...
software itself.
CRM114
The CRM 114 Discriminator is a fictional piece of radio equipment in Stanley Kubrick's film ''Dr. Strangelove'' (1964), the destruction of which prevents the crew of a B-52 from receiving the recall code that would stop them from dropping their ...
, oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, but includes the ″unigram″ feature for reference.
Process
Particular words have particular
probabilities
Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...
of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "
Viagra
Sildenafil, sold under the brand name Viagra, among others, is a medication used to treat erectile dysfunction and pulmonary arterial hypertension. It is unclear if it is effective for treating sexual dysfunction in women. It is taken by ...
" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.
After training, the word probabilities (also known as
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
s) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
and is computed using
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
As in any other
spam filtering
Various anti-spam techniques are used to prevent email spam (unsolicited bulk email).
No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...
technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement
quarantine
A quarantine is a restriction on the movement of people, animals and goods which is intended to prevent the spread of disease or pests. It is often used in connection to disease and illness, preventing the movement of those who may have been ...
mechanisms that define a time frame during which the user is allowed to review the software's decision.
The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever-evolving nature of spam.
Some spam filters combine the results of both Bayesian spam filtering and other
heuristics
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
(pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.
Mathematical foundation
Bayesian
email filter
Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...
s utilize
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
. Bayes' theorem is used several times in the context of spam:
* a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;
* a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);
* sometimes a third time, to deal with rare words.
Computing the probability that a message containing a given word is spam
Let's suppose the suspected message contains the word "
replica
A 1:1 replica is an exact copy of an object, made out of the same raw materials, whether a molecule, a work of art, or a commercial product. The term is also used for copies that closely resemble the original, without claiming to be identical. Al ...
". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities.
The formula used by the software to determine that, is derived from
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
:
where:
*
is the probability that a message is a spam, knowing that the word "replica" is in it;
*
is the overall probability that any given message is spam;
*
is the probability that the word "replica" appears in spam messages;
*
is the overall probability that any given message is not spam (is "ham");
*
is the probability that the word "replica" appears in ham messages.
(For a full demonstration, see
Bayes' theorem#Extended form.)
The spamliness of a word
Statistics show that the current probability of any message being spam is 80%, at the very least:
:
However, most bayesian spam detection software makes the assumption that there is no ''a priori'' reason for any incoming message to be spam rather than ham, and considers both cases to have equal probabilities of 50%:
:
The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to:
:
This is functionally equivalent to asking, "what percentage of occurrences of the word "replica" appear in spam messages?"
This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number
used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly,
is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.
Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, which is why bayesian spam software tries to consider several words and combine their spamicities to determine a message's overall probability of being spam.
Combining individual probabilities
Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are
independent events
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
. This condition is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known. On this basis, one can derive the following formula from Bayes' theorem:
:
where:
*
is the probability that the suspect message is spam;
*
is the probability
that the first word (for example "replica") appears, given that the message is spam;
*
is the probability
that the second word (for example "watches") appears, given that the message is spam;
* etc...
Spam filtering software based on this formula is sometimes referred to as a
naive Bayes classifier
In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
, as "naive" refers to the strong
independence
Independence is a condition of a person, nation, country, or state in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the statu ...
assumptions between the features. The result ''p'' is typically compared to a given threshold to decide whether the message is spam or not. If ''p'' is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.
Other expression of the formula for combining individual probabilities
Usually ''p'' is not directly computed using the above formula due to
floating-point underflow. Instead, ''p'' can be computed in the log domain by rewriting the original equation as follows:
:
Taking logs on both sides:
: