Naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

s are a popular

statistical Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...

technique Technique or techniques may refer to: Music * The Techniques, a Jamaican rocksteady vocal group of the 1960s *Technique (band), a British female synth pop band in the 1990s * ''Technique'' (album), by New Order, 1989 * ''Techniques'' (album), by M ...

e-mail filtering Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...

. They typically use bag-of-words features to identify

email spam Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email ( spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoi ...

, an approach commonly used in

text classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectual ...

. Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using

Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exa ...

to calculate a probability that an email is or is not spam. Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low

false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...

spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

History

Bayesian algorithms were used for email filtering as early as 1996. Although naive Bayesian filters did not become popular until later, multiple programs were released in 1998 to address the growing problem of unwanted email. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998. That work was soon thereafter deployed in commercial spam filters. Variants of the basic technique have been implemented in a number of research works and commercial

software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work. ...

products. Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs.

Server-side In the client–server model, server-side refers to programs and operations that run on the server. This is in contrast to client-side programs and operations which run on the client. General concepts Typically, a server is a computer applicati ...

email filters, such as DSPAM,

SpamAssassin Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is ...

SpamBayes SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others. The most notable difference ...

Bogofilter Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally writ ...

and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within

mail server Within the Internet email system, a message transfer agent (MTA), or mail transfer agent, or mail relay is software Software is a set of computer programs and associated software documentation, documentation and data (computing), data. Thi ...

software itself. CRM114, oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, but includes the ″unigram″ feature for reference.

Process

Particular words have particular

probabilities Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speakin ...

of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "

Viagra Sildenafil, sold under the brand name Viagra, among others, is a medication used to treat erectile dysfunction and pulmonary arterial hypertension. It is unclear if it is effective for treating sexual dysfunction in women. It is taken by m ...

" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members. After training, the word probabilities (also known as

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

s) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...

and is computed using

. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam. As in any other

spam filtering Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email ( false positives) as opposed t ...

technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement

quarantine A quarantine is a restriction on the movement of people, animals and goods which is intended to prevent the spread of disease or pests. It is often used in connection to disease and illness, preventing the movement of those who may have been ...

mechanisms that define a time frame during which the user is allowed to review the software's decision. The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever-evolving nature of spam. Some spam filters combine the results of both Bayesian spam filtering and other

heuristics A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...

(pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

Mathematical foundation

Bayesian

email filter Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly appl ...

s utilize

. Bayes' theorem is used several times in the context of spam: * a first time, to compute the probability that the message is spam, knowing that a given word appears in this message; * a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them); * sometimes a third time, to deal with rare words.

Computing the probability that a message containing a given word is spam

Let's suppose the suspected message contains the word "

replica A 1:1 replica is an exact copy of an object, made out of the same raw materials, whether a molecule, a work of art, or a commercial product. The term is also used for copies that closely resemble the original, without claiming to be identical. Al ...

". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities. The formula used by the software to determine that, is derived from

\Pr(S, W) = \frac

where: *

\Pr(S, W)

is the probability that a message is a spam, knowing that the word "replica" is in it; *

\Pr(S)

is the overall probability that any given message is spam; *

\Pr(W, S)

is the probability that the word "replica" appears in spam messages; *

\Pr(H)

is the overall probability that any given message is not spam (is "ham"); *

\Pr(W, H)

is the probability that the word "replica" appears in ham messages. (For a full demonstration, see Bayes' theorem#Extended form.)

The spamliness of a word

Statistics show that the current probability of any message being spam is 80%, at the very least: :

\Pr(S) = 0.8 ;  \Pr(H) = 0.2

However, most bayesian spam detection software makes the assumption that there is no ''a priori'' reason for any incoming message to be spam rather than ham, and considers both cases to have equal probabilities of 50%: :

\Pr(S) = 0.5 ;  \Pr(H) = 0.5

The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to: :

\Pr(S, W) = \frac

This is functionally equivalent to asking, "what percentage of occurrences of the word "replica" appear in spam messages?" This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number

\Pr(W, S)

used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly,

\Pr(W, H)

is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size. Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, which is why bayesian spam software tries to consider several words and combine their spamicities to determine a message's overall probability of being spam.

Combining individual probabilities

Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are

independent events Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of ...

. This condition is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known. On this basis, one can derive the following formula from Bayes' theorem: :

p = \frac

where: *

p

is the probability that the suspect message is spam; *

p_1

is the probability

p(W_1, S)

that the first word (for example "replica") appears, given that the message is spam; *

p_2

is the probability

p(W_2, S)

that the second word (for example "watches") appears, given that the message is spam; * etc... Spam filtering software based on this formula is sometimes referred to as a

naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

, as "naive" refers to the strong

independence Independence is a condition of a person, nation, country, or state in which residents and population, or some portion thereof, exercise self-government, and usually sovereignty, over its territory. The opposite of independence is the s ...

assumptions between the features. The result ''p'' is typically compared to a given threshold to decide whether the message is spam or not. If ''p'' is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.

Other expression of the formula for combining individual probabilities

Usually ''p'' is not directly computed using the above formula due to floating-point underflow. Instead, ''p'' can be computed in the log domain by rewriting the original equation as follows: :

\frac - 1 = \frac

Taking logs on both sides: :

\ln \left ( \frac - 1  \right ) = \sum_^N \left \ln(1-p_i) - \ln p_i \right /math>

Let \eta = \sum_^N \left \ln(1-p_i) -\ln p_i \right . Therefore,

: \frac - 1 = e^\eta Hence the alternate formula for computing the combined probability:

: p = \frac

Dealing with rare words

In the case a word has never been met during the learning phase, both the numerator and the denominator are equal to zero, both in the general formula and in the spamicity formula. The software can decide to discard such words for which there is no information available. More generally, the words that were encountered only a few times during the learning phase cause a problem, because it would be an error to trust blindly the information they provide. A simple solution is to simply avoid taking such unreliable words into account as well. Applying again Bayes' theorem, and assuming the classification between spam and ham of the emails containing a given word ("replica") is a

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the p ...

with

beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...

, some programs decide to use a corrected probability: :

\Pr'(S, W) = \frac

where: *

\Pr'(S, W)

is the corrected probability for the message to be spam, knowing that it contains a given word ; *

s

is the ''strength'' we give to background information about incoming spam ; *

\Pr(S)

is the probability of any incoming message to be spam ; *

n

is the number of occurrences of this word during the learning phase ; *

\Pr(S, W)

is the spamicity of this word. (Demonstration:) This corrected probability is used instead of the spamicity in the combining formula.

\Pr(S)

can again be taken equal to 0.5, to avoid being too suspicious about incoming email. 3 is a good value for ''s'', meaning that the learned corpus must contain more than 3 messages with that word to put more confidence in the spamicity value than in the default value. This formula can be extended to the case where ''n'' is equal to zero (and where the spamicity is not defined), and evaluates in this case to

Pr(S)

Other heuristics

"Neutral" words like "the", "a", "some", or "is" (in English), or their equivalents in other languages, can be ignored. More generally, some bayesian filtering filters simply ignore all the words which have a spamicity next to 0.5, as they contribute little to a good decision. The words taken into consideration are those whose spamicity is next to 0.0 (distinctive signs of legitimate messages), or next to 1.0 (distinctive signs of spam). A method can be for example to keep only those ten words, in the examined message, which have the greatest absolute value , 0.5 − ''pI'', . Some software products take into account the fact that a given word appears several times in the examined message, others don't. Some software products use ''patterns'' (sequences of words) instead of isolated natural languages words. For example, with a "context window" of four words, they compute the spamicity of "Viagra is good for", instead of computing the spamicities of "Viagra", "is", "good", and "for". This method gives more sensitivity to context and eliminates the Bayesian noise better, at the expense of a bigger database.

Mixed methods

There are other ways of combining individual probabilities for different words than using the "naive" approach. These methods differ from it on the assumptions they make on the statistical properties of the input data. These different hypotheses result in radically different formulas for combining the individual probabilities. For example, assuming the individual probabilities follow a

chi-squared distribution In probability theory and statistics, the chi-squared distribution (also chi-square or \chi^2-distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-squar ...

with 2''N'' degrees of freedom, one could use the formula: :

p = C^(-2 \ln(p_1 p_2 \cdots p_N), 2N) \,

where ''C''⁻¹ is the inverse of the chi-squared function. Individual probabilities can be combined with the techniques of the

Markovian discrimination Within the probability theory Markov model, Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods. A sim ...

too.

Discussion

Advantages

One of the main advantages of Bayesian spam filtering is that it can be trained on a per-user basis. The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that the user considers to be spam. This online newsletter is likely to contain words that are common to all newsletters, such as the name of the newsletter and its originating email address. A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns. The legitimate e-mails a user receives will tend to be different. For example, in a corporate environment, the company name and the names of clients or customers will be mentioned often. The filter will assign a lower spam probability to emails containing those names. The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules. It can perform particularly well in avoiding false positives, where legitimate email is incorrectly classified as spam. For example, if the email contains the word "Nigeria", which is frequently used in

Advance fee fraud An advance-fee scam is a form of fraud and is one of the most common types of confidence tricks. The scam typically involves promising the victim a significant share of a large sum of money, in return for a small up-front payment, which the frauds ...

spam, a pre-defined rules filter might reject it outright. A Bayesian filter would mark the word "Nigeria" as a probable spam word, but would take into account other important words that usually indicate legitimate e-mail. For example, the name of a spouse may strongly indicate the e-mail is not spam, which could overcome the use of the word "Nigeria."

Disadvantages

Depending on the implementation, Bayesian spam filtering may be susceptible to

Bayesian poisoning Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam ...

, a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources).

Spammer Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especial ...

tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email's spam score, making it more likely to slip past a Bayesian spam filter. However, with (for example) Paul Graham's scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly. Words that normally appear in large quantities in spam may also be transformed by spammers. For example, «Viagra» would be replaced with «Viaagra» or «V!agra» in the spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the Bayesian filter, which hinders its learning process. As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones. Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is "drawn". The spam filter is usually unable to analyze this picture, which would contain the sensitive words like «Viagra». However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also, a picture's size in bytes is bigger than the equivalent text's size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. A solution used by

Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

in its

Gmail Gmail is a free email service provided by Google. As of 2019, it had 1.5 billion active user (computing), users worldwide. A user typically accesses Gmail in a web browser or the official mobile app. Google also supports the use of email clien ...

email system is to perform an OCR (Optical Character Recognition) on every mid to large size image, analyzing the text inside.

General applications of Bayesian filtering

While Bayesian filtering is used widely to identify spam email, the technique can classify (or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. One example is a general purpose classification program calle
AutoClass
which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice.

References

{{DEFAULTSORT:Bayesian Spam Filtering Spam filtering Bayesian estimation

History

Process

Mathematical foundation

Computing the probability that a message containing a given word is spam

The spamliness of a word

Combining individual probabilities

Other expression of the formula for combining individual probabilities

Dealing with rare words

Other heuristics

Mixed methods

Discussion

Advantages

Disadvantages

General applications of Bayesian filtering

See also

References