A Simplified Understanding of Naïve Bayes Algorithm

Commonly used in Machine Learning, Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is not a single algorithm but a family of algorithms that all share a common principle, that every feature being classified is independent of any other feature. So for example, a vegetable may be considered to be a chili if it is red, tastes hot, and about 4″ in length. A Naive Bayes classifier considers each of these “features” (red, hotness , 4” in length) to contribute independently to the probability that the vegetable is a chili, regardless of any correlations between features. Features, however, aren’t always independent which is often seen as a shortcoming of the Naive Bayes algorithm and this is why it’s labeled “naive”.

The Bayes’ theorem

This theorem forms the core of the whole concept of naive Bayes classification. In order to understand how naive Bayes classifiers work, we have to briefly recapitulate the concept of Bayes’ rule. The probability model that was formulated by Thomas Bayes (1701-1761) is quite simple yet powerful; it can be written down in simple words as follows:

bayes theorem

In the context of a classification problem, it can be interpreted as: “What is the probability that a particular object belongs to class i given its observed feature values?”

To put it simply Naive Bayes works well with linear class boundaries as shown below:

naive-bayes

For a detailed understanding you may check out this post on Naive Bayes

Why is the Naive Bayes algorithm fast?

Naive Bayes is fast because all it needs are the apriori and conditional probabilities that can be ‘learnt’ or rather determined with trivial operations like Counting and Dividing values that do not change and can be stored and reused.

Difference between Naïve Bayes and Logistic regression

Reference: Brendan O’Connor’s answer in Quora

The difference between Naïve Bayes and Logistic regression is based on how you fit the weights from training data.

In NB, you set each feature’s weight independently, based on how much it correlates with the label. (Weights come out to be the features’ log-likelihood ratios for the different classes.)

In logistic regression, by contrast, you set all the weights together such that the linear decision function tends to be high for positive classes and low for negative classes. (Linear SVM’s work the same, except for a technical tweak of what “tends to be high/low” means.)

The difference between NB and LogReg happens when features are correlated. Say you have two features which are useful predictors — they correlate with the labels — but they themselves are repetitive, having extra correlation with each other as well. NB will give both of them strong weights, so their influence is double-counted. But logistic regression will compensate by weighting them lower.

This is a way to view the probabilistic assumptions of the models; namely, Naive Bayes makes a conditional independence assumption, which is violated when you have correlated/repetitive features.

One nice thing about NB is that training has no optimization step. You just calculate a count table for each feature and you’re done with it — it’s single pass and trivially parallelizable every which way.

One nice thing about LR is that you can be sloppy with feature engineering. You can throw in multiple variations of a feature without hurting the overall model (provided you’re regularizing appropriately); but in NB this can be problematic.

Reference: Naive Bayes and Text Classification- Introduction and Theory

Leave a Reply

Your email address will not be published.