Naive bayes from scratch [Python 3.6]

The focus of this article is to introduce bayesian decision theory from a practical point of view; by implementing a naive bayes classifier in python.


  1. Bayesian Decision Theory
  2. Naive Bayes

Bayesian Decision Theory:

Conditional Probabilities :

Let’s start with a refresher of some probability rules. Recall that the probability of an event (A) given that an event (B) occurs is simply the joint probability of both events, normalized by the condition, (B).


A good way to visualize this equation is to imagine the following scenario:


Event A: You look outside and it is raining
Event B: you look outside and there is thunder
P(A ∩ B): The portion of times where thunder and rain was observed.
P(B): The portion of times where thunder was observed.
P(A|B) : Given that you looked outside and there is thunder. what is the probability that its raining? simply put, it’s the portion of times where rain and thunder was observed divided by the times thunder alone was observed.

It’s almost like flipping a coin and counting how many times you got heads, and normalizing the result on the number of trials. In this case we normalize the joint event on a subset of the events.

Bayes Rule:

Bayes rule utilizes the assumption that the joint probability of two events does not depend on the order of events:



If you write down the equation for P(A|B) and P(B|A)  and use the previously stated equation.


You’ll end up with bayes rule:


If you look at bayes rule in the generic sense it’s hard to get around, hence we will simply swap some of notation and try to understand it from a machine learning point of view.


P(Ci|x): The probability that some class (Ci) will detect the data (x) as positive example.  [ posterior ]

P(x|Ci): the likelihood that the data will appear in some subset of the classes (C)

P(Ci): the Prior probability of some class Ci. For example. let’s say you have 3 classes [C1 C2 C3]. P(C1)=1/3 if the classes are all equally likely.

P(x): the probability of data being present regardless whether it is labeled as a positive or a negative example by the classifier [evidence]

What bayes rule says in this case that the probability of a certain class (Ci) of describing the data (x) is:  the likelyhood of the data being observed in that subset (Ci) multiplied by the prior probability of the class . Normalized by the probability of the evidence.

If things are still vague, the following example will clear things up:


We have two classes. Class A and B. They are equally likely, hence the priors are:


When we look at our dataset. we can see that the given data (x) is labeled (A) 4 times and (B) 4 times as well, hence:


Assuming a uniform distribution.

The only thing that remains is the P(x).

Remember that x is a normalizer. It contains elements of the data (x) in both the A class and the B Class.

What we want is the sum of the joint probability of the data(x) in each class (Marginal Probability)  :


This is exactly the numerator however; for all of the classes. Thats why its a normalizing factor. Lets plug in the values and see the result




And if you calculate the result for Class B it’ll be 0.5 as well. 


I hope this example shows that if we have a labeled dataset – distributed in some fashion [uniform,bernoulli, gaussian…etc]- we can train a classifier on that dataset using bayes rule that will return the probability of whether the data belongs to a certain class.

In the next section we will use what we’ve learned to formulate the Naive Bayes classifier which assumes a gaussian distribution.