Quantcast
Channel: A Developer Diary
Viewing all 41 articles
Browse latest View live

Introduction to Hidden Markov Model

$
0
0

Hidden Markov Model is an Unsupervised* Machine Learning Algorithm which is part of the Graphical Models. However Hidden Markov Model (HMM) often trained using supervised learning method in case training data is available. In this introduction to Hidden Markov Model we will learn about the foundational concept, usability, intuition of the algorithmic part and some basic examples. Only little bit of knowledge on probability will be sufficient for anyone to understand this article fully.

What kind of problem Hidden Markov Model helps to solve?

It’s important to understand where Hidden Markov Model algorithm actually fits in or used. In short, HMM is a graphical model, which is generally used in predicting states (hidden) using sequential data like weather, text, speech etc.

Let’s take an example. Say, a dishonest casino uses two dice (assume each die has 6 sides), one of them is fair the other one is unfair. Unfair means one of the die does not have the probabilities defined as (1/6, 1/6, 1/6, 1/6, 1/6,/ 1/6).The casino randomly rolls any one of the die at any given time.Now, assume we do not know which die was used at what time (the state is hidden). However we know the outcome of the dice (1 to 6), that is, the sequence of throws (observations). Hidden Markov Model can use these observations and predict when the unfair die was used (hidden state).

In the picture below,

  1. First plot shows the sequence of throws for each side (1 to 6) of the die (Assume each die has 6 sides).
  2. 2nd plot is the prediction of Hidden Markov Model. Red = Use of Unfair Die.
  3. 3rd plot is the true (actual) data. Red = Use of Unfair Die.
  4. 4th plot shows the difference between predicted and true data. You can see how well HMM performs.
  5. Ignoring the 5th plot for now, however it shows the prediction confidence.

Introduction to Hidden Markov Model adeveloperdiary.com

Basic Understanding of Markov Model:

Before even going through Hidden Markov Model, let’s try to get an intuition of Markov Model. Later using this concept it will be easier to understand HMM. Markov Model has been used to model randomly changing systems such as weather patterns. In Markov Model all the states are visible or observable.

The most important point Markov Model establishes is that the future state/event depends only on current state/event and not on any other older states (This is known as Markov Property). For an example, if we consider weather pattern ( sunny, rainy & cloudy ) then we can say tomorrow’s weather will only depends on today’s weather and not on y’days weather.

Mathematically we can say, the probability of the state at time t will only depend on time step t-1. In other words, probability of s(t) given s(t-1), that is \( p(s(t) | s(t-1)) \). This is known as First Order Markov Model.

In case, the probability of the state s at time t depends on time step t-1 and t-2, it’s known as 2nd Order Markov Model. As you increase the dependency of past time events the order increases. The 2nd Order Markov Model can be written as \( p(s(t) | s(t-1), s(t-2)) \).

Eventually, the idea is to model the joint probability, such as the probability of \( s^T = \{ s_1, s_2, s_3 \} \) where s1, s2 and s3 happens sequentially. We can use the joint & conditional probability rule and write it as:

\[
\begin{align}
p(s_3,s_2,s_1) &= p(s_3|s_2,s_1)p(s_2,s_1) \\
&= p(s_3|s_2,s_1)p(s_2|s_1)p(s_1) \\
&= p(s_3|s_2)p(s_2|s_1)p(s_1)
\end{align}
\]

Below is the diagram of a simple Markov Model as we have defined in above equation.

Transition Probabilities:

The probability of one state changing to another state is defined as Transition Probability. So in case there are 3 states (Sun, Cloud, Rain) there will be total 9 Transition Probabilities.As you see in the diagram, we have defined all the Transition Probabilities. Transition Probability generally are denoted by \( a_{ij} \) which can be interpreted as the Probability of the system to transition from state i to state j at time step t+1.

Mathematically,
\[
a_{ij} = p(\text{ } s(t+1) = j \text{ } | \text{ }s(t) = i \text{ })
\]

For an example, in the above state diagram, the Transition Probability from Sun to Cloud is defined as \( a_{12} \). Note that, the transition might happen to the same state also. This is also valid scenario. If we have sun in two consecutive days then the Transition Probability from sun to sun at time step t+1 will be \( a_{11} \).

Generally, the Transition Probabilities are define using a (M x M) matrix, known as Transition Probability Matrix. We can define the Transition Probability Matrix for our above example model as:

\[
A = \begin{bmatrix}a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix}
\]

Once important property to notice, when the machine transitions to another state, the sum of all transition probabilities given the current state should be 1. In our example \( a_{11}+a_{12}+a_{13} \) should be equal to 1.

Mathematically,
\[
\sum_{j=1}^{M} a_{ij} = 1 \; \; \; \forall i
\]

Initial Probability Distribution:

The machine/system has to start from one state. The initial state of Markov Model ( when time step t = 0) is denoted as \( \pi \), it’s a M dimensional row vector. All the probabilities must sum to 1, that is \( \sum_{i=1}^{M} \pi_i = 1 \; \; \; \forall i \). During implementation, we can just assign the same probability to all the states. In our weather example, we can define the initial state as \( \pi = [ \frac{1}{3} \frac{1}{3} \frac{1}{3}] \)

Note, in some cases we may have \( \pi_i = 0 \), since they can not be the initial state.

Markov Chain:

There are basic 4 types of Markov Models. When the system is fully observable and autonomous it’s called as Markov Chain. What we have learned so far is an example of Markov Chain. Hence we can conclude that Markov Chain consists of following parameters:

  • A set of M states
  • A transition probability matrix A
  • An initial probability distribution \( \pi \)

Final/Absorbing State:

When the transition probabilities of any step to other steps are zero except for itself then its knows an Final/Absorbing State.So when the system enters into the Final/Absorbing State, it never leaves.

Hidden Markov Model:

In Hidden Markov Model the state of the system will be hidden (unknown), however at every time step t the system in state s(t) will emit an observable/visible symbol v(t).You can see an example of Hidden Markov Model in the below diagram.

In our initial example of dishonest casino, the die rolled (fair or unfair) is unknown or hidden. However every time a die is rolled, we know the outcome (which is between 1-6), this is the observing symbol.

Notes:

  • We can define a particular sequence of visible/observable state/symbols as \( V^T = \{ v(1), v(2) … v(T) \} \)
  • We will define our model as \( \theta \), so in any state s(t) we have a probability of emitting a particular visible state \( v_k(t) \)
  • Since we have access to only the visible states, while s(t)’s are unobservable, such a model is called as Hidden Markov Model
  • Network like this are called as Finite-State Machine
  • When they are associated with transition probabilities, they are called as Markov Network

Emission Probability:

Now, let’s redefine our previous example. Assume based on the weather of any day the mood of a person changes from happy to sad. Also assume the person is at a remote place and we do not know how is the weather there. We can only know the mood of the person. So in this case, weather is the hidden state in the model and mood (happy or sad) is the visible/observable symbol. So we should be able to predict the weather by just knowing the mood of the person using HMM. If we redraw the states it would look like this:

The observable symbols are \( \{ v_1 , v_2 \} \), one of which must be emitted from each state. The probability of emitting any symbol is known as Emission Probability, which are generally defined as \( b_{jk}\). Mathematically, the probability of emitting symbol k given state j.
\[
b_{jk} = p(v_k(t) | s_j(t) )
\]

Emission probabilities are also defined using MxC matrix, named as Emission Probability Matrix.
\[
B = \begin{bmatrix}
b_{11} & b_{12} \\
b_{21} & b_{22} \\
b_{31} & b_{32}
\end{bmatrix}
\]

Again, just like the Transition Probabilities, the Emission Probabilities also sum to 1.

\[
\sum_{k=1}^{C} b_{jk} = 1 \; \; \; \forall j
\]

So far we have defined different attributes/properties of Hidden Markov Model. Prediction is the ultimate goal for any model/algorithm. However before jumping into prediction we need to solve two main problem in HMM.

Central Issues with Hidden Markov Model:

1. Evaluation Problem:

Let’s first define the model ( \( \theta \) ) as following:
\[
\theta \rightarrow s, v, a_{ij},b_{jk}
\]

Given the model ( \( \theta \) ) and Sequence of visible/observable symbol ( \( V^T\) ), we need to determine the probability that a particular sequence of visible states/symbol ( \( V^T\) ) that was generated from the model ( \( \theta \) ).

There could be many models \( \{ \theta_1, \theta_2 … \theta_n \} \). We need to find \( p(V^T | \theta_i) \), then use Bayes Rule to correctly classify the sequence \( V^T \).

\[
p(\theta | V^T ) = \frac{p(V^T | \theta) p(\theta)}{p(V^T)}
\]

Forward and Backward Algorithm in Hidden Markov Model

2. Learning Problem:

In general HMM is unsupervised learning process, where number of different visible symbol types are known (happy, sad etc), however the number of hidden states are not known. The idea is to try out different options, however this may lead to more computation and processing time.

Hence we often use training data and specific number of hidden states (sun, rain, cloud etc) to train the model for faster and better prediction.

Once the high-level structure (Number of Hidden & Visible States) of the model is defined, we need to estimate the Transition (\( a_{ij}\)) & Emission (\( b_{jk}\)) Probabilities using the training sequences. This is known as the Learning Problem.

Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model

We will also be using the evaluation problem to solve the Learning Problem. So it’s important to understand how the Evaluation Problem really works. Another important note, Expectation Maximization (EM) algorithm will be used to estimate the Transition (\( a_{ij}\)) & Emission (\( b_{jk}\)) Probabilities. The Learning Problem is knows as Forward-Backward Algorithm or Baum-Welch Algorithm.

3. Decoding Problem:

Finally, once we have the estimates for Transition (\( a_{ij}\)) & Emission (\( b_{jk}\)) Probabilities, we can then use the model ( \( \theta \) ) to predict the Hidden States \( W^T\) which generated the Visible Sequence \( V^T \). The Decoding Problem is also known as Viterbi Algorithm.

Implement Viterbi Algorithm in Hidden Markov Model using Python and R

Conclusion:

In this Introduction to Hidden Markov Model article we went through some of the intuition behind HMM. Next we will go through each of the three problem defined above and will try to build the algorithm from scratch and also use both Python and R to develop them by ourself without using any library.

Here are the list of all the articles in this series:

  1. Introduction to Hidden Markov Model
  2. Forward and Backward Algorithm in Hidden Markov Model
  3. Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
  4. Implement Viterbi Algorithm in Hidden Markov Model using Python and R

The post Introduction to Hidden Markov Model appeared first on A Developer Diary.


Forward and Backward Algorithm in Hidden Markov Model

$
0
0

Introduction to Hidden Markov Model article provided basic understanding of the Hidden Markov Model. We also went through the introduction of the three main problems of HMM (Evaluation, Learning and Decoding). In this Understanding Forward and Backward Algorithm in Hidden Markov Model article we will dive deep into the Evaluation Problem. We will go through the mathematical understanding & then will use Python and R to build the algorithms by ourself.

Quick Recap:

Hidden Markov Model is a Markov Chain which is mainly used in problems with temporal sequence of data. Markov Model explains that the next step depends only on the previous step in a temporal sequence. In Hidden Markov Model the state of the system is hidden (invisible), however each state emits a symbol at every time step. HMM works with both discrete and continuous sequences of data. (Here we will only see the example of discrete data)

Basic Structure of HMM:

As we have discussed earlier, Hidden Markov Model (\( \theta \)) has with following parameters :

  • Set of M Hidden States (\( S^M\))
  • A Transaction Probability Matrix (A)
  • A sequence of T observations (\( V^T\))
  • A Emission Probability Matrix (Also known as Observation Likelihood) (B)
  • An Initial Probability Distribution (\( \pi \))

In case you are not sure of any of above terminology, please refer my previous article on Introduction to Hidden Markov Model:

Introduction to Hidden Markov Model

Evaluation Problem:

As we have seen earlier, the Evaluation Problem can be stated as following,
\[
\text{Given } \theta, V_T \rightarrow \text{Estimate } p(V_T|\theta) \\
\text{Where } \theta \rightarrow s, v, a_{ij},b_{jk}
\]

Solution:

  • First we need to find all possible sequences of the state \( S^M\) where M is the number of Hidden States.
  • Then from all those sequences of \( S^M\), find the probability of which sequence generated the visible sequence of symbols \( V^T\)
  • Mathematically, \( p(V_T|\theta) \) can be estimated as,
    \[
    p(V^T|\theta)= \sum_{r=1}^{R} p(V^T|S_r^T)p(S_r^T) \\
    \text{where }S_r^T = \{ s_1(1), s_2(2)… s_r(T)\}
    \]
    and R=Maximum Number of possible sequences of the hidden state

So, if there are M number of hidden state, then we can define R as :
\[
R=M^T
\]

In order to compute the probability of the model generated by the particular sequence of T visible symbols \( V^T\), we should take each conceivable sequence of hidden state, calculate the probability that they have produced \(V^T\) and then add up these probabilities.

Question :

Question you might be having is how to proof that the above equation is valid? Let’s try to understand this in a different way.

Remember our example? So here is the diagram of a specific sequence of 3 states. The transition between the Hidden Layers have been grayed out intentionally, we will come back to that in a moment.

Forward and Backward Algorithm in Hidden Markov Model adeveloperdiary.com

In case in the above example we already know the sequence of the Hidden states (i.e sun, sun, cloud) which generated the 3 visible symbols happy, sad & happy, then it will be very easy to calculate the probability of the visible symbols/states given the hidden state. So we can write probability of \( V^T\) given \(S^T\) as:

p(happy, sad, happy | sun, sun, rain ) = p(happy|sun) x p(sad|sun) x p(happy|rain)

Mathematically,
\[
p(V^T|S_r^T)=\prod_{t=1}^{T} p(v(t) | s(t))
\]

Unfortunately we really do not know the specific sequence of hidden states which generated the visible symbols happy, sad & happy.Hence we need to compute the probability of mood changes happy, sad & happy by summing over all possible weather sequences, weighted by their probability (transition probability).

We now have the same state diagram, however now the transition probabilities have been given here.

Forward and Backward Algorithm in Hidden Markov Model adeveloperdiary.com

We can calculate the joint probability of the sequence of visible symbol \(V^T\) generated by a specific sequences of hidden state \(S^T\) as:

p(happy,sad,happy,sun,sun,rain) = p(sun|initial state) x p(sun|sun) x p(cloud|sun) x p(happy|sun) x x p(sad|sun) x p(happy|rain)

Mathematically,

\[
p(V^T,S^T)=p(V^T | S^T)p(S^T)
\]

Since we are using First-Order Markov model, we can say that the probability of a sequence of T hidden states is the multiplication of the probability of each transition.
\[
p(S^T)=\prod_{t=1}^{T} p(s(t) | s(t-1))
\]

Write the joint probability as following,

\[
\begin{align}
p(V^T,S^T) &=p(V^T | S^T)p(S^T) \\
&=\prod_{t=1}^{T} p(v(t) | s(t)) \prod_{t=1}^{T} p(s(t) | s(t-1))
\end{align}
\]

As you can see, we are slowly getting close to our original equation. Just one more step is left now. The above equation is for a specific sequence of hidden state that we thought might have generated the visible sequence of symbols/states. We can now compute the probably of all the different possible sequences of hidden states by summing over all the joint probabilities of \(V^T\) and \(S^T\).

In our example, we have a sequence of 3 visible symbols/states, we also have 2 different states to represent. So there can be \(2^3 = 8\) possible sequences. We can write them as:

p(happy,sad,happy|model) = p(happy,sad,happy,sun,sun,sun) + p(happy,sad,happy,sun,sun,rain) + p(happy,sad,happy,sun,rain,rain)+ . . .

We can write the generalized equation as:

\[
\begin{align}
p(V^T|\theta) &=\sum_{\text{All Seq of S}} p(V^T, S^T) \\
&=\sum_{\text{All Seq of S}} p(V^T | S^T)p(S^T) \\
&=\sum_{r=1}^R \prod_{t=1}^{T} p(v(t) | s(t)) \prod_{t=1}^{T} p(s(t) | s(t-1)) \\
&=\sum_{r=1}^R \prod_{t=1}^{T} p(v(t) | s(t)) p(s(t) | s(t-1))
\end{align}
\]

Again, R=Maximum Number of possible sequences of the hidden state.

The above solution is simple, however the computation complexity is \( O(N^T.T) \), which is very high for practical scenarios. So even if we have derived the solution to the Evaluation Problem, we need to find an alternative which should be easy to compute.

We will a recursive dynamic programming approach to overcome the exponential computation we had with the solution above. There are two such algorithms, Forward Algorithm and Backward Algorithm.

Forward Algorithm:

In Forward Algorithm (as the name suggested), we will use the computed probability on current time step to derive the probability of the next time step. Hence the it is computationally more efficient \(O(N^2.T)\).

We need to find the answer of the following question to make the algorithm recursive:

Given a a sequence of Visible state \(V^T\) , what will be the probability that the Hidden Markov Model will be in a particular hidden state s at a particular time step t.

If we write the above question mathematically it might be more easier to understand.

\[
\alpha_j(t) = p(v(1)…v(t),s(t)= j)
\]

First, we will derive the equation using just probability & then will solve again using trellis diagram. So don’t worry if you are not able to fully understand the next section, just read along and come back after going through the trellis diagram.

Solution using Probabilities:

When t = 1 :

Rewrite the above equation when t=1

\[
\begin{align}
\alpha_j(1) &= p(v_k(1),s(1)= j) \\
&= p(v_k(1)|s(1)=j)p(s(1)=j) \\
&= \pi_j p(v_k(1)|s(1)=j) \\
&= \pi_j b_{jk} \\
\text{where } \pi &= \text{ initial distribution, } \\
b_{jkv(1)} &= \text{ Emission Probability at } t = 1
\end{align}
\]

When t = 2 :

So we have the solution when t=1. Now lets rewrite the same when t=2. Our objective here will be to come up with an equation where \(\alpha_j(1)\) is part of it, so that we can use recursion.

\[
\begin{align}
\alpha_j(2) &= p \Big( v_k(1),v_k(2),s(2)= j \Big) \\
&= \color{Blue}{\sum_{i=1}^M} p \Big( v_k(1),v_k(2),\color{Blue}{s(1)= i}, s(2)= j \Big) \\
&= \sum_{i=1}^M p \Big( v_k(2) | s(2)= j, v_k(1),s(1)= i \Big) p \Big( v_k(1),s(2),s(1)= i \Big)\\
&= \sum_{i=1}^M p \Big( v_k(2) | s(2)= j, \color{Red}{v_k(1), s(1)= i} \Big) p \Big( s(2) | \color{Red}{v_k(1),}s(1)= i \Big) p \Big(v_k(1),s(1)= i \Big) \\
&= \sum_{i=1}^M p \Big( v_k(2) | s(2)= j \Big) p \Big(s(2) | s(1)= i \Big) p \Big(v_k(1),s(1)= i \Big)\\
&= \color{DarkRed}{p \Big( v_k(2) | s(2)= j \Big) }\sum_{i=1}^M p \Big( s(2) | s(1)= i \Big) \color{Blue}{p \Big( v_k(1),s(1)= i \Big)} \\
&= \color{DarkRed}{b_{jk v(2)}} \sum_{i=1}^M a_{i2} \color{Blue} {\alpha_i(1)}\\
\text{where } a_{i2} &= \text{ Transition Probability } \\
b_{jk v(2)} &= \text{ Emission Probability at } t=2 \\
\alpha_i(1) &= \text{ Forward probability at } t=1
\end{align}
\]

Let me try to explain some part of it. We have just used the Joint Probability Rule and have broken the equation in different parts.

In Line 2 we have added \(s(1)=i\) for which we have added the summation since there are M different hidden states. The red highlighted section in Line 4 can be removed. Finally line 6 has 3 parts which are highlighted in colors. Since \( p ( v_k(2) | s(2)= j ) \) does not depend on i, we can move it outside of the summation. The final equation consists of \( \alpha_i(1) \) which we have already calculated when t=1.

Generalized Equation :

Let’s generalize the equation now for any time step t+1:

\[
\begin{align}
\alpha_j(t+1) &= p \Big( v_k(1) … v_k(t+1),s(t+1)= j \Big) \\
&= \color{Blue}{\sum_{i=1}^M} p\Big(v_k(1) … v_k(t+1),\color{Blue}{s(t)= i}, s(t+1)= j \Big) \\
&= \sum_{i=1}^M p\Big(v_k(t+1) | s(t+1)= j, v_k(1) … v_k(t),s(t)= i\Big) \\
& p\Big(v_k(1)…v_k(t),s(t+1),s(t)= i \Big) \\
&= \sum_{i=1}^M p\Big(v_k(t+1) | s(t+1)= j, \color{Red}{v_k(1)…v_k(t), s(t)= i}\Big) \\
& p\Big(s(t+1) | \color{Red}{v_k(1)…v_k(t),}s(t)= i\Big) p\Big(v_k(t),s(t)= i\Big)\\
&= \sum_{i=1}^M p\Big(v_k(t+1) | s(t+1)= j\Big) p\Big(s(t+1) | s(t)= i\Big) p\Big(v_k(t),s(t)= i\Big)\\
&= \color{DarkRed}{p\Big(v_k(t+1) | s(t+1)= j\Big) }\sum_{i=1}^M p\Big(s(t+1) | s(t)= i\Big) \color{Blue}{p\Big(v_k(t),s(t)= i\Big)} \\
&= \color{DarkRed}{b_{jk v(t+1)}} \sum_{i=1}^M a_{ij} \color{Blue}{\alpha_i(t)}
\end{align}
\]

The above equation follows the same derivation as we did for t=2. This equation will be really easy to implement using any programming language. We won’t use recursion function, just use the pre-calculated values in a loop (More on this later).

Intuition using Trellis:

We will use Trellis Diagram to get the intuition behind the Forward Algorithm. I case you have not understood the derivation using joint probability rule, this section will definitely help you to understand the equation.

I am repeating the same question again here:
Given a a sequence of Visible state \(V^T\) , what will be the probability that the Hidden Markov Model will be in a particular hidden state s at a particular time step t.

Step by Step Derivation:

Please refer the below Trellis diagram and assume the probability that the system/machine is at hidden state \(s_1\) at time \( (t-1) \) is \( \alpha_1(t-1) \). The probability of transition to hidden state \( s_2 \) at time step t can be now written as,

\[
\alpha_1(t-1) a_{12}
\]

Forward and Backward Algorithm in Hidden Markov Model adeveloperdiary.com

Likewise, if we sum all the probabilities where the machine transition to state \( s_2 \) at time t from any state at time \((t-1)\), it gives the total probability that there will a transition from any hidden state at \((t-1)\) to \( s_2 \) at time step t.

Mathematically,
\[
\sum_{i=1}^M \alpha_i(t-1) a_{i2}
\]

Finally, we can say the probability that the machine is at hidden state \( s_2 \) at time t, after emitting first t number of visible symbol from sequence \(V^T\) is given but the following, (We simply multiply the emission probability to the above equation)

\[
b_{2k} \sum_{i=1}^M \alpha_i(t-1) a_{i2}
\]

Now we can extend this to a recursive algorithm to find the probability that sequence \(V^T\) was generated by HMM \(\theta\). Here is the generalized version of the equation.

\[
\alpha_j(t)= \begin{cases}
\pi_jb_{jk} & \text{ when }t = 1 \\
b_{jk} \sum_{i=1}^M \alpha_i(t-1) a_{ij} & \text{ when } t \text{ greater than } 1
\end{cases}
\]

Here \(\alpha_j(t)\) is the probability that the machine will be at hidden state \(s_j\) at time step t, after emitting first t visible sequence of symbols.

Implementation of Forward Algorithm:

Now lets work on the implementation. We will use both Python and R for this.

Data:

In our example we have 2 Hidden States (A,B) and 3 Visible States (0,1,2) ( in R file, it will be (1,2,3) ). Assume that we already know our a and b.

\[
A=
\begin{bmatrix}
0.54 & 0.46\\
0.49 & 0.51
\end{bmatrix}
\]
\[
B= \begin{bmatrix}
0.16 & 0.26 & 0.58\\
0.25 & 0.28 & 0.47
\end{bmatrix}
\]

The data_python.csv & data_r.csv has two columns named, Hidden and Visible. The only difference between the Python and R is only the starting index of the Visible column. Python file has 0,1,2 where as R has 1,2,3.

Python:

First Load the data.

import pandas as pd
import numpy as np

data = pd.read_csv('data_python.csv')

V = data['Visible'].values

Then set the values for transition probability, emission probabilities and initial distribution.

# Transition Probabilities
a = np.array(((0.54, 0.46), (0.49, 0.51)))

# Emission Probabilities
b = np.array(((0.16, 0.26, 0.58), (0.25, 0.28, 0.47)))

# Equal Probabilities for the initial distribution
initial_distribution = np.array((0.5, 0.5))

In python the index starts from 0, hence our t will start from 0 to T-1.

Next, we will have the forward function. Here we will store and return all the \(\alpha_0(0), \alpha_1(0) … \alpha_0(T-1),\alpha_1(T-1)\)

def forward(V, a, b, initial_distribution):
    alpha = np.zeros((V.shape[0], a.shape[0]))
    alpha[0, :] = initial_distribution * b[:, V[0]]

    for t in range(1, V.shape[0]):
        for j in range(a.shape[0]):
            # Matrix Computation Steps
            #                  ((1x2) . (1x2))      *     (1)
            #                        (1)            *     (1)
            alpha[t, j] = alpha[t - 1].dot(a[:, j]) * b[j, V[t]]

    return alpha

alpha = forward(V, a, b, initial_distribution)
print(alpha)

First we will create the alpha matrix with 2 Columns and T Rows.
As per our equation multiply initial_distribution with the \( b_{jkv(0)} \) to calculate \(\alpha_0(0) , \alpha_1(0) \). This will be a simple vector multiplication since both initial_distribution and \( b_{kv(0)} \) are of same size.

  • We will loop through the time steps now, starting from 1 ( remember python index starts from 0 ).
  • Another loop for each hidden step j.
  • Use the same formula for calculating the \( \alpha \) values.
  • Return all of the alpha values.

Output:

[[8.00000000e-002 1.25000000e-001]
[[8.00000000e-002 1.25000000e-001]
 [2.71570000e-002 2.81540000e-002]
 [1.65069392e-002 1.26198572e-002]
 [8.75653677e-003 6.59378003e-003]
…
…
 [8.25847348e-221 6.30684489e-221]
 [4.37895921e-221 3.29723269e-221]
 [1.03487332e-221 1.03485477e-221]
 [6.18228050e-222 4.71794300e-222]]

R Code:

Here is the same Forward Algorithm implemented in R. If you notice, we have removed the 2nd for loop in R code. You can do the same in python too.

data = read.csv("data_r.csv")

a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)
b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, 0.47),nrow = 2,ncol = 3)
initial_distribution = c(1/2, 1/2)

forward = function(v, a, b, initial_distribution){
  
  T = length(v)
  m = nrow(a)
  alpha = matrix(0, T, m)
  
  alpha[1, ] = initial_distribution*b[, v[1]]
  
  for(t in 2:T){
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]
  }
  return(alpha)
}

forward(data$Visible,a,b,initial_distribution)

Backward Algorithm:

Backward Algorithm is the time-reversed version of the Forward Algorithm. In Backward Algorithm we need to find the probability that the machine will be in hidden state \( s_i \) at time step t and will generate the remaining part of the sequence of the visible symbol \(V^T\).

Derivation of Backward Algorithm:

Please find the Derivation of the Backward Algorithm using Probability Theory. The concepts are same as the forward algorithm.

\[
\begin{align}
\beta_i(t) &= p \Big( v_k(t+1) …. v_k(T) | s(t) = i \Big) \\
&= \sum_{j=0}^M p\Big( v_k(t+1) …. v_k(T), s(t+1) = j | s(t) = i \Big) \\
&= \sum_{j=0}^M p\Big( v_k(t+2) …. v_k(T) | v_k(t+1) , s(t+1) = j , s(t) = i \Big) \\
& p \Big( v_k(t+1) , s(t+1) = j | s(t) = i \Big) \\
&= \sum_{j=0}^M p\Big( v_k(t+2) …. v_k(T) | v_k(t+1) , s(t+1) = j , s(t) = i \Big)
\\ & p \Big( v_k(t+1) | s(t+1) = j , s(t) = i \Big) p \Big( s(t+1) = j | s(t) = i \Big) \\
&= \sum_{j=0}^M p\Big( v_k(t+2) …. v_k(T) | s(t+1) = j \Big) p \Big( v_k(t+1) | s(t+1) = j \Big) \\ & p \Big( s(t+1) = j | s(t) = i \Big) \\
&= \sum_{j=0}^M \beta_j(t+1) b_{jkv(t+1)} a_{ij} \\
\text{where } a_{i2} &= \text{ Transition Probability } \\
b_{jk v(t+1)} &= \text{ Emission Probability at } t=t+1 \\
\beta_i(t+1) &= \text{ Backward probability at } t=t+1
\end{align}
\]

Intuition using Trellis:

Here is the Trellis diagram of the Backward Algorithm. Mathematically, the algorithm can be written in following way:

\[
\beta_i(t)= \begin{cases}
1 & \text{ when }t = T \\
\sum_{j=0}^M a_{ij} b_{jkv(t+1)}\beta_j(t+1) & \text{ when } t \text{ less than } T
\end{cases}
\]

Forward and Backward Algorithm in Hidden Markov Model adeveloperdiary.com

Implementation of Backward Algorithm:

We will use the same data file and parameters as defined for Forward Algorithm.

Python Code :

import pandas as pd
import numpy as np

data = pd.read_csv('data_python.csv')

V = data['Visible'].values

# Transition Probabilities
a = np.array(((0.54, 0.46), (0.49, 0.51)))

# Emission Probabilities
b = np.array(((0.16, 0.26, 0.58), (0.25, 0.28, 0.47)))


def backward(V, a, b):
    beta = np.zeros((V.shape[0], a.shape[0]))

    # setting beta(T) = 1
    beta[V.shape[0] - 1] = np.ones((a.shape[0]))

    # Loop in backward way from T-1 to
    # Due to python indexing the actual loop will be T-2 to 0
    for t in range(V.shape[0] - 2, -1, -1):
        for j in range(a.shape[0]):
            beta[t, j] = (beta[t + 1] * b[:, V[t + 1]]).dot(a[j, :])

    return beta


beta = backward(V, a, b)
print(beta)

R Code :

data = read.csv("data_r.csv")

a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)
b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, 0.47),nrow = 2,ncol = 3)

backward = function(V, A, B){
  T = length(V)
  m = nrow(A)
  beta = matrix(1, T, m)
  
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * B[, V[t+1]])
    beta[t, ] = t(A %*% tmp)
  }
  return(beta)
}

backward(data$Visible,a,b)

Output :

[[5.30694627e-221 5.32373319e-221]
 [1.98173335e-220 1.96008747e-220]
 [3.76013005e-220 3.71905927e-220]
 [7.13445025e-220 7.05652279e-220]
...
...
 [7.51699476e-002 7.44006456e-002]
 [1.41806080e-001 1.42258480e-001]
 [5.29400000e-001 5.23900000e-001]
 [1.00000000e+000 1.00000000e+000]]

Conclusion:

In our next article we will use both the forward and backward algorithm to solve the learning problem. Here I have provided a very detailed overview of the Forward and Backward Algorithm. The output of the program may not make lot of sense now, however next article will provide more insight.

Here is the link to the code and data file in github.

Also, here are the list of all the articles in this series:

  1. Introduction to Hidden Markov Model
  2. Forward and Backward Algorithm in Hidden Markov Model
  3. Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
  4. Implement Viterbi Algorithm in Hidden Markov Model using Python and R

Feel free to post any question you may have.

The post Forward and Backward Algorithm in Hidden Markov Model appeared first on A Developer Diary.

Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model

$
0
0

The most important and complex part of Hidden Markov Model is the Learning Problem. Even though it can be used as Unsupervised way, the more common approach is to use Supervised learning just for defining number of hidden states. In this Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model article we will go through step by step derivation process of the Baum Welch Algorithm (a.k.a Forward-Backward Algorithm) and then implement is using both Python and R.

Quick Recap:

This is the 3rd part of the Introduction to Hidden Markov Model Tutorial. So far we have gone through the intuition of HMM, derivation and implementation of the Forward and Backward Algorithm. In case you need a refresher please refer the part 2 of the tutorial series.

Forward and Backward Algorithm in Hidden Markov Model

Learning Problem : HMM Training

  • The objective of the Learning Problem is to estimate for \( a_{ij}\) and \( b_{jk}\) using the training data.
  • The standard algorithm for Hidden Markov Model training is the Forward-Backward or Baum-Welch Algorithm.
  • This algorithm uses a special case of the Expectation Maximization (EM) Algorithm.

Example using Maximum Likelihood Estimate:

Now let’s try to get an intuition using an example of Maximum Likelihood Estimate.Consider training a Simple Markov Model where the hidden state is visible.

We we use our example used in the programming section (You should already have it if you have followed part 2) where we had 2 hidden states [A,B] and 3 visible states [1,2,3]. (Assume in this example the hidden states are also known)

As you see here we have 4 different sets of sequences (each in alternative colors).

3 2 2 1 1 3 1 2 3 2 1 1
B B A A A B B A B B A A

Now we will compute the HMM parameters by Maximum Likelihood Estimation using the sample data above.

Estimate Initial Probability Distribution

We will initialize \( \pi \) using the probability derived from the above sequences. In the example above, one of the sequence started with A and rest all 3 with B. We can define,

\[
\pi_A=1/3 , \pi_B=2/3
\]

Estimate Transition Probabilities:

Lets define our Transition Probability Matrix first as:

\[
\hat{A} = \begin{bmatrix}
p(A|A) & p(B|A) \\
p(A|B) & p(B|B)
\end{bmatrix}
\]

We can calculate the probabilities from the example as (Ignore the final hidden state since there is to state to transition to):

\[
\hat{A} = \begin{bmatrix}
2/4 & 2/4 \\
3/4 & 1/4
\end{bmatrix}
\]

Estimate Emission Probabilities:

Same way, following should be our Emission Probability Matrix.
\[
\hat{B} =\begin{bmatrix}
p(1|A) & p(2|A) & p(3|A) \\
p(1|B) & p(2|B) & p(3|B)
\end{bmatrix}
\]

Here are the calculated probabilities:

\[
\hat{B} =\begin{bmatrix}
4/6 & 2/6 & 0/6 \\
1/6 & 2/6 & 3/6
\end{bmatrix}
\]

Baum-Welch Algorithm:

The above maximum likelihood estimate will work only when the sequence of hidden states are known. However thats not the case for us. Hence we need to find another way to estimate the Transition and Emission Matrix.

This algorithm is also known as Forward-Backward or Baum-Welch Algorithm, it’s a special case of the Expectation Maximization (EM) algorithm.

High Level Steps of the Algorithm (EM):

Lets first understand what we need in order to get an estimate for the parameters of the HMM. Here are the high level steps:

  1. Start with initial probability estimates [A,B]. Initially set equal probabilities or define them randomly.
  2. Compute expectation of how often each transition/emission has been used. We will estimate latent variables [ \( \xi , \gamma \) ] (This is common approach for EM Algorithm)
  3. Re-estimate the probabilities [A,B] based on those estimates (latent variable).
  4. Repeat until convergence

How to solve Baum-Welch Algorithm?:

There are two main ways we can solve the Baum-Welch Algorithm.

  • Probabilistic Approach : HMM is a Generative model, hence we can solve Baum-Welch using Probabilistic Approach.
  • Lagrange Multipliers : The Learning problem can be defined as a constrained optimization problem, hence it can also be solved using Lagrange Multipliers.

The final equation for both A, B will look the same irrespective of any of the above approach since both A,B can be defined using joint and marginal probabilities. Let’s look at the formal definition of them :

Estimate for \( a_{ij}\):

\[
\hat{a_{ij}} = \frac{\text{expected number of transitions from hidden state i to state j}}{\text{expected number of transition from hidden state i}}
\]

Estimate for \( b_{jk}\):

\[
\hat{b_{jk}} = \frac{\text{expected number of times in hidden state j and observing v(k) }}{\text{expected number of times in hidden state j}}
\]

The above definition is just the generalized view of the Maximum Likelihood Example we went through. Let’s use the Probabilistic Approach and find out how we can estimate the parameters A,B

Probabilistic Approach:

Derivation of \( \hat{a_{ij}}\):

If we know the probability of a given transition from i to j at time step t, then we can sum over all the T times to estimate for the numerator in our equation for \( \hat{A}\).

By the way \( \hat{A}\) is just the matrix representation of \( \hat{a_{ij}}\), so don’t be confused.

We can define this as the probability of being in state i at time t and in state j at time t+1, given the observation sequence and the model.

Mathematically,
\[
p(s(t) = i,s(t+1)=j | V^T, \theta )
\]

We already know from the basic probability theory that,

\[
\begin{align}
p(X, Y | Z) &= p(X | Y, Z) p( Y | Z ) \\
p(X | Y, Z) &= \frac{p(X, Y | Z) }{p( Y | Z )}
\end{align}
\]

We can now say,

\[
\begin{align}
p(s(t) = i,s(t+1)=j | V^T, \theta ) &=\frac{ p(s(t) = i,s(t+1)=j , V^T | \theta )}{p(V^T| \theta )}
\end{align}
\]

The numerator of the equation can be expressed using Forward and Backward Probabilities (Refer the diagram below):

\[
\begin{align}
p(s(t) = i,s(t+1)=j , V^T | \theta ) = \alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)
\end{align} \\ \\ \\
\]

Derivation and implementation of Baum Welch Algorithm for Hidden Markov Models adeveloperdiary.com

The denominator \( p(V^T|\theta)\) is the probability of the observation sequence \( V^T\) by any path given the model \( \theta \). It can be expressed as the marginal probability:

\[
\begin{align}
p(V^T | \theta ) = \sum_{i=1}^{M} \sum_{j=1}^{M} \alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)
\end{align}
\]

We will define \(\xi \) as the latent variable representing \( p(s(t) = i,s(t+1)=j | V^T, \theta ) \). We can now define \(\xi_{ij} (t) \) as:

\[
\xi_{ij} (t) = \frac{\alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)}{\sum_{i=1}^{M} \sum_{j=1}^{M} \alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)}
\]

The \(\xi_{ij} (t) \) defined above is only for one time step, we need to sum over all T to get the total joint probability for all the transitions from hidden state i to hidden state j. This will be our numerator of the equation of \( \hat{a_{ij}} \).

For the denominator, we need to get the marginal probability which can be expressed as following,

\[
\sum_{t=1}^{T-1} \sum_{j=1}^{M} \xi_{ij} (t)
\]

Now we can define \( \hat{a_{ij}} \) as,

\[
\hat{a_{ij}} = \frac{\sum_{t=1}^{T-1} \xi_{ij} (t)}{\sum_{t=1}^{T-1} \sum_{j=1}^{M} \xi_{ij} (t)} . . . . . . . . . (1)
\]

Probabilistic view of the Denominator:

Before we move on estimating B, let’s understand more on the denominator of \( \hat{a_{ij}} \). The denominator is the probability of a state i at time t, which can be expressed as :

\[
\begin{align}
p(s(t)=i | V^T , \theta) & = \frac{p(s(t)=i, V^T | \theta)}{p(V^T| \theta)} \\
&= \frac{ p(v(1)… v(t), s(t)=i | \theta) p(v(t+1) … v(T)| s(t)=i , \theta) }{ p(V^T| \theta) } \\
&=\frac{\alpha_i(t) \beta_i(t)}{p(V^T| \theta)} \\
&= \frac{\alpha_i(t) \beta_i(t)}{ \sum_{i=1}^M \alpha_i(t) \beta_i(t)} = \gamma_i(t)
\end{align}
\]

Derivation and implementation of Baum Welch Algorithm for Hidden Markov Models adeveloperdiary.com

if we use the above equation to define our estimate for A, it will be,

\[
\hat{a_{ij}} = \frac{\sum_{t=1}^{T-1} \xi_{ij} (t)}{\sum_{t=1}^{T-1} \gamma(t)} . . . . . . . . . (2)
\]

This is the same equation as \( (1) \) we derived earlier.

However, since
\[
\gamma_i(t) = \sum_{j=1}^M \xi_{ij}(t)
\]

we can just use \(\xi_{ij}(t)\) to define the \(\hat{a_{ij}}\). This will same some computation.

In summary, in case you see the estimate of \(a_{ij}\) with this equation, don’t be confused, since both \((1) \) and \( (2)\) are identical, even through the representations are different.

Derivation of \( \hat{b_{jk}}\):

\( b_{jk}\) is the probability of a given symbol \(v_k\) from the observations V given a hidden state j.

We already know the probability of being in state j at time t.

\[
\gamma_j(t) = \frac{\alpha_j(t) \beta_j(t)}{ \sum_{j=1}^M \alpha_j(t) \beta_j(t)}
\]

We can compute \( \hat{b_{jk}}\) using \(\gamma_j(t)\),

\[
\hat{b_{jk}} = \frac{\sum_{t=1}^T \gamma_j(t) 1(v(t)=k)}{\sum_{t=1}^T \gamma_j(t) }
\]

where \(1(v(t)=k)\) is the indicator function.

Final EM Algorithm:

  • initialize A and B
  • iterate until convergence
    • E-Step
      • \( \xi_{ij} (t) = \frac{\alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)}{\sum_{i=1}^{M} \sum_{j=1}^{M} \alpha_i(t) a_{ij} b_{jk \text{ } v(t+1) }\beta_j(t+1)} \)
      • \( \gamma_i(t) = \sum_{j=1}^M \xi_{ij}(t)\)
    • M-Step
      • \( \hat{a_{ij}} = \frac{\sum_{t=1}^{T-1} \xi_{ij} (t)}{\sum_{t=1}^{T-1} \sum_{j=1}^{M} \xi_{ij} (t)} \)
      • \( \hat{b_{jk}} = \frac{\sum_{t=1}^T \gamma_j(t) 1(v(t)=k)}{\sum_{t=1}^T \gamma_j(t) } \)
  • return A,B

Lagrange Multipliers:

We can represent the Learning problem as a constrained optimization problem and define it as,

\[
\begin{align}
\text{Optimize } & p(V^T| \theta) \\
\text{ where } \theta &= \{\pi, A , B \} \\
\text{Subject to } &
\begin{cases} \sum_{i=1}^M \pi_i=1\\
\sum_{j=1}^M a_{ij}=1, \forall i \in \{ 1,…,M\} \\
\sum_{k=1}^M b_{jk}=1, \forall j \in \{ 1,…,M\}
\end{cases}
\end{align}
\]

We can then solve this using Lagrange Multipliers and by taking the derivatives. We are not going to through the details of that derivation here, however if you are interested let me know I can expand this section if needed.

Code :

R-Script:

Here is the implementation of the algorithm.

  • In line# 23-24, we are appending the T‘th data into the \(\gamma\) since \( \xi \)’s length is T-1
  • We are using \( \xi \) to derive \(\gamma\).
  • The indicator function has been implemented using which in line# 26.

BaumWelch = function(v, a, b, initial_distribution, n.iter = 100){

  for(i in 1:n.iter){
    T = length(v)
    M = nrow(a)
    K=ncol(b)
    alpha = forward(v, a, b, initial_distribution)
    beta = backward(v, a, b)
    xi = array(0, dim=c(M, M, T-1))
    
    for(t in 1:T-1){
      denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) 
      for(s in 1:M){
        numerator = alpha[t,s] * a[s,] * b[,v[t+1]] * beta[t+1,]
        xi[s,,t]=numerator/as.vector(denominator)
      }
    }
    
    
    xi.all.t = rowSums(xi, dims = 2)
    a = xi.all.t/rowSums(xi.all.t)
    
    gamma = apply(xi, c(1, 3), sum)  
    gamma = cbind(gamma, colSums(xi[, , T-1]))
    for(l in 1:K){
      b[, l] = rowSums(gamma[, which(v==l)])
    }
    b = b/rowSums(b)
    
  }
  return(list(a = a, b = b, initial_distribution = initial_distribution))
}

Here is the full code.

forward = function(v, a, b, initial_distribution){
  
  T = length(v)
  M = nrow(a)
  alpha = matrix(0, T, M)
  
  alpha[1, ] = initial_distribution*b[, v[1]]
  
  for(t in 2:T){
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]
  }
  return(alpha)
}

backward = function(v, a, b){
  T = length(v)
  M = nrow(a)
  beta = matrix(1, T, M)
  
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * b[, v[t+1]])
    beta[t, ] = t(a %*% tmp)
  }
  return(beta)
}


BaumWelch = function(v, a, b, initial_distribution, n.iter = 100){

  for(i in 1:n.iter){
    T = length(v)
    M = nrow(a)
    K=ncol(b)
    alpha = forward(v, a, b, initial_distribution)
    beta = backward(v, a, b)
    xi = array(0, dim=c(M, M, T-1))
    
    for(t in 1:T-1){
      denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) 
      for(s in 1:M){
        numerator = alpha[t,s] * a[s,] * b[,v[t+1]] * beta[t+1,]
        xi[s,,t]=numerator/as.vector(denominator)
      }
    }
    
    
    xi.all.t = rowSums(xi, dims = 2)
    a = xi.all.t/rowSums(xi.all.t)
    
    gamma = apply(xi, c(1, 3), sum)  
    gamma = cbind(gamma, colSums(xi[, , T-1]))
    for(l in 1:K){
      b[, l] = rowSums(gamma[, which(v==l)])
    }
    b = b/rowSums(b)
    
  }
  return(list(a = a, b = b, initial_distribution = initial_distribution))
}

data = read.csv("data_r.csv")

M=2; K=3
A = matrix(1, M, M)
A = A/rowSums(A)
B = matrix(1:6, M, K)
B = B/rowSums(B)
initial_distribution = c(1/2, 1/2)

(myout = BaumWelch(data$Visible, A, B, initial_distribution, n.iter = 100))

Output:

$a
          [,1]      [,2]
[1,] 0.5381634 0.4618366
[2,] 0.4866444 0.5133556

$b
          [,1]      [,2]      [,3]
[1,] 0.1627751 0.2625807 0.5746441
[2,] 0.2514996 0.2778097 0.4706907

$initial_distribution
[1] 0.5 0.5

Validate Result:

Let’s validate our result with the HMM R package.

library(HMM)
hmm =initHMM(c("A", "B"), c(1, 2, 3), 
              startProbs = initial_distribution,
              transProbs = A, emissionProbs = B)

true.out = baumWelch(hmm, data$Visible, maxIterations=100, pseudoCount=0)
true.out$hmm

Here is the output, which is exactly same as our output.

$States
[1] "A" "B"

$Symbols
[1] 1 2 3

$startProbs
  A   B 
0.5 0.5 

$transProbs
    to
from         A         B
   A 0.5381634 0.4618366
   B 0.4866444 0.5133556

$emissionProbs
      symbols
states         1         2         3
     A 0.1627751 0.2625807 0.5746441
     B 0.2514996 0.2778097 0.4706907

Python:

Here is the python code for the Baum Welch algorithm, the logic is same as we have used in R.

def baum_welch(V, a, b, initial_distribution, n_iter=100):
    M = a.shape[0]
    T = len(V)

    for n in range(n_iter):
        alpha = forward(V, a, b, initial_distribution)
        beta = backward(V, a, b)

        xi = np.zeros((M, M, T - 1))
        for t in range(T - 1):
            denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :])
            for i in range(M):
                numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T
                xi[i, :, t] = numerator / denominator

        gamma = np.sum(xi, axis=1)
        a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((-1, 1))

        # Add additional T'th element in gamma
        gamma = np.hstack((gamma, np.sum(xi[:, :, T - 2], axis=0).reshape((-1, 1))))

        K = b.shape[1]
        denominator = np.sum(gamma, axis=1)
        for l in range(K):
            b[:, l] = np.sum(gamma[:, V == l], axis=1)

        b = np.divide(b, denominator.reshape((-1, 1)))

    return {"a":a, "b":b}

Here is the full code:

import pandas as pd
import numpy as np


def forward(V, a, b, initial_distribution):
    alpha = np.zeros((V.shape[0], a.shape[0]))
    alpha[0, :] = initial_distribution * b[:, V[0]]

    for t in range(1, V.shape[0]):
        for j in range(a.shape[0]):
            # Matrix Computation Steps
            #                  ((1x2) . (1x2))      *     (1)
            #                        (1)            *     (1)
            alpha[t, j] = alpha[t - 1].dot(a[:, j]) * b[j, V[t]]

    return alpha


def backward(V, a, b):
    beta = np.zeros((V.shape[0], a.shape[0]))

    # setting beta(T) = 1
    beta[V.shape[0] - 1] = np.ones((a.shape[0]))

    # Loop in backward way from T-1 to
    # Due to python indexing the actual loop will be T-2 to 0
    for t in range(V.shape[0] - 2, -1, -1):
        for j in range(a.shape[0]):
            beta[t, j] = (beta[t + 1] * b[:, V[t + 1]]).dot(a[j, :])

    return beta


def baum_welch(V, a, b, initial_distribution, n_iter=100):
    M = a.shape[0]
    T = len(V)

    for n in range(n_iter):
        alpha = forward(V, a, b, initial_distribution)
        beta = backward(V, a, b)

        xi = np.zeros((M, M, T - 1))
        for t in range(T - 1):
            denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :])
            for i in range(M):
                numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T
                xi[i, :, t] = numerator / denominator

        gamma = np.sum(xi, axis=1)
        a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((-1, 1))

        # Add additional T'th element in gamma
        gamma = np.hstack((gamma, np.sum(xi[:, :, T - 2], axis=0).reshape((-1, 1))))

        K = b.shape[1]
        denominator = np.sum(gamma, axis=1)
        for l in range(K):
            b[:, l] = np.sum(gamma[:, V == l], axis=1)

        b = np.divide(b, denominator.reshape((-1, 1)))

    return {"a":a, "b":b}


data = pd.read_csv('data_python.csv')

V = data['Visible'].values

# Transition Probabilities
a = np.ones((2, 2))
a = a / np.sum(a, axis=1)

# Emission Probabilities
b = np.array(((1, 3, 5), (2, 4, 6)))
b = b / np.sum(b, axis=1).reshape((-1, 1))

# Equal Probabilities for the initial distribution
initial_distribution = np.array((0.5, 0.5))

print(baum_welch(V, a, b, initial_distribution, n_iter=100))

Output:

Here is the output of our code. Its the same as previous one, however the precision is different.

{
'a': array([[0.53816345, 0.46183655],
       [0.48664443, 0.51335557]]), 

'b': array([[0.16277513, 0.26258073, 0.57464414],
       [0.2514996 , 0.27780971, 0.47069069]])
}

Conclusion:

We went through the details of the Learning Algorithm of HMM here. I hope that this article helped you to understand the concept.

Click on the link to get the code:

Also, here are the list of all the articles in this series:

  1. Introduction to Hidden Markov Model
  2. Forward and Backward Algorithm in Hidden Markov Model
  3. Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
  4. Implement Viterbi Algorithm in Hidden Markov Model using Python and R

The post Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model appeared first on A Developer Diary.

Implement Viterbi Algorithm in Hidden Markov Model using Python and R

$
0
0

The 3rd and final problem in Hidden Markov Model is the Decoding Problem. In this article we will implement Viterbi Algorithm in Hidden Markov Model using Python and R. Viterbi Algorithm is dynamic programming and computationally very efficient. We will start with the formal definition of the Decoding Problem, then go through the solution and finally implement it. This is the 4th part of the Introduction to Hidden Markov Model tutorial series. This one might be the easier one to follow along.

Quick Recap:

We have learned about the three problems of HMM. We went through the Evaluation and Learning Problem in detail including implementation using Python and R in my previous article. In case you want a refresh your memories, please refer my previous articles.

Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model

Decoding Problem:

Given a sequence of visible symbol \(V^T\) and the model ( \( \theta \rightarrow \{ A, B \} \) ) find the most probable sequence of hidden states \(S^T\).

In general we could try to find all the different scenarios of hidden states for the given sequence of visible symbols and then identify the most probable one. However, just like we have seen earlier, it will be an exponentially complex problem \( O(N^T . T) \) to solve.

Viterbi Algorithm:

We will be using a much more efficient algorithm named Viterbi Algorithm to solve the decoding problem. So far in HMM we went deep into deriving equations for all the algorithms in order to understand them clearly. However Viterbi Algorithm is best understood using an analytical example rather than equations. I will provide the mathematical definition of the algorithm first, then will work on a specific example.

Probabilistic View:

The decoding problem is similar to the Forward Algorithm. In Forward Algorithm we compute the likelihood of the observation sequence, given the hidden sequences by summing over all the probabilities, however in decoding problem we need to find the most probable hidden state in every iteration of t.

The following equation represents the highest probability along a single path for first t observations which ends at state i.

\[
\omega _i(t)= \max_{s_1,…,s_{T-1}} p(s_1,s_2….s_T=i, v_1,v_2 … v_T | \theta)
\]

We can use the same approach as the Forward Algorithm to calculate \( \omega _i(+1) \)

\[
\omega _i(t+1)= \max_i \Big( \omega _i(t) a_{ij} b_{jk v(t+1)} \Big)
\]

Now to find the sequence of hidden states we need to identify the state that maximizes \( \omega _i(t) \) at each time step t.

\[
\arg \max_t \omega(t)
\]

Once we complete the above steps for all the observations, we will first find the last hidden state by maximum likelihood, then using backpointer to backtrack the most likely hidden path.

Everything what I said above may not make a lot of sense now. Go through the example below and then come back to read this part. I hope it will definitely be more easy to understand once you have the intuition.

Example:

Our example will be same one used in during programming, where we have two hidden states A,B and three visible symbols 1,2,3. Assume we have a sequence of 6 visible symbols and the model \( \theta \). We need to predict the sequence of the hidden states for the visible symbols.

If we draw the trellis diagram, it will look like the fig 1. Note, here \( S_1 = A\) and \( S_2 = B\).

As stated earlier, we need to find out for every time step t and each hidden state what will be the most probable next hidden state.

Assume when t = 2, the probability of transitioning to \( S_2(2) \) from \( S_1(1) \) is higher than transitioning to \( S_1(2) \), so we keep track of this. This is highlighted by the red arrow from \( S_1(1) \) to \( S_2(2) \) in the below diagram. The other path is in gray dashed line, which is not required now.

Like wise, we repeat the same for each hidden state. In other words, assuming that at t=1 if \( S_2(1) \) was the hidden state and at t=2 the probability of transitioning to \( S_1(2) \) from \( S_2(1) \) is higher, hence its highlighted in red.

Implement Viterbi Algorithm in Hidden Markov Model using Python and R adeveloperdiary.com

We can repeat the same process for all the remaining observations. The trellis diagram will look like following.

Implement Viterbi Algorithm in Hidden Markov Model using Python and R adeveloperdiary.com

The output of the above process is to have the sequences of the most probable states (1) [below diagram] and the corresponding probabilities (2). So as we go through finding most probable state (1) for each time step, we will have an 2x5 matrix ( in general M x (T-1) ) as below:

Implement Viterbi Algorithm in Hidden Markov Model using Python and R adeveloperdiary.com

The first number 2 in above diagram indicates that current hidden step 1 (since it’s in 1st row) transitioned from previous hidden step 2.

Let’s take one more example, the 2 in the 2nd row 2nd col indicates that the current step 2 ( since it’s in 2nd row) transitioned from previous hidden step 2. If you refer fig 1, you can see its true since at time 3, the hidden state \(S_2\) transisitoned from \(S_2\) [ as per the red arrow line]

Similar to the most probable state ( at each time step ), we will have another matrix of size 2 x 6 ( in general M x T ) for the corresponding probabilities (2). Next we find the last step by comparing the probabilities(2) of the T’th step in this matrix.

Assume, in this example, the last step is 1 ( A ), we add that to our empty path array. then we find the previous most probable hidden state by backtracking in the most probable states (1) matrix. Refer the below fig 3 for the derived most probable path.The path could have been different if the last hidden step was 2 ( B ).

Implement Viterbi Algorithm in Hidden Markov Model using Python and R adeveloperdiary.com

The final most probable path in this case is given in the below diagram, which is similar as defined in fig 1.

Implement Viterbi Algorithm in Hidden Markov Model using Python and R adeveloperdiary.com

Code:

Now lets look at the code. We will start with Python first.

Python:

The code has comments and its following same intuition from the example. One implementation trick is to use the log scale so that we dont get the underflow error.

def viterbi(V, a, b, initial_distribution):
    T = V.shape[0]
    M = a.shape[0]

    omega = np.zeros((T, M))
    omega[0, :] = np.log(initial_distribution * b[:, V[0]])

    prev = np.zeros((T - 1, M))

    for t in range(1, T):
        for j in range(M):
            # Same as Forward Probability
            probability = omega[t - 1] + np.log(a[:, j]) + np.log(b[j, V[t]])

            # This is our most probable state given previous state at time t (1)
            prev[t - 1, j] = np.argmax(probability)

            # This is the probability of the most probable state (2)
            omega[t, j] = np.max(probability)

    # Path Array
    S = np.zeros(T)

    # Find the most probable last hidden state
    last_state = np.argmax(omega[T - 1, :])

    S[0] = last_state

    backtrack_index = 1
    for i in range(T - 2, -1, -1):
        S[backtrack_index] = prev[i, int(last_state)]
        last_state = prev[i, int(last_state)]
        backtrack_index += 1

    # Flip the path array since we were backtracking
    S = np.flip(S, axis=0)

    # Convert numeric values to actual hidden states
    result = []
    for s in S:
        if s == 0:
            result.append("A")
        else:
            result.append("B")

    return result

Here is the full Python Code:

import pandas as pd
import numpy as np


def forward(V, a, b, initial_distribution):
    alpha = np.zeros((V.shape[0], a.shape[0]))
    alpha[0, :] = initial_distribution * b[:, V[0]]

    for t in range(1, V.shape[0]):
        for j in range(a.shape[0]):
            # Matrix Computation Steps
            #                  ((1x2) . (1x2))      *     (1)
            #                        (1)            *     (1)
            alpha[t, j] = alpha[t - 1].dot(a[:, j]) * b[j, V[t]]

    return alpha


def backward(V, a, b):
    beta = np.zeros((V.shape[0], a.shape[0]))

    # setting beta(T) = 1
    beta[V.shape[0] - 1] = np.ones((a.shape[0]))

    # Loop in backward way from T-1 to
    # Due to python indexing the actual loop will be T-2 to 0
    for t in range(V.shape[0] - 2, -1, -1):
        for j in range(a.shape[0]):
            beta[t, j] = (beta[t + 1] * b[:, V[t + 1]]).dot(a[j, :])

    return beta


def baum_welch(V, a, b, initial_distribution, n_iter=100):
    M = a.shape[0]
    T = len(V)

    for n in range(n_iter):
        alpha = forward(V, a, b, initial_distribution)
        beta = backward(V, a, b)

        xi = np.zeros((M, M, T - 1))
        for t in range(T - 1):
            denominator = np.dot(np.dot(alpha[t, :].T, a) * b[:, V[t + 1]].T, beta[t + 1, :])
            for i in range(M):
                numerator = alpha[t, i] * a[i, :] * b[:, V[t + 1]].T * beta[t + 1, :].T
                xi[i, :, t] = numerator / denominator

        gamma = np.sum(xi, axis=1)
        a = np.sum(xi, 2) / np.sum(gamma, axis=1).reshape((-1, 1))

        # Add additional T'th element in gamma
        gamma = np.hstack((gamma, np.sum(xi[:, :, T - 2], axis=0).reshape((-1, 1))))

        K = b.shape[1]
        denominator = np.sum(gamma, axis=1)
        for l in range(K):
            b[:, l] = np.sum(gamma[:, V == l], axis=1)

        b = np.divide(b, denominator.reshape((-1, 1)))

    return (a, b)


def viterbi(V, a, b, initial_distribution):
    T = V.shape[0]
    M = a.shape[0]

    omega = np.zeros((T, M))
    omega[0, :] = np.log(initial_distribution * b[:, V[0]])

    prev = np.zeros((T - 1, M))

    for t in range(1, T):
        for j in range(M):
            # Same as Forward Probability
            probability = omega[t - 1] + np.log(a[:, j]) + np.log(b[j, V[t]])

            # This is our most probable state given previous state at time t (1)
            prev[t - 1, j] = np.argmax(probability)

            # This is the probability of the most probable state (2)
            omega[t, j] = np.max(probability)

    # Path Array
    S = np.zeros(T)

    # Find the most probable last hidden state
    last_state = np.argmax(omega[T - 1, :])

    S[0] = last_state

    backtrack_index = 1
    for i in range(T - 2, -1, -1):
        S[backtrack_index] = prev[i, int(last_state)]
        last_state = prev[i, int(last_state)]
        backtrack_index += 1

    # Flip the path array since we were backtracking
    S = np.flip(S, axis=0)

    # Convert numeric values to actual hidden states
    result = []
    for s in S:
        if s == 0:
            result.append("A")
        else:
            result.append("B")

    return result


data = pd.read_csv('data_python.csv')

V = data['Visible'].values

# Transition Probabilities
a = np.ones((2, 2))
a = a / np.sum(a, axis=1)

# Emission Probabilities
b = np.array(((1, 3, 5), (2, 4, 6)))
b = b / np.sum(b, axis=1).reshape((-1, 1))

# Equal Probabilities for the initial distribution
initial_distribution = np.array((0.5, 0.5))

a, b = baum_welch(V, a, b, initial_distribution, n_iter=100)

print(viterbi(V, a, b, initial_distribution))

Output:

I am only having partial result here. Later we will compare this with the HMM library.

['B', 'B', 'A', 'A', 
... 
'A', 'A', 
'A', 'A', 'B', 'B', 'B', 'A', 
'A', 'A', 'A', 'A', 'A', 'A']

R Script:

The R code below does not have any comments. You can find them in the python code ( they are structurally the same )

Viterbi=function(v,a,b,initial_distribution) {
  
  T = length(v)
  M = nrow(a)
  prev = matrix(0, T-1, M)
  omega = matrix(0, M, T)
  
  omega[, 1] = log(initial_distribution * b[, v[1]])
  for(t in 2:T){
    for(s in 1:M) {
      probs = omega[, t - 1] + log(a[, s]) + log(b[s, v[t]])
      prev[t - 1, s] = which.max(probs)
      omega[s, t] = max(probs)
    }
  }
  
  S = rep(0, T)
  last_state=which.max(omega[,ncol(omega)])
  S[1]=last_state
  
  j=2
  for(i in (T-1):1){
    S[j]=prev[i,last_state] 
    last_state=prev[i,last_state] 
    j=j+1
  }
  
  S[which(S==1)]='A'
  S[which(S==2)]='B'
  
  S=rev(S)
  
  return(S)
  
}

Full R Code:

forward = function(v, a, b, initial_distribution){
  
  T = length(v)
  M = nrow(a)
  alpha = matrix(0, T, M)
  
  alpha[1, ] = initial_distribution*b[, v[1]]
  
  for(t in 2:T){
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]
  }
  return(alpha)
}

backward = function(v, a, b){
  T = length(v)
  M = nrow(a)
  beta = matrix(1, T, M)
  
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * b[, v[t+1]])
    beta[t, ] = t(a %*% tmp)
  }
  return(beta)
}


BaumWelch = function(v, a, b, initial_distribution, n.iter = 100){

  for(i in 1:n.iter){
    T = length(v)
    M = nrow(a)
    K=ncol(b)
    alpha = forward(v, a, b, initial_distribution)
    beta = backward(v, a, b)
    xi = array(0, dim=c(M, M, T-1))
    
    for(t in 1:T-1){
      denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) 
      for(s in 1:M){
        numerator = alpha[t,s] * a[s,] * b[,v[t+1]] * beta[t+1,]
        xi[s,,t]=numerator/as.vector(denominator)
      }
    }
    
    
    xi.all.t = rowSums(xi, dims = 2)
    a = xi.all.t/rowSums(xi.all.t)
    
    gamma = apply(xi, c(1, 3), sum)  
    gamma = cbind(gamma, colSums(xi[, , T-1]))
    for(l in 1:K){
      b[, l] = rowSums(gamma[, which(v==l)])
    }
    b = b/rowSums(b)
    
  }
  return(list(a = a, b = b, initial_distribution = initial_distribution))
}


Viterbi=function(v,a,b,initial_distribution) {
  
  T = length(v)
  M = nrow(a)
  prev = matrix(0, T-1, M)
  omega = matrix(0, M, T)
  
  omega[, 1] = log(initial_distribution * b[, v[1]])
  for(t in 2:T){
    for(s in 1:M) {
      probs = omega[, t - 1] + log(a[, s]) + log(b[s, v[t]])
      prev[t - 1, s] = which.max(probs)
      omega[s, t] = max(probs)
    }
  }
  
  S = rep(0, T)
  last_state=which.max(omega[,ncol(omega)])
  S[1]=last_state
  
  j=2
  for(i in (T-1):1){
    S[j]=prev[i,last_state] 
    last_state=prev[i,last_state] 
    j=j+1
  }
  
  S[which(S==1)]='A'
  S[which(S==2)]='B'
  
  S=rev(S)
  
  return(S)
  
}

data = read.csv("data_r.csv")

M=2; K=3
A = matrix(1, M, M)
A = A/rowSums(A)
B = matrix(1:6, M, K)
B = B/rowSums(B)
initial_distribution = c(1/2, 1/2)

myout = BaumWelch(data$Visible, A, B, initial_distribution, n.iter = 100)
myout.hidden=Viterbi(data$Visible,myout$a,myout$b,initial_distribution)

We can compare our output with the HMM library. Here is the result.

library(HMM)
hmm =initHMM(c("A", "B"), c(1, 2, 3), 
              startProbs = initial_distribution,
              transProbs = A, emissionProbs = B)

true.out = baumWelch(hmm, data$Visible, maxIterations=100, pseudoCount=0)

true.viterbi = viterbi(true.out$hmm, data$Visible)
sum(true.viterbi != myout.hidden)

Output:

> sum(true.viterbi != myout.hidden)
[1] 0
>

Conclusion:

This “Implement Viterbi Algorithm in Hidden Markov Model using Python and R” article was the last part of the Introduction to the Hidden Markov Model tutorial series. I believe these articles will help anyone to understand HMM. Here we went through the algorithm for the sequence discrete visible symbols, the equations are little bit different for continuous visible symbols. Please post comment in case you need more clarification to any of the section.

Do share this article if you find it useful. The full code can be found at:

Also, here are the list of all the articles in this series:

  1. Introduction to Hidden Markov Model
  2. Forward and Backward Algorithm in Hidden Markov Model
  3. Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
  4. Implement Viterbi Algorithm in Hidden Markov Model using Python and R

The post Implement Viterbi Algorithm in Hidden Markov Model using Python and R appeared first on A Developer Diary.

Understand and Implement the Backpropagation Algorithm From Scratch In Python

$
0
0

It’s very important have clear understanding on how to implement a simple Neural Network from scratch. In this Understand and Implement the Backpropagation Algorithm From Scratch In Python tutorial we go through step by step process of understanding and implementing a Neural Network. We will start from Linear Regression and use the same concept to build a 2-Layer Neural Network.Then we will code a N-Layer Neural Network using python from scratch.As prerequisite, you need to have basic understanding of Linear/Logistic Regression with Gradient Descent.

Let’s see how we can slowly move towards building our first neural network.

Linear Regression:

Here we have represented Linear Regression using graphical format (Bias b is not shown). As you see in the below diagram, we have two input features ( \( x_1, x_2\) ). Z represents the linear combination of the vectors w. The node with Z can also be named as hidden unit, since X & Y are visible ( for training ) and Z is something defined inside the model.

We can write the equation for predicting values using above linear regression as (this is shown using blue arrow),
\[
\hat{y}= z = b+x_1w_1+x_2w_2
\]

So in order to find the best w, we need to first define the cost function J.To use gradient descent, take derivative of the cost function J w.r.t w and b, then update w and b by a fraction (learning rate) of dw and db until convergence (this is shown using red arrow).
We can write dw and db as follows ( using chain rule ).

\[
\frac{dJ}{dW}=\frac{dJ}{dZ}\frac{dZ}{dW} \\
\frac{dJ}{db}=\frac{dJ}{dZ}\frac{dZ}{db}
\]

And the gradient descent equation for updating w and b are,

\[
W=: W-\alpha \frac{dJ}{dW} \\
b=: b-\alpha \frac{dJ}{db}
\]

In summary, first we predict the \( \hat{y}\), then using this we calculate the cost, after that using gradient descent we adjust the parameters of the model. This happens in a loop and eventually we learn the best parameters (w and b ) to be used in prediction. The above picture depicts the same.

Logistic Regression:

Here we will try to represent Logistic Regression in the same way. Mathematically Logistic regression is different than Linear Regression in two following ways:

  • Logistic Regression has a different Cost Function J
  • Apply a non-linear transformation (Sigmoid) on Z to predict probability of class label ( Binary Classification )

As you see in the below diagram the blue arrow indicates the Forward Propagation.

Here are the steps of Forward Propagation in Logistic Regression. ( Matrix Format )

\[
Z=W^TX+b \\
\hat{y}= A = \sigma(Z)
\]

The Gradient Descent ( a.k.a Backpropagation ) in Logistic Regression has an additional derivative to calculate.

\[
\frac{dJ}{dW}=\frac{dJ}{dA}\frac{dA}{dZ}\frac{dZ}{dW} \\
\frac{dJ}{dW}=\frac{dJ}{dA}\frac{dA}{dZ}\frac{dZ}{db}
\]

The gradient descent equation for updating w and b will be exactly same as Linear Regression (They are same for Neural Network too),

\[
W=: W-\alpha \frac{dJ}{dW} \\
b=: b-\alpha \frac{dJ}{db}
\]

The process flow diagram is exactly the same for Logistic Regression too.

We can say that Logistic Regression is a 1-Layer Neural Network. Now we will extend the idea to a 2-Layer Neural Network.

2-Layer Neural Network:

Extend the same concept to a 2-Layer Neural Network. Refer the below diagram ( bias term is not displayed ). There are some minor notation changes, such as, the super-script now denotes the layer number. We have added two more hidden units to our model. The vector W will have different dimension for each hidden layer.

In case you are new to Neural Network, imagine that the output of the first layer used as input to the next layer. Earlier in case of Logistic Regression we didn’t have multiple layers. These intermediate hidden layers provides a way to solve complex tasks (a.k.a non-linearity).

We can write the forward propagation in two steps as (Consider uppercase letters as Matrix).

\[
\begin{align}
Z^{[1]}=& W^{[1]}X+b^{[1]} \\
A^{[1]}=& \sigma(Z^{[1]}) \\
Z^{[2]}=& W^{[2]}A^{[1]}+b^{[2]} \\
\hat{y}=& A^{[2]}=\sigma(Z^{[2]})
\end{align}
\]

Again, just like Linear and Logistic Regression gradient descent can be used to find the best W and b. The approach is basically same :

Define a cost function:

Take derivative (dw, db) of the cost function J w.r.t w and b.
Update w and b using dw, db.

The back propagation has been shown in the above diagram using the red arrows. Let’s find the dw and db using chain rule. This might look complicated, however if you just follow the arrows can you can then easily correlate them with the equation.

\[
\begin{align}
dW^{[2]}=&\frac{dJ}{dW^{[2]}}=\frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dW^{[2]}}\\
db^{[2]}=&\frac{dJ}{db^{[2]}}=\frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{db^{[2]}}\\
dW^{[1]}=&\frac{dJ}{dW^{[2]}}=\frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}}\frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{dW^{[1]}}\\
db^{[1]}=&\frac{dJ}{dW^{[2]}}=\frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}}\frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{db^{[1]}}
\end{align}
\]

Finally, we will update w and as following, ( same as other algorithms)

\[
W^{[1]}=: W^{[1]}-\alpha \frac{dJ}{dW^{[1]}} \\
b^{[1]}=: b^{[1]}-\alpha \frac{dJ}{db^{[1]}} \\
W^{[2]}=: W^{[2]}-\alpha \frac{dJ}{dW^{[2]}} \\
b^{[2]}=: b^{[2]}-\alpha \frac{dJ}{db^{[2]}}
\]

As you see, technically the steps are same for Linear Regression, Logistic Regression and Neural Network.

In Artificial Neural Network the steps towards the direction of blue arrows is named as Forward Propagation and the steps towards the red arrows as Back-Propagation.

Backpropagation:

One major disadvantage of Backpropagation is computation complexity. Just for 2 layer Neural Network with 2 hidden unit in layer one, we already have pretty complex equation to solve. Imagine the computation complexity for a network having 100’s of layers and 1000’s of hidden units in each layer. In order to solve this problem we can use dynamic programming.

The high level idea is to express the derivation of \(dw^{[l]}\) ( where l is the current layer) using the already calculated values ( \(dA^{[l+1]} , dZ^{[l+1]} etc \) ) of layer l+1. In nutshell, this is named as Backpropagation Algorithm.

We will derive the Backpropagation algorithm for a 2-Layer Network and then will generalize for N-Layer Network.

Derivation of 2-Layer Neural Network:

For simplicity propose, let’s assume our 2-Layer Network only does binary classification. So the final Hidden Layer will be using a Sigmoid Activation function and our Cost function will be simply the Binary Cross Entropy Error Function used in Logistic Regression. The Activation function of the remaining hidden layer can be anything.

Why the above assumptions are important:

Since the Backpropagation starts from taking derivative of the cost/error function, the derivation will be different if we are using a different activation function such as Softmax (at the final hidden layer only). Softmax can be used for MultiClass Classification, I will have a separate post for that.

I will be referring the diagram above, which I drew to show the Forward and Backpropagation of the 2-Layer Network. So that you don’t have to scroll up and down, I am having the same diagram here again.

Our first objective is to find \( \frac{dJ}{dW^{[2]}} \) where J is the cost function and \( W^{[2]} \) is a matrix of all the weights in the final layer. Using partial derivates we can define the following ( follow the path (red color) of the Backpropagation in the picture above if you are confused )

\[
\frac{dJ}{dW^{[2]}} = \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dW^{[2]}}
\]

Our Cross Entropy Error Function for binary classification is :

\[
J= – \frac{1}{n} \bigg( Ylog \left ( A^{[2]} \right ) – \left ( 1-Y \right )log \left ( 1 – A^{[2]} \right ) \bigg)
\]

Remember, in the above equation \( a^{[2]} \) is nothing but \( \hat{y} \)

Now we can define our \( \frac{dJ}{dW^{[2]}} \) as,

\[
\frac{dJ}{dW^{[2]}} = \Bigg[ -\frac{Y}{A^{[2]}} + \frac{1-Y}{1- A^{[2]}} \Bigg] \Bigg[ A^{[2]} (1- A^{[2]})\Bigg] \Bigg[ A^{[2]}\Bigg]
\]

Let’s take a minute and understand what just happened here. The 1st part is the derivative of the Cost Function. As long as you know the derivate of log, you can see how this makes sense. ( I have omitted the 1/n factor here, we will ignore that for now, however during coding we will make sure to divide the result by n )

The 2nd part is the derivative of the Sigmoid activation function. Again, you can derive it by yourself just by knowing the derivate of \( e^x \) w.r.t x.

We already know \( Z^{[2]}\) from our forward propagation,
\[
Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}
\]

The derivative of the above \( Z^{[2]}\) w.r.t \( W^{[2]} \) will simply be \( A^{[1]} \).

Simplifying the equation, we get

\[
\require{cancel}
\begin{align}
\frac{dJ}{dW^{[2]}} &= \Bigg[ -Y + \cancel{YA^{[2]}} + A^{[2]} – \cancel{YA^{[2]}} \Bigg] \Bigg[ A^{[2]}\Bigg] \\
&=\Bigg[ A^{[2]} – Y\Bigg] \Bigg[ A^{[2]}\Bigg] \\
&= dZ^{[2]} A^{[2]}
\end{align}
\]

Just note that, (we will use this later)
\[
dZ^{[2]} = \frac{dJ}{dZ^{[2]}} = \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}} = \Bigg[ A^{[2]} – Y\Bigg]
\]

Similarly we can define \( \frac{dJ}{db^{[2]}} \) as,

\[
\begin{align}
\frac{dJ}{db^{[2]}} &= \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{db^{[2]}} \\
&=\Bigg[ A^{[2]} – Y\Bigg] \Bigg[ 1 \Bigg] \\
&=\Bigg[ A^{[2]} – Y\Bigg] \\
&=dZ^{[2]}
\end{align}
\]

We will now move to the first layer, (following the red arrows in the picture)

\[
\begin{align}
\frac{dJ}{dW^{[1]}} &= \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}} \frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{dW^{[1]}}\\
&= \frac{dJ}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}} \frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{dW^{[1]}}\\
&= \Bigg[ A^{[2]} – Y\Bigg] \Bigg[ W^{[2]} \Bigg] \Bigg[ g'{\left ( Z^{[1]} \right )} \Bigg] \Bigg[ A^{[0]}\Bigg] \\
&= dZ^{[2]} W^{[2]} g'{\left ( Z^{[1]} \right )} A^{[0]} \\
& = dZ^{[1]} A^{[0]}
\end{align}
\]

There are few points to note.

  • First is reusability, the whole objective of dynamic programming is how to reuse already computed values in future computation. Thats the reason we are reusing \( dZ^{[2]} \).
  • \(A^{[0]}\) here is nothing but our input X, however if you have more than 2 hidden layer, it will just be the activation output of the previous later.
  • We can generalize this by equation for any layer except for the final hidden layer (The final layer equation depends on the Activaition of that layer).

Also need the following,
\[
\begin{align}
\frac{dJ}{dA^{[1]}} &= \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}} \\
&= \frac{dJ}{dZ^{[2]}}W^{[2]} \\
&= dZ^{[2]}W^{[2]}
\end{align}
\]

Same for \( db^{[1]} \)

\[
\begin{align}
\frac{dJ}{db^{[1]}} &= \frac{dJ}{dA^{[2]}}\frac{dA^{[2]}}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}} \frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{db^{[1]}}\\
&= \frac{dJ}{dZ^{[2]}}\frac{dZ^{[2]}}{dA^{[1]}} \frac{dA^{[1]}}{dZ^{[1]}}\frac{dZ^{[1]}}{db^{[1]}}\\
&= \Bigg[ A^{[2]} – Y\Bigg] \Bigg[ W^{[2]} \Bigg] \Bigg[ g'{\left ( Z^{[1]} \right )} \Bigg] \Bigg[ 1\Bigg] \\
&= dZ^{[2]} W^{[2]} g'{\left ( Z^{[1]} \right )} \\
& = dZ^{[1]}
\end{align}
\]

Since we have the required derivatives, \( dW^{[2]}, db^{[2]}, dW^{[1]}, db^{[1]}\), it’s time that we define the full algorithm.

N-Layer Neural Network Algorithm :

We will now define the full algorithm of a N-Layer Neural Network Algorithm by generalizing the equations we have derived for our 2-Layer Network.

\[
\begin{align}
& \bullet \text{Initialize } W^{[1]} .. W^{[L]}, b^{[1]} … b^{[L]} \\
& \bullet \text{Set } A^{[0]} = X \text{ ( Input ) }, L = \text{Total Layers} \\
& \bullet \text{Loop } \text{epoch} = 1 \text{ to } \text{ max iteration } \\
& \rule{1cm}{0pt} \bullet \text{Forward Propagation} \\
& \rule{2cm}{0pt} \bullet \text{Loop } l=1 \text{ to } L-1 \\
& \rule{3cm}{0pt} \bullet Z^{[l]} = W^{[l]}A^{[l-1]}+b^{[l]} \\
& \rule{3cm}{0pt} \bullet A^{[l]} = g \left ( b^{[l]} \right ) \\
& \rule{3cm}{0pt} \bullet \text{Save } A^{[l]},W^{[l]} \text{ in memory for later use } \\
& \rule{2cm}{0pt} \bullet Z^{[L]} = W^{[L]}A^{[L-1]}+b^{[L]} \\
& \rule{2cm}{0pt} \bullet A^{[L]} = \sigma \left ( Z^{[L]} \right ) \\
& \rule{1cm}{0pt} \bullet \text{Cost } J= – \frac{1}{n} \bigg( Ylog \left ( A^{[2]} \right ) – \left ( 1-Y \right )log \left ( 1 – A^{[2]} \right ) \bigg)\\
& \rule{1cm}{0pt} \bullet \text{Backward Propagation} \\
& \rule{2cm}{0pt} \bullet dA^{[L]} = -\frac{Y}{A^{[L]}} + \frac{1-Y}{1- A^{[L]}} \\
& \rule{2cm}{0pt} \bullet dZ^{[L]} = dA^{[L]} \sigma’\left ( dA^{[L]} \right ) \\
& \rule{2cm}{0pt} \bullet dW^{[L]} = dZ^{[L]} dA^{[L-1]} \\
& \rule{2cm}{0pt} \bullet db^{[L]} = dZ^{[L]} \\
& \rule{2cm}{0pt} \bullet dA^{[L-1]} = dZ^{[L]} W^{[L]} \\
& \rule{2cm}{0pt} \bullet \text{Loop } l=L-1 \text{ to } 1 \\
& \rule{3cm}{0pt} \bullet dZ^{[l]} = dA^{[l]} g’\left ( dA^{[l]} \right ) \\
& \rule{3cm}{0pt} \bullet dW^{[l]} = dZ^{[l]} dA^{[l-1]} \\
& \rule{3cm}{0pt} \bullet db^{[l]} = dZ^{[l]} \\
& \rule{3cm}{0pt} \bullet dA^{[l-1]} = dZ^{[l]} W^{[l]} \\
& \rule{1cm}{0pt} \bullet \text{Update W and b} \\
& \rule{2cm}{0pt} \bullet \text{Loop } l=1 \text{ to } L \\
& \rule{3cm}{0pt} \bullet W^{[l]} =W^{[l]} -\alpha . dW^{[l]} \\
& \rule{3cm}{0pt} \bullet b^{[l]} =b^{[l]} -\alpha . db^{[l]}
\end{align}
\]

The algorithm above is easy to understand. Just the generalized version of our previous derivation. Feel fee to ask me question in the comments section in case you have any doubt.

Python Implementation:

At this point technically we can directly jump into the code, however you will surely have issues with matrix dimension. Hence, let’s make sure that we fully understand the matrix dimensions before coding. Once you do this coding should be very simple.

We will use MNIST dataset for our implementation.( You can google in case you are hearing about this dataset to know more about it. )
MNIST has 6000 28x28 dimension gray scale image as training and total 10 different class, however since we will be focusing on binary classification here, we will choose all images with label 5 and 8 (Total 11272). We will write a function which will return the data we need.

Each pixel will be a feature for us, so we will first flatten each image to 28x28 = 784 vector. The input dimension will be 11272 X 784.

In our Neural Network we will have total 2 layers, so it will be like 784 (input Layer)->196->1.

Forward Propagation – Layer 1:

\[
\begin{align}
X &= \left ( 11272,784 \right ) \\
W^{[1]} &=\left ( 196, 784 \right ) \\
b^{[1]} &=\left ( 196, 1 \right ) \\
A^{[0]} &= X^T\\
&=\left ( 784,11272 \right ) \\
Z^{[1]} &=W^{[1]}A^{[0]}+b^{[1]} \\
&= \left ( 196,784 \right ) * \left ( 784,11272 \right ) + \left ( 196, 1 \right ) \\
&= \left ( 196,11272 \right ) + \left ( 196, 1 \right ) \\
&= \left ( 196,11272 \right ) \\
A^{[1]} &=g\left ( Z^{[1]} \right ) \\
&=\left ( 196,11272 \right ) \\
\end{align}
\]

Forward Propagation – Layer 2:

\[
\begin{align}
W^{[2]} &=\left ( 1, 196 \right ) \\
b^{[2]} &=\left ( 1, 1 \right ) \\
Z^{[2]} &=W^{[2]}A^{[1]}+b^{[2]} \\
&= \left ( 1, 196 \right ) * \left ( 196,11272 \right ) + \left ( 1, 1 \right ) \\
&= \left ( 1,11272 \right ) + \left ( 1, 1 \right ) \\
&= \left ( 1,11272 \right ) \\
A^{[2]} &=g\left ( Z^{[2]} \right ) \\
&=\left ( 1,11272 \right ) \\
\end{align}
\]

Backward Propagation – Layer 2:

\[
\begin{align}
Y^T &= \left ( 1, 11272 \right ) \\
dA^{[2]} &=-\frac{Y^T}{A^{[2]}} + \frac{1-Y^T}{1- A^{[2]}} \\
&=\left ( 1, 11272 \right ) \\
dZ^{[2]} &=dA^{[2]} g'(Z^{[2]}) \\
&= \left ( 1, 11272 \right ) * \left ( 1, 11272 \right )\\
&= \left ( 1, 11272 \right )\\
dW^{[2]} &=dZ^{[2]} \left ( A^{[1]} \right )^T \\
&=\left ( 1, 11272 \right ) * \left ( 11272,196 \right ) \\
&= \left ( 1, 196 \right )\\
db^{[2]} &=dZ^{[2]} \\
&= \left ( 1, 1 \right )\\
dA^{[1]} &= \left ( W^{[2]} \right )^T dZ^{[2]} \\
&=\left ( 196,1 \right ) * \left ( 1, 11272 \right )\\
&=\left ( 196, 11272 \right )\\
\end{align}
\]

Backward Propagation – Layer 1:

\[
\begin{align}
dZ^{[1]} &=dA^{[1]} g'(Z^{[1]}) \\
&= \left ( 196, 11272 \right ) * \left ( 196,11272 \right )\\
&= \left ( 196, 11272 \right )\\
dW^{[1]} &=dZ^{[1]} \left ( A^{[0]} \right )^T \\
&=\left ( 196, 11272 \right ) * \left ( 11272, 784 \right ) \\
&= \left ( 196, 784\right )\\
db^{[1]} &=dZ^{[1]} \\
&= \left ( 196, 1 \right )\\
\end{align}
\]

Two important points:

  • I haven’t fully explained the calculation for b above. We need need to sum over all the rows to make sure the dimension of \(b^{[l]}\) and \(db^{[l]}\) matches. We will use numpy’s axis=1 and keepdims=True option for this.
  • We have completely ignore the divide by n calculation (It was part of our cost function). So as a practice, whenever we are calculating the derivative of W and b, we will divide the result by n.

We will be using a python library to load the MNIST data. It just helps us to focus on the algorithm. You can install it by running following command.

pip install python-mnist

We will create a class named ANN and have the following methods defined there.

ann = ANN(layers_dims)
ann.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
ann.predict(train_x, train_y)
ann.predict(test_x, test_y)
ann.plot_cost()

We will get the data then preprocess it and invoke our ANN class.Our main will look like this. Also we should be able to pass the number of layers we need in our model. We dont want to fix the number of layers, rather want to pass that as an array to our ANN class.

if __name__ == '__main__':
    train_x, train_y, test_x, test_y = get_binary_dataset()

    train_x, test_x = pre_process_data(train_x, test_x)

    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))

    layers_dims = [196, 1]

    ann = ANN(layers_dims)
    ann.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
    ann.predict(train_x, train_y)
    ann.predict(test_x, test_y)
    ann.plot_cost()

The get_binary_dataset() function above will provide the Train and Test data. The dimension of the data will be as we have seen above. In the pre_process_data() function we will just normalize the data.

def pre_process_data(train_x, test_x):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    return train_x, test_x

Below is the constructor of the ANN class. Here the layer size will be passed as an array.The self.parameters will be a dictonary object where we keep all the W and b.

def __init__(self, layers_size):
	self.layers_size = layers_size
	self.parameters = {}
	self.L = len(self.layers_size)
	self.n = 0
	self.costs = []

The fit() function will first call initialize_parameters() to create all the necessary W and b for each layer.Then we will have the training running in n_iterations times. Inside the loop first call the forward() function. Then calculate the cost and call the backward() function. Afterwards, we will update the W and b for all the layers.

def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
	np.random.seed(1)

	self.n = X.shape[0]

	self.layers_size.insert(0, X.shape[1])

	self.initialize_parameters()
	for loop in range(n_iterations):
		A, store = self.forward(X)
		cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
		derivatives = self.backward(X, Y, store)

		for l in range(1, self.L + 1):
			self.parameters["W" + str(l)] = self.parameters["W" + str(l)] - learning_rate * derivatives[
				"dW" + str(l)]
			self.parameters["b" + str(l)] = self.parameters["b" + str(l)] - learning_rate * derivatives[
				"db" + str(l)]

		if loop % 100 == 0:
			print(cost)
			self.costs.append(cost)

Since the W1 parameter needs the number of features present in the training data, we will insert that in the layers_size array before invoking initialize_parameters()

In the initialize_parameters() function we loop through the layers_size array and store the parameters in the self.parameters dictionary.

def initialize_parameters(self):
	np.random.seed(1)

	for l in range(1, len(self.layers_size)):
		self.parameters["W" + str(l)] = np.random.randn(self.layers_size[l], self.layers_size[l - 1]) / np.sqrt(
			self.layers_size[l - 1])
		self.parameters["b" + str(l)] = np.zeros((self.layers_size[l], 1))

Once you run the code the self.parameters variable will look like this:

The forward() function is very easy to understand. Even though we are using Sigmoid Activation function in all the layers, we will have the calculation for the final layer outside of the loop so that we can easily plugin a Softmax function there (Softmax is not covered in this tutorial).

We will also create a new store dictionary object and keep the A,W and Z for each layer so that we can use them during backpropagation.

def forward(self, X):
	store = {}

	A = X.T
	for l in range(self.L - 1):
		Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
		A = self.sigmoid(Z)
		store["A" + str(l + 1)] = A
		store["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
		store["Z" + str(l + 1)] = Z

	Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
	A = self.sigmoid(Z)
	store["A" + str(self.L)] = A
	store["W" + str(self.L)] = self.parameters["W" + str(self.L)]
	store["Z" + str(self.L)] = Z

	return A, store

Above in line 18, returned value A is basically the \( \hat{y}\).

In the backward() function like we have in the derivation, first calculate the dA,dW,db for the L'th layer and then in the loop find all the derivatives for remaining layers.

The below code is the same as the derivations we went through earlier. We keep all the derivatives in the derivatives dictionary and return that to the fit() function.

def backward(self, X, Y, store):

	derivatives = {}

	store["A0"] = X.T

	A = store["A" + str(self.L)]
	dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

	dZ = dA * self.sigmoid_derivative(store["Z" + str(self.L)])
	dW = dZ.dot(store["A" + str(self.L - 1)].T) / self.n
	db = np.sum(dZ, axis=1, keepdims=True) / self.n
	dAPrev = store["W" + str(self.L)].T.dot(dZ)

	derivatives["dW" + str(self.L)] = dW
	derivatives["db" + str(self.L)] = db

	for l in range(self.L - 1, 0, -1):
		dZ = dAPrev * self.sigmoid_derivative(store["Z" + str(l)])
		dW = 1. / self.n * dZ.dot(store["A" + str(l - 1)].T)
		db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
		if l > 1:
			dAPrev = store["W" + str(l)].T.dot(dZ)

		derivatives["dW" + str(l)] = dW
		derivatives["db" + str(l)] = db

	return derivatives

Here is the code for the sigmoid() and sigmoid_derivative() function. In a later tutorial we will see how to use ReLu and Softmax.

def sigmoid(self, Z):
    return 1 / (1 + np.exp(-Z))

def sigmoid_derivative(self, Z):
    s = 1 / (1 + np.exp(-Z))
    return s * (1 - s)

In the predict() function we wil just use the current W and b and compute the probability usimng forward() function. Then we will convert the probability to a predicted class 0 or 1.

def predict(self, X, Y):
	A, cache = self.forward(X)
	n = X.shape[0]
	p = np.zeros((1, n))

	for i in range(0, A.shape[1]):
		if A[0, i] > 0.5:
			p[0, i] = 1
		else:
			p[0, i] = 0

	print("Accuracy: " + str(np.sum((p == Y) / n)))

Let’s look at the outout. You will get around 96% Train and Test Accuracy.

train_x's shape: (11272, 784)
test_x's shape: (1866, 784)
0.7043777294167167
0.3094035971595143
0.19106252272122817
0.15772416612846746
0.14255528419489316
0.1336554279807337
0.12762011948747812
0.12313725638495653
0.11959735842488138
0.11667822494436252
Accuracy: 0.9599893541518808
Accuracy: 0.9598070739549842

The cost gradually does down as we run multiple iteration.

The best part of writing the code in a generic way is we can easily try using different layer size. Let’s try the following:

layers_dims = [392,196,98,1]

Here is the result.

train_x's shape: (11272, 784)
test_x's shape: (1866, 784)
0.6941917096801075
0.689779337934555
0.6864347273157968
0.680851445965145
0.6693297859482221
0.6392888056143693
0.5391389182596976
0.30972952941407295
0.1900953225522053
0.15499153620779857
Accuracy: 0.9491660752306601
Accuracy: 0.9555198285101825

Naturally, with the same data, iteration and learning rate the larger Network is performing poorly than the smaller one. If you were expecting a different result then let me know in the comment section and we can discuss about it.

Below is the full code of the ANN class:

import numpy as np
import datasets.mnist.loader as mnist
import matplotlib.pylab as plt

class ANN:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.L = len(self.layers_size)
        self.n = 0
        self.costs = []

    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-Z))

    def initialize_parameters(self):
        np.random.seed(1)

        for l in range(1, len(self.layers_size)):
            self.parameters["W" + str(l)] = np.random.randn(self.layers_size[l], self.layers_size[l - 1]) / np.sqrt(
                self.layers_size[l - 1])
            self.parameters["b" + str(l)] = np.zeros((self.layers_size[l], 1))

    def forward(self, X):
        store = {}

        A = X.T
        for l in range(self.L - 1):
            Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
            A = self.sigmoid(Z)
            store["A" + str(l + 1)] = A
            store["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
            store["Z" + str(l + 1)] = Z

        Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
        A = self.sigmoid(Z)
        store["A" + str(self.L)] = A
        store["W" + str(self.L)] = self.parameters["W" + str(self.L)]
        store["Z" + str(self.L)] = Z

        return A, store

    def sigmoid_derivative(self, Z):
        s = 1 / (1 + np.exp(-Z))
        return s * (1 - s)

    def backward(self, X, Y, store):

        derivatives = {}

        store["A0"] = X.T

        A = store["A" + str(self.L)]
        dA = -np.divide(Y, A) + np.divide(1 - Y, 1 - A)

        dZ = dA * self.sigmoid_derivative(store["Z" + str(self.L)])
        dW = dZ.dot(store["A" + str(self.L - 1)].T) / self.n
        db = np.sum(dZ, axis=1, keepdims=True) / self.n
        dAPrev = store["W" + str(self.L)].T.dot(dZ)

        derivatives["dW" + str(self.L)] = dW
        derivatives["db" + str(self.L)] = db

        for l in range(self.L - 1, 0, -1):
            dZ = dAPrev * self.sigmoid_derivative(store["Z" + str(l)])
            dW = 1. / self.n * dZ.dot(store["A" + str(l - 1)].T)
            db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
            if l > 1:
                dAPrev = store["W" + str(l)].T.dot(dZ)

            derivatives["dW" + str(l)] = dW
            derivatives["db" + str(l)] = db

        return derivatives

    def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
        np.random.seed(1)

        self.n = X.shape[0]

        self.layers_size.insert(0, X.shape[1])

        self.initialize_parameters()
        for loop in range(n_iterations):
            A, store = self.forward(X)
            cost = np.squeeze(-(Y.dot(np.log(A.T)) + (1 - Y).dot(np.log(1 - A.T))) / self.n)
            derivatives = self.backward(X, Y, store)

            for l in range(1, self.L + 1):
                self.parameters["W" + str(l)] = self.parameters["W" + str(l)] - learning_rate * derivatives[
                    "dW" + str(l)]
                self.parameters["b" + str(l)] = self.parameters["b" + str(l)] - learning_rate * derivatives[
                    "db" + str(l)]

            if loop % 100 == 0:
                print(cost)
                self.costs.append(cost)

    def predict(self, X, Y):
        A, cache = self.forward(X)
        n = X.shape[0]
        p = np.zeros((1, n))

        for i in range(0, A.shape[1]):
            if A[0, i] > 0.5:
                p[0, i] = 1
            else:
                p[0, i] = 0

        print("Accuracy: " + str(np.sum((p == Y) / n)))

    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()


def get_binary_dataset():
    train_x_orig, train_y_orig, test_x_orig, test_y_orig = mnist.get_data()

    index_5 = np.where(train_y_orig == 5)
    index_8 = np.where(train_y_orig == 8)

    index = np.concatenate([index_5[0], index_8[0]])
    np.random.seed(1)
    np.random.shuffle(index)

    train_y = train_y_orig[index]
    train_x = train_x_orig[index]

    train_y[np.where(train_y == 5)] = 0
    train_y[np.where(train_y == 8)] = 1

    index_5 = np.where(test_y_orig == 5)
    index_8 = np.where(test_y_orig == 8)

    index = np.concatenate([index_5[0], index_8[0]])
    np.random.shuffle(index)

    test_y = test_y_orig[index]
    test_x = test_x_orig[index]

    test_y[np.where(test_y == 5)] = 0
    test_y[np.where(test_y == 8)] = 1

    return train_x, train_y, test_x, test_y

def pre_process_data(train_x, test_x):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    return train_x, test_x


if __name__ == '__main__':
    train_x, train_y, test_x, test_y = get_binary_dataset()

    train_x, test_x = pre_process_data(train_x, test_x)

    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))

    layers_dims = [196, 1]

    ann = ANN(layers_dims)
    ann.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
    ann.predict(train_x, train_y)
    ann.predict(test_x, test_y)
    ann.plot_cost()

You can access the full project here:

Conclusion:

I hope that this tutorial provides a detail view on backpropagation algorithm. Since backpropagation is the backbone of any Neural Network, it’s important to understand in depth. We can make many optimization from this point onwards for improving the accuracy, faster computation etc. Next we will see how to implement the same using both Tensorflow and PyTorch.

Below are the articles on implementing the Neural Network using TensorFlow and PyTorch.

  1. Understanding and implementing Neural Network with SoftMax in Python from scratch
  2. Implement Neural Network using TensorFlow
  3. Implement Neural Network using PyTorch

The post Understand and Implement the Backpropagation Algorithm From Scratch In Python appeared first on A Developer Diary.

Implement Neural Network using TensorFlow

$
0
0

In the previous article we have implemented the Neural Network using Python from scratch. However for real implementation we mostly use a framework, which generally provides faster computation and better support for best practices. In this article we will Implement Neural Network using TensorFlow. At present, TensorFlow probably is the most popular deep learning framework available.

Here is my previous post on “Understand and Implement the Backpropagation Algorithm From Scratch In Python”. We will be implementing the similar example here using TensorFlow. In case you need a refresher please refer the article below:

Understand and Implement the Backpropagation Algorithm From Scratch In Python

Notes on TensorFlow:

There are few points worth to understand about TensorFlow which is different from other frameworks or our previous implementation.

  • TensorFlow has been written to keep production deployment in mind. I would say exactly opposite to R, which is very much focused on analysis and study. Hence, there are few design considerations which need to understand.
  • TensorFlow mainly supports Static Computation Graph ( However there will be support for Dynamic Computation Graph in version 2. You might have already heard of Eager Execution.) In this example we will work with Static Computation Graph. The main problem with Static Computation Graph is support for debugging. So if you make a mistake you can really use Python editor to perform line by line debugging.
  • In past years, TensorFlow went through many many design changes. Hence you might find many different ways of creating Neural Network using TensorFlow. I will be using the most recent recommendations. In case you google you will find many variations of the same code, however be careful with the version they have used, since the code might already be outdated.
  • TensorFlow is currently integrating Keras as High Level API. If you visit the TensorFlow website you will find plenty of example using Keras.
  • In real production implementation system generally you will always save the model and run prediction by loading the model separately. In short, you generally save the model after training and load it during testing or real uses. There is no way to save the model as instance variable in Python. Since we just want to focus on building the Network, we will run train and test both inside the fit() function using the same session. You will generally never do this in real production system.

Dataset:

We will be using the MNIST dataset. It has 60K training images, each 28X28 pixel in gray scale. There are total 10 classes to classify. You can find more details about it in the following sites:

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/index.html

Coding using TensorFlow:

We will create a class named ANN and define the following functions. As discussed earlier, the fit_predict() function will train our model and then run the prediction on the test data using the same session.

As you have noticed we will pass the layers needed for our network as a list so that we don’t have to code them explicitly.

model = ANN(layers_size=[196, 10])
model.fit_predict(train_x, train_y, test_x, test_y, learning_rate=0.1, n_iterations=1000)
model.plot_cost()

__init__() function:

Start by defining the __init__() method. The self.parameters and self.store dictionary will be used to save the computed values during forward() so that these values can be reused.

def __init__(self, layers_size):
    self.costs = []
    self.layers_size = layers_size
    self.parameters = {}
    self.L = len(layers_size)
    self.store = {}
    self.X = None
    self.Y = None

initialize_parameters() function:

The initialize_parameters() function will be used to initialize the W and b parameters for our Network.

If we already know number of layers and hidden units, we can simply define them as following. However defining like this way will not help if we want to try out more layes with different hidden units.

W1 = tf.get_variable("W1", shape=[196, f], initializer=tf.contrib.layers.xavier_initializer(seed=1))
b1 = tf.get_variable("b1", shape=[196, 1], initializer=tf.zeros_initializer())
W2 = tf.get_variable("W2", shape=[10, 196], initializer=tf.contrib.layers.xavier_initializer(seed=1))
b2 = tf.get_variable("b2", shape=[10, 1], initializer=tf.zeros_initializer())

Hence we will dynamically define them by looping through the self.layers_size list.

We will use get_variable() function which is a relatively latest addition to TensorFlow where we can define a supported initializer. Here we will be using xavier_initializer for W and zeros_initializer for b.

Note: In case you do not know whats Xavier Initializer, don’t worry about it. I will later make another tutorial on it.

def initialize_parameters(self):
    tf.set_random_seed(1)

    for l in range(1, self.L + 1):
        self.parameters["W" + str(l)] = tf.get_variable("W" + str(l),
                                                        shape=[self.layers_size[l], self.layers_size[l - 1]],
                                                        initializer=tf.contrib.layers.xavier_initializer(seed=1))
        self.parameters["b" + str(l)] = tf.get_variable("b" + str(l), shape=[self.layers_size[l], 1],
                                                        initializer=tf.zeros_initializer())

forward() function:

Next let’s define the forward() function.

We can code for the fixed number of layers like following.

Z1 = tf.add(tf.matmul(W1, tf.transpose(self.X)), b1)
A1 = tf.nn.relu(Z1)
Z2 = tf.add(tf.matmul(W3, A2), b3)

However, it will be wise to dynamically perform the forward propagation. Few points to be noted,

  • We are using ReLu Activation function.
  • When l=1, \(A^{[0]}\) will be equal to X
  • For all layers, calculate and store the \(Z^{[l]}\) in memory inside the loop.
  • For all layers from l=1 to L-1, calculate and store the \(A^{[l]}\) in memory inside the loop.
  • Since final layer will use Softmax activation, we will use TensorFlow’s builtin function softmax_cross_entropy_with_logits_v2().

def forward(self):
	for l in range(1, len(self.layers_size)):
		if l == 1:
			self.store["Z" + str(l)] = tf.add(tf.matmul(self.parameters["W" + str(l)], tf.transpose(self.X)),
											  self.parameters["b" + str(l)])
		else:
			self.store["Z" + str(l)] = tf.add(
				tf.matmul(self.parameters["W" + str(l)], self.store["A" + str(l - 1)]),
				self.parameters["b" + str(l)])
		if l < self.L:
			self.store["A" + str(l)] = tf.nn.relu(self.store["Z" + str(l)])
			
	softmax = tf.nn.softmax_cross_entropy_with_logits_v2(logits=tf.transpose(self.store["Z" + str(self.L)]),labels=self.Y)
    return softmax

fit_predict() function:

TensorFlow will automatically calculate the derivatives for us, hence the backpropagation will be just a like of code.Lets go through the fit_predict() function.

First we will find the number of features from the shape of X_train and the number of classes from the shape of Y. The shape of X_train in our example here is (60000, 784) and The shape of Y_train is (60000, 10).

Then we will define two placeholder X,Y based on number of features and classes.

We will insert the number of features in our layers_size list since technically the input layer is layer 0. We will need the size to define the W1.

Finally we will call self.initialize_parameters() and self.forward() function.

tf.set_random_seed(1)
_, f = X_train.shape
_, c = Y_train.shape

self.X = tf.placeholder(tf.float32, shape=[None, f], name='X')
self.Y = tf.placeholder(tf.float32, shape=[None, c], name='Y')

self.layers_size.insert(0, f)

self.initialize_parameters()

softmax = self.forward()

Next we will define our cost function and then use TensorFlow’s builtin function for Gradient Descent Optimization. Feel free to try out other optimization functions available. The minimize() function will help to calculate all the derivatives with respect to the cost function.

cost = tf.reduce_mean(softmax)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

We are done with defining our static computation graph. We have not feed the data into our model yet. Lets do that next by creating a TensorFlow session.

We need to call global_variables_initializer() for TensorFlow’s Global Variable Initialization. Next we will create a Session and loop through the n_iterations.

Inside the loop we will have TensorFlow compute the optimizer and cost variable. At this point we need to pass the data using the feed_dict parameter.

init = tf.global_variables_initializer()
with tf.Session() as sess:
	sess.run(init)
	for epoch in range(n_iterations):
		_, epoch_cost = sess.run([optimizer, cost], feed_dict={self.X: X_train, self.Y: Y_train})

We want to calculate the Train accuracy in every 100 iteration and also save the cost in every 10 iteration. The below code is very simple, we will compare the predicted values with target variable. Then find the accuracy by calling reduce_mean() function.

if epoch % 100 == 0:
	correct_prediction = tf.equal(tf.argmax(self.store["Z" + str(self.L)]),
								  tf.argmax(tf.transpose(self.Y)))

	accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
	print("Cost after epoch %i: %f, Accuracy %f" % (
		epoch, epoch_cost, accuracy.eval({self.X: X_train, self.Y: Y_train})))

if epoch % 10 == 0:
	self.costs.append(epoch_cost)

Once training is complete we will calculate the accuracy of the test data inside the same session.

correct_prediction = tf.equal(tf.argmax(self.store["Z" + str(self.L)]),tf.argmax(tf.transpose(self.Y)))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print("Test Accuracy %f" % (accuracy.eval({self.X: X_test, self.Y: Y_test})))

__main__():

Finally let’s look at our main() method. First we will get the data. Then we will preprocess it. Afterwards, call the fit_predict() function of the ANN() class.

if __name__ == '__main__':
    train_x_orig, train_y_orig, test_x_orig, test_y_orig = mnist.get_data()

    train_x, train_y, test_x, test_y = pre_process_data(train_x_orig, train_y_orig, test_x_orig, test_y_orig)

    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))

    model = ANN(layers_size=[196, 10])
    model.fit_predict(train_x, train_y, test_x, test_y, learning_rate=0.1, n_iterations=1000)
    model.plot_cost()

pre_process_data():

In the preprocessing step we will first normalize the data by dividing by 255. Then we will use OneHotEncoder of the sklearn package to transform the target variable.

def pre_process_data(train_x, train_y, test_x, test_y):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y

Output:

Now its time to run the our code. With just a 2-Layer Network and 1000 epoch we are getting around 94% of accuracy.

train_x's shape: (60000, 784)
test_x's shape: (10000, 784)
Cost after epoch 0: 2.336267, Accuracy 0.192117
Cost after epoch 100: 0.455633, Accuracy 0.881817
Cost after epoch 200: 0.358374, Accuracy 0.901867
Cost after epoch 300: 0.317593, Accuracy 0.911667
Cost after epoch 400: 0.291452, Accuracy 0.918683
Cost after epoch 500: 0.271708, Accuracy 0.924483
Cost after epoch 600: 0.255534, Accuracy 0.929317
Cost after epoch 700: 0.241680, Accuracy 0.932850
Cost after epoch 800: 0.229520, Accuracy 0.936000
Cost after epoch 900: 0.218720, Accuracy 0.939083
Test Accuracy 0.942200

Here is the plot of the cost function.

You can try using different Network Layout such as:

layers_dims = [392,196,98,1]

Here is the result. Our accuracy increased to 96%.

train_x's shape: (60000, 784)
test_x's shape: (10000, 784)
Cost after epoch 0: 2.338752, Accuracy 0.187967
Cost after epoch 100: 0.377190, Accuracy 0.893700
Cost after epoch 200: 0.270761, Accuracy 0.922600
Cost after epoch 300: 0.228978, Accuracy 0.934517
Cost after epoch 400: 0.195207, Accuracy 0.944633
Cost after epoch 500: 0.171445, Accuracy 0.951867
Cost after epoch 600: 0.152584, Accuracy 0.956917
Cost after epoch 700: 0.137145, Accuracy 0.961433
Cost after epoch 800: 0.124189, Accuracy 0.965167
Cost after epoch 900: 0.113151, Accuracy 0.968233
Test Accuracy 0.964000

Here is the plot of the cost function.

Below is the full code of the ANN class.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import datasets.mnist.loader as mnist
from sklearn.preprocessing import OneHotEncoder

class ANN:
    def __init__(self, layers_size):
        self.costs = []
        self.layers_size = layers_size
        self.parameters = {}
        self.L = len(layers_size)
        self.store = {}
        self.X = None
        self.Y = None

    def initialize_parameters(self):
        tf.set_random_seed(1)

        for l in range(1, self.L + 1):
            self.parameters["W" + str(l)] = tf.get_variable("W" + str(l),
                                                            shape=[self.layers_size[l], self.layers_size[l - 1]],
                                                            initializer=tf.contrib.layers.xavier_initializer(seed=1))
            self.parameters["b" + str(l)] = tf.get_variable("b" + str(l), shape=[self.layers_size[l], 1],
                                                            initializer=tf.zeros_initializer())

    def forward(self):
        for l in range(1, len(self.layers_size)):

            if l == 1:
                self.store["Z" + str(l)] = tf.add(tf.matmul(self.parameters["W" + str(l)], tf.transpose(self.X)),
                                                  self.parameters["b" + str(l)])
            else:
                self.store["Z" + str(l)] = tf.add(
                    tf.matmul(self.parameters["W" + str(l)], self.store["A" + str(l - 1)]),
                    self.parameters["b" + str(l)])
            if l < self.L:
                self.store["A" + str(l)] = tf.nn.relu(self.store["Z" + str(l)])

        softmax = tf.nn.softmax_cross_entropy_with_logits_v2(logits=tf.transpose(self.store["Z" + str(self.L)]),
                                                             labels=self.Y)

        return softmax

    def fit_predict(self, X_train, Y_train, X_test, Y_test, learning_rate=0.01, n_iterations=2500):
        tf.set_random_seed(1)
        _, f = X_train.shape
        _, c = Y_train.shape

        self.X = tf.placeholder(tf.float32, shape=[None, f], name='X')
        self.Y = tf.placeholder(tf.float32, shape=[None, c], name='Y')

        self.layers_size.insert(0, f)

        self.initialize_parameters()

        softmax = self.forward()

        cost = tf.reduce_mean(softmax)

        optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cost)

        init = tf.global_variables_initializer()

        with tf.Session() as sess:
            sess.run(init)
            for epoch in range(n_iterations):
                _, epoch_cost = sess.run([optimizer, cost], feed_dict={self.X: X_train, self.Y: Y_train})

                if epoch % 100 == 0:
                    correct_prediction = tf.equal(tf.argmax(self.store["Z" + str(self.L)]),
                                                  tf.argmax(tf.transpose(self.Y)))

                    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
                    print("Cost after epoch %i: %f, Accuracy %f" % (
                        epoch, epoch_cost, accuracy.eval({self.X: X_train, self.Y: Y_train})))

                if epoch % 10 == 0:
                    self.costs.append(epoch_cost)

            correct_prediction = tf.equal(tf.argmax(self.store["Z" + str(self.L)]),
                                          tf.argmax(tf.transpose(self.Y)))

            accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
            print("Test Accuracy %f" % (accuracy.eval({self.X: X_test, self.Y: Y_test})))

    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()


def pre_process_data(train_x, train_y, test_x, test_y):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))
    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y


if __name__ == '__main__':
    train_x_orig, train_y_orig, test_x_orig, test_y_orig = mnist.get_data()

    train_x, train_y, test_x, test_y = pre_process_data(train_x_orig, train_y_orig, test_x_orig, test_y_orig)

    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))

    model = ANN(layers_size=[196, 10])
    model.fit_predict(train_x, train_y, test_x, test_y, learning_rate=0.1, n_iterations=1000)
    model.plot_cost()

You can access the full project here:

\[
\]

The post Implement Neural Network using TensorFlow appeared first on A Developer Diary.

Implement Neural Network using PyTorch

$
0
0

PyTorch is gaining popularity specially among students since it’s much more developer friendly. PyTorch helps to focus more on core concepts of deep learning unlike TensorFlow which is more focused on running optimized model on production system. In this tutorial we will Implement Neural Network using PyTorch and understand some of the core concepts of PyTorch.

This tutorial is more like a follow through of the previous tutorial on Understand and Implement the Backpropagation Algorithm From Scratch In Python.If you need a refresher on this please review my previous article.

Understand and Implement the Backpropagation Algorithm From Scratch In Python

Notes on PyTorch:

  • PyTorch models cannot be deployed to a production system directly. It needs to be converted to Caffe2 using ONNX, then deploy to production.
  • PyTorch supports computations using GPU(cuda) for faster processing. I will explain how to do this in this tutorial.
  • In other deep learning frameworks such as TensorFlow or Theano, you can just feed the input data in NumPy format to the model.It’s easy to implement this way, specially when you are trying out for the first time or learning. However batching, shuffling, parallel data loading etc needs to be taken care manually when you are looking for real implementation. PyTorch provides all these functionalities out of the box using the torch.utils.data.Dataset and torch.utils.data.DataLoader
  • PyTorch automatically calculates derivate of any function, hence our backpropagation will be very easy to implement.
  • PyTorch provides Modules, which are nothing but abstract class or interface. If you are familiar with OOPS then you already know about inheritance. Modules helps to integrate our custom code with the PyTorch core framework.

Dataset:

We will be using the MNIST dataset. It has 60K training images, each 28X28 pixel in gray scale. There are total 10 classes to classify. You can find more details about it in the following sites:

https://en.wikipedia.org/wiki/MNIST_database
http://yann.lecun.com/exdb/mnist/index.html

Implementation:

In PyTorch we need to define our Neural Network using a class. We will name our class as ANN. We will also add the fit() and predict() function so that we can invoke them from the main() function.

__main__():

Lets look at our simple main method. We will first get the data from the get_data() function. I am using an external library to load the MNIST data. You can install it using the below command.

pip install python-mnist

we have defined the device variable before main function. This will help to detect if the machine has cuda supported GPU so that we can run our model faster.

First we will run through our pre-processing step where we are normalizing the data. Then instantiate the ANN class by passing the layers_size. We will code the ANN class such way that we can define the layers dynamically.

Then we will call the fit() and predict() function.

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if __name__ == '__main__':
    train_x_orig, train_y_orig, test_x_orig, test_y_orig = get_data()
    train_x, train_y, test_x, test_y = pre_process_data(train_x_orig, train_y_orig, test_x_orig, test_y_orig)

    model = ANN(layers_size=[196, 10])
    model.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
    model.predict(test_x, test_y)
    model.plot_cost()

__init__() of ANN Class:

As discussed earlier, PyTorch provides Modules for specific type of Neural Networks. We will be extending the torch.nn.Module while creating the ANN class.

The __init__() method is very srtaight forward. In the first line we will call the __init__() method of the parent class torch.nn.Module.

class ANN(nn.Module):    
    def __init__(self, layers_size):
        super(ANN, self).__init__()
        self.layers_size = layers_size
        self.L = len(layers_size)
        self.costs = []

initialize_parameters():

In the initialize_parameters() function we will define our Layes with W’s and b’s. Since we dont want to create fixed set of layers, we will loop through our self.layers_size list and call nn.Linear() function.

There are two important points to note here:

  • We will be calling nn.Linear().to(device) so that PyTorch can select GPU ( if available ) for computation.
  • add_module() function is part of torch.nn.Module. PyTorch provides this function so that we can define all the layers dynamically.

def initialize_parameters(self):
	for l in range(0, self.L):
		self.add_module("fc" + str(l + 1), nn.Linear(self.layers_size[l], self.layers_size[l + 1]).to(device))

forward():

The forward() is inherited from the torch.nn.Module, which means you need to always define a function named forward(). Otherwise PyTorch wont be able to execute this function.

The logic in this function is very easy to understand. We will loop through all the different layers that was added by calling the self.add_module and both Z and A was calculated. (Z is the output before Activation and A is the output of the Activation)

We are using Relu as activation function for all the hidden layers except for the last layer. That’s why we are not calculating that for the last layer L inside the loop.

We are calling torch.nn.functional.log_softmax() function for the Softmax activation.

def forward(self, X):
	for l, (name, m) in enumerate(self.named_modules()):
		if l > 0:
			if l == self.L - 1:
				X = m(X)
			else:
				X = F.relu(m(X))

	return F.log_softmax(input=X)

fit():

The fit() function drives all the work for us, hence we will break it down to understand fully.

The self.to() is a built in function which is part of the torch.nn.Module. We will pass the device here so that PyTorch knows whether to execute the computation in CPU or GPU.

Next we will insert the feature size to the self.layers_size list since technically X is the layer 0.

Invoke self.initialize_parameters() to create the required layers. Use torch.optim.SGD() for updating the parameters using Stochastic Gradient Descent. We need pass the parameters by calling self.parameters() (which is again part of torch.nn.Module) and the learning rate.

We can define the negative log likelihood loss function just by calling torch.nn.NLLLoss().

def fit(self, X, Y, learning_rate=0.1, n_iterations=2500):
	self.to(device)
	self.layers_size.insert(0, X.shape[1])
	self.initialize_parameters()

	optimizer = torch.optim.SGD(self.parameters(), lr=learning_rate)
	criterion = nn.NLLLoss()

We are all set to run our training iterations. However as discussed earlier, we need to make sure PyTorch can retrieve the data using the torch.utils.data.DataLoader class.

PyTorch DataLoader:

We need to inherit the torch.utils.data.Dataset class and provide implementation of the necessary methods.

Here is the structure of our class MyDataLoader. Here in the __init__() method will initialize data and target. We can actually read the data from the file in the init method itself since it will be executed only once, however in order to make the code simple, we will just pass our already loaded numpy data there.

__len__():

The __len__() method needs to return the length of the dataset.

__getitem__():

At runtime PyTorch will call __getitem__() method and create the mini batch randomly. We just need to return the feature row vector and target class as a tuple, based on the index that was passed. Here we will convert the numpy array to torch.Tensor.

Also, remember that we don’t have to transform the target using OneHotEncoding, since PyTorch will take care of that automatically.

We will just convert the target class to int since PyTorch does not integrate directly with NumPy.

class MyDataLoader(data.Dataset):
	def __init__(self, X, Y):
		self.data = X
		self.target = Y
		self.n_samples = self.data.shape[0]

	def __len__(self):
		return self.n_samples

	def __getitem__(self, index):
		return torch.Tensor(self.data[index]), int(self.target[index])

Once we have the MyDataLoader class completed, we can create an install of the class by passing our train feature matrix and target class vector.

Next we will pass the instance of MyDataLoader to the torch.utils.data.DataLoader class. We also need to provide the batch_size and num_workers. I have selected batch size of 2048 and num_workers will be mostly be the number of CPU core you have you. Since I have 32, I have provided the same.

The PyTorch’s DataLoader class takes care of batching, shuffling, parallel data loading etc. Nice !

train_dataset = self.MyDataLoader(X, Y)
data_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=2048, num_workers=32)

Training Loop:

Back to our training loop inside the fit() function. First we will loop through the n_iterations and then the data_loader.

The data_loader will return a batch of train data. In order to use them for training we need to send them to the appropriate device such as CPU or GPU. Just call the .to function so that the data can be moved to GPU memory or stay in on-board memory.

Then we will reset the gradient by calling optimizer.zero_grad(). self(inputs) will automatically execute the forward() function.

Next, the loss will be calculated using the predicted value and ground truth. Afterwards, call loss.backward() for computing the backpropagation and update the parameters using the optimizer.step() function.

for epoch in range(n_iterations):
	for k, (inputs, target) in enumerate(data_loader):
		inputs, target = inputs.to(device), target.to(device)

		optimizer.zero_grad()
		forward = self(inputs)
		loss = criterion(forward, target)
		loss.backward()
		optimizer.step()

predict():

We will use the MyDataLoader class for loading the test data too. Here we will use with torch.no_grad() in order to inform PyTorch that there is no need to track for gradients (This will save some computation).

Below code is very straight forward. I will have you go through and ask question as needed.

def predict(self, X, Y):
	dataset = self.MyDataLoader(X, Y)
	data_loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=2048, num_workers=32)
	with torch.no_grad():
		correct = 0
		total = 0
		for inputs, target in data_loader:
			inputs, target = inputs.to(device), target.to(device)
			forward = self(inputs)
			_, predicted = torch.max(forward.data, 1)
			total += target.size(0)
			correct += (predicted == target).sum().item()

		print('Accuracy of the network on the {} images: {} %'.format(Y.shape[0], 100 * correct / total))

Results:

Just using 2-Layes, [196, 10] we can achieve 92.77% Accuracy in the Test set.

Train Epoch: 0 	Loss: 1.382688
Train Epoch: 100 	Loss: 0.211236
Train Epoch: 200 	Loss: 0.187463
Train Epoch: 300 	Loss: 0.164462
Train Epoch: 400 	Loss: 0.149739
Train Epoch: 500 	Loss: 0.140264
Train Epoch: 600 	Loss: 0.133282
Train Epoch: 700 	Loss: 0.127745
Train Epoch: 800 	Loss: 0.122800
Train Epoch: 900 	Loss: 0.118388
Train Accuracy: 94.23 %
Accuracy of the network on the 10000 images: 92.77 %

Here is the plot of the Cost function.

Try using different layers and hidden units and see how the accuracy changes.

Full ANN Class:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as data
import datasets.mnist.loader as mnist
import matplotlib.pyplot as plt
import numpy as np

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


class ANN(nn.Module):
    class MyDataLoader(data.Dataset):
        def __init__(self, X, Y):
            self.data = X
            self.target = Y
            self.n_samples = self.data.shape[0]

        def __len__(self):
            return self.n_samples

        def __getitem__(self, index):
            return torch.Tensor(self.data[index]), int(self.target[index])

    def __init__(self, layers_size):
        super(ANN, self).__init__()
        self.layers_size = layers_size
        self.L = len(layers_size)
        self.costs = []

    def initialize_parameters(self):
        for l in range(0, self.L):
            self.add_module("fc" + str(l + 1), nn.Linear(self.layers_size[l], self.layers_size[l + 1]).to(device))

    def forward(self, X):

        for l, (name, m) in enumerate(self.named_modules()):
            if l > 0:
                if l == self.L - 1:
                    X = m(X)
                else:
                    X = F.relu(m(X))

        return F.log_softmax(input=X)

    def fit(self, X, Y, learning_rate=0.1, n_iterations=2500):

        self.to(device)

        self.layers_size.insert(0, X.shape[1])

        self.initialize_parameters()

        optimizer = torch.optim.SGD(self.parameters(), lr=learning_rate)
        criterion = nn.NLLLoss()

        train_dataset = self.MyDataLoader(X, Y)
        data_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=2048, num_workers=32)

        for epoch in range(n_iterations):
            for k, (inputs, target) in enumerate(data_loader):
                inputs, target = inputs.to(device), target.to(device)

                optimizer.zero_grad()
                forward = self(inputs)
                loss = criterion(forward, target)
                loss.backward()
                optimizer.step()

            if epoch % 100 == 0:
                print('Train Epoch: {} \tLoss: {:.6f}'.format(epoch, loss.item()))

            if epoch % 10 == 0:
                self.costs.append(loss.item())

        with torch.no_grad():
            correct = 0
            total = 0
            for inputs, labels in data_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = self(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
            print('Train Accuracy: {:.2f} %'.format(100 * correct / total))

    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()

    def predict(self, X, Y):
        dataset = self.MyDataLoader(X, Y)
        data_loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=2048, num_workers=32)
        with torch.no_grad():
            correct = 0
            total = 0
            for inputs, target in data_loader:
                inputs, target = inputs.to(device), target.to(device)
                forward = self(inputs)
                _, predicted = torch.max(forward.data, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()

            print('Accuracy of the network on the {} images: {} %'.format(Y.shape[0], 100 * correct / total))


def pre_process_data(train_x, train_y, test_x, test_y):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    return train_x, train_y, test_x, test_y


if __name__ == '__main__':
    train_x_orig, train_y_orig, test_x_orig, test_y_orig = mnist.get_data()
    train_x, train_y, test_x, test_y = pre_process_data(train_x_orig, train_y_orig, test_x_orig, test_y_orig)

    model = ANN(layers_size=[196, 10])
    model.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
    model.predict(test_x, test_y)
    model.plot_cost()

Please find the full project here:

The post Implement Neural Network using PyTorch appeared first on A Developer Diary.

Understanding and implementing Neural Network with SoftMax in Python from scratch

$
0
0

Understanding multi-class classification using Feedforward Neural Network is the foundation for most of the other complex and domain specific architecture. However often most lectures or books goes through Binary classification using Binary Cross Entropy Loss in detail and skips the derivation of the backpropagation using the Softmax Activation.In this Understanding and implementing Neural Network with Softmax in Python from scratch we will go through the mathematical derivation of the backpropagation using Softmax Activation and also implement the same using python from scratch.

We will continue from where we left off in the previous tutorial on backpropagation using binary cross entropy loss function.We will extend the same code to work with Softmax Activation. In case you need to refer, please find the previous tutorial here:

Understand and Implement the Backpropagation Algorithm From Scratch In Python

Softmax:

The Sigmoid Activation function we have used earlier for binary classification needs to be changed for multi-class classification. The basic idea of Softmax is to distribute the probability of different classes so that they sum to 1. Earlier we have used only one Sigmoid hidden unit, now the number of Softmax hidden units needs to be same as the number of classes. Since we will be using the full MNIST dataset here, we have total 10 classes, hence we need 10 hidden units at the final layer of our Network. The Softmax Activation function looks at all the Z values from all (10 here) hidden unit and provides the probability for the each class. Later during prediction we can just take the most probable one and assume that’s that final output.

So as you see in the below picture, there are 5 hidden units at the final layer, each corresponds to a specific class.

Understanding and implementing Neural Network with SoftMax in Python from scratch adeveloperdiary.com

Mathematical Definition of Softmax:

The Softmax function can be defined as below, where c is equal to the number of classes.

\[
a_i = \frac{e^{z_i}}{\sum_{k=1}^c e^{z_k}} \\
\text{where} \sum_{i=1}^c a_i = 1
\]

The below diagram shows the SoftMax function, each of the hidden unit at the last layer output a number between 0 and 1.

Understanding and implementing Neural Network with SoftMax in Python from scratch adeveloperdiary.com

Implementation Note:

The above Softmax function is not really a stable one, if you implement this using python you will frequently get nan error due to floating point limitation in NumPy. In order to avoid that we can multiply both the numerator and denominator with a constant c.

\[
\begin{align}
a_i =& \frac{ce^{z_i}}{c\sum_{k=1}^c e^{z_k}} \\
=& \frac{e^{z_i+logc}}{\sum_{k=1}^c e^{z_k}+logc} \\
\end{align}
\]

A popular choice of the \(log c \) constant is \( -max\left ( z \right ) \)

\[
a_i = \frac{e^{z_i – max\left ( z \right )}}{\sum_{k=1}^c e^{z_k}- max\left ( z \right )}
\]

def softmax(self, Z):
	expZ = np.exp(Z - np.max(Z))
	return expZ / expZ.sum(axis=0, keepdims=True)

SoftMax in Forward Propagation:

In our previous tutorial we had used the Sigmoid at the final layer. Now we will just replace that with Softmax function. Thats all the change you need to make.

def forward(self, X):
	store = {}

	A = X.T
	for l in range(self.L - 1):
		Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
		A = self.sigmoid(Z)
		store["A" + str(l + 1)] = A
		store["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
		store["Z" + str(l + 1)] = Z

	Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
	A = self.softmax(Z) # Replace this line
	store["A" + str(self.L)] = A
	store["W" + str(self.L)] = self.parameters["W" + str(self.L)]
	store["Z" + str(self.L)] = Z

	return A, store

Loss Function:

We will be using the Cross-Entropy Loss (in log scale) with the SoftMax, which can be defined as,

\[
L = – \sum_{i=0}^c y_i log a_i
\]

cost = -np.mean(Y * np.log(A.T + 1e-8))

Numerical Approximation:

As you have seen in the above code, we have added a very small number 1e-8 inside the log just to avoid divide by zero error.

Due to this our loss may not be absolutely 0.

Derivative of SoftMax:

Our main focus is to understand the derivation of how to use this SoftMax function during backpropagation. As you already know ( Please refer my previous post if needed ), we shall start the backpropagation by taking the derivative of the Loss/Cost function. However, there is a neat trick we can apply in order to make the derivation simpler. To do so, let’s first understand the derivative of the Softmax function.

We know that if \(f(x) = \frac{g(x)}{h(x)}\) then we can take the derivative of \(f(x)\) using the following formula,

\[
f(x) = \frac{g'(x)h(x) – h'(x)g(x)}{h(x)^2}
\]

In case of Softmax function,

\[
\begin{align}
g(x) &= e^{z_i} \\
h(x) &=\sum_{k=1}^c e^{z_k}
\end{align}
\]

Now,
\[
\frac{da_i}{dz_j} = \frac{d}{dz_j} \bigg( \frac{e^{z_i}}{\sum_{k=1}^c e^{z_k}} \bigg) = \frac{d}{dz_j} \bigg( \frac{g(x)}{h(x)} \bigg)
\]

Calculate \(g'(x)\):

\[
\begin{align}
\frac{d}{dz_j} \big( g(x)\big) &= \frac{d}{dz_j} (e^{z_i}) \\
&=\frac{d}{dz_i} (e^{z_i})\frac{dz_i}{dz_j} (z_i) \\
&= e^{z_i} \frac{dz_i}{dz_j} (z_i) \\
&= \left\{\begin{matrix}
& e^{z_i} \text{ if } i = j\\
& 0 \text{ if } i \not= j
\end{matrix}\right.
\end{align}
\]

Calculate \(h'(x)\) :

\[
\begin{align}
\frac{d}{dz_j} \big( h(x)\big) &= \frac{d}{dz_j} \big( \sum_{k=1}^c e^{z_k}\big) \\
&= \frac{d}{dz_j} \big( \sum_{k=1, k \not=j}^c e^{z_k} + e^{z_j}\big) \\
&= \frac{d}{dz_j} \big( \sum_{k=1, k \not=j}^c e^{z_k} \big) + \frac{d}{dz_j} \big( e^{z_j}\big) \\
&=0+ e^{z_j} \\
&= e^{z_j} \\
\end{align}
\]

So we have two scenarios, when \( i = j \):

\[
\begin{align}
\frac{da_i}{dz_j} &= \frac{e^{z_i}\sum_{k=1}^c e^{z_k} -e^{z_j}e^{z_i} }{\big( \sum_{k=1}^c e^{z_k} \big)^2} \\
&= \frac{e^{z_i} \big(\sum_{k=1}^c e^{z_k} -e^{z_j} \big)}{\big( \sum_{k=1}^c e^{z_k} \big)^2} \\
&= \frac{e^{z_i}}{\sum_{k=1}^c e^{z_k}} . \frac{\sum_{k=1}^c e^{z_k} -e^{z_j}}{\sum_{k=1}^c e^{z_k}} \\
&= a_i (1- a_j) \\
&= a_i (1- a_i) \text{ ; since } i=j
\end{align}
\]

And when \(I\not=j\)

\[
\begin{align}
\frac{da_i}{dz_j} &= \frac{0 \sum_{k=1}^c e^{z_k} -e^{z_j}e^{z_i} }{\big( \sum_{k=1}^c e^{z_k} \big)^2} \\
&= \frac{ – e^{z_j}e^{z_i}}{\big( \sum_{k=1}^c e^{z_k} \big)^2} \\
&= -a_i a_j \\
\end{align}
\]

Derivative of Cross-Entropy Loss with Softmax:

As we have already done for backpropagation using Sigmoid, we need to now calculate \( \frac{dL}{dw_i} \) using chain rule of derivative. The First step of that will be to calculate the derivative of the Loss function w.r.t. \(a\). However when we use Softmax activation function we can directly derive the derivative of \( \frac{dL}{dz_i} \). Hence during programming we can skip one step.

Later you will find that the backpropagation of both Softmax and Sigmoid will be exactly same. You can go back to previous tutorial and make modification to directly compute the \(dZ^L\) and not \(dA^L\). We computed \(dA^L\) there so that its easy for initial understanding.

\[
\require{cancel}
\begin{align}
\frac{dL}{dz_i} &= \frac{d}{dz_i} \bigg[ – \sum_{k=1}^c y_k log (a_k) \bigg] \\
&= – \sum_{k=1}^c y_k \frac{d \big( log (a_k) \big)}{dz_i} \\
&= – \sum_{k=1}^c y_k \frac{d \big( log (a_k) \big)}{da_k} . \frac{da_k}{dz_i} \\
&= – \sum_{k=1}^c\frac{y_k}{a_k} . \frac{da_k}{dz_i} \\
&= – \bigg[ \frac{y_i}{a_i} . \frac{da_i}{dz_i} + \sum_{k=1, k \not=i}^c \frac{y_k}{a_k} \frac{da_k}{dz_i} \bigg] \\
&= – \frac{y_i}{\cancel{a_i}} . \cancel{a_i}(1-a_i) \text{ } – \sum_{k=1, k \not=i}^c \frac{y_k}{\cancel{a_k}} . (\cancel{a_k}a_i) \\
&= – y_i +y_ia_i + \sum_{k=1, k \not=i}^c y_ka_i \\
&= a_i \big( y_i + \sum_{k=1, k \not=i}^c y_k \big) – y_i \\
&= a_i + \sum_{k=1}^c y_k -y_i \\
&= a_i . 1 – y_i \text{ , since } \sum_{k=1}^c y_k =1 \\
&= a_i – y_i
\end{align}
\]

If you notice closely, this is the same equation as we had for Binary Cross-Entropy Loss (Refer the previous article).

Backpropagation:

Now we will use the previously derived derivative of Cross-Entropy Loss with Softmax to complete the Backpropagation.

The matrix form of the previous derivation can be written as :

\[
\begin{align}
\frac{dL}{dZ} &= A – Y
\end{align}
\]

For the final layer L we can define as:

\[
\begin{align}
\frac{dL}{dW^L} &= \frac{dL}{dZ^L} \frac{dZ^L}{dW^L} \\
&= (A^L-Y) \frac{d}{dW^L} \big( A^{L-1}W^L + b^L \big) \\
&= (A^L-Y) A^{L-1}
\end{align}
\]

For all other layers except the layer L we can define as:

\[
\begin{align}
\frac{dL}{dW^{L-1}} &= \frac{dL}{dZ^L} \frac{dZ^L}{dA^{L-1}}\frac{dA^{L-1}}{dZ^{L-1}} \frac{dZ^{L-1}}{dW^{L-1}}\\
&= (A^L-Y) \frac{d}{dA^{L-1}} \big( A^{L-1}W^L + b^L \big) \\
& \frac{d}{dZ^{L-1}} \big( \sigma(Z^{L-1}) \big) \frac{d}{dW^{L-1}} \big( A^{L-2}W^{L-1} + b^{L-1} \big)\\
&= (A^L-Y) W^L\sigma'(Z^{L-1})A^{L-2}
\end{align}
\]

This is exactly same as our existing solution.

Code:

Below is the code of the backward() function. The only difference between this and previous version is, we are directly calculating \(dZ\) and not \(dA\). Hence we can update the highlighted lines like following:

def backward(self, X, Y, store):

	derivatives = {}

	store["A0"] = X.T

	A = store["A" + str(self.L)]
	dZ = A - Y.T

	dW = dZ.dot(store["A" + str(self.L - 1)].T) / self.n
	db = np.sum(dZ, axis=1, keepdims=True) / self.n
	dAPrev = store["W" + str(self.L)].T.dot(dZ)

	derivatives["dW" + str(self.L)] = dW
	derivatives["db" + str(self.L)] = db

	for l in range(self.L - 1, 0, -1):
		dZ = dAPrev * self.sigmoid(dAPrev, store["Z" + str(l)])
		dW = dZ.dot(store["A" + str(l - 1)].T) / self.n
		db = np.sum(dZ, axis=1, keepdims=True) / self.n
		if l > 1:
			dAPrev = store["W" + str(l)].T.dot(dZ)

		derivatives["dW" + str(l)] = dW
		derivatives["db" + str(l)] = db

	return derivatives

One Hot Encoding:

Instead of using 0 and 1 for binary classification, we need to use One Hot Encoding transformation of Y. We will be using sklearn.preprocessing.OneHotEncoder class. In our example, our transformed Y will have 10 columns since we have 10 different classes.

We will add the additional transformation in the pre_process_data() function.

def pre_process_data(train_x, train_y, test_x, test_y):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))

    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y

Predict():

The predict() function will be changed for Softmax. First we need to get the most probable class by calling np.argmax() function, then do the same for the OneHotEncoded Y values to convert them to numeric data. Finally calculate the accuracy.

def predict(self, X, Y):
	A, cache = self.forward(X)
	y_hat = np.argmax(A, axis=0)
	Y = np.argmax(Y, axis=1)
	accuracy = (y_hat == Y).mean()
	return accuracy * 100

Full Code:

import numpy as np
import datasets.mnist.loader as mnist
import matplotlib.pylab as plt
from sklearn.preprocessing import OneHotEncoder


class ANN:
    def __init__(self, layers_size):
        self.layers_size = layers_size
        self.parameters = {}
        self.L = len(self.layers_size)
        self.n = 0
        self.costs = []

    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-Z))

    def softmax(self, Z):
        expZ = np.exp(Z - np.max(Z))
        return expZ / expZ.sum(axis=0, keepdims=True)

    def initialize_parameters(self):
        np.random.seed(1)

        for l in range(1, len(self.layers_size)):
            self.parameters["W" + str(l)] = np.random.randn(self.layers_size[l], self.layers_size[l - 1]) / np.sqrt(
                self.layers_size[l - 1])
            self.parameters["b" + str(l)] = np.zeros((self.layers_size[l], 1))

    def forward(self, X):
        store = {}

        A = X.T
        for l in range(self.L - 1):
            Z = self.parameters["W" + str(l + 1)].dot(A) + self.parameters["b" + str(l + 1)]
            A = self.sigmoid(Z)
            store["A" + str(l + 1)] = A
            store["W" + str(l + 1)] = self.parameters["W" + str(l + 1)]
            store["Z" + str(l + 1)] = Z

        Z = self.parameters["W" + str(self.L)].dot(A) + self.parameters["b" + str(self.L)]
        A = self.softmax(Z)
        store["A" + str(self.L)] = A
        store["W" + str(self.L)] = self.parameters["W" + str(self.L)]
        store["Z" + str(self.L)] = Z

        return A, store

    def sigmoid_derivative(self, Z):
        s = 1 / (1 + np.exp(-Z))
        return s * (1 - s)

    def backward(self, X, Y, store):

        derivatives = {}

        store["A0"] = X.T

        A = store["A" + str(self.L)]
        dZ = A - Y.T

        dW = dZ.dot(store["A" + str(self.L - 1)].T) / self.n
        db = np.sum(dZ, axis=1, keepdims=True) / self.n
        dAPrev = store["W" + str(self.L)].T.dot(dZ)

        derivatives["dW" + str(self.L)] = dW
        derivatives["db" + str(self.L)] = db

        for l in range(self.L - 1, 0, -1):
            dZ = dAPrev * self.sigmoid_derivative(store["Z" + str(l)])
            dW = 1. / self.n * dZ.dot(store["A" + str(l - 1)].T)
            db = 1. / self.n * np.sum(dZ, axis=1, keepdims=True)
            if l > 1:
                dAPrev = store["W" + str(l)].T.dot(dZ)

            derivatives["dW" + str(l)] = dW
            derivatives["db" + str(l)] = db

        return derivatives

    def fit(self, X, Y, learning_rate=0.01, n_iterations=2500):
        np.random.seed(1)

        self.n = X.shape[0]

        self.layers_size.insert(0, X.shape[1])

        self.initialize_parameters()
        for loop in range(n_iterations):
            A, store = self.forward(X)
            cost = -np.mean(Y * np.log(A.T+ 1e-8))
            derivatives = self.backward(X, Y, store)

            for l in range(1, self.L + 1):
                self.parameters["W" + str(l)] = self.parameters["W" + str(l)] - learning_rate * derivatives[
                    "dW" + str(l)]
                self.parameters["b" + str(l)] = self.parameters["b" + str(l)] - learning_rate * derivatives[
                    "db" + str(l)]

            if loop % 100 == 0:
                print("Cost: ", cost, "Train Accuracy:", self.predict(X, Y))

            if loop % 10 == 0:
                self.costs.append(cost)

    def predict(self, X, Y):
        A, cache = self.forward(X)
        y_hat = np.argmax(A, axis=0)
        Y = np.argmax(Y, axis=1)
        accuracy = (y_hat == Y).mean()
        return accuracy * 100

    def plot_cost(self):
        plt.figure()
        plt.plot(np.arange(len(self.costs)), self.costs)
        plt.xlabel("epochs")
        plt.ylabel("cost")
        plt.show()


def pre_process_data(train_x, train_y, test_x, test_y):
    # Normalize
    train_x = train_x / 255.
    test_x = test_x / 255.

    enc = OneHotEncoder(sparse=False, categories='auto')
    train_y = enc.fit_transform(train_y.reshape(len(train_y), -1))

    test_y = enc.transform(test_y.reshape(len(test_y), -1))

    return train_x, train_y, test_x, test_y


if __name__ == '__main__':
    train_x, train_y, test_x, test_y = mnist.get_data()

    train_x, train_y, test_x, test_y = pre_process_data(train_x, train_y, test_x, test_y)

    print("train_x's shape: " + str(train_x.shape))
    print("test_x's shape: " + str(test_x.shape))

    layers_dims = [50, 10]

    ann = ANN(layers_dims)
    ann.fit(train_x, train_y, learning_rate=0.1, n_iterations=1000)
    print("Train Accuracy:", ann.predict(train_x, train_y))
    print("Test Accuracy:", ann.predict(test_x, test_y))
    ann.plot_cost()

Output:

Here is the plot of the cost function:

Understanding and implementing Neural Network with SoftMax in Python from scratch adeveloperdiary.com

This is the output after 1000 iteration. Here our test accuracy is more than train accuracy, do you know why ? Post a comment in case you are not sure and I will explain.

train_x's shape: (60000, 784)
test_x's shape: (10000, 784)
Cost:  0.24014291022543646 Train Accuracy: 8.393333333333333
Cost:  0.16293340442170298 Train Accuracy: 70.35833333333333
Cost:  0.11068081204697405 Train Accuracy: 79.54833333333333
Cost:  0.08353159072761683 Train Accuracy: 83.24833333333333
Cost:  0.06871067093157585 Train Accuracy: 85.32
Cost:  0.05959970354422914 Train Accuracy: 86.56666666666666
Cost:  0.05347708397827516 Train Accuracy: 87.46333333333334
Cost:  0.049101880831507155 Train Accuracy: 88.12
Cost:  0.04583107963137556 Train Accuracy: 88.59666666666666
Cost:  0.04329685602394087 Train Accuracy: 89.00833333333334
Train Accuracy: 89.31333333333333
Test Accuracy: 89.89

Please find the full project here:

Conclusion:

Below are the articles on implementing the Neural Network using TensorFlow and PyTorch.

  1. Implement Neural Network using TensorFlow
  2. Implement Neural Network using PyTorch

The post Understanding and implementing Neural Network with SoftMax in Python from scratch appeared first on A Developer Diary.


Introduction to Naive Bayes Classifier using R and Python

$
0
0

Naive Bayes Classifier is one of the simple Machine Learning algorithm to implement, hence most of the time it has been taught as the first classifier to many students. However, many of the tutorials are rather incomplete and does not provide the proper understanding. Hence, today in this Introduction to Naive Bayes Classifier using R and Python tutorial we will learn this simple yet useful concept. Bayesian Modeling is the foundation of many important statistical concepts such as Hierarchical Models (Bayesian networks), Markov Chain Monte Carlo etc.

Naive Bayes Classifier is a special simplified case of Bayesian networks where we assume that each feature value is independent to each other. Hierarchical Models can be used to define the dependency between features and we can build much complex and accurate Models using JAGS, BUGS or Stan ( which is out of scope of this tutorial ).

Prerequisites:

This tutorial expect you to already know the Bayes Theorem and some understanding of Gaussian Distributions.

Objective:

Say, we have a dataset and the classes (label/target) associated with each data. For an example, if we consider the Iris dataset with only 2 types of flower, Versicolor and Virginica then the feature ( X ) vector will contain 4 types of features – Petal length, Petal width, Sepal length, Sepal width. The Versicolor and Virginica will be the class ( Y ) of each sample of data. Now using the training data we will like to build our Naive Bayes Classifier so that using any unlabeled data we should be able to classify the flower correctly.

Bayes Theorem:

We can write the Bayes Theorem as following where X is the feature vector and Y is the output class/target variable.

\[
p(Y|X) = \frac{p(X|Y)p(Y)}{p(X)}
\]

As you already know, the definition of each of the probabilities are:

\[
\text{posterior} = \frac{\text{likelihood} * \text{prior}}{ \text{marginal} }
\]

Naive Bayes Classifier:

We will now use the above Bayes Theorem to come up with Bayes Classifier.

Simplify the Posterior Probability:

Say we have only two class 0 and 1 [ 0 = Versicolor, 1 = Virginica], then our objective will be to find the values of \( p(Y=0|X) \) and \( p(Y=1|X) \), then whichever probability value is larger than the other, we will predict the data belongs to that class.

We can define that mathematically as:

\[
\arg\max_{c} (p(y_c|X) )
\]

Simplify the Likelihood Probability:

By saying Naive, we have assumed that each feature is independent. We can then define the likelihood as the multiplication of the probability of each of the features given the class.

\[
\begin{align}
p(X|Y) &= p(x_1|y_c)p(x_2|y_c)…p(x_n|y_c) \\
&= \prod_{i=1}^n p(x_i|y_c) \\
& \text{where } y_c \text{ is any specific class, 0 or 1 }
\end{align}
\]

Simplify the Prior Probability:

We can define the prior as \(\frac{m_c}{m}\), where \(m_c\) is the number of sample for the class \(c\) and \(m\) is the total number of samples in our dataset.

Simplify the Marginal Probability:

The Marginal Probability is not really useful to us since it does not depend on Y, hence same for all the classes. So we can use the following way,

\[
\begin{align}
p(Y|X) \propto & \text{ } p(X|Y)p(Y) \\
= & \text{ } p(X|Y)p(Y) + k \\
& \text{where k = some constant}
\end{align}
\]

The k constant can be dropped during implementation since it’s the same for all the classes.

Final Equation:

The final equation looks like following:

\[
\begin{align}
\text{prediction} &= \arg\max_{c} (p(y_c|X) ) \\
&= \arg\max_{c} \prod_{i=1}^n p(x_i|y_c) p(y_c) \\
\end{align}
\]

However the product might create numerical issues. We will use log scale in our implementation, since log is a monotonic function we should achieve the same result.

\[
\begin{align}
log ( \text{prediction} ) =& \arg\max_{c} \bigg( \sum_{i=1}^n log(p(x_i|y_c))+ log(p(y_c)) \bigg) \\
=& \arg\max_{c} \bigg( \sum_{i=1}^n log(p(x_i|y_c))+ log(\frac{m_c}{m}) \bigg)\\
\end{align}
\]

Believe or not, we are done defining our Naive Bayes Classifier. There is just one thing pending, we need to define a model to calculate the likelihood.

How to define the Likelihood?

There are different strategies and it really depends on the features.

Discrete Variable:

In case the features are discrete variable then we can define the likelihood using simply the probability of each feature. For an example, in case we are creating a classifier to detect spam emails, and we have three words (discount, offer and dinner ) as our features, then we can define our likelihood as:

\[
\begin{align}
p(X|Y=\text{ spam })=& p(\text{discount=yes}|spam)*p(\text{offer=yes}|spam) \\ & *p(\text{dinner=no}|spam)\\
=& (10/15)*( 7/15 )*( 1/15 )
\end{align}
\]

You can then calculate the prior and easily classify the data using the final equation we have.

Note: Often in exams this comes as a problem to solve by hand.

Continuous Variable:

In case our features are continuous ( like we have in our iris dataset ) we have two options:

  • Quantize the continuous values and use them as categorical variable.
  • Define a distribution and model the likelihood using it.

I will talk about vector quantization in a future video, however let’s look more into the 2nd option.

If you plot any feature \(x_1\) the distribution might look Normal/Gaussian, hence we can use normal distribution to define our likelihood. For simplicity, assume we have only one feature and if we plot the data for both the classes, it might look like following:

Introduction to Naive Bayes Classifier using R and Python adeveloperdiary

In the above case, any new point in the left side will have a higher probably for \( p(x_1|y=0) \) than \( p(x_1|y=1) \). We can define the probability using the Univariate Gaussian Distribution.

\[
P(x| \mu, \sigma) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{-(x-\mu)^2 / 2 \sigma^2}
\]

We can easily estimate the mean and variance from our train data.

So our likelihood will be, \( P(x| \mu, \sigma, y_c) \)

Important Note:

Now, you might be tempted to plot the feature and in case they are looking like exponential distribution, you probably want use exponential distribution to define the likelihood. I must tell you that you shouldn’t do anything like that. There are many reasons,

  • Limited data might not provide accurate distribution, hence prediction will be wrong.
  • We really don’t need to match the distribution exactly with the data, as long as we can separate them, our classifier will perfectly.

So we mostly use Gaussian or Bernoulli distribution for continuous variable.

Code Naive Bayes Classifier using Python from scratch:

Enough of theory, let’s now actually build the classifier using Python from scratch.

First let’s understand the structure of our NaiveBayes class.

import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from Logging.Logging import output_log
import math


class NaiveBayes:
    def __init__(self):
        pass    

    def fit(self, X, y):
        pass

    def predict(self, X):
        return True

    def accuracy(self, y, prediction):
        return True

if __name__ == '__main__':
    iris = sns.load_dataset("iris")
    iris = iris.loc[iris["species"] != "setosa"]

    le = preprocessing.LabelEncoder()
    y = le.fit_transform(iris["species"])
    X = iris.drop(["species"], axis=1).values

    train_accuracy = np.zeros([100])
    test_accuracy = np.zeros([100])

    for loop in range(100):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=loop)

        model = NaiveBayes()
        model.fit(X_train, y_train)
		
        prediction = model.predict(X_train)
        train_accuracy[loop] = model.accuracy(y_train, prediction)
		
        prediction = model.predict(X_test)
        test_accuracy[loop] = model.accuracy(y_test, prediction)

    output_log("Average Train Accuracy {}%".format(np.mean(train_accuracy)))
    output_log("Average Test Accuracy {}%".format(np.mean(train_accuracy)))

We will be using seaborn package just to access the iris data without writing any code. Then we will define the skeleton of the NaiveBayes class.

In this example we will work on binary classification, hence we wont use the setosa flower type.There will be only 100 sample data. We will convert the class to numeric value in line 26-27.

In order to get a better estimate of our classifier, we will run the classification 100 times on randomly split data and then average them out to get our estimate. Hence we have the loop and inside the loop we are splitting the data into train/test sets.

Finally we will instantiate our class and invoke the fit() function only once.

fit():

The fit function wont return any value. Normalization is very important when implementing NaiveBayes classifier since the scale of the data will impact the prediction. Here we will normalize each feature so that the mean is 0 and standard deviation is 1.

We will start by calling a function named calculate_mean_sd() which will calculate and store \( \mu \) and \(\sigma\) as class variable. Then we will call normalize() function to scale the data.

Next,we need to calculate the \( \mu \) and \(\sigma\) for each class. Finally calculate the prior for each class and save them to class variable.

def fit(self, X, y):
	self.calculate_mean_sd(X)
	train_scaled = self.normalize(X)

	X_class1 = train_scaled[np.where(y == 0)]
	X_class2 = train_scaled[np.where(y == 1)]

	self.class1_mean = np.mean(X_class1, axis=0)
	self.class1_sd = np.std(X_class1, axis=0)

	self.class2_mean = np.mean(X_class2, axis=0)
	self.class2_sd = np.std(X_class2, axis=0)

	self.class1_prior = X_class1.shape[0] / X.shape[0]
	self.class2_prior = X_class2.shape[0] / X.shape[0]

Below are the calculate_mean_sd() and normalize() function.

def calculate_mean_sd(self, X):
	self.train_mean = np.mean(X, axis=0)
	self.train_sd = np.std(X, axis=0)

def normalize(self, X):
	train_scaled = (X - self.train_mean) / self.train_sd
	return train_scaled

predict():

We will pass the test data into the predict() function. First scale the data using normalize() function, which uses the \( \mu \) and \(\sigma\) calculated during training.

Next go thorugh each row and calculate the likelihood by looping through each feature. Remember, this is a very in-efficient code, since its not vectorized. In our R code we will see a much faster version.

Python does not have a built-in dnorm function to calculate the density of a Normal Distribution, hence we will write our own dnorm() function.

Finally, we compare the two output and predict the class based on the larger value.

def predict(self, X):
	test_scaled = self.normalize(X)

	len = test_scaled.shape[0]

	prediction = np.zeros([len])

	for row in range(len):

		log_sum_class1 = 0
		log_sum_class2 = 0

		for col in range(test_scaled.shape[1]):
			log_sum_class1 += math.log(self.dnorm(test_scaled[row, col], self.class1_mean[col], self.class1_sd[col]))
			log_sum_class2 += math.log(self.dnorm(test_scaled[row, col], self.class2_mean[col], self.class2_sd[col]))

		log_sum_class1 += math.log(self.class1_prior)
		log_sum_class2 += math.log(self.class2_prior)

		if log_sum_class1 < log_sum_class2:
			prediction[row] = 1

	return prediction

Here is the dnorm() function.

def dnorm(self, x, mu, sd):
	return 1 / (np.sqrt(2 * np.pi) * sd) * np.e ** (-np.power((x - mu) / sd, 2) / 2)

accuracy():

The accuracy() function is very easy. Here is the code:

def accuracy(self, y, prediction):
	accuracy = (prediction == y).mean()
	return accuracy * 100

Output:

[OUTPUT] Average Train Accuracy 93.9857142857143%
[OUTPUT] Average Test Accuracy 93.9857142857143%

Full Python Code:

import numpy as np
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from Logging.Logging import output_log
import math


class NaiveBayes:
    def __init__(self):
        self.train_mean = None
        self.train_sd = None
        self.class1_mean = None
        self.class1_sd = None
        self.class2_mean = None
        self.class2_sd = None

    def dnorm(self, x, mu, sd):
        return 1 / (np.sqrt(2 * np.pi) * sd) * np.e ** (-np.power((x - mu) / sd, 2) / 2)

    def calculate_mean_sd(self, X):
        self.train_mean = np.mean(X, axis=0)
        self.train_sd = np.std(X, axis=0)

    def normalize(self, X):
        train_scaled = (X - self.train_mean) / self.train_sd
        return train_scaled

    def fit(self, X, y):
        self.calculate_mean_sd(X)
        train_scaled = self.normalize(X)

        X_class1 = train_scaled[np.where(y == 0)]
        X_class2 = train_scaled[np.where(y == 1)]

        self.class1_mean = np.mean(X_class1, axis=0)
        self.class1_sd = np.std(X_class1, axis=0)

        self.class2_mean = np.mean(X_class2, axis=0)
        self.class2_sd = np.std(X_class2, axis=0)

        self.class1_prior = X_class1.shape[0] / X.shape[0]
        self.class2_prior = X_class2.shape[0] / X.shape[0]

    def predict(self, X):
        test_scaled = self.normalize(X)

        len = test_scaled.shape[0]

        prediction = np.zeros([len])

        for row in range(len):

            log_sum_class1 = 0
            log_sum_class2 = 0

            for col in range(test_scaled.shape[1]):
                log_sum_class1 += math.log(self.dnorm(test_scaled[row, col], self.class1_mean[col], self.class1_sd[col]))
                log_sum_class2 += math.log(self.dnorm(test_scaled[row, col], self.class2_mean[col], self.class2_sd[col]))

            log_sum_class1 += math.log(self.class1_prior)
            log_sum_class2 += math.log(self.class2_prior)

            if log_sum_class1 < log_sum_class2:
                prediction[row] = 1

        return prediction

    def accuracy(self, y, prediction):
        accuracy = (prediction == y).mean()
        return accuracy * 100


if __name__ == '__main__':
    iris = sns.load_dataset("iris")
    iris = iris.loc[iris["species"] != "setosa"]

    le = preprocessing.LabelEncoder()
    y = le.fit_transform(iris["species"])
    X = iris.drop(["species"], axis=1).values

    train_accuracy = np.zeros([100])
    test_accuracy = np.zeros([100])

    for loop in range(100):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=loop)

        model = NaiveBayes()
        model.fit(X_train, y_train)
        prediction = model.predict(X_train)
        train_accuracy[loop] = model.accuracy(y_train, prediction)
        prediction = model.predict(X_test)
        test_accuracy[loop] = model.accuracy(y_test, prediction)

    output_log("Average Train Accuracy {}%".format(np.mean(train_accuracy)))
    output_log("Average Test Accuracy {}%".format(np.mean(train_accuracy)))

Code Naive Bayes Classifier using R from scratch:

I am not going through the full code here and provided inline comments. Fundamentally it’s the same as the python version. However here are two main differences:

  • Using built-in dnorm() function.
  • Operations are vectorized

library(caret)

#Data Preparation
data=iris[which(iris$Species!='setosa'),]
data$Species=as.numeric(data$Species)
data$Species=data$Species-2
data=as.matrix()

y_index=ncol(data)

#Placeholder for test & train accuracy
trainData_prediction=rep(1,100)
tstData_prediction=rep(1,100)

# Execulute 100 times and later average the accuracy
for(count in c(1:100))
{
  
  #Split the data in train & test set
  set.seed(count)
  split=createDataPartition(y=data[,y_index], p=0.7, list=FALSE)
  
  training_data=data[split,]
  test_data=data[-split,]
  
  training_x=training_data[,-y_index]
  training_y=training_data[,y_index]
  
  #Normalize Train Data
  tr_ori_mean <- apply(training_x,2, mean)  
  tr_ori_sd   <- apply(training_x,2, sd)    
  
  tr_offsets <- t(t(training_x) - tr_ori_mean)         
  tr_scaled_data  <- t(t(tr_offsets) / tr_ori_sd)
  
  #Get Positive class Index
  positive_idx = which(training_data[,y_index] == 1)
  
  
  positive_data = tr_scaled_data[positive_idx,]
  negative_data = tr_scaled_data[-positive_idx,]
  
  
  #Get Means and SD on Scaled Data
  pos_means=apply(positive_data,2,mean)
  pos_sd=apply(positive_data,2,sd)
  
  neg_means=apply(negative_data,2,mean)
  neg_sd=apply(negative_data,2,sd)
  
  test_x=test_data[,1:y_index-1]
  
  predict_func=function(test_x_row){
    
    target=0;
    
    #Used dnorm() function for normal distribution and calculate probability
    p_pos=sum(log(dnorm(test_x_row,pos_means,pos_sd)))+log(length(positive_idx)/length(training_y))
    p_neg=sum(log(dnorm(test_x_row,neg_means,neg_sd)))+log( 1 - (length(positive_idx)/length(training_y)))
    
    if(p_pos>p_neg){
      target=1
    }else{
      target=0
    }  
  }
  
  #Scale Test Data
  tst_offsets <- t(t(test_x) - tr_ori_mean)         
  tst_scaled_data  <- t(t(tst_offsets) / tr_ori_sd)
  
  #Predict for test data, get prediction for each row
  y_pred=apply(tst_scaled_data,1,predict_func)
  target=test_data[,y_index]
  
  tstData_prediction[count]=length(which((y_pred==target)==TRUE))/length(target)
  
  #Predict for train data ( optional, output not printed )
  y_pred_train=apply(tr_scaled_data,1,predict_func)
  
  trainData_prediction[count]=length(which((y_pred_train==training_y)==TRUE))/length(training_y)
  
}
print(paste("Average Train Data Accuracy:",mean(trainData_prediction)*100.0,sep = " "))
print(paste("Average Test Data Accuracy:",mean(tstData_prediction)*100.0,sep = " "))

Please find the full project here:

The post Introduction to Naive Bayes Classifier using R and Python appeared first on A Developer Diary.

Applying Gaussian Smoothing to an Image using Python from scratch

$
0
0

Using Gaussian filter/kernel to smooth/blur an image is a very important tool in Computer Vision. You will find many algorithms using it before actually processing the image. Today we will be Applying Gaussian Smoothing to an image using Python from scratch and not using library like OpenCV.

High Level Steps:

There are two steps to this process:

  • Create a Gaussian Kernel/Filter
  • Perform Convolution and Average

Gaussian Kernel/Filter:

Create a function named gaussian_kernel(), which takes mainly two parameters. The size of the kernel and the standard deviation.

def gaussian_kernel(size, sigma=1, verbose=False):

    kernel_1D = np.linspace(-(size // 2), size // 2, size)
    for i in range(size):
        kernel_1D[i] = dnorm(kernel_1D[i], 0, sigma)
    kernel_2D = np.outer(kernel_1D.T, kernel_1D.T)

    kernel_2D *= 1.0 / kernel_2D.max()

    if verbose:
        plt.imshow(kernel_2D, interpolation='none',cmap='gray')
        plt.title("Image")
        plt.show()

    return kernel_2D

Create a vector of equally spaced number using the size argument passed. When the size = 5, the kernel_1D will be like the following:

array([-2., -1.,  0.,  1.,  2.])

Now we will call the dnorm() function which returns the density using the mean = 0 and standard deviation. We will see the function definition later. The kernel_1D vector will look like:

array([0.05399097, 0.24197072, 0.39894228, 0.24197072, 0.05399097])

Then we will create the outer product and normalize to make sure the center value is always 1.

Kernel Output:

In order to set the sigma automatically, we will use following equation: (This will work for our purpose, where filter size is between 3-21):

sigma=math.sqrt(kernel_size)

Here is the output of different kernel sizes.

Applying Gaussian Smoothing to an image using Python from scratch adeveloperdiary.com

As you are seeing the sigma value was automatically set, which worked nicely. This simple trick will save you time to find the sigma for different settings.

dnorm()

def dnorm(x, mu, sd):
    return 1 / (np.sqrt(2 * np.pi) * sd) * np.e ** (-np.power((x - mu) / sd, 2) / 2)

Here is the dorm() function. Just calculated the density using the formula of Univariate Normal Distribution.

Convolution and Average:

We will create the convolution function in a generic way so that we can use it for other operations. This is not the most efficient way of writing a convolution function, you can always replace with one provided by a library. However the main objective is to perform all the basic operations from scratch.

I am not going to go detail on the Convolution ( or Cross-Correlation ) operation, since there are many fantastic tutorials available already. Here we will only focus on the implementation.

Let’s look at the convolution() function part by part.

def convolution(image, kernel, average=False, verbose=False):

    if len(image.shape) == 3:
        print("Found 3 Channels : {}".format(image.shape))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        print("Converted to Gray Channel. Size : {}".format(image.shape))
    else:
        print("Image Shape : {}".format(image.shape))

    print("Kernel Shape : {}".format(kernel.shape))

    if verbose:
        plt.imshow(image, cmap='gray')
        plt.title("Image")
        plt.show()

The function has the image and kernel as the required parameters and we will also pass average as the 3rd argument. The average argument will be used only for smoothing filter. Notice, we can actually pass any filter/kernel, hence this function is not coupled/depended on the previously written gaussian_kernel() function.

Since our convolution() function only works on image with single channel, we will convert the image to gray scale in case we find the image has 3 channels ( Color Image ). Then plot the gray scale image using matplotlib.

image_row, image_col = image.shape
kernel_row, kernel_col = kernel.shape

output = np.zeros(image.shape)

pad_height = int((kernel_row - 1) / 2)
pad_width = int((kernel_col - 1) / 2)

padded_image = np.zeros((image_row + (2 * pad_height), image_col + (2 * pad_width)))

padded_image[pad_height:padded_image.shape[0] - pad_height, pad_width:padded_image.shape[1] - pad_width] = image

We want the output image to have the same dimension as the input image. This is technically known as the “same convolution”. In order to do so we need to pad the image. Here we will use zero padding, we will talk about other types of padding later in the tutorial. Now for “same convolution” we need to calculate the size of the padding using the following formula, where k is the size of the kernel.

\[
\frac{(k-1)}{2}
\]

In the the last two lines, we are basically creating an empty numpy 2D array and then copying the image to the proper location so that we can have the padding applied in the final output. In the below image we have applied a padding of 7, hence you can see the black border.

Applying Gaussian Smoothing to an image using Python from scratch adeveloperdiary.com

for row in range(image_row):
    for col in range(image_col):
        output[row, col] = np.sum(kernel * padded_image[row:row + kernel_row, col:col + kernel_col])

Now simply implement the convolution operation using two loops.

if average:
    output[row, col] /= kernel.shape[0] * kernel.shape[1]

In order to apply the smooth/blur effect we will divide the output pixel by the total number of pixel available in the kernel/filter. This will be done only if the value of average is set True.

We are finally done with our simple convolution function. Here is the output image.

Applying Gaussian Smoothing to an image using Python from scratch adeveloperdiary.com

gaussian_blur():

So the gaussian_blur() function will call the gaussian_kernel() function first to create the kernel and then invoke convolution() function.

def gaussian_blur(image, kernel_size, verbose=False):
    kernel = gaussian_kernel(kernel_size, sigma=math.sqrt(kernel_size), verbose=verbose)
    return convolution(image, kernel, average=True, verbose=verbose)

main():

In the main function, we just need to call our gaussian_blur() function by passing the arguments.

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="Path to the image")
    args = vars(ap.parse_args())

    image = cv2.imread(args["image"])

    gaussian_blur(image, 9, verbose=True)

Conclusion:

As you have noticed, once we use a larger filter/kernel there is a black border appearing in the final output. This is because we have used zero padding and the color of zero is black. You can implement two different strategies in order to avoid this.

  • Don’t use any padding, the dimension of the output image will be different but there won’t be any dark border.
  • Instead of using zero padding, use the edge pixel from the image and use them for padding.

Full Code:

import numpy as np
import cv2
import matplotlib.pyplot as plt


def convolution(image, kernel, average=False, verbose=False):
    if len(image.shape) == 3:
        print("Found 3 Channels : {}".format(image.shape))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        print("Converted to Gray Channel. Size : {}".format(image.shape))
    else:
        print("Image Shape : {}".format(image.shape))

    print("Kernel Shape : {}".format(kernel.shape))

    if verbose:
        plt.imshow(image, cmap='gray')
        plt.title("Image")
        plt.show()

    image_row, image_col = image.shape
    kernel_row, kernel_col = kernel.shape

    output = np.zeros(image.shape)

    pad_height = int((kernel_row - 1) / 2)
    pad_width = int((kernel_col - 1) / 2)

    padded_image = np.zeros((image_row + (2 * pad_height), image_col + (2 * pad_width)))

    padded_image[pad_height:padded_image.shape[0] - pad_height, pad_width:padded_image.shape[1] - pad_width] = image

    if verbose:
        plt.imshow(padded_image, cmap='gray')
        plt.title("Padded Image")
        plt.show()

    for row in range(image_row):
        for col in range(image_col):
            output[row, col] = np.sum(kernel * padded_image[row:row + kernel_row, col:col + kernel_col])
            if average:
                output[row, col] /= kernel.shape[0] * kernel.shape[1]

    print("Output Image size : {}".format(output.shape))

    if verbose:
        plt.imshow(output, cmap='gray')
        plt.title("Output Image using {}X{} Kernel".format(kernel_row, kernel_col))
        plt.show()

    return output

import numpy as np
import cv2
import argparse
import matplotlib.pyplot as plt
import math
from Computer_Vision.Gaussian_Smoothing.convolution import convolution


def dnorm(x, mu, sd):
    return 1 / (np.sqrt(2 * np.pi) * sd) * np.e ** (-np.power((x - mu) / sd, 2) / 2)


def gaussian_kernel(size, sigma=1, verbose=False):
    kernel_1D = np.linspace(-(size // 2), size // 2, size)
    for i in range(size):
        kernel_1D[i] = dnorm(kernel_1D[i], 0, sigma)
    kernel_2D = np.outer(kernel_1D.T, kernel_1D.T)

    kernel_2D *= 1.0 / kernel_2D.max()

    if verbose:
        plt.imshow(kernel_2D, interpolation='none', cmap='gray')
        plt.title("Kernel ( {}X{} )".format(size, size))
        plt.show()

    return kernel_2D


def gaussian_blur(image, kernel_size, verbose=False):
    kernel = gaussian_kernel(kernel_size, sigma=math.sqrt(kernel_size), verbose=verbose)
    return convolution(image, kernel, average=True, verbose=verbose)


if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="Path to the image")
    args = vars(ap.parse_args())

    image = cv2.imread(args["image"])

    gaussian_blur(image, 5, verbose=True)

Project in Github:

Please find the full project here:

The post Applying Gaussian Smoothing to an Image using Python from scratch appeared first on A Developer Diary.

How to implement Sobel edge detection using Python from scratch

$
0
0

Sobel edge detection is one of the foundational building block of Computer Vision. Even when you start learning deep learning if you find the reference of Sobel filter. In this tutorial we will learn How to implement Sobel edge detection using Python from scratch.

We will be referring the same code for the Convolution and Gaussian Smoothing function from the following blog.

Applying Gaussian Smoothing to an Image using Python from scratch

The objective will be to find the edges in the below image:

How to implement Sobel edge detection using Python from scratch

What is an edge?

An edge is a place of rapid change in the image intensity function.

How to detect an edge?

In order to detect edge we need to detect the discontinuities in image and we know that we can use derivative to detect discontinuities.

How to implement Sobel edge detection using Python from scratch adeveloperdiary.com sobel

Image Credit: http://stanford.edu/

As you are seeing in the above picture, the edges corresponds to the derivatives. Since images are discrete in nature, we can easily take the derivate of an image using 2D derivative mask.

However derivates are also effected by noise, hence it’s advisable to smooth the image first before taking the derivative. Then we can use the convolution using the mask to detect the edges. Again, I am not going into the math part, we will focus only on the implementation details here.

Sobel Operator:

Sobel Operator is a specific type of 2D derivative mask which is efficient in detecting the edges in an image. We will use following two masks:

How to implement Sobel edge detection using Python from scratch adeveloperdiary.com sobel

main:

Let’s look at the implementation now.

if __name__ == '__main__':
    filter = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])

    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="Path to the image")
    args = vars(ap.parse_args())

    image = cv2.imread(args["image"])
    image = gaussian_blur(image, 9, verbose=True)
    sobel_edge_detection(image, filter, verbose=True)

We will create the vertical mask using numpy array. The horizontal mask will be derived from vertical mask. We will pass the mask as the argument so that we can really utilize the sobel_edge_detection() function using any mask. Next apply smoothing using gaussian_blur() function. Please refer my tutorial on Gaussian Smoothing to find more details on this function.

Finally call the sobel_edge_detection() function by passing the image and the vertical filter.

sobel_edge_detection():

def sobel_edge_detection(image, filter, verbose=False):
    new_image_x = convolution(image, filter, verbose)

    if verbose:
        plt.imshow(new_image_x, cmap='gray')
        plt.title("Horizontal Edge")
        plt.show()

We will first call the convolution() function using the vertical mask. The output of the derivative looks like this:

How to implement Sobel edge detection using Python from scratch adeveloperdiary.com

new_image_y = convolution(image, np.flip(filter.T, axis=0), verbose)

if verbose:
    plt.imshow(new_image_y, cmap='gray')
    plt.title("Vertical Edge")
    plt.show()

Then apply the convolution using the horizontal mask. We will simply take a transpose of the mask and flip it along horizontal axis. Here is the output:

How to implement Sobel edge detection using Python from scratch adeveloperdiary.com

In order to combine both the vertical and horizontal edges (derivatives) we can use the following equation:

\[
G = \sqrt{G_x^2 + G_y^2}
\]

gradient_magnitude = np.sqrt(np.square(new_image_x) + np.square(new_image_y))

gradient_magnitude *= 255.0 / gradient_magnitude.max()

if verbose:
    plt.imshow(gradient_magnitude, cmap='gray')
    plt.title("Gradient Magnitude")
    plt.show()

We will implement the same equation and then normalize the output to be between 0 and 255.

Output:

Here is the final output.

How to implement Sobel edge detection using Python from scratch adeveloperdiary.com

Limitation in Sobel Edge Detection Technique:

  • Poor Localization, which means you will see many edges where we actually should have only edge.
  • Can miss edges which are neither verticle or horizontal.

Canny Edge Detector:

Next we will implement Canny edge detector where we will overcome theses issues.

Full Code:

import numpy as np
import cv2
import argparse
import matplotlib.pyplot as plt
from Computer_Vision.Sobel_Edge_Detection.convolution import convolution
from Computer_Vision.Sobel_Edge_Detection.gaussian_smoothing import gaussian_blur


def sobel_edge_detection(image, filter, verbose=False):
    new_image_x = convolution(image, filter, verbose)

    if verbose:
        plt.imshow(new_image_x, cmap='gray')
        plt.title("Horizontal Edge")
        plt.show()

    new_image_y = convolution(image, np.flip(filter.T, axis=0), verbose)

    if verbose:
        plt.imshow(new_image_y, cmap='gray')
        plt.title("Vertical Edge")
        plt.show()

    gradient_magnitude = np.sqrt(np.square(new_image_x) + np.square(new_image_y))

    gradient_magnitude *= 255.0 / gradient_magnitude.max()

    if verbose:
        plt.imshow(gradient_magnitude, cmap='gray')
        plt.title("Gradient Magnitude")
        plt.show()

    return gradient_magnitude


if __name__ == '__main__':
    filter = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])

    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="Path to the image")
    args = vars(ap.parse_args())

    image = cv2.imread(args["image"])
    image = gaussian_blur(image, 9, verbose=True)
    sobel_edge_detection(image, filter, verbose=True)

Project in Github:

Please find the full project here:

The post How to implement Sobel edge detection using Python from scratch appeared first on A Developer Diary.

Implement Canny edge detector using Python from scratch

$
0
0

Canny edge detector is the most widely used edge detector in Computer Vision, hence understanding and implementing it will be very important for any CV Engineer. In this tutorial we will Implement Canny Edge Detection Algorithm using Python from scratch. There are many incomplete implementation are available in GitHub, however we will understand every step and build the complete algorithm.

Canny Edge Detector Steps:

  • Smooth image with Gaussian Noise.
  • Find magnitude and orientation of gradient.
  • Apply Non-max suppression.
  • Apply Hysteresis threshold.

Smooth image with Gaussian Noise:

I have already posted tutorial on this. Please refer it here:

Applying Gaussian Smoothing to an Image using Python from scratch

Find magnitude and orientation of gradient:

I also have a tutorial on calculating the magnitude of the gradient. Please refer it here:

How to implement Sobel edge detection using Python from scratch

Orientation of Gradient:

We will update our sobel_edge_detection() function to calculate the Orientation of Gradient.

def sobel_edge_detection(image, filter, convert_to_degree=False, verbose=False):
    ...

First we will add another argument to the function named convert_to_degree.

The equation for calculating the Orientation of Gradient is:

\[
\theta = tan^{-1} \bigg( \frac{G_x}{G_y} \bigg)
\]

gradient_direction = np.arctan2(new_image_y, new_image_x)

if convert_to_degree:
    gradient_direction = np.rad2deg(gradient_direction)
    gradient_direction += 180

return gradient_magnitude, gradient_direction

Use the Numpy’s arctan2() function to calculate the gradient direction. However the returned value will be in radian. If the convert_to_degree is set to True then we will convert it to degree by calling the rad2deg() function. This returns the degree between -180 to 180, which we will convert from 0 to 360 by adding 180 to gradient_direction.

Finally we will return both the magnitude and direction of gradient.

As you have seen already we can plot the Gradient Magnitude.

Implement Canny Edge Detection Algorithm using Python from scratch

Apply Non-max Suppression:

This is probably the most important step in Canny Edge Detection Algorithm. We have two different parameters

  1. Magnitude of the Gradient
  2. Direction of the Gradient

Our objective is to remove redundant/duplicate edges identified by Sobel Edge Detection ( Refer the image above ). We want just one line to show the edge rather than having multiple lines for the same edge. This can be done by the Non-Max Suppression Algorithm.

A pixel can have total 4 directions for the gradient (shown below) since there are total 8 neighboring pixels.

Implement Canny Edge Detection Algorithm using Python from scratch

We want to make sure no adjacent pixels are representing the same edge and in order to do that, we want to compare the Magnitude of the gradient between one pixel and it’s neighboring pixels along the same direction and select the pixel whose magnitude is the largest. Let’s understand that in more detail.

We will start with a black image where all pixel values are zero.

Now, consider the following example. We have a pixel (middle one) who’s gradient’s direction is 90 degree. Hence we will compare the magnitude of the gradient with both the pixel above (90 Degree) and below (270 Degree) it. Now in this example white represents higher value (255) and black represents lower value (0). We can see that the bottom pixel has higher value than the one we are comparing with. Hence we won’t take the value of the selected pixel. Since we started with a black image, the value of this this pixel will remain 0.

Implement Canny Edge Detection Algorithm using Python from scratch

Here is another example, say the gradient direction of a given pixel is 0 degree. So we will compare the magnitude of gradient of the right ( 0 degree ) and left ( 180 degree ) pixel with it. In this example, clearly the magnitude of gradient of the selected pixel is higher than the other two, hence we update our output pixel value by the magnitude of gradient of the selected pixel.

Implement Canny Edge Detection Algorithm using Python from scratch

We will repeat this for all the pixel except the boarding pixel of the image. The output will look like following:

Implement Canny Edge Detection Algorithm using Python from scratch

There is one more concept to go through before we jump into the code. In our example we have taken the direction as 90 ( or 0 ) degree, where we know we had to compare the top or bottom pixel. What about 57 Degree or 113 Degree?

In order to solve this problem, we will consider a range of degrees to select a neighbor. Look at the below picture. If the direction of the middle pixel is between \(\frac{15\pi}{8}\) & \(\frac{\pi}{8}\) and also between \(\frac{7\pi}{8}\) & \(\frac{9\pi}{8}\), then we shall compare the middle pixel with the left one right neighbor pixels.

Implement Canny Edge Detection Algorithm using Python from scratch

Now let’s look at the code:

def non_max_suppression(gradient_magnitude, gradient_direction, verbose):

    image_row, image_col = gradient_magnitude.shape

    output = np.zeros(gradient_magnitude.shape)

    PI = 180

Our non_max_suppression() function takes 2 arguments. The Gradient Magnitude and Direction.

Our output image will start with with a back image with the same width and height of the input image. Then we will set pi to 180 degree.

for row in range(1, image_row - 1):
    for col in range(1, image_col - 1):
        direction = gradient_direction[row, col]

        if (0 <= direction < PI / 8) or (15 * PI / 8 <= direction <= 2 * PI):
            before_pixel = gradient_magnitude[row, col - 1]
            after_pixel = gradient_magnitude[row, col + 1]

        elif (PI / 8 <= direction < 3 * PI / 8) or (9 * PI / 8 <= direction < 11 * PI / 8):
            before_pixel = gradient_magnitude[row + 1, col - 1]
            after_pixel = gradient_magnitude[row - 1, col + 1]

        elif (3 * PI / 8 <= direction < 5 * PI / 8) or (11 * PI / 8 <= direction < 13 * PI / 8):
            before_pixel = gradient_magnitude[row - 1, col]
            after_pixel = gradient_magnitude[row + 1, col]

        else:
            before_pixel = gradient_magnitude[row - 1, col - 1]
            after_pixel = gradient_magnitude[row + 1, col + 1]

        if gradient_magnitude[row, col] >= before_pixel and gradient_magnitude[row, col] >= after_pixel:
            output[row, col] = gradient_magnitude[row, col]

This loop is the main part of the algorithm. We will loop through all the pixels in the gradient directions ( except the border pixels ). Then based on the value of gradient direction we will store the gradient magnitude of the two neighboring pixel. At the end we will find out whether the selected/middle pixel has the highest gradient magnitude or not. If not we continue with the loop, otherwise update the output image for the given row and col with the value of the gradient magnitude.

Apply Hysteresis threshold:

If you notice, the output after non-max suppression has few edges in bright white, however many of them are between gray to dark-gray. Again, our objective is to produce clear edges ( all the edge pixel will be 255 ). We will achieve this using Hysteresis Threshold.

We will break the concept in two parts:

  • Apply thresholding
  • Apply hysteresis

Apply thresholding:

The main idea of thresholding, as the name suggest is to take all the edges and define them either week (some low number say 50) or strong ( white – 255 ). It will be more easy to understand when you look at the output:

Implement Canny Edge Detection Algorithm using Python from scratch

def threshold(image, low, high, weak, verbose=False):
    info_log("threshold()")

    output = np.zeros(image.shape)

    strong = 255

    strong_row, strong_col = np.where(image >= high)
    weak_row, weak_col = np.where((image <= high) & (image >= low))

    output[strong_row, strong_col] = strong
    output[weak_row, weak_col] = weak

    if verbose:
        plt.imshow(output, cmap='gray')
        plt.title("threshold")
        plt.show()

    return output

In our threshold() function if the value of any pixel is higher than the high value, then we set it to 255. We assume these are proper edges. Next if the pixels are between low and high value then we set them to week value ( passed as an argument ). Remaining pixels will all the 0.

The function call will look like below:

weak = 100

new_image = threshold(new_image, 5, 20, weak=weak, verbose=args["verbose"])

Apply hysteresis:

So we already have the confirmed edges in white pixel ( 255 ) and other pixels in some weak value ( say 50 ). The objective of the hysteresis function is to identify the weak pixels which can be edges and discard the remaining.

Obvious question is how to determine which pixels are part of real edges? We want to find out whether a selected pixel is connected to the already defined edge pixels, if so we can consider this pixel also to be part of an edge.The simple solution is to find out whether any given pixels neighbors ( as we have seen earlier, there will be total 8 ) has value equal to 255, if yes then change the value of the pixel to 255, otherwise discard the pixel by setting the value to 0.

def hysteresis(image, weak):
    image_row, image_col = image.shape

    top_to_bottom = image.copy()

    for row in range(1, image_row):
        for col in range(1, image_col):
            if top_to_bottom[row, col] == weak:
                if top_to_bottom[row, col + 1] == 255 or top_to_bottom[row, col - 1] == 255 or top_to_bottom[row - 1, col] == 255 or top_to_bottom[
                    row + 1, col] == 255 or top_to_bottom[
                    row - 1, col - 1] == 255 or top_to_bottom[row + 1, col - 1] == 255 or top_to_bottom[row - 1, col + 1] == 255 or top_to_bottom[
                    row + 1, col + 1] == 255:
                    top_to_bottom[row, col] = 255
                else:
                    top_to_bottom[row, col] = 0

We will loop through each pixel in the image, if the value of the pixel is weak (we have to do this only for weak pixels) and verify whether there are any neighboring pixel with value 255. If not then set the value of the pixel to 0.

Many of the tutorials available online implements hysteresis partially. The above code detects connected edges only if the weak pixels are after the strong pixels. Let’s look at an example.

The thresholding output has a circular edge in the hat, where the middle part contains strong pixels, left and right side have weak pixels. As per hysteresis algorithm the entire edge should have been selected since its connected and continuous.

However if we implement just algorithm we have learned so far, the left part of the edge will be detected but not the right part ( 2nd Image ). This is because we are scanning from left-top to bottom-down. When we arrive at the a pixel at the right top corner of the edge right weak edge, there are no neighboring pixels with the value 255, hence its been set to 0. However that’s not true when we arrive at the first pixel of the left weak edge (blue arrow).

Implement Canny Edge Detection Algorithm using Python from scratch

In order to fix the problem, we need to also scan the image from bottom-right to top-left corner, which will help to detect the right part of the edge. ( 3rd image on top ). We will do this total 4 times from all corners.

bottom_to_top = image.copy()

for row in range(image_row - 1, 0, -1):
    for col in range(image_col - 1, 0, -1):
        if bottom_to_top[row, col] == weak:
            if bottom_to_top[row, col + 1] == 255 or bottom_to_top[row, col - 1] == 255 or bottom_to_top[row - 1, col] == 255 or bottom_to_top[
                row + 1, col] == 255 or bottom_to_top[
                row - 1, col - 1] == 255 or bottom_to_top[row + 1, col - 1] == 255 or bottom_to_top[row - 1, col + 1] == 255 or bottom_to_top[
                row + 1, col + 1] == 255:
                bottom_to_top[row, col] = 255
            else:
                bottom_to_top[row, col] = 0

right_to_left = image.copy()

for row in range(1, image_row):
    for col in range(image_col - 1, 0, -1):
        if right_to_left[row, col] == weak:
            if right_to_left[row, col + 1] == 255 or right_to_left[row, col - 1] == 255 or right_to_left[row - 1, col] == 255 or right_to_left[
                row + 1, col] == 255 or right_to_left[
                row - 1, col - 1] == 255 or right_to_left[row + 1, col - 1] == 255 or right_to_left[row - 1, col + 1] == 255 or right_to_left[
                row + 1, col + 1] == 255:
                right_to_left[row, col] = 255
            else:
                right_to_left[row, col] = 0

left_to_right = image.copy()

for row in range(image_row - 1, 0, -1):
    for col in range(1, image_col):
        if left_to_right[row, col] == weak:
            if left_to_right[row, col + 1] == 255 or left_to_right[row, col - 1] == 255 or left_to_right[row - 1, col] == 255 or left_to_right[
                row + 1, col] == 255 or left_to_right[
                row - 1, col - 1] == 255 or left_to_right[row + 1, col - 1] == 255 or left_to_right[row - 1, col + 1] == 255 or left_to_right[
                row + 1, col + 1] == 255:
                left_to_right[row, col] = 255
            else:
                left_to_right[row, col] = 0

Sum all the pixels to create our final image. The white pixels will add up, hence to make sure there is no pixel value greater than 255, we threshold them to 255. Return the final_image.

final_image = top_to_bottom + bottom_to_top + right_to_left + left_to_right

final_image[final_image > 255] = 255

return final_image

The output will look like:

Implement Canny Edge Detection Algorithm using Python from scratch

Full Code:

import numpy as np
import cv2
import argparse

from Computer_Vision.Canny_Edge_Detection.sobel import sobel_edge_detection
from Computer_Vision.Canny_Edge_Detection.gaussian_smoothing import gaussian_blur

import matplotlib.pyplot as plt


def non_max_suppression(gradient_magnitude, gradient_direction, verbose):
    image_row, image_col = gradient_magnitude.shape

    output = np.zeros(gradient_magnitude.shape)

    PI = 180

    for row in range(1, image_row - 1):
        for col in range(1, image_col - 1):
            direction = gradient_direction[row, col]

            if (0 <= direction < PI / 8) or (15 * PI / 8 <= direction <= 2 * PI):
                before_pixel = gradient_magnitude[row, col - 1]
                after_pixel = gradient_magnitude[row, col + 1]

            elif (PI / 8 <= direction < 3 * PI / 8) or (9 * PI / 8 <= direction < 11 * PI / 8):
                before_pixel = gradient_magnitude[row + 1, col - 1]
                after_pixel = gradient_magnitude[row - 1, col + 1]

            elif (3 * PI / 8 <= direction < 5 * PI / 8) or (11 * PI / 8 <= direction < 13 * PI / 8):
                before_pixel = gradient_magnitude[row - 1, col]
                after_pixel = gradient_magnitude[row + 1, col]

            else:
                before_pixel = gradient_magnitude[row - 1, col - 1]
                after_pixel = gradient_magnitude[row + 1, col + 1]

            if gradient_magnitude[row, col] >= before_pixel and gradient_magnitude[row, col] >= after_pixel:
                output[row, col] = gradient_magnitude[row, col]

    if verbose:
        plt.imshow(output, cmap='gray')
        plt.title("Non Max Suppression")
        plt.show()

    return output


def threshold(image, low, high, weak, verbose=False):
    output = np.zeros(image.shape)

    strong = 255

    strong_row, strong_col = np.where(image >= high)
    weak_row, weak_col = np.where((image <= high) & (image >= low))

    output[strong_row, strong_col] = strong
    output[weak_row, weak_col] = weak

    if verbose:
        plt.imshow(output, cmap='gray')
        plt.title("threshold")
        plt.show()

    return output


def hysteresis(image, weak):
    image_row, image_col = image.shape

    top_to_bottom = image.copy()

    for row in range(1, image_row):
        for col in range(1, image_col):
            if top_to_bottom[row, col] == weak:
                if top_to_bottom[row, col + 1] == 255 or top_to_bottom[row, col - 1] == 255 or top_to_bottom[row - 1, col] == 255 or top_to_bottom[
                    row + 1, col] == 255 or top_to_bottom[
                    row - 1, col - 1] == 255 or top_to_bottom[row + 1, col - 1] == 255 or top_to_bottom[row - 1, col + 1] == 255 or top_to_bottom[
                    row + 1, col + 1] == 255:
                    top_to_bottom[row, col] = 255
                else:
                    top_to_bottom[row, col] = 0

    bottom_to_top = image.copy()

    for row in range(image_row - 1, 0, -1):
        for col in range(image_col - 1, 0, -1):
            if bottom_to_top[row, col] == weak:
                if bottom_to_top[row, col + 1] == 255 or bottom_to_top[row, col - 1] == 255 or bottom_to_top[row - 1, col] == 255 or bottom_to_top[
                    row + 1, col] == 255 or bottom_to_top[
                    row - 1, col - 1] == 255 or bottom_to_top[row + 1, col - 1] == 255 or bottom_to_top[row - 1, col + 1] == 255 or bottom_to_top[
                    row + 1, col + 1] == 255:
                    bottom_to_top[row, col] = 255
                else:
                    bottom_to_top[row, col] = 0

    right_to_left = image.copy()

    for row in range(1, image_row):
        for col in range(image_col - 1, 0, -1):
            if right_to_left[row, col] == weak:
                if right_to_left[row, col + 1] == 255 or right_to_left[row, col - 1] == 255 or right_to_left[row - 1, col] == 255 or right_to_left[
                    row + 1, col] == 255 or right_to_left[
                    row - 1, col - 1] == 255 or right_to_left[row + 1, col - 1] == 255 or right_to_left[row - 1, col + 1] == 255 or right_to_left[
                    row + 1, col + 1] == 255:
                    right_to_left[row, col] = 255
                else:
                    right_to_left[row, col] = 0

    left_to_right = image.copy()

    for row in range(image_row - 1, 0, -1):
        for col in range(1, image_col):
            if left_to_right[row, col] == weak:
                if left_to_right[row, col + 1] == 255 or left_to_right[row, col - 1] == 255 or left_to_right[row - 1, col] == 255 or left_to_right[
                    row + 1, col] == 255 or left_to_right[
                    row - 1, col - 1] == 255 or left_to_right[row + 1, col - 1] == 255 or left_to_right[row - 1, col + 1] == 255 or left_to_right[
                    row + 1, col + 1] == 255:
                    left_to_right[row, col] = 255
                else:
                    left_to_right[row, col] = 0

    final_image = top_to_bottom + bottom_to_top + right_to_left + left_to_right

    final_image[final_image > 255] = 255

    return final_image


if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="Path to the image")
    ap.add_argument("-v", "--verbose", type=bool, default=False, help="Path to the image")
    args = vars(ap.parse_args())

    image = cv2.imread(args["image"])

    blurred_image = gaussian_blur(image, kernel_size=9, verbose=False)

    edge_filter = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])

    gradient_magnitude, gradient_direction = sobel_edge_detection(blurred_image, edge_filter, convert_to_degree=True, verbose=args["verbose"])

    new_image = non_max_suppression(gradient_magnitude, gradient_direction, verbose=args["verbose"])

    weak = 50

    new_image = threshold(new_image, 5, 20, weak=weak, verbose=args["verbose"])

    new_image = hysteresis(new_image, weak)

    plt.imshow(new_image, cmap='gray')
    plt.title("Canny Edge Detector")
    plt.show()

Project in Github:

Please find the full project here:

The post Implement Canny edge detector using Python from scratch appeared first on A Developer Diary.

How to prepare Imagenet dataset for Image Classification

$
0
0

Imagenet is one of the most widely used large scale dataset for benchmarking Image Classification algorithms. In case you are starting with Deep Learning and want to test your model against the imagine dataset or just trying out to implement existing publications, you can download the dataset from the imagine website. The downloaded dataset is not human readable, hence In this How to prepare Imagenet dataset for Image Classification tutorial I will explain how you can use this dataset.

How to download imagenet dataset?

Option 1:

You need to have an .edu email address to download directly from the imagenet website. Click on the below website, and login using your .edu email id. You need to register in case you don’t have a profile created.
http://image-net.org/download-images
The Dataset has not changed since 2012, I recommend to download from 2017-2015 links. Click on any of the following links:
crane_bird
How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

Then download the Development Kit ( it has the labels ) and the CLS-LOC dataset which is 155GB.

How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

I strongly recommend using a download manager where you can pause or resume the downloads since its going to take a while based on your internet connection speed.

Option 2:

There might be other ways to download the dataset, without having the .edu email address. My suggestion is to google for finding those options.

Downloaded folder structure:

Once the zip files are downloaded, extract them. If you are using SSD, then extraction will be much faster than HDD.

The train/val/test data will be in following folder:

  • /ILSVRC2015/Data/CLS-LOC/train
  • /ILSVRC2015/Data/CLS-LOC/test
  • /ILSVRC2015/Data/CLS-LOC/val

There will be 1000 folders inside the train folder only. However the folder names are not the image labels, for that we need to look into the devkit.

The /devkit/data/map_clsloc.txt has the training label and /devkit/data/ILSVRC2015_clsloc_validation_ground_truth.txt has the validation labels.

Update the train folders:

What we want to do is, update the folders/dirs inside the train folders with the name of the respective class label. If you view the file, the structure is like following:

How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

Each row is separated by space, the first column is the name of the train folder and the 3rd one is the mapped label.

We want to make sure the folders name in side the train folders are as per respective images label name. We can easily do that using a simple script.

Read the map_clsloc.txt file and create two python map object. The class_dir_map will have the current folder name as key and the respective class label as value. The id_class_map will have the id for the key and the respective class label as value. We would need the id_class_map for validation dataset.

import os

MAP_CLASS_LOC = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/devkit/data/map_clsloc.txt"

class_dir_map = {}
id_class_map = {}

with open(MAP_CLASS_LOC, "rb") as map_class_file:
    rows = map_class_file.readlines()
    for row in rows:
        row = row.strip()
        arr = row.decode("utf-8").split(" ")
        class_dir_map[arr[0]] = arr[2]
        id_class_map[int(arr[1])] = arr[2]

TRAIN_DATA_FOLDER = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC_100/train/"

Issues with imagenet dataset:

There is an issue with the current dataset, lets fix that before we proceed further.

Issue 1: crane

There are two identical labels named “crane” . One is a bird and another is mechanical crane. We will rename the bird as crane_bird.

n02012849 429 crane_bird

Issue 2: maillot

maillot has the same problem. We will update them by maillot_1 and maillot_2.

n03710637 782 maillot_1
n03710721 977 maillot_2

Now we will update the subfolders under train as per their label name.

for key in class_dir_map.keys():
    if os.path.isdir(TRAIN_DATA_FOLDER + key):
        os.rename(TRAIN_DATA_FOLDER + key, TRAIN_DATA_FOLDER + class_dir_map[key])

Train/Test Split:

Since the labels for the test dataset has not been given, we will 50000 from the train data ( 50 from each label ) and create the test dataset.

Create two list, one containing the path of each image and another their class labels. Then use sklearn.model_selection.train_test_split() to create the test dataset. Make sure to set the stratify=labels, so that the train_test_split() function can distribute the test labels evenly.

import glob

files = glob.glob(TRAIN_DATA_FOLDER + "**/*.JPEG")
paths = []
labels = []

for file in files:
    label_str = file.split("/")[-2]
    paths.append(file)
    labels.append(label_str)

from sklearn.model_selection import train_test_split

(trainPaths, testPaths, trainLabels, testLabels) = train_test_split(paths, labels, test_size=50000, stratify=labels, random_state=42)

All that is left now is to move the identified test images to a test folder.

TEST_DATA_FOLDER = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC/test/"

for testPath, testLabel in zip(testPaths, testLabels):

    if not os.path.isdir(TEST_DATA_FOLDER + testLabel):
        os.mkdir(TEST_DATA_FOLDER + testLabel)

    os.rename(testPath, TEST_DATA_FOLDER + testLabel + "/" + testPath.split("/")[-1])

Here is how the train/test folders looks:

How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

Update the validation folders:

The val folder has only list of images. The name of the files will be like following:

How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

The last part of the filename is the sequence id of each file. It starts from 00000001.

The ILSVRC2015_clsloc_validation_ground_truth.txt file just has the list of sequence ids for the validation set. So the file with name ILSVRC2012_val_00000001.JPEG will have the label by as 490. The label name of the label id 490 is sea_snake. ( you can verify by opening the file itself).

How to prepare Imagenet dataset for Image Classification adeveloperdiary.com

Now, some of the validation images are very difficult to classify, hence they are not used. These files are listed in another file named, /data/ILSVRC2015_clsloc_validation_blacklist.txt. The structure of this file is same as the validation ground truth. We need to make sure not to include these in our validation set.

First we need to read the backlist file and store the ids in a list named black_list.

BLACK_LIST = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/devkit/data/ILSVRC2015_clsloc_validation_blacklist.txt"
VAL_CLASS_PATH = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/devkit/data/ILSVRC2015_clsloc_validation_ground_truth.txt"

VAL_DATA_PATH = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC/val/"

VAL_ORI_DATA_PATH = "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC/val_original/*.JPEG"

black_list = []

with open(BLACK_LIST) as b_file:
    rows = b_file.readlines()
    for row in rows:
        row = int(row.strip())
        black_list.append(row)

Next read the validation ground truth and do the same.

val_class = []

with open(VAL_CLASS_PATH) as val_file:
    rows = val_file.readlines()
    for row in rows:
        row = int(row.strip())
        val_class.append(row)

Finally, loop through each validation image files,

  • Parse the sequence id.
  • Make sure its not in the black list.
  • Find the class id and class label name.
  • Create a folder with the label name in the val directory.
  • Move the validation image inside that folder.

val_files = glob.glob(VAL_ORI_DATA_PATH)

for file in val_files:
    seq_num = int(file.split("/")[-1].split("_")[-1].split(".")[0])
    if seq_num not in black_list:
        class_id = val_class[seq_num - 1]
        class_name = id_class_map[class_id]

        if not os.path.isdir(VAL_DATA_PATH + class_name):
            os.mkdir(VAL_DATA_PATH + class_name)

        os.rename(file, VAL_DATA_PATH + class_name + "/" + file.split("/")[-1])

Create Label Map:

Just for convenience we will store the class label and their NumberEncoder in a JSON file so that we can use that to set the target class. You can also store them as OneHotEncoding here.

import json
import glob

label_map = {}

dirs = glob.glob(TRAIN_DATA_FOLDER + "*")
for i, dir in enumerate(dirs):
    label_map[dir.split("/")[-1]] = i

with open("label_map.json", "w") as file:
    file.write(json.dumps(label_map))

Here is the part of the json file:

{
  "Persian_cat": 0,
  "barracouta": 1,
  "pug": 2,
  "whiskey_jug": 3,
  "pot": 4,
  "cassette": 5,
  "solar_dish": 6,
  "tub": 7,
  "gorilla": 8,
  "microphone": 9,
  "cabbage_butterfly": 10,
  ....
  ....
}

Conclusion:

This should get you to the point that you are start your preprocessing. In future I will post another article on how to prepare the imagenet dataset for object detection.

The post How to prepare Imagenet dataset for Image Classification appeared first on A Developer Diary.

Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API

$
0
0

Image PreProcessing is the first step of any Computer Vision application. Although beginners tends to neglect this step, since most of the time while learning, we take a small dataset which has only couple of thousand data to fit in memory. However in real life that’s not the case and learning to have an efficient pipeline for Image PreProcessing can be really helpful when working on a tight deadline. In this Imagenet PreProcessing using TFRecord and Tensorflow 2.0 we will learn not only about how to effectively use TFRecord and new TensorFlow 2.0 Data API features, we will also learn how to use available computational resources fully.

Scope:

Let’s assume that we want to replicate the AlexNet using 2015 Imagenet data. Now, Imagenet is around 166GB, hence its probably not a good idea to plan to store the entire dataset in computer memory, hence we must look out for building an efficient pipeline.

Now, look at the process diagram of typical Convolutional Neural Network application. This is at a very high-level and focuses on the data preparation part and not the real-time active/online learning and prediction.

Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API adeveloperdiary.com

In the above diagram the data-preparation steps are highlighted in red. In this tutorial we will mainly focus on the Image Pre-Processing step.

In case you want to understand how to prepare the Imagenet data please refer the following tutorial to know more on the Data Preparation step.

How to prepare Imagenet dataset for Image Classification

Assumptions:

  • The images ( train/val/test ) are organized within their respective class label. In the Dataset Preparation step we will organize them accordingly. Here is a sample structure of how it should look like.Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API adeveloperdiary.com
  • There is a JSON file with the class label as number. Here is an example of that. This also should have been generated from the Dataset Preparation step. ( Later during Training we will be using sparse categorical cross entropy loss function rather than categorical cross entropy)

    {
      "Persian_cat": 0,
      "barracouta": 1,
      "pug": 2,
      "whiskey_jug": 3,
      "pot": 4,
      "cassette": 5,
      "solar_dish": 6,
      "tub": 7,
      "gorilla": 8,
      "microphone": 9,
      "cabbage_butterfly": 10,
      ....
      ....
    }

  • We can also directly have one hot encoding in this file rather than using number encoding. Here just for simplicity I am using number encoding for the classes.

    {
        "car":[1,0,0],
        "road":[0,1,0],
        "tree":[0,0,1]
    }

Objectives:

It’s important to define what we want to do in the beginning.

Functional Objectives:

As per the AlexNet paper, we will perform following operations:

  1. Mean RGB Calculation
  2. Image Preprocessing:
  • Image Resize
  • Create TFRecord and store them in filesystem

Technical Objectives:

We will also try to use as much as computation power we may have in the system we are using, which should also lead to faster processing times. As you see the first picture, this step could be executed more than once, hence having a faster pipeline will help us in long run.

  • Use more than one CPU core
  • Reduce Processing Timeframe
  • The size of the TFRecord files should be same as the original data size.
    • In case you have enough storage you can ignore this. More on this in later section.

Implementations:

1. Mean RGB Calculation:

In the Mean RGB Calculation we will calculate the mean values of R,G & B channels across all the images. You can always use a subsample dataset derived from the main dataset and use that for this task, which theoretically should give you same result as long as the sample dataset is a good representation of the original dataset.

There are two ways we can implement this:

  • Use the original images and calculate the mean rgb.
    • This is helpful when the rgb mean normalization is part of the pre-processing. So that when we create the TFRecord, you can use this.
    • This will be bit complex to implement since we need to perform the thread management by ourself.
  • Use the generated TFRecord files after pre-processing
    • This is helpful when you want to perform the rgb mean normalization during your training pipeline.
    • This will also be more accurate since the mean of rgb will be calculated on the resized, cropped images.
    • Easy to implement since Tensorflow data API will perform most of the work.

Implementing using python multi-threading:

The __master_get_mean_rgb() function creates many child threads of __worker_calculate_mean() function in order to calculate the mean of a subset of images and return them to the get_mean_rgb() function, which will then save the final values in JSON format to the disk.

Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API adeveloperdiary.com

__worker_calculate_mean():

This is a very straight forward function. We will be using progressbar package to create a nice processing bar in the console. The function’s argument is a list of image paths and an identifier as core_num ( just to show in the progress bar, not required for actual functionality).

We first create 3 empty lists named R,G and B, then loop through the image paths and read them using opencv. Next, we are calling the cv2.mean() function to get the mean of each layer. We will take only first 3 values, since the 4th one is for the alpha channel which we don’t need.

Then append the values to the R, G, B lists. Remember opencv uses B,G,R format and not R,G,B. Once the loop is completed, we calculate the mean of the batch of images and return that to the calling function.

def __worker_calculate_mean(files, core_num):
    (R, G, B) = ([], [], [])

    widgets = [
        'Calculating Mean - [', str(core_num), ']',
        progressbar.Bar('#', '[', ']'),
        ' [', progressbar.Percentage(), '] ',
        '[', progressbar.Counter(format='%(value)02d/%(max_value)d'), '] '

    ]

    bar = progressbar.ProgressBar(maxval=len(files), widgets=widgets)
    bar.start()

    for i, file in enumerate(files):
        image = cv2.imread(file)
        (b, g, r) = cv2.mean(image)[:3]
        R.append(r)
        G.append(g)
        B.append(b)
        bar.update(i + 1)
    bar.finish()
    return np.mean(R), np.mean(G), np.mean(B)

__master_get_mean_rgb():

We will be using multiprocessing package to create and manage threads. First we get the core count of the processor by calling multiprocessing.cpu_count() and then split the list of images in batches based on the core count.

So if we have 1M images and 10 cores in CPU, we will have 100K images per CPU core to process.

Use the Pool class to create a pool of threads, then invoke starmap() function. The first argument of this function is the __worker_calculate_mean() method and the 2nd argument is the parameters need to be passed.

After all threads have completed, loop through the results and append the batch means to the local R,G and B list. Later return the mean values.

def __master_get_mean_rgb(files):
    (R, G, B) = ([], [], [])

    cpu_core = multiprocessing.cpu_count()
    item_per_thread = int(len(files) / cpu_core) + 1

    split_file_list = [files[x:x + item_per_thread] for x in range(0, len(files), item_per_thread)]
    p = multiprocessing.Pool(cpu_core)
    results = p.starmap(__worker_calculate_mean, zip(split_file_list, list(range(len(split_file_list)))))
    p.close()
    p.join()

    for val in results:
        R.append(val[0])
        G.append(val[1])
        B.append(val[2])

    return np.mean(R), np.mean(G), np.mean(B)

Using this approach, we should be able to able to speedup the computations. I have Threadripper 1950x with 32 Threads which significantly improved the performance by completing in just 5mins for using all 1.2M images. I was reading data from WD Black 4TB HDD, if you use nvme drive or regular SSD the performance will be even better.

As you see in the below picture, all the cpu cores in my machine were utilized.

Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API adeveloperdiary.com

These RGB values need to be stored in a JSON file in the disk for later use.

def get_mean_rgb(image_dir, output_file):
    files = glob.glob(image_dir)
    R, G, B = __master_get_mean_rgb(files)

    with open(output_file, "w+") as f:
        f.write(json.dumps({"R": R, "G": G, "B": B}))

You can execute the script using following command:

python mean_rgb_calc.py -i "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC/test/**/*.JPEG" -o "imagenet_test_mean_rgb.json"

You can see the progress in the console:

Calculating Mean - [5][#########################################################] [100%] [1313/1313]  
Calculating Mean - [13][########################] [100%] [1313/1313] 
Calculating Mean - [28][########################] [100%] [1313/1313] 
Calculating Mean - [15][########################] [100%] [1313/1313] 
Calculating Mean - [23][########################] [100%] [1313/1313] 
Calculating Mean - [0][#########################] [100%] [1313/1313] 
Calculating Mean - [21][########################] [100%] [1313/1313] 
Calculating Mean - [19][########################] [100%] [1313/1313] 
Calculating Mean - [26][########################] [100%] [1313/1313] 
Calculating Mean - [1][#########################] [100%] [1313/1313] 
Calculating Mean - [8][#########################] [100%] [1313/1313] 
Calculating Mean - [7][#########################] [100%] [1313/1313]

Here is the output JSON file:

{
	"R": 122.58534800031481, 
	"G": 116.7101693473191, 
	"B": 104.37388196859331
}

Implementing using TFRecord:

If you have completed Step 2 ( image PreProcessing ) and saved the data using TFRecord then those files can be used for RGB Mean calculation as well.

This code will be simple since TensorFlow’s Data API will take care of creating multiple threads for efficiency.

I strongly encourage to skip this section and come back after reading through the TFRecord creation in case you are new to TFRecord.

__master_get_mean_rgb_from_tfrecord():

First lets read the tfrecords files using tensorflow’s Data API, then call the parse_image() function to parse each TFRecord to image tensor and label. We will set repeat to 1 so that every record should be accessed only once. You can set any batch size, I am setting it as 1024.

In order to make sure everything happens using parallel calls, set tf.data.experimental.AUTOTUNE to the num_parallel_calls in the map function and buffer_size in the prefetch function. Based on the available hardware tensorflow will automatically set the number of parallel threads.

Next, loop through the dataset by calling take() function. Set count to -1 in order to retrieve all the data in loop.

Inside the for loop:

  • Change the data type to float64 using tf.cast() function.
  • Calculate the mean using tf.reduce_mean().
    • Set the axis to (0,1,2) since the shape of the image will be (1024, 256,256, 3) and we want to calculate the mean for each channel. The final output will be a vector with 3 values [R,G,B]
  • Set the mean for each batch in the rgb_mean_arr array.

Finally, convert the python list to jumpy array and call np.mean with axis set to 0 to calculate the final rgb mean.

def __master_get_mean_rgb_from_tfrecord(files):
    def parse_image(record):
        features = {
            'label': tf.io.FixedLenFeature([], tf.int64),
            'image_raw': tf.io.FixedLenFeature([], tf.string)
        }
        parsed_record = tf.io.parse_single_example(record, features)
        image = tf.io.decode_jpeg(parsed_record['image_raw'], channels=3)
        label = tf.cast(parsed_record['label'], tf.int32)
        return image, label

    record_files = tf.data.Dataset.list_files(files)

    dataset = tf.data.TFRecordDataset(filenames=record_files, compression_type="GZIP")

    dataset = dataset.map(parse_image, num_parallel_calls=tf.data.experimental.AUTOTUNE) \
        .repeat(1) \
        .batch(1024) \
        .prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

    rgb_mean_arr = []

    for i, (image, label) in enumerate(dataset.take(count=-1)):
        rgb_mean_arr.append(tf.reduce_mean(tf.cast(image, tf.float64), axis=(0, 1, 2)))

    return np.mean(np.array(rgb_mean_arr), axis=0)

The get_mean_rgb() function now needs to be updated to accommodate both the options.

def get_mean_rgb(image_dir, output_file, useTFRecord=False):
    files = glob.glob(image_dir)

    if useTFRecord:
        rgb_mean = __master_get_mean_rgb_from_tfrecord(files)
        R, G, B = rgb_mean[0], rgb_mean[1], rgb_mean[2]
    else:
        R, G, B = __master_get_mean_rgb(files)

    with open(output_file, "w+") as f:
        f.write(json.dumps({"R": R, "G": G, "B": B}))

Let’s create the main function.

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image_dir", required=True, help="path to the input image dir. e.g. /home/Dataset/train/**/*.jpg", )
    ap.add_argument("-o", "--output_file_name", required=True, help="path to the json output")
    ap.add_argument("-tf", "--use_tfrecord", required=False, type=bool, default=False, help="path to the json output")
    args = vars(ap.parse_args())

    get_mean_rgb(args["image_dir"], args["output_file_name"], args["use_tfrecord"])

You can execute the script using following command:

python mean_rgb_calc.py -i "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/tf_records/test/*.tfrecord" -o "imagenet_test_mean_rgb_tf.json" -tf True

The output JSON will be:

{
	"R": 122.10927936917298, 
	"G": 116.5416959998387, 
	"B": 102.61744377213829
}

Notice, there is a difference between both the approaches. The 2nd one is more appropriate though.

Please find the full code in github.

2. Image Preprocessing:

In this Image Preprocessing section, we will first resize the images and crop them as per AlexNet paper and then will store them in TFRecord format for faster processing at training time.

Image Resize:

We will be using opencv for the image processing tasks.

scale_image():

The scale_image() method will take raw image as vector input [h,w,3] and will upscale/downscale the shortest side to 256 pixel. This is a very straight forward code.

def scale_image(image, size):
    image_height, image_width = image.shape[:2]

    if image_height <= image_width:
        ratio = image_width / image_height
        h = size
        w = int(ratio * 256)

        image = cv2.resize(image, (w, h))

    else:
        ratio = image_height / image_width
        w = size
        h = int(ratio * 256)

        image = cv2.resize(image, (w, h))

    return image

center_crop():

This function will crop the center part of the resized image. We initially need to make sure the longer side is not 257pixel since that will make incorrect sizes.

In case the width & height are not equal to 256px we will just resize them to 256px X 256px at the end. This will happen when the longer side has 257px.

def center_crop(image, size):
    image_height, image_width = image.shape[:2]

    if image_height <= image_width and abs(image_width - size) > 1:

        dx = int((image_width - size) / 2)
        image = image[:, dx:-dx, :]
    elif abs(image_height - size) > 1:
        dy = int((image_height - size) / 2)
        image = image[dy:-dy, :, :]

    image_height, image_width = image.shape[:2]
    if image_height is not size or image_width is not size:
        image = cv2.resize(image, (size, size))

    return image

process_image():

The process image just invokes the above functions and returns the values.

def process_image(image, size):
    image = scale_image(image, size)
    image = center_crop(image, size)
    return image

TFRecord Creation:

Once the image pre-processing has been completed, we can now store them in TFRecord format.

Note – We have not done the RGB Mean normalization here, we will perform that during training.( read more on that later)

worker_tf_write():

This function will take a list of image paths and store all them in TFRecord format. TFRecord supports GZIP as compression format so that we get around 4% storage benefit, which can be passed as an argument.

We will loop through each image files and call process_image() function by passing the image vector. This function will return the resized image.
We can store the image vector directly in TFRecord however the size will be 5-7 times more. Hence we will compress the image vector to jpg file and store that as base64 ( Binary data in String format ).

The cv2.imencode() function will encode the image vector to jpg file. We can define the quality of the jpg compression using the encode_param. There will be small decode compute cost need to be paid during training, however that can be alleviated by processing in CPU instead of the GPU (The next tutorial will have more details on this).

Also, we are not normalizing the images using Mean RGB since when we convert it to compressed JPG, we will loose all that information, so we will perform the mean RGB normalization during training. However if you are saving the images as vector and not compressed jpg encoded image, you can actually perform both RGB Mean normalization and float32 conversion before storing the image vector in TFRecord.

With 70% JPG Compression Quality our training output size was ~13GB and with 90% JPG Compression Quality it was ~25GB. We are seeing a significant reduction in size as we are resizing the images to 256×256 from the original ~138GB training data.

In case you are interested to find out how the model performance degrade over higher JPG Compression Quality, here is the reference chart from a recent study. Compression quality of 70%-80% wont effect accuracy of the model much. Again, not all the imagenet images are having 100 quality. I found many images with 70%-80% quality.

Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API adeveloperdiary.com

If you want to know more on this, please refer this paper: https://arxiv.org/abs/1604.04004

If the encoding is successful, then convert it to bytes using tobytes() function and store it.

The label name will be captured by parsing the image path and retrieve the label number using the label_map. I will be storing the labels as NumberEncoding, however you can convert them to OneHotEncoding and store them.

Using the tf.train.Example() convert the label and raw image into TFRecord data.

Then using tf_writer.write() function, write the record to the file.

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def worker_tf_write(files, tf_record_path, label_map, size, image_quality, tf_record_options, number):
    encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), image_quality]
    tf_record_options = tf.io.TFRecordOptions(compression_type=tf_record_options)

    with tf.io.TFRecordWriter(tf_record_path, tf_record_options) as tf_writer:
        for i, file in enumerate(files):
            image = process_image(cv2.imread(file), size)
            is_success, im_buf_arr = cv2.imencode(".jpg", image, encode_param)

            if is_success:
                label_str = file.split("/")[-2]

                label_number = label_map[label_str]

                image_raw = im_buf_arr.tobytes()
                row = tf.train.Example(features=tf.train.Features(feature={
                    'label': _int64_feature(label_number),
                    'image_raw': _bytes_feature(image_raw)
                }))

                tf_writer.write(row.SerializeToString())
            else:
                print("Error processing " + file)

Note: In the GitHub code you can see reference of progress bar, however for simplicity purpose I have removed it from above code sample.

master_tf_write():

We will invoke the worker_tf_write() using multiple threads from master_tf_write() function. The code is simple to read through.

def master_tf_write(split_file_list, tf_record_paths, size, image_quality, label_map, tf_record_options):
    cpu_core = multiprocessing.cpu_count()

    p = multiprocessing.Pool(cpu_core)
    results = p.starmap(worker_tf_write,
                        zip(split_file_list, tf_record_paths, repeat(label_map), repeat(size), repeat(image_quality), repeat(tf_record_options),
                            list(range(len(tf_record_paths)))))
    p.close()
    p.join()

create_tf_record():

The create_tf_record() function has 8 different arguments. Get the list of files first using the glob.glob() function, then shuffle the list. Afterwards, find how many tfrecord files need to be created based on the split_number. In a loop create the list of tfrecord files and then invoke master_tf_write() function by passing all the required parameters.

def create_tf_record(image_folder, record_path, identifier, label_map, size=256, split_number=1000, image_quality=90, tf_record_options=None):
    print("creating " + identifier + " records")

    files = glob.glob(image_folder)

    random.shuffle(files)

    split_file_list = [files[x:x + split_number] for x in range(0, len(files), split_number)]

    tf_record_paths = []

    for i in range(len(split_file_list)):
        tf_record_paths.append(record_path + identifier + "-" + str(i) + ".tfrecord")

    master_tf_write(split_file_list, tf_record_paths, size, image_quality, label_map, tf_record_options)

JSON Config:

In case we need to create the tfrecords for train, test and validation images at once, we can define a JSON Config file to have all the necessary configurations.

{
  "label_map": "label_map_100.json",
  "split_number": 6000,
  "image_folder": "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/Data/CLS-LOC_100/",
  "record_path": "/media/4TB/datasets/ILSVRC2015/ILSVRC2015/tf_records_159/",
  "image_type": "JPEG",
  "crop_size": 256,
  "image_quality": 90,
  "tf_record_compression": "GZIP",
  "batch": [
    "val",
    "test",
    "train"
  ]
}

__name__:

In the main method we can first parse the json and call create_tf_record() inside the for loop for each split type.

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-c", "--config", required=True, help="path to the config JSON.", )
    args = vars(ap.parse_args())

    with open(args["config"], "rb") as file:
        config = json.loads(file.read())

    with open(config["label_map"], "rb") as file:
        label_map = json.loads(file.read())

    for indentifier in config["batch"]:
        image_folder = config["image_folder"] + indentifier + "/**/*." + config["image_type"]
        record_path = config["record_path"] + indentifier + "/"

        if not os.path.isdir(record_path):
            os.makedirs(record_path)

        create_tf_record(image_folder, record_path, indentifier, label_map, config["crop_size"], config["split_number"], config["image_quality"],
                         config["tf_record_compression"])

You can execute the script using following command:

python imagenet_preprocessing.py -c pre_process_config.json

The output will look like following:

creating val records
Processing Images - [1][################################] [100%] [1686/1686] 
Processing Images - [0][################################] [100%] [6000/6000] 
creating test records
Processing Images - [1][################################] [100%] [2004/2004] 
Processing Images - [0][################################] [100%] [6000/6000]

You can find the source code in github.

Conclusion:

This preprocessing steps can be extended for any data sets. Next we will learn how to use the tfrecord to fetch the data at training time.

The post Imagenet PreProcessing using TFRecord and Tensorflow 2.0 Data API appeared first on A Developer Diary.

Linear Discriminant Analysis – from Theory to Code

$
0
0

Linear Discriminant Analysis (LDA) is an important tool in both Classification and Dimensionality Reduction technique. Most of the text book covers this topic in general, however in this Linear Discriminant Analysis – from Theory to Code tutorial we will understand both the mathematical derivations, as well how to implement as simple LDA using Python code. I believe you should be confident about LDA after going through the post end to end.

Prerequisite:

You need to understand basic Linear Algebra in order to follow in this tutorial. Specifically you should know about Vector, Matrix and Eigen Decomposition.

Objective:

Linear Discriminant Analysis can be used for both Classification and Dimensionality Reduction. The basic idea is to find a vector w which maximizes the separation between target classes after projecting them onto w. Refer the below diagram for a better idea, where the first plot shows a non-optimal projection of the data points and the 2nd plot shows an optimal projection of the data point so that the classes are well separated.

LDA is a supervised algorithm ( Unlike PCA ), hence we need to have the target classes specified.

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

Credit: Above picture has been taken from “Pattern Recognition and Machine Learning” by “Christopher Bishop”

Orthogonal Projection:

As per the objective, we need to project the input data onto w vector. Assume that we already know w (I will show how to find optimal w later). Now let’s derive the equation for orthogonally projecting a vector x ( input vector ) onto vector w.

In the below diagram the vector p is called the Orthogonal Projection of x onto w. The d vector is the perpendicular distance between x and w. This is also known as the residual or error vector.

\[
d= x-p
\]

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

Since p is parallel to w, we can write,

\[
p=cw \\
\text{where } c = \text{some scaler}
\]

We can rewrite d as,

\[
\begin{align}
d & = x-p\\
& = x-cw
\end{align}
\]

We know that if two vectors are orthogonal to each other than the dot product between them will be 0. In this case p and d are orthogonal to each other, hence we can write them as,

\[
\begin{align}
p^Td & = 0 \\
(cw)^T(x-cw) & = 0 \\
cw^Tx – c^2wTw & = 0 \\
cw^Tw & = w^Tx \\
c & = \frac{w^Tx}{w^Tw}
\end{align}
\]

We can now rewrite the expression of p as,

\[
\begin{align}
p & = \left ( \frac{w^Tx}{w^Tw} \right ) w
\end{align}
\]

Optimal Linear Discriminant:

First we will develop the general concept on building an Optimal Linear Discriminant, then work on the objective function later.

Assume w to an unit vector, which means,

\[
w^Tw = 1
\]

We can write the orthogonal projection of any n-dimensional vector \( \hat{x_i} \) onto the vector w as,

\[
\begin{align}
\hat{x_i} & = \left ( \frac{w^Tx_i}{w^Tw} \right ) w \\
& = \left ( w^Tx_i \right ) w \\
& = a_i w
\end{align}
\]

Above \( a_i \) is the offset/coordinate of \( \hat{x_i} \) along the line w. A set of these scaler values \( \left \{ a_1, a_2 … a_n \right \} \) represents the mapping from \( \mathbf{R}^d \rightarrow \mathbf{R} \). This means if we know the value of \( w \), we can reduce the dimension of the data from d-dimension to 1-dimension.

Dataset Assumption:

Just for simplicity, assume our dataset has only 2 target class and 2 input fields. This will help us to visualize the results effectively. So we can say,

\[
\begin{align}
x_i \in \mathbf{R}^d \\
\text{where } d = 2
\end{align}
\]

Also, we can define \(D\) as following,
\[
\begin{align}
D_i = \left \{ x_j|y_j = c_i \right \}
\end{align}
\]

Here \( y_i \) is the target labels and \( c_i \) is the target class. \( D_i \) is the subset of the datasets labelled with class \(c_i\)

Mean of the Projected Points:

Since we already know the target class labels, we can easily calculate the mean of the projected data points for each class separately as,

\[
\begin{align}
m_1 & = \frac{1}{n_1} \sum_{x_i \in D_1} a_i \\
& = \frac{1}{n_1} \sum_{x_i \in D_1} w^Tx_i \\
& = w^T \left ( \frac{1}{n_1} \sum_{x_i \in D_1} x_i \right ) \\
& = w^T \mu_1
\end{align}
\]

Here, \( \mu_1\) is the mean of all input data points in \(D_1\)

Very similarly, we can define \(m_2\) as,

\[
\begin{align}
m_1 & = w^T \mu_2
\end{align}
\]

Two Important Ideas:

In order to maximize the separation between the classes ( in the projected space ), we can think of maximizing the difference between the projected means \( |m_1-m_2| \).

However this condition alone is not sufficient to make sure in the projected space the data points are separated for each class. We need to think about the variance of the projected data in each class. A large variance will lead to overlapping data points among two classes ( we have only 2 classes in our data ).

So we need to look at minimizing the variance in class too. LDA does not use the variance directly, rather uses the following formulation.

\[
\begin{align}
s_i^2 = \sum_{x_j \in D_i} (a_j – m_i)^2
\end{align}
\]

\( s_i^2 \) is the total squared deviation from the mean ( remember, variance is the mean deviation). This is also known as scatter matrix.

Fisher’s LDA:

We can incorporate the above two ideas,

  • Maximize the distance between projected means
  • Minimize the sum of the projected scatter

into one equation, named as Fisher’s LDA.

\[
\begin{align}
\max_w J(w) = \frac{(m_1-m_2)^2}{s_1+s_2}
\end{align}
\]

So the goal of LDA is to find the vector \( w \) which maximizes \( J(w) \). This vector \( w \) is also called as Optimal Linear Discriminant.

Rewrite the Equation using InputData:

The above objective function is in projected space, so let’s express it using the input data, since \( w \) is still unknown.

\[
\begin{align}
(m_1-m_2)^2 = & (w^T\mu_1 – w^T\mu_2)^2 \\
= & [w^T(\mu_1-\mu_2)]^2 \\
= & w^T(\mu_1-\mu_2)(\mu_1-\mu_2)^Tw \\
= & w^TBw
\end{align}
\]

Here \(B\) is called between class scatter matrix. It will be a dxd matrix.

\[
\begin{align}
s_1^2= & \sum_{x_i \in D_i} (a_i – m_1)^2 \\
= & \sum_{x_i \in D_i} (w^Tx_i – w^T\mu_1)^2 \\
= & \sum_{x_i \in D_i} \bigg( w^T \left ( x_i – \mu_1 \right ) \bigg)^2 \\
= & w^T \bigg( \sum_{x_i \in D_i} \left ( x_i – \mu_1 \right )\left ( x_i – \mu_1 \right )^T \bigg) w \\
= & w^TS_1w
\end{align}
\]

Above, \(S_1\) is the scatter matrix for \(D_1\).

Similarly we can define \(S_2\).

\[
\begin{align}
s_2^2 = & w^TS_2w
\end{align}
\]

We can combine the above equations,

\[
\begin{align}
s_1^2+s_2^2 & = w^TS_1w+w^TS_2w\\
& = w^T(S_1+S_2)w \\
& = w^TSw
\end{align}
\]

Now, we can rewrite the LDA Objective Function as,

\[
\begin{align}
\max_w J(w) = \frac{w^TBw}{w^TSw}
\end{align}
\]

Solving the Objective Function:

You might have already guessed, in order to solve for the best direction of \(w\) we need to differentiate the objective function w.r.t \(w\) and set that to zero.

Recall that if \(f(x)\) and \(f(x)\) are two functions, when we can define the derivative w.r.t \(x\) as,

\[
\frac{d}{dx} \bigg( \frac{f(x)}{g(x)} \bigg) = \frac{f{\prime}(x)g(x) – g(x){\prime}f(x)}{g(x)^2}
\]

We can use the above formula to differentiate our cost/objective function.

\[
\begin{align}
\frac{d}{dw} J(w) = & \frac{(2Bw)(w^TSw) – (2Sw)(w^TBw)}{(w^TSw)^2} = 0 \\
Bw(w^TSw)= & Sw(w^TBw) \\
Bw =& Sw \bigg ( \frac{w^TBw}{w^TSw} \bigg) \\
Bw =& J(w) Sw \\
Bw =& \lambda Sw \\
S^{-1}Bw =& \lambda S^{-1}Sw \\
(S^{-1}B)w =& \lambda w \\
\end{align}
\]

If the \(S \) not a singular matrix ( inverse exists ) then the above equation can be considered as general Eigenvalue-Eigenvector decomposition.
\(\lambda = J(w)\) is Eigenvalue and \(w\) is the Eigenvector of the matrix \(S^{-1}B\).

Now all we have to do is calculate the Eigen decomposition of \(S^{-1}B\), then get the Eigenvector corresponding to the largest Eigenvalue. That will be our optimized best valued \(w\) vector.

LDA for Multi-Class:

In case you want to use LDA for more than 2 classes, there will be a small change in the way we calculate \(B\). The scatter matrix \(S\) still be the same.

\[
\begin{align}
B = \sum_{i = 1}^C n_i (\mu_i – \mu)(\mu_i – \mu)^T
\end{align}
\]

Where,

  • \(C\) = numbest of target classes
  • \(n_i\) = Number of dataset in each class
  • \(\mu_i\) = Mean of the datasets for specific class
  • \(\mu\) = Mean of the entire dataset

LDA Python Implementation:

First we will create a LDA class, implement the logics there and then run different experiments.

class LDA:
    def __init__(self):
        pass

    def fit(self, X, y):
        pass

fit():

Then inside the fit() function, first get the target classes, then calculate the mean for each class. Store them in mean_vectors list.

def fit(self, X, y):
        target_classes = np.unique(y)

        mean_vectors = []

        for cls in target_classes:
            mean_vectors.append(np.mean(X[y == cls], axis=0))

If the number of class is 2, then simply calculate the B matrix as per the above equation.

mu1_mu2 = (mean_vectors[0] - mean_vectors[1]).reshape(1, X.shape[1])
B = np.dot(mu1_mu2.T, mu1_mu2)

In case there are more than 2 classes, we need to first calculate the mean of the entire dataset. Then using that calculate the B matrix.

data_mean = np.mean(X, axis=0).reshape(1, X.shape[1])
B = np.zeros((X.shape[1], X.shape[1]))
for i, mean_vec in enumerate(mean_vectors):
	n = X[y == i].shape[0]
	mean_vec = mean_vec.reshape(1, X.shape[1])
	mu1_mu2 = mean_vec - data_mean

	B += n * np.dot(mu1_mu2.T, mu1_mu2)

Now its time to create the S matrix. Define an empty array s_matrix to store the scatter matrix for each class. Then loop through the mean_vectors and each data points to calculate the Si. Append each value to the list. Again, we are just following above formula. In case you are confused, please refer the equations.

s_matrix = []

for cls, mean in enumerate(mean_vectors):
	Si = np.zeros((X.shape[1], X.shape[1]))
	for row in X[y == cls]:
		t = (row - mean).reshape(1, X.shape[1])
		Si += np.dot(t.T, t)
	s_matrix.append(Si)

Crete the S matrix and loop through the s_matrix list to append all the values.

S = np.zeros((X.shape[1], X.shape[1]))
for s_i in s_matrix:
	S += s_i

Calculate the \(S^{-1}\) using np.linalg.inv() then calculate the \(S^{-1}B\). Finally call np.linalg.eig() to get the eigen vector and eigen values.

S_inv = np.linalg.inv(S)

S_inv_B = S_inv.dot(B)

eig_vals, eig_vecs = np.linalg.eig(S_inv_B)

The np.linalg.eig() function does not sort the eigen values, so we need to do that programmatically. Use the argsort() function, then reverse the array by using [::-1]. Use the index to order the eig_vals and eig_vecs. Finally return the eig_vecs.

idx = eig_vals.argsort()[::-1]

eig_vals = eig_vals[idx] # Not needed
eig_vecs = eig_vecs[:, idx]

return eig_vecs

load_data():

We will be using the iris dataset. Hence need a function to load the datasets as needed for running various experiments. I am not explaining this load_data(), feel free to post comments if you need any help.

def load_data(cols, load_all=False, head=False):
    iris = sns.load_dataset("iris")

    if not load_all:
        if head:
            iris = iris.head(100)
        else:
            iris = iris.tail(100)

    le = preprocessing.LabelEncoder()
    y = le.fit_transform(iris["species"])

    X = iris.drop(["species"], axis=1)

    if len(cols) > 0:
        X = X[cols]

    return X.values, y

Experiment 1: 2 Classes, 2 Column Dataset

We will use petal_length and petal_width only. Also we will use only 2 classes from the dataset. Call load_data(). The shape of X will be (100, 2)

cols = ["petal_length", "petal_width"]
X, y = load_data(cols, load_all=False, head=True)
print(X.shape)

Invoke fit() by passing the X and y, which will return all the eigen vectors.

lda = LDA()
W = lda.fit(X, y)

We can just get the first eigen vector since we want to reduce the dimension to 1.

W = eig_vecs[:, :1]

We will loop through each data point, plot them using ax.scatter() then use the Orthogonal Projection to calculate the projected space.

colors = ['red', 'green', 'blue']
fig, ax = plt.subplots(figsize=(10, 8))
for point, pred in zip(X, y):
    ax.scatter(point[0], point[1], color=colors[pred], alpha=0.3)
    proj = (np.dot(point, W) * W) / np.dot(W.T, W)

    ax.scatter(proj[0], proj[1], color=colors[pred], alpha=0.3)

plt.show()

Here is the plot below. You can see the LDA algorithm has found the plane to separate the data optimally.

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

Experiment 2: 3 Classes, 2 Column Dataset:

We will use the full iris dataset with 3 classes. Pass load_all=True in the load_data() function. There are no other changes required.

cols = ["petal_length", "petal_width"]
X, y = load_data(cols, load_all=True, head=True)
print(X.shape)

lda = LDA()
eig_vecs = lda.fit(X, y)
W = eig_vecs[:, :1]

colors = ['red', 'green', 'blue']
fig, ax = plt.subplots(figsize=(10, 8))
for point, pred in zip(X, y):
    ax.scatter(point[0], point[1], color=colors[pred], alpha=0.3)
    proj = (np.dot(point, W) * W) / np.dot(W.T, W)

    ax.scatter(proj[0], proj[1], color=colors[pred], alpha=0.3)

plt.show()

Here is the plot. You can see that all three classes are well separated in 1 dimension.

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

Experiment 3: 3 Classes, 4 Column Dataset:

Now, if we want to reduce the dimension of the data from 4 to 2, we can take the 2 eigen vectors and them transform the data using X.dot(W).

X, y = load_data([], load_all=True, head=True)
print(X.shape)

lda = LDA()
eig_vecs = lda.fit(X, y)
W = eig_vecs[:, :2] # Take 2 eigen vectors

transformed = X.dot(W)

plt.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.Set1)
plt.show()

Here is the projected data in 2 dimensions. As you can see we are able to reduce the dimension from 4 to 2 using LDA without loosing much information.

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

sklearn library:

Use can use built in LinearDiscriminantAnalysis class from the sklearn library.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf = LinearDiscriminantAnalysis()
clf.fit(X, y)
transformed = clf.transform(X)

plt.scatter(transformed[:, 0], transformed[:, 1], c=y, cmap=plt.cm.Set1)
plt.show()

The plot looks the same, it’s just the flipped.

Linear Discriminant Analysis from Theory to Code adeveloperdiary.com

Conclusion:

Linear Discriminant Analysis (LDA) is simple yet powerful tool. Often PCA and LDA are compared, however LDA is Supervised Learning Method and PCA is Unsupervised Learning Method. There are other extensions of LDA are available, such as Kernel LDA, QDA etc.

You can find the full code in GitHub.

The post Linear Discriminant Analysis – from Theory to Code appeared first on A Developer Diary.


Support Vector Machines for Beginners – Linear SVM

$
0
0

Support Vector Machines (SVM) is a very popular machine learning algorithm for classification. We still use it where we don’t have enough dataset to implement Artificial Neural Networks. In academia almost every Machine Learning course has SVM as part of the curriculum since it’s very important for every ML student to learn and understand SVM. […]

The post Support Vector Machines for Beginners – Linear SVM appeared first on A Developer Diary.

Support Vector Machines for Beginners – Duality Problem

$
0
0

The Objective Function of Primal Problem works fine for Linearly Separable Dataset, however doesn’t solve Non-Linear Dataset. In this Support Vector Machines for Beginners – Duality Problem article we will dive deep into transforming the Primal Problem into Dual Problem and solving the objective functions using Quadratic Programming. Don’t worry if this sounds too complicated, […]

The post Support Vector Machines for Beginners – Duality Problem appeared first on A Developer Diary.

Support Vector Machines for Beginners – Kernel SVM

$
0
0

Kernel Methods the widely used in Clustering and Support Vector Machine. Even though the concept is very simple, most of the time students are not clear on the basics. We can use Linear SVM to perform Non Linear Classification just by adding Kernel Trick. All the detailed derivations from Prime Problem to Dual Problem had […]

The post Support Vector Machines for Beginners – Kernel SVM appeared first on A Developer Diary.

Support Vector Machines for Beginners – Training Algorithms

$
0
0

We will now work on training SVM using the optimization algorithms (Primal and Dual) that we have defined. Even though these training algorithms can be good foundation for more complex and efficient algorithms, they are only useful for learning purpose and not for real application. Generally, SVM Training algorithms needs loops than vectorized implementations, hence […]

The post Support Vector Machines for Beginners – Training Algorithms appeared first on A Developer Diary.

Machine Translation using Recurrent Neural Network and PyTorch

$
0
0

Seq2Seq (Encoder-Decoder) Model Architecture has become ubiquitous due to the advancement of Transformer Architecture in recent years. Large corporations started to train huge networks and published them to the research community. Recently Open API has licensed their most advanced pre-trained Transformer model GPT-3 to Microsoft. Even though the practical implementation of RNN has become almost […]

The post Machine Translation using Recurrent Neural Network and PyTorch appeared first on A Developer Diary.

Viewing all 41 articles
Browse latest View live