#### Deep Learning from first principles in Python, R and Octave – Part 1

Feed: R-bloggers.

Author: Tinniam V Ganesh.

*“You don’t perceive objects as they are. You perceive them as you are.”**“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”**“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”*

` David Eagleman - The Brain: The Story of You`

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.*I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!*

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

## 1. Logistic Regression

```
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
clf = LogisticRegression().fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
```

```
## Accuracy of Logistic regression classifier on training set: 0.96
## Accuracy of Logistic regression classifier on test set: 0.96
```

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. You could also take a look at my book, available on Amazon , in which I implement the most popular Machine Learning algorithms in both R and Python – My book ‘Practical Machine Learning with R and Python’ on Amazon

## 2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below

At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.

where ‘b’ is the bias term

The Activation function is the sigmoid function which is

The Loss, when the sigmoid function is used in the output layer, is given by

(1)

## Gradient Descent

### Forward propagation

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

### Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

**Chain rule of differentiation**

Let y=f(u)

and u=g(x) then

**Derivative of sigmoid**

Let then

Using the chain rule of differentiation we get

Therefore -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are

-(a)

-(b)

-(c)

The back propagation step requires the computation of

and . In the case of regression it would be and where dE is the Mean Squared Error function.

Computing the derivatives for back propagation we have

-(d)

because

Also from equation (2) we get

– (e)

By chain rule

therefore substituting the results of (d) & (e) we get

(f)

Finally

-(g)

– (h)

and from (f) we have

Therefore (g) reduces to

-(i)

Also

-(j)

Since

The gradient computes the weights at the input layer and the corresponding bias by using the values

of

and

I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

### 3. Neural Network for Logistic Regression -Python code (vectorized)

```
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def sigmoid(z):
a=1/(1+np.exp(-z))
return a
def initialize(dim):
w = np.zeros(dim).reshape(dim,1)
b = 0
return w
def computeLoss(numTraining,Y,A):
loss=-1/numTraining *np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
return(loss)
def forwardPropagation(w,b,X,Y):
Z=np.dot(w.T,X)+b
numTraining=float(len(X))
A=sigmoid(Z)
loss = computeLoss(numTraining,Y,A)
dZ=A-Y
dw=1/numTraining*np.dot(X,dZ.T)
db=1/numTraining*np.sum(dZ)
gradients = {"dw": dw,
"db": db}
loss = np.squeeze(loss)
return gradients,loss
def gradientDescent(w, b, X, Y, numIerations, learningRate):
losses=[]
idx =[]
for i in range(numIerations):
gradients,loss=forwardPropagation(w,b,X,Y)
dw = gradients["dw"]
db = gradients["db"]
w = w-learningRate*dw
b = b-learningRate*db
if i % 100 == 0:
idx.append(i)
losses.append(loss)
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, losses,idx
def predict(w,b,X):
size=X.shape[1]
yPredicted=np.zeros((1,size))
Z=np.dot(w.T,X)
A=sigmoid(Z)
for i in range(A.shape[1]):
if(A[0][i] > 0.5):
yPredicted[0][i]=1
else:
yPredicted[0][i]=0
return yPredicted
def normalize(x):
x_norm = None
x_norm = np.linalg.norm(x,axis=1,keepdims=True)
x= x/x_norm
return x
from sklearn.datasets import load_breast_cancer
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 0)
X_train1=normalize(X_train)
w=np.zeros((X_train.shape[1],1))
b=0
X_train1=normalize(X_train)
X_train2=X_train1.T
y_train1=y_train.reshape(len(y_train),1)
y_train2=y_train1.T
parameters, grads, costs,idx = gradientDescent(w, b, X_train2, y_train2, numIerations=4000, learningRate=0.75)
w = parameters["w"]
b = parameters["b"]
X_test1=normalize(X_test)
X_test2=X_test1.T
y_test1=y_test.reshape(len(y_test),1)
y_test2=y_test1.T
yPredictionTest = predict(w, b, X_test2)
yPredictionTrain = predict(w, b, X_train2)
print("train accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTrain - y_train2)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(yPredictionTest - y_test)) * 100))
fig1=plt.plot(idx,costs)
fig1=plt.title("Gradient descent-Cost vs No of iterations")
fig1=plt.xlabel("No of iterations")
fig1=plt.ylabel("Cost")
fig1.figure.savefig("fig1", bbox_inches='tight')
```

```
## train accuracy: 90.3755868545 %
## test accuracy: 89.5104895105 %
```

**Note**: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

### 4. Neural Network for Logistic Regression -R code (vectorized)

`source("RFunctions-1.R")`

```
sigmoid function(z){
a 1/(1+ exp(-z))
a
}
computeLoss function(numTraining,Y,A){
loss -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A))
return(loss)
}
forwardPropagation function(w,b,X,Y){
Z t(w) %*% X +b
numTraining ncol(X)
A=sigmoid(Z)
loss computeLoss(numTraining,Y,A)
dZA-Y
dw1/numTraining * X %*% t(dZ)
db1/numTraining*sum(dZ)
fwdProp list("loss" = loss, "dw" = dw, "db" = db)
return(fwdProp)
}
gradientDescent function(w, b, X, Y, numIerations, learningRate){
losses NULL
idx NULL
for(i in 1:numIerations){
fwdProp forwardPropagation(w,b,X,Y)
dw fwdProp$dw
db fwdProp$db
w = w-learningRate*dw
b = b-learningRate*db
l fwdProp$loss
if(i %% 100 == 0){
idx c(idx,i)
losses c(losses,l)
}
}
gradDescnt list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx)
return(gradDescnt)
}
predict function(w,b,X){
m=dim(X)[2]
yPredicted=matrix(rep(0,m),nrow=1,ncol=m)
Z t(w) %*% X +b
A=sigmoid(Z)
for(i in 1:dim(A)[2]){
if(A[1,i] > 0.5)
yPredicted[1,i]=1
else
yPredicted[1,i]=0
}
return(yPredicted)
}
normalize function(x){
nas.matrix(sqrt(rowSums(x^2)))
normalizedsweep(x, 1, n, FUN="/")
return(normalized)
}
cancer read.csv("cancer.csv")
names(cancer) c(seq(1,30),"output")
train_idx trainTestSplit(cancer,trainPercent=75,seed=5)
train cancer[train_idx, ]
test cancer[-train_idx, ]
X_train train[,1:30]
y_train train[,31]
X_test test[,1:30]
y_test test[,31]
w matrix(rep(0,dim(X_train)[2]))
b 0
X_train1 normalize(X_train)
X_train2=t(X_train1)
y_train1=as.matrix(y_train)
y_train2=t(y_train1)
gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77)
X_test1=normalize(X_test)
X_test2=t(X_test1)
y_test1=as.matrix(y_test)
y_test2=t(y_test1)
yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2)
yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2)
sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100))
```

`## [1] "Train accuracy: 90.845070"`

`sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100))`

`## [1] "test accuracy: 87.323944"`

```
df data.frame(gradDescent$idx, gradDescent$losses)
names(df) c("iterations","losses")
ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") +
ggtitle("Gradient Descent - Losses vs No of Iterations") +
xlab("No of iterations") + ylab("Losses")
```

### 4. Neural Network for Logistic Regression -Octave code (vectorized)

1;

# Define sigmoid function

function a = sigmoid(z)

a = 1 ./ (1+ exp(-z));

end

# Compute the loss

function loss=computeLoss(numtraining,Y,A)

loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A));

end

# Perform forward propagation

function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)

% Compute Z

Z = w' * X + b;

numtraining = size(X)(1,2);

# Compute sigmoid

A = sigmoid(Z);

#Compute loss. Note this is element wise product

loss =computeLoss(numtraining,Y,A);

# Compute the gradients dZ, dw and db

dZ = A-Y;

dw = 1/numtraining* X * dZ';

db =1/numtraining*sum(dZ);

end

# Compute Gradient Descent

function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)

#Initialize losses and idx

losses=[];

index=[];

# Loop through the number of iterations

for i=1:numIerations,

[loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);

# Perform Gradient descent

w = w - learningRate*dw;

b = b - learningRate*db;

if(mod(i,100) ==0)

# Append index and loss

index = [index i];

losses = [losses loss];

endif

end

end

# Determine the predicted value for dataset

function yPredicted = predict(w,b,X)

m = size(X)(1,2);

yPredicted=zeros(1,m);

# Compute Z

Z = w' * X + b;

# Compute sigmoid

A = sigmoid(Z);

for i=1:size(X)(1,2),

# Set predicted as 1 if A > 0,5

if(A(1,i) >= 0.5)

yPredicted(1,i)=1;

else

yPredicted(1,i)=0;

endif

end

end

# Normalize by dividing each value by the sum of squares

function normalized = normalize(x)

# Compute Frobenius norm. Square the elements, sum rows and then find square root

a = sqrt(sum(x .^ 2,2));

# Perform element wise division

normalized = x ./ a;

end

# Split into train and test sets

function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)

# Create a random index

ix = randperm(length(dataset));

# Split into training

trainSize = floor(trainPercent/100 * length(dataset));

train=dataset(ix(1:trainSize),:);

# And test

test=dataset(ix(trainSize+1:length(dataset)),:);

X_train = train(:,1:30);

y_train = train(:,31);

X_test = test(:,1:30);

y_test = test(:,31);

end

cancer=csvread("cancer.csv");

[X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);

w=zeros(size(X_train)(1,2),1);

b=0;

X_train1=normalize(X_train);

X_train2=X_train1';

y_train1=y_train';

[w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);

# Normalize X_test

X_test1=normalize(X_test);

#Transpose X_train so that we have a matrix as (features, numSamples)

X_test2=X_test1';

y_test1=y_test';

# Use the values of the weights generated from Gradient Descent

yPredictionTest = predict(w1, b1, X_test2);

yPredictionTrain = predict(w1, b1, X_train2);

trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100

testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100

trainAccuracy = 90.845

testAccuracy = 89.510

graphics_toolkit('gnuplot')

plot(idx,losses);

title ('Gradient descent- Cost vs No of iterations');

xlabel ("No of iterations");

ylabel ("Cost");

**Conclusion**

This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

The Deep Learning journey has begun… Don’t miss the bus!

Stay tuned for more interesting posts in Deep Learning!!

References

1. Deep Learning Specialization

2. Neural Networks for Machine Learning

3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville

4. Neural Networks: The mechanics of backpropagation

5. Machine Learning

Also see

1. My book ‘Practical Machine Learning with R and Python’ on Amazon

2. Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4

3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon

4. Practical Machine Learning with R and Python – Part 4

5. Introducing QCSimulator: A 5-qubit quantum computing simulator in R

6. A Bluemix recipe with MongoDB and Node.js

7. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)

To see all posts check Index of posts

*Related*

R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…

## Leave a Reply

You must be logged in to post a comment.