Restricted Boltzmann Machine for Diabetes Prediction

- 12 mins

Restricted Boltzmann machine (RBM)

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.It has seen wide applications in different areas of supervised/unsupervised machine learning such as feature learning, dimensionality reduction, classification, collaborative filtering, and topic modeling. Source

Bernoulli RBM

we will use Berrnoulli RBM in scikit learn with Logistic regression.The BernoulliRBM is an unsupervised method so it is mostly used for non-linear feature extraction that can be feed to a classifier (in this case Logistic Regression).

Objective

To train a calssifier for predicting Diabetes

Data source

diabetes data set is included in scikit learn and can also be downloded from the kaggle

The Notebook

The notebook for this work can be found at [Restricted Boltzmann machine (RBM)]

import pandas as pd
import seaborn as sns
import random
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from numpy import set_printoptions
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report
data = pd.read_csv('Diabetes.csv')

Dimensions of Your Data

shape = data.shape
print(shape)
(768, 9)
data.head()
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Descriptive Statistics

# Statistical Summary
from pandas import set_option
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
(description)
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000
mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349
std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477
min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000
25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000
50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000
75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000
max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000

check if data is balanced

class_counts = data.groupby('class').size()
print(class_counts)
class
0    500
1    268
dtype: int64

Correlations Between Attributes

correlations = data.corr(method='pearson')
(correlations)
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
preg_count 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222
glucose_concentration 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467
blood_pressure 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065
skin_thickness -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075
serum_insulin -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131
bmi 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293
pedigree_function -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174
age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238
class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

skew and Kurtosis of univeraite data

print(data.skew())

preg_count               0.902
glucose_concentration    0.174
blood_pressure          -1.844
skin_thickness           0.109
serum_insulin            2.272
bmi                     -0.429
pedigree_function        1.920
age                      1.130
class                    0.635
dtype: float64
data.kurtosis()
preg_count               0.159
glucose_concentration    0.641
blood_pressure           5.180
skin_thickness          -0.520
serum_insulin            7.214
bmi                      3.290
pedigree_function        5.595
age                      0.643
class                   -1.601
dtype: float64

Univariate Visualization

data.hist(figsize=(16,7))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1f381750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1f2de9d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1f306250>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1ea21dd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1f2d5b50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e94ee50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e91e290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e8df990>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e89ef90>]],
      dtype=object)

png

Density Curve to inspect shape of each distribution clearly

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False,figsize=(15,6))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e69d390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e63f790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e600410>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e5a7350>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e5889d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e543a90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e48d6d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e45c190>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcb1e419d10>]],
      dtype=object)

png

Multivariate Plots

import numpy
correlations = data.corr()
# plot correlation matrix
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
plt.show()

png

Data standardization

# Standardize data (0 mean, 1 stdev)
data_array = data.values
# separate array into input and output components
X = data_array[:,0:8]
Y = data_array[:,8]
scaler = StandardScaler()
scaledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(scaledX[0:8,:])
[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]
 [ 0.343 -0.153  0.253 -1.288 -0.693 -0.811 -0.818 -0.276]
 [-0.251 -1.342 -0.988  0.719  0.071 -0.126 -0.676 -0.616]
 [ 1.828 -0.184 -3.573 -1.288 -0.693  0.42  -1.02  -0.361]]
X = data_array[:,0:8]
y = data_array[:,8]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=2017)

print X.shape
print y.shape
(768, 8)
(768,)
X_train[0]
array([ 0.64 , -0.059, -0.988,  0.092,  0.835, -0.621,  2.555, -0.02 ])

Building Machine Learning Pipelines

# initialize the RBM + Logistic Regression pipeline
rbm = BernoulliRBM()
logistic = LogisticRegression()
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)])

params = {
    "rbm__learning_rate": [0.1, 0.01, 0.001,0.05,0.005],
    "rbm__n_iter": [20, 40,50, 80,90,100],
    "rbm__n_components": [20,30,50, 100,80, 200],
    "logistic__C": [1.0,5.0 ,10.0,20.0,30.0,100.0]}

# perform a grid search over the parameter
gs = GridSearchCV(classifier, params, n_jobs = -1, verbose = 1)
gs.fit(X_train, y_train)

print "Best Score: %0.3f" % (gs.best_score_)

print "RBM + Logistic Regression parameters"
bestParams = gs.best_estimator_.get_params()

# loop over the parameters and print each of them out
# so they can be manually set
for p in sorted(params.keys()):
    print "\t %s: %f" % (p, bestParams[p])
Fitting 3 folds for each of 1080 candidates, totalling 3240 fits


[Parallel(n_jobs=-1)]: Done  73 tasks      | elapsed:   21.2s
[Parallel(n_jobs=-1)]: Done 223 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 473 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 823 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 1273 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done 1823 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done 2473 tasks      | elapsed: 15.0min
[Parallel(n_jobs=-1)]: Done 3223 tasks      | elapsed: 19.6min
[Parallel(n_jobs=-1)]: Done 3240 out of 3240 | elapsed: 19.8min finished


Best Score: 0.762
RBM + Logistic Regression parameters
	 logistic__C: 100.000000
	 rbm__learning_rate: 0.001000
	 rbm__n_components: 100.000000
	 rbm__n_iter: 90.000000
# initialize the RBM + Logistic Regression classifier with
# the cross-validated parameters
rbm = BernoulliRBM(n_components = 200, n_iter = 80, learning_rate = 0.001000,  verbose = False)
logistic = LogisticRegression(C = 100)

# train the classifier and show an evaluation report
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)])
classifier.fit(X_train, y_train)

print metrics.accuracy_score(y_train, classifier.predict(X_train))
print metrics.accuracy_score(y_test, classifier.predict(X_test))
0.7768729641693811
0.8051948051948052

Confusion Matrix

matrix = confusion_matrix(y_test, classifier.predict(X_test))
matrix= pd.DataFrame(matrix,index=['Positive','Negative'],columns=['True Posutive','False Negative'])
matrix
True Posutive False Negative
Positive 91 12
Negative 18 33

Classification report


report = classification_report(y_test, classifier.predict(X_test))
print(report)



             precision    recall  f1-score   support

        0.0       0.83      0.88      0.86       103
        1.0       0.73      0.65      0.69        51

avg / total       0.80      0.81      0.80       154

Firstfive Actual class VS predicted class

print "\tACTUAL class (1-5) "
print y_test[1:10]
	ACTUAL class (1-5) 
[1. 1. 1. 1. 0. 0. 0. 0. 0.]
print "\tPREDICTED class( 1-5) "
print classifier.predict(X_test)[1:10]
	PREDICTED class( 1-5) 
[1. 1. 1. 1. 0. 0. 0. 0. 0.]
Mustapha Omotosho

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora