Restricted Boltzmann Machine for Diabetes Prediction

Restricted Boltzmann machine (RBM)

A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.It has seen wide applications in different areas of supervised/unsupervised machine learning such as feature learning, dimensionality reduction, classification, collaborative filtering, and topic modeling. Source

Bernoulli RBM

we will use Berrnoulli RBM in scikit learn with Logistic regression.The BernoulliRBM is an unsupervised method so it is mostly used for non-linear feature extraction that can be feed to a classifier (in this case Logistic Regression).


To train a calssifier for predicting Diabetes

Data source

diabetes data set is included in scikit learn and can also be downloded from the kaggle

The Notebook

The notebook for this work can be found at [Restricted Boltzmann machine (RBM)]

import pandas as pd
import seaborn as sns
import random
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from numpy import set_printoptions
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import classification_report
data = pd.read_csv('Diabetes.csv')

Dimensions of Your Data

shape = data.shape
(768, 9)
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Descriptive Statistics

# Statistical Summary
from pandas import set_option
set_option('display.width', 100)
set_option('precision', 3)
description = data.describe()
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000
mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349
std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477
min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000
25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000
50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000
75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000
max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000

check if data is balanced

class_counts = data.groupby('class').size()
0    500
1    268
dtype: int64

Correlations Between Attributes

correlations = data.corr(method='pearson')
preg_count glucose_concentration blood_pressure skin_thickness serum_insulin bmi pedigree_function age class
preg_count 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222
glucose_concentration 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467
blood_pressure 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065
skin_thickness -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075
serum_insulin -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131
bmi 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293
pedigree_function -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174
age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238
class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

skew and Kurtosis of univeraite data


preg_count               0.902
glucose_concentration    0.174
blood_pressure          -1.844
skin_thickness           0.109
serum_insulin            2.272
bmi                     -0.429
pedigree_function        1.920
age                      1.130
class                    0.635
dtype: float64
preg_count               0.159
glucose_concentration    0.641
blood_pressure           5.180
skin_thickness          -0.520
serum_insulin            7.214
bmi                      3.290
pedigree_function        5.595
age                      0.643
class                   -1.601
dtype: float64

Univariate Visualization

Density Curve to inspect shape of each distribution clearly

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False,figsize=(15,6))
Multivariate Plots

import numpy
correlations = data.corr()
# plot correlation matrix
fig = plt.figure(figsize=(15,7))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
ticks = numpy.arange(0,9,1)


Data standardization

# Standardize data (0 mean, 1 stdev)
data_array = data.values
# separate array into input and output components
X = data_array[:,0:8]
Y = data_array[:,8]
scaler = StandardScaler()
scaledX = scaler.fit_transform(X)
# summarize transformed data
[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]
 [ 0.343 -0.153  0.253 -1.288 -0.693 -0.811 -0.818 -0.276]
 [-0.251 -1.342 -0.988  0.719  0.071 -0.126 -0.676 -0.616]
 [ 1.828 -0.184 -3.573 -1.288 -0.693  0.42  -1.02  -0.361]]
X = data_array[:,0:8]
y = data_array[:,8]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=2017)

print X.shape
print y.shape
(768, 8)
array([ 0.64 , -0.059, -0.988,  0.092,  0.835, -0.621,  2.555, -0.02 ])

Building Machine Learning Pipelines

# initialize the RBM + Logistic Regression pipeline
rbm = BernoulliRBM()
logistic = LogisticRegression()
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)])

params = {
    "rbm__learning_rate": [0.1, 0.01, 0.001,0.05,0.005],
    "rbm__n_iter": [20, 40,50, 80,90,100],
    "rbm__n_components": [20,30,50, 100,80, 200],
    "logistic__C": [1.0,5.0 ,10.0,20.0,30.0,100.0]}

# perform a grid search over the parameter
gs = GridSearchCV(classifier, params, n_jobs = -1, verbose = 1), y_train)

print "Best Score: %0.3f" % (gs.best_score_)

print "RBM + Logistic Regression parameters"
bestParams = gs.best_estimator_.get_params()

# loop over the parameters and print each of them out
# so they can be manually set
for p in sorted(params.keys()):
    print "\t %s: %f" % (p, bestParams[p])
Fitting 3 folds for each of 1080 candidates, totalling 3240 fits

Best Score: 0.762
RBM + Logistic Regression parameters
	 logistic__C: 100.000000
	 rbm__learning_rate: 0.001000
	 rbm__n_components: 100.000000
	 rbm__n_iter: 90.000000
# initialize the RBM + Logistic Regression classifier with
# the cross-validated parameters
rbm = BernoulliRBM(n_components = 200, n_iter = 80, learning_rate = 0.001000,  verbose = False)
logistic = LogisticRegression(C = 100)

# train the classifier and show an evaluation report
classifier = Pipeline([("rbm", rbm), ("logistic", logistic)]), y_train)

print metrics.accuracy_score(y_train, classifier.predict(X_train))
print metrics.accuracy_score(y_test, classifier.predict(X_test))

Confusion Matrix

matrix = confusion_matrix(y_test, classifier.predict(X_test))
matrix= pd.DataFrame(matrix,index=['Positive','Negative'],columns=['True Posutive','False Negative'])
True Posutive False Negative
Positive 91 12
Negative 18 33

Classification report

report = classification_report(y_test, classifier.predict(X_test))

             precision    recall  f1-score   support

        0.0       0.83      0.88      0.86       103
        1.0       0.73      0.65      0.69        51

avg / total       0.80      0.81      0.80       154

Firstfive Actual class VS predicted class

print "\tACTUAL class (1-5) "
print y_test[1:10]
	ACTUAL class (1-5) 
[1. 1. 1. 1. 0. 0. 0. 0. 0.]
print "\tPREDICTED class( 1-5) "
print classifier.predict(X_test)[1:10]
	PREDICTED class( 1-5) 
[1. 1. 1. 1. 0. 0. 0. 0. 0.]
