Andrew Beam does a great job showing that small datasets are not off limits for current neural net methods. If you use the regularisation methods at hand – ANNs is entirely possible to use instead of classic methods.

Let’s see how this holds up on up on some benchmark datasets.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

seed = 123456
np.random.seed(seed)


Let’s start with the iris dataset that you nicely can pull with the pandas read_csv function right of the internets.

target_variable = 'species'
df = (

# Rename columns to lowercase and underscores
.pipe(lambda d: d.rename(columns={
k: v for k, v in zip(
d.columns,
[c.lower().replace(' ', '_') for c in d.columns]
)
}))
# Switch categorical classes to integers
.assign(**{target_variable: lambda r: r[target_variable].astype('category').cat.codes})
)


There’s three classes and 150 datapoints. Not really B̫͕̟̱I̜̼͈̖̫G͉ d̙͕a͇͍͕̝̟t̪̝̹̻͉̭ͅa.

df[target_variable].value_counts().sort_index().plot.bar()


We create a feature matrix X and a target y from the Pandas dataframe. And since an ANN needs the features to be normalized, let’s do some min-max-scaling before anything else.

y = df[target_variable].values
X = (
# Drop target variable
df.drop(target_variable, axis=1)
# Min-max-scaling (only needed for the DL model)
.pipe(lambda d: (d-d.min())/d.max()).fillna(0)
.as_matrix()
)


We split into a training set and a test set.

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=seed
)


Import some keras goodness (and perhaps run pip install keras first if you need it).

from keras.models import Sequential
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Activation, Dropout
from keras import optimizers


And set up a thee layer deep and 128 wide net. Nothing fancy, not even sure this would qualify as deep learning – but throw in some dropout between them to help it to not overfit.

Learning rate for the optimization method Adam might be something to tune on other datasets but here 0.001 seems to work nicely.

m = Sequential()

m.compile(
loss='categorical_crossentropy',
metrics=['accuracy']
)


EarlyStopping helps us stop the training when the validation set is not improving any more – which helps us avoid overfitting. And to keep the checkpoint just before overfitting occurs, ModelCheckpoints let’s us save the best model before decline in validation set performance.

m.fit(
# Feature matrix
X_train,
# Target class one-hot-encoded
pd.get_dummies(pd.DataFrame(y_train), columns=[0]).as_matrix(),
# Iterations to be run if not stopped by EarlyStopping
epochs=200,
callbacks=[
# Stop iterations when validation loss has not improved
EarlyStopping(monitor='val_loss', patience=25),
# Nice for keeping the last model before overfitting occurs
ModelCheckpoint(
'best.model',
monitor='val_loss',
save_best_only=True,
verbose=1
)
],
verbose=2,
validation_split=0.1,
batch_size=256,
)

# Load the best model

# Keep track of what class corresponds to what index
mapping = (
pd.get_dummies(pd.DataFrame(y_train), columns=[0], prefix='', prefix_sep='')
.columns.astype(int).values
)
y_test_preds = [mapping[pred] for pred in m.predict(X_test).argmax(axis=1)]


Now we can assess the performance on the test set. Below is a confusion matrix showing all predictions held up to reality.

pd.crosstab(
pd.Series(y_test, name='Actual'),
pd.Series(y_test_preds, name='Predicted'),
margins=True
)

Predicted 0 1 2 All
Actual
0 18 0 0 18
1 0 16 0 16
2 0 0 16 16
All 18 16 16 50
print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 1.000


Actually a perfect score. We will now do the same with an good old xgboost (conda install xgboost) with the nice sklearn api.

Finding the right hyperparameters is a task well suited for an Bayesian approach that can test the alternatives in an effective way without any gradient. GridSearch and such takes alot of time – this way we instead give it a parameter space and a “budget”. It will then in it’s most cost effective way test the hyperparameters of xgboost under those constraints.

from xgboost.sklearn import XGBClassifier

params_fixed = {
'objective': 'binary:logistic',
'silent': 1,
'seed': seed,
}

# The space to search
space = {
'max_depth': (1, 5),
'learning_rate': (10**-4, 10**-1),
'n_estimators': (10, 200),
'min_child_weight': (1, 20),
'subsample': (0, 1),
'colsample_bytree': (0.3, 1)
}

reg = XGBClassifier(**params_fixed)

def objective(params):
""" Wrap a cross validated inverted accuracy as objective func """
reg.set_params(**{k: p for k, p in zip(space.keys(), params)})
return 1-np.mean(cross_val_score(
reg, X_train, y_train, cv=5, n_jobs=-1,
scoring='accuracy')
)


For this we use skopt (pip install scikit-optimize). I’ve given it 50 iterations to explore the hyperparameter space. Might be some more performance to squeeze out, but probably not.

from skopt import gp_minimize
res_gp = gp_minimize(objective, space.values(), n_calls=50, random_state=seed)
best_hyper_params = {k: v for k, v in zip(space.keys(), res_gp.x)}

print "Best accuracy score =", 1-res_gp.fun
print "Best parameters =", best_hyper_params

Best accuracy score = 0.96
Best parameters = {'colsample_bytree': 1.0, 'learning_rate': 0.10000000000000001, 'min_child_weight': 5, 'n_estimators': 45, 'subsample': 1, 'max_depth': 5}

from skopt.plots import plot_convergence
plot_convergence(res_gp)


Now let’s fix these hyperparameters and evaluate on the test set. This is exactly the same test set as Keras got to be clear.

params = best_hyper_params.copy()
params.update(params_fixed)

clf = XGBClassifier(**params)

clf.fit(X_train, y_train)
y_test_preds = clf.predict(X_test)

pd.crosstab(
pd.Series(y_test, name='Actual'),
pd.Series(y_test_preds, name='Predicted'),
margins=True
)

Predicted 0 1 2 All
Actual
0 18 0 0 18
1 0 15 1 16
2 0 2 14 16
All 18 17 15 50
print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 0.940


The deep (?) net got all datapoints right while xgboost missed three of them. On the other hand if you change the seed and rerun the code it might as well be xgboost coming up on top so I wouldn’t read to much into it.

Let’s generalize the code above so that we can plug in any dataset of choosing and see if this holds for harder problems. While we’re at it, I’ve added a boostrap on the accuracy statistic to give a feel about the uncertanty. I’ve put the code in this gist since it’s more or less just a repetition of the code above.

## Telecom churn dataset (n=2325)

compare_on_dataset(
target_variable='churn',
lr=0.0005,
patience=5
)


#### ANN

Predicted 0 1 All
Actual
0 1478 270 1748
1 218 359 577
All 1696 629 2325
Accuracy: 0.790
Boostrapped accuracy 95 % interval 0.770223752151 0.809810671256


#### Xgboost

Predicted 0 1 All
Actual
0 1563 185 1748
1 265 312 577
All 1828 497 2325
Accuracy: 0.806
Boostrapped accuracy 95 % interval 0.78743545611 - 0.825301204819


Churn is a bit harder problem, but both methods do well though.

## Three class wine dataset (n=59)

compare_on_dataset(
'https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv',
target_variable='wine',
lr=0.001
)


#### ANN

Predicted 0 1 2 All
Actual
0 18 1 0 19
1 0 19 0 19
2 0 0 21 21
All 18 20 21 59
Accuracy: 0.983
Boostrapped accuracy 95 % interval 0.931034482759 1.0


#### Xgboost

Predicted 0 1 2 All
Actual
0 19 0 0 19
1 0 19 0 19
2 0 0 21 21
All 19 19 21 59
Accuracy: 1.000
Boostrapped accuracy 95 % interval 1.0 1.0


A very easy dataset that both methods have no problem with. Since the sample size is so small the boostrap will not be much of use to us here.

## German Credit Data (n=1000)

compare_on_dataset(
'https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv',
target_variable='creditability',
lr=0.001,
patience=5
)


#### ANN

Predicted 0 1 All
Actual
0 43 45 88
1 28 214 242
All 71 259 330
Accuracy: 0.779
Boostrapped accuracy 95 % interval 0.727272727273 0.830303030303


#### Xgboost

Predicted 0 1 All
Actual
0 37 51 88
1 25 217 242
All 62 268 330
Accuracy: 0.770
Boostrapped accuracy 95 % interval 0.715151515152 0.824242424242


So sometimes the ANN comes out on top, and sometime it’s xgboost. I think it’s fair to say that ANNs, controlled for overfitting/overtraining works kinda great even small data. At least pair with xgboost.

And the necessary hyperparameter tuning of xgboost is a pain since it really takes time. On these datasets, training the ANN takes no time at all. So let’s see if ANNs will start to eat small data also anytime soon.