Using ANNs on small data – Deep Learning vs. Xgboost

Andrew Beam does a great job showing that small datasets are not off limits for current neural net methods. If you use the regularisation methods at hand – ANNs is entirely possible to use instead of classic methods.

Let’s see how this holds up on up on some benchmark datasets.

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')

seed = 123456
np.random.seed(seed)

Let’s start with the iris dataset that you nicely can pull with the pandas read_csv function right of the internets.

target_variable = 'species'
df = (
    pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/d546eaee765268bf2f487608c537c05e22e4b221/iris.csv')

    # Rename columns to lowercase and underscores
    .pipe(lambda d: d.rename(columns={
        k: v for k, v in zip(
            d.columns,
            [c.lower().replace(' ', '_') for c in d.columns]
        )
    }))
    # Switch categorical classes to integers
    .assign(**{target_variable: lambda r: r[target_variable].astype('category').cat.codes})
)

There’s three classes and 150 datapoints. Not really B̫͕̟̱I̜̼͈̖̫G͉ d̙͕a͇͍͕̝̟t̪̝̹̻͉̭ͅa.

df[target_variable].value_counts().sort_index().plot.bar()

png

We create a feature matrix X and a target y from the Pandas dataframe. And since an ANN needs the features to be normalized, let’s do some min-max-scaling before anything else.

y = df[target_variable].values
X = (
    # Drop target variable
    df.drop(target_variable, axis=1)
    # Min-max-scaling (only needed for the DL model)
    .pipe(lambda d: (d-d.min())/d.max()).fillna(0)
    .as_matrix()
)

We split into a training set and a test set.

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=seed
)

Import some keras goodness (and perhaps run pip install keras first if you need it).

from keras.models import Sequential
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Activation, Dropout
from keras import optimizers

And set up a thee layer deep and 128 wide net. Nothing fancy, not even sure this would qualify as deep learning – but throw in some dropout between them to help it to not overfit.

Learning rate for the optimization method Adam might be something to tune on other datasets but here 0.001 seems to work nicely.

m = Sequential()
m.add(Dense(128, activation='relu', input_shape=(X.shape[1],)))
m.add(Dropout(0.5))
m.add(Dense(128, activation='relu'))
m.add(Dropout(0.5))
m.add(Dense(128, activation='relu'))
m.add(Dropout(0.5))
m.add(Dense(len(np.unique(y)), activation='softmax'))

m.compile(
    optimizer=optimizers.Adam(lr=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

EarlyStopping helps us stop the training when the validation set is not improving any more – which helps us avoid overfitting. And to keep the checkpoint just before overfitting occurs, ModelCheckpoints let’s us save the best model before decline in validation set performance.

m.fit(
    # Feature matrix
    X_train,
    # Target class one-hot-encoded
    pd.get_dummies(pd.DataFrame(y_train), columns=[0]).as_matrix(),
    # Iterations to be run if not stopped by EarlyStopping
    epochs=200,
    callbacks=[
        # Stop iterations when validation loss has not improved
        EarlyStopping(monitor='val_loss', patience=25),
        # Nice for keeping the last model before overfitting occurs
        ModelCheckpoint(
            'best.model',
            monitor='val_loss',
            save_best_only=True,
            verbose=1
        )
    ],
    verbose=2,
    validation_split=0.1,
    batch_size=256,
)

# Load the best model
m.load_weights("best.model")

# Keep track of what class corresponds to what index
mapping = (
    pd.get_dummies(pd.DataFrame(y_train), columns=[0], prefix='', prefix_sep='')
    .columns.astype(int).values
)
y_test_preds = [mapping[pred] for pred in m.predict(X_test).argmax(axis=1)]

Now we can assess the performance on the test set. Below is a confusion matrix showing all predictions held up to reality.

pd.crosstab(
    pd.Series(y_test, name='Actual'),
    pd.Series(y_test_preds, name='Predicted'),
    margins=True
)

Predicted	0	1	2	All
Actual
0	18	0	0	18
1	0	16	0	16
2	0	0	16	16
All	18	16	16	50

print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 1.000

Actually a perfect score. We will now do the same with an good old xgboost (conda install xgboost) with the nice sklearn api.

Finding the right hyperparameters is a task well suited for an Bayesian approach that can test the alternatives in an effective way without any gradient. GridSearch and such takes alot of time – this way we instead give it a parameter space and a “budget”. It will then in it’s most cost effective way test the hyperparameters of xgboost under those constraints.

from xgboost.sklearn import XGBClassifier

params_fixed = {
    'objective': 'binary:logistic',
    'silent': 1,
    'seed': seed,
}

# The space to search
space = {
    'max_depth': (1, 5),
    'learning_rate': (10**-4, 10**-1),
    'n_estimators': (10, 200),
    'min_child_weight': (1, 20),
    'subsample': (0, 1),
    'colsample_bytree': (0.3, 1)
}

reg = XGBClassifier(**params_fixed)

def objective(params):
    """ Wrap a cross validated inverted `accuracy` as objective func """
    reg.set_params(**{k: p for k, p in zip(space.keys(), params)})
    return 1-np.mean(cross_val_score(
        reg, X_train, y_train, cv=5, n_jobs=-1,
        scoring='accuracy')
    )

For this we use skopt (pip install scikit-optimize). I’ve given it 50 iterations to explore the hyperparameter space. Might be some more performance to squeeze out, but probably not.

from skopt import gp_minimize
res_gp = gp_minimize(objective, space.values(), n_calls=50, random_state=seed)
best_hyper_params = {k: v for k, v in zip(space.keys(), res_gp.x)}

print "Best accuracy score =", 1-res_gp.fun
print "Best parameters =", best_hyper_params

Best accuracy score = 0.96
Best parameters = {'colsample_bytree': 1.0, 'learning_rate': 0.10000000000000001, 'min_child_weight': 5, 'n_estimators': 45, 'subsample': 1, 'max_depth': 5}

from skopt.plots import plot_convergence
plot_convergence(res_gp)

png

Now let’s fix these hyperparameters and evaluate on the test set. This is exactly the same test set as Keras got to be clear.

params = best_hyper_params.copy()
params.update(params_fixed)

clf = XGBClassifier(**params)

clf.fit(X_train, y_train)
y_test_preds = clf.predict(X_test)

pd.crosstab(
    pd.Series(y_test, name='Actual'),
    pd.Series(y_test_preds, name='Predicted'),
    margins=True
)

Predicted	0	1	2	All
Actual
0	18	0	0	18
1	0	15	1	16
2	0	2	14	16
All	18	17	15	50

print 'Accuracy: {0:.3f}'.format(accuracy_score(y_test, y_test_preds))

Accuracy: 0.940

The deep (?) net got all datapoints right while xgboost missed three of them. On the other hand if you change the seed and rerun the code it might as well be xgboost coming up on top so I wouldn’t read to much into it.

Let’s generalize the code above so that we can plug in any dataset of choosing and see if this holds for harder problems. While we’re at it, I’ve added a boostrap on the accuracy statistic to give a feel about the uncertanty. I’ve put the code in this gist since it’s more or less just a repetition of the code above.

Telecom churn dataset (n=2325)

compare_on_dataset(
    'https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv?cm_mc_uid=06267660176214972094054&cm_mc_sid_50200000=1497209405&cm_mc_sid_52640000=1497209405',
    target_variable='churn',
    lr=0.0005,
    patience=5
)

ANN

Predicted	0	1	All
Actual
0	1478	270	1748
1	218	359	577
All	1696	629	2325

Accuracy: 0.790
Boostrapped accuracy 95 % interval 0.770223752151 0.809810671256

Xgboost

Predicted	0	1	All
Actual
0	1563	185	1748
1	265	312	577
All	1828	497	2325

Accuracy: 0.806
Boostrapped accuracy 95 % interval 0.78743545611 - 0.825301204819

Churn is a bit harder problem, but both methods do well though.

Three class wine dataset (n=59)

compare_on_dataset(
    'https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv',
    target_variable='wine',
    lr=0.001
)

ANN

Predicted	0	1	2	All
Actual
0	18	1	0	19
1	0	19	0	19
2	0	0	21	21
All	18	20	21	59

Accuracy: 0.983
Boostrapped accuracy 95 % interval 0.931034482759 1.0

Xgboost

Predicted	0	1	2	All
Actual
0	19	0	0	19
1	0	19	0	19
2	0	0	21	21
All	19	19	21	59

Accuracy: 1.000
Boostrapped accuracy 95 % interval 1.0 1.0

A very easy dataset that both methods have no problem with. Since the sample size is so small the boostrap will not be much of use to us here.

German Credit Data (n=1000)

compare_on_dataset(
    'https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv',
    target_variable='creditability',
    lr=0.001,
    patience=5
)

ANN

Predicted	0	1	All
Actual
0	43	45	88
1	28	214	242
All	71	259	330

Accuracy: 0.779
Boostrapped accuracy 95 % interval 0.727272727273 0.830303030303

Xgboost

Predicted	0	1	All
Actual
0	37	51	88
1	25	217	242
All	62	268	330

Accuracy: 0.770
Boostrapped accuracy 95 % interval 0.715151515152 0.824242424242

So sometimes the ANN comes out on top, and sometime it’s xgboost. I think it’s fair to say that ANNs, controlled for overfitting/overtraining works kinda great even small data. At least pair with xgboost.

And the necessary hyperparameter tuning of xgboost is a pain since it really takes time. On these datasets, training the ANN takes no time at all. So let’s see if ANNs will start to eat small data also anytime soon.