19 Hyper-Parameter Optimization

by Magnus Erik Hvass Pedersen / GitHub / Videos on YouTube

Introduction

There are many parameters you can select when building and training a Neural Network in TensorFlow. These are often called Hyper-Parameters. For example, there is a hyper-parameter for how many layers the network should have, and another hyper-parameter for how many nodes per layer, and another hyper-parameter for the activation function to use, etc. The optimization method also has one or more hyper-parameters you can select, such as the learning-rate.

One way of searching for good hyper-parameters is by hand-tuning, where you try one set of parameters and see how they perform, and then you try another set of parameters and see if they improve the performance. You try and build an intuition for what works well and guide your parameter-search accordingly. Not only is this extremely time-consuming for a human researcher, but the optimal parameters are often counter-intuitive to humans so you will not find them!

Another way of searching for good hyper-parameters is to divide each parameter’s valid range into evenly spaced values, and then simply have the computer try all combinations of parameter-values. This is called Grid Search. Although it is run entirely by the computer, it quickly becomes extremely time-consuming because the number of parameter-combinations increases exponentially as you add more hyper-parameters. This problem is known as the Curse of Dimensionality. For example, if you have just 4 hyper-parameters to tune and each of them is allowed 10 possible values, then there is a total of 10^4 parameter-combinations. If you add just one more hyper-parameter then there are 10^5 parameter-combinations, and so on.

Yet another way of searching for good hyper-parameters is by random search. Instead of systematically trying every single parameter-combination as in Grid Search, we now try a number of parameter-combinations completely at random. This is like searching for “a needle in a haystack” and as the number of parameters increases, the probability of finding the optimal parameter-combinations by random sampling decreases to zero.

This tutorial uses a clever method for finding good hyper-parameters known as Bayesian Optimization. You should be familiar with TensorFlow, Keras and Convolutional Neural Networks, see Tutorials #01, #02 and #03-C.

Flowchart

The problem with hyper-parameter optimization is that it is extremely costly to assess the performance of a set of parameters. This is because we first have to build the corresponding neural network, then we have to train it, and finally we have to measure its performance on a test-set. In this tutorial we will use the small MNIST problem so this training can be done very quickly, but on more realistic problems the training may take hours, days or even weeks on a very fast computer. So we need an optimization method that can search for hyper-parameters as efficiently as possible, by only evaluating the actual performance when absolutely necessary.

The idea with Bayesian optimization is to construct another model of the search-space for hyper-parameters. One kind of model is known as a Gaussian Process. This gives us an estimate of how the performance varies with changes to the hyper-parameters. Whenever we evaluate the actual performance for a set of hyper-parameters, we know for a fact what the performance is - except perhaps for some noise. We can then ask the Bayesian optimizer to give us a new suggestion for hyper-parameters in a region of the search-space that we haven’t explored yet, or hyper-parameters that the Bayesian optimizer thinks will bring us most improvement. We then repeat this process a number of times until the Bayesian optimizer has built a good model of how the performance varies with different hyper-parameters, so we can choose the best parameters.

The flowchart of the algorithm is roughly:

Flowchart

Imports

%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import math
/home/magnus/anaconda3/envs/tf-test/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)

We need to import several things from Keras.

# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import InputLayer, Input
from tensorflow.python.keras.layers import Reshape, MaxPooling2D
from tensorflow.python.keras.layers import Conv2D, Dense, Flatten
from tensorflow.python.keras.callbacks import TensorBoard
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.models import load_model

NOTE: We will save and load models using Keras so you need to have h5py installed. You also need to have scikit-optimize installed for doing the hyper-parameter optimization.

You should be able to run the following command in a terminal to install them both:

pip install h5py scikit-optimize

NOTE: This Notebook requires plotting functions in scikit-optimize that have not been merged into the official release at the time of this writing. If this Notebook cannot run with the version of scikit-optimize installed by the command above, you may have to install scikit-optimize from a development branch by running the following command instead:

pip install git+git://github.com/Hvass-Labs/scikit-optimize.git@dd7433da068b5a2509ef4ea4e5195458393e6555

import skopt
from skopt import gp_minimize, forest_minimize
from skopt.space import Real, Categorical, Integer
from skopt.plots import plot_convergence
from skopt.plots import plot_objective, plot_evaluations
from skopt.plots import plot_histogram, plot_objective_2D
from skopt.utils import use_named_args

This was developed using Python 3.6 (Anaconda) and package versions:

tf.__version__
'1.4.0'
tf.keras.__version__
'2.0.8-tf'
skopt.__version__
'0.4'

Hyper-Parameters

In this tutorial we want to find the hyper-parametes that makes a simple Convolutional Neural Network perform best at classifying the MNIST dataset for hand-written digits.

For this demonstration we want to find the following hyper-parameters:

  • The learning-rate of the optimizer.
  • The number of fully-connected / dense layers.
  • The number of nodes for each of the dense layers.
  • Whether to use ‘sigmoid’ or ‘relu’ activation in all the layers.

We will use the Python package scikit-optimize (or skopt) for finding the best choices of these hyper-parameters. Before we begin with the actual search for hyper-parameters, we first need to define the valid search-ranges or search-dimensions for each of these parameters.

This is the search-dimension for the learning-rate. It is a real number (floating-point) with a lower bound of 1e-6 and an upper bound of 1e-2. But instead of searching between these bounds directly, we use a logarithmic transformation, so we will search for the number k in 1ek which is only bounded between -6 and -2. This is better than searching the entire exponential range.

dim_learning_rate = Real(low=1e-6, high=1e-2, prior='log-uniform',
                         name='learning_rate')

This is the search-dimension for the number of dense layers in the neural network. This is an integer and we want at least 1 dense layer and at most 5 dense layers in the neural network.

dim_num_dense_layers = Integer(low=1, high=5, name='num_dense_layers')

This is the search-dimension for the number of nodes for each dense layer. This is also an integer and we want at least 5 and at most 512 nodes in each layer of the neural network.

dim_num_dense_nodes = Integer(low=5, high=512, name='num_dense_nodes')

This is the search-dimension for the activation-function. This is a combinatorial or categorical parameter which can be either ‘relu’ or ‘sigmoid’.

dim_activation = Categorical(categories=['relu', 'sigmoid'],
                             name='activation')

We then combine all these search-dimensions into a list.

dimensions = [dim_learning_rate,
              dim_num_dense_layers,
              dim_num_dense_nodes,
              dim_activation]

It is helpful to start the search for hyper-parameters with a decent choice that we have found by hand-tuning. But we will use the following parameters that do not perform so well, so as to better demonstrate the usefulness of hyper-parameter optimization: A learning-rate of 1e-5, a single dense layer with 16 nodes, and relu activation-functions.

Note that these hyper-parameters are packed in a single list. This is how skopt works internally on hyper-parameters. You therefore need to ensure that the order of the dimensions are consistent with the order given in dimensions above.

default_parameters = [1e-5, 1, 16, 'relu']

Helper-function for log-dir-name

We will log the training-progress for all parameter-combinations so they can be viewed and compared using TensorBoard. This is done by setting a common parent-dir and then have a sub-dir for each parameter-combination with an appropriate name.

def log_dir_name(learning_rate, num_dense_layers,
                 num_dense_nodes, activation):

    # The dir-name for the TensorBoard log-dir.
    s = "./19_logs/lr_{0:.0e}_layers_{1}_nodes_{2}_{3}/"

    # Insert all the hyper-parameters in the dir-name.
    log_dir = s.format(learning_rate,
                       num_dense_layers,
                       num_dense_nodes,
                       activation)

    return log_dir

Load Data

The MNIST data-set is about 12 MB and will be downloaded automatically if it is not located in the given path.

from tensorflow.examples.tutorials.mnist import input_data
data = input_data.read_data_sets('data/MNIST/', one_hot=True)
Extracting data/MNIST/train-images-idx3-ubyte.gz
Extracting data/MNIST/train-labels-idx1-ubyte.gz
Extracting data/MNIST/t10k-images-idx3-ubyte.gz
Extracting data/MNIST/t10k-labels-idx1-ubyte.gz

The MNIST data-set has now been loaded and consists of 70,000 images and associated labels (i.e. classifications of the images). The data-set is split into 3 mutually exclusive sub-sets.

print("Size of:")
print("- Training-set:\t\t{}".format(len(data.train.labels)))
print("- Test-set:\t\t{}".format(len(data.test.labels)))
print("- Validation-set:\t{}".format(len(data.validation.labels)))
Size of:
- Training-set:     55000
- Test-set:     10000
- Validation-set:   5000

The class-labels are One-Hot encoded, which means that each label is a vector with 10 elements, all of which are zero except for one element. The index of this one element is the class-number, that is, the digit shown in the associated image. We also need the class-numbers as integers for the test-set, so we calculate it now.

data.test.cls = np.argmax(data.test.labels, axis=1)

We use the performance on the validation-set as an indication of which choice of hyper-parameters performs the best on previously unseen data. The Keras API needs the validation-set as a tuple.

validation_data = (data.validation.images, data.validation.labels)

Data Dimensions

The data dimensions are used in several places in the source-code below. They are defined once so we can use these variables instead of numbers throughout the source-code below.

# We know that MNIST images are 28 pixels in each dimension.
img_size = 28

# Images are stored in one-dimensional arrays of this length.
img_size_flat = img_size * img_size

# Tuple with height and width of images used to reshape arrays.
# This is used for plotting the images.
img_shape = (img_size, img_size)

# Tuple with height, width and depth used to reshape arrays.
# This is used for reshaping in Keras.
img_shape_full = (img_size, img_size, 1)

# Number of colour channels for the images: 1 channel for gray-scale.
num_channels = 1

# Number of classes, one class for each of 10 digits.
num_classes = 10

Helper-function for plotting images

Function used to plot 9 images in a 3x3 grid, and writing the true and predicted classes below each image.

def plot_images(images, cls_true, cls_pred=None):
    assert len(images) == len(cls_true) == 9
    
    # Create figure with 3x3 sub-plots.
    fig, axes = plt.subplots(3, 3)
    fig.subplots_adjust(hspace=0.3, wspace=0.3)

    for i, ax in enumerate(axes.flat):
        # Plot image.
        ax.imshow(images[i].reshape(img_shape), cmap='binary')

        # Show true and predicted classes.
        if cls_pred is None:
            xlabel = "True: {0}".format(cls_true[i])
        else:
            xlabel = "True: {0}, Pred: {1}".format(cls_true[i], cls_pred[i])

        # Show the classes as the label on the x-axis.
        ax.set_xlabel(xlabel)
        
        # Remove ticks from the plot.
        ax.set_xticks([])
        ax.set_yticks([])
    
    # Ensure the plot is shown correctly with multiple plots
    # in a single Notebook cell.
    plt.show()

Plot a few images to see if data is correct

# Get the first images from the test-set.
images = data.test.images[0:9]

# Get the true classes for those images.
cls_true = data.test.cls[0:9]

# Plot the images and labels using our helper-function above.
plot_images(images=images, cls_true=cls_true)

png

Helper-function to plot example errors

Function for plotting examples of images from the test-set that have been mis-classified.

def plot_example_errors(cls_pred):
    # cls_pred is an array of the predicted class-number for
    # all images in the test-set.

    # Boolean array whether the predicted class is incorrect.
    incorrect = (cls_pred != data.test.cls)

    # Get the images from the test-set that have been
    # incorrectly classified.
    images = data.test.images[incorrect]
    
    # Get the predicted classes for those images.
    cls_pred = cls_pred[incorrect]

    # Get the true classes for those images.
    cls_true = data.test.cls[incorrect]
    
    # Plot the first 9 images.
    plot_images(images=images[0:9],
                cls_true=cls_true[0:9],
                cls_pred=cls_pred[0:9])

Hyper-Parameter Optimization

There are several steps required to do hyper-parameter optimization.

Create the Model

We first need a function that takes a set of hyper-parameters and creates the Convolutional Neural Network corresponding to those parameters. We use Keras to build the neural network in TensorFlow, see Tutorial #03-C for more details.

def create_model(learning_rate, num_dense_layers,
                 num_dense_nodes, activation):
    """
    Hyper-parameters:
    learning_rate:     Learning-rate for the optimizer.
    num_dense_layers:  Number of dense layers.
    num_dense_nodes:   Number of nodes in each dense layer.
    activation:        Activation function for all layers.
    """
    
    # Start construction of a Keras Sequential model.
    model = Sequential()

    # Add an input layer which is similar to a feed_dict in TensorFlow.
    # Note that the input-shape must be a tuple containing the image-size.
    model.add(InputLayer(input_shape=(img_size_flat,)))

    # The input from MNIST is a flattened array with 784 elements,
    # but the convolutional layers expect images with shape (28, 28, 1)
    model.add(Reshape(img_shape_full))

    # First convolutional layer.
    # There are many hyper-parameters in this layer, but we only
    # want to optimize the activation-function in this example.
    model.add(Conv2D(kernel_size=5, strides=1, filters=16, padding='same',
                     activation=activation, name='layer_conv1'))
    model.add(MaxPooling2D(pool_size=2, strides=2))

    # Second convolutional layer.
    # Again, we only want to optimize the activation-function here.
    model.add(Conv2D(kernel_size=5, strides=1, filters=36, padding='same',
                     activation=activation, name='layer_conv2'))
    model.add(MaxPooling2D(pool_size=2, strides=2))

    # Flatten the 4-rank output of the convolutional layers
    # to 2-rank that can be input to a fully-connected / dense layer.
    model.add(Flatten())

    # Add fully-connected / dense layers.
    # The number of layers is a hyper-parameter we want to optimize.
    for i in range(num_dense_layers):
        # Name of the layer. This is not really necessary
        # because Keras should give them unique names.
        name = 'layer_dense_{0}'.format(i+1)

        # Add the dense / fully-connected layer to the model.
        # This has two hyper-parameters we want to optimize:
        # The number of nodes and the activation function.
        model.add(Dense(num_dense_nodes,
                        activation=activation,
                        name=name))

    # Last fully-connected / dense layer with softmax-activation
    # for use in classification.
    model.add(Dense(num_classes, activation='softmax'))
    
    # Use the Adam method for training the network.
    # We want to find the best learning-rate for the Adam method.
    optimizer = Adam(lr=learning_rate)
    
    # In Keras we need to compile the model so it can be trained.
    model.compile(optimizer=optimizer,
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    
    return model

Train and Evaluate the Model

The neural network with the best hyper-parameters is saved to disk so it can be reloaded later. This is the filename for the model.

path_best_model = '19_best_model.keras'

This is the classification accuracy for the model saved to disk. It is a global variable which will be updated during optimization of the hyper-parameters.

best_accuracy = 0.0

This is the function that creates and trains a neural network with the given hyper-parameters, and then evaluates its performance on the validation-set. The function then returns the so-called fitness value (aka. objective value), which is the negative classification accuracy on the validation-set. It is negative because skopt performs minimization instead of maximization.

Note the function decorator @use_named_args which wraps the fitness function so that it can be called with all the parameters as a single list, for example: fitness(x=[1e-4, 3, 256, 'relu']). This is the calling-style skopt uses internally.

@use_named_args(dimensions=dimensions)
def fitness(learning_rate, num_dense_layers,
            num_dense_nodes, activation):
    """
    Hyper-parameters:
    learning_rate:     Learning-rate for the optimizer.
    num_dense_layers:  Number of dense layers.
    num_dense_nodes:   Number of nodes in each dense layer.
    activation:        Activation function for all layers.
    """

    # Print the hyper-parameters.
    print('learning rate: {0:.1e}'.format(learning_rate))
    print('num_dense_layers:', num_dense_layers)
    print('num_dense_nodes:', num_dense_nodes)
    print('activation:', activation)
    print()
    
    # Create the neural network with these hyper-parameters.
    model = create_model(learning_rate=learning_rate,
                         num_dense_layers=num_dense_layers,
                         num_dense_nodes=num_dense_nodes,
                         activation=activation)

    # Dir-name for the TensorBoard log-files.
    log_dir = log_dir_name(learning_rate, num_dense_layers,
                           num_dense_nodes, activation)
    
    # Create a callback-function for Keras which will be
    # run after each epoch has ended during training.
    # This saves the log-files for TensorBoard.
    # Note that there are complications when histogram_freq=1.
    # It might give strange errors and it also does not properly
    # support Keras data-generators for the validation-set.
    callback_log = TensorBoard(
        log_dir=log_dir,
        histogram_freq=0,
        batch_size=32,
        write_graph=True,
        write_grads=False,
        write_images=False)
   
    # Use Keras to train the model.
    history = model.fit(x=data.train.images,
                        y=data.train.labels,
                        epochs=3,
                        batch_size=128,
                        validation_data=validation_data,
                        callbacks=[callback_log])

    # Get the classification accuracy on the validation-set
    # after the last training-epoch.
    accuracy = history.history['val_acc'][-1]

    # Print the classification accuracy.
    print()
    print("Accuracy: {0:.2%}".format(accuracy))
    print()

    # Save the model if it improves on the best-found performance.
    # We use the global keyword so we update the variable outside
    # of this function.
    global best_accuracy

    # If the classification accuracy of the saved model is improved ...
    if accuracy > best_accuracy:
        # Save the new model to harddisk.
        model.save(path_best_model)
        
        # Update the classification accuracy.
        best_accuracy = accuracy

    # Delete the Keras model with these hyper-parameters from memory.
    del model
    
    # Clear the Keras session, otherwise it will keep adding new
    # models to the same TensorFlow graph each time we create
    # a model with a different set of hyper-parameters.
    K.clear_session()
    
    # NOTE: Scikit-optimize does minimization so it tries to
    # find a set of hyper-parameters with the LOWEST fitness-value.
    # Because we are interested in the HIGHEST classification
    # accuracy, we need to negate this number so it can be minimized.
    return -accuracy

Test Run

Before we run the hyper-parameter optimization, let us first check that the various functions above actually work, when we pass the default hyper-parameters.

fitness(x=default_parameters)
learning rate: 1.0e-05
num_dense_layers: 1
num_dense_nodes: 16
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.2525 - acc: 0.1995 - val_loss: 2.1754 - val_acc: 0.3578
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 2.0279 - acc: 0.4612 - val_loss: 1.8432 - val_acc: 0.5558
Epoch 3/3
55000/55000 [==============================] - 4s - loss: 1.6227 - acc: 0.5998 - val_loss: 1.3877 - val_acc: 0.6654

Accuracy: 66.54%






-0.66539999999999999

Run the Hyper-Parameter Optimization

Now we are ready to run the actual hyper-parameter optimization using Bayesian optimization from the scikit-optimize package. Note that it first calls fitness() with default_parameters as the starting point we have found by hand-tuning, which should help the optimizer locate better hyper-parameters faster.

There are many more parameters you can experiment with here, including the number of calls to the fitness() function which we have set to 40. But fitness() is very expensive to evaluate so it should not be run too many times, especially for larger neural networks and datasets.

You can also experiment with the so-called acquisition function which determines how to find a new set of hyper-parameters from the internal model of the Bayesian optimizer. You can also try using another Bayesian optimizer such as Random Forests.

search_result = gp_minimize(func=fitness,
                            dimensions=dimensions,
                            acq_func='EI', # Expected Improvement.
                            n_calls=40,
                            x0=default_parameters)
learning rate: 1.0e-05
num_dense_layers: 1
num_dense_nodes: 16
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 2.2287 - acc: 0.1868 - val_loss: 2.1264 - val_acc: 0.3182
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 1.9607 - acc: 0.4438 - val_loss: 1.7713 - val_acc: 0.5082
Epoch 3/3
55000/55000 [==============================] - 5s - loss: 1.5763 - acc: 0.5579 - val_loss: 1.3832 - val_acc: 0.6166

Accuracy: 61.66%

learning rate: 6.1e-04
num_dense_layers: 2
num_dense_nodes: 474
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 1.3354 - acc: 0.5258 - val_loss: 0.3002 - val_acc: 0.9112
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.2336 - acc: 0.9269 - val_loss: 0.1626 - val_acc: 0.9538
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.1403 - acc: 0.9563 - val_loss: 0.1113 - val_acc: 0.9692

Accuracy: 96.92%

learning rate: 6.1e-06
num_dense_layers: 2
num_dense_nodes: 333
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.1702 - acc: 0.5067 - val_loss: 1.9186 - val_acc: 0.6892
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 1.4878 - acc: 0.7480 - val_loss: 1.0546 - val_acc: 0.7940
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.8226 - acc: 0.8264 - val_loss: 0.6324 - val_acc: 0.8514

Accuracy: 85.14%

learning rate: 1.7e-04
num_dense_layers: 4
num_dense_nodes: 252
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3075 - acc: 0.1058 - val_loss: 2.2968 - val_acc: 0.1126
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 1.5272 - acc: 0.4944 - val_loss: 0.8210 - val_acc: 0.7386
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.6595 - acc: 0.7967 - val_loss: 0.4940 - val_acc: 0.8544

Accuracy: 85.44%

learning rate: 7.3e-03
num_dense_layers: 3
num_dense_nodes: 166
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1821 - acc: 0.9431 - val_loss: 0.0705 - val_acc: 0.9808
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.0605 - acc: 0.9829 - val_loss: 0.0678 - val_acc: 0.9848
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0549 - acc: 0.9855 - val_loss: 0.0736 - val_acc: 0.9846

Accuracy: 98.46%

learning rate: 6.1e-05
num_dense_layers: 2
num_dense_nodes: 209
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 2.3187 - acc: 0.1073 - val_loss: 2.3030 - val_acc: 0.0924
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 2.3016 - acc: 0.1121 - val_loss: 2.2993 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 5s - loss: 2.2858 - acc: 0.1573 - val_loss: 2.2243 - val_acc: 0.2898

Accuracy: 28.98%

learning rate: 1.8e-04
num_dense_layers: 4
num_dense_nodes: 453
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.3601 - acc: 0.8920 - val_loss: 0.1234 - val_acc: 0.9640
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.0850 - acc: 0.9741 - val_loss: 0.0576 - val_acc: 0.9830
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0566 - acc: 0.9824 - val_loss: 0.0535 - val_acc: 0.9856

Accuracy: 98.56%

learning rate: 5.5e-06
num_dense_layers: 4
num_dense_nodes: 186
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3129 - acc: 0.1039 - val_loss: 2.3025 - val_acc: 0.1100
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 2.3016 - acc: 0.1106 - val_loss: 2.3010 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 2.3013 - acc: 0.1123 - val_loss: 2.3011 - val_acc: 0.1126

Accuracy: 11.26%

learning rate: 3.1e-05
num_dense_layers: 3
num_dense_nodes: 427
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3132 - acc: 0.1070 - val_loss: 2.3007 - val_acc: 0.1126
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 2.3029 - acc: 0.1080 - val_loss: 2.3020 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 2.3021 - acc: 0.1093 - val_loss: 2.3016 - val_acc: 0.1126

Accuracy: 11.26%

learning rate: 1.4e-04
num_dense_layers: 2
num_dense_nodes: 29
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 0.8474 - acc: 0.7524 - val_loss: 0.2954 - val_acc: 0.9190
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.2392 - acc: 0.9315 - val_loss: 0.1741 - val_acc: 0.9512
Epoch 3/3
55000/55000 [==============================] - 5s - loss: 0.1643 - acc: 0.9517 - val_loss: 0.1346 - val_acc: 0.9612

Accuracy: 96.12%

learning rate: 3.7e-04
num_dense_layers: 4
num_dense_nodes: 338
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.1610 - acc: 0.1844 - val_loss: 1.0813 - val_acc: 0.6678
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.5982 - acc: 0.8131 - val_loss: 0.3252 - val_acc: 0.9100
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.2712 - acc: 0.9201 - val_loss: 0.1858 - val_acc: 0.9468

Accuracy: 94.68%

learning rate: 1.7e-06
num_dense_layers: 4
num_dense_nodes: 512
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.2568 - acc: 0.3895 - val_loss: 2.1984 - val_acc: 0.6048
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 2.0854 - acc: 0.6719 - val_loss: 1.9276 - val_acc: 0.7052
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 1.7106 - acc: 0.7158 - val_loss: 1.4589 - val_acc: 0.7290

Accuracy: 72.90%

learning rate: 1.4e-03
num_dense_layers: 2
num_dense_nodes: 62
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 0.2396 - acc: 0.9249 - val_loss: 0.0643 - val_acc: 0.9822
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.0587 - acc: 0.9819 - val_loss: 0.0536 - val_acc: 0.9838
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.0427 - acc: 0.9867 - val_loss: 0.0480 - val_acc: 0.9842

Accuracy: 98.42%

learning rate: 2.7e-03
num_dense_layers: 2
num_dense_nodes: 364
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 1.3014 - acc: 0.5223 - val_loss: 0.2531 - val_acc: 0.9232
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.1956 - acc: 0.9386 - val_loss: 0.1221 - val_acc: 0.9650
Epoch 3/3
55000/55000 [==============================] - 5s - loss: 0.1138 - acc: 0.9646 - val_loss: 0.0846 - val_acc: 0.9758

Accuracy: 97.58%

learning rate: 5.6e-04
num_dense_layers: 5
num_dense_nodes: 13
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.9357 - acc: 0.6775 - val_loss: 0.3024 - val_acc: 0.9184
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.2416 - acc: 0.9338 - val_loss: 0.1749 - val_acc: 0.9520
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.1685 - acc: 0.9525 - val_loss: 0.1541 - val_acc: 0.9570

Accuracy: 95.70%

learning rate: 1.0e-02
num_dense_layers: 5
num_dense_nodes: 352
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3316 - acc: 0.1049 - val_loss: 2.3019 - val_acc: 0.1070
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 2.3024 - acc: 0.1090 - val_loss: 2.3017 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 2.3020 - acc: 0.1104 - val_loss: 2.3014 - val_acc: 0.1126

Accuracy: 11.26%

learning rate: 1.5e-03
num_dense_layers: 1
num_dense_nodes: 5
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 1.7072 - acc: 0.4784 - val_loss: 1.2153 - val_acc: 0.6980
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.9949 - acc: 0.7914 - val_loss: 0.7749 - val_acc: 0.8564
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.6663 - acc: 0.8663 - val_loss: 0.5469 - val_acc: 0.9014

Accuracy: 90.14%

learning rate: 1.0e-03
num_dense_layers: 3
num_dense_nodes: 496
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1843 - acc: 0.9426 - val_loss: 0.0483 - val_acc: 0.9852
Epoch 2/3
55000/55000 [==============================] - 5s - loss: 0.0506 - acc: 0.9840 - val_loss: 0.0471 - val_acc: 0.9856
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0347 - acc: 0.9889 - val_loss: 0.0451 - val_acc: 0.9856

Accuracy: 98.56%

learning rate: 3.7e-03
num_dense_layers: 5
num_dense_nodes: 512
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 4s - loss: 0.2060 - acc: 0.9377 - val_loss: 0.0739 - val_acc: 0.9832
Epoch 2/3
55000/55000 [==============================] - 5s - loss: 0.0781 - acc: 0.9814 - val_loss: 0.0765 - val_acc: 0.9842
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.0908 - acc: 0.9818 - val_loss: 0.1368 - val_acc: 0.9766

Accuracy: 97.66%

learning rate: 1.0e-02
num_dense_layers: 5
num_dense_nodes: 512
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3199 - acc: 0.1105 - val_loss: 2.3015 - val_acc: 0.1126
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 2.3020 - acc: 0.1104 - val_loss: 2.3011 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 2.3018 - acc: 0.1110 - val_loss: 2.3013 - val_acc: 0.1126

Accuracy: 11.26%

learning rate: 1.9e-04
num_dense_layers: 4
num_dense_nodes: 418
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.3595 - acc: 0.8999 - val_loss: 0.0888 - val_acc: 0.9732
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.0868 - acc: 0.9738 - val_loss: 0.0686 - val_acc: 0.9782
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.0584 - acc: 0.9821 - val_loss: 0.0478 - val_acc: 0.9850

Accuracy: 98.50%

learning rate: 2.4e-03
num_dense_layers: 4
num_dense_nodes: 144
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1906 - acc: 0.9390 - val_loss: 0.0576 - val_acc: 0.9834
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.0550 - acc: 0.9840 - val_loss: 0.0402 - val_acc: 0.9890
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0380 - acc: 0.9885 - val_loss: 0.0459 - val_acc: 0.9880

Accuracy: 98.80%

learning rate: 6.8e-03
num_dense_layers: 2
num_dense_nodes: 105
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 0.1552 - acc: 0.9507 - val_loss: 0.0498 - val_acc: 0.9860
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.0485 - acc: 0.9853 - val_loss: 0.0534 - val_acc: 0.9836
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.0417 - acc: 0.9875 - val_loss: 0.0496 - val_acc: 0.9852

Accuracy: 98.52%

learning rate: 2.5e-04
num_dense_layers: 2
num_dense_nodes: 435
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.3258 - acc: 0.9131 - val_loss: 0.1024 - val_acc: 0.9676
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.0856 - acc: 0.9742 - val_loss: 0.0603 - val_acc: 0.9812
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0601 - acc: 0.9819 - val_loss: 0.0477 - val_acc: 0.9868

Accuracy: 98.68%

learning rate: 2.5e-06
num_dense_layers: 1
num_dense_nodes: 409
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.2504 - acc: 0.3689 - val_loss: 2.1796 - val_acc: 0.5498
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 2.0835 - acc: 0.6384 - val_loss: 1.9688 - val_acc: 0.6812
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 1.8409 - acc: 0.7098 - val_loss: 1.6977 - val_acc: 0.7404

Accuracy: 74.04%

learning rate: 4.4e-03
num_dense_layers: 3
num_dense_nodes: 311
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1504 - acc: 0.9523 - val_loss: 0.0746 - val_acc: 0.9800
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.0559 - acc: 0.9842 - val_loss: 0.0751 - val_acc: 0.9812
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0431 - acc: 0.9884 - val_loss: 0.0500 - val_acc: 0.9870

Accuracy: 98.70%

learning rate: 2.1e-03
num_dense_layers: 5
num_dense_nodes: 436
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1884 - acc: 0.9418 - val_loss: 0.0664 - val_acc: 0.9840
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.0598 - acc: 0.9837 - val_loss: 0.0454 - val_acc: 0.9880
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.0435 - acc: 0.9887 - val_loss: 0.0553 - val_acc: 0.9864

Accuracy: 98.64%

learning rate: 1.9e-04
num_dense_layers: 3
num_dense_nodes: 441
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.3664 - acc: 0.8989 - val_loss: 0.1076 - val_acc: 0.9698
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.0872 - acc: 0.9736 - val_loss: 0.0626 - val_acc: 0.9816
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.0583 - acc: 0.9824 - val_loss: 0.0504 - val_acc: 0.9856

Accuracy: 98.56%

learning rate: 1.7e-03
num_dense_layers: 1
num_dense_nodes: 512
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 1.2528 - acc: 0.5598 - val_loss: 0.2764 - val_acc: 0.9186
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.2010 - acc: 0.9369 - val_loss: 0.1251 - val_acc: 0.9592
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.1203 - acc: 0.9629 - val_loss: 0.0916 - val_acc: 0.9734

Accuracy: 97.34%

learning rate: 1.5e-03
num_dense_layers: 5
num_dense_nodes: 285
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.1378 - acc: 0.1588 - val_loss: 1.2723 - val_acc: 0.4116
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.5670 - acc: 0.7991 - val_loss: 0.2616 - val_acc: 0.9266
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.1877 - acc: 0.9460 - val_loss: 0.1365 - val_acc: 0.9618

Accuracy: 96.18%

learning rate: 3.3e-04
num_dense_layers: 5
num_dense_nodes: 5
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.3570 - acc: 0.0907 - val_loss: 2.3175 - val_acc: 0.0868
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 2.3074 - acc: 0.0952 - val_loss: 2.3029 - val_acc: 0.1126
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 2.3019 - acc: 0.1123 - val_loss: 2.3013 - val_acc: 0.1126

Accuracy: 11.26%

learning rate: 2.3e-04
num_dense_layers: 4
num_dense_nodes: 512
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 2.1591 - acc: 0.1861 - val_loss: 1.0381 - val_acc: 0.6422
Epoch 2/3
55000/55000 [==============================] - 5s - loss: 0.6686 - acc: 0.7868 - val_loss: 0.4403 - val_acc: 0.8662
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.3814 - acc: 0.8831 - val_loss: 0.2920 - val_acc: 0.9090

Accuracy: 90.90%

learning rate: 2.6e-03
num_dense_layers: 1
num_dense_nodes: 126
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 1.1633 - acc: 0.5922 - val_loss: 0.1928 - val_acc: 0.9422
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.1490 - acc: 0.9550 - val_loss: 0.0859 - val_acc: 0.9778
Epoch 3/3
55000/55000 [==============================] - 5s - loss: 0.0885 - acc: 0.9729 - val_loss: 0.0735 - val_acc: 0.9786

Accuracy: 97.86%

learning rate: 5.7e-04
num_dense_layers: 1
num_dense_nodes: 246
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 6s - loss: 0.2579 - acc: 0.9261 - val_loss: 0.0748 - val_acc: 0.9782
Epoch 2/3
55000/55000 [==============================] - 6s - loss: 0.0691 - acc: 0.9787 - val_loss: 0.0502 - val_acc: 0.9858
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.0465 - acc: 0.9854 - val_loss: 0.0423 - val_acc: 0.9880

Accuracy: 98.80%

learning rate: 2.4e-04
num_dense_layers: 1
num_dense_nodes: 164
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 0.4321 - acc: 0.8849 - val_loss: 0.1429 - val_acc: 0.9608
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.1163 - acc: 0.9654 - val_loss: 0.0821 - val_acc: 0.9766
Epoch 3/3
55000/55000 [==============================] - 4s - loss: 0.0794 - acc: 0.9762 - val_loss: 0.0679 - val_acc: 0.9796

Accuracy: 97.96%

learning rate: 1.0e-06
num_dense_layers: 2
num_dense_nodes: 5
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 2.3000 - acc: 0.1046 - val_loss: 2.2987 - val_acc: 0.1122
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 2.2981 - acc: 0.1124 - val_loss: 2.2965 - val_acc: 0.1224
Epoch 3/3
55000/55000 [==============================] - 4s - loss: 2.2959 - acc: 0.1221 - val_loss: 2.2941 - val_acc: 0.1290

Accuracy: 12.90%

learning rate: 1.3e-05
num_dense_layers: 2
num_dense_nodes: 512
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 1.6243 - acc: 0.6472 - val_loss: 0.7587 - val_acc: 0.8260
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.5184 - acc: 0.8724 - val_loss: 0.3656 - val_acc: 0.9038
Epoch 3/3
55000/55000 [==============================] - 7s - loss: 0.3292 - acc: 0.9091 - val_loss: 0.2724 - val_acc: 0.9268

Accuracy: 92.68%

learning rate: 7.6e-05
num_dense_layers: 1
num_dense_nodes: 241
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 2s - loss: 0.7636 - acc: 0.8233 - val_loss: 0.2393 - val_acc: 0.9368
Epoch 2/3
55000/55000 [==============================] - 2s - loss: 0.1961 - acc: 0.9448 - val_loss: 0.1449 - val_acc: 0.9612
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.1309 - acc: 0.9617 - val_loss: 0.1068 - val_acc: 0.9688

Accuracy: 96.88%

learning rate: 2.0e-03
num_dense_layers: 4
num_dense_nodes: 512
activation: relu

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 0.1668 - acc: 0.9474 - val_loss: 0.0605 - val_acc: 0.9832
Epoch 2/3
55000/55000 [==============================] - 4s - loss: 0.0548 - acc: 0.9845 - val_loss: 0.0419 - val_acc: 0.9902
Epoch 3/3
55000/55000 [==============================] - 8s - loss: 0.0408 - acc: 0.9890 - val_loss: 0.0596 - val_acc: 0.9844

Accuracy: 98.44%

learning rate: 2.2e-03
num_dense_layers: 2
num_dense_nodes: 326
activation: sigmoid

Train on 55000 samples, validate on 5000 samples
Epoch 1/3
55000/55000 [==============================] - 3s - loss: 1.1358 - acc: 0.5865 - val_loss: 0.2104 - val_acc: 0.9356
Epoch 2/3
55000/55000 [==============================] - 3s - loss: 0.1576 - acc: 0.9505 - val_loss: 0.0999 - val_acc: 0.9712
Epoch 3/3
55000/55000 [==============================] - 6s - loss: 0.1011 - acc: 0.9679 - val_loss: 0.0856 - val_acc: 0.9726

Accuracy: 97.26%

Optimization Progress

The progress of the hyper-parameter optimization can be easily plotted. The best fitness value found is plotted on the y-axis, remember that this is the negated classification accuracy on the validation-set.

Note how few hyper-parameters had to be tried before substantial improvements were found.

plot_convergence(search_result)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc2830202e8>

png

Best Hyper-Parameters

The best hyper-parameters found by the Bayesian optimizer are packed as a list because that is what it uses internally.

search_result.x
[0.0023584457378584664, 4, 144, 'relu']

We can convert these parameters to a dict with proper names for the search-space dimensions.

First we need a reference to the search-space object.

space = search_result.space

Then we can use it to create a dict where the hyper-parameters have the proper names of the search-space dimensions. This is a bit awkward.

space.point_to_dict(search_result.x)
{'activation': 'relu',
 'learning_rate': 0.0023584457378584664,
 'num_dense_layers': 4,
 'num_dense_nodes': 144}

This is the fitness value associated with these hyper-parameters. This is a negative number because the Bayesian optimizer performs minimization, so we had to negate the classification accuracy which is posed as a maximization problem.

search_result.fun
-0.98799999999999999

We can also see all the hyper-parameters tried by the Bayesian optimizer and their associated fitness values (the negated classification accuracies). These are sorted so the highest classification accuracies are shown first.

It appears that ‘relu’ activation was generally better than ‘sigmoid’. Otherwise it can be difficult to see a pattern of which parameter choices are good. We really need to plot these results.

sorted(zip(search_result.func_vals, search_result.x_iters))
[(-0.98799999999999999, [0.00057102338020535671, 1, 246, 'relu']),
 (-0.98799999999999999, [0.0023584457378584664, 4, 144, 'relu']),
 (-0.98699999999999999, [0.0043924439217142824, 3, 311, 'relu']),
 (-0.98680000000000001, [0.00025070302453255417, 2, 435, 'relu']),
 (-0.98640000000000005, [0.0020904801989242469, 5, 436, 'relu']),
 (-0.98560000000000003, [0.00017567744133971055, 4, 453, 'relu']),
 (-0.98560000000000003, [0.00018871091218374878, 3, 441, 'relu']),
 (-0.98560000000000003, [0.0010013922052631494, 3, 496, 'relu']),
 (-0.98519999999999996, [0.006752254693985822, 2, 105, 'relu']),
 (-0.98499999999999999, [0.0001905308801138268, 4, 418, 'relu']),
 (-0.98460000000000003, [0.0073224617473678331, 3, 166, 'relu']),
 (-0.98440000000000005, [0.0020143982003767271, 4, 512, 'relu']),
 (-0.98419999999999996, [0.0014193250864683331, 2, 62, 'relu']),
 (-0.97960000000000003, [0.00023735076383216567, 1, 164, 'relu']),
 (-0.97860000000000003, [0.0026064900033469073, 1, 126, 'sigmoid']),
 (-0.97660000000000002, [0.0037123587226393501, 5, 512, 'relu']),
 (-0.9758, [0.0027230837381696737, 2, 364, 'sigmoid']),
 (-0.97340000000000004, [0.0016597651372777609, 1, 512, 'sigmoid']),
 (-0.97260000000000002, [0.0022460993827137423, 2, 326, 'sigmoid']),
 (-0.96919999999999995, [0.00060563429543890952, 2, 474, 'sigmoid']),
 (-0.96879999999999999, [7.5808558985641429e-05, 1, 241, 'relu']),
 (-0.96179999999999999, [0.0014963322170155162, 5, 285, 'sigmoid']),
 (-0.96120000000000005, [0.00013559943302194881, 2, 29, 'relu']),
 (-0.95699999999999996, [0.00056441093780360571, 5, 13, 'relu']),
 (-0.94679999999999997, [0.00036704404112128516, 4, 338, 'sigmoid']),
 (-0.92679999999999996, [1.3066947342663859e-05, 2, 512, 'relu']),
 (-0.90900000000000003, [0.00023277413216549582, 4, 512, 'sigmoid']),
 (-0.90139999999999998, [0.001544493082361837, 1, 5, 'sigmoid']),
 (-0.85440000000000005, [0.00016937303683800523, 4, 252, 'sigmoid']),
 (-0.85140000000000005, [6.1458838378363633e-06, 2, 333, 'relu']),
 (-0.74039999999999995, [2.4847514577863683e-06, 1, 409, 'relu']),
 (-0.72899999999999998, [1.7068698743151031e-06, 4, 512, 'relu']),
 (-0.61660000000000004, [1e-05, 1, 16, 'relu']),
 (-0.2898, [6.1011365846453456e-05, 2, 209, 'sigmoid']),
 (-0.129, [9.9999999999999995e-07, 2, 5, 'relu']),
 (-0.11260000000000001, [5.4599879082087208e-06, 4, 186, 'sigmoid']),
 (-0.11260000000000001, [3.1218037895598157e-05, 3, 427, 'sigmoid']),
 (-0.11260000000000001, [0.00033099542158994725, 5, 5, 'sigmoid']),
 (-0.11260000000000001, [0.01, 5, 352, 'sigmoid']),
 (-0.11260000000000001, [0.01, 5, 512, 'relu'])]

Plots

There are several plotting functions available in the skopt library. For example, we can plot a histogram for the activation parameter, which shows the distribution of samples during the hyper-parameter optimization.

fig, ax = plot_histogram(result=search_result,
                         dimension_name='activation')

png

We can also make a landscape-plot of the estimated fitness values for two dimensions of the search-space, here taken to be learning_rate and num_dense_layers.

The Bayesian optimizer works by building a surrogate model of the search-space and then searching this model instead of the real search-space, because it is much faster. The plot shows the last surrogate model built by the Bayesian optimizer where yellow regions are better and blue regions are worse. The black dots show where the optimizer has sampled the search-space and the red star shows the best parameters found.

Several things should be noted here. Firstly, this surrogate model of the search-space may not be accurate. It is built from only 40 samples of calls to the fitness() function for training a neural network with a given choice of hyper-parameters. The modelled fitness landscape may differ significantly from its true values especially in regions of the search-space with few samples. Secondly, the plot may change each time the hyper-parameter optimization is run because of random noise in the training process of the neural network. Thirdly, this plot shows the effect of changing these two parameters num_dense_layers and learning_rate when averaged over all other dimensions in the search-space, this is also called a Partial Dependence plot and is a way of visualizing high-dimensional spaces in only 2-dimensions.

fig = plot_objective_2D(result=search_result,
                        dimension_name1='learning_rate',
                        dimension_name2='num_dense_layers',
                        levels=50)

png

We cannot make a landscape plot for the activation hyper-parameter because it is a categorical variable that can be one of two strings relu or sigmoid. How this is encoded depends on the Bayesian optimizer, for example, whether it is using Gaussian Processes or Random Forests. But it cannot currently be plotted using the built-in functions of skopt.

Instead we only want to use the real- and integer-valued dimensions of the search-space which we identify by their names.

dim_names = ['learning_rate', 'num_dense_nodes', 'num_dense_layers']

We can then make a matrix-plot of all combinations of these dimensions.

The diagonal shows the influence of a single dimension on the fitness. This is a so-called Partial Dependence plot for that dimension. It shows how the approximated fitness value changes with different values in that dimension.

The plots below the diagonal show the Partial Dependence for two dimensions. This shows how the approximated fitness value changes when we are varying two dimensions simultaneously.

These Partial Dependence plots are only approximations of the modelled fitness function - which in turn is only an approximation of the true fitness function in fitness(). This may be a bit difficult to understand. For example, the Partial Dependence is calculated by fixing one value for the learning_rate and then taking a large number of random samples for the remaining dimensions in the search-space. The estimated fitness for all these points is then averaged. This process is then repeated for other values of the learning_rate to show how it affects the fitness on average. A similar procedure is done for the plots that show the Partial Dependence plots for two dimensions.

fig, ax = plot_objective(result=search_result, dimension_names=dim_names)

png

We can also show another type of matrix-plot. Here the diagonal shows histograms of the sample distributions for each of the hyper-parameters during the Bayesian optimization. The plots below the diagonal show the location of samples in the search-space and the colour-coding shows the order in which the samples were taken. For larger numbers of samples you will likely see that the samples eventually become concentrated in a certain region of the search-space.

fig, ax = plot_evaluations(result=search_result, dimension_names=dim_names)

png

Evaluate Best Model on Test-Set

We can now use the best model on the test-set. It is very easy to reload the model using Keras.

model = load_model(path_best_model)

We then evaluate its performance on the test-set.

result = model.evaluate(x=data.test.images,
                        y=data.test.labels)
 8960/10000 [=========================>....] - ETA: 0s

We can print all the performance metrics for the test-set.

for name, value in zip(model.metrics_names, result):
    print(name, value)
loss 0.0363312054525
acc 0.9888

Or we can just print the classification accuracy.

print("{0}: {1:.2%}".format(model.metrics_names[1], result[1]))
acc: 98.88%

Predict on New Data

We can also predict the classification for new images. We will just use some images from the test-set but you could load your own images into numpy arrays and use those instead.

images = data.test.images[0:9]

These are the true class-number for those images. This is only used when plotting the images.

cls_true = data.test.cls[0:9]

Get the predicted classes as One-Hot encoded arrays.

y_pred = model.predict(x=images)

Get the predicted classes as integers.

cls_pred = np.argmax(y_pred,axis=1)
plot_images(images=images,
            cls_true=cls_true,
            cls_pred=cls_pred)

png

Examples of Mis-Classified Images

We can plot some examples of mis-classified images from the test-set.

First we get the predicted classes for all the images in the test-set:

y_pred = model.predict(x=data.test.images)

Then we convert the predicted class-numbers from One-Hot encoded arrays to integers.

cls_pred = np.argmax(y_pred,axis=1)

Plot some of the mis-classified images.

plot_example_errors(cls_pred)

png

Conclusion

This tutorial showed how to optimize the hyper-parameters of a neural network using Bayesian optimization. We used the scikit-optimize (skopt) library which is still under development, but it is already an extremely powerful tool. It was able to substantially improve on hand-tuned hyper-parameters in a small number of iterations. This is vastly superior to Grid Search and Random Search of the hyper-parameters, which would require far more computational time, and would most likely find inferior hyper-parameters, especially for more difficult problems.

Exercises

These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.

You may want to backup this Notebook before making any changes.

  • Try and run 100 or 200 iterations of the optimization instead of just 40 iterations. What happens to the plotted landscapes?
  • Try some of the other optimization methods from scikit-optimize such as forest_minimize instead of gp_minimize. How do they perform?
  • Try using another acquisition function for the optimizer e.g. Probability of Improvement.
  • Try optimizing more hyper-parameters with the Bayesian optimization. For example, the kernel-size and number of filters in the convolutional-layers, or the batch-size used in training.
  • Add a hyper-parameter for the number of convolutional layers and implement it in create_model(). Note that if you have pooling-layers after the convolution then the images are downsampled, so there is a limit to the number of layers you can have before the images become too small.
  • Look at the plots. Do you think that some of the hyper-parameters may be irrelevant? Try and remove these parameters and redo the optimization of the remaining hyper-parameters.
  • Use another and more difficult dataset with image-files.
  • Train for more epochs. Does it improve the classification accuracy on the validiation- and test-sets? How does it affect the time-usage?
  • Explain to a friend how the program works.

License (MIT)

Copyright © 2016-2018 by Magnus Erik Hvass Pedersen

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.