Documentation

Welcome to SigOpt’s developer documentation. If you have a question you can’t answer, feel free to contact us!
This feature is currently in alpha. Please contact us if you would like more information.

Visualizing Training Monitor Experiments (Alpha)

See the Training Monitor documentation for a complete introduction to Training monitor experiments.

Our main goal in providing training monitor experiments is to build a workflow which supports users building neural networks. A key part of that process is visualizing the training process and hyperparameter optimization.

In this document, we demonstrate how to create and access some useful visualizations of individual Training Runs as well as ensembles of training runs for comparing performance.

We also provide demonstrations of how to incorporate SigOpt's training monitor into Pytorch and Tensorflow code in Python. This will eventually appear in the Gallery, but only once training monitor moves from early testing.

Contents

Example - Visualizing the Metric Curve for a Single Training Run

Whenever a suggestion is created in a training monitor experiment, new elements are added to the suggestion modal. In particular, relevant plots associated with the training run can be reviewed within this modal, which also appears on an observation after one has been created. The default visualization is shown below.

Customized Visualizations for a Single Training Run

Practitioners building neural networks use a variety of tools and visualizations to power their development. Our goal with this feature was to build it in a flexible fashion and empower users to analyze their training runs as they find beneficial.

This flexibility is afforded through the use of Checkpoint metadata; values stored in this metadata will be available for plotting on the training run graphs.

  • For values to appear on these training run plots, those values must be reported in the metadata of ALL the checkpoints of that training run.
  • While this feature is being developed, only 5 metadata entries are permitted per checkpoint.

Below are some suggested strategies which may prove useful.

Plotting Training Runs by Epoch Number

In this example, we only report checkpoints once every two epochs, but we want to be able to visualize the training progress as a function of the number of epochs (not checkpoints). We report the number of epochs in metadata and then change the x-axis of the plot accordingly.

suggestion = conn.experiments(experiment.id).suggestions().create()

model = form_model_from_suggestion(suggestion)
optimizer = form_optimizer_from_suggestion(suggestion)
training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
for epoch in range(20):
  optimizer.advance_model(model)

  if epoch % 2 == 1:
    validation_accuracy = model.compute_validation_accuracy()
    checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
      values=[{'name':'Validation Accuracy', 'value': validation_accuracy}],
      metadata={'epoch': epoch},
    )

Plotting Alternate Metrics from Training Runs

In this example, we report checkpoints every epoch, and we want to plot the training loss (the quantity minimized during the training) in addition to the validation accuracy (which, as the metric being optimized, would always be available). We report the training loss in metadata and then change the y-axis of the plot accordingly.

suggestion = conn.experiments(experiment.id).suggestions().create()

model = form_model_from_suggestion(suggestion)
optimizer = form_optimizer_from_suggestion(suggestion)
training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
for _ in range(20):
  optimizer.advance_model(model)

  training_loss = optimizer.training_loss
  validation_accuracy = model.compute_validation_accuracy()
  checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
    values=[{'name':'Validation Accuracy', 'value': validation_accuracy}],
    metadata={'training_loss': training_loss},
  )

Simultaneously Visualizing Multiple Training Runs

During the hyperparameter optimization loop, it may be the case that we want to analyzing multiple training runs in comparison to each other. Such a comparison can be conducted from the analysis page, after a number of training runs has been completed.

Recording Checkpoints at Fixed Intervals in Time

Sometimes, we may want to wait a fixed amount of time between checkpoints, rather than a fixed number of epochs. If the training can be divided into small batches (as is often the case for most batch sizes), this can be achieved with training monitor. The first code block below shows how to create such an experiment which will use multiple parallel suggestions.

time_between_checkpoints = 300  # five minutes
maximum_time_per_training = 3600  # one hour
max_checkpoints = maximum_time_per_training // time_between_checkpoints
maximum_time_for_experiment = 86400  # one day
parallel_bandwidth = 10
observation_budget = (maximum_time_for_experiment * parallel_bandwidth) // maximum_time_per_training

experiment = conn.experiments().create(
  name='Checkpoint In Time',
  parameters=[{'name': 'gamma', 'bounds': {'min': .001, 'max': .1}, 'type': 'double'}],
  metrics=[{'name': 'Validation Accuracy', 'objective': 'maximize'}],
  observation_budget=observation_budget,
  parallel_bandwidth=parallel_bandwidth,
  training_monitor={'max_checkpoints': max_checkpoints},
  type='offline',
  metadata={
    'time_between_checkpoints': time_between_checkpoints,
    'maximum_time_per_training': maximum_time_per_training,
    'maximum_time_for_experiment': maximum_time_for_experiment,
  },
)

After the experiment has been created, all the parallel workers would execute the code below to run the loop and report checkpoints after time_between_checkpoints seconds have passed. The loop is guaranteed to terminate after maximum_time_per_training seconds have passed, with some buffer for the time to progress through optimizer.advance_model_one_batch.

import time
passed_experiment_id = 'MUST BE SUPPLIED'

experiment = conn.experiments(passed_experiment_id).fetch()
time_between_checkpoints = experiment.metadata['time_between_checkpoints']
maximum_time_per_training = experiment.metadata['maximum_time_per_training']

while experiment.progress.observation_count < experiment.observation_budget:
  suggestion = conn.experiments(experiment.id).suggestions().create()

  model = form_model_from_suggestion(suggestion)
  optimizer = form_optimizer_from_suggestion(suggestion)

  training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
  start_training_time = time.time()
  last_checkpoint_time = start_training_time
  while time.time() < start_training_time + maximum_time_per_training:
    optimizer.advance_model_one_batch(model)

    if time.time() > last_checkpoint_time + time_between_checkpoints:
      validation_accuracy = model.compute_validation_accuracy()
      checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
        values=[{'name':'Validation Accuracy', 'value': validation_accuracy}],
      )
      last_checkpoint_time = time.time()

      if checkpoint.should_stop:
        break

  observation = conn.experiments(experiment.id).observations().create(training_run=training_run.id)
  experiment = conn.experiments(experiment.id).fetch()

Pytorch Demonstration

This is an example using a 2 layer neural network to fit manufactured regression data.

import torch
import numpy as np
from sigopt import Connection

DIM = 51
NUM_DATA = 12345
NUM_VALID = 1234
NUM_TEST = 987
NUM_HIDDEN = 45
NOISE = 7.89e-4

class TwoLayerNet(torch.nn.Module):
  def __init__(self, num_hidden=NUM_HIDDEN):
    super().__init__()
    self.linear1 = torch.nn.Linear(DIM, num_hidden)
    self.linear2 = torch.nn.Linear(num_hidden, 1)

  def forward(self, x):
    h_relu = self.linear1(x).clamp(min=0)
    y_pred = self.linear2(h_relu)
    return y_pred

def make_data(num_data=NUM_DATA, num_valid=NUM_VALID, num_test=NUM_TEST, num_hidden=NUM_HIDDEN, noise=NOISE):
  x = torch.randn(num_data, DIM)
  y = (
    torch.cos(x[:, 0:1] + x[:, 3:4] + 2 * x[:, 22:23] + x[:, 34:35] + x[:, 43:44]) +
    noise * torch.randn(num_data, 1)
  )

  x_train = x[num_valid + num_test:, :]
  y_train = y[num_valid + num_test:, :]
  x_valid = x[num_test:num_test + num_valid, :]
  y_valid = y[num_test:num_test + num_valid, :]
  x_test = x[:num_test, :]
  y_test = y[:num_test, :]

  return x_train, y_train, x_valid, y_valid, x_test, y_test

def form_experiment_meta(num_epochs=123, checkpoint_frequency=8, observation_budget=25):
  max_checkpoints = int(np.ceil(num_epochs / checkpoint_frequency))

  experiment_meta = {
    'name': 'SGD Test',
    'parameters': [
      {'name': 'log_learning_rate', 'type': 'double', 'bounds': {'min': np.log(1e-4), 'max': np.log(1e1)}}
    ],
    'metrics': [{'name': 'Validation MSE Loss', 'objective': 'minimize'}],
    'observation_budget': observation_budget,
    'training_monitor': {
      'max_checkpoints': max_checkpoints,
      'early_stopping_criteria': [
        {
          'name': 'Look Back 1',
          'metric': 'Validation MSE Loss',
          'type': 'convergence',
          'min_checkpoints': 3,
          'lookback_checkpoints': 1
        },
      ],
    },
    'metadata': {
      'num_epochs': num_epochs,
      'checkpoint_frequency': checkpoint_frequency,
    },
    'type': 'offline',
  }
  return experiment_meta

def main(num_epochs=123, checkpoint_frequency=8, observation_budget=15):
  experiment_meta = form_experiment_meta(num_epochs, checkpoint_frequency, observation_budget)
  max_checkpoints = experiment_meta['training_monitor']['max_checkpoints']
  x_train, y_train, x_valid, y_valid, x_test, y_test = make_data()

  conn = Connection(client_token='INSERT YOUR SIGOPT API TOKEN HERE')
  experiment = conn.experiments().create(**experiment_meta)
  for k in range(experiment.observation_budget):

    model = TwoLayerNet()
    loss_func = torch.nn.MSELoss()

    suggestion = conn.experiments(experiment.id).suggestions().create()
    learning_rate = np.exp(suggestion.assignments['log_learning_rate'])
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
    next_epoch_for_checkpoint = num_epochs - (max_checkpoints - 1) * checkpoint_frequency - 1
    for epoch in range(num_epochs):

      y_pred = model(x_train)
      training_loss = loss_func(y_pred, y_train)
      optimizer.zero_grad()
      training_loss.backward()
      optimizer.step()

      if epoch == next_epoch_for_checkpoint:
        next_epoch_for_checkpoint += checkpoint_frequency
        y_pred = model(x_valid)
        validation_mse_loss = loss_func(y_pred, y_valid)

        checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
          values=[
            {
              'name':'Validation MSE Loss',
              'value': validation_mse_loss.item(),
            },
          ],
          metadata={
            'Training Loss': training_loss.item(),
            'epoch': epoch
          },
        )

        if checkpoint.should_stop:
          y_pred = model(x_test)
          test_mse_loss = loss_func(y_pred, y_test)
          conn.experiments(experiment.id).observations().create(
            training_run=training_run.id,
            metadata={'Test MSE Loss': test_mse_loss.item()},
          )
          break

if __name__ == '__main__':
  main()

Tensorflow Demonstration

This is an example using a convolutional neural network for MNIST image classification.

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
from sigopt import Connection

CONV1_KERNEL = 4
CONV1_OUTPUT = 11
CONV1_ACT = tf.nn.relu
CONV2_KERNEL = 3
CONV2_OUTPUT = 13
CONV2_ACT = tf.sigmoid
FC1_HIDDEN = 234
FC1_ACT = tf.tanh
OPTIMIZER = tf.train.RMSPropOptimizer
NUM_EPOCHS = 10
BATCH_SIZE = 500


def form_model(x_img):
  w_c1 = tf.Variable(tf.random_normal([CONV1_KERNEL, CONV1_KERNEL, 1, CONV1_OUTPUT]))
  b_c1 = tf.Variable(tf.random_normal([CONV1_OUTPUT]))
  conv1 = tf.nn.conv2d(x_img, w_c1, strides=[1, 1, 1, 1], padding='SAME')
  conv1 = tf.add(conv1, b_c1)
  conv1 = tf.nn.max_pool(value=conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
  conv1 = CONV1_ACT(conv1)

  w_c2 = tf.Variable(tf.random_normal([CONV2_KERNEL, CONV2_KERNEL, CONV1_OUTPUT, CONV2_OUTPUT]))
  b_c2 = tf.Variable(tf.random_normal([CONV2_OUTPUT]))
  conv2 = tf.nn.conv2d(conv1, w_c2, strides=[1, 1, 1, 1], padding='SAME')
  conv2 = tf.add(conv2, b_c2)
  conv2 = CONV2_ACT(conv2)

  conv2_flat = tf.contrib.layers.flatten(conv2)

  w_fc1 = tf.Variable(tf.random_normal([conv2_flat.get_shape()[1].value, FC1_HIDDEN]))
  b_fc1 = tf.Variable(tf.random_normal([FC1_HIDDEN]))
  fc1 = tf.add(tf.matmul(conv2_flat, w_fc1), b_fc1)
  fc1 = FC1_ACT(fc1)

  w_out = tf.Variable(tf.random_normal([FC1_HIDDEN, 10]))
  b_out = tf.Variable(tf.random_normal([10]))
  out = tf.add(tf.matmul(fc1, w_out), b_out)

  return out

def make_data():
  mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

  x = tf.placeholder("float", [None, 784])
  x_img = tf.reshape(x, [-1, 28, 28, 1])
  y = tf.placeholder("float", [None, 10])

  return mnist, x, x_img, y

def form_experiment_meta(observation_budget=10):
  experiment_meta = {
    'name': 'SGD Test',
    'parameters': [
      {'name': 'log_learning_rate', 'type': 'double', 'bounds': {'min': np.log(1e-4), 'max': np.log(1e1)}}
    ],
    'metrics': [{'name': 'Validation Accuracy', 'objective': 'maximize'}],
    'observation_budget': observation_budget,
    'training_monitor': {
      'max_checkpoints': NUM_EPOCHS,
      'early_stopping_criteria': [
        {
          'name': 'Look Back 1',
          'metric': 'Validation Accuracy',
          'type': 'convergence',
          'min_checkpoints': 2,
          'lookback_checkpoints': 1
        },
      ],
    },
    'metadata': {
      'num_epochs': NUM_EPOCHS,
      'checkpoint_frequency': 1,
      'batch_size': BATCH_SIZE,
    },
    'type': 'offline',
  }
  return experiment_meta

def main(observation_budget=10):
  experiment_meta = form_experiment_meta(observation_budget)
  mnist, x, x_img, y = make_data()

  conn = Connection(client_token='INSERT YOUR SIGOPT API TOKEN HERE')
  experiment = conn.experiments().create(**experiment_meta)
  for k in range(experiment.observation_budget):

    suggestion = conn.experiments(experiment.id).suggestions().create()
    model = form_model(x_img)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=model, labels=y))
    learning_rate = np.exp(suggestion.assignments['log_learning_rate'])
    optimizer = OPTIMIZER(learning_rate=learning_rate).minimize(cost)
    with tf.Session() as sess:
      sess.run(tf.global_variables_initializer())

      training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
      for epoch in range(NUM_EPOCHS):
        num_batches = int(mnist.train.num_examples / BATCH_SIZE)

        for b in range(num_batches):
          batch_x, batch_y = mnist.train.next_batch(BATCH_SIZE)
          sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})

        correct_prediction = tf.equal(tf.argmax(model, 1), tf.argmax(y, 1))
        acc = tf.reduce_mean(tf.cast(correct_prediction, "float"))
        validation_accuracy = sess.run(acc, feed_dict={x: mnist.validation.images, y: mnist.validation.labels})

        checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
          values=[
            {
              'name':'Validation Accuracy',
              'value': validation_accuracy,
            },
          ],
          metadata={'epoch': epoch},
        )

        if checkpoint.should_stop:
          conn.experiments(experiment.id).observations().create(training_run=training_run.id)
          break

if __name__ == '__main__':
  main()