Documentation

Welcome to SigOpt’s developer documentation. If you have a question you can’t answer, feel free to contact us!
This feature is currently in alpha. Please contact us if you would like more information.

Training Monitor for Neural Network Experiments (Alpha)

Store and visualize your neural network training progress during hyperparameter optimization.

Often times, neural networks are trained using a stochastic gradient descent tool (such as Adam or RMSProp). This training process, which occurs for each Suggestion from SigOpt, can be a slow process, taking many hours. In recognition of this, Training Monitor experiments allow users to report intermediate progress towards a converged model and monitor that training progress on the SigOpt website.

In this new workflow, a Training Run object accumulates all the work associated with executing a suggested set of assignments. Each time a neural network is trained, a new Training Run object is created to contain and organize information associated with that training run. The progress during the training run is reported to SigOpt with the Checkpoint object. At the end of a neural network training, all of the checkpoints associated with a given training run are accumulated into a standard SigOpt Observation. The image below encapsulates this process.

Contents

Supporting Documentation

During this internal development period, Object/Endpoint documentation is hidden from discovery. Relevant links to objects and their functionality are provided below.

Example - Basic Checkpoint Structure

To train a deep neural network, we may choose to perform a stochastic gradient descent to fit our data and use a Training Monitor experiment to tune our descent hyperparameters and network architecture. We may choose to run 50 epochs (cycles through the data) and report the metric value (accuracy on a validation data set) to SigOpt at the end of each epoch. Using a Training Monitor experiment, we would create a training run and then create checkpoints after each epoch to store the progress to be monitored on the SigOpt site.

Example - Checkpointing Less Frequently

If we were training a convolutional neural network for image classification, we may run 35 epochs but may only evaluate the validation accuracy once every 5 epochs (because the inference cost is nontrivial). In this scenario, a checkpoint would be created once each time the validation accuracy was recorded, for a total of 7 checkpoints during the 35 epochs.

Creating the Experiment

Certain fields are required when creating a Training Monitor experiment.

  • Training monitor experiments require a field training_monitor, set to a mapping.
    • max_checkpoints must be set; this is a positive integer denoting the maximum number of checkpoints that will be generated per training run.
  • The type must be set to offline.
  • The observation_budget must be set.
  • The metrics field must also be used to name the metric under consideration for this experiment.
    • This is in contrast to other single metric experiments where the metrics field could be omitted.

This feature is currently in alpha. Only the Python API client library supports this feature. Below is a sample Training Monitor experiment create call.

experiment = conn.experiments().create(
  name='Classifier Accuracy',
  parameters=[
    {
      'name': 'gamma',
      'bounds': {'min': 0.001, 'max': 1.0},
      'type': 'double',
    },
  ],
  metrics=[
    {'name': 'Validation Accuracy', 'objective': 'maximize'},
  ],
  observation_budget=47,  # required
  parallel_bandwidth=1,
  training_monitor={
    'max_checkpoints': 10,  # required, cannot exceed 200
  },
  type='offline',  # required
)

Creating Training Runs, Checkpoints and Observations

Training Monitor experiments provide a new workflow which more closely mirrors the iterative neural network training process.

  • Once the experiment object has been created, we ask for a new suggestion by calling Suggestion Create. This is the same as a normal experiment.
  • Using that suggestion, we then call Training Run Create to create a Training Run.
    • The training_run organizes the progress of the neural network training.
  • During this training process, we call Checkpoint Create to create a Checkpoint whenever the metric is evaluated.
    • Here, we are specifically interested in the metric which SigOpt is optimizing, even if other metrics may be computed during the optimization (e.g., the training loss).
  • After the training has completed, Observation Create must be called.
    • This observation condenses all of the checkpoints into a single value equal to the best performance of the suggested assignments over all the reported checkpoints.
    • Doing so closes the training run.

The code snippet below demonstrates this process.

experiment = conn.experiments().create(**experiment_meta)

for _ in range(experiment.observation_budget):
  suggestion = conn.experiments(experiment.id).suggestions().create()
  model = form_model(training_data, training_values, **suggestion.assignments)

  training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
  for t in range(num_epochs):
    model.step()
    validation_accuracy = model.evaluate_accuracy(validation_data, validation_values)

    checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
      values=[{'name': 'Validation Accuracy', 'value': validation_accuracy}],
    )
  observation = conn.experiments(experiment.id).observations().create(training_run=training_run.id)

It is possible to report observations as failures at any point during the training (e.g., if the machine conducting the training ran out of memory).

Stating Early Stopping Criteria

It may be the case that the full number of epochs need not be executed during the training; some parameter assignments may converge more quickly than others and others may encounter overfitting if trained for too long. Training Monitor experiments provide a strategy to help users conduct and automate the process of stopping neural network training earlier than the maximum time.

To define early stopping behavior, add relevant early_stopping_criteria to the training_monitor component during experiment creation; an example is provided below.

training_monitor={
  max_checkpoints=10,
  early_stopping_criteria=[
    {
      'type': 'convergence',  # Only permitted value during alpha testing
      'name': 'Look Back 2 Steps',
      'metric': 'Validation Accuracy',
      'lookback_checkpoints': 2,
      'min_checkpoints': 3,  # Minimum checkpoints before stopping criteria is considered
    },
  ],
}
  • At present, the available early stopping criteria are restricted to only detecting convergence.
    • We compare the metric value at the most recent checkpoint to the value at an earlier checkpoint.
      • The number of checkpoints earlier to consider is specified by the positive integer lookback_checkpoints.
      • If the current value is no better, then the given early stopping criteria is considered satisfied.
    • Several criteria can be defined per experiment.
  • Whenever a checkpoint is created, all the defined criteria are checked.
    • The checkpoint has an entry called stopping_reasons which returns the status of all the criteria at that checkpoint (whether or not they were satisfied).
    • If fewer than min_checkpoints having been recorded, the criterion is automatically considered not satisfied.
    • The checkpoint also has a boolean should_stop which takes the value True if all the early stopping criteria were satisfied.
      • It will also take the value True if max_checkpoints checkpoints have already been reported.
    • The Checkpoint page explains this in detail and provides an example.

A more complicated Training Monitor demonstration, utilizing early stopping, is provided below. We consider using at most 123 epochs and desire to only check the metric value every 14 epochs. The early stopping criterion checks whether the value now is better than 2 checkpoints earlier, but only after at least 4 checkpoints have been reported.

num_epochs = 123
checkpoint_frequency = 14
max_checkpoints = int(np.ceil(num_epochs / checkpoint_frequency))

experiment_meta['training_monitor'] = {
  'max_checkpoints': max_checkpoints,
  'early_stopping_criteria': [
    {
      'type': 'convergence',
      'name': 'Look Back 2 Steps',
      'metric': 'Validation Accuracy',
      'lookback_checkpoints': 2,
      'min_checkpoints', 4,
    }
  ],
}
experiment = conn.experiments().create(**experiment_meta)

for _ in range(experiment.observation_budget):
  suggestion = conn.experiments(experiment.id).suggestions().create()
  model = form_model(training_data, training_values, **suggestion.assignments)

  training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
  next_iteration_for_checkpoint = total_iterations - (max_checkpoints - 1) * checkpoint_frequency - 1
  for t in range(num_epochs):
    model.step()

    if t == next_iteration_for_checkpoint:
      next_iteration_for_checkpoint += checkpoint_frequency
      validation_accuracy = model.evaluate_accuracy(validation_data, validation_values)

      checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
        values=[{'name': 'Validation Accuracy', 'value': validation_accuracy}],
      )

      if checkpoint.should_stop:
        observation = conn.experiments(experiment.id).observations().create(training_run=training_run.id)
        break

Limitations

Training monitor experiments have some limitations while in development. This list is likely to change as this feature develops during its beta release.

  • Neither training runs, nor checkpoints, may be updated or deleted. If a mistake is made, any associated observations or suggestions can still be deleted, and training runs may be abandoned.
  • Training runs must be created with a suggestion field, i.e., they cannot be defined using an assignments field. Suggestions may still be created or enqueued using assignments, though.
  • At present, only one training run per suggestion is permitted.
  • Only one metric is permitted and the number of solutions must be one.
  • No more than 3 early stopping criteria are permitted.
  • All early stopping criteria must have unique names.
  • The value max_checkpoints must be less than 200.
  • Creating more than max_checkpoints for a given training run is strictly forbidden and will yield an error. As a result, we assume there is no zeroth checkpoint -- no initial computation of the metric before training has begun.
  • The metadata associated with a checkpoint may have no more than 4 entries.
  • Unless reporting a failure, at least one checkpoint must be created before an observation can be created.

Monitoring Training Through the API

During a training, the training_run object can be recovered through its id or through the suggestion id from which it was created; after an observation is created using a training run, the training_run object can also be recovered by querying the observation id. Checkpoints can be retrieved individually or as a group of all the checkpoints associated with a single training run.