Training Monitor for Neural Network Experiments (Alpha)
Store and visualize your neural network training progress during hyperparameter optimization.
Often times, neural networks are trained using a stochastic gradient descent tool (such as Adam or RMSProp). This training process, which occurs for each Suggestion from SigOpt, can be a slow process, taking many hours. In recognition of this, Training Monitor experiments allow users to report intermediate progress towards a converged model and monitor that training progress on the SigOpt website.
In this new workflow, a Training Run object accumulates all the work associated with executing a suggested set of assignments. Each time a neural network is trained, a new Training Run object is created to contain and organize information associated with that training run. The progress during the training run is reported to SigOpt with the Checkpoint object. At the end of a neural network training, all of the checkpoints associated with a given training run are accumulated into a standard SigOpt Observation. The image below encapsulates this process.
Contents
- Supporting Documentation
- Motivating Examples
- Creating the Experiment
- Creating Training Runs, Checkpoints and Observations
- Stating Early Stopping Criteria
- Limitations
Supporting Documentation
During this internal development period, Object/Endpoint documentation is hidden from discovery. Relevant links to objects and their functionality are provided below.
Example - Basic Checkpoint Structure
To train a deep neural network, we may choose to perform a stochastic gradient descent to fit our data and use a Training Monitor experiment to tune our descent hyperparameters and network architecture. We may choose to run 50 epochs (cycles through the data) and report the metric value (accuracy on a validation data set) to SigOpt at the end of each epoch. Using a Training Monitor experiment, we would create a training run and then create checkpoints after each epoch to store the progress to be monitored on the SigOpt site.
Example - Checkpointing Less Frequently
If we were training a convolutional neural network for image classification, we may run 35 epochs but may only evaluate the validation accuracy once every 5 epochs (because the inference cost is nontrivial). In this scenario, a checkpoint would be created once each time the validation accuracy was recorded, for a total of 7 checkpoints during the 35 epochs.
Creating the Experiment
Certain fields are required when creating a Training Monitor experiment.
- Training monitor experiments require a field
training_monitor
, set to a mapping. max_checkpoints
must be set; this is a positive integer denoting the maximum number of checkpoints that will be generated per training run.- The
type
must be set tooffline
. - The
observation_budget
must be set. - The
metrics
field must also be used to name the metric under consideration for this experiment. - This is in contrast to other single metric experiments where the
metrics
field could be omitted.
This feature is currently in alpha. Only the Python API client library supports this feature. Below is a sample Training Monitor experiment create call.
experiment = conn.experiments().create(
name='Classifier Accuracy',
parameters=[
{
'name': 'gamma',
'bounds': {'min': 0.001, 'max': 1.0},
'type': 'double',
},
],
metrics=[
{'name': 'Validation Accuracy', 'objective': 'maximize'},
],
observation_budget=47, # required
parallel_bandwidth=1,
training_monitor={
'max_checkpoints': 10, # required, cannot exceed 200
},
type='offline', # required
)
Creating Training Runs, Checkpoints and Observations
Training Monitor experiments provide a new workflow which more closely mirrors the iterative neural network training process.
- Once the
experiment
object has been created, we ask for a new suggestion by calling Suggestion Create. This is the same as a normal experiment. - Using that
suggestion
, we then call Training Run Create to create a Training Run. - The
training_run
organizes the progress of the neural network training. - During this training process, we call Checkpoint Create to create a Checkpoint whenever the metric is evaluated.
- Here, we are specifically interested in the metric which SigOpt is optimizing, even if other metrics may be computed during the optimization (e.g., the training loss).
- After the training has completed, Observation Create must be called.
- This
observation
condenses all of the checkpoints into a single value equal to the best performance of the suggested assignments over all the reported checkpoints. - Doing so closes the training run.
The code snippet below demonstrates this process.
experiment = conn.experiments().create(**experiment_meta)
for _ in range(experiment.observation_budget):
suggestion = conn.experiments(experiment.id).suggestions().create()
model = form_model(training_data, training_values, **suggestion.assignments)
training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
for t in range(num_epochs):
model.step()
validation_accuracy = model.evaluate_accuracy(validation_data, validation_values)
checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
values=[{'name': 'Validation Accuracy', 'value': validation_accuracy}],
)
observation = conn.experiments(experiment.id).observations().create(training_run=training_run.id)
It is possible to report observations as failures at any point during the training (e.g., if the machine conducting the training ran out of memory).
Stating Early Stopping Criteria
It may be the case that the full number of epochs need not be executed during the training; some parameter assignments may converge more quickly than others and others may encounter overfitting if trained for too long. Training Monitor experiments provide a strategy to help users conduct and automate the process of stopping neural network training earlier than the maximum time.
To define early stopping behavior, add relevant early_stopping_criteria
to the training_monitor
component during experiment creation; an example is provided below.
training_monitor={
max_checkpoints=10,
early_stopping_criteria=[
{
'type': 'convergence', # Only permitted value during alpha testing
'name': 'Look Back 2 Steps',
'metric': 'Validation Accuracy',
'lookback_checkpoints': 2,
'min_checkpoints': 3, # Minimum checkpoints before stopping criteria is considered
},
],
}
- At present, the available early stopping criteria are restricted to only detecting convergence.
- We compare the
metric
value at the most recent checkpoint to the value at an earlier checkpoint. - The number of checkpoints earlier to consider is specified by the positive integer
lookback_checkpoints
. - If the current value is no better, then the given early stopping criteria is considered satisfied.
- Several criteria can be defined per experiment.
- Whenever a checkpoint is created, all the defined criteria are checked.
- The
checkpoint
has an entry calledstopping_reasons
which returns the status of all the criteria at that checkpoint (whether or not they were satisfied). - If fewer than
min_checkpoints
having been recorded, the criterion is automatically considered not satisfied. - The
checkpoint
also has a booleanshould_stop
which takes the valueTrue
if all the early stopping criteria were satisfied. - It will also take the value
True
ifmax_checkpoints
checkpoints have already been reported. - The Checkpoint page explains this in detail and provides an example.
A more complicated Training Monitor demonstration, utilizing early stopping, is provided below. We consider using at most 123 epochs and desire to only check the metric value every 14 epochs. The early stopping criterion checks whether the value now is better than 2 checkpoints earlier, but only after at least 4 checkpoints have been reported.
num_epochs = 123
checkpoint_frequency = 14
max_checkpoints = int(np.ceil(num_epochs / checkpoint_frequency))
experiment_meta['training_monitor'] = {
'max_checkpoints': max_checkpoints,
'early_stopping_criteria': [
{
'type': 'convergence',
'name': 'Look Back 2 Steps',
'metric': 'Validation Accuracy',
'lookback_checkpoints': 2,
'min_checkpoints', 4,
}
],
}
experiment = conn.experiments().create(**experiment_meta)
for _ in range(experiment.observation_budget):
suggestion = conn.experiments(experiment.id).suggestions().create()
model = form_model(training_data, training_values, **suggestion.assignments)
training_run = conn.experiments(experiment.id).training_runs().create(suggestion=suggestion.id)
next_iteration_for_checkpoint = total_iterations - (max_checkpoints - 1) * checkpoint_frequency - 1
for t in range(num_epochs):
model.step()
if t == next_iteration_for_checkpoint:
next_iteration_for_checkpoint += checkpoint_frequency
validation_accuracy = model.evaluate_accuracy(validation_data, validation_values)
checkpoint = conn.experiments(experiment.id).training_runs(training_run.id).checkpoints().create(
values=[{'name': 'Validation Accuracy', 'value': validation_accuracy}],
)
if checkpoint.should_stop:
observation = conn.experiments(experiment.id).observations().create(training_run=training_run.id)
break
Limitations
Training monitor experiments have some limitations while in development. This list is likely to change as this feature develops during its beta release.
- Neither training runs, nor checkpoints, may be updated or deleted. If a mistake is made, any associated observations or suggestions can still be deleted, and training runs may be abandoned.
- Training runs must be created with a
suggestion
field, i.e., they cannot be defined using anassignments
field. Suggestions may still be created or enqueued usingassignments
, though. - At present, only one training run per suggestion is permitted.
- Only one metric is permitted and the number of solutions must be one.
- No more than 3 early stopping criteria are permitted.
- All early stopping criteria must have unique names.
- The value
max_checkpoints
must be less than 200. - Creating more than
max_checkpoints
for a given training run is strictly forbidden and will yield an error. As a result, we assume there is no zeroth checkpoint -- no initial computation of the metric before training has begun. - The
metadata
associated with a checkpoint may have no more than 4 entries. - Unless reporting a failure, at least one checkpoint must be created before an observation can be created.
Monitoring Training Through the API
During a training, the training_run
object can be recovered through its id
or through the suggestion id from which it was created; after an observation is created using a training run, the training_run
object can also be recovered by querying the observation id. Checkpoints can be retrieved individually or as a group of all the checkpoints associated with a single training run.