Documentation

Welcome to the developer documentation for SigOpt. If you have a question you can’t answer, feel free to contact us!
This feature is currently in alpha. Please contact us if you would like more information.

SigOpt Orchestrate Reference

SigOpt Orchestrate is a command-line tool for managing training clusters and running optimization experiments.

Cluster Configuration File Back to Top

The cluster configuration file is commonly referred to as cluster.yml, but you can name yours anything you like. The file is used when we create a SigOpt Orchestrate cluster, with orchestrate cluster create -f cluster.yml. You can update your cluster configuration file after the cluster has been created to change the number of nodes in your cluster or change instance types. These changes can be applied by running orchestrate cluster update -f cluster.yml. Some updates might not be supported, for example introducing GPU nodes to your cluster in some regions. If the update is not supported then you will need to destroy the cluster and create it again.

The available fields are:

FieldRequired?Description
cpu, or gpuYYou must provide at least one of either cpu or gpu. Define the CPU compute that your cluster will need in terms of: instance_type, max_nodes, and min_nodes. It is recommended that you set min_nodes to 0 so the autoscaler can remove all of your expensive compute nodes when they aren't in use. It's ok if max_nodes and min_nodes are the same value, as long as max_nodes is not 0.
cluster_nameYYou must provide a name for your cluster. You will share this with anyone else who wants to connect to your cluster.
awsNOverride environment-provided values for aws_access_key_id or aws_secret_access_key.
kubernetes_versionNThe version of Kubernetes to use for your cluster. Currently supports Kubernetes 1.16, 1.17, 1.18, and 1.19. Defaults to the latest stable version supported by SigOpt Orchestrate, which is currently 1.18.
providerNCurrently, AWS is our only supported provider for creating clusters. You can, however, use a custom provider to connect to your own Kubernetes cluster with the orchestrate cluster connect. See section on Bringing your own K8s Cluster.
systemNSystem nodes are required to run the autoscaler. You can specify the number and type of system nodes with min_nodes, max_nodes and instance_type. The value of min_nodes must be at least 1 so that you have at least 1 system node. The defaults for system are:
  • min_nodes: 1
  • max_nodes: 2
  • instance_type: "t3.large"

Example

The example YAML file below defines a CPU cluster named tiny-cluster with two t2.small AWS instances.

# cluster.yml

# AWS is currently our only supported provider for cluster create
# You can connect to custom clusters via orchestrate connect
provider: aws

# We have provided a name that is short and descriptive
cluster_name: tiny-cluster

# Your cluster config can have CPU nodes, GPU nodes, or both.
# The configuration of your nodes is defined in the sections below.

# (Optional) Define CPU compute here
cpu:
  # AWS instance type
  instance_type: t2.small
  max_nodes: 2
  min_nodes: 0

# # (Optional) Define GPU compute here
# gpu:
#   # AWS GPU-enabled instance type
#   # This can be any p* instance type
#   instance_type: p2.xlarge
#   max_nodes: 2
#   min_nodes: 0

kubernetes_version: '1.18'

Configure training orchestration Back to Top

The SigOpt Orchestrate configuration file tells SigOpt Orchestrate how to setup and run the model, which metrics to track, as well as details about which hyperparameters to tune.

You can use a SigOpt Orchestrate config YAML file you've already created, or SigOpt Orchestrate will auto-generate an orchestrate.yml template file for you if you run the following:

orchestrate init

The available fields are:

FieldRequired?Description
imageYName of Docker container SigOpt Orchestrate creates for you. You can also point this to an existing Docker container to use for SigOpt Orchestrate.
nameYName for your training run or HPO experiment
runYModel file to execute
resources_per_modelNResources required by your training run or HPO experiment. Can specify: gpus, cpu, memory
optimizationN

Enables an HPO experiment. Requires at least one metric

metricsOne or more metrics to optimize for your HPO experiment and its metric strategy
parametersname, type, bounds, and optimization strategy for each hyperparameter you want to optimize in your HPO experiment
parallel_bandwidth(Optional) Defines number of Kubernetes pods to use in parallel for your training run or HPO experiment. Note: SigOpt Orchestrate does not support multi-gpu, multi-machine distributed training. However, you can run a single machine multi-gpu training run.

Considerations for resources_per_model

When specifying CPUs, valid amounts are whole numbers (1, 2), and fractional numbers or millis (1.5 and 1500m both represent 1.5 CPU). When specifying memory, valid amounts are shown in the Kubernetes documentation for memory resources, but some examples are 1e9, 1Gi, 500M. For gpus, only whole numbers are valid.

When choosing the resources for a single model training run, it's important to keep in mind that some resources on your cluster will be auto-reserved for Kubernetes processes. For this reason, you must specify fewer resources for your model than are available on each node. A good rule of thumb is to assume that your node will have 0.5 CPU less than the total to run your model.

For example, if your nodes have 8 CPUs then you must specify fewer than 8 CPUs in the requests section of your resources_per_model in order for your model to run. Keep in mind that you can specify fractional amounts of CPU, e.g. 7.5 or 7500m.

Example

Here's an example SigOpt Orchestrate configuration file.

# orchestrate.yml

resources_per_model:
  requests:
    cpu: 0.5
    memory: 512Mi
  limits:
    cpu: 1
    memory: 512Mi
# We don't need any GPUs for this example, so we'll leave this commented out
#  gpus: 1

# Choose a descriptive name for your model
name: Orchestrate SGD Classifier (python)

# Model run command
run: python model.py

# Optimization details
optimization:
  # Every experiment needs at least one named optimized metric.
  metrics:
    - name: accuracy
      strategy: optimize
      objective: maximize

  # Parameter that are defined here are available to SigOpt Orchestrate
  parameters:
    - name: l1_ratio
      type: double
      bounds:
        min: 0
        max: 1.0
    - name: log_alpha
      type: double
      bounds:
        min: -5
        max: 2

  # Our example cluster has two machines, so we have enough compute power
  # to execute two models in parallel.
  parallel_bandwidth: 2

  # We want to evaluate our model on sixty different sets of hyperparameters
  observation_budget: 60

# SigOpt Orchestrate creates a container for your model. Since we're using an AWS
# cluster, it's easy to securely store the model in the Amazon Elastic Container Registry.
# Choose a descriptive and unique name for each new experiment configuration file.
image: orchestrate/sgd-classifier

SigOpt Orchestrate Commands Back to Top

The best way to learn the most up to date information about cluster commands is from the command line interface (CLI) itself! Append any command with --help to learn about sub commands, arguments, and flags.

For example, to learn more about all SigOpt Orchestrate commands, run:

orchestrate --help

To learn more about the specific orchestrate optimize command, run:

orchestrate optimize --help
CommandFlagDescription
orchestrate cleanClean up SigOpt Orchestrate Docker images to free up disk space.
orchestrate cluster connectSigOpt Orchestrate will connect to an existing Kubernetes cluster on the specified cloud provider.
--cluster-name <cluster-name>
--kubeconfig <kubeconfig file>Optional. Defaults to checking ~/.kube/config
--provider <provider>Valid inputs are: aws, custom
--registry <registry-url>Optional. Defaults to docker.io on a custom cluster, or to ECR for clusters created with SigOpt Orchestrate.
orchestrate cluster create -f <cluster config yaml>AWS only. Creates a cluster using the specifications from the config YAML to launch a Kubernetes cluster.
orchestrate cluster destroyDestroy a cluster. At this time, only AWS clusters can be destroyed.
--cluster-name <cluster name>Specify the cluster to destroy.
--provider <provider name>Currently only supports: aws.
orchestrate cluster disconnectDisconnect from a cluster. Requires one of the following flags.
--cluster-name <cluster name>Specify the cluster to disconnect.
-aDisconnect from all clusters.
orchestrate cluster statusSee the status of all your training runs and HPO experiments.
orchestrate cluster testChecks SigOpt Orchestrate cluster connection.
orchestrate cluster update -f <cluster config yaml>AWS only. Updates the cluster configuration with the changes in the config yaml. Not supported for all operations, ex. introducing GPU nodes to your cluster in some regions.
orchestrate configUsed to configure the SigOpt Orchestrate CLI. You will need yourSIGOPT_API_TOKEN found on the SigOpt web app. You can also (optionally) enable log collection, so that stdout and stderr from your model go straight to the SigOpt web app.
orchestrate initCreates a SigOpt Orchestrate configuration YAML file and Dockerfile that can be modified as needed.
orchestrate kubectlRun kubectl commands directly on your cluster.
orchestrate optimize -f <experiment config yaml>Executes an HPO experiment on your Kubernetes cluster using SigOpt Orchestrate.
orchestrate run -f <experiment config yaml>Executes a training run on your Kubernetes cluster using SigOpt Orchestrate.
orchestrate status <object-type>/<id>See the status of all your training runs and HPO experiments. Requires at least one argument. A run ID is formatted as run/<run-ID> and an experiment ID is formatted as experiment/<experiment-id>.
orchestrate stopStops and archives an experiment or a run on the cluster. If an experiment is stopped, all in-progress runs will be halted. Objects will still exist on sigopt.com.
--experiment <experiment-id>Stops an HPO experiment.
--run <run-id>Stops a run. Valid inputs for the id argument are: <run-id>, run/<run-id>, suggestion/<suggestion-id>, <pod-name>.
orchestrate test-run -f <experiment config yaml>Tests the execution of a single training run on your Kubernetes cluster using SigOpt Orchestrate. Outputs debugging information and logs from your training run.
orchestrate versionCurrent version.

Adding Additional AWS Policies Back to Top

Users creating AWS clusters with SigOpt Orchestrate can easily interface with different AWS services. To allow your cluster permission to access different AWS services, provide additional AWS policies in the aws.additional_policies section of the cluster configuration file.

SigOpt Orchestrate Logging Back to Top

SigOpt Orchestrate integrates seamlessly with the SigOpt API to optimize the hyperparameters of your model. SigOpt Orchestrate is built to handle communication with theSigOpt API under the hood, so that you only need to focus on your model, some lightweight installation requirements, and your experiment configuration file.

As you write your model, use a few lines of code from the sigopt package to read hyperparameters and write your model's metric(s).

Logging Example

Below is a side-by-side comparison of two nearly-identical Multilayer Perceptron models. The right model is written for SigOpt Orchestrate, the left model is not. As you can see, the right model uses sigopt.get_parameter to read assignments from SigOpt Orchestrate, as well as sigopt.log_metric to send its metric value back to SigOpt Orchestrate.

import numpy
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD


x_train = numpy.random.random((1000, 20))
y_train = keras.utils.to_categorical(
  numpy.random.randint(10, size=(1000, 1)),
  num_classes=10,
)
x_test = numpy.random.random((100, 20))
y_test = keras.utils.to_categorical(
  numpy.random.randint(10, size=(100, 1)),
  num_classes=10,
)

dropout_rate = 0.5
model = Sequential()
model.add(Dense(
  units=64,
  activation='relu',
  input_dim=20,
))
model.add(Dropout(dropout_rate))
model.add(Dense(
  units=64,
  activation='relu',
))
model.add(Dropout(dropout_rate))
model.add(Dense(10, activation='softmax'))

sgd = SGD(
  lr=0.01,
  decay=1e-6,
  momentum=0.9,
  nesterov=True,
)
model.compile(
  loss='categorical_crossentropy',
  optimizer=sgd,
  metrics=['accuracy'],
)

model.fit(
  x=x_train,
  y=y_train,
  epochs=20,
  batch_size=128,
)
evaluation_loss, accuracy = model.evaluate(
  x=x_test,
  y=y_test,
  batch_size=128,
)
print('evaluation_loss:', evaluation_loss)
print('accuracy:', accuracy)
import numpy
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.optimizers import SGD
import sigopt

x_train = numpy.random.random((1000, 20))
y_train = keras.utils.to_categorical(
  numpy.random.randint(10, size=(1000, 1)),
  num_classes=10,
)
x_test = numpy.random.random((100, 20))
y_test = keras.utils.to_categorical(
  numpy.random.randint(10, size=(100, 1)),
  num_classes=10,
)

dropout_rate = sigopt.get_parameter('dropout_rate', default=0.5)
model = Sequential()
model.add(Dense(
  units=sigopt.get_parameter('hidden_1', default=64),
  activation=sigopt.get_parameter('activation_1', default='relu'),
  input_dim=20,
))
model.add(Dropout(dropout_rate))
model.add(Dense(
  units=sigopt.get_parameter('hidden_2', default=64),
  activation=sigopt.get_parameter('activation_2', default='relu'),
))
model.add(Dropout(dropout_rate))
model.add(Dense(10, activation='softmax'))

sgd = SGD(
  lr=10**sigopt.get_parameter('log_learning_rate', default=-2),
  decay=10**sigopt.get_parameter('log_decay', default=-6),
  momentum=sigopt.get_parameter('momentum', default=0.9),
  nesterov=True,
)
model.compile(
  loss=sigopt.get_parameter('loss', default='categorical_crossentropy'),
  optimizer=sgd,
  metrics=['accuracy'],
)

model.fit(
  x=x_train,
  y=y_train,
  epochs=sigopt.get_parameter('epochs', default=20),
  batch_size=sigopt.get_parameter('batch_size', default=128),
)
evaluation_loss, accuracy = model.evaluate(
  x=x_test,
  y=y_test,
  batch_size=128,
)
sigopt.log_metric('evaluation_loss', evaluation_loss)
sigopt.log_metric('accuracy', accuracy)

Logging Reference

  • sigopt.get_parameter(name, default=None): Read a single assignment for a parameter by name. If you setup your parameter in the parameters section of your experiment configuration file, this value will be a new assignment to try. If the parameter is not included in the experiment configuration file, the default will be returned. The default values will also be used if you execute your model file directly, without using orchestrate.
  • sigopt.get_suggestion: The latest Suggestion from the SigOpt API. This is automatically created for you, and is available to read.
  • sigopt.log_metric(name, value, stddev=None): Log a metric value for your model, like "accuracy". You may include an optional standard deviation. Any metrics not defined in the experiment configuration file will be recorded as metadata (see below).
  • sigopt.log_metadata(key, value): Log a piece of metadata.
  • sigopt.log_failure(): Indicate that this model is a failure.

In the backend, the information from the sigopt.log_* calls will be automatically bundled into a SigOpt Observation. The JSON payload for every SigOpt Observation create call is logged to standard out. You will see log warnings if:

  • You use sigopt.get_parameter(parameter_name, ...) and parameter_name is not a parameter in your orchestrate.yml file.
  • You use sigopt.log_metric(metric_name, ...) and metric_name is not a metric in your orchestrate.yml file.
  • You use sigopt.log_failure() during the evaluation of your model.
  • Any of your run commands failed. This is also interpreted as a failure.
  • At least one of your metrics was not logged with sigopt.log_metric. This is also interpreted as a failure.

SigOpt Orchestrate Compute Resources Back to Top

If you're training a model that needs a GPU you will want to use resources_per_model to ensure that your model has access to GPUs. Requests and limits are optional, but may be helpful if your model is having trouble running with enough memory or CPU resources.

Requests are resource guarantees and will cause your model to wait until the cluster has available resources before running. Limits prevent your model from using additional resources. These map directly to Kubernetes requests and limits.
Note: If you only set a limit it will also set a request of the same value. See the Kubernetes documentation for details.

Resource Types

  • CPU resources are measured in number of "logical cores" and can be decimal values. This is generally a vCPU in the cloud and a hyperthread on a custom cluster. See Meaning of CPU on the Kubernetes documentation for cloud specific and formatting details.
  • Memory is measured in number of bytes but can be postfixed by "Mi, Gi" for megabytes and gigabytes respectively. See Meaning of Memory on the Kubernetes documentation for details and below for a simple example.
  • The gpus field is currently specific to Nvidia gpus tagged as "nvidia.com/gpu". Alternatives can used by adding them to the limits field.

The below example will guarantee 20%(.2) of a logical core, 200 megabytes of memory, and a gpu are available for your model to run. If the cluster you are running on does not have enough free compute resources it will wait until they become available before running your model. This example will also limit your model so that it does not use more than 1 logical core and 2 gigabytes of memory.

name: My Orchestrate Experiment
install:
  -  pip install -r requirements.txt
run: python model.py
image: example/foobar
resources_per_model:
  gpus: 1
  requests:
    cpu: .2
    memory: 200Mi
  limits:
    cpu: 1
    memory: 2Gi
optimization:
...

Docker Back to Top

Orchestrate uses Docker to build and upload your model environment. If you find that orchestrate optimize is taking a long time, then you may want to try some of the following tips to reduce the build and upload time of your model:

Keep your model directory free of extra files

Omit files like logs, saved models, tests, and virtual environments. Changes to these extra files will cause Orchestrate to re-build your model environment.

Reduce the complexity of your install commands

If you can, use one of our pre-built frameworks as a starting point.

Omit your training data from your model directory

You can try downloading or streaming your training data in your run commands instead.

Create a .dockerignore file in your model directory

This file should contain a list of the files that you want to omit from your model environment.

# python bytecode
**/*.pyc
**/__pycache__/

# virtual environment
venv/

# training data
data/

# tests
tests/

# anything else
.git/
saved_models/
logs/

See the official Docker documentation for more information.

(Advanced) Custom Image Registries Back to Top

Clusters with the provider aws will use AWS ECR as their default container registry, and cluster with the provider custom will use Docker Hub.

To use a custom image registry, provide the registry argument when you connect to your cluster:

 cluster connect \
  --cluster-name tiny-cluster \
  --provider custom \
  --kubeconfig /path/to/kubeconfig \
  --registry myregistrydomain:port