SigOpt and scikit-learn

SigOpt's Python API Client works naturally with any machine learning library in Python, but to make things even easier we offer an additional SigOpt + scikit-learn package that can train and tune a model in just one line of code.

The SigOpt + scikit-learn package supports:

SigOpt's sklearn package is available via pip, with source code on GitHub:

pip install sigopt_sklearn

Find your SigOpt API token on the API tokens page.

SigOptSearchCV

The simplest use case for SigOpt in conjunction with scikit-learn is optimizing estimator hyperparameters using cross validation. A short example that tunes the parameters of an SVM on a small dataset is provided below:

from sklearn import svm, datasets
from sigopt_sklearn.search import SigOptSearchCV

# find your SigOpt client token here : https://app.sigopt.com/tokens/info
client_token = SIGOPT_API_TOKEN

iris = datasets.load_iris()

# define parameter domains
svc_parameters = {'kernel': ['linear', 'rbf'], 'C': [0.5, 100]}

# define sklearn estimator
svr = svm.SVC()

# define SigOptCV search strategy
clf = SigOptSearchCV(svr, svc_parameters, cv=5,
                     client_token=client_token, n_jobs=5, n_iter=20)

# perform CV search for best parameters and fits estimator
# on all data using best found configuration
clf.fit(iris.data, iris.target)

# clf.predict() now uses best found estimator
# clf.best_score_ contains CV score for best found estimator
# clf.best_params_ contains best found param configuration

XGBoostClassifier

SigOptSearchCV also works with XGBoost's XGBClassifier wrapper. A hyperparameter search over XGBClassifier models can be done using the same interface:

import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import datasets
from sigopt_sklearn.search import SigOptSearchCV

# find your SigOpt client token here : https://app.sigopt.com/tokens/info
client_token = SIGOPT_API_TOKEN
iris = datasets.load_iris()

xgb_params = {
 'learning_rate': [0.01, 0.5],
 'n_estimators': [10, 50],
 'max_depth': [3, 10],
 'min_child_weight': [6, 12],
 'gamma': [0, 0.5],
 'subsample': [0.6, 1.0],
 'colsample_bytree': [0.6, 1.0]
}

xgbc = XGBClassifier()

clf = SigOptSearchCV(xgbc, xgb_params, cv=5,
                     client_token=client_token, n_jobs=5, n_iter=70, verbose=1)

clf.fit(iris.data, iris.target)

SigOptEnsembleClassifier

This class concurrently trains and tunes several classification models within sklearn to facilitate model selection efforts when investigating new datasets. In the following tutorial video, SigOpt Research Engineer Ian Dewancker walks through how to use the ensemble classifier on an example activity recognition dataset using Amazon Web Services (AWS):

To run the SigOpt ensemble classifier on your own, first run the following on the command line to download the example dataset:

# Human Activity Recognition Using Smartphone
# https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip
unzip UCI\ HAR\ Dataset.zip
cd UCI\ HAR\ Dataset

Next, run the following code in Python:

import numpy
import pandas as pd
from sigopt_sklearn.ensemble import SigOptEnsembleClassifier

def load_datafile(filename):
    X = []
    with open(filename, "r") as f:
        for l in f:
            X.append(numpy.array(map(float, l.split())))
    X = numpy.vstack(X)
    return X
X_train = load_datafile("train/X_train.txt")
y_train = load_datafile("train/y_train.txt").ravel()
X_test = load_datafile("test/X_test.txt")
y_test = load_datafile("test/y_test.txt").ravel()

# fit and tune several classification models concurrently
# find your SigOpt client token here : https://app.sigopt.com/tokens/info
sigopt_clf = SigOptEnsembleClassifier()
sigopt_clf.parallel_fit(X_train, y_train, est_timeout=(40 * 60),
                        client_token=SIGOPT_API_TOKEN)

# compare model performance on hold out set
ensemble_train_scores = [
    est.score(X_train, y_train)
    for est
    in sigopt_clf.estimator_ensemble
]
ensemble_test_scores = [
    est.score(X_test, y_test)
    for est
    in sigopt_clf.estimator_ensemble
]
data = sorted(
    zip(
        [est.__class__.__name__ for est in sigopt_clf.estimator_ensemble],
        ensemble_train_scores,
        ensemble_test_scores
    ),
    reverse=True,
    key=lambda x: (x[2], x[1])
)
pd.DataFrame(data, columns=['Classifier ALGO.', 'Train ACC.', 'Test ACC.'])

CV Fold Timeouts

SigOptSearchCV performs evaluations on cv folds in parallel using joblib. Timeouts are now supported in the master branch of joblib and SigOpt can use this timeout information to learn to avoid hyperparameter configurations that are too slow.

You'll need to install joblib from source for this example to work:

pip uninstall joblib
git clone https://github.com/joblib/joblib.git
cd joblib; python setup.py install

Next, run the following code in Python:

from sklearn import svm, datasets
from sigopt_sklearn.search import SigOptSearchCV

# find your SigOpt client token here : https://app.sigopt.com/tokens/info
client_token = SIGOPT_API_TOKEN
dataset = datasets.fetch_20newsgroups_vectorized()
X = dataset.data
y = dataset.target

# define parameter domains
svc_parameters = {'kernel': ['linear', 'rbf'], 'C': [0.5, 100],
                  'max_iter': [10, 200], 'tol': [1e-2, 1e-6]}
svr = svm.SVC()

# SVM fitting can be quite slow, so we set timeout = 180 seconds
# for each fit.  SigOpt will then avoid configurations that are too slow
clf = SigOptSearchCV(svr, svc_parameters, cv=5, timeout=180,
                     client_token=client_token, n_jobs=5, n_iter=40)

clf.fit(X, y)