Bayesian Optimization

class edbo.bro.BO(results_path=None, results=Empty DataFrame Columns: [] Index: [], domain_path=None, domain=Empty DataFrame Columns: [] Index: [], exindex_path=None, exindex=Empty DataFrame Columns: [] Index: [], model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, duplicate_experiments=False, gpu=False, fast_comp=False, noise_constraint=1e-05, matern_nu=2.5, lengthscale_prior=[GammaPrior(), 5.0], outputscale_prior=[GammaPrior(), 8.0], noise_prior=[GammaPrior(), 1.0], computational_objective=None)

Main method for calling Bayesian optimization algorithm.

Class provides a unified framework for selecting experimental conditions for the parallel optimization of chemical reactions and for the simulation of known objectives. The algorithm is implemented on a user defined grid of domain points and is flexible to any numerical encoding.

__init__(results_path=None, results=Empty DataFrame Columns: [] Index: [], domain_path=None, domain=Empty DataFrame Columns: [] Index: [], exindex_path=None, exindex=Empty DataFrame Columns: [] Index: [], model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, duplicate_experiments=False, gpu=False, fast_comp=False, noise_constraint=1e-05, matern_nu=2.5, lengthscale_prior=[GammaPrior(), 5.0], outputscale_prior=[GammaPrior(), 8.0], noise_prior=[GammaPrior(), 1.0], computational_objective=None)

Experimental results, experimental domain, and experiment index of known results can be passed as paths to .csv or .xlsx files or as DataFrames.

Parameters
  • results_path (str, optional) – Path to experimental results.

  • results (pandas.DataFrame, optional) – Experimental results with X values matching the domain.

  • domain_path (str, optional) –

    Path to experimental domain.

    Note

    A domain_path or domain are required.

  • domain (pandas.DataFrame, optional) – Experimental domain specified as a matrix of possible configurations.

  • exindex_path (str, optional) – Path to experiment results index if available.

  • exindex (pandas.DataFrame, optional) – Experiment results index matching domain format. Used as lookup table for simulations.

  • model (edbo.models) – Surrogate model object used for Bayesian optimization. See edbo.models for predefined models and specification of custom models.

  • acquisition_function (str) – Acquisition function used for for selecting a batch of domain points to evaluate. Options: (TS) Thompson Sampling, (‘EI’) Expected Improvement, (PI) Probability of Improvement, (UCB) Upper Confidence Bound, (EI-TS) EI (first choice) + TS (n-1 choices), (PI-TS) PI (first choice) + TS (n-1 choices), (UCB-TS) UCB (first choice) + TS (n-1 choices), (MeanMax-TS) Mean maximization (first choice) + TS (n-1 choices), (VarMax-TS) Variance maximization (first choice) + TS (n-1 choices), (MeanMax) Top predicted values, (VarMax) Variance maximization, (rand) Random selection.

  • init_method (str) – Strategy for selecting initial points for evaluation. Options: (rand) Random selection, (pam) k-medoids algorithm, (kmeans) k-means algorithm, (external) User define external data read in as results.

  • target (str) – Column label of optimization objective. If set to -1, the last column of the DataFrame will be set as the target.

  • batch_size (int) – Number of experiments selected via acquisition and initialization functions.

  • duplicate_experiments (bool) – Allow the acquisition function to select experiments already present in results.

  • gpu (bool) – Carry out GPyTorch computations on a GPU if available.

  • fast_comp (bool) – Enable fast computation features for GPyTorch models.

  • noise_constraint (float) – Noise constraint for GPyTorch models.

  • matern_nu (0.5, 1.5, 2.5) – Parameter value for model Matern kernel.

  • lengthscale_prior ([gytorch.prior, initial_value]) – Specify a prior over GP length scale prameters.

  • outputscale_prior ([gytorch.prior, initial_value]) – Specify a prior over GP output scale prameter.

  • noise_prior ([gytorch.prior, initial_value]) – Specify a prior over GP noice prameter.

  • computational_objective (function, optional) – Function to be optimized for computational objectives.

init_sample(seed=None, append=False, export_path=None, visualize=False)

Generate initial samples via an initialization method.

Parameters
  • seed (None, int) – Random seed used for selecting initial points.

  • append (bool) – Append points to results if computational objective or experiment index are available.

  • export_path (str) – Path to export SVG of clustering results if pam or kmeans methods are used for selecting initial points.

  • visualize (bool) – If initialization method is set to ‘pam’ or ‘kmeans’ and visualize is set to True then a 2D embedding of the clustering results will be generated.

Returns

Domain points for proposed experiments.

Return type

pandas.DataFrame

run(append=False, n_restarts=0, learning_rate=0.1, training_iters=100)

Run a single iteration of optimization with known results.

Note

Use run for human-in-the-loop optimization.

Parameters
  • append (bool) – Append points to results if computational objective or experiment index are available.

  • n_restarts (int) – Number of restarts used when optimizing GPyTorch model parameters.

  • learning_rate (float) – ADAM learning rate used when optimizing GPyTorch model parameters.

  • training_iters (int) – Number of iterations to run ADAM when optimizin GPyTorch models parameters.

Returns

Domain points for proposed experiments.

Return type

pandas.DataFrame

simulate(iterations=1, seed=None, update_priors=False, n_restarts=0, learning_rate=0.1, training_iters=100)

Run autonomous BO loop.

Run N iterations of optimization with initial results obtained via initialization method and experiments selected from experiment index via the acquisition function. Simulations require know objectives via an index of results or function.

Note

Requires a computational objective or experiment index.

Parameters
  • append (bool) – Append points to results if computational objective or experiment index are available.

  • n_restarts (int) – Number of restarts used when optimizing GPyTorch model parameters.

  • learning_rate (float) – ADAM learning rate used when optimizing GPyTorch model parameters.

  • training_iters (int) – Number of iterations to run ADAM when optimizin GPyTorch models parameters.

  • seed (None, int) – Random seed used for initialization.

  • update_priors (bool) – Use parameter estimates from optimization step N-1 as initial values for step N.

clear_results()

Clear results manually.

Note

‘rand’ and ‘pam’ initialization methods clear results automatically.

plot_convergence(export_path=None)

Plot optimizer convergence.

Parameters

export_path (None, str) – Path to export SVG of optimizer optimizer convergence plot.

Returns

Plot of optimizer convergence.

Return type

matplotlib.pyplot

acquisition_summary()

Summarize predicted mean and variance for porposed points.

Returns

Summary table.

Return type

pandas.DataFrame

best()

Best observed objective values and corresponding domain point.

save(path='BO.pkl')

Save BO state.

Parameters

path (str) – Path to export <BO state dict>.pkl.

Returns

Return type

None

load(path='BO.pkl')

Load BO state.

Parameters

path (str) – Path to <BO state dict>.pkl.

Returns

Return type

None

class edbo.bro.BO_express(reaction_components={}, encoding={}, descriptor_matrices={}, model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, computational_objective=None)

Quick method for auto-generating a reaction space, encoding, and BO.

Class provides a unified framework for defining reaction spaces, encoding reacitons, selecting experimental conditions for the parallel optimization of chemical reactions, and analyzing results.

BO_express automates most of the process required for BO such as the featurization of the reaction space, preprocessing of data and selection of gaussian process priors.

Reaction components and encodings are passed to BO_express using dictionaries. BO_express attempts to encode each component based on the specified encoding. If there is an error in a SMILES string or the name could not be found in the NIH database an edbo bot is spawned to help resolve the issue. Once instantiated, BO_express.help() will also spawn an edbo bot to help with tasks.

Example

Defining a reaction space

from edbo.bro import BO_express

# (1) Define a dictionary of components
reaction_components={
    'aryl_halide':['chlorobenzene','iodobenzene','bromobenzene'],
    'base':['DBU', 'MTBD', 'potassium carbonate', 'potassium phosphate'],
    'solvent':['THF', 'Toluene', 'DMSO', 'DMAc'],
    'ligand': ['c1ccc(cc1)P(c2ccccc2)c3ccccc3', # PPh3
               'C1CCC(CC1)P(C2CCCCC2)C3CCCCC3', # PCy3
               'CC(C)c1cc(C(C)C)c(c(c1)C(C)C)c2ccccc2P(C3CCCCC3)C4CCCCC4' # X-Phos
               ],
    'concentration':[0.1, 0.2, 0.3],
    'temperature': [20, 30, 40],
    'additive': '<defined in descriptor_matrices>'}

# (2) Define a dictionary of desired encodings
encoding={'aryl_halide':'resolve',
          'base':'ohe',
          'solvent':'resolve',
          'ligand':'smiles',
          'concentration':'numeric',
          'temperature':'numeric'}

# (3) Add any user define descriptor matrices directly
import pandas as pd

A = pd.DataFrame(
         [['a1', 1,2,3,4],['a2',1,5,2,0],['a3', 3,5,1,25]],
         columns=['additive', 'A_des1', 'A_des2', 'A_des3', 'A_des4'])

descriptor_matrices = {'additive': A}

# (4) Instatiate BO_express
bo = BO_express(reaction_components=reaction_components,
                encoding=encoding,
                descriptor_matrices=descriptor_matrices,
                batch_size=10,
                acquisition_function='TS',
                target='yield')
__init__(reaction_components={}, encoding={}, descriptor_matrices={}, model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, computational_objective=None)
Parameters
  • reaction_components (dict) –

    Dictionary of reaction components of the form:

    Example

    Defining reaction components

    {'A': [a1, a2, a3, ...],
     'B': [b1, b2, b3, ...],
     'C': [c1, c2, c3, ...],
                 .
     'N': [n1, n2, n3, ...]}
    

    Components can be specified as: (1) arbitrary names, (2) chemical names or nicknames, (3) SMILES strings, or (4) numeric values.

    Note

    A reaction component will not be encoded unless its key is present in the reaction_components dictionary.

  • encodings (dict) –

    Dictionary of encodings with keys corresponding to reaction_components. Encoding dictionary has the form:

    Example

    Defining reaction encodings

    {'A': 'resolve',
     'B': 'ohe',
     'C': 'smiles',
            .
     'N': 'numeric'}
    

    Encodings can be specified as: (‘resolve’) resolve a compound name using the NIH database and compute Mordred descriptors, (‘ohe’) one-hot-encode, (‘smiles’) compute Mordred descriptors using a smiles string, (‘numeric’) numerical reaction parameters are used as passed. If no encoding is specified, the space will be automatically one-hot-encoded.

  • descriptor_matrices (dict) –

    Dictionary of descriptor matrices where keys correspond to reaction_components and values are pandas.DataFrames.

    Descriptor dictionary has the form:

    Example

    User defined descriptor matrices

    # DataFrame where the first column is the identifier (e.g., a SMILES string)
    
    A = pd.DataFrame([....], columns=[...])
    
    --------------------------------------------
      A_SMILES  |  des1  |  des2  | des3 | ...
    --------------------------------------------
          .         .        .       .     ...
          .         .        .       .     ...
    --------------------------------------------
    
    # Dictionary of descriptor matrices defined as DataFrames
    
    descriptor_matrices = {'A': A}
    

    Note

    If a key is present in both encoding and descriptor_matrices then the descriptor matrix will take precedence.

  • model (edbo.models) – Surrogate model object used for Bayesian optimization. See edbo.models for predefined models and specification of custom models.

  • acquisition_function (str) – Acquisition function used for for selecting a batch of domain points to evaluate. Options: (TS) Thompson Sampling, (‘EI’) Expected Improvement, (PI) Probability of Improvement, (UCB) Upper Confidence Bound, (EI-TS) EI (first choice) + TS (n-1 choices), (PI-TS) PI (first choice) + TS (n-1 choices), (UCB-TS) UCB (first choice) + TS (n-1 choices), (MeanMax-TS) Mean maximization (first choice) + TS (n-1 choices), (VarMax-TS) Variance maximization (first choice) + TS (n-1 choices), (MeanMax) Top predicted values, (VarMax) Variance maximization, (rand) Random selection.

  • init_method (str) – Strategy for selecting initial points for evaluation. Options: (rand) Random selection, (pam) k-medoids algorithm, (kmeans) k-means algorithm, (external) User define external data read in as results.

  • target (str) – Column label of optimization objective. If set to -1, the last column of the DataFrame will be set as the target.

  • batch_size (int) – Number of experiments selected via acquisition and initialization functions.

  • computational_objective (function, optional) – Function to be optimized for computational objectives.

get_experiments(structures=False)

Return indexed experiments proposed by Bayesian optimization algorithm.

edbo.BO works directly with a standardized encoded reaction space. This method returns proposed experiments as the origional smiles strings, categories, or numerical values.

Parameters

structures (bool) – If True, use RDKit to print out the chemical structures of any encoded smiles strings.

Returns

Proposed experiments.

Return type

pandas.DataFrame

add_results(results_path=None)

Add experimental results.

Experimental results should be added with the same column headings as those returned by BO_express.get_experiments. If a path to the results is not specified, an edbo bot is spawned to help load results. It does so by exporting the entire reaction space to a CSV file in the working directory.

Note: The first column in the CSV/EXCEL results file must have the same index as the experiment. Try BO_express.export_proposed() to export a CSV file with the proper format.

Parameters

results_path (str) – Imports results from a CSV/EXCEL file with system path results_path.

Returns

Return type

None

export_proposed(path=None)

Export proposed experiments.

edbo.BO works directly with a standardized encoded reaction space. This method exports proposed experiments as the origional smiles strings, categories, or numerical values. If a path to the results is not specified, a CSV file entitled ‘experiments.csv’ will be exported to the current working directory.

Parameters

path (str) – Export a CSV file to path.

Returns

Return type

None

help()

Spawn an edbo bot to help with tasks.

If you are not familiar with edbo commands BO_express.help() will spawn an edbo bot to help with tasks. Natural language can be used to interact with edbo bot in the terminal to accomplish tasks such as: initializing (selecting initial experiments using chosen init method), optimizing (run BO algorithm with availible data to choose next experiments), getting proposed experiments, adding experimental results, checking the underlying models regression performance, saving the BO instance so you can load it for use later, and exporting proposed experiments to a CSV file.