Bayesian Optimization¶
-
class
edbo.bro.
BO
(results_path=None, results=Empty DataFrame Columns: [] Index: [], domain_path=None, domain=Empty DataFrame Columns: [] Index: [], exindex_path=None, exindex=Empty DataFrame Columns: [] Index: [], model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, duplicate_experiments=False, gpu=False, fast_comp=False, noise_constraint=1e-05, matern_nu=2.5, lengthscale_prior=[GammaPrior(), 5.0], outputscale_prior=[GammaPrior(), 8.0], noise_prior=[GammaPrior(), 1.0], computational_objective=None)¶ Main method for calling Bayesian optimization algorithm.
Class provides a unified framework for selecting experimental conditions for the parallel optimization of chemical reactions and for the simulation of known objectives. The algorithm is implemented on a user defined grid of domain points and is flexible to any numerical encoding.
-
__init__
(results_path=None, results=Empty DataFrame Columns: [] Index: [], domain_path=None, domain=Empty DataFrame Columns: [] Index: [], exindex_path=None, exindex=Empty DataFrame Columns: [] Index: [], model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, duplicate_experiments=False, gpu=False, fast_comp=False, noise_constraint=1e-05, matern_nu=2.5, lengthscale_prior=[GammaPrior(), 5.0], outputscale_prior=[GammaPrior(), 8.0], noise_prior=[GammaPrior(), 1.0], computational_objective=None)¶ Experimental results, experimental domain, and experiment index of known results can be passed as paths to .csv or .xlsx files or as DataFrames.
- Parameters
results_path (str, optional) – Path to experimental results.
results (pandas.DataFrame, optional) – Experimental results with X values matching the domain.
domain_path (str, optional) –
Path to experimental domain.
Note
A domain_path or domain are required.
domain (pandas.DataFrame, optional) – Experimental domain specified as a matrix of possible configurations.
exindex_path (str, optional) – Path to experiment results index if available.
exindex (pandas.DataFrame, optional) – Experiment results index matching domain format. Used as lookup table for simulations.
model (edbo.models) – Surrogate model object used for Bayesian optimization. See edbo.models for predefined models and specification of custom models.
acquisition_function (str) – Acquisition function used for for selecting a batch of domain points to evaluate. Options: (TS) Thompson Sampling, (‘EI’) Expected Improvement, (PI) Probability of Improvement, (UCB) Upper Confidence Bound, (EI-TS) EI (first choice) + TS (n-1 choices), (PI-TS) PI (first choice) + TS (n-1 choices), (UCB-TS) UCB (first choice) + TS (n-1 choices), (MeanMax-TS) Mean maximization (first choice) + TS (n-1 choices), (VarMax-TS) Variance maximization (first choice) + TS (n-1 choices), (MeanMax) Top predicted values, (VarMax) Variance maximization, (rand) Random selection.
init_method (str) – Strategy for selecting initial points for evaluation. Options: (rand) Random selection, (pam) k-medoids algorithm, (kmeans) k-means algorithm, (external) User define external data read in as results.
target (str) – Column label of optimization objective. If set to -1, the last column of the DataFrame will be set as the target.
batch_size (int) – Number of experiments selected via acquisition and initialization functions.
duplicate_experiments (bool) – Allow the acquisition function to select experiments already present in results.
gpu (bool) – Carry out GPyTorch computations on a GPU if available.
fast_comp (bool) – Enable fast computation features for GPyTorch models.
noise_constraint (float) – Noise constraint for GPyTorch models.
matern_nu (0.5, 1.5, 2.5) – Parameter value for model Matern kernel.
lengthscale_prior ([gytorch.prior, initial_value]) – Specify a prior over GP length scale prameters.
outputscale_prior ([gytorch.prior, initial_value]) – Specify a prior over GP output scale prameter.
noise_prior ([gytorch.prior, initial_value]) – Specify a prior over GP noice prameter.
computational_objective (function, optional) – Function to be optimized for computational objectives.
-
init_sample
(seed=None, append=False, export_path=None, visualize=False)¶ Generate initial samples via an initialization method.
- Parameters
seed (None, int) – Random seed used for selecting initial points.
append (bool) – Append points to results if computational objective or experiment index are available.
export_path (str) – Path to export SVG of clustering results if pam or kmeans methods are used for selecting initial points.
visualize (bool) – If initialization method is set to ‘pam’ or ‘kmeans’ and visualize is set to True then a 2D embedding of the clustering results will be generated.
- Returns
Domain points for proposed experiments.
- Return type
pandas.DataFrame
-
run
(append=False, n_restarts=0, learning_rate=0.1, training_iters=100)¶ Run a single iteration of optimization with known results.
Note
Use run for human-in-the-loop optimization.
- Parameters
append (bool) – Append points to results if computational objective or experiment index are available.
n_restarts (int) – Number of restarts used when optimizing GPyTorch model parameters.
learning_rate (float) – ADAM learning rate used when optimizing GPyTorch model parameters.
training_iters (int) – Number of iterations to run ADAM when optimizin GPyTorch models parameters.
- Returns
Domain points for proposed experiments.
- Return type
pandas.DataFrame
-
simulate
(iterations=1, seed=None, update_priors=False, n_restarts=0, learning_rate=0.1, training_iters=100)¶ Run autonomous BO loop.
Run N iterations of optimization with initial results obtained via initialization method and experiments selected from experiment index via the acquisition function. Simulations require know objectives via an index of results or function.
Note
Requires a computational objective or experiment index.
- Parameters
append (bool) – Append points to results if computational objective or experiment index are available.
n_restarts (int) – Number of restarts used when optimizing GPyTorch model parameters.
learning_rate (float) – ADAM learning rate used when optimizing GPyTorch model parameters.
training_iters (int) – Number of iterations to run ADAM when optimizin GPyTorch models parameters.
seed (None, int) – Random seed used for initialization.
update_priors (bool) – Use parameter estimates from optimization step N-1 as initial values for step N.
-
clear_results
()¶ Clear results manually.
Note
‘rand’ and ‘pam’ initialization methods clear results automatically.
-
plot_convergence
(export_path=None)¶ Plot optimizer convergence.
- Parameters
export_path (None, str) – Path to export SVG of optimizer optimizer convergence plot.
- Returns
Plot of optimizer convergence.
- Return type
matplotlib.pyplot
-
acquisition_summary
()¶ Summarize predicted mean and variance for porposed points.
- Returns
Summary table.
- Return type
pandas.DataFrame
-
best
()¶ Best observed objective values and corresponding domain point.
-
save
(path='BO.pkl')¶ Save BO state.
- Parameters
path (str) – Path to export <BO state dict>.pkl.
- Returns
- Return type
None
-
load
(path='BO.pkl')¶ Load BO state.
- Parameters
path (str) – Path to <BO state dict>.pkl.
- Returns
- Return type
None
-
-
class
edbo.bro.
BO_express
(reaction_components={}, encoding={}, descriptor_matrices={}, model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, computational_objective=None)¶ Quick method for auto-generating a reaction space, encoding, and BO.
Class provides a unified framework for defining reaction spaces, encoding reacitons, selecting experimental conditions for the parallel optimization of chemical reactions, and analyzing results.
BO_express automates most of the process required for BO such as the featurization of the reaction space, preprocessing of data and selection of gaussian process priors.
Reaction components and encodings are passed to BO_express using dictionaries. BO_express attempts to encode each component based on the specified encoding. If there is an error in a SMILES string or the name could not be found in the NIH database an edbo bot is spawned to help resolve the issue. Once instantiated, BO_express.help() will also spawn an edbo bot to help with tasks.
Example
Defining a reaction space
from edbo.bro import BO_express # (1) Define a dictionary of components reaction_components={ 'aryl_halide':['chlorobenzene','iodobenzene','bromobenzene'], 'base':['DBU', 'MTBD', 'potassium carbonate', 'potassium phosphate'], 'solvent':['THF', 'Toluene', 'DMSO', 'DMAc'], 'ligand': ['c1ccc(cc1)P(c2ccccc2)c3ccccc3', # PPh3 'C1CCC(CC1)P(C2CCCCC2)C3CCCCC3', # PCy3 'CC(C)c1cc(C(C)C)c(c(c1)C(C)C)c2ccccc2P(C3CCCCC3)C4CCCCC4' # X-Phos ], 'concentration':[0.1, 0.2, 0.3], 'temperature': [20, 30, 40], 'additive': '<defined in descriptor_matrices>'} # (2) Define a dictionary of desired encodings encoding={'aryl_halide':'resolve', 'base':'ohe', 'solvent':'resolve', 'ligand':'smiles', 'concentration':'numeric', 'temperature':'numeric'} # (3) Add any user define descriptor matrices directly import pandas as pd A = pd.DataFrame( [['a1', 1,2,3,4],['a2',1,5,2,0],['a3', 3,5,1,25]], columns=['additive', 'A_des1', 'A_des2', 'A_des3', 'A_des4']) descriptor_matrices = {'additive': A} # (4) Instatiate BO_express bo = BO_express(reaction_components=reaction_components, encoding=encoding, descriptor_matrices=descriptor_matrices, batch_size=10, acquisition_function='TS', target='yield')
-
__init__
(reaction_components={}, encoding={}, descriptor_matrices={}, model=<class 'edbo.models.GP_Model'>, acquisition_function='EI', init_method='rand', target=-1, batch_size=5, computational_objective=None)¶ - Parameters
reaction_components (dict) –
Dictionary of reaction components of the form:
Example
Defining reaction components
{'A': [a1, a2, a3, ...], 'B': [b1, b2, b3, ...], 'C': [c1, c2, c3, ...], . 'N': [n1, n2, n3, ...]}
Components can be specified as: (1) arbitrary names, (2) chemical names or nicknames, (3) SMILES strings, or (4) numeric values.
Note
A reaction component will not be encoded unless its key is present in the reaction_components dictionary.
encodings (dict) –
Dictionary of encodings with keys corresponding to reaction_components. Encoding dictionary has the form:
Example
Defining reaction encodings
{'A': 'resolve', 'B': 'ohe', 'C': 'smiles', . 'N': 'numeric'}
Encodings can be specified as: (‘resolve’) resolve a compound name using the NIH database and compute Mordred descriptors, (‘ohe’) one-hot-encode, (‘smiles’) compute Mordred descriptors using a smiles string, (‘numeric’) numerical reaction parameters are used as passed. If no encoding is specified, the space will be automatically one-hot-encoded.
descriptor_matrices (dict) –
Dictionary of descriptor matrices where keys correspond to reaction_components and values are pandas.DataFrames.
Descriptor dictionary has the form:
Example
User defined descriptor matrices
# DataFrame where the first column is the identifier (e.g., a SMILES string) A = pd.DataFrame([....], columns=[...]) -------------------------------------------- A_SMILES | des1 | des2 | des3 | ... -------------------------------------------- . . . . ... . . . . ... -------------------------------------------- # Dictionary of descriptor matrices defined as DataFrames descriptor_matrices = {'A': A}
Note
If a key is present in both encoding and descriptor_matrices then the descriptor matrix will take precedence.
model (edbo.models) – Surrogate model object used for Bayesian optimization. See edbo.models for predefined models and specification of custom models.
acquisition_function (str) – Acquisition function used for for selecting a batch of domain points to evaluate. Options: (TS) Thompson Sampling, (‘EI’) Expected Improvement, (PI) Probability of Improvement, (UCB) Upper Confidence Bound, (EI-TS) EI (first choice) + TS (n-1 choices), (PI-TS) PI (first choice) + TS (n-1 choices), (UCB-TS) UCB (first choice) + TS (n-1 choices), (MeanMax-TS) Mean maximization (first choice) + TS (n-1 choices), (VarMax-TS) Variance maximization (first choice) + TS (n-1 choices), (MeanMax) Top predicted values, (VarMax) Variance maximization, (rand) Random selection.
init_method (str) – Strategy for selecting initial points for evaluation. Options: (rand) Random selection, (pam) k-medoids algorithm, (kmeans) k-means algorithm, (external) User define external data read in as results.
target (str) – Column label of optimization objective. If set to -1, the last column of the DataFrame will be set as the target.
batch_size (int) – Number of experiments selected via acquisition and initialization functions.
computational_objective (function, optional) – Function to be optimized for computational objectives.
-
get_experiments
(structures=False)¶ Return indexed experiments proposed by Bayesian optimization algorithm.
edbo.BO works directly with a standardized encoded reaction space. This method returns proposed experiments as the origional smiles strings, categories, or numerical values.
- Parameters
structures (bool) – If True, use RDKit to print out the chemical structures of any encoded smiles strings.
- Returns
Proposed experiments.
- Return type
pandas.DataFrame
-
add_results
(results_path=None)¶ Add experimental results.
Experimental results should be added with the same column headings as those returned by BO_express.get_experiments. If a path to the results is not specified, an edbo bot is spawned to help load results. It does so by exporting the entire reaction space to a CSV file in the working directory.
Note: The first column in the CSV/EXCEL results file must have the same index as the experiment. Try BO_express.export_proposed() to export a CSV file with the proper format.
- Parameters
results_path (str) – Imports results from a CSV/EXCEL file with system path results_path.
- Returns
- Return type
None
-
export_proposed
(path=None)¶ Export proposed experiments.
edbo.BO works directly with a standardized encoded reaction space. This method exports proposed experiments as the origional smiles strings, categories, or numerical values. If a path to the results is not specified, a CSV file entitled ‘experiments.csv’ will be exported to the current working directory.
- Parameters
path (str) – Export a CSV file to path.
- Returns
- Return type
None
-
help
()¶ Spawn an edbo bot to help with tasks.
If you are not familiar with edbo commands BO_express.help() will spawn an edbo bot to help with tasks. Natural language can be used to interact with edbo bot in the terminal to accomplish tasks such as: initializing (selecting initial experiments using chosen init method), optimizing (run BO algorithm with availible data to choose next experiments), getting proposed experiments, adding experimental results, checking the underlying models regression performance, saving the BO instance so you can load it for use later, and exporting proposed experiments to a CSV file.
-