Bayesian Reaction Optimization Using EDBO - Part IV
Published:
Published:
Published:
Published:
In part I we installed the pre-release of EDBO and ran some basic functionality tests. Now in part II we can dive into a basic introduction to using the software. In this post we provide example code for Bayesian optimization of a 1D objective which can be used to explore some of the softwares features. The main Bayesian optimization program is accessed through the edbo.bro module. The main BO classes, edbo.bro.BO and edbo.bro.BO_express, enable users to select initial experiments with experimental designs, running BO on human-in-the-loop or computational objectives, model data, and analyze results. Note: BO parameters are preset to those optimized for DFT encodings in the paper. However, BO_express attempts to automate the selection of priors based on the search space. In general, the BO class is more flexible but as a result less user friendly. Therefore let’s use the BO_express class in this demonstration.
To start we need to define a search space and an objective. In general, for any application it is up to us to define where to optimizer will search for conditions that maximize our objective. For a reaction your objective may be the yield of desired product, here I am using an arbitrary function so feel free to change it to anything you want for this demo.
import numpy as np
import matplotlib.pyplot as plt
# Define a computational objective
# EDBO works with feature vectors so even a 1D objective needs to be vectorized
def f(x):
"""Noise free objective."""
return np.sin(x[0]) * x[0] * 5 + 30
def g(x):
"""With noise."""
return f(x) + (np.random.random() - 0.5) * 15
# BO uses a user defined domain
X = np.linspace(0,10,1000) # Grid of 1000 points between 0 and 10
Now we can use matplotlib to visualize the objective.
sample = np.random.choice(X, 100)
plt.figure(figsize=(5,5))
plt.plot(X, [f([x]) for x in X])
plt.scatter(sample, [g([x]) for x in sample], alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('"Unknown" Objective')
plt.show()
With our search space prepared we can now use EDBO to choose initial experiments, evaluate models, and run Bayesian optimization. There are several ways in which the main BO methods can be used. Let’s start by checking out the options when instantiating BO objects. Here is a link to the documentation page: edbo.bro.
First, as we are checking out some of EDBO’s features it will be handy to have nice plotting function.
# Handy function to visualize the results
def map_corr(df):
"""Get corresponding points in unstandardized domain."""
index = []
for x in df.values:
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
index.append(i)
return bo.reaction.get_experiments(index)
def plot_results(export_path=None, plot_samples=True):
"""Plot summary of 1D BO simulations."""
mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain)) # GP posterior mean
std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2 # GP posterior standard deviation
next_points = bo.reaction.get_experiments(bo.proposed_experiments.index.values).copy() # Next points proposed by BO
next_points['g(x)'] = [f(x) for x in next_points.values]
results = map_corr(bo.obj.results.drop('g(x)', axis=1)) # Results for known data
results['g(x)'] = [g(x) for x in results.values]
plt.figure(1, figsize=(8,8))
# Model mean and standard deviation
plt.subplot(211)
plt.plot(X, [f([x]) for x in X], color='black')
plt.plot(X, mean, label='GP')
plt.fill_between(X, mean-std, mean+std, alpha=0.4)
# Known results and next selected point
plt.scatter(results['x_index'], results['g(x)'], color='black', label='known')
plt.scatter(next_points['x_index'], next_points['g(x)'], color='red', label='next')
plt.ylabel('f(x)')
# Plot some posterior samples
if plot_samples:
samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=2))
i = 1
for sample in samples:
plt.plot(X, sample.numpy(), '--', label='sample' + str(i))
i += 1
plt.legend(loc='lower left')
# Plot the acquisition function
plt.subplot(212)
for p in bo.acq.function.projections:
plt.plot(bo.obj.domain['x'], p)
plt.xlabel('x')
plt.ylabel('Acquisition Function')
if export_path is not None:
plt.savefig(export_path, format='svg', dpi=1200, bbox_inches='tight')
plt.show()
Suppose we have no data and want to start by selecting initial experiments to run. We can do this at random or by using clustering methods using EDBO. I have also written some DOE add on modules which enable you to use response surface (e.g., central composite) and fractional factorial designs. However, these are not included in EDBO 0.0.0. Here we use the centroids from k-Means clustering for initialization.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
init_method='kmeans')
# (4) Choose initial experiments using k-means
bo.init_sample()
print('\nNormalized domain points:')
bo.proposed_experiments
Normalized domain points:
x | |
---|---|
252 | 0.252252 |
751 | 0.751752 |
We can get the unnormalized experiments (or SMILES strings etc.) using the get_experiments method.
print('\nDomain points:')
bo.get_experiments()
Domain points:
x_index | |
---|---|
252 | 2.52252 |
751 | 7.51752 |
And we can plot the choices on the domain.
plt.figure(figsize=(6,1))
plt.scatter(bo.obj.domain['x'], np.ones((len(bo.obj.domain))))
plt.scatter(bo.proposed_experiments, np.ones((len(bo.proposed_experiments))), s=100)
plt.xlabel('x')
plt.yticks([])
plt.show()
Now we can move on to the optimization. If you were really running experiments in the lab you would likely just want to use the run method to iteratively choose experiments. Then go into the lab run the experiments, collect the results, and read them back into the optimizer. Let’s see what that would look like. First lets export the proposed experiments to a CSV file so we can add the results after we “run” the experiments.
# Without an arguement this will export 'experiments.csv' to the cwd.
bo.export_proposed()
Since this is actually a computational objective we can “run” the experiments right here.
# "Run" the experiments
expts = bo.get_experiments()
expts['g(x)'] = [g(x) for x in expts.values]
# Save the results as a CSV
expts.to_csv('results.csv')
# Load the results
bo.add_results('results.csv')
Then in order to choose the next experiments we simply use the run method.
bo.run()
And we can return basic analysis of the acquisition process using the acquisition_summary method.
bo.acquisition_summary()
x | predicted g(x) | variance | |
---|---|---|---|
960 | 0.960961 | 60.7232 | 1581.51 |
631 | 0.631632 | 64.5289 | 969.982 |
You can continue this process iteratively until the objective is maximized or you run out of resources. We can get an idea of what is going on under the hood using our plotting function. In the top plot notice that the model mean fits the experimental results well and that the model confidence region (2$\sigma$) capture the unknown objective. As a result, when we sample the posterior predictive distribution of the model you can see that one of the random functions (yellow dashed) actually captures most of the variation in the objective. The default acquisition function used by EDBO is parallel expected improvement (EI). The computed EI, used to select the next round of experiments, is shown in the bottom plot. Notice that the ArgMax of the acquisition function gives the next two experiments (red points).
Given that $f$ is actually a computational objective, we could just use EDBO to automatically optimize the objective. Below is some sample code for how you can do this using the computational objective option.
# EDBO works on the normalized search space
# We need a new function that maps to the real domain
def h(x):
"""Deal with scaling."""
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
df = bo.reaction.get_experiments(i)
return g(df.values)
# Use the computational_objective arguement
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
computational_objective=h)
# Run the optimization automatically using simulate
bo.simulate(seed=4, iterations=5)
# Plot the results
plot_results()
Models. In Bayesian optimization, the surrogate model type defines a prior over functions which capture our assumptions about the shape of a the response surface. When we combined this prior with observed reaction data we then get a posterior distribution of functions which we can use to reason about the possible positions of global optima. Practically speaking, many acquisition functions (but not all, e.g., Thompson Sampling) are formulated from the surrogate models mean and variance. Thus, in principal any regression model can be employed in Bayesian optimization (e.g., by bootstrapping variance estimates). EDBO currently has three different surrogate models built into the edbo.models module: gaussian processes (edbo.models.GP_Model, GPyTorch), random forests (edbo.models.RF_Model, Scikit-Learn), and Bayesian linear regression (edbo.models.Bayesian_Linear_Model, Scikit-Learn). See the edbo.models documentation page for more details. We can get an idea of the shape of these functions using the plotting method we wrote above (vide infra). It is also straightforward to implement your own model - see the edbo.models module for examples. Below is an example code block for utilizing a random forest model instead of the default gaussian process.
from edbo.bro import BO_express
from edbo.models import RF_Model
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
model=RF_Model)
Gaussian Process Regression (EDBO’s default model):
Random Forest Regression:
Bayesian Linear Regression:
Acquisition functions. The acquisition function is the algorithm responsible for selecting the next experiments to run based on the information captured by the surrogate model. Most acquisition functions are built to balance the exploration of the search space with the exploitation of information availible from evaluated experiments. EDBO has several acquisition functions availible via keyword arguements from the BO and BO_express classes. A full list can be found the the documentation. The default acquisition function, expected improvement, is derived from the expectation value of the improvement utility function. Below is an example code block for choosing different acquisition functions and a few examples of parallel acquisition functions which utilize the Kriging Believer algorithm for batching.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
acquisition_function='UCB')
Expected improvement (EDBO’s default acquisition function):
Probability of improvement:
Upper confidence bound:
Mean maximization (pure exploitation):
Variance maximization (pure exploration):
During optimization we can run misc analysis using some of EDBO’s built in functions. For example, we can plot the optimizers path.
bo.plot_convergence()
And we can evaluate how well the model fits the experimental data.
bo.model.regression()
Finally, note that if you need help EDBO has a basic BOT which can run most of its methods. You can call the BOT using the help method. For example, if you wanted to save your workspace for later.
bo.help()
This will span an interactive session:
edbo bot: What can I help you with?
~ Save my workspace
edbo bot: Can you clarify: pickle BO object for later, export proposed, or exit?
~ pickle it
edbo bot: Save instace? (yes or no) You can load instance later with edbo.BO_express.load().
~ yes
edbo bot: Saving edbo.BO instance...
edbo bot: What can I help you with?
~ exit
edbo bot: Exiting...
EDBO and the main BO classes have a lot more features but hopefully this gives you an idea of how it could be used. In the next post we will see how to apply EDBO to chemical reaction data.
Published:
Recently, in collaboration with folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. Now, the paper is submitted for publication and under review so I have not yet made the repository public. However, to facilitate training and beta testing I am writing a few preliminary posts on (1) installation and basic software usage, (2) simulations with real chemical reaction data, (3) using EDBO in the lab, and (4) tackling computational optimization problems.
Reference: Shields, Benjamin J.; Stevens, Jason; Li, Jun; Parasram, Marvin; Damani, Farhan, Martinez Alvarado, Jesus; Janey, Jacob; Adams, Ryan P.; Doyle, Abigail G. “Bayesian Reaction Optimization as A Tool for Chemical Synthesis” Manuscript Submitted.
Ok boring stuff first. In this post we will be tackling software installation from the code in my private repository (so no Git, PyPI, or Anaconda for now).
If you haven’t already installed anaconda (or miniconda) on your machine you can follow the instructions provided by conda.
Windows Script
I wrote a shell script (install.sh) to install EDBO on windows machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open an anaconda prompt, navigate to the edbo directory, and run the script.
cd path/to/edbo/directory
sh install.sh
Mac/Linux Script
I wrote a slightly different shell script (install_mac.sh) to install EDBO on Mac/Linux machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open a terminal and create a conda environment for EDBO.
conda create -y --name edbo python=3.7.5
conda activate edbo
cd path/to/edbo/directory
sh install_mac.sh
Use the pytest framework to run some basic software tests to make sure the installation worked. In the anaconda prompt (or terminal for Mac/Linux) navigate to the folder containing edbo. Then run the following commands and you will see test logs appear in the testing directory. These may take a few min to run and you should see some warnings but no failed tests. If you do please let me know so I can fix the issue and update the software.
conda activate edbo
cd tests
sh basic_tests.sh
That wraps up this post. In Part II we will walk through a basic introduction to the software.
Published:
Published:
My site basics are finished! Later I will add some posts on things I am working on.
Published:
Published:
Published:
In part I we installed the pre-release of EDBO and ran some basic functionality tests. Now in part II we can dive into a basic introduction to using the software. In this post we provide example code for Bayesian optimization of a 1D objective which can be used to explore some of the softwares features. The main Bayesian optimization program is accessed through the edbo.bro module. The main BO classes, edbo.bro.BO and edbo.bro.BO_express, enable users to select initial experiments with experimental designs, running BO on human-in-the-loop or computational objectives, model data, and analyze results. Note: BO parameters are preset to those optimized for DFT encodings in the paper. However, BO_express attempts to automate the selection of priors based on the search space. In general, the BO class is more flexible but as a result less user friendly. Therefore let’s use the BO_express class in this demonstration.
To start we need to define a search space and an objective. In general, for any application it is up to us to define where to optimizer will search for conditions that maximize our objective. For a reaction your objective may be the yield of desired product, here I am using an arbitrary function so feel free to change it to anything you want for this demo.
import numpy as np
import matplotlib.pyplot as plt
# Define a computational objective
# EDBO works with feature vectors so even a 1D objective needs to be vectorized
def f(x):
"""Noise free objective."""
return np.sin(x[0]) * x[0] * 5 + 30
def g(x):
"""With noise."""
return f(x) + (np.random.random() - 0.5) * 15
# BO uses a user defined domain
X = np.linspace(0,10,1000) # Grid of 1000 points between 0 and 10
Now we can use matplotlib to visualize the objective.
sample = np.random.choice(X, 100)
plt.figure(figsize=(5,5))
plt.plot(X, [f([x]) for x in X])
plt.scatter(sample, [g([x]) for x in sample], alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('"Unknown" Objective')
plt.show()
With our search space prepared we can now use EDBO to choose initial experiments, evaluate models, and run Bayesian optimization. There are several ways in which the main BO methods can be used. Let’s start by checking out the options when instantiating BO objects. Here is a link to the documentation page: edbo.bro.
First, as we are checking out some of EDBO’s features it will be handy to have nice plotting function.
# Handy function to visualize the results
def map_corr(df):
"""Get corresponding points in unstandardized domain."""
index = []
for x in df.values:
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
index.append(i)
return bo.reaction.get_experiments(index)
def plot_results(export_path=None, plot_samples=True):
"""Plot summary of 1D BO simulations."""
mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain)) # GP posterior mean
std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2 # GP posterior standard deviation
next_points = bo.reaction.get_experiments(bo.proposed_experiments.index.values).copy() # Next points proposed by BO
next_points['g(x)'] = [f(x) for x in next_points.values]
results = map_corr(bo.obj.results.drop('g(x)', axis=1)) # Results for known data
results['g(x)'] = [g(x) for x in results.values]
plt.figure(1, figsize=(8,8))
# Model mean and standard deviation
plt.subplot(211)
plt.plot(X, [f([x]) for x in X], color='black')
plt.plot(X, mean, label='GP')
plt.fill_between(X, mean-std, mean+std, alpha=0.4)
# Known results and next selected point
plt.scatter(results['x_index'], results['g(x)'], color='black', label='known')
plt.scatter(next_points['x_index'], next_points['g(x)'], color='red', label='next')
plt.ylabel('f(x)')
# Plot some posterior samples
if plot_samples:
samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=2))
i = 1
for sample in samples:
plt.plot(X, sample.numpy(), '--', label='sample' + str(i))
i += 1
plt.legend(loc='lower left')
# Plot the acquisition function
plt.subplot(212)
for p in bo.acq.function.projections:
plt.plot(bo.obj.domain['x'], p)
plt.xlabel('x')
plt.ylabel('Acquisition Function')
if export_path is not None:
plt.savefig(export_path, format='svg', dpi=1200, bbox_inches='tight')
plt.show()
Suppose we have no data and want to start by selecting initial experiments to run. We can do this at random or by using clustering methods using EDBO. I have also written some DOE add on modules which enable you to use response surface (e.g., central composite) and fractional factorial designs. However, these are not included in EDBO 0.0.0. Here we use the centroids from k-Means clustering for initialization.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
init_method='kmeans')
# (4) Choose initial experiments using k-means
bo.init_sample()
print('\nNormalized domain points:')
bo.proposed_experiments
Normalized domain points:
x | |
---|---|
252 | 0.252252 |
751 | 0.751752 |
We can get the unnormalized experiments (or SMILES strings etc.) using the get_experiments method.
print('\nDomain points:')
bo.get_experiments()
Domain points:
x_index | |
---|---|
252 | 2.52252 |
751 | 7.51752 |
And we can plot the choices on the domain.
plt.figure(figsize=(6,1))
plt.scatter(bo.obj.domain['x'], np.ones((len(bo.obj.domain))))
plt.scatter(bo.proposed_experiments, np.ones((len(bo.proposed_experiments))), s=100)
plt.xlabel('x')
plt.yticks([])
plt.show()
Now we can move on to the optimization. If you were really running experiments in the lab you would likely just want to use the run method to iteratively choose experiments. Then go into the lab run the experiments, collect the results, and read them back into the optimizer. Let’s see what that would look like. First lets export the proposed experiments to a CSV file so we can add the results after we “run” the experiments.
# Without an arguement this will export 'experiments.csv' to the cwd.
bo.export_proposed()
Since this is actually a computational objective we can “run” the experiments right here.
# "Run" the experiments
expts = bo.get_experiments()
expts['g(x)'] = [g(x) for x in expts.values]
# Save the results as a CSV
expts.to_csv('results.csv')
# Load the results
bo.add_results('results.csv')
Then in order to choose the next experiments we simply use the run method.
bo.run()
And we can return basic analysis of the acquisition process using the acquisition_summary method.
bo.acquisition_summary()
x | predicted g(x) | variance | |
---|---|---|---|
960 | 0.960961 | 60.7232 | 1581.51 |
631 | 0.631632 | 64.5289 | 969.982 |
You can continue this process iteratively until the objective is maximized or you run out of resources. We can get an idea of what is going on under the hood using our plotting function. In the top plot notice that the model mean fits the experimental results well and that the model confidence region (2$\sigma$) capture the unknown objective. As a result, when we sample the posterior predictive distribution of the model you can see that one of the random functions (yellow dashed) actually captures most of the variation in the objective. The default acquisition function used by EDBO is parallel expected improvement (EI). The computed EI, used to select the next round of experiments, is shown in the bottom plot. Notice that the ArgMax of the acquisition function gives the next two experiments (red points).
Given that $f$ is actually a computational objective, we could just use EDBO to automatically optimize the objective. Below is some sample code for how you can do this using the computational objective option.
# EDBO works on the normalized search space
# We need a new function that maps to the real domain
def h(x):
"""Deal with scaling."""
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
df = bo.reaction.get_experiments(i)
return g(df.values)
# Use the computational_objective arguement
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
computational_objective=h)
# Run the optimization automatically using simulate
bo.simulate(seed=4, iterations=5)
# Plot the results
plot_results()
Models. In Bayesian optimization, the surrogate model type defines a prior over functions which capture our assumptions about the shape of a the response surface. When we combined this prior with observed reaction data we then get a posterior distribution of functions which we can use to reason about the possible positions of global optima. Practically speaking, many acquisition functions (but not all, e.g., Thompson Sampling) are formulated from the surrogate models mean and variance. Thus, in principal any regression model can be employed in Bayesian optimization (e.g., by bootstrapping variance estimates). EDBO currently has three different surrogate models built into the edbo.models module: gaussian processes (edbo.models.GP_Model, GPyTorch), random forests (edbo.models.RF_Model, Scikit-Learn), and Bayesian linear regression (edbo.models.Bayesian_Linear_Model, Scikit-Learn). See the edbo.models documentation page for more details. We can get an idea of the shape of these functions using the plotting method we wrote above (vide infra). It is also straightforward to implement your own model - see the edbo.models module for examples. Below is an example code block for utilizing a random forest model instead of the default gaussian process.
from edbo.bro import BO_express
from edbo.models import RF_Model
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
model=RF_Model)
Gaussian Process Regression (EDBO’s default model):
Random Forest Regression:
Bayesian Linear Regression:
Acquisition functions. The acquisition function is the algorithm responsible for selecting the next experiments to run based on the information captured by the surrogate model. Most acquisition functions are built to balance the exploration of the search space with the exploitation of information availible from evaluated experiments. EDBO has several acquisition functions availible via keyword arguements from the BO and BO_express classes. A full list can be found the the documentation. The default acquisition function, expected improvement, is derived from the expectation value of the improvement utility function. Below is an example code block for choosing different acquisition functions and a few examples of parallel acquisition functions which utilize the Kriging Believer algorithm for batching.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
acquisition_function='UCB')
Expected improvement (EDBO’s default acquisition function):
Probability of improvement:
Upper confidence bound:
Mean maximization (pure exploitation):
Variance maximization (pure exploration):
During optimization we can run misc analysis using some of EDBO’s built in functions. For example, we can plot the optimizers path.
bo.plot_convergence()
And we can evaluate how well the model fits the experimental data.
bo.model.regression()
Finally, note that if you need help EDBO has a basic BOT which can run most of its methods. You can call the BOT using the help method. For example, if you wanted to save your workspace for later.
bo.help()
This will span an interactive session:
edbo bot: What can I help you with?
~ Save my workspace
edbo bot: Can you clarify: pickle BO object for later, export proposed, or exit?
~ pickle it
edbo bot: Save instace? (yes or no) You can load instance later with edbo.BO_express.load().
~ yes
edbo bot: Saving edbo.BO instance...
edbo bot: What can I help you with?
~ exit
edbo bot: Exiting...
EDBO and the main BO classes have a lot more features but hopefully this gives you an idea of how it could be used. In the next post we will see how to apply EDBO to chemical reaction data.
Published:
Recently, in collaboration with folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. Now, the paper is submitted for publication and under review so I have not yet made the repository public. However, to facilitate training and beta testing I am writing a few preliminary posts on (1) installation and basic software usage, (2) simulations with real chemical reaction data, (3) using EDBO in the lab, and (4) tackling computational optimization problems.
Reference: Shields, Benjamin J.; Stevens, Jason; Li, Jun; Parasram, Marvin; Damani, Farhan, Martinez Alvarado, Jesus; Janey, Jacob; Adams, Ryan P.; Doyle, Abigail G. “Bayesian Reaction Optimization as A Tool for Chemical Synthesis” Manuscript Submitted.
Ok boring stuff first. In this post we will be tackling software installation from the code in my private repository (so no Git, PyPI, or Anaconda for now).
If you haven’t already installed anaconda (or miniconda) on your machine you can follow the instructions provided by conda.
Windows Script
I wrote a shell script (install.sh) to install EDBO on windows machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open an anaconda prompt, navigate to the edbo directory, and run the script.
cd path/to/edbo/directory
sh install.sh
Mac/Linux Script
I wrote a slightly different shell script (install_mac.sh) to install EDBO on Mac/Linux machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open a terminal and create a conda environment for EDBO.
conda create -y --name edbo python=3.7.5
conda activate edbo
cd path/to/edbo/directory
sh install_mac.sh
Use the pytest framework to run some basic software tests to make sure the installation worked. In the anaconda prompt (or terminal for Mac/Linux) navigate to the folder containing edbo. Then run the following commands and you will see test logs appear in the testing directory. These may take a few min to run and you should see some warnings but no failed tests. If you do please let me know so I can fix the issue and update the software.
conda activate edbo
cd tests
sh basic_tests.sh
That wraps up this post. In Part II we will walk through a basic introduction to the software.
Published:
Published:
Published:
In part I we installed the pre-release of EDBO and ran some basic functionality tests. Now in part II we can dive into a basic introduction to using the software. In this post we provide example code for Bayesian optimization of a 1D objective which can be used to explore some of the softwares features. The main Bayesian optimization program is accessed through the edbo.bro module. The main BO classes, edbo.bro.BO and edbo.bro.BO_express, enable users to select initial experiments with experimental designs, running BO on human-in-the-loop or computational objectives, model data, and analyze results. Note: BO parameters are preset to those optimized for DFT encodings in the paper. However, BO_express attempts to automate the selection of priors based on the search space. In general, the BO class is more flexible but as a result less user friendly. Therefore let’s use the BO_express class in this demonstration.
To start we need to define a search space and an objective. In general, for any application it is up to us to define where to optimizer will search for conditions that maximize our objective. For a reaction your objective may be the yield of desired product, here I am using an arbitrary function so feel free to change it to anything you want for this demo.
import numpy as np
import matplotlib.pyplot as plt
# Define a computational objective
# EDBO works with feature vectors so even a 1D objective needs to be vectorized
def f(x):
"""Noise free objective."""
return np.sin(x[0]) * x[0] * 5 + 30
def g(x):
"""With noise."""
return f(x) + (np.random.random() - 0.5) * 15
# BO uses a user defined domain
X = np.linspace(0,10,1000) # Grid of 1000 points between 0 and 10
Now we can use matplotlib to visualize the objective.
sample = np.random.choice(X, 100)
plt.figure(figsize=(5,5))
plt.plot(X, [f([x]) for x in X])
plt.scatter(sample, [g([x]) for x in sample], alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('"Unknown" Objective')
plt.show()
With our search space prepared we can now use EDBO to choose initial experiments, evaluate models, and run Bayesian optimization. There are several ways in which the main BO methods can be used. Let’s start by checking out the options when instantiating BO objects. Here is a link to the documentation page: edbo.bro.
First, as we are checking out some of EDBO’s features it will be handy to have nice plotting function.
# Handy function to visualize the results
def map_corr(df):
"""Get corresponding points in unstandardized domain."""
index = []
for x in df.values:
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
index.append(i)
return bo.reaction.get_experiments(index)
def plot_results(export_path=None, plot_samples=True):
"""Plot summary of 1D BO simulations."""
mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain)) # GP posterior mean
std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2 # GP posterior standard deviation
next_points = bo.reaction.get_experiments(bo.proposed_experiments.index.values).copy() # Next points proposed by BO
next_points['g(x)'] = [f(x) for x in next_points.values]
results = map_corr(bo.obj.results.drop('g(x)', axis=1)) # Results for known data
results['g(x)'] = [g(x) for x in results.values]
plt.figure(1, figsize=(8,8))
# Model mean and standard deviation
plt.subplot(211)
plt.plot(X, [f([x]) for x in X], color='black')
plt.plot(X, mean, label='GP')
plt.fill_between(X, mean-std, mean+std, alpha=0.4)
# Known results and next selected point
plt.scatter(results['x_index'], results['g(x)'], color='black', label='known')
plt.scatter(next_points['x_index'], next_points['g(x)'], color='red', label='next')
plt.ylabel('f(x)')
# Plot some posterior samples
if plot_samples:
samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=2))
i = 1
for sample in samples:
plt.plot(X, sample.numpy(), '--', label='sample' + str(i))
i += 1
plt.legend(loc='lower left')
# Plot the acquisition function
plt.subplot(212)
for p in bo.acq.function.projections:
plt.plot(bo.obj.domain['x'], p)
plt.xlabel('x')
plt.ylabel('Acquisition Function')
if export_path is not None:
plt.savefig(export_path, format='svg', dpi=1200, bbox_inches='tight')
plt.show()
Suppose we have no data and want to start by selecting initial experiments to run. We can do this at random or by using clustering methods using EDBO. I have also written some DOE add on modules which enable you to use response surface (e.g., central composite) and fractional factorial designs. However, these are not included in EDBO 0.0.0. Here we use the centroids from k-Means clustering for initialization.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
init_method='kmeans')
# (4) Choose initial experiments using k-means
bo.init_sample()
print('\nNormalized domain points:')
bo.proposed_experiments
Normalized domain points:
x | |
---|---|
252 | 0.252252 |
751 | 0.751752 |
We can get the unnormalized experiments (or SMILES strings etc.) using the get_experiments method.
print('\nDomain points:')
bo.get_experiments()
Domain points:
x_index | |
---|---|
252 | 2.52252 |
751 | 7.51752 |
And we can plot the choices on the domain.
plt.figure(figsize=(6,1))
plt.scatter(bo.obj.domain['x'], np.ones((len(bo.obj.domain))))
plt.scatter(bo.proposed_experiments, np.ones((len(bo.proposed_experiments))), s=100)
plt.xlabel('x')
plt.yticks([])
plt.show()
Now we can move on to the optimization. If you were really running experiments in the lab you would likely just want to use the run method to iteratively choose experiments. Then go into the lab run the experiments, collect the results, and read them back into the optimizer. Let’s see what that would look like. First lets export the proposed experiments to a CSV file so we can add the results after we “run” the experiments.
# Without an arguement this will export 'experiments.csv' to the cwd.
bo.export_proposed()
Since this is actually a computational objective we can “run” the experiments right here.
# "Run" the experiments
expts = bo.get_experiments()
expts['g(x)'] = [g(x) for x in expts.values]
# Save the results as a CSV
expts.to_csv('results.csv')
# Load the results
bo.add_results('results.csv')
Then in order to choose the next experiments we simply use the run method.
bo.run()
And we can return basic analysis of the acquisition process using the acquisition_summary method.
bo.acquisition_summary()
x | predicted g(x) | variance | |
---|---|---|---|
960 | 0.960961 | 60.7232 | 1581.51 |
631 | 0.631632 | 64.5289 | 969.982 |
You can continue this process iteratively until the objective is maximized or you run out of resources. We can get an idea of what is going on under the hood using our plotting function. In the top plot notice that the model mean fits the experimental results well and that the model confidence region (2$\sigma$) capture the unknown objective. As a result, when we sample the posterior predictive distribution of the model you can see that one of the random functions (yellow dashed) actually captures most of the variation in the objective. The default acquisition function used by EDBO is parallel expected improvement (EI). The computed EI, used to select the next round of experiments, is shown in the bottom plot. Notice that the ArgMax of the acquisition function gives the next two experiments (red points).
Given that $f$ is actually a computational objective, we could just use EDBO to automatically optimize the objective. Below is some sample code for how you can do this using the computational objective option.
# EDBO works on the normalized search space
# We need a new function that maps to the real domain
def h(x):
"""Deal with scaling."""
i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
df = bo.reaction.get_experiments(i)
return g(df.values)
# Use the computational_objective arguement
bo = BO_express(components,
encoding,
batch_size=2,
target='g(x)',
computational_objective=h)
# Run the optimization automatically using simulate
bo.simulate(seed=4, iterations=5)
# Plot the results
plot_results()
Models. In Bayesian optimization, the surrogate model type defines a prior over functions which capture our assumptions about the shape of a the response surface. When we combined this prior with observed reaction data we then get a posterior distribution of functions which we can use to reason about the possible positions of global optima. Practically speaking, many acquisition functions (but not all, e.g., Thompson Sampling) are formulated from the surrogate models mean and variance. Thus, in principal any regression model can be employed in Bayesian optimization (e.g., by bootstrapping variance estimates). EDBO currently has three different surrogate models built into the edbo.models module: gaussian processes (edbo.models.GP_Model, GPyTorch), random forests (edbo.models.RF_Model, Scikit-Learn), and Bayesian linear regression (edbo.models.Bayesian_Linear_Model, Scikit-Learn). See the edbo.models documentation page for more details. We can get an idea of the shape of these functions using the plotting method we wrote above (vide infra). It is also straightforward to implement your own model - see the edbo.models module for examples. Below is an example code block for utilizing a random forest model instead of the default gaussian process.
from edbo.bro import BO_express
from edbo.models import RF_Model
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
model=RF_Model)
Gaussian Process Regression (EDBO’s default model):
Random Forest Regression:
Bayesian Linear Regression:
Acquisition functions. The acquisition function is the algorithm responsible for selecting the next experiments to run based on the information captured by the surrogate model. Most acquisition functions are built to balance the exploration of the search space with the exploitation of information availible from evaluated experiments. EDBO has several acquisition functions availible via keyword arguements from the BO and BO_express classes. A full list can be found the the documentation. The default acquisition function, expected improvement, is derived from the expectation value of the improvement utility function. Below is an example code block for choosing different acquisition functions and a few examples of parallel acquisition functions which utilize the Kriging Believer algorithm for batching.
from edbo.bro import BO_express
# (1) Define a dictionary of components
components = {'x':X}
# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}
# (3) Instatiate BO object
bo = BO_express(components,
encoding,
acquisition_function='UCB')
Expected improvement (EDBO’s default acquisition function):
Probability of improvement:
Upper confidence bound:
Mean maximization (pure exploitation):
Variance maximization (pure exploration):
During optimization we can run misc analysis using some of EDBO’s built in functions. For example, we can plot the optimizers path.
bo.plot_convergence()
And we can evaluate how well the model fits the experimental data.
bo.model.regression()
Finally, note that if you need help EDBO has a basic BOT which can run most of its methods. You can call the BOT using the help method. For example, if you wanted to save your workspace for later.
bo.help()
This will span an interactive session:
edbo bot: What can I help you with?
~ Save my workspace
edbo bot: Can you clarify: pickle BO object for later, export proposed, or exit?
~ pickle it
edbo bot: Save instace? (yes or no) You can load instance later with edbo.BO_express.load().
~ yes
edbo bot: Saving edbo.BO instance...
edbo bot: What can I help you with?
~ exit
edbo bot: Exiting...
EDBO and the main BO classes have a lot more features but hopefully this gives you an idea of how it could be used. In the next post we will see how to apply EDBO to chemical reaction data.
Published:
Recently, in collaboration with folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. Now, the paper is submitted for publication and under review so I have not yet made the repository public. However, to facilitate training and beta testing I am writing a few preliminary posts on (1) installation and basic software usage, (2) simulations with real chemical reaction data, (3) using EDBO in the lab, and (4) tackling computational optimization problems.
Reference: Shields, Benjamin J.; Stevens, Jason; Li, Jun; Parasram, Marvin; Damani, Farhan, Martinez Alvarado, Jesus; Janey, Jacob; Adams, Ryan P.; Doyle, Abigail G. “Bayesian Reaction Optimization as A Tool for Chemical Synthesis” Manuscript Submitted.
Ok boring stuff first. In this post we will be tackling software installation from the code in my private repository (so no Git, PyPI, or Anaconda for now).
If you haven’t already installed anaconda (or miniconda) on your machine you can follow the instructions provided by conda.
Windows Script
I wrote a shell script (install.sh) to install EDBO on windows machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open an anaconda prompt, navigate to the edbo directory, and run the script.
cd path/to/edbo/directory
sh install.sh
Mac/Linux Script
I wrote a slightly different shell script (install_mac.sh) to install EDBO on Mac/Linux machines. You will find a copy in the edbo.zip folder provided.
Download and unzip the folder.
Open a terminal and create a conda environment for EDBO.
conda create -y --name edbo python=3.7.5
conda activate edbo
cd path/to/edbo/directory
sh install_mac.sh
Use the pytest framework to run some basic software tests to make sure the installation worked. In the anaconda prompt (or terminal for Mac/Linux) navigate to the folder containing edbo. Then run the following commands and you will see test logs appear in the testing directory. These may take a few min to run and you should see some warnings but no failed tests. If you do please let me know so I can fix the issue and update the software.
conda activate edbo
cd tests
sh basic_tests.sh
That wraps up this post. In Part II we will walk through a basic introduction to the software.