Automatic Design of SARS-CoV-2 M^pro Inhibitors via Machine Learning & Molecular Docking

22 minute read

Published: August 09, 2020

Introduction

One of my current interest is in developing machine learning (ML) methods for optimizing chemical processes. Recently, in collaboration with the folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. While the software was developed with this specific application in mind, it is a general implementation and I was curious what other fun tasks it might interface with.

As a synthetic chemist I have heard quite a bit about docking simulations and the role they play in the drug discovery process. In this context, docking refers to molecular modeling methods which enable the prediction of binding affinity between a drug-like molecule and a biological target such as a protein. In turn, the predicted binding behavior and conformation can enable chemists and biologists to identify promising structures for experimental evaluation and potential small molecule therapeutics. With this general understanding, the objective of this post is to get a little hands on experience. Specifically, we will take a quick dive into the process of molecular docking using open-source tools. And to make things even more fun we will also put together a fully automated artificial intelligence (AI) based ligand discovery pipeline using EDBO, run it on my laptop, and see what it discovers!

As I am writing this post, we are in the midst of the covid-19 pandemic. Accordingly, I could see no better target for this project than SARS-CoV-2, the virus responsible for our current situation. In a recent report (Jin, et al. “Structure of M^pro from SARS-CoV-2 and discovery of its inhibitors” Nature, 2020, 582,289), researchers disclosed the identification of several SARS-CoV-2 inhibitors of M^pro, a protease enzyme critical to the replication cycle of SARS-CoV-2. The polyproteins responsible for viral replication and transcription must undergo proteolytic processing (digestion) by M^pro. Thus, M^pro is an ideal target for antiviral drugs. A ribbon structure of M^pro is shown below.

In order to pull this off we need to code a few key components including: (1) a molecular structure generator - converts from a string representation to 3D coordinates, carries out geometry optimization, and generates conformers, (2) a molecular docking simulator - submits generated structures and return a binding score, and (3) an automated AI optimizer - optimizes molecular structure to maximize binding score. In turn, the AI method requires: (A) a search space - chemical space to optimize over, (B) an encoding - numerical representation of molecular structures, and (C) an optimizer - selects the next experiments to run according to some utility function. For the search space and encoding I thought it would be interesting to use a variational autoencoder (vide infra), trained on the structures for drug-like molecules, to generate a numerical embedding for optimization.

Automating 3D structure generation

Generating representative 3D molecular structures from SMILES strings is critical for this application. This is because we are going to use a variational autoencoder trained on SMILES strings. In addition, I wanted to code this application in python and avoid quantum mechanical (e.g., DFT) geometry optimization. Therefore, let’s use Open-Babels python API to generate 3D structures from smiles strings via: (1) creating the initial structure using rules and fragment templates, (2) steepest descent geometry optimization with the MMFF94 forcefield, (3) a weighted rotor conformational search, and (4) and a final conjugate gradient geometry optimization on the lowest energy conformer. Then additional geometries can be generated using open-babels built in genetic algorithm for diverse conformer generation. Let’s wrap all of the methods we will need up into one handy python class. You can see the details for each method in the accompanying doc strings.

from openbabel import pybel
import re

# Molecule class handles structure generation, conformer searching, and input generation

class molecule:
    """
    Class for handling structure generation and conformer searching.
    The methods defined here will allow us to generate 3D coordinates
    to feed into docking simulations.
    """
    
    def __init__(self, SMILES, NAME, correct_for_ph=True, ph=7.4):
        
        # Save SMILES and NAME
        self.smi = SMILES
        self.name = NAME
        self.correct_ph = correct_for_ph
        self.ph = ph
        
        # Open-Babel
        self.conv = pybel.ob.OBConversion()
        self.obmol = self.smiles_to_obmol()
        self.obmol.AddHydrogens()
        if correct_for_ph:
            self.obmol.CorrectForPH(ph)
        self.obmol.SetTitle(self.name)
        
    def smiles_to_obmol(self, input_format='smi'):
        """
        Convert a SMILES string to an obmol object.
        """
        
        mol = pybel.ob.OBMol()
        self.conv.SetInFormat(input_format)
        self.conv.ReadString(mol, self.smi)

        return mol
    
    def to_string(self, output_format='mol2'):
        """
        Using the current obmol object generate a formatted
        string.
        """
        
        self.conv.SetOutFormat(output_format)
        out = self.conv.WriteString(self.obmol).strip()
        
        return out
    
    def to_file(self, path, confs=False, output_format='mol2'):
        """
        Write formatted structure data for optimized geometries.
        """
        
        if confs:
            # Open output file
            file = pybel.Outputfile(output_format,
                                    path + '.' + output_format, 
                                    overwrite=True)
            
            # Write conformers
            for c in range(self.n_confs):
                # Conformer name
                self.obmol.SetConformer(c)
                self.obmol.SetTitle(self.name + str(c))
                file.write(pybel.Molecule(self.obmol))
            
            # Close file
            file.close()
        else:
            self.conv.SetOutFormat(output_format)
            self.conv.WriteFile(self.obmol, path + '.' + output_format)
            
        # Reset name
        self.obmol.SetTitle(self.name)
    
    def _geom(self):
        """
        Generate a starting geometry.
        """
        
        # Initial generation
        gen3D = pybel.ob.OBOp.FindType("gen3D")
        gen3D.Do(self.obmol, 'best')
        
        # Sometimes this fails so we need to check for proper 3D coordinates
        txt = self.to_string(output_format='xyz')
        coord = re.findall('[\d]+', txt)
        if len(set(coord)) < 10:    
            pybel.Molecule(self.obmol).make3D(forcefield='mmff94', 
                                              steps=250)
        
    def conf_gen(self, n_confs=30):
        """
        Run a conformer generation generating up to n_confs
        conformers.
        """
        
        self._geom()
        
        confSearch = pybel.ob.OBConformerSearch()
        confSearch.Setup(self.obmol, n_confs)
        confSearch.Search()
        confSearch.GetConformers(self.obmol)
        
        self.n_confs = self.obmol.NumConformers()

We can run a quick test of the class and its methods using a known binder of M^pro (called N3 in the recent paper). Here is an example block of code to instantiate a molecule, generate conformers, and write them to a MOL2 file.

m = molecule('O=C(N[C@@H](C)C(N[C@@H](C(C)C)C(N[C@@H](CC(C)C)C(N[C@H](/C=C/C(OCC1=CC=CC=C1)=O)C[C@@H]2CCNC2=O)=O)=O)=O)C3=NOC(C)=C3', 'N3')
m.conf_gen(n_confs=3)
m.to_file(m.name, confs=True)

Molecular docking simulations

Next, we require a docking simulator and python interface which takes conformations of a given molecule as an input and return a predicted binding affinity. For the docking simulations we will use the freely available software AutoDock Vina. Autodock Vina computes a binding score via a scoring function that approximates the chemical potential of the system (Trott et al. “AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading”, J. Comput. Chem., 2010, 31, 455). However, recently it has been demonstrated that including an additional machine learning (ML) based scoring function, trained on active and inactive molecules docked to a large set of targets, as a post-processing step can give substantial improvements in virtual screening performance in terms of hit rate (Wójcikowski et al. “Performance of machine-learning scoring functions in structure-based virtual screening”, Scientific Reports, 2017, 7, 46710). Here we will use RF-Score (a random forest scoring function) which takes as input the AutoDock output and returns an adjusted score. Notably, using AutoDock Vina for molecular docking and Open Drug Discovery Toolkit as a python interface this whole pipeline is actually pretty straightforward to implement.

Let’s write a python class which: (1) parses input for an arbitrary protein, bound ligand, and candidate ligands, (2) interfaces with AutoDock Vina to run docking simulations, (3) computes the RF-Score from the output, and (4) parses the results (which come in the form of MOL2 files) to return scores and docked structures. You can see the function of each method in the accompanying doc strings.

from oddt.virtualscreening import virtualscreening as vs
import os
import shutil
import re
import pandas as pd

# Docking class handles automatic docking and output parsing

class dock:
    """
    Class for handling automated docking simulations, file manipulations,
    and result parsing.
    """
    
    def __init__(self, protein, protein_ligand, ligands, n_cpu=-1):
      
        # Input file paths
        self.protein = protein
        self.protein_ligand = protein_ligand
        self.ligands = ligands
        self.executable = 'PATH_TO_VINA.EXE'
        
        # Virtual screening pipeline
        self.pipeline = vs(n_cpu=n_cpu)

        # Load ligands from a mol2 file
        self.pipeline.load_ligands('mol2', self.ligands)

        # Dock entire library to receptor, autocenter docking box on ligand
        self.pipeline.dock('autodock_vina', 
                           self.protein, 
                           self.protein_ligand, 
                           executable=self.executable)
        
        # Post-processing using a scoring function
        self.pipeline.score(function='rfscore',
                            protein=self.protein,
                            version=1)
        
    def run(self, output_file, overwrite=False):
        """
        Run docking simulation for a given protein and ligand.
        """
        
        # Unique output file name
        output_file += '.mol2'
        
        # If you don't want to overwrite previous output
        if overwrite == False:
            counter = 1
            while os.path.isfile(output_file):
                parts = output_file.split('.')
                parts[0] += str(counter)
                counter += 1
                output_file = parts[0] + '.' + parts[1]
            
        # Run docking simulation and write output
        self.pipeline.write('mol2', 
                            output_file, 
                            # overwrite=True, 
                            opt={'c':None})
        
        # Get results text
        f = open(output_file, 'r')
        self.results_text = f.read()
        f.close()
        self.parse_results()
        self.clean_dir()
        
    def parse_results(self):
        """
        Parse docking output to get Vina docking scores and
        scoring function results.
        """
      
        # Break file up into each conformer
        mols = self.results_text.split('\n\n##########')
        
        # Extract results for each
        results = []
        for mol in mols:
            
            # Get scores
            output = re.findall(r'\t(.+):\t([0-9]+[.]?[0-9]+)', mol)
            names = re.findall(r'MOLECULE\n(.+)\n', mol)
        
            # Tabulate results
            columns = ['Conformer']
            tabulated = [names[0]]
            for entry in output:
                columns.append(entry[0])
                tabulated.append(entry[1])

            results.append(pd.DataFrame([tabulated], columns=columns))
        
        # Generate a results DataFrame
        self.results = pd.DataFrame(columns=results[0].columns.values)
        for entry in results:
            self.results = pd.concat([self.results, entry])
        
        # Make sure RF Score is the last column
        last = 0
        while 'score' not in self.results.columns.values[last]:
            last += 1
            
        self.results = self.results.iloc[:,:last + 1]
    
    def write_conformer(self, conformer_number):
        """
        Write a docked conformer to a .mol2 file. Use this to
        visualize docked ligand conformations in PMV.
        """
        
        f = open('conf.mol2', 'w')
        f.write(self.results_text.split('\n\n##########')[conformer_number])
        f.close()
    
    def write_best_conformer(self):
        """
        Parse results for conformer with best score and write
        coordinates to a .mol2 file for evaluation.
        """
        
        best = self.results.sort_values(self.results.columns.values[-1]).iloc[[-1]].index.values[0]
        self.write_conformer(best)
        
    def clean_dir(self):
        """
        Garbage collector. In some cases I have seen exceptions b/c
        oddt wasn't able to remove temporary files.
        """
        
        # Get files in cwd
        files = [x[0].split('\\')for x in os.walk('.')]
        
        # Try to remove temporary files
        try:       
            for file in files:
                if len(file) == 2:
                    if 'autodock_vina' in file[1]:
                        shutil.rmtree(file[1])
        except Exception as e:
            print(e)
            if 'cannot access' in str(e):
                print('Could not remove temporary files...')
                pass  

Now, for our particular problem we need structural data for M^pro with a bound ligand (N3). We can get this from the recent publication’s supplemental information (vide supra). With the raw data in hand, we then need to prepare the individual protein and bound ligand structure data (PDBQT files) for modeling using Python Molecular Viewer (PMV) (e.g., by removing water molecules, adding hydrogens, and generating a grid). I did this by following the tutorial on the AutoDock Vina website. Then to test our docking class we will run simulations using the N3 structures generated with the molecule class above.

protein = 'proteins/6lu7_protein.pdbqt'          # Protein structure
protein_ligand = 'proteins/6lu7_ligand.pdbqt'    # Bound ligand structure
ligands = m.name + '.mol2'                       # Structures generated with molecule class

d = dock(protein, protein_ligand, ligands)
d.run(m.name + '_docking')

Now let’s visualize the results for the best predicted binding mode of N3 using PMV. You can see that the predicted structure fits nicely in the binding pocket and that the conformation with the best predicted binding affinity has good agreement with the experimental structure.

Crystal Structure:

Sample of Predicted Binding Modes:

Structure Overlay:

Variational autoencoder

The next piece of the puzzle is the variational autoencoder which is used to convert SMILES representations of molecules to and from a continuous numerical encoding. Importantly, this model will allow us to run an optimization in the continuous representation while decoding to discrete chemical structures. We will use an autoencoder trained using drug-like molecules from a recent paper for this demonstration (Gómez-Bombarelli et al. “Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules”, ACS Cent. Sci., 2018, 4, 268). Here is some example code from the authors repository for encoding and decoding SMILES strings:

from os import environ
environ['KERAS_BACKEND'] = 'tensorflow'
from chemvae.vae_utils import VAEUtils
from chemvae import mol_utils as mu
import numpy as np
import pandas as pd

# Load model from paper
vae = VAEUtils(directory='chemical_vae-master/models/zinc_properties')

# Encode a SMILES string
smiles = mu.canon_smiles('CSCC(=O)NNC(=O)c1c(C)oc(C)c1C')
X = vae.smiles_to_hot(smiles, canonize_smiles=True)
z = vae.encode(X)

# Decode a point in the continuous embedding
df = vae.z_to_smiles(z, decode_attempts=100, noise_norm=5.0)

print('Found {:d} unique mols, out of {:d}.'.format(len(set(df['smiles'])),sum(df['count'])))

Bayesian optimization

Bayesian optimization is an iterative response surface-based global optimization algorithm which has demonstrated excellent performance in a number of tasks (Shahriari et al. “Taking the Human Out of the Loop: A Review of Bayesian Optimization”, Proceedings of the IEEE, 2016, 104, 148). In a recent collaboration, we developed a framework for Bayesian optimization that is compatible with encoded chemical data and an open-source python software tool that enables easy integration with a given task (EDBO).

Below is a brief demonstration of an arbitrary 1D objective with a discretized domain. While EDBO is designed to work with human-in-the-loop experimentation you can also use computational objectives (we also demonstrate this feature below). Let’s start by defining the simulation functions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from gpytorch.priors import GammaPrior
from edbo.bro import BO

# Define a computational objective
def f(x):
    """Noise free objective."""
    
    return np.sin(10 * x[0]) * x[0] * 100
  
# Bayesian optimization
X = np.linspace(0,1,1000).reshape(1000, 1)
domain = pd.DataFrame(X, columns=['x'])                        # Search space

bo = BO(domain=domain,                                         # Search space
        target='f(x)',                                         # Name of target (not required but nice)
        acquisition_function='EI',                             # Acquisition function
        init_method='rand',                                    # Initialization method
        lengthscale_prior=[GammaPrior(1.2,1.1), 0.2],          # GP length scale prior and initial value
        noise_prior=None,                                      # No noise prior
        batch_size=2,                                          # Number of experiments to choose in parallel
        fast_comp=True,                                        # Use gpytorch fast computation features
        computational_objective=f)                             # The objective is defined as a function

# Plot model and some posterior predictive samples
def plot_results(export_path):
    """Plot summary of 1D BO simulations"""

    mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain))                             # GP posterior mean
    std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2                         # GP posterior standard deviation
    samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=3))   # GP samples
    next_points = bo.obj.get_results(bo.proposed_experiments)                                       # Next points proposed by BO
    results = bo.obj.results_input()                                                                # Results for known data

    plt.figure(1, figsize=(8,8))

    # Model mean and standard deviation
    plt.subplot(211)
    plt.plot(X.flatten(), [f(x) for x in X], color='black')
    plt.plot(X.flatten(), mean, label='GP')
    plt.fill_between(X.flatten(), mean-std, mean+std, alpha=0.4)

    # Known results and next selected point
    plt.scatter(results['x'], results['f(x)'], color='black', label='known')
    plt.scatter(next_points['x'], next_points['f(x)'], color='red', label='next_experiments')
    plt.ylabel('f(x)')

    # Samples
    plt.subplot(212)
    for sample in samples:
        plt.plot(X.flatten(), sample.numpy())
    plt.xlabel('x')
    plt.ylabel('Posterior Samples')
    
    plt.savefig(export_path + '.svg', format='svg', dpi=1200, bbox_inches='tight')
    plt.show()

Now we can run a single round of Bayesian optimization and visualize the results in terms of the selected points and models fit and samples from the posterior predictive distribution.

# Run a single iteration and plot results
bo.init_sample(append=True, seed=4)
bo.run()
plot_results('bo_demo1')

And to to run a simulation for an arbitrary number of iterations we can use the simulate method.

# Run a simulation and plot results
bo.simulate(iterations=5, seed=4)
plot_results('bo_demo2')

Final Modeling Pipeline

Selecting the initial search space. To start we want to get a general idea of the molecular structures which are predicted to bind tightly to M^pro. To do this I carried out an initial search over drug-like molecules in the Zinc database, encoded using the VAE, using Bayesian optimization. A rigorous application may run the optimization over the entire Zinc database (after filtering by criteria of interest like Lipinski’s rule of five), using a HPC cluster to expedite to calculations. However, I am running this demonstration on a 2 core i3 laptop with 8 GB of RAM so I restrict the initial search space to 10,000 randomly sampled points.

from edbo.utils import Data

molecule_grid_size = 10000

# Sample autoencoder
Z, data, smiles = vae.ls_sampler_w_prop(size=molecule_grid_size, return_smiles=True)

# Define and standardize domain
domain = Data(pd.DataFrame(Z, columns=['z' + str(i) for i in range(len(Z[0]))]))
domain.base_data.insert(0, 'SMILES', smiles)
domain.clean()
domain.standardize(scaler='minmax', target=None)

A sample of the selected structures can be seen below:

Optimization. Next we will need to define the computational objective to optimize. To do this we will utilize the molecule class to generate structures, the dock class to run docking simulations, and return the RF-Score for the tightest binding conformer of the molecule.

def best_docking_score(x, n_confs=3):
    """
    Objective function to be used for Bayesian optimization. Returns the predicted binding affinity for the conformer conformer which
    binds best to the Mpro.
    """
    
    # find correspondence to initial Z matrix
    index = domain.data.where(domain.data == x).dropna().index.values[0]
    
    # Get SMILES
    smiles = domain.base_data['SMILES'].iloc[index]
    
    # Print molecule
    cdx = ChemDraw([smiles])
    cdx.show()
    
    # Get structures
    name = 'step_' + str(index)
    m = molecule(smiles, name)
    m.conf_gen(n_confs=n_confs)
    m.to_file(name, confs=True)
    
    # Run docking - deal with some exceptions
    try:
        protein = 'proteins/6lu7_protein.pdbqt'           # Protein (delete water, add all H, merge non-polar, grid-->macromol-->choose)
        protein_ligand = 'proteins/6lu7_ligand.pdbqt'     # Protein + bound ligand structure - protein
        ligands = m.name + '.mol2'                        # Candidate ligand structures for binding
        d = dock(protein, protein_ligand, ligands)
        d.run(name + '_docking')
        score = d.results.sort_values(d.results.columns.values[-1]).iloc[-1].values[-1]
    except Exception as e:
        if 'cannot access' in str(e):
            print(e)
            score = best_docking_score(x)
        else:
            score = 4
    
    return float(score)

Now with this computational objective in hand we can run an initial Bayesian optimization over the randomly sampled search space. Since this is just a demonstration, we will just run the optimizer from a single initial starting point randomly selected from the domain. Keep in mind that in practice it would be better to initialize the optimization with numerous points and to utilize a larger search space.

from edbo.bro import BO

bo = BO(domain=domain.data,
        computational_objective=best_docking_score,
        batch_size=1,
        target='score')

bo.simulate(seed=0, iterations=45)

Now we can plot the optimization results in terms of the scores for each molecule and visualize the best scoring ligand. Interestingly, the initially selected molecule actually had a reasonably high docking score. However, over the course of the initial optimization we are still able to see an improvement in predicted binding affinity.

We can dive deeper into the results by visualizing the ligand conformer for the best conformation using PMV. What is very nice to see in the space filling model is that the identified structure actually fits very nicely into the binding site.

Local optimization. As a final step, we can run a more fine grained optimization by using the VAE to generate a new search space centered on one of the interesting structures identified in the initial optimization. Let’s do this by first decoding randomly sampled points in the embedded space about the structure of interest. Then, we can generate additional structures branching out from this region of the space in rounds by carrying out the same procedure over each of the initially decoded points. Finally, the resulting list of structures will define the new search space. The following code will allow us to carry out this local expansion procedure.

def local_search_space(SMILES, noise_norm=10.0, decode_attempts=250, n_points=100):
    """
    Generate a list of similar structures by sampling random points about a given SMILES string
    in the encoded space. Returns the encoded space as a pandas.DataFrame.
    """
    
    # Encode smiles
    smiles = mu.canon_smiles(SMILES)
    X = vae.smiles_to_hot(smiles, canonize_smiles=True)
    z = vae.encode(X)
    
    # Randomly sample about point
    # print('Searching molecules randomly sampled from {:.2f} std (z-distance) from the point...'.format(noise_norm))
    df = vae.z_to_smiles(z, 
                         decode_attempts=decode_attempts, 
                         noise_norm=noise_norm, 
                         n_points=n_points,
                         constant_norm=False)
    #print('Found {:d} unique molecules, out of {:d}.'.format(len(set(df['smiles'])),sum(df['count'])))

    return df

def mutate_search_space(SMILES, n_mutations=1, noise_norm=10.0, decode_attempts=250):
    """
    Carries out a series of local searches iteratively and returns a list of SMILES strings.
    """
    
    smiles_list = [[SMILES]]
    out = [SMILES]
    
    # Run mutations over identified smiles
    for i in range(n_mutations + 1):
        expansion = []
        for smi in smiles_list[-1]:
            smiles = local_search_space(smi, 
                                        noise_norm=noise_norm, 
                                        decode_attempts=decode_attempts)['smiles'].drop_duplicates().values
            expansion = expansion + list(smiles)
        
        expansion = list(set(expansion))
        print('Mutation round ' + str(i) + ':', 'Identified ' + str(len(expansion)) + ' unique structures...')
        smiles_list.append(expansion)
        out = out + list(expansion)
    
    # Remove initial point
    out = list(set(out) - set([SMILES]))
    
    print('Identified ' + str(len(out)) + ' total unique structures....')
    
    return out

def smiles_to_z(smiles_list):
    """
    Convert a list of SMILES strings to encoded values using the VAE.
    """
    
    Z = []
    S = []
    for s in smiles_list:
        try:
            smiles = mu.canon_smiles(s)
            X = vae.smiles_to_hot(smiles, canonize_smiles=True)
            z = vae.encode(X)
            Z.append(z[0])
            S.append(s)
        except:
            None
        
    df = pd.DataFrame(Z, columns=['z' + str(i) for i in range(len(Z[0]))])
    df.insert(0, 'SMILES', S)
        
    return df

For this demonstration, I carried out the sampling procedure over 3 total rounds, starting from the best scoring structure from the initial optimization, to give 1166 unique structures. You will notice that the VAE decoded to some nonsense (not chemically reasonable) structures. However, it is a simple matter to filter such structures after the optimization.

# Generate local search space
smiles = mutate_search_space('COc1cccc(CN2C(=O)N[C@](C)(C3CCN(C(=O)c4ccc(C)nc4)CC3)C2=O)c1OC',
                             noise_norm=20.0,
                             decode_attempts=500,
                             n_mutations=2)

# Clean and normalize encoded data
domain = Data(smiles_to_z(smiles))
domain.clean()
domain.standardize(scaler='minmax', target=None)

Sample of structures:

Finally, we can run the second round of optimization over this new domain using the same methodology as before. In this round the optimizer was able to discover several structures with predicted binding affinity at or above that of the best structure from the initial search. The structure with the highest predicted binding affinity and the optimization path is shown below.

Once again, we can check out the results for the best ligand conformer using PMV.

This about wraps up our hands-on investigation of molecular docking. In this post, we have built a general and fully automated ligand discovery pipeline using molecular docking and ML. This system carries out molecular design (using a VAE), optimization (using BO), and scoring (via molecular docking and a RF model) using freely available and open-source software. As a demonstration, we utilized this approach to identify promising binders of the SARS-CoV-2 protease M^pro. Notably, the optimizer identified molecular structures with near nanomolar predicted binding affinity using a low-power laptop and evaluating < 100 candidate structures. Therefore, if this system were improved and scaled to include a larger search it is plausible that even more promising molecular structures could be identified. Finally, to put this demonstration in context, in a real drug development program it would next be up to synthetic chemists and biologists to experimentally validate such findings.

Share on

Twitter Facebook LinkedIn

Bayesian Reaction Optimization Using EDBO - Part IV

1 minute read

Published: October 06, 2020

Part IV - Bayesian Reaction Optimization Workshop

Bayesian Reaction Optimization Using EDBO - Part III

20 minute read

Published: October 04, 2020

Part III - Bayesian Reaction Optimization

Bayesian Reaction Optimization Using EDBO - Part II

11 minute read

Published: October 01, 2020

Part II - Software introduction

In part I we installed the pre-release of EDBO and ran some basic functionality tests. Now in part II we can dive into a basic introduction to using the software. In this post we provide example code for Bayesian optimization of a 1D objective which can be used to explore some of the softwares features. The main Bayesian optimization program is accessed through the edbo.bro module. The main BO classes, edbo.bro.BO and edbo.bro.BO_express, enable users to select initial experiments with experimental designs, running BO on human-in-the-loop or computational objectives, model data, and analyze results. Note: BO parameters are preset to those optimized for DFT encodings in the paper. However, BO_express attempts to automate the selection of priors based on the search space. In general, the BO class is more flexible but as a result less user friendly. Therefore let’s use the BO_express class in this demonstration.

To start we need to define a search space and an objective. In general, for any application it is up to us to define where to optimizer will search for conditions that maximize our objective. For a reaction your objective may be the yield of desired product, here I am using an arbitrary function so feel free to change it to anything you want for this demo.

Define Objective and Search Space

import numpy as np
import matplotlib.pyplot as plt

# Define a computational objective
# EDBO works with feature vectors so even a 1D objective needs to be vectorized

def f(x):
    """Noise free objective."""
    
    return np.sin(x[0]) * x[0] * 5 + 30

def g(x):
    """With noise."""
    
    return f(x) + (np.random.random() - 0.5) * 15
  
# BO uses a user defined domain

X = np.linspace(0,10,1000)    # Grid of 1000 points between 0 and 10

Now we can use matplotlib to visualize the objective.

sample = np.random.choice(X, 100)
plt.figure(figsize=(5,5))
plt.plot(X, [f([x]) for x in X])
plt.scatter(sample, [g([x]) for x in sample], alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('"Unknown" Objective')
plt.show()

Using EDBO

With our search space prepared we can now use EDBO to choose initial experiments, evaluate models, and run Bayesian optimization. There are several ways in which the main BO methods can be used. Let’s start by checking out the options when instantiating BO objects. Here is a link to the documentation page: edbo.bro.

First, as we are checking out some of EDBO’s features it will be handy to have nice plotting function.

# Handy function to visualize the results

def map_corr(df):
    """Get corresponding points in unstandardized domain."""
    
    index = []
    for x in df.values:
        i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
        index.append(i)
    
    return bo.reaction.get_experiments(index)

def plot_results(export_path=None, plot_samples=True):
    """Plot summary of 1D BO simulations."""

    mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain))                             # GP posterior mean
    std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2                         # GP posterior standard deviation
    next_points = bo.reaction.get_experiments(bo.proposed_experiments.index.values).copy()          # Next points proposed by BO
    next_points['g(x)'] = [f(x) for x in next_points.values]
    results = map_corr(bo.obj.results.drop('g(x)', axis=1))                                         # Results for known data
    results['g(x)'] = [g(x) for x in results.values]    
    
    plt.figure(1, figsize=(8,8))

    # Model mean and standard deviation
    plt.subplot(211)
    plt.plot(X, [f([x]) for x in X], color='black')
    plt.plot(X, mean, label='GP')
    plt.fill_between(X, mean-std, mean+std, alpha=0.4)

    # Known results and next selected point
    plt.scatter(results['x_index'], results['g(x)'], color='black', label='known')
    plt.scatter(next_points['x_index'], next_points['g(x)'], color='red', label='next')
    plt.ylabel('f(x)')
    
    # Plot some posterior samples
    if plot_samples:
        samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=2))
        i = 1
        for sample in samples:
            plt.plot(X, sample.numpy(), '--', label='sample' + str(i))
            i += 1
    
    plt.legend(loc='lower left')

    # Plot the acquisition function
    plt.subplot(212)
    for p in bo.acq.function.projections:
        plt.plot(bo.obj.domain['x'], p)

    plt.xlabel('x')
    plt.ylabel('Acquisition Function')
    
    if export_path is not None:
        plt.savefig(export_path, format='svg', dpi=1200, bbox_inches='tight')
    
    plt.show()

Initialization methods

Suppose we have no data and want to start by selecting initial experiments to run. We can do this at random or by using clustering methods using EDBO. I have also written some DOE add on modules which enable you to use response surface (e.g., central composite) and fractional factorial designs. However, these are not included in EDBO 0.0.0. Here we use the centroids from k-Means clustering for initialization.

from edbo.bro import BO_express

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                batch_size=2,
                target='g(x)',
                init_method='kmeans')

# (4) Choose initial experiments using k-means
bo.init_sample()

print('\nNormalized domain points:')
bo.proposed_experiments

Normalized domain points:

	x
252	0.252252
751	0.751752

We can get the unnormalized experiments (or SMILES strings etc.) using the get_experiments method.

print('\nDomain points:')
bo.get_experiments()

Domain points:

	x_index
252	2.52252
751	7.51752

And we can plot the choices on the domain.

plt.figure(figsize=(6,1))
plt.scatter(bo.obj.domain['x'], np.ones((len(bo.obj.domain))))
plt.scatter(bo.proposed_experiments, np.ones((len(bo.proposed_experiments))), s=100)
plt.xlabel('x')
plt.yticks([])
plt.show()

Human-in-the-loop optimization

Now we can move on to the optimization. If you were really running experiments in the lab you would likely just want to use the run method to iteratively choose experiments. Then go into the lab run the experiments, collect the results, and read them back into the optimizer. Let’s see what that would look like. First lets export the proposed experiments to a CSV file so we can add the results after we “run” the experiments.

# Without an arguement this will export 'experiments.csv' to the cwd.
bo.export_proposed()

Since this is actually a computational objective we can “run” the experiments right here.

# "Run" the experiments
expts = bo.get_experiments()
expts['g(x)'] = [g(x) for x in expts.values]

# Save the results as a CSV
expts.to_csv('results.csv')

# Load the results
bo.add_results('results.csv')

Then in order to choose the next experiments we simply use the run method.

bo.run()

And we can return basic analysis of the acquisition process using the acquisition_summary method.

bo.acquisition_summary()

	x	predicted g(x)	variance
960	0.960961	60.7232	1581.51
631	0.631632	64.5289	969.982

You can continue this process iteratively until the objective is maximized or you run out of resources. We can get an idea of what is going on under the hood using our plotting function. In the top plot notice that the model mean fits the experimental results well and that the model confidence region (2$\sigma$) capture the unknown objective. As a result, when we sample the posterior predictive distribution of the model you can see that one of the random functions (yellow dashed) actually captures most of the variation in the objective. The default acquisition function used by EDBO is parallel expected improvement (EI). The computed EI, used to select the next round of experiments, is shown in the bottom plot. Notice that the ArgMax of the acquisition function gives the next two experiments (red points).

Automated optimization

Given that $f$ is actually a computational objective, we could just use EDBO to automatically optimize the objective. Below is some sample code for how you can do this using the computational objective option.

# EDBO works on the normalized search space
# We need a new function that maps to the real domain
def h(x):
    """Deal with scaling."""
    
    i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
    df = bo.reaction.get_experiments(i)
    
    return g(df.values)

# Use the computational_objective arguement 
bo = BO_express(components,
                encoding,
                batch_size=2,
                target='g(x)',
                computational_objective=h)

# Run the optimization automatically using simulate
bo.simulate(seed=4, iterations=5)

# Plot the results
plot_results()

Configuring the optimizer

Models. In Bayesian optimization, the surrogate model type defines a prior over functions which capture our assumptions about the shape of a the response surface. When we combined this prior with observed reaction data we then get a posterior distribution of functions which we can use to reason about the possible positions of global optima. Practically speaking, many acquisition functions (but not all, e.g., Thompson Sampling) are formulated from the surrogate models mean and variance. Thus, in principal any regression model can be employed in Bayesian optimization (e.g., by bootstrapping variance estimates). EDBO currently has three different surrogate models built into the edbo.models module: gaussian processes (edbo.models.GP_Model, GPyTorch), random forests (edbo.models.RF_Model, Scikit-Learn), and Bayesian linear regression (edbo.models.Bayesian_Linear_Model, Scikit-Learn). See the edbo.models documentation page for more details. We can get an idea of the shape of these functions using the plotting method we wrote above (vide infra). It is also straightforward to implement your own model - see the edbo.models module for examples. Below is an example code block for utilizing a random forest model instead of the default gaussian process.

from edbo.bro import BO_express
from edbo.models import RF_Model

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                model=RF_Model)

Gaussian Process Regression (EDBO’s default model):

Random Forest Regression:

Bayesian Linear Regression:

BLM

Acquisition functions. The acquisition function is the algorithm responsible for selecting the next experiments to run based on the information captured by the surrogate model. Most acquisition functions are built to balance the exploration of the search space with the exploitation of information availible from evaluated experiments. EDBO has several acquisition functions availible via keyword arguements from the BO and BO_express classes. A full list can be found the the documentation. The default acquisition function, expected improvement, is derived from the expectation value of the improvement utility function. Below is an example code block for choosing different acquisition functions and a few examples of parallel acquisition functions which utilize the Kriging Believer algorithm for batching.

from edbo.bro import BO_express

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                acquisition_function='UCB')

Expected improvement (EDBO’s default acquisition function):

Probability of improvement:

Upper confidence bound:

Mean maximization (pure exploitation):

Variance maximization (pure exploration):

Analysis

During optimization we can run misc analysis using some of EDBO’s built in functions. For example, we can plot the optimizers path.

bo.plot_convergence()

And we can evaluate how well the model fits the experimental data.

bo.model.regression()

Finally, note that if you need help EDBO has a basic BOT which can run most of its methods. You can call the BOT using the help method. For example, if you wanted to save your workspace for later.

bo.help()

This will span an interactive session:

edbo bot: What can I help you with?
~  Save my workspace

edbo bot: Can you clarify: pickle BO object for later, export proposed, or exit?
~  pickle it

edbo bot: Save instace? (yes or no) You can load instance later with edbo.BO_express.load().
~  yes

edbo bot: Saving edbo.BO instance...


edbo bot: What can I help you with?
~  exit

edbo bot: Exiting...

Up next

EDBO and the main BO classes have a lot more features but hopefully this gives you an idea of how it could be used. In the next post we will see how to apply EDBO to chemical reaction data.

Bayesian Reaction Optimization Using EDBO - Part I

2 minute read

Published: September 30, 2020

Recently, in collaboration with folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. Now, the paper is submitted for publication and under review so I have not yet made the repository public. However, to facilitate training and beta testing I am writing a few preliminary posts on (1) installation and basic software usage, (2) simulations with real chemical reaction data, (3) using EDBO in the lab, and (4) tackling computational optimization problems.

Reference: Shields, Benjamin J.; Stevens, Jason; Li, Jun; Parasram, Marvin; Damani, Farhan, Martinez Alvarado, Jesus; Janey, Jacob; Adams, Ryan P.; Doyle, Abigail G. “Bayesian Reaction Optimization as A Tool for Chemical Synthesis” Manuscript Submitted.

Part I - Installation

Ok boring stuff first. In this post we will be tackling software installation from the code in my private repository (so no Git, PyPI, or Anaconda for now).

Install conda

If you haven’t already installed anaconda (or miniconda) on your machine you can follow the instructions provided by conda.

Install EDBO

Windows Script

I wrote a shell script (install.sh) to install EDBO on windows machines. You will find a copy in the edbo.zip folder provided.

Download and unzip the folder.
Open an anaconda prompt, navigate to the edbo directory, and run the script.

cd path/to/edbo/directory
sh install.sh

Mac/Linux Script

I wrote a slightly different shell script (install_mac.sh) to install EDBO on Mac/Linux machines. You will find a copy in the edbo.zip folder provided.

Download and unzip the folder.
Open a terminal and create a conda environment for EDBO.

conda create -y --name edbo python=3.7.5
conda activate edbo

Navigate to the edbo directory and run the script.

cd path/to/edbo/directory
sh install_mac.sh

Software tests

Use the pytest framework to run some basic software tests to make sure the installation worked. In the anaconda prompt (or terminal for Mac/Linux) navigate to the folder containing edbo. Then run the following commands and you will see test logs appear in the testing directory. These may take a few min to run and you should see some warnings but no failed tests. If you do please let me know so I can fix the issue and update the software.

conda activate edbo
cd tests
sh basic_tests.sh

Up next

That wraps up this post. In Part II we will walk through a basic introduction to the software.

Benjamin J. Shields, Ph.D.

Automatic Design of SARS-CoV-2 M^pro Inhibitors via Machine Learning & Molecular Docking

Introduction

Automating 3D structure generation

Molecular docking simulations

Variational autoencoder

Bayesian optimization

Final Modeling Pipeline

Share on

You May Also Enjoy

Bayesian Reaction Optimization Using EDBO - Part IV

Part IV - Bayesian Reaction Optimization Workshop

Bayesian Reaction Optimization Using EDBO - Part III

Part III - Bayesian Reaction Optimization

Bayesian Reaction Optimization Using EDBO - Part II

Part II - Software introduction

Define Objective and Search Space

Using EDBO

Initialization methods

Human-in-the-loop optimization

Automated optimization

Configuring the optimizer

Analysis

Up next

Bayesian Reaction Optimization Using EDBO - Part I

Part I - Installation

Install conda

Install EDBO

Software tests

Up next