Machine learning in methods development: From reaction outcome prediction to mechanistic understanding

Date:

Machine learning (ML), the development and study of computer algorithms that can learn from data, is increasingly important across a wide array of applications in chemistry. For example, ML has facilitated virtual screening of druglike molecules for medical applications, rapid prediction of physical data, and computer aided synthesis planning. While ML has become well-established in these areas, scientists have only just begun to advance tools for synthetic methods development (reaction optimization, prediction, mechanistic study). Though these burgeoning areas of research have already added to the synthetic chemist’s toolbox, average research practices have remained relatively unaffected. One approach to facilitating the adoption of ML in synthetic chemistry is to develop applications which integrate seamlessly with the typical methods of synthetic chemists. Here I will discuss approaches to some obstacles to incorporating ML in the synthetic mainstay including: (1) interpretability – scientists may not trust a model because predictions appear to be unintelligible or derived randomly from regressors. This challenge could be overcome by using simple interpretable graphics and traditional physical organic chemistry to explain and experimentally probe ML results. (2) Data – current approaches to applying ML in synthetic chemistry have focused on mining the chemical literature or actively generating new datasets on a per problem basis. However, mined data is sparse, noisy, and often incomplete and data set curation imposes a heavy experimental cost. An alternative approach is to draw from the success of ML in other areas which incorporate data endogenous to a given domain (e.g. product recommendation systems). Much of the data collected in synthetic chemistry laboratories is derived from the optimization of reactions. While this data is typically leveraged only towards the discovery of optimal conditions, a method which draws from optimization data, quantum chemical calculations, and ML could naturally integrate with synthetic research practices.