Synergy of semiempirical models and machine learning in … – American Institute of Physics

A detailed review of SEQM methods is available in the literature.4244, Figure 1(a) shows an illustration of how SEQM fits into the broader landscape of computational chemistry methodologies. It is important to note that this depiction is a simplified representation assuming the application of these methods to a hypothetical set of small molecules, such as drug-like compounds, as demonstrated in previous studies.27,45 At the lower level of accuracy, classical force fields (FFs) are employed. FFs typically utilize simple, physically motivated terms46 to account for phenomena such as bond stretching and angle bending in harmonic approximation, non-covalent interactions, and Coulomb electrostatic terms. Their computationally efficient nature enables large-scale applications, such as protein folding simulations.47 More advanced FFs incorporate additional effects, including polarization and even bond breaking/formation, as exemplified by ReaxFF.48 Machine learning interatomic potentials (MLIAPs) trained on high-quality datasets can potentially achieve greater accuracy in specific applications, such as torsional data benchmarks.27,45 Relative to traditional FFs, a typical MLIAP may contain two or three orders of magnitude more model parameters, and its numerical costs grow commensurately. MLIAP simulations of up to 107 atoms have been achieved.49,50 Both FFs and MLIAPs avoid self-consistent solutions of a quantum Hamiltonian and instead make strong assumptions regarding the spatial locality of chemical interactions, which leads to a linear scaling of computational costs with system size. We point curious readers to the comprehensive literature on MLIAPs.4,8,12,24,5153 Note, though, that the accuracy and transferability of MLIAP could not be rigorously compared with electronic structure methods, as MLIAPs are trained for specific elemental compositions and/or crystal structures, and their high accuracy is confined to the training domain. The opposite side of the scale is dominated by a family of coupled cluster (CC) approaches,54 which remain the gold standard for accurate electronic structure calculations while retaining polynomial scaling. Most other methods naturally fall in between. Spanning the space of transferability, accuracy, and cost, SEQM methods occupy the middle-ground between force fields (FFs) and Density Functional Theory Methods (DFT),55 a workhorse of computational chemistry. Much more affordable than DFT, SEQM methods are usually applied to large systems (102103 atoms), which should not, for physical reasons, be treated classically. In the recent decade, use of SEQM was substantially limited given the development of accurate and affordable DFT functionals and their highly parallelized implementations. However, we expect this balance to change with the arrival of ML-parametrized SEQM (ML-SEQM) which can offer accuracy on par with or exceeding that of DFT at much less computational cost [Fig. 1(a)]. Recent advances in ML-SEQM will be the main focus of this discussion.

The original models introduced characteristic approximations to reduce the number of electronic interactions to calculate; for example, 3- and 4- center Coulomb integrals are totally neglected in the popular Modified Neglect of Differential Overlap (MNDO) approach.41,56,57 Further on, 1-center and 2-center integrals are simplified through monopole interactions and atom-specific constants (the latter will be subjected to ML parametrization). Parameters of the model are optimized to reproduce a set of reference values and provided interatomic distances. The structural knowledge encoded in those models is simplified since it does not contain angles, bond connectivity, etc., and the final parametrization yields a single set of parameters for each element. In contrast, descriptors in ML models typically encode radial and angular information on neighboring (or even further atoms) along with atom types10,13,15,18,53 allowing a more bespoke fit. The original approach can also lead to an abundance of outliers beyond the dataset and poor transferability. As a result, any deficiencies in the reference dataset would be reflected in deficiencies in the resulting method. This statement could be easily confused with the direct quote from a modern ML paper, even though it is taken from Stewarts work published back in 2002.58 The work goes on to say, The lesson learned from this experience, an important lesson painfully learned, was that the composition of the reference dataset is of paramount importance. In 2023, this lesson may sound elementary to modern ML practitioners but it manifested the beginning of data driven techniques for quantum chemistry, and the foothold that ML has gained in the field is, therefore, no surprise. However, back in the days, outliers were tackled manually: An effective way to prevent the errors of the type that were found would have been to use rules. Such rules would likely have prevented the types of errors that are present in PM3.58 To eliminate severe outliers, SEQM was often further modified by some arbitrary rules such as additional or manually corrected terms for specific systems [water clusters, Cu liganded complexes,59 or peptide bonds as a result of improper nitrogen description,58, Fig. 1(b)]. Along with careful selection of target values (enthalpies, ionization potentials, etc.), implementation of system-specific rules implies high-level human expertise based on method development, programming, and chemical research experience. It also means that system-specific rules should be recalibrated almost by hand for new applications or chemical families. We would like to conclude this historical overview with a prediction made by Stewart himself: Finally, as more and more elements are parameterized, and as methods become increasingly sophisticated, the transition will have to be made to a purely mathematical approach.58 Stewart had foreseen that the discrepancy between chemical systems could rarely be fitted into an automated if-else conditional logic. All this goes to show that the usage of ML methods, designed to automatically identify patterns and hidden relationships in data, is in fact a highly logical direction that has been foreseen for decades.

To the best of our knowledge, the first true application of ML in SEQM could be tracked down to a 2015 report60 in which a workflow for automatic parameterization was established. Rupp et al. suggested the use of an invariant Coulomb matrix61 to take into account the structure of the molecules comprising the dataset. Given that established static parameters in orthogonalized model 2 (OM2)44 are already optimized to give the best average, this work builds upon these parameters and suggests only small structure specific corrections. The pipeline is very simple: vary one parameter P at a time to find optimal corrections P for each individual molecule using the LevenbergMarquardt algorithm;62,63 train the ML model on the derived correction P via kernel ridge regression to learn the variation of P with respect to the structure; predict P for the molecules in a test set using the ML-model; evaluate the performance of OM2 based on the P + P parameters. Figure 1(c) shows a performance comparison between the original OM2 model, a revised OM2 model (rOM2; variant conventionally reparameterized for a specific dataset in the work), and the ML-OM2 model derived with automatic parameterization in the 2015 report.60 However, rOM2 improves upon the original OM2 model, ML-OM2 exhibits slightly better accuracy for atomization enthalpies among the three models. Not only, is error distribution for ML-OM2 centered at zero, but the magnitude of errors is also noticeably reduced, narrowing the gap between SEQM and DFT. This seminal work suggests that ML is a powerful approach for a broad improvement of the SEQM family without sacrificing its favorable computational cost, even taking into account the small overhead of on-the-fly P predictions.

Here is the original post:

Synergy of semiempirical models and machine learning in ... - American Institute of Physics

Related Posts

Comments are closed.