# Features – MLatom 2.0

Table of Contents

## Tasks Performed by MLatom

A brief overview of *MLatom* capabilities. See sections below for more details.

### Tasks

- Estimating accuracy of ML models.
- Creating ML model and saving it to a file.
- Loading existing ML model from a file and performing ML calculations with this model.
- ML-accelerated calculation of absorption spectra within nuclear ensemble approach
- Learning curves

### Data Set Operations

- Converting XYZ coordinates into an input vector (molecular descriptor) for ML.
- Sampling subsets from a data set.

## Sampling

- none: simply splitting the data set into the training, test, and, if necessary, training set into the subtraining and validation sets (in this order) without changing the order of indices.
- random sampling.
- user-defined: requests
*MLatom*to read indices for the training, test, and, if necessary, for the subtraining and validation sets from files. - structure-based sampling
- from unsliced and sliced data

- farthest-point traversal iterative procedure, which starts from two points farthest apart.

## ML Algorithm

Kernel ridge regression with the following kernels:

- Gaussian.
- Laplacian.
- exponential.
- Matérn (
*details of implementation*).

Permutationally invariant kernel and self-correction are also supported.

## Hybrid QM/ML Approaches

## Molecular Descriptors

- Coulomb matrix
- sorted by norms of its rows;
- unsorted;
- permuted.

- Normalized inverse internuclear distances (RE descriptor)
- sorted for user-defined atoms by the sum of their nuclear repulsions to all other atoms;
- unsorted;
- permuted.

## The KREG model

The **KREG **(**K**ernel-ridge-regression using **RE** descriptor and the **G**aussian kernel function) model is the default ML method.

## Model Validation

ML model can be validated (generalization error can be estimated) in several ways:

- on a hold-out
**test**set not used for training. Both training and test sets can be**sampled**in one of the ways described above; - by performing N-fold cross-validation. User can define the number of folds N. If N is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.
- by performing leave-one-out cross-validation (special case of N-fold cross-validation).

*MLatom* prints out mean absolute error (MAE), mean signed error (MSE), root-mean-squared error (RMSE), mean values of reference and estimated values, largest positive and negative outliers, correlation coefficient and its squared value R^{2} as well as coefficients of linear regression and corresponding standard deviations.

## Hyperparameter Tuning

Gaussian, Laplacian, and Matérn kernels have σ and λ tunable hyperparameters. *MLatom* can determine them by performing user-defined number of iterations of hyperparameter optimization on a logarithmic grid. User can adjust number of grid points, starting and finishing points on the grid. Hyperparameter are tuned to minimize either mean absolute error or root-mean-square error as defined by the user. Hyperparameters can be tuned to minimize

- the error of the ML model trained on the subtraining set in a hold-out
**validation**set. Both subtraining and validation sets are parts of the training set, which can be used at the end with optimal parameters for training the final ML model. These sets ideally should not overlap and can be**sampled**from the training set in one of the ways described above; - N-fold cross-validation error. User can define the number of folds N. If N is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation.

Note that hyperparameter tuning can be performed together with model validation. This means that for example one can perform outer loop of the cross-validation for model validation and tune hyperparameters via inner loop of the cross-validation.

Apart from natively implemented logarithmic grid search for hyperparameters, *MLatom *also provides the interface to the hyperopt package implementing hyperparameter optimization using Bayesian methods with Tree-structured Parzen Estimator (TPE).

## First Derivatives

*MLatom* can be also used to estimate first derivatives from an ML model. Two scenarios are possible:

- partial derivatives are calculated for each dimension of given input vectors (analytical derivatives for Gaussian and Matern kernels);
- first derivatives are calculated in XYZ coordinates for input files containing molecular XYZ coordinates (analytical derivatives for the RE and Coulomb matrix descriptors).
- derivatives for interfaced models

## Cross Section

*MLatom* can significantly accelerate the calculation of cross-section with the Nuclear Ensemble Approach (NEA).

In brief, this feature uses fewer QC calculation to achieve higher precision and reduce computational cost. You can find more detail on this paper (please cite it when using this feature:

Bao-Xin Xue, Mario Barbatti*,

Pavlo O. Dral*, Machine Learning for Absorption Cross Sections,J. Phys. Chem. A2020,124, 7199–7210. DOI: 10.1021/acs.jpca.0c05310.

Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.

## Interfaces to 3^{rd}-party software

*MLatom *also provides interfaces to some third-party software where extra ML model types are natively implemented. It allows users to access other popular ML model types within *MLatom’*s workflow. Currently available third-party model types are: