# Manual – MLatom 2.0

Please consult with Features for an overview of *MLatom* capabilities. This page provides details on how to use *MLatom* for various types of calculations.

Table of Contents

## Installation and usage

### Installation with pip

One way to install MLatom is to use command:

`python3 -m pip install -U MLatom`

Then you can use MLatom by simply running command:

`mlatom [options]`

### Installation from a zipped package

Alternatively, you can download a zipped package with MLatomPy (requires Python 3.7+) and a statically compiled binary of `MLatomF`

and `cs.so`

for Linux systems. These files can be unpacked in any directory and used directly without any modifications to the environment variables etc. You may need to make files executable by using command line option `chmod +x MLatom.py MLatomF cs.so`

.

You can also add your MLatom path into the `$PATH`

variable with the command (in bash):

`export PATH=$PATH:/path/to/MLatom`

It is convenient to add this line to `.bashrc`

file.

Installation instructions for enabling interfaced third-party programs, see below.

To run *MLatom* provide a path to *MLatom.py* and the necessary command-line options (see in the next section), i.e. in your terminal type:

`$pathToMLatom/MLatom.py [command-line options or the name of an input file with options]`

In the following, notation `mlatom`

(it is useful to setup such an alias in your shell) is used instead of `$pathToMLatom/MLatom.py`

.

## Running MLatom

All options are case insensitive, i.e. you can type either

`mlatom help`

or

`mlatom Help`

with the same result (the command will print available options on your computer screen).

In order to run *MLatom* you have to have several input files as described below. Note that input and output file names are case sensitive! For example, `xyz.dat`

and `XYZ.dat`

are two different file names.

By default, *MLatom* will use all available threads on your computer. If you want to limit the number of threads to N threads, you can use option `nthreads=N`

.

## Getting Help and List of Options for a Current Version

You can directly request your current version of *MLatom* to print its available options with the command:

`mlatom help`

for an overall help

## Input

MLatom can be run by providing it with the input file. Example:

`mlatom myinputfile.inp`

`myinputfile.inp`

can look like this (with comments followed after # symbol):

`estAccMLmodel # one command on one line`

# Lines with comments etc.

# createMLmodel # Requests creating ML model

# MLmodelOut=mlmod_E_FCI_Gaussian_20random.unf # saves the model to file

XfileIn=R_451.dat Yfile=E_FCI_451.dat # Several commands on one line

sigma=opt # Requests optimizing sigma parameter

All above options can be given directly to MLatom in a single command:

`mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat sigma=opt`

Along with options *MLatom* needs to read various files from disk depending on the task. File names should be specified using the following options:

`XYZfile=[name of file with molecular XYZ coordinates]`

`XfileIn=[name of file with molecular descriptor (ML input) vectors]`

`Yfile=[name of file with reference values]`

`Yb=[file name with the data obtained with the baseline method for Δ-ML]`

`Yt=[file name with the reference data obtained with the target method for Δ-ML]`

`MLmodelIn=[name of file with ML model]`

`iTrainIn=[name of file with indices of training points]`

`iTestIn=[name of file with indices of test points]`

`iSubtrainIn=[name of file with indices of sub-training points]`

`iValidateIn=[name of file with indices of validation points]`

`iCVtestPrefIn=[prefix of names of files with indices for CVtest]`

`iCVoptPrefIn=[prefix of names of files with indices for CVopt]`

In the requested input file does not exist, MLatom will terminate with the request to provide it. This check is not performed for files with indices involved in cross-validation.

File extensions are arbitrary.

It is sometimes useful to use only part of the big data set. This can be requested by using option `Nuse=N`

, requesting that only N first entries of input files will be used.

### File Formats

`XYZfile`

option requires file with XYZ coordinates of molecules one after another, with first line specifying number of atoms in a molecule followed by one blank line and then by Cartesian coordinates of nuclei, e.g. for three molecules:

5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H -0.342 1.027 0.000 H -0.342 -0.513 -0.890 H -0.342 -0.513 0.890 5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H -0.343 1.027 0.000 H -0.342 -0.513 -0.890 H -0.342 -0.513 0.890 5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H -0.339 1.028 0.000 H -0.342 -0.513 -0.890 H -0.342 -0.513 0.890

Nuclear charges can be used instead of element symbols. Coordinates are given in Å.

`XfileIn`

requires file with input vectors, where each vector should be on one line, e.g.:

1.0093 1.0009 1.0009 1.0080 1.0229 1.0004 1.0009 0.9947 0.9738

`Yfile`

, `Yb`

, and `Yt`

requires a file with one reference datum per line, e.g.:

6.349 23.852 60.872

`MLmodelIn`

requires a file with ML model generated by *MLatom*, version 1.0 or version 1.1.

Files with indices should contain one index per line.

## Output

*MLatom* prints summary of its calculations to the standard output, i.e. it is recommended to redirect it to a file, e.g.:

`mlatom help > mlatom.out`

It can also write files to the disk depending on the task. File names should be specified using the following options:

`XfileOut=[name of file to write input vectors to]`

`XYZsortedFileOut=[name of file to write sorted XYZ coordinates]`

`MLmodelOut=[name of file to write ML model to]`

`YestFile=[name of file with values predicted by ML or with corrections predicted by Δ-ML]`

`YestT=[file name with the Δ-ML predictions estimating the target method]`

`YgradEstFile=[name of file to write gradients predicted by ML to]`

`iTrainOut=[name of file to write training point indices to]`

`iTestOut=[name of file to write test point indices to]`

`iSubtrainOut=[name of file to write sub-training point indices to]`

`iValidateOut=[name of file to write validation point indices to]`

If output file with the same name already exists, *MLatom* will terminate with the request to remove or rename it. This check is not performed for files with indices generated during cross-validation.

File extensions are arbitrary.

Option `XYZsortedFileOut`

only works with options`molDescriptor=RE molDescrType=sorted`

. Option `permInvNuclei=[atomic indices separated by '-']`

can be provided to specify which atoms to sort.

You can request additional output with `debug`

option. It will e.g. print the regression coefficients α when using the ML model.

## ML Tasks

### estAccMLmodel

You can estimate accuracy of ML models, i.e. estimate its *generalization error* by using option `estAccMLmodel`

with other options:

`mlatom estAccMLmodel [other options]`

For default settings and other mandatory options see the corresponding sections below, specically section Model Validation.

Example:

`mlatom estAccMLmodel Yfile=y.dat XYZfile=xyz.dat kernel=Gaussian sigma=opt lambda=opt`

This command will request estimation of the generalization error of an ML model for molecules provided in Cartesian coordinates in `xyz.dat`

file and reference data in `y.dat`

file. Gaussian kernel will be used and hyperparameters σ and λ will be optimized.

### createMLmodel

In order to create an ML model and save it to a file on a disk, use option `createMLmodel`

:

`mlatom createMLmodel [other options]`

For both `estAccMLmodel`

and `createMLmodel `

additional input option `Yfile`

should be used (see Section Input).

Example:

`mlatom createMLmodel Yfile=y.dat XYZfile=xyz.dat MLmodelOut=mlmod.unf kernel=Gaussian sigma=opt lambda=opt`

This command will request creating an ML model for molecules provided in Cartesian coordinates in `xyz.dat`

file and reference data in `y.dat`

file and save it to `mlmod.unf`

file. Gaussian kernel will be used and hyperparameters σ and λ will be optimized.

### useMLmodel

Loading existing ML model from a file and performing ML calculations with this model can be done with option `useMLmodel`

:

`mlatom useMLmodel [other options]`

For `useMLmodel`

additional input option `MLmodelIn`

should be used (see Section Input).

Example:

`mlatom useMLmodel MLmodelIn=mlmod.unf XYZfile=xyz.dat YestFile=yest.dat`

This command will request making predictions with an ML model read from `mlmod.unf`

file for molecules provided in Cartesian coordinates in `xyz.dat`

file and save predicted values in `yest.dat`

file. Program will output summary of the loaded model, such as used kernel and values of hyperparameters used to create it.

### deltaLearn

All above ML operations can be also performed within Δ-ML approach requested by `deltaLearn`

. The baseline values should be provided using `Yb`

option. The target values for training should be provided with `Yt`

option. The Δ-ML can be saved to file specified with `YestT`

, while the corrections themselves to file specified with `YestFile`

.

### selfCorrect

Self-correction can be requested by option `selfCorrect`

. Currently it works only with four layers and file with reference values should be named `y.dat`

.

### learningCurve

Another ML operation, the task learning curve can automatically train models and estimate their accuracy, then give a summary file in comma separated values (CSV) format. Use `learningCurve`

option to perform this task. The options for this task, just as those for `estAccMLmodel`

, except for the `Ntrain`

which is replaced by `lcNtrains`

and a extra option `lcNrepeats`

. As their names show, the choice of traning set sizes and number of repeats are defined in these two new options.

Example:

`mlatom learningCurve Yfile=y.dat XYZfile=xyz.dat kernel=Gaussian sigma=opt lambda=opt lcNtrains=100,250,500,1000,2500,5000,10000 lcNrepeats=64,32,16,8,4,2,1`

With this command training set sizes lited in `lcNtrains`

will be tested repeatedly for 64, 32, 16, 8, 4, 2, 1 time(s), respectively. All data generated (including csv reports) will be stored in the folder *learningCurve *under current directory.

### crossSection

MLatom can accelerate the calculation of cross-section with Nuclear Ensemble Approach: paper link>>>

**Newton-X and Gaussian should be available.**

**Requirements for ML-NEA**

To run ML-NEA calculations of absorption cross sections, you also need to define some environment:

- Install Newton-X (NX)
- use
`export NX=/path/to/Newton-X`

to define the`$NX`

- install matplotlib with the command
`python3 -m pip install matplotlib`

- Have Gaussian installed.

**usage**: `MLatom.py cross-section [optional arguments]`

**optional arguments:**

`Nexcitations=N`

number of excited states to calculate.

(default=3)`nQMpoints=N`

user-defined number of QM calculations for training ML. (default=0, number of QM calculations will be determined iteratively)`plotQCNEA`

requests plotting QC-NEA cross section`deltaQCNEA=float`

define the broadening parameter of QC-NEA cross section`plotQCSPC`

requests plotting cross section obtained via single point convolution

**advanced arguments (not recommended to modify):**

`nMaxPoints=N`

maximum number of QC calculations in the iterative

procedure. (default=10000)`MLpoints=N`

number of ML calculations.

(default=50000)

**environment variables**

`$NX`

Newton-X environment`Environment for calculations with Gaussian program package. details>>>`

In bash, you can for example use the following command (provide the correct path to Newton-X bin directory):`export NX=/home/users/bxxue/NX/bin`

**required files:**

- mandatory file
`gaussian_optfreq.com`

input file for Gaussian opt and freq calculations Alternatively, files`eq.xyz`

(XYZ file with equilibrium, optimized, geometry) and`nea_geoms.xyz`

(file with all geometries in nuclear ensemble) can be provided.`gaussian_ef.com`

template file for calculating excitation energies and oscillator strengths with Gaussian.

- optional file
`cross-section_ref.dat`

reference cross section file calculated in format similar to that of Newton-X (1st column: DE/eV; 2nd column: lambda/nm; 3rd column: sigma/A^2)`eq.xyz`

file with optimized geometry (has to be used together with`nea_geoms.xyz`

)`nea_geoms.xyz`

file with all geometries in nuclear ensemble (has to be used together with`eq.xyz`

)`E1.dat E2.dat ...`

and`f1.dat f2.dat ...`

files that stores the exciting energy and oscillator strength per line which correspond to`nea_geoms.xyz`

.

**output files:**

`cross-section/cross-section_ml-nea.dat`

: cross-section spectra calculated with ML-NEA method`cross-section/cross-section_qc-nea.dat`

: cross-section spectra calculated with QC-NEA method`cross-section/cross-section_spc.dat`

: cross-section spectra calculated with single-point-convolution`cross-section/plot.png`

: the plotting that contains cross-section calculated with different kinds of method.

## Data Set Tasks

### XYZ2X

Converting XYZ coordinates into an input vector (molecular descriptor) for ML

You can use `XYZ2X`

option to convert XYZ coordinates of a series of molecules provided in file requested by option `XYZfile=[filename]`

to the molecular descriptor (input) vectors for ML calculations saved in file requested by option `XfileOut=[filename]`

in `estAccMLmodel`

with other options.

Example:

`mlatom XYZ2X XYZfile=xyz.dat XfileOut=x.dat`

Given a data set of molecules either in XYZ format or in molecular descriptor form, you can sample their subsets (e.g. the training and test sets), by using `sample`

option:

`mlatom sample [other options]`

Basically, one can use this option to generate indices of the training, test, sub-training, and validation sets without performing ML calculations. Thus, other options used for Model Validation and Hyperparameter Tuning are applicable.

### analyze

This task requires reference data and estimated data to give a statistical report.

For reference data at least one of these arguments below is required:`Yfile=S`

`YgradXYZfile=S`

And for estimated data, correspondingly:`YestFile=S`

`YgradXYZestFile=S`

Example:

`MLatom.py analyze Yfile=en.dat YestFile=enest.dat`

### sample

You can specify a type of sampling into the training and other sets using option `sampling=[type of sampling]`

. Available types of sampling are: `none`

, `random`

, `user-defined`

, `structure-based`

, `farthest-point`

.

`random`

: default. Simple random sampling`user-defined`

: requests*MLatom*to read indices for the training, test, and, if necessary, for the subtraining and validation sets from files defined by options`iTrainIn`

,`iTestIn`

,`iSubtrainIn`

,`iValidateIn`

. Corresponding options`Ntrain`

,`Ntest`

,`Nsubtrain`

, and`Nvalidate`

can be used as well. Cross-validation parts can be read in from files with names starting with prefixes specified by options`iCVtestPrefIn`

and`iCVoptPrefIn`

.`structure-based`

: performs structure-based sampling. Only works with`molDescriptor=RE`

.`farthest-point`

: farthest-point traversal iterative procedure, which starts from two points farthest apart`none`

: simply splitting the data set into the training, test, and, if necessary, training set into the subtraining and validation sets (in this order) without changing the order of indices

### Structure-based Sampling from Sliced Data

Options for sorting geometries by the Euclidean distance of their corresponding ML input vector to the input vector of the equilibrium geometry and slicing the ordered data set into requested number of regions of the same size:

`slice`

: slice data set`nslices=[number of slices]`

[default = 3]`XfileIn=[file with input vectors X]`

`eqXfileIn=[file S with input vector for the equilibrium]`

This options create files `xordered.dat`

(input vectors sorted by distance), `indices_ordered.dat`

(indices of ordered data set wrt the original data set), and `distances_ordered.dat`

(list of Euclidean distances of ordered data points to the equilibrium). They also create directories `slice1`

, `slice2`

etc. Each of them contains three files: `x.dat`

, `slice_indices.dat`

, and `slice_distances.dat`

that are slices of the corresponding files of the entire data set.

Options to perform structure-based sampling to sample the desired number of data from each slice:

`sampleFromSlices`

: sample from each slice`nslices=[number of slices]`

[default = 3]`Ntrain=[total integer number N of training points from all slices]`

This command creates `itrain.dat`

files with training set indices in each `slice[1-...]`

directory. Note: it is possible to modify `sliceData.py`

script to submit the jobs in parallel to the queue.

To merge sampled indices from all slices into indices files for the training, test, sub-training, and validation sets using the same order of data points as in original data set:

`mergeSlices`

: merges indices from slices [see sliceData help]`nslices=[number of slices]`

[default = 3]`Ntrain=[total integer number N of training points from all slices]`

This command creates four files with indices: `itrain.dat`

(with 4480 points for training), `isubtrain.dat`

(with 80% of training points also chosen using structure-based sampling), `itest.dat`

, and `ivalidate.dat`

.

## ML Algorithm

### KRR

You can use the following options for performing kernel ridge regression calculations:

`lambda=R`

: sets regularization parameter λ to a floating-point number`R`

. Default value is 0.0. You can request optimization of this parameter with`lambda=opt`

, see below for more options related to hyperparameter tuning.`kernel=[type of kernel]`

: requests using one of the available types of kernel, which are self-explaining.`kernel=Gaussian`

(set by default).`kernel=Laplacian`

`kernel=exponential`

`kernel=Matern`

Kernel width σ is a parameter, which can be also changed by the user using the following option:

`sigma=R`

: sets σ to a floating-point number`R`

. You can request optimization of this parameter with`sigma=opt`

, see below for more options related to hyperparameter tuning. Default values are different for different kernels:`sigma=100.0`

for the Gaussian and Matérn kernels`sigma=800.0`

for the Laplacian and exponential kernels

In case of Matérn kernel, there is an additional integer parameter n, which is set by default to 2, and can be changed to an integer number `N`

using option `nn=N`

.

Permutation of atomic indices (especially of the same element) should not change predictions made by ML model. This can be achieved by using permutationally invariant kernel (preferred) or sorting indices of atoms in some unique way (described below in Section Molecular Descriptors). Calculations with permutationally invariant kernel can be requested by using option`permInvKernel`

and:

- by providing file with molecular geometries in XYZ format and specifying atoms to permute using options
`permInvNuclei=[atomic indices separated by '-'] molDescrType=permuted`

. - by providing file with input vectors and specifying number of permutations using option
`Nperm=[number of permutations]`

. Each line of input vector file must contain input vectors with molecular descriptors concatenated for all atomic permutation of a single geometry.

### Molecular Descriptors

`molDescriptor=[molecular descriptor]`

: requests using one of the available molecular descriptors:

`molDescriptor=CM`

: requests using the Coulomb matrix`molDescriptor=RE`

: requests using the RE descriptor (normalized inverted internuclear distances; default). It is a vector {r^{eq}/r}, where r is an internuclear distance in a current molecule and r^{eq}is an internuclear distance in the equilibrium (or other reference) structure. Equilibrium structure should be provided in a file named ‘eq.xyz’ in XYZ format.

Variants of these descriptors can be requested by option` molDescrType=[type]`

:

`molDescrType=unsorted`

: uses the same order of atoms as in input file with XYZ coordinates of molecules. Default for`molDescriptor=RE`

.`molDescrType=sorted`

: sorts atoms and ensures permutation invariance on input vector level (especially useful for sorting; when possible, permutationally invariant kernel should be prefered for ML calculations):- sorts Coulomb matrix by norms of its rows for
`molDescriptor=CM`

(default for Coulomb matrix). - sorts atoms by the sum of their nuclear repulsions to all other atoms for
`molDescriptor=RE`

. Atoms to sort can be specified by option`permInvNuclei=[atomic indices separated by '-']`

. If option`permInvNuclei`

is not used, all atoms are sorted.

- sorts Coulomb matrix by norms of its rows for
`molDescrType=permuted`

: generate multiple XYZ structures of a single geometry by permuting atoms specified with option`permInvNuclei=[atomic indices separated by '-']`

, convert each of them to molecular descriptor, and concatenate the latter into a single input vector. This option is necessary to run calculations with permutationally invariant kernel.

### KREG model

The KREG model is the default ML model of MLatom. It is KRR ML algorithm with the Gaussian kernel function and RE molecular descriptor.

### Model Validation

ML model can be validated (generalization error can be estimated) in several ways:

- on a hold-out
**test**set not used for training. Both training and test sets can be**sampled**in one of the ways described above. Number of points in the sub-training and validation sets is set by options`Ntrain=R`

and`Ntest=R`

, respectively. If`R`

is an integer larger or equal to 1, this number of points is sampled from the data set. If`R`

is a floating-point number less than 1.0, it is used to define a fraction of the data set points to sample. By default, 80% of the data set points are used as the training set and remaining 20% as the test set; - by performing N-fold cross-validation. User can request this procedure using option
`CVtest`

and define the number of folds N by using option`NcvTestFolds=N`

. By default, 5-fold cross-validation is used. If N is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. - by performing leave-one-out cross-validation. User can request this procedure using option
`LOOtest`

. Only random or no sampling can be used.

### Hyperparameter Tuning

Gaussian, Laplacian, exponential, and Matérn kernels have σ and λ tunable hyperparameters. Their optimization can be requested with options `sigma=opt`

and `lambda=opt`

, respectively.

*MLatom* can tune hyperparameters either to minimize mean absolute error or to minimize root-mean-square error as defined by option using either option `minimizeError=MAE`

or `minimizeError=RMSE`

(default), respectively. Hyperparameters can be tuned to minimize

- the error of the ML model trained on the
**sub-training**set in a hold-out**validation**set. Both sub-training and validation sets can be**sampled**from the training set in one of the ways described above. Number of points in the sub-training and validation sets is set by options`Nsubtrain=R`

and`Nvalidate=R`

, respectively. If`R`

is an integer larger or equal to 1, this number of points is sampled from the training set. If`R`

is a floating-point number less than 1.0, it is used to define a fraction of the training set points to sample. By default, 80% of the training set points are used as the sub-training set and remaining 20% as the validation set; - N-fold cross-validation error. User can request this procedure using option
`CVopt`

and define the number of folds N by using option`NcvOptFolds=N`

. By default, 5-fold cross-validation is used. If N is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. - leave-one-out cross-validation error. User can request this procedure using option
`LOOopt`

. Only random or no sampling can be used.

*MLatom* searches optimal parameters on a logarithmic grid. After best parameters found in the first iteration, *MLatom* can perform more iterations of a logarithmic grid search. Number of iterations is controlled by `lgOptDepth=N`

keyword with `N`

=3 by default. User can adjust number of grid points, starting and finishing points on the grid by using the following options for

- λ hyperparameter:
`NlgLambda=N`

defines the number of points on the logarithmic grid (base 2). By default 6 points are used.`lgLambdaL=R`

Lowest value of log_{2}λ for a logarithmic grid optimization of lambda. Default value is -35.0.`lgLambdaH=R`

Highest value of log_{2}λ for a logarithmic grid optimization of lambda. Default value is -6.0.

- σ hyperparameter:
`NlgSigma=N`

defines the number of points on the logarithmic grid (base 2). By default 6 points are used.`lgSigmaL=R`

Lowest value of log_{2}λ for a logarithmic grid optimization of lambda. Default value is 2.0 for the Gaussian and Matérn kernels, 5.0 for the Laplacian and exponential kernels.`lgSigmaH=R`

Highest value of log_{2}λ for a logarithmic grid optimization of lambda. Default value is 9.0 for the Gaussian and Matérn kernels, 12.0 for the Laplacian and exponential kernels.

Another approach for hyperparameter tuning in *MLatom *is using the *hyperopt* interface (https://github.com/hyperopt/hyperopt). *hyperopt* is a package that provide general solution of the optimization problem. To trigger this approach in *MLatom*, all you need is just substituding numeric vaule(s) you want to optimize with function-like `hyperopt.xxx()`

, which has several options available for the triple x:

`hyperopt.uniform(lb,ub)`

: linear search space from lower bound`lb`

, and upper bound`ub`

.`hyperopt.loguniform(lb,ub)`

: logarithmic search space, base 2.`hyperopt.qunifrom(lb,ub,q)`

: discrete linear space, rounded by`q`

.

The maximum number of attemps is defined by option `hyperopt.max_evals=N`

, and the type of optimization loss is determined by `hyperopt.losstype=S`

, where `S`

can be `geomean`

(default) or `weighted`

. If the latter is chosen, the weight for gradients can be defined by `hyperopt.w_grad`

.

To enable hyperopt, please run `pip install hyperopt`

to install *hyperopt.*

### First Derivatives

*MLatom* can be also used to calculate first derivatives given a file with an existing ML model. In order to request such calculations, simply add to the options used with `useMLmodel`

option additional output option `YgradEstFile=[name of a file to save gradients in]`

or `YgradXYZestFile=[name of a file to save XYZ gradients in]`

.

Example:

`mlatom useMLmodel MLmodelIn=mlmod.unf XYZfile=xyz.dat YgradXYZestFile=ygradest.dat`

This command will request making predictions with an ML model read from `mlmod.unf`

file for molecules provided in Cartesian coordinates in `xyz.dat`

file and save predicted gradients in `ygradest.dat`

file.

## Interfaces to third-party programs

*MLatom * aslo provides interfaces to some third-party software.

To use third-party software, `MLprog`

should be set to a third-party software.

Currently implemented programs and the default choises when `MLmodelType`

and `MLprog`

is difined:

**Supported ML model types and**

**default programs:**

` +-------------+----------------+ `

` |`** MLmodelType** | **default MLprog** |

` +-------------+----------------+ `

` | KREG | MLatomF | `

` +-------------+----------------+ `

` | sGDML | sGDML | `

` +-------------+----------- ----+ `

` | GAP-SOAP | GAP | `

` +-------------+----------------+ `

` | PhysNet | PhysNet | `

` +-------------+----------------+ `

` | DeepPot-SE | DeePMD-kit | `

` +-------------+----------------+ `

` | ANI | TorchANI | `

` +-------------+----------------+ `

**Supported interfaces with default and tested ML model types:**

` +------------+----------------------+`

` | `**MLprog** | **MLmodelType** |

` +------------+----------------------+`

` | MLatomF | KREG [default] |`

` | | see |`

` | | MLatom.py KRR help |`

` +------------+----------------------+`

` | sGDML | sGDML [default] |`

` | | GDML |`

` +------------+----------------------+`

` | GAP | GAP-SOAP |`

` +------------+----------------------+`

` | PhysNet | PhysNet |`

` +------------+----------------------+`

` | DeePMD-kit | DeepPot-SE [default] |`

` | | DPMD |`

` +------------+----------------------+`

` | TorchANI | ANI [default] |`

` +------------+----------------------+`

### DeePMD-kit

**Installation**

1. download installer for *DeePMD-kit* from GitHub `https://github.com/deepmodeling/deepmd-kit/releases`

(tested v1.2.2)

2. run installer

3. add environmetal variable `$DeePMDkit`

that point to the where dp binary is located (`bin/`

in your installation directory)

e.g. `export DeePMDkit=/export/home/fcge/deepmd-kit-1.2/bin`

**usage**

`MLprog=DeePMD-kit`

to enable the interface.**options**

Expressions like `deepmd.xxx.xxx=X`

specify arguments for DeePMD, follows the structure of DeePMD’s json input file.

For example:

`deepmd.training.stop_batch=N`

is an equivalent of

` { `

` ... `

` "training": { `

` ... `

` "stop_batch": N `

` ... `

` } `

` ... `

` } `

in DeePMD-kit’s json input.In addition, option `deepmd.input=S`

intakes a input json file `S`

as a template. Final input file will be generated base on it with `deepmd.xxx.xxx=X`

options (if any). Check default template file `bin/interfaces/DeePMDkit/template.json`

for defualt values

### GAP and QUIP

**Installation**

1. compile *QUIP* and* GAP* from source

1.1 install prerequisites

`sudo apt-get install gcc gfortran python python-pip libblas-dev liblapack-dev`

(for system uses apt, do equivalent for your OS)`pip install numpy ase f90wrap`

1.2 get source code of *QUIP* and *GAP*

`git clone --recursive https://github.com/libAtoms/QUIP.git`

Get source code of *GAP* from http://www.libatoms.org/gap/gap_download.html (form-filling required).

Then put source code in `QUIP/src/`

.

1.3 compile

`cd QUIP`

`export QUIP_ARCH=linux_x86_64_gfortran_openmp # enable multi-threading, use 'export QUIP_ARCH=linux_x86_64_gfortran' if no OpenMP thus no MT capability`

`export QUIPPY_INSTALL_OPTS=--user # omit for a system-wide installation`

`make config`

Enter `Y`

for gap or edit `build/linux_x86_64_gfortran/Makefile.inc`

with `HAVE_GAP=1`

, then:`make`

Built binaries are in `QUIP/build/linux_x86_64_gfortran/quip`

and `QUIP/build/linux_x86_64_gfortran/gap_fit`

.

2. add environmetal variable `$quip`

and `$gap_fit`

for *quip* and *gap_fit*

`e.g. export quip='/export/home/fcge/GAP-SOAP/QUIP/build/linux_x86_64_gfortran_openmp/quip' `

`export gap_fit='/export/home/fcge/GAP-SOAP/QUIP/build/linux_x86_64_gfortran_openmp/gap_fit'`

visit https://libatoms.github.io/GAP/index.html for more info.

**usage**

`MLprog=GAP`

to enable the interface.**options**

`gapfit.xxx=x`

xxx could be any option for gap_fit (e.g. `default_sigma`

).

Note that there’s no need to set `at_file`

and `gp_file`

.`gapfit.gap.xxx=x`

xxx could be any option for gap.

`gapfit.default_sigma={0.0005,0.001,0,0}`

hyperparameter sigmas for energies, forces, virals and hessians`gapfit.e0_method=average`

method for determining e0 `gapfit.gap.type=soap`

descriptor type`gapfit.gap.l_max=6`

max number of angular basis functions `gapfit.gap.n_max=6`

max number of radial basis functions `gapfit.gap.atom_sigma=0.5`

hyperparameter for Gaussain smearing of atom density`gapfit.gap.zeta=4`

hyperparameter for kernel sensitivity `gapfit.gap.cutoff=6.0`

cutoff radius of local environment `gapfit.gap.cutoff_transition_width=0.5`

cutoff transition width `gapfit.gap.delta=1`

hyperparameter delta for kernel scaling### TorchANI

**Installation**

1. install *Numpy* and nightly version of *PyTorch*

`pip install numpy`

`pip install --pre torch torchvision -f \ https://download.pytorch.org/whl/nightly/cu100/torch_nightly.html`

2. install *TorchANI*

`pip install torchani`

Visit https://aiqm.github.io/torchani/ for more info.

**usage**

`MLprog=TorchANI`

to enable the interface.**options**

Arguments with their default values:

`ani.batch_size=8`

batch size `ani.max_epochs=10000000`

max epochs `ani.early_stopping_learning_rate=0.00001`

learning rate that triggers early-stopping `ani.force_coefficient=0.1`

weight for force `ani.Rcr=5.2`

radial cutoff radius `ani.Rca=3.5`

angular cutoff radius `ani.EtaR=1.6`

radial smoothness in radial part `ani.ShfR=0.9,1.16875,1.4375,1.70625,1.975,`

`2.24375,2.5125,2.78125,3.05,3.31875,3.5875,`

`3.85625,4.125,4.9375,4.6625,4.93125`

radial shifts in radial part `ani.Zeta=32`

angular smoothness `ani.ShfZ=0.19634954,0.58904862,0.9817477,`

`1.3744468, `

`1.7671459,2.1598449,2.552544,`

`2.9452431`

angular shifts `ani.EtaA=8`

radial smoothness in angular part `ani.ShfA=0.9,1.55,2.2,2.85`

radial shifts in angular part `ani.Neuron_l1=160`

number of neurons in layer 1 `ani.Neuron_l2=128`

number of neurons in layer 2 `ani.Neuron_l3=96`

number of neurons in layer 3 `ani.AF1='CELU'`

acitivation function for layer 1 `ani.AF2='CELU'`

acitivation function for layer 2 `ani.AF3='CELU'`

acitivation function for layer 3 ### PhysNet

**Installation**

1. clone form *PhysNet*‘s GitHub page

`git clone https://github.com/MMunibas/PhysNet.git`

2. install *TensorFlow*:

`pip install tensorflow`

3. if you use*TensorFlow v2*, you need to execute the command below in *PhysNet’*s directory to make the scripts compatible with *TFv2.*

`for i in `find . -name '*.py'`; do sed -i -e 's/import tensorflow as tf/import tensorflow.compat.v1 as tf\ntf.disable_v2_behavior()/g' -e 's/import tensorflow as tf/import tensorflow.compat.v1 as tf\ntf.disable_v2_behavior()/g' $i; done`

4. add environmetal variable $PhysNet to the directory

e.g. `export PhysNet=/export/home/fcge/PhysNet/`

**usage**

`MLprog=PhysNet `

to enable the interface.**options**

Arguments with their default values:

`physnet.num_features=128`

number of input features `physnet.num_basis=64`

number of radial basis functions `physnet.num_blocks=5`

number of stacked modular building blocks `physnet.num_residual_atomic=2`

number of residual blocks for atom-wise refinements `physnet.num_residual_interaction=3`

number of residual blocks for refinements of proto-message `physnet.num_residual_output=1`

number of residual blocks in output blocks `physnet.cutoff=10.0`

cutoff radius for interactions in the neural network `physnet.seed=42`

random seed`physnet.learning_rate=0.0008`

starting learning rate `physnet.decay_steps=10000000`

decay steps `physnet.decay_rate=0.1`

decay rate for learning rate `physnet.batch_size=12`

training batch size `physnet.valid_batch_size=2`

validation batch size `physnet.force_weight=52.91772105638412`

weight for force `physnet.summary_interval=5`

interval for summary `physnet.validation_interval=5`

interval for validation `physnet.save_interval=10`

interval for model saving **sGDML**

**Installation**

1. install *sGDML*

`pip install sgdml`

2. add the path of sGDML binary to environmetal variable `$sGDML`

e.g. `export sGDML=/export/home/fcge/.linuxbrew/bin/sgdml`

Visit http://quantum-machine.org/gdml/doc/ for more info

**usage**

`MLprog=sGDML`

to enable the interface.**options**

Arguments with their default values:

`sgdml.gdml=False`

use GDML instead of sGDML `sgdml.cprsn=False`

compress kernel matrix along symmetric degrees of freedom`sgdml.no_E=False`

not to predict energies `sgdml.E_cstr=False`

include the energy constraints in the kernel `sgdml.s=<s1>[,<s2>[,...]] or <start>:[<step>:]<stop>`

set hyperparameter sigma, see sgdml create -h for details.## Support

If you want to collaborate, have some suggestions for improving the program, or want to report a bug, please write to me.