# Manual

**NOTE! See the new manuals and tutorials for MLatom in English and Chinese (this page only contains the legacy documentation which will be slowly migrated to the new documentation).**

A brief overview of the capabilities of MLatom 3 is given in the release broadcast video:

## Python API

NEW! The manual and tutorials for MLatom Python API is available in English and Chinese.

## Command-line options not transferred yet to the new docs

Below is an overview of command-line (input file) options of MLatom not transferred yet to the new docs.

## Learning

- training generic ML models (kernel ridge regression with many kernels)
- Δ-learning
- self-correction

## Data

- converting XYZ coordinates to molecular descriptor (RE, Coulomb matrix, …)
- analyzing data sets
- sampling (random, structure-based, farthest-point) and splitting datasets

## Simulations Top↑

### Quantum dynamics with machine learning Top↑

MLatom can perform quantum dissipative dynamics with a range of machine-learning methods via an interface to the MLQD program. Supported methods from the program’s website:

**Kernel Ridge Regression (KRR)-based recursive (iterative) Quantum Dissipative Dynamics method:**Here is the corresponding article → Speeding up quantum dissipative dynamics of open systems with kernel methods. Recently, we have performed a comparative study where KKR method outperforms NN models, here is the article → A comparative study of different machine learning methods for dissipative quantum dynamics**AIQD non-recursive (non-iterative) approach:**Here is the corresponding article → Predicting the future of excitation energy transfer in light-harvesting complex with artificial intelligence-based quantum dynamics**The blazingly fast OSTL non-recursive (non-iterative) approach:**Here is the corresponding article → One-Shot Trajectory Learning of Open Quantum Systems Dynamics

#### Lecture & Tutorial

**Input and output arguments** Top↑

Arguments | Available and default parameters | Description |
---|---|---|

QDmodel=[createQDmodel or useQDmodel] (not optional) | default option is useQDmodel | requests MLQD to create or use QD model |

QDmodelIn=[user-provided model file] | Not optional if QDmodel=useQDmodel. Passing the name of file with the trained model | |

QDmodelOut=user-defined name of created model](optional) | You can pass it if QDmodel=createQDmodel and MLQD will save the trained model with this name. However, its optional, if you don’t pass it, MLQD will choose a random name. | |

QDmodelType=[KRR or AIQD or OSTL] | default option is OSTL | It tells MLQD what type of QD model to use |

systemType=[SB or FMO](not optional) | no default option | It tells MLQD the type of the system |

QDtrajOut=file name for the output trajectory | You can pass it if QDmodel=useQDmodel and MLQD will save the predicted dynamics with this name. However, its optional, if you don’t pass it, MLQD will choose a random name. | |

prepInput=[True or False] | default is False. Case sensitive | Prepare input files X and Y from the data |

hyperParam=[True or False] | default is False. Case sensitive | Optimize the hyper parameters of the model |

patience=[integer non-negative number] | Default value is 10 | Patience for early stopping in CNN training |

epochs=[integer non-negative number] | Default value is 100 | Number of epochs for training and optimization of CNN model [OSTL and AIQD methods] |

max_evals=[integer non-negative number] | Default value is 100 | Number of maximum evaluations in hyperopt optimization of CNN model [OSTL and AIQD methods] |

XfileIn=[name of X file] | Default is x_data if QDmodel=createQDmodel and prepInput=True | In the case of QDmodel=createQDmodel, its optional. It passes the name for X file. It saves the Xfile with this name if prepInput=True , and it passes the Xfile if prepInput=False . However if QDmodel=useQDmodel and QDmodelType=KRR , then it is not optional. You need to pass the input shot-time trajectory. |

YfileIn=[name of Y file] | Default is y_data if QDmodel = createQDmodel and prepInput=True | In the case of QDmodel = createQDmodel, it is optional. It passes the name for Y file. It saves the Yfile with this name if prepInput=True , and it passes the Yfile if prepInput=False. |

dataPath=[absolute or relative path with data] | In the case of QDmodel=createQDmodel, and prepInput=True, need to pass path to the data, so MLQD can prepare the X and Y files. It should be noted that, data should be in the same format as our in our data set QDDSET-1 (to be published) especially when QDmodelType=OSTL or AIQD | |

n_states=[number of states or sites, integer] | Default is 2 for SB and 7 for FMO | Number of states (SB) or sites (FMO) |

initState=[number of initial site] | Default value is 1 (Initial exictation is on site-1) | It represents initial site in FMO complex. Only required when we propagate dynamics with OSTL or AIQD method |

time=[propagation time] | Default is 20 for SB and 50 for FMO | Propagation time in picoseconds (ps) for FMO complex and in atomic units (a.u.) for spin-boson model |

time_step=[time step of propagation] | Default is 0.05 for SB and 0.005 for FMO | time step of propagation |

energyDiff=[energy difference] | Default value is 1.0 | Energy difference between the states in the case of SB, needed only when QDmodelType=OSTL or AIQD |

Delta=[tunneling matrix element] | Default value is 1.0 | The tunneling matrix element in the case of SB, needed only when QDmodelType = OSTL or AIQD |

gamma=[characteristic frequency] | Default value is 10 in the case of SB and 500 in the case of FMO | Characteristic frequency. In cm^-1 for FMO and in (a.u.) for SB, and needed only when QDmodelType=OSTL or AIQD |

lamb=[system-bath coupling strength] | Default value is 1.0 in the case of SB and 520 in the case of FMO | System-bath coupling strength. In cm^-1 for FMO and in (a.u.) for SB, and needed only when QDmodelType=OSTL or AIQD |

temp=[temperature] | Default value is 1.0 in the case of SB and 510 in the case of FMO | Temperature (K) in the case FMO complex and inverse temperature in the case of SB, and needed only when QDmodelType=OSTL or AIQD |

energyNorm=[normalizer] | Default value is 1.0 | Normalizer for the energy difference between the states in the case of SB |

energyNorm=[normalizer] | Default value is 1.0 | Normalizer for the tunneling matrix element in the case of SB |

gammaNorm=[normalizer] | Default value is 10 in the case of SB and 500 in the case of FMO | Normalizer for characteristic frequency |

lambNorm=[normalizer] | Default value is 1.0 in the case of SB and 520 in the case of FMO | Normalizer for system-bath coupling strength |

tempNorm=[normalizer] | Default value is 1.0 in the case of SB and 510 in the case of FMO | Normalizer for temperature in the case of FMO and for inverse temperature in the case of SB |

numLogf=[number of logistic functions] | Default value is 1 | Number of logistic functions normalizing the dimension of time |

LogCa=[coefficient] | Default value is 1.0 | Coefficient “a” in the logistic function |

LogCb=[coefficient] | Default value is 15.0 | Coefficient “b” in the logistic function |

LogCc=[coefficient] | Default value is -1.0 | Coefficient “c” in the logistic function |

LogCd=[coefficient] | Default value is 1.0 | Coefficient “d” in the logistic function |

dataCol=[column number] | Default value is 1 | When QDmodelType=KRR , it only works for single output values. If ther are multiple columns in you data files, you need mention which column to grab |

dtype=[real or imag] | Default is real | When you pass the column with dataCol and your data is complex, then need to mention which part of the complex data the MLQD to grab, real or imaginary |

xlength=[number of time steps in the short seed trajectory] | Default value is 81 | Length of the input short trajectory. It is the number of time steps in the data you passed with dataCol |

refTraj | MLQD has the option to plot the predicted dynamics against the reference trajectory. It is optional, if reference trajectory is provided, MLQD will go for plotting otherwise not | |

xlim=[xaxis limit] | Default option is equal to the propagation time | The user can define xaxis limit for plotting |

pltNstates=[number of states to be plotted] | Default option is to plot all states | Users can define how many states should be plotted by MLQD |

#### Examples

**These are just very brief examples, please see our detailed tutorial**.

##### Training a KRR model

In the case of spin boson model, we have provided 20 trajectories from our QD3SET-1 database for demonstration. The MLQD will grab them automatically if you don’t pass data path.

```
MLQD
QDmodel=createQDmodel
QDmodelType=KRR
prepInput=True
dataCol=1
dtype=real
xlength=81
systemType=SB
QDmodelOut=KRR_SB_model
```

##### Propagation of dynamics with the trained KRR model

We are providing a short input trajectory saved as `state_1_pop.txt`

:

```
MLQD
time=20
time_step=0.05
QDmodel=useQDmodel
QDmodelType=KRR
XfileIn=state_1_pop.txt
systemType=SB
QDmodelIn=KRR_SB_model
QDtrajOut=KRR_trajectory
```

The reference trajectory for comparison:

##### Training an AIQD model

```
MLQD
n_states=2
time=20
time_step=0.05
QDmodel=createQDmodel
QDmodelType=AIQD
prepInput=True
numLogf=10
LogCa=1.0
LogCb=15.0
LogCc=-1.0
LogCd=1.0
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
hyperParam=True
patience=10
epochs=10
max_evals=10
QDmodelOut=AIQD_SB_model
```

##### Propagation of dynamics with the trained AIQD model

We just pass the parameters and the trained AIQD model should be able to predict the corresponding dynamics

```
MLQD
n_states=2
time=20
time_step=0.05
energyDiff=1.0
Delta=1.0
gamma=4.0
lamb=0.1
temp=1.0
QDmodel=useQDmodel
QDmodelType=AIQD
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
numLogf=10
systemType=SB
QDmodelIn=AIQD_SB_model.hdf5
QDtrajOut=Qd_trajectory
```

**Training an OSTL model**

```
MLQD
n_states=2
QDmodel=createQDmodel
QDmodelType=OSTL
prepInput=True
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
hyperParam=True
patience=10
epochs=10
max_evals=10
QDmodelOut=OSTL_SB_model
```

**Propagation of dynamics with the trained OSTL model**

We just pass the parameters and the trained OSTL model should be able to predict the corresponding dynamics in one shot

```
MLQD
n_states=2
time=20
time_step=0.05
energyDiff=1.0
Delta=1.0
gamma=4.0
lamb=0.1
temp=1.0
QDmodel=useQDmodel
QDmodelType=OSTL
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
QDmodelIn=OSTL_SB_model.hdf5
QDtrajOut=Qd_trajectory
```

## Learning Top↑

### Training generic ML models Top↑

MLatom allows to train kernel ridge regression (KRR) models for any generic data set with input vectors X and reference labels Y. A range of kernel functionals are supported. **Instead of using this option, it may be more convenient to use one of the popular ML models available in MLatom.**

#### Required arguments Top↑

Below are required arguments but typically more options are needed, e.g., for choosing a molecular descriptor and algorithm hyperparameters, as shown later.

Arguments | Available and default parameters | Description |

createMLmodel | requests training an ML model. Currently only KRR models are supported. | |

`XYZfile=[input file with XYZ coordinates]` or `XfileIn=[input file with input vectors X]` | one and only one of these two options can be chosen. no default file names. | `XYZfile` : requests to train on a data set with many molecules provided in file with their XYZ coordinates. The units of coordinates are arbitrary, but many simulations with MLatom require Å which are recommended.`XfileIn` : requests to train on a data set with many input vectors (one input vector per line in text file), which are typically molecular descriptors. |

`Yfile=[input file with reference values]` and/or `YgradXYZfile=[input file with reference XYZ gradients]` | one or both of these two options can be chosen. no default file names. | `Yfile` are often energies, it is recommended to use Hartree if the model is intended to be used in further simulations.`YgradXYZfile` are often energy gradients, it is recommended to use Hartree/Å. Note that gradients are negative forces and appropriate sign should be used. Also, note that sparse gradients can be provided, where for geometries without gradients, `YgradXYZfile` file should contain ‘0’ followed by a blank line (see tutorial). |

`MLmodelOut=[output file with trained model]` | no default file name. | saves model to a user-defined file, commonly with .unf extension. If the file already exists, MLatom will not overwrite it and stop. |

#### KRR-related arguments Top↑

Arguments | Available and default parameters | Description |

`prior=[offset of reference values]` | `0.0` [default]`mean` use average of reference scalar valuesany other user-defined decimal/integer number. | It is often useful to offset reference values, e.g., by removing average value. This may improve stability of the model and make learning easier. |

`KRRtask=[one of tasks]` | `learnVal` learns reference values [default if only `Yfile` provided]`learnGradXYZ` explicitly learns only XYZ gradients (should be requested for correct simulations). Works only with the KREG model (RE descriptor and Gaussian kernel).`learnValGradXYZ` explicitly learns both scalar values and XYZ gradients [default if both `Yfile` and YgradXYZfile are provided]. Works only with the KREG model (RE descriptor and Gaussian kernel). | specifies what to learn: scalar values and/or XYZ gradients. |

`lambda=[regularization hyperparameter]` | `0.0` [default]`opt` optimize hyperapameter, see dedicated manualany other user-defined nonnegative decimal/integer number. | It is recommended to always optimize this hyperparameter. Usually, lambda parameter should be rather small but larger than zero, e.g., 10^{-6}. |

`lambdaGradXYZ=[regularization hyperparameter for XYZ gradients part]` | similar to `lambda` .Can be used for `KRRtask=learnGradXYZ` and `KRRtask=learnValGradXYZ` .For `KRRtask=learnGradXYZ` , both `lambda` and `lambdaGradXYZ` are equivalent. | similar to `lambda` , may be helpful if it is hard to learn both scalar values and XYZ gradients with a single lambda. |

`kernel=[kernel function]` | `Gaussian` [default]. Its hyperparameter: `sigma` .Modifications of Gaussian kernel: – `periodKernel` . Its hyperparameters: `sigma` , `period` .– `decayKernel` . Its hyperparameters: `sigma` , `sigmap` , `period` .`Laplacian` . Its hyperparameter: `sigma` .`exponential` . Its hyperparameter: `sigma` .`Matern` is the most flexible but relatively slow, hyperparameters: nn, sigma. nn = 0 makes Matern kernel equivalent to exponential kernel, very large nn makes it equivalent to Gaussian kernel.`linear` . No hyperparameters. | Many of these kernel functions have hyperparameters that are recommended to be defined by indicated arguments. Linear kernel makes KRR equivalent to ridge regression, i.e., kernalized multiple linear regression (MLR) and MLatom prints out coefficients of an equivalent MLR model. |

`sigma=[length scale hyperparameter]` | `100.0` [default for `kernel=Gaussian` and `kernel=Matern` ]`800.0` [default for `kernel=Laplacian` and `kernel=exponential` ]`opt` optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | scale length hyperparameter present in most kernel functions. It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |

`sigmap=[length scale hyperparameter of a periodic part]` | `100.0` [default, can be used only with `kernel=decayKernel` ]`opt` optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |

`period=[length scale hyperparameter]` | `1.0` [default, can be used in both kernel=periodKernel and `kernel=decayKernel` ]`opt` optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |

`nn=[length scale hyperparameter]` | 2 [default, can only be used for kernel=Matern]`opt` optimize hyperapameter, see dedicated manualany other user-defined positive integer number. | Since it is an integer hyperparameters, it is usually easy to manually check several values from 1 to 5, because 0 corresponds to exponential kernel, and more than 5 are already close to Gaussian kernel. |

`permInvKernel` | optional. Related options: `molDescrType=permuted` , `permInvGroups` , `permInvNuclei` , `Nperm` , `selectperm` , `permIndIn` , `permlen` . | requests calculations with permutationally invariant kernel. Recommended for small data sets to ensure that permutation of homonuclear atoms will not change ML predictions. |

`Nperm=[number of permutations]` | optional, can only be used with `permInvKernel` . and `XfileIn` . | defines number of permutations in the user-provided file with reference values. Each line of input vector file must contain input vectors with molecular descriptors concatenated for all atomic permutation of a single geometry. See also related tutorial. |

`selectperm` | optional, can only be used with `permInvKernel` and `molDescrType=permuted` . | may be useful to find most relevant permutations nad reduce the number of permutations by minimizing distance RMSD to an equilibrium structure. Prints out list of selected permutations. See also related tutorial. |

`permIndIn=[file with permutations list]` | optional, can only be used with `permInvKernel` and `molDescrType=permuted` and `permlen` . | See also related tutorial. |

`permlen=[number of permutations in permIndIn]` | optional, can only be used with `permInvKernel` and `molDescrType=permuted` and . | See also related tutorial. |

`matDecomp=[type of matrix decomposition]` | `Cholesky` [default]`LU` `Bunch-Kaufman` | `Cholesky` is the most efficient, but for very difficult cases (e.g., too small hyperparameter lambda), other types can be used. MLatom first tries to do Cholesky decomposition, if it fails, MLatom tries to do Bunch-Kaufman and, finally, LU. Thus, usually, the user does not need to worry about this option. |

`invMatrix` | not used by default. Optional. | requests inverting kernel matrix to train the model. Not recommended because it is much slower than the default option. |

#### Molecular descriptor arguments Top↑

If the user only provides XYZ file with `XYZfile`

argument, XYZ coordinates need to be first converted into the molecular descriptor.

Arguments | Available and default parameters | Description |

`molDescriptor=[molecular descriptor]` | `RE` [default] (relative-to-equilibrium)`CM` (Coulomb matrix)`ID` (inverse internuclear distances) | `RE` descriptor is well-suited for accurate descriptioin of single-molecule PES.`CM` is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.`ID` is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor. |

`molDescrType=[type of molecular descriptor]` | `unsorted` [default for RE]`sorted` [default for CM]`permuted` (optional, can be used for both RE and CM) | `unsorted` descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.`sorted` descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: `XYZsortedFileOut` , `permInvGroups` , `permInvNuclei` . See also related tutorial.`permuted` augments the descriptor with the permutations of user-defined atoms. Related arguments: `permInvKernel` , `permInvGroups` , `permInvNuclei` . See also related tutorial. |

`XYZsortedFileOut=[output file with with sorted XYZ coordinates]` | optional. Only works with `molDescriptor=RE molDescrType=sorted` . | saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out |

`permInvNuclei=[permutationally invariant nuclei]` | optional. Should be used with `molDescrType=permuted` (and often with `permInvKernel` ) | E.g. `permInvNuclei=2-3.5-6` will permute atoms 2,3 and 6,7. See also related tutorial. |

`permInvGroups=[permutationally invariant groups]` | optional. Should be used with `molDescrType=permuted` (and often with `permInvKernel` ) | E.g. for water dimer `permInvGroups=1,2,3-4,5,6` generates permuted atom indices by flipping the monomers in a dimer. |

#### Additional output arguments Top↑

Arguments | Available and default parameters | Description |

`YestFile=[output file with estimated Y values]` | this argument is optional and no default parameters are provided. | makes predictions Y for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |

`YgradXYZestFile=[output file with estimated XYZ gradients]` | this argument is optional and no default parameters are provided. | should be used only with XYZfile option. Calculates first XYZ derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |

`YgradEstFile=[output file with estimated gradients]` | this argument is optional and no default parameters are provided. | should be used only with XfileIn option. Calculates first derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |

#### Example Top↑

Here we show how to train a simple model for the H_{2} dissociation curve with kernel ridge regression.

Download `R_20.dat`

file with 20 points corresponding to internuclear distances in the H_{2} molecule in Å:

Download `E_FCI_20.dat`

file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:

Train (option `createMLmodel`

) ML model and save it to a file (option `MLmodelOut=mlmod_E_FCI_20_overfit.unf`

) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{−11} and λ=0:

`mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out`

In the output file `create_E_FCI_20_overfit.out`

you can see that the error for the created ML model is essentially zero for the training set. Option `sampling=none`

ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (options `useMLmodel`

`MLmodelIn`

) for calculating energies for its own training set and save them to `E_ML_20_overfit.dat`

file:

`mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out`

Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option `debug`

also prints the values of the regression coefficients alpha to the output file `use_E_FCI_20_overfit.out`

. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because of the hyperparameter choice. Thus, optimization of hyperparameters is strongly recommended.

### Δ-learning Top↑

Δ-machine learning can be used with one of the usual options. Below, arguments unique to delta-learning are described. See also a tutorial.

Arguments | Description |

`deltaLearn` | required. Should be used with one of: – `createMLmodel` – `useMLmodel MLmodelIn` – `estAccMLmodel` |

`Yb=[file with data obtained with baseline method]` | required for both training and predictions. |

`Yt=[file with data obtained with target method]` | required only for training. |

`YestT=[file with ML estimations of target method]` | required for predictions. |

`YestFile=[file with ML corrections to baseline method]` | required for predictions. |

`YgradXYZb=[file with baseline XYZ gradients]` | optional. |

`YgradXYZt=[file with target XYZ gradients]` | optional. |

`YgradXYZestT=file with ML estimations of target XYZ gradients]` | optional. |

`YgradXYZestFile=[file with ML corrections to baseline XYZ gradients]` | optional. |

#### Example Top↑

`mlatom estAccMLmodel deltaLearn XfileIn=x.dat Yb=UHF.dat Yt=FCI.dat YestT=D-ML.dat YestFile=corr_ML.dat`

### Self-correction Top↑

Self-correction as described here. Can be used with one of the usual options. Below, arguments unique to self-correction are described. See also a tutorial.

Arguments | Description |

`selfCorrect` | required. Should be used with one of: – `createMLmodel` – `useMLmodel MLmodelIn` – `estAccMLmodel` |

#### Example Top↑

`mlatom estAccMLmodel selfCorrect XYZfile=xyz.dat Yfile=y.dat`

## Data Top↑

### Converting XYZ coordinates to molecular descriptor Top↑

Arguments |

`XYZ2X` |

`XYZfile=[input file S with XYZ coordinates]` |

`XfileOut=[output file S with X values]` |

#### Molecular descriptor arguments Top↑

If the user only provides XYZ file with `XYZfile`

argument, XYZ coordinates need to be first converted into the molecular descriptor.

Arguments | Available and default parameters | Description |

`molDescriptor=[molecular descriptor]` | `RE` [default] (relative-to-equilibrium)`CM` (Coulomb matrix)`ID` (inverse internuclear distances) | `RE` descriptor is well-suited for accurate descriptioin of single-molecule PES.`CM` is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.`ID` is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor. |

`molDescrType=[type of molecular descriptor]` | `unsorted` [default for RE]`sorted` [default for CM]`permuted` (optional, can be used for both RE and CM) | `unsorted` descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.`sorted` descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: `XYZsortedFileOut` , `permInvGroups` , `permInvNuclei` . See also related tutorial.`permuted` augments the descriptor with the permutations of user-defined atoms. Related arguments: `permInvKernel` , `permInvGroups` , `permInvNuclei` . See also related tutorial. |

`XYZsortedFileOut=[output file with with sorted XYZ coordinates]` | optional. Only works with `molDescriptor=RE molDescrType=sorted` . | saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out |

`permInvNuclei=[permutationally invariant nuclei]` | optional. Should be used with `molDescrType=permuted` (and often with `permInvKernel` ) | E.g. `permInvNuclei=2-3.5-6` will permute atoms 2,3 and 6,7. See also related tutorial. |

`permInvGroups=[permutationally invariant groups]` | optional. Should be used with `molDescrType=permuted` (and often with `permInvKernel` ) | E.g. for water dimer `permInvGroups=1,2,3-4,5,6` generates permuted atom indices by flipping the monomers in a dimer. |

#### Example Top↑

`MLatom.py XYZ2X XYZfile=CH3Cl.xyz XfileOut=CH3Cl.x`

### Analyzing data sets Top↑

MLatom can analyze data sets by comparing them, e.g., mostly by calculating errors of ML-predicted values with respect to available reference values. All files are input files and MLatom output is a statistical analysis.

Arguments |

`analyze` |

`Yfile=[input file with values]` |

`YgradXYZfile=[input file with gradients in XYZ coordinates]` |

`YestFile=[input file with estimated Y values]` |

`YgradXYZestFile=[input file with estimated XYZ gradients]` |

#### Example Top↑

`MLatom.py analyze Yfile=en.dat YestFile=enest.dat`

### Sampling and splitting Top↑

#### Arguments for sampling and splitting

Arguments | Available and default parameters | Description |

`sample` | it requires at least one of `iTrainOut` , `CVtest` , `LOOtest` , `CVopt` , `LOOopt` | see also tutorial. |

`XYZfile=[file with XYZ coordinates]` or `XfileIn=[file with input vectors X]` | required. | |

`iTrainOut=[file with indices of training points]` | no default file names. | generates indices for the training set. |

`iTestOut=[file with indices of test points]` | no default file names. | generates indices for the test set. |

`iSubtrainOut=[file with indices of sub-training points]` | no default file names. | generates indices for the sub-training set. |

`iValidateOut=[file with indices of validation points]` | no default file names. | generates indices for the validation set. |

`CVtest` | optional. Related option `NcvOptFolds` . | generates indices for splits in N-fold cross-validation. By default, 5-fold cross-validation is used. |

`NcvTestFolds=[number of CV folds]` | `5` [default]. Can be used only with `CVopt` . | if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |

`LOOtest` | optional. | leave-one-out cross-validation. Only random or no sampling can be used. |

`iCVtestPrefOut=[prefix of files with indices for CVtest]` | no default prefixes. | file names will include the required prefix. |

`CVopt` | optional. Related option `NcvOptFolds` . | generates indices for N-fold cross-validation for hyperparameters optimization. By default, 5-fold cross-validation is used. |

`NcvOptFolds=[number of CV folds]` | `5` [default]. Can be used only with `CVopt` . | If this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |

`LOOopt` | optional. | Leave-one-out cross-validation. Only random or no sampling can be used. |

`iCVoptPrefOut=[prefix of files with indices for CVopt]` | no default prefixes. | file names will include the required prefix. |

#### Additional optional arguments for sampling Top↑

Arguments used with `sample`

argument.

Arguments | Available and default parameters |

`sampling=[type of data set sampling into splits]` | `random` [default] random sampling`none` simply split unshuffled data set into the training and test sets (in this order) (and sub-training and validation sets)`structure-based` structure-based sampling`farthest-point` farthest-point traversal iterative procedure |

`Nuse=[N first entries of the data set file to be used]` | 100% [default] optional. |

`Ntrain=[number of the sub-training points or a fraction of the training points]` | 80% of the total set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. |

`Ntest=[number of the validation points or a fraction of the training points]` | By default, the remaining points of the total set after subtracting the training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. |

`Nsubtrain=[number of the sub-training points or a fraction of the training points]` | 80% of the training set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. |

`Nvalidate=[number of the validation points or a fraction of the training points]` | By default, the remaining points of the training set after subtracting the sub-training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. |

#### Example Top↑

Structure-based sampling:

`mlatom sample sampling=structure-based XYZfile=CH3Cl.xyz Ntrain=1000 Ntest=10000 iTrainOut=itrain.dat iTestOut=itest.dat`

#### Slicing Top↑

Sometimes it is useful to slice data by the Euclidean distance of their descriptors to the equilibrium descriptor. See tutorial.

Arguments for slicing:

Arguments | Available and default parameters |

`slice` | required. |

`XfileIn=[file with input vectors X]` | required. |

`eqXfileIn=[file with input vector for equilibrium geometry]` | required. |

`Nslices=[number of slices]` | `3` [default]optional. |

Arguments for sampling from slices:

Arguments | Available and default parameters |

`sampleFromSlices` | |

`Ntrain=[total integer number N of training points from all slices]` | required. |

`Nslices=[number of slices]` | `3` [default]optional. |

Arguments for merging indices from slices:

Arguments | Available and default parameters |

`mergeSlices` | |

`Ntrain=[total integer number N of training points from all slices]` | required. |

`Nslices=[number of slices]` | `3` [default]optional. |

#### Examples Top↑

See tutorial.

`MLatom.py slice Nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x`

`mlatom sampleFromSlices Nslices=3 sampling=structure-based Ntrain=4480`

`mlatom mergeSlices Nslices=3 Ntrain=4480`

## Leave a Reply