Manual of MLatom, version 0.92, revision 102
Please consult with Features for an overview of MLatom capabilities. This page provides details on how to use MLatom for various types of calculations.
Table of Contents
Installation
Currently, only a statically compiled binary for Linux systems are provided. This binary can be saved in any directory and used directly without any modifications to the environment variables etc.
Running MLatom
To run MLatom provide a path to the binary called MLatomF
and the necessary commandline options (see in the next section), i.e. in your terminal type:
$pathToMLatom/MLatomF [options]
In the following, notation mlatom
(it is useful to setup such an alias in your shell) is used instead of $pathToMLatom/MLatomF
.
All options are case insensitive, i.e. you can type either
mlatom help
or
mlatom Help
with the same result (the command will print available options on your computer screen).
In order to run MLatom you have to have several input files as described below. Note that input and output file names are case sensitive! For example, xyz.dat
and XYZ.dat
are two different file names.
By default, MLatom will use all available threads on your computer. If you want to limit the number of threads, you should change environmental variables OMP_NUM_THREADS
and MKL_NUM_THREADS
. Often, on machines with hyperthreading, significant performance speedup can be achieved by setting environmental variable OMP_PROC_BIND
to 'true'
.
Getting Help and List of Options for a Current Version
You can directly request your current version of MLatom to print its available options with the command:
mlatom help
Input
Along with command line options MLatom needs to read various files from disk depending on the task. File names should be specified using the following options:
XYZfile=[name of file with molecular XYZ coordinates]
XfileIn=[name of file with molecular descriptor (ML input) vectors]
Yfile=[name of file with reference values]
MLmodelIn=[name of file with ML model]
In the requested input file does not exist, MLatom will terminate with the request to provide it.
File extensions are arbitrary.
It is sometimes useful to use only part of the big data set. This can be requested by using option Nuse=N
, requesting that only N first entries of input files will be used.
File Formats
XYZfile
option requires file with XYZ coordinates of molecules one after another, with first line specifying number of atoms in a molecule followed by one blank line and then by Cartesian coordinates of nuclei, e.g. for three molecules:
5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H 0.342 1.027 0.000 H 0.342 0.513 0.890 H 0.342 0.513 0.890 5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H 0.343 1.027 0.000 H 0.342 0.513 0.890 H 0.342 0.513 0.890 5 C 0.000 0.000 0.000 Cl 1.776 0.000 0.000 H 0.339 1.028 0.000 H 0.342 0.513 0.890 H 0.342 0.513 0.890
Nuclear charges can be used instead of element symbols. Coordinates are given in Å.
XfileIn
requires file with input vectors, where each vector should be on one line, e.g.:
1.0093 1.0009 1.0009 1.0080 1.0229 1.0004 1.0009 0.9947 0.9738
Yfile
requires a file with one reference datum per line, e.g.:
6.349 23.852 60.872
MLmodelIn
requires a file with ML model generated by MLatom.
Output
MLatom prints summary of its calculations to the standard output, i.e. it is recommended to redirect it to a file, e.g.:
mlatom help > mlatom.out
It can also write files to the disk depending on the task. File names should be specified using the following options:
XfileOut=[name of file to write input vectors to]
MLmodelOut=[name of file to write ML model to]
YestFile=[name of file to write values predicted by ML to]
YgradEstFile=[name of file to write gradients predicted by ML to]
iTrainOut=[name of file to write training point indices to]
iTestOut=[name of file to write test point indices to]
iSubtrainOut=[name of file to write subtraining point indices to]
iValidateOut=[name of file to write validation point indices to]
If output file with the same name already exists, MLatom will terminate with the request to remove or rename it.
File extensions are arbitrary.
Tasks Performed by MLatom
A brief overview how to request MLatom to perform its tasks. See sections below for additional options.
For all tasks at least one of both XYZfile
and XfileIn
options should be used (see Section Input).
ML operations
You can estimate accuracy of ML models, i.e. estimate its generalization error by using option estAccMLmodel
with other options:
mlatom estAccMLmodel [other options]
For default settings and other mandatory options see the corresponding sections below, specically section Model Validation.
Example:
mlatom estAccMLmodel Yfile=y.dat XYZfile=xyz.dat kernel=Gaussian sigma=opt lambda=opt
This command will request estimation of the generalization error of an ML model for molecules provided in Cartesian coordinates in xyz.dat
file and reference data in y.dat
file. Gaussian kernel will be used and hyperparameters σ and λ will be optimized.
In order to create an ML model and save it to a file on a disk, use option createMLmodel
:
mlatom createMLmodel [other options]
For both estAccMLmodel
and createMLmodel
additional input option Yfile
should be used (see Section Input).
Example:
mlatom createMLmodel Yfile=y.dat XYZfile=xyz.dat MLmodelOut=mlmod.unf kernel=Gaussian sigma=opt lambda=opt
This command will request creating an ML model for molecules provided in Cartesian coordinates in xyz.dat
file and reference data in y.dat
file and save it to mlmod.unf
file. Gaussian kernel will be used and hyperparameters σ and λ will be optimized.
Loading existing ML model from a file and performing ML calculations with this model can be done with option useMLmodel
:
mlatom useMLmodel [other options]
For useMLmodel
additional input option MLmodelIn
should be used (see Section Input).
Example:
mlatom useMLmodel MLmodelIn=mlmod.unf XYZfile=xyz.dat YestFile=yest.dat
This command will request making predictions with an ML model read from mlmod.unf
file for molecules provided in Cartesian coordinates in xyz.dat
file and save predicted values in yest.dat
file. Program will output summary of the loaded model, such as used kernel and values of hyperparameters used to create it.
Data Set Operations
Converting XYZ coordinates into an input vector (molecular descriptor) for ML
You can use XYZ2X
option to convert XYZ coordinates of a series of molecules provided in file requested by option XYZfile=[filename]
to the molecular descriptor (input) vectors for ML calculations saved in file requested by option XfileOut=[filename]
in estAccMLmodel
with other options.
Example:
mlatom XYZ2X XYZfile=xyz.dat XfileOut=x.dat
Given a data set of molecules either in XYZ format or in molecular descriptor form, you can sample their subsets (e.g. the training and test sets), by using sample
option:
mlatom sample [other options]
Basically, one can use this option to generate indices of the training, test, subtraining, and validation sets without performing ML calculations. Thus, other options used for Model Validation and Hyperparameter Tuning are applicable. Bug in this release: Testing and training indices generated and saved to text files with sample sampling=random
options are not randomly sampled; this bug does not affect ML operations.
Sampling
You can specify a type of sampling into the training and other sets using option sampling=[type of sampling]
. Available types of sampling are: none
, random
, userdefined
, structurebased
, farthestpoint
.

none
: simply splitting the data set into the training, test, and, if necessary, training set into the subtraining and validation sets (in this order) without changing the order of indicesrandom
: simple random samplinguserdefined
: requests MLatom to read indices for the training, test, and, if necessary, for the subtraining and validation sets from filesitrain.dat
,itest.dat
,isubtrain.dat
,ivalidate.dat
files. Corresponding optionsNtrain
,Ntest
,Nsubtrain
, andNvalidate
can be used as well.structurebased
: performs structurebased samplingfarthestpoint
: farthestpoint traversal iterative procedure, which starts from two points farthest apart
ML Algorithm
You can use the following options for performing kernel ridge regression calculations:
lambda=R
: sets regularization parameter λ to a floatingpoint numberR
. Default value is 0.0. You can request optimization of this parameter withlambda=opt
, see below for more options related to hyperparameter tuning.kernel=[type of kernel]
: requests using one of the available types of kernel, which are selfexplaining.kernel=Gaussian
(set by default).kernel=Laplacian
kernel=Matern
Kernel width σ is a parameter, which can be also changed by the user using the following option:
sigma=R
: sets σ to a floatingpoint numberR
. You can request optimization of this parameter withsigma=opt
, see below for more options related to hyperparameter tuning. Default values are different for different kernels:sigma=100.0
for Gaussian and Matérn kernelssigma=800.0
for Laplacian kernel
In case of Matérn kernel, there is an additional integer parameter n, which is set by default to 2, and can be changed to an integer number R
using option nn=N
.
Molecular Descriptors
molDescriptor=[type of molecular descriptor]
: requests using one of the available types of molecular descriptor:
molDescriptor=CM
: requests using Coulomb matrix. Two types of this descriptor are available:CMtype=sorted
(set by default) sorts Coulomb matrix by norms of its rows.CMtype=unsorted
uses the same order of atoms as in input file with XYZ coordinates of molecules
molDescriptor=RE
: requests using unsorted vector {r^{eq}/r}, where r is an internuclear distance in a current molecule and r^{eq} is an internuclear distance in the equilibrium (or other reference) structure. Equilibrium structure should be provided in a file named ‘eq.xyz’ in XYZ format.
Option CMtype=sorted
is deprecated. In new releases, usemolDescrType=sorted
instead.
Model Validation
ML model can be validated (generalization error can be estimated) in several ways:
 on a holdout test set not used for training. Both training and test sets can be sampled in one of the ways described above. Number of points in the subtraining and validation sets is set by options
Ntrain=R
andNtest=R
, respectively. IfR
is an integer larger or equal to 1, this number of points is sampled from the data set. IfR
is a floatingpoint number less than 1.0, it is used to define a fraction of the data set points to sample. By default, 80% of the data set points are used as the training set and remaining 20% as the test set;  by performing Nfold crossvalidation. User can request this procedure using option
CVtest
and define the number of folds N by using optionNcvTestFolds=N
. By default, 5fold crossvalidation is used. If N is equal to the number of data points, leaveoneout crossvalidation is performed. Only random or no sampling can be used for crossvalidation.
Hyperparameter Tuning
Gaussian, Laplacian, and Matérn kernels have σ and λ tunable hyperparameters. Their optimization can be requested with options sigma=opt
and lambda=opt
, respectively.
MLatom can tune hyperparameters either to minimize mean absolute error or to minimize rootmeansquare error as defined by option using either option minimizeError=MAE
or minimizeError=RMSE
(default), respectively. Hyperparameters can be tuned to minimize
 the error of the ML model trained on the subtraining set in a holdout validation set. Both subtraining and validation sets can be sampled from the training set in one of the ways described above. Number of points in the subtraining and validation sets is set by options
Nsubtrain=R
andNvalidate=R
, respectively. IfR
is an integer larger or equal to 1, this number of points is sampled from the training set. IfR
is a floatingpoint number less than 1.0, it is used to define a fraction of the training set points to sample. By default, 80% of the training set points are used as the subtraining set and remaining 20% as the validation set;  Nfold crossvalidation error. User can request this procedure using option
CVopt
and define the number of folds N by using optionNcvOptFolds=N
. By default, 5fold crossvalidation is used. If N is equal to the number of data points, leaveoneout crossvalidation is performed. Only random or no sampling can be used for crossvalidation.
MLatom searches optimal parameters on a logarithmic grid. After best parameters found in the first iteration, MLatom can perform more iterations of a logarithmic grid search. Number of iterations is controlled by lgOptDepth=N
keyword with N
=2 by default. User can adjust number of grid points, starting and finishing points on the grid by using the following options for
 λ hyperparameter:
NlgLambda=N
defines the number of points on the logarithmic grid (base 2). By default 6 points are used (commandline help of this revision erroneously says that the default is 21 points).lgLambdaL=R
Lowest value of log_{2} λ for a logarithmic grid optimization of lambda. Default value is 16.0.lgLambdaH=R
Highest value of log_{2} λ for a logarithmic grid optimization of lambda. Default value is 6.0.
 σ hyperparameter:
NlgSigma=N
defines the number of points on the logarithmic grid (base 2). By default 11 points are used.lgSigmaL=R
Lowest value of log_{2} λ for a logarithmic grid optimization of lambda. Default value is 2.0 for Gaussian and Matérn kernels, 5.0 for Laplacian kernel.lgSigmaH=R
Highest value of log_{2} λ for a logarithmic grid optimization of lambda. Default value is 9.0 for Gaussian and Matérn kernels, 12.0 for Laplacian kernel.
First Derivatives
MLatom can be also used to calculate first derivatives given a file with an existing ML model. In order to request such calculations, simply add to the options used with useMLmodel
option additional output option YgradEstFile=[name of a file to save gradients in]
.
Example:
mlatom useMLmodel MLmodelIn=mlmod.unf XYZfile=xyz.dat YgradEstFile=ygradest.dat
This command will request making predictions with an ML model read from mlmod.unf
file for molecules provided in Cartesian coordinates in xyz.dat
file and save predicted gradients in ygradest.dat
file.
Note that YestFile
cannot be used together with YgradEstFile
.
Support
Generally no support is provided, because I have already many responsibilities, but in case you want to collaborate, have some suggestions for improving the program, or want to report a bug, please write to me.