Manual for MLatom 3
NOTE! See the new manuals and tutorials for MLatom in English and Chinese (this page only contains the legacy documentation which will be slowly migrated to the new documentation).
A brief overview of the capabilities of MLatom 3 is given in the release broadcast video:
Python API
NEW! The manual and tutorials for MLatom Python API is available in English and Chinese.
Command-line options
Below is an overview of command-line (input file) options of MLatom.
Simulations
- single-point calculations
- geometry optimizations (minima and transition states, IRC)
- frequencies & thermochemistry
- simulations with AI-enhanced QM methods and pre-trained ML models (AIQM1, ANI-1ccx, etc.)
- NEW! simulations with QM methods
- simulations with user-trained models
- NEW! molecular dynamics
- NEW! quantum dynamics with machine learning
- NEW! IR and power spectra from MD
- UV/vis spectra (ML-NEA)
- two-photon absorption cross sections (ML-TPA)
Learning
- training popular ML models (KREG, ANI, sGDML, PhysNet, DPMD, GAP-SOAP, KRR-CM)
- training generic ML models (kernel ridge regression with many kernels)
- optimizing hyperparameters
- evaluating ML models (also with learning curves)
- Δ-learning
- self-correction
Data
- converting XYZ coordinates to molecular descriptor (RE, Coulomb matrix, …)
- analyzing data sets
- sampling (random, structure-based, farthest-point) and splitting datasets
Simulations Top↑
Single-point calculations Top↑
Single-point calculations can be performed for either given geometries (one or many) or generic input vector(s) X (often, molecular descriptors representing a molecule) and can be performed either with a pre-trained model supported by MLatom or with a user-trained model.
Input arguments Top↑
The user has to choose at least one model either with MLmodelIn
or by giving the name of a pre-trained model.
At least one of the arguments, XYZfile
or XfileIn
, should be chosen.
Arguments | Available and default parameters | Description |
useMLmodel MLmodelIn=[file with ML model] or AIQM1 , ANI-1ccx , … | one and only one of these two options can be chosen.useMLmodel MLmodelIn : see user-trained models for details. No default parameters to MLmodelIn are provided.AIQM1 , ANI-1ccx , …: see pre-trained models for a full list. This argument is optional and no default parameters are provided. | useMLmodel MLmodelIn : requests to read a file with ML model.AIQM1 , ANI-1ccx , … requests one of the pre-trained models supported by MLatom. |
XYZfile=[file with XYZ coordinates] or XfileIn=[file with input vectors X] | one and only one of these two options can be chosen. no default file names. | XYZfile : requests to make predictions for one or many molecules provided in file with their XYZ coordinates. The units of coordinates depend on the model. For pre-trained models Å should be used.XfileIn : requests to make predictions for provided list of input vectors (one input vector per line in text file), which are typically molecular descriptors. |
Output file arguments Top↑
At least one of the below arguments is required.
Arguments | Available and default parameters | Description |
YestFile=[file with estimated Y values] | this argument is optional and no default parameters are provided. | saves predictions Y to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. If predictions are made with pre-trained models, they are energies in Hartree. Other predictions depend on a model. |
YgradXYZestFile=[file with estimated XYZ gradients] | this argument is optional and no default parameters are provided. | should be used only with XYZfile option. Saves predicted XYZ gradients (first derivatives) to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. If predictions are made with pre-trained models, they are XYZ gradients in Hartree/Å. |
YgradEstFile=[file with estimated gradients] | this argument is optional and no default parameters are provided. | should be used only with XfileIn option. Saves predicted gradients (first derivatives) wrt to elements of input vector X to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
Note: calculations with AIQM1-based models also generate additional output files, i.e., if one needs other properties than available via the above options, one should use a single XYZ geometry and look at the MNDO output file mndo.out
.
Example Top↑
Single-point calculations of energies (and gradients if needed) of closed-shell molecules in electronic ground state is the simplest job, which can be run with 3-4 line MLatom input file, e.g., sp.inp
:
AIQM1 # or useMLmodel MLmodelIn=CH3Cl.unf if you, e.g., want to use CH3Cl.unf model
xyzfile=sp.xyz
yestfile=enest.dat
ygradxyzestfile=xyzgradest.dat
This input requires a sp.xyz
file with XYZ geometries of molecules (you can provide many molecules as usual for MLatom), e.g., for hydrogen and methane sp.xyz
file can look like (geometries in Å):
2
H 0.000000 0.000000 0.363008
H 0.000000 0.000000 -0.363008
5
C 0.000000 0.000000 0.000000
H 0.627580 0.627580 0.627580
H -0.627580 -0.627580 0.627580
H 0.627580 -0.627580 -0.627580
H -0.627580 0.627580 -0.627580
After you prepared your input files sp.inp
and sp.xyz
, you can run MLatom as usual:
mlatom sp.inp > sp.out
After the calculations finish, MLatom output sp.out
will contain information depending on the model, e.g., for AIQM1 it will contain the standard deviation of NN prediction and components of AIQM1 energies:
Standard deviation of NN contribution : 0.00892407 Hartree 5.59994 kcal/mol
NN contribution : -0.00210898 Hartree
Sum of atomic self energies : -0.08587317 Hartree
ODM2* contribution : -1.09094119 Hartree
D4 contribution : -0.00000889 Hartree
Total energy : -1.17893224 Hartree
Standard deviation of NN contribution : 0.00025608 Hartree 0.16069 kcal/mol
NN contribution : 0.00958812 Hartree
Sum of atomic self energies : -33.60470494 Hartree
ODM2* contribution : -6.86968756 Hartree
D4 contribution : -0.00010193 Hartree
Total energy : -40.46490632 Hartree
In any case, MLatom will save predicted values in file enest.dat
which for above calculations will contain AIQM1 energies in Hartree:
-1.178932238420
-40.464906315250
and XYZ gradients in Hartree/Å will be saved in file xyzgradest.dat
looking like this:
2
0.000000000000 0.000000000000 0.000032023551
0.000000000000 0.000000000000 -0.000032023551
5
-0.000000000000 -0.000000000000 0.000000000000
0.000490470799 0.000490470714 0.000490470881
-0.000490470799 -0.000490470714 0.000490470881
0.000490470799 -0.000490470714 -0.000490470881
-0.000490470799 0.000490470714 -0.000490470881
Note that your output may have very minor numerical differences.
Geometry optimizations Top↑
Geometry optimizations can be performed for given geometries (one or many) with either a pre-trained model supported by MLatom or a user-trained model. Optimizations are performed using third-party software (Gaussian or ASE), please see the installation instructions; the experimental option is to use SciPy geometry optimization, but it is not well tested. On the MLatom@XACS cloud, ASE is used by default.
Input and output arguments Top↑
The user has to choose a task (minimum energy geometry optimization, TS optimization, or IRC) and choose at least one model either with MLmodelIn
or by giving the name of a pre-trained model.
Arguments | Available and default parameters | Description |
geomopt or TS or IRC | one and only one of these arguments is required. | geomopt requests minimum-energy geometry optimization.TS requests optimization of a transition state structure. Only works via interface to Gaussian.IRC requests intrinsic reaction coordinate calculations to follow the reaction path from TS structure. Only works via interface to Gaussian. |
useMLmodel MLmodelIn=[file with ML model] | this argument is optional and no default parameters are provided. | requests to read a file with ML model. It is an optional argument and no default models are provided. Note that model should predict energies in Hartree for a successful geometry optimizations. |
AIQM1 , ANI-1ccx , … | see pre-trained model for a full list. This argument is optional and no default parameters are provided. | requests one of the pre-trained models supported by MLatom, e.g., AIQM1, ANI-1ccx, etc. |
XYZfile=[file with XYZ coordinates] | this argument is required; no default parameters are provided. | file with initial XYZ coordinates of one or many molecules (geomopt or TS ) or file with optimized TS structure (with IRC ). The units of coordinates should be Å. |
optprog=[either Gaussian or ASE or SciPy] | by default, MLatom will try to use Gaussian. If Gaussian is not found, it will try to use ASE. If ASE is not found, it will try to use SciPy. | chooses a third-party program for optimization. Default algorithms in Gaussian, ASE, and SciPy are Berny optimization, LBFGS, and BFTS, respectivel. Optimizations settings can be changed by using additional options for the third-party program, see below. If ASE is used, it may terminate after maximum number of iterations without informing a user that geometry is not optimized. |
optxyz=[file with optimized XYZ coordinates] | default file name: optgeoms.xyz . | saves optimized geometries in the requested file. If this file already exists, MLatom will terminate without overwriting it. |
Note: optimizations with Gaussian program also generate Gaussian output files mol_1.log
, mol_2.log
, etc, which contain additional information.
Additional options for third-party programs Top↑
Arguments | Default parameters |
ase.fmax=[threshold of maximum force (in eV/A)] | 0.02 |
ase.steps=[maximum number of optimization steps] | 200 |
ase.optimizer=[LBFGS or BFGS] | LBFGS. |
Example Top↑
Geometry optimization of closed-shell molecules in the electronic ground state is as simple as running single point calculations and a 4-line MLatom input file, e.g., opt.inp
, looks like this:
AIQM1 # or useMLmodel MLmodelIn=CH3Cl.unf if you, e.g., want to use CH3Cl.unf model
xyzfile=init.xyz
optxyz=opt.xyz
geomopt
This input requires init.xyz
file with initial XYZ geometries of molecules to be optimized (you can provide many molecules as usual for MLatom), e.g., for hydrogen and methane init.xyz
file can look like (geometries in Å):
2
Hydrogen molecule
H 0.0000000000 0.0000000000 0.0000000000
H 0.7414000000 0.0000000000 0.0000000000
5
Methane molecule
C 0.0000000000 0.0000000000 0.0000000000
H 1.0870000000 0.0000000000 0.0000000000
H -0.3623333220 -1.0248334322 -0.0000000000
H -0.3623333220 0.5124167161 -0.8875317869
H -0.3623333220 0.5124167161 0.8875317869
After you prepared your input files opt.inp
and init.xyz
, you can run MLatom as usual:
mlatom opt.inp > opt.out
MLatom output file opt.out
should contain lines similar to those below (if interface to the Gaussian program was used for optimization):
******************************************************************************
optprog: Gaussian 16
Standard deviation of NN contribution : 0.00892062 Hartree 5.59777 kcal/mol
NN contribution : -0.00210740 Hartree
Sum of atomic self energies : -0.08587317 Hartree
ODM2* contribution : -1.09094281 Hartree
D4 contribution : -0.00000889 Hartree
Total energy : -1.17893227 Hartree
Standard deviation of NN contribution : 0.00025608 Hartree 0.16069 kcal/mol
NN contribution : 0.00958812 Hartree
Sum of atomic self energies : -33.60470494 Hartree
ODM2* contribution : -6.86968742 Hartree
D4 contribution : -0.00010193 Hartree
Total energy : -40.46490617 Hartree
==============================================================================
Wall-clock time: 21.60 s (0.36 min, 0.01 hours)
MLatom terminated on 11.11.2022 at 13:21:37
==============================================================================
After the calculations finish, the optimized geometries are saved in a single file opt.xyz
, which for our example looks like (geometries in Å, there can be slight numerical differences depending on a machine, etc.):
2
H 0.00770082 0.00000000 0.00000000
H 0.73369918 0.00000000 0.00000000
5
C 0.00000000 0.00000000 0.00000000
H 1.08666998 -0.00000000 0.00000000
H -0.36222332 -1.02452229 -0.00000000
H -0.36222332 0.51226114 -0.88726233
H -0.36222332 0.51226114 0.88726233
Frequencies and thermochemistry Top↑
Calculation of frequencies and thermochemical properties can be performed for given optimized geometries (one or many) with either a pre-trained model supported by MLatom or a user-trained model. The geometries should be first optimized with the same model before these calculations can be performed. Calculations require the use of third-party software (Gaussian or ASE), please see the installation instructions. On the MLatom@XACS cloud, ASE is used.
Input arguments Top↑
The user has to choose at least one model either with MLmodelIn
or by giving the name of a pre-trained model.
Arguments | Available and default parameters | Description |
freq | this argument is required. | requests frequencies calculations. |
useMLmodel MLmodelIn=[file with ML model] | this argument is optional and no default parameters are provided. | requests to read a file with ML model. It is an optional argument and no default models are provided. Note that model should predict energies in Hartree for a successful freq calculations. |
AIQM1 , ANI-1ccx , … | see pre-trained model for a full list. This argument is optional and no default parameters are provided. | requests one of the pre-trained models supported by MLatom, e.g., AIQM1, ANI-1ccx, etc. If AIQM1 or ANI-1ccx is requested, MLatom will also calculate enthalpies of formation at 298 K, see tutorial. The standard deviation of neural networks in AIQM1 and ANI-1ccx models will be reported and if it is larger than 0.41 and 1.68 kcal/mol, respectively, predicted enthalpies of formation have potentially too high uncertainty and a warning will be reported in the output file. |
XYZfile=[file with optimized XYZ coordinates] | this argument is required; no default parameters are provided. | file with optimized XYZ coordinates of one or many molecules. The units of coordinates should be Å. |
optprog=[either Gaussian or ASE] | by default, MLatom will try to use Gaussian. If Gaussian is not found, it will try to use ASE. | chooses a third-party program for frequencies calculations (optprog is a correct name, it is not freqprog ). |
Note: frequency calculations with Gaussian program also generate Gaussian output files mol_1.log
, mol_2.log
, etc, which contain additional information.
Additional options for third-party programs Top↑
Arguments | Available and default parameters | Description |
ase.linear=N,…,N | 0 [default] 1 | 0 for nonlinear molecule, 1 for linear molecule. The order is the same as in XYZ file. |
ase.symmetrynumber=N,…,N | 1 [default] | rotational symmetry number for each molecule (see Table 10.1 and Appendix B of C. Cramer “Essentials of Computational Chemistry”, 2nd Ed.). This number only affect the results of entropy and free energy, but this influence is usually very small. The order is the same as in XYZ file. |
Example Top↑
Thermochemical properties of closed-shell molecules in the electronic ground state can be calculated at ANI-1ccx or AIQM1 level by adding an argument freq
to the MLatom input file, e.g., freq.inp
, and they should be run on geometries optimized with the corresponding model. An example of MLatom input file using the ANI-1ccx-optimized geometries:
ANI-1ccx
xyzfile=optgeoms.xyz
freq
optprog=ASE # or optprog=gaussian if you choose Gaussian
ase.linear=1,0
ase.symmetrynumber=2,12
When ASE is used for the calculation of thermochemical properties, you should specify ase.linear
and ase.symmetrynumber
this two keywrods. ase.linear
is 0 for nonlinear molecule, 1 for linear molecule, and ase.symmetrynumber
is the rotational symmetry number for each molecule (see Table 10.1 and Appendix B of C. Cramer “Essentials of Computational Chemistry”, 2nd Ed.). For example, for hydrogen and methane this two molecules, you should set ase.linear=1,0
and ase.symmetrynumber=2,12
.
File with preoptimized geometries optgeoms.xyz
for our example are (geometries in Å):
2
hydrogen
H 0.15255733 0.00000000 0.00000000
H 0.58884267 0.00000000 0.00000000
5
methane
C 0.00000000 0.00000000 0.00000000
H 1.08733372 -0.00000000 0.00000000
H -0.36244456 -1.02514806 0.00000000
H -0.36244456 0.51257403 -0.88780426
H -0.36244456 0.51257403 0.88780426
After you prepared your input files freq.inp
and optgeoms.xyz
, you can run MLatom as usual:
mlatom freq.inp > freq.out
After the calculations finish, MLatom output freq.out
will contain the summary with atomization enthalpy at 0 K, ZPVE-exclusive atomization energy at 0 K, and heat of formation at 298.15 K for each molecule. If you use ASE, MLatom output will contain the same lines as above, but also include additional data such as entropy and the Gibbs free energy:
...... Zero-point vibrational energy : 4.07528 kcal/mol Atomization enthalpy at 0 K : 126.42974 kcal/mol ZPE exclusive atomization energy at 0 K : 130.50502 kcal/mol Heat of formation at 298.15 K : -23.11424 kcal/mol * Warning * Heat of formation have high uncertainty! ...... Zero-point vibrational energy : 27.87144 kcal/mol Atomization enthalpy at 0 K : 391.92513 kcal/mol ZPE exclusive atomization energy at 0 K : 419.79657 kcal/mol Heat of formation at 298.15 K : -17.63420 kcal/mol
If you use Gaussian, the Gaussian output files of frequency calculations are saved in mol_1.log
, mol_2.log
, … files for each molecule; these files contain ZPVE energy and lots of thermochemical data such as entropy and the Gibbs free energy.
UV/vis spectra Top↑
UV/vis spectra (cross-sections) can be calculated with ML-Nuclear Ensemble Approach (ML-NEA). Detailed tutorial 1 and tutorial 2 are available.
For full functionality, Newton-X (tested with version 2.2) and Gaussian should be installed (see installation instructions including settings appropriate environmental variables like $NX and $GAUSS_EXEDIR
). Neither Newton-X nor Gaussian are available on MLatom@XACS cloud.
optional arguments:
Nexcitations=N
number of excited states to calculate.
(default=3)nQMpoints=N
user-defined number of QM calculations for training ML. (default=0, number of QM calculations will be determined iteratively)plotQCNEA
requests plotting QC-NEA cross sectiondeltaQCNEA=float
define the broadening parameter of QC-NEA cross sectionplotQCSPC
requests plotting cross section obtained via single point convolution
advanced arguments (not recommended to modify):
nMaxPoints=N
maximum number of QC calculations in the iterative
procedure. (default=10000)MLpoints=N
number of ML calculations.
(default=50000)
required files:
- mandatory file
gaussian_optfreq.com
input file for Gaussian opt and freq calculations Alternatively, fileseq.xyz
(XYZ file with equilibrium, optimized, geometry) andnea_geoms.xyz
(file with all geometries in nuclear ensemble) can be provided.gaussian_ef.com
template file for calculating excitation energies and oscillator strengths with Gaussian.
- optional file
cross-section_ref.dat
reference cross section file calculated in format similar to that of Newton-X (1st column: DE/eV; 2nd column: lambda/nm; 3rd column: sigma/A2)eq.xyz
file with optimized geometry (has to be used together withnea_geoms.xyz
)nea_geoms.xyz
file with all geometries in nuclear ensemble (has to be used together witheq.xyz
)E1.dat E2.dat ...
andf1.dat f2.dat ...
files that stores the exciting energy and oscillator strength per line which correspond tonea_geoms.xyz
.
output files:
cross-section/cross-section_ml-nea.dat
: cross-section spectra calculated with ML-NEA methodcross-section/cross-section_qc-nea.dat
: cross-section spectra calculated with QC-NEA methodcross-section/cross-section_spc.dat
: cross-section spectra calculated with single-point-convolutioncross-section/plot.png
: the plotting that contains cross-section calculated with different kinds of method.
Simulations with AI-enhanced QM methods and pre-trained ML models Top↑
MLatom supports calculations with the following pre-trained ML-based models (MLatom arguments are spelled exaxtly the same way as the method names given below):
- AIQM1, AIQM1@DFT, AIQM1@DFT* (tutorial with examples and installation instructions)
- Strengths: AIQM1 is approaching CCSD(T)/CBS accuracy but with a speed of semiempirical methods (thousand times faster than DFT) for energy calculations and geometry optimizations of closed-shell molecules in their ground state. It is also transferable for calculations of charged and radical species as well as for excited-state calculations with good accuracy. MLatom will also report the standard deviation of neural networks correction and if it is larger than 0.41 kcal/mol the AIQM1 calculations have potentially too high uncertainty and a warning will be reported in the output file if heats of formation are predicted.
- Limitations: only CHNO elements are supported. On XACS cloud, no analytical gradients are available, i.e., geometry optimizations and frequencies calculations are rather slow; install local MLatom if higher efficiency is needed.
- ANI-1ccx, ANI-1x, ANI-2x, ANI-1x-D4 and ANI-2x-D4 (requires to install TorchANI)
- Strengths: Faster than AIQM1. ANI-1ccx is also approaching CCSD(T)/CBS accuracy for energy calculations and geometry optimizations of closed-shell molecules in their ground state but is generally less accurate and reliable than AIQM1. MLatom will also report the standard deviation of neural networks and if it is larger than 1.68 kcal/mol the ANI-1ccx calculations have potentially too high uncertainty and a warning will be reported in the output file if heats of formation are predicted.
- Limitations: only CHNO elements are supported by ANI-1ccx and ANI-1x, CHNOFClS are supported by ANI-2x. Not transferable for calculations of charged and radical species or to excited-state calculations. Not good accuracy for noncovalent interactions if no D4 correction is included.
They can be used for such typical simulations as (see the corresponding sections for more details):
- single-point calculations
- geometry optimizations
- frequencies and thermochemistry
- molecular dynamics
- infrared spectra
Optional arguments Top↑
AIQM1 is using interfaces to MNDO or Sparrow to calculate QM contributions. Thus, the following AIQM1-specific arguments can be used:
Arguments | Available and default parameters | Description |
QMprog=[program] | MNDO [default]Sparrow [default if MNDO is not found] | chooses a program for calculating QM part of AIQM1. If neither MNDO or Sparrow program is found, MLatom will not be able to run AIQM1 calculations. |
mndokeywords=[file with MNDO keywords, e.g., mndokw] | allows to modify the input to MNDO to request non-standard calculations, e.g., to define charge, multiplicity, excited-state calculations settings, convergence criteria, etc. These keywords can be provided to MLatom via a MNDO keyword file, e.g., mndokw file, which should contain at least keywords iop=-22 immdp=-1 . For more details see the AIQM1 tutorial and MNDO documentation. |
Note: calculations with AIQM1-based models also generate additional output files, i.e., if one needs other properties than available via the above arguments, one should perform calculations on a single XYZ geometry and look at the MNDO output file mndo.out
.
Example Top↑
Geometry optimization of closed-shell molecules in the electronic ground state is as simple as running single point calculations and a 4-line MLatom input file, e.g., opt.inp
, looks like this:
AIQM1 # or ANI-1ccx, ANI-2x, etc.
xyzfile=init.xyz
optxyz=opt.xyz
geomopt
This input requires init.xyz
file with initial XYZ geometries of molecules to be optimized (you can provide many molecules as usual for MLatom), e.g., for hydrogen and methane init.xyz
file can look like (geometries in Å):
2
Hydrogen molecule
H 0.0000000000 0.0000000000 0.0000000000
H 0.7414000000 0.0000000000 0.0000000000
5
Methane molecule
C 0.0000000000 0.0000000000 0.0000000000
H 1.0870000000 0.0000000000 0.0000000000
H -0.3623333220 -1.0248334322 -0.0000000000
H -0.3623333220 0.5124167161 -0.8875317869
H -0.3623333220 0.5124167161 0.8875317869
After you prepared your input files opt.inp
and init.xyz
, you can run MLatom as usual:
mlatom opt.inp > opt.out
MLatom output file opt.out
should contain lines similar to those below (if interface to the Gaussian program was used for optimization):
******************************************************************************
optprog: Gaussian 16
Standard deviation of NN contribution : 0.00892062 Hartree 5.59777 kcal/mol
NN contribution : -0.00210740 Hartree
Sum of atomic self energies : -0.08587317 Hartree
ODM2* contribution : -1.09094281 Hartree
D4 contribution : -0.00000889 Hartree
Total energy : -1.17893227 Hartree
Standard deviation of NN contribution : 0.00025608 Hartree 0.16069 kcal/mol
NN contribution : 0.00958812 Hartree
Sum of atomic self energies : -33.60470494 Hartree
ODM2* contribution : -6.86968742 Hartree
D4 contribution : -0.00010193 Hartree
Total energy : -40.46490617 Hartree
==============================================================================
Wall-clock time: 21.60 s (0.36 min, 0.01 hours)
MLatom terminated on 11.11.2022 at 13:21:37
==============================================================================
After the calculations finish, the optimized geometries are saved in a single file opt.xyz
, which for our example looks like (geometries in Å, there can be slight numerical differences depending on a machine, etc.):
2
H 0.00770082 0.00000000 0.00000000
H 0.73369918 0.00000000 0.00000000
5
C 0.00000000 0.00000000 0.00000000
H 1.08666998 -0.00000000 0.00000000
H -0.36222332 -1.02452229 -0.00000000
H -0.36222332 0.51226114 -0.88726233
H -0.36222332 0.51226114 0.88726233
Simulations with QM methods Top↑
MLatom supports QM calculations with various popular programs.
Arguments | Available and default parameters |
method=[QM method] (required, case insensitive)or one of the standard methods recognized by MLatom from the following short-list: GFN2-xTB (interface to xtb)CCSD(T)*/CBS (interface to ORCA)ODM2 (interface to MNDO)ODM2* (interface to MNDO and Sparrow) | QM methods can be given in the usual format such as method=B3LYP/6-31G* . Depending on the available interfaced QM program (see below), supported methods include ab initio, DFT, and semi-empirical QM methods. |
qmprog=[supported QM program] (case insensitive) | Supported QM programs include PySCF, Gaussian, xtb, MNDO, Sparrow (see their manuals in the links for more information about the methods implemented there; a brief but incomplete overview is given below). |
QMprogramKeywords=[file with keywords of QM program](optional) | Now only xtb, MNDO, and Sparrow keywords are supported for the respective programs. |
multiplicities=[multiplicities of molecules] | Default value is 1. If more than one molecules are provides, please use comma to separate the multiplicities, e.g. multiplicities=3,3 . |
charges=[charges of molecules] | Default value is 0. If more than one molecules are provied, please use comma to separate the charges, e.g. charges=1,-1 . |
nthreads=[number of threads used] | Default value is 1 |
Example:
When doing simulations with QM methods, it is necessary to include method
and qmprog
keywords in your input file (except for xTB which does not require qmprog
.The input file of each QM program will be shown in detail after example). The molecular structure file and output file can be defined as usual. See tutorial above about the single-point calculations. Generally, the input file sp.inp
should look like this:
method=B3LYP/6-31G*
qmprog=guassian
xyzfile=sp.xyz
yestfile=enest.dat
where the input structure file sp.xyz
contains:
5
C 0.00000000 0.00000000 0.00000000
H 0.62783705 -0.62783705 0.62783705
H -0.62783705 0.62783705 0.62783705
H -0.62783705 -0.62783705 -0.62783705
H 0.62783705 0.62783705 -0.62783705
After running $mlatom sp.inp > sp.out
, the output file sp.out
will give calculated single point energy like this: (also in enest.dat)
******************************************************************************
You are going to use feature(s) listed below.
Please cite corresponding work(s) in your paper:
Gaussian program:
See the Gaussian output file for
the proper citation
******************************************************************************
Energy of molecule 1: -40.5182964000000 Hartree
==============================================================================
Wall-clock time: 1.07 s (0.02 min, 0.00 hours)
MLatom terminated on 08.10.2023 at 10:13:54
==============================================================================
If definition of charges and multiplicities is required, the input file can also look like this:
method=B3LYP/6-31G*
qmprog=gaussian
xyzfile=sp.xyz
yestfile=enest.dat
charges=0,0,1
multiplicities=3,3,1
where structures of 3 molecules are included in sp.xyz
. Charges and multiplicities should be defined in the same order.
QM programs and their supported QM methods:
Gaussian
Gaussian is a commonly used QM program that supports various types of calculations with different level of theory. Each Gaussian job should specify both method and basis set (usually separated by ‘/’), which will be used in method
keyword in MLatom. For methods available, see https://gaussian.com/capabilities/?tabid=0. A typical input file of MLatom using Gaussian looks like this:
method=B3LYP/6-31G*
qmprog=gaussian
xyzfile=sp.xyz
yestfile=enest.dat
PySCF
The Python-based Simulations of Chemistry Framework (PySCF) is an open-source Python package that possesses various electronic structure modules. It can be used to simulate the properties of molecules, crystals, and custom Hamiltonians using mean-field and post-mean-field methods. For methods available, see https://pyscf.org/user.html
Here is the list of methods and jobs currently supported in pyscf.
- Energy: HF, MP2, DFT, CISD, FCI, CCSD/CCSD(T), TD-DFT/TD-HF
- Gradients: HF, MP2, DFT, CISD, CCSD, RCCSD(T), TD-DFT/TD-HF
- Hessian: HF, DFT
method=b3lyp/6-31g*
qmprog=pyscf
xyzfile=sp.xyz
yestfile=enest.dat
xTB
The open-source semiempirical extended tight binding (xTB) program supports the calculation with popular semiempirical quantum mechanical methods GFNn-xTB. Currently MLatom only support GFN2-xTB.
method=GFN2-xTB
xyzfile=sp.xyz
yestfile=enest.dat
QMprogramKeywords=xtb_kw # optional
xtb_kw
file looks like: (details see xTB command line option https://xtb-docs.readthedocs.io/en/latest/commandline.html)
-c 1 -u 3
**NOTE**:
- When using xTB, no need to specify
qmprog=xTB
- If GFN-xTB is to use, please specify
--gfn 1
in keyword file
MNDO
MNDO is a semiempirical quantum chemistry program that supports semiempirical calculations using orthogonalization corrections. For methods available, please refer to https://mndo.kofo.mpg.de/input.php
method=ODM2
qmprog=mndo
xyzfile=sp.xyz
yestfile=enest.dat
QMprogramKeywords=mndokw # optional
Sparrow
SCINE Sparrow is an open-source command line tool for various semiempirical methods including MNDO-type models and DFTB models. For methods available, please refer to https://scine.ethz.ch/download/sparrow
method=ODM2*
qmprog=sparrow
xyzfile=sp.xyz
yestfile=enest.dat
QMprogramKeywords=SparrowKW # optional
It might be useful to use QMprogramKeywords
for, e.g., printing orbitals in Molden wavefunction format. The content of the file would be then -W
.
Simulations with user-trained models Top↑
MLatom can read a user-trained model from a file to make predictions (single-point calculations) with it for new data (given either as input vectors X or as XYZ coordinates) and ultimately to perform the following simulations (for data given in XYZ coordinates):
The models can be either native MLatom or from third-party interfaces to popular ML model types:
- kernel ridge regression (KRR) models
- KREG (native). See tutorial. Can only be used for single-molecule PES.
- KRR-CM (KRR with Coulomb matrix, native).
- ANI (through TorchANI)
- DeepPot-SE and DPMD (through DeePMD-kit)
- GAP–SOAP (through GAP suite and QUIP)
- PhysNet (through PhysNet)
- sGDML (through sGDML). Can only be used for single-molecule PES
Arguments Top↑
Calculations with native implementations do not require additional arguments, while the use of third-party models require the specification of a model type and/or the name of a third-party program, i.e., the user should provide either MLmodelType
and/or MLprog
argument (see also installation instructions). They can be used for such typical simulations as (see the corresponding sections for more details).
Arguments | Available and default parameters |
MLmodelType=[supported ML model type] | +-------------+----------------+ | MLmodelType | default MLprog | +-------------+----------------+ | KREG | MLatomF | +-------------+----------------+ | sGDML | sGDML | +-------------+----------- ----+ | GAP-SOAP | GAP | +-------------+----------------+ | PhysNet | PhysNet | +-------------+----------------+ | DeepPot-SE | DeePMD-kit | +-------------+----------------+ | ANI | TorchANI | +-------------+----------------+ |
MLprog=[supported ML program] | Supported interfaces with default and tested ML model types: +------------+----------------------+ | MLprog | MLmodelType | +------------+----------------------+ | MLatomF | KREG [default] | | | see | | | MLatom.py KRR help | +------------+----------------------+ | sGDML | sGDML [default] | | | GDML | +------------+----------------------+ | GAP | GAP-SOAP | +------------+----------------------+ | PhysNet | PhysNet | +------------+----------------------+ | DeePMD-kit | DeepPot-SE [default] | | | DPMD | +------------+----------------------+ | TorchANI | ANI [default] | +------------+----------------------+ |
Note: calculations with third-party programs may also generate additional output files.
Example Top↑
Below is an input file example of how to use KREG model to optimize geometry (see tutorial):
geomopt # Request geometry optimization
useMLmodel # using existing ML model
MLmodelIn=energies.unf # in energies.unf file
MLmodelType=KREG # of the KREG type
xyzfile=eq.xyz # The file with initial guess
Molecular dynamics Top↑
MLatom can now perform molecular dynamics of molecular systems with various methods and models due to its interfaces to many famous quantum chemistry and machine learning packages.
Input and output arguments Top↑
Arguments | Available and default parameters | Description |
---|---|---|
dt | 0.1 by default | time step; unit: fs |
trun | 1000 by default | length of trajectory; unit: fs |
initXYZ | required | user-provided initial geometry (should be in Angstrom) |
initVXYZ | required when initConditions=random | user-provided initial velocity (should be in Angstrom/fs) |
initConditions | user-defined by default, other options: random | algorithm of generating initial conditions |
initTemperature | 300 by default | initial temperature; unit: K; necessary when initConditions=random |
initXYZout | output file of initial geometry | |
initVXYZout | output file of initial velocity | |
Thermostat | NVE by default, other options: Andersen Nose-Hoover | MD thermostat |
Temperature | 300 by default | environment temperature |
Gamma | 0.2 by default, required when Thermostat=Andersen | collision frequency; unit: fs^-1 |
NHClength | 3 by default, required when Thermostat=Nose-Hoover | Nose-Hoover chain length |
Nc | 3 by default, required when Thermostat=Nose_Hoover | multiple time step |
Nys | 7 by default, required when Thermostat=Nose-Hoover | number of Yoshida-Suzuki steps; only 1,3,5,7 are available |
NHCfreq | 0.0625 by default, required when Thermostat=Nose-Hoover | Nose-Hoover chain frequency; unit: fs^-1 |
trajH5MDout | traj.h5 by default | trajectory saved in H5MD file format |
trajTextout | traj by default | trajectory saved in plain text format |
Example Top↑
Below is the example of how to run MD with MLatom
MD # Molecular dynamics
method=AIQM1 # Use AIQM1
initConditions=user-defined # Use user-defined initial conditions
initXYZ=init.xyz # File with initial geometry
initVXYZ=init.vxyz # File with initial velocities
dt=0.1 # Time step
trun=100000 # Length of trajectory
thermostat=nose-hoover # Use Nose-Hoover thermostat
temperature=300 # Set temperature
trajH5MDout=traj.h5 # Save trajectory in traj.h5
qmprog=mndo
Two-photon absorption cross sections Top↑
This simulation type is performed as described in this publication. It is currently only available on the MLatom@XACS cloud and will be released soon. See the original source code on GitHub.
To run ML-TPA calculations locally, the following packages have to be installed:
- python >= 3.7
- scikit-learn<1.0.0
- xgboost>=1.5.0
- rdkit>=2022.03.3
- numpy>=1.21.1
- pandas>=1.0.1
After proper python environment is built. Install packages from conda is recommened, i.e., you need to run:
pip install pandas
pip install numpy==1.22
pip install scikit-learn==0.24.2
pip install xgboost==1.5
pip install rdkit
Input and output arguments Top↑
Arguments | Available and default parameters | Description |
---|---|---|
MLTPA | required. | requests calculation of the two-photon absorption (TPA) cross section for a spectra or a given wavelength. |
SMILLESfile=[file with SMILES] | this argument is required; no default file name. | file with SMILES of one or many molecules. |
Output contains the comma-separated predicted ML-TPA cross section values or spectra in units of GM for each wavelength. Output files tpa[molecular index as in SMILESfile].txt
are saved in a folder tpa[absolute time]
in current path.
Additional options for wavelength and solvent Top↑
Arguments | Default parameters |
---|---|
auxfile=[file with the information of wavelength and Et30 in the format of 'wavelength_lowbound,wavelength_upbound,Et30'] (wavelength in nm.) | If the auxiliary file does not exist, then the default value of Et30 will be 33.9 (toluene) and the whole spectra between 600-1100 nm will be provided. The entries (lines) should be provided in the same order as in SMILLESfile. See the list with the solvents and their Et30 values. |
Example Top↑
Here we show how to calculate TPA cross section for RHODAMINE 6G and RHODAMINE 123 molecules with MLatom input file mltpa.inp
:
MLTPA
SMILESfile=Smiles.csv
auxfile=_aux.txt
This input requires Smiles.csv
file with SMILES of molecules:
CCNC1=CC2=C(C=C1C)C(=C3C=C(C(=[NH+]CC)C=C3O2)C)C4=CC=CC=C4C(=O)OCC.[Cl-]
COC(=O)C1=CC=CC=C1C2=C3C=CC(=N)C=C3OC4=C2C=CC(=C4)N.Cl
and optional _aux.txt
:
600,850,55.4
600,600,33.9
After you prepared your input files mltpa.inp
, Smiles.csv
, and _aux.txt
, you can run MLatom as usual:
mlatom mltpa.inp > mltpa.out
After the calculations finish, the predicted TPA cross section values are saved in a Folder named tpa[absolute time]
. In the folder, there are two files for two molecules: tpa1.txt
and tpa2.txt
. For our examples, it looks like:
wavelength,predicted_sigma (GM)
600.0,285.19455
610.0,297.71707
620.0,284.11694
......
810.0,121.51988
820.0,116.537994
830.0,118.04909
840.0,103.65925
850.0,113.72374
wavelength,predicted_sigma (GM)
600.0,138.2346
Quantum dynamics with machine learning Top↑
MLatom can perform quantum dissipative dynamics with a range of machine-learning methods via an interface to the MLQD program. Supported methods from the program’s website:
- Kernel Ridge Regression (KRR)-based recursive (iterative) Quantum Dissipative Dynamics method: Here is the corresponding article → Speeding up quantum dissipative dynamics of open systems with kernel methods. Recently, we have performed a comparative study where KKR method outperforms NN models, here is the article → A comparative study of different machine learning methods for dissipative quantum dynamics
- AIQD non-recursive (non-iterative) approach: Here is the corresponding article → Predicting the future of excitation energy transfer in light-harvesting complex with artificial intelligence-based quantum dynamics
- The blazingly fast OSTL non-recursive (non-iterative) approach: Here is the corresponding article → One-Shot Trajectory Learning of Open Quantum Systems Dynamics
Lecture & Tutorial
Input and output arguments Top↑
Arguments | Available and default parameters | Description |
---|---|---|
QDmodel=[createQDmodel or useQDmodel] (not optional) | default option is useQDmodel | requests MLQD to create or use QD model |
QDmodelIn=[user-provided model file] | Not optional if QDmodel=useQDmodel. Passing the name of file with the trained model | |
QDmodelOut=user-defined name of created model](optional) | You can pass it if QDmodel=createQDmodel and MLQD will save the trained model with this name. However, its optional, if you don’t pass it, MLQD will choose a random name. | |
QDmodelType=[KRR or AIQD or OSTL] | default option is OSTL | It tells MLQD what type of QD model to use |
systemType=[SB or FMO](not optional) | no default option | It tells MLQD the type of the system |
QDtrajOut=file name for the output trajectory | You can pass it if QDmodel=useQDmodel and MLQD will save the predicted dynamics with this name. However, its optional, if you don’t pass it, MLQD will choose a random name. | |
prepInput=[True or False] | default is False. Case sensitive | Prepare input files X and Y from the data |
hyperParam=[True or False] | default is False. Case sensitive | Optimize the hyper parameters of the model |
patience=[integer non-negative number] | Default value is 10 | Patience for early stopping in CNN training |
epochs=[integer non-negative number] | Default value is 100 | Number of epochs for training and optimization of CNN model [OSTL and AIQD methods] |
max_evals=[integer non-negative number] | Default value is 100 | Number of maximum evaluations in hyperopt optimization of CNN model [OSTL and AIQD methods] |
XfileIn=[name of X file] | Default is x_data if QDmodel=createQDmodel and prepInput=True | In the case of QDmodel=createQDmodel, its optional. It passes the name for X file. It saves the Xfile with this name if prepInput=True , and it passes the Xfile if prepInput=False . However if QDmodel=useQDmodel and QDmodelType=KRR , then it is not optional. You need to pass the input shot-time trajectory. |
YfileIn=[name of Y file] | Default is y_data if QDmodel = createQDmodel and prepInput=True | In the case of QDmodel = createQDmodel, it is optional. It passes the name for Y file. It saves the Yfile with this name if prepInput=True , and it passes the Yfile if prepInput=False. |
dataPath=[absolute or relative path with data] | In the case of QDmodel=createQDmodel, and prepInput=True, need to pass path to the data, so MLQD can prepare the X and Y files. It should be noted that, data should be in the same format as our in our data set QDDSET-1 (to be published) especially when QDmodelType=OSTL or AIQD | |
n_states=[number of states or sites, integer] | Default is 2 for SB and 7 for FMO | Number of states (SB) or sites (FMO) |
initState=[number of initial site] | Default value is 1 (Initial exictation is on site-1) | It represents initial site in FMO complex. Only required when we propagate dynamics with OSTL or AIQD method |
time=[propagation time] | Default is 20 for SB and 50 for FMO | Propagation time in picoseconds (ps) for FMO complex and in atomic units (a.u.) for spin-boson model |
time_step=[time step of propagation] | Default is 0.05 for SB and 0.005 for FMO | time step of propagation |
energyDiff=[energy difference] | Default value is 1.0 | Energy difference between the states in the case of SB, needed only when QDmodelType=OSTL or AIQD |
Delta=[tunneling matrix element] | Default value is 1.0 | The tunneling matrix element in the case of SB, needed only when QDmodelType = OSTL or AIQD |
gamma=[characteristic frequency] | Default value is 10 in the case of SB and 500 in the case of FMO | Characteristic frequency. In cm^-1 for FMO and in (a.u.) for SB, and needed only when QDmodelType=OSTL or AIQD |
lamb=[system-bath coupling strength] | Default value is 1.0 in the case of SB and 520 in the case of FMO | System-bath coupling strength. In cm^-1 for FMO and in (a.u.) for SB, and needed only when QDmodelType=OSTL or AIQD |
temp=[temperature] | Default value is 1.0 in the case of SB and 510 in the case of FMO | Temperature (K) in the case FMO complex and inverse temperature in the case of SB, and needed only when QDmodelType=OSTL or AIQD |
energyNorm=[normalizer] | Default value is 1.0 | Normalizer for the energy difference between the states in the case of SB |
energyNorm=[normalizer] | Default value is 1.0 | Normalizer for the tunneling matrix element in the case of SB |
gammaNorm=[normalizer] | Default value is 10 in the case of SB and 500 in the case of FMO | Normalizer for characteristic frequency |
lambNorm=[normalizer] | Default value is 1.0 in the case of SB and 520 in the case of FMO | Normalizer for system-bath coupling strength |
tempNorm=[normalizer] | Default value is 1.0 in the case of SB and 510 in the case of FMO | Normalizer for temperature in the case of FMO and for inverse temperature in the case of SB |
numLogf=[number of logistic functions] | Default value is 1 | Number of logistic functions normalizing the dimension of time |
LogCa=[coefficient] | Default value is 1.0 | Coefficient “a” in the logistic function |
LogCb=[coefficient] | Default value is 15.0 | Coefficient “b” in the logistic function |
LogCc=[coefficient] | Default value is -1.0 | Coefficient “c” in the logistic function |
LogCd=[coefficient] | Default value is 1.0 | Coefficient “d” in the logistic function |
dataCol=[column number] | Default value is 1 | When QDmodelType=KRR , it only works for single output values. If ther are multiple columns in you data files, you need mention which column to grab |
dtype=[real or imag] | Default is real | When you pass the column with dataCol and your data is complex, then need to mention which part of the complex data the MLQD to grab, real or imaginary |
xlength=[number of time steps in the short seed trajectory] | Default value is 81 | Length of the input short trajectory. It is the number of time steps in the data you passed with dataCol |
refTraj | MLQD has the option to plot the predicted dynamics against the reference trajectory. It is optional, if reference trajectory is provided, MLQD will go for plotting otherwise not | |
xlim=[xaxis limit] | Default option is equal to the propagation time | The user can define xaxis limit for plotting |
pltNstates=[number of states to be plotted] | Default option is to plot all states | Users can define how many states should be plotted by MLQD |
Examples
These are just very brief examples, please see our detailed tutorial.
Training a KRR model
In the case of spin boson model, we have provided 20 trajectories from our QD3SET-1 database for demonstration. The MLQD will grab them automatically if you don’t pass data path.
MLQD
QDmodel=createQDmodel
QDmodelType=KRR
prepInput=True
dataCol=1
dtype=real
xlength=81
systemType=SB
QDmodelOut=KRR_SB_model
Propagation of dynamics with the trained KRR model
We are providing a short input trajectory saved as state_1_pop.txt
:
MLQD
time=20
time_step=0.05
QDmodel=useQDmodel
QDmodelType=KRR
XfileIn=state_1_pop.txt
systemType=SB
QDmodelIn=KRR_SB_model
QDtrajOut=KRR_trajectory
The reference trajectory for comparison:
Training an AIQD model
MLQD
n_states=2
time=20
time_step=0.05
QDmodel=createQDmodel
QDmodelType=AIQD
prepInput=True
numLogf=10
LogCa=1.0
LogCb=15.0
LogCc=-1.0
LogCd=1.0
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
hyperParam=True
patience=10
epochs=10
max_evals=10
QDmodelOut=AIQD_SB_model
Propagation of dynamics with the trained AIQD model
We just pass the parameters and the trained AIQD model should be able to predict the corresponding dynamics
MLQD
n_states=2
time=20
time_step=0.05
energyDiff=1.0
Delta=1.0
gamma=4.0
lamb=0.1
temp=1.0
QDmodel=useQDmodel
QDmodelType=AIQD
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
numLogf=10
systemType=SB
QDmodelIn=AIQD_SB_model.hdf5
QDtrajOut=Qd_trajectory
Training an OSTL model
MLQD
n_states=2
QDmodel=createQDmodel
QDmodelType=OSTL
prepInput=True
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
hyperParam=True
patience=10
epochs=10
max_evals=10
QDmodelOut=OSTL_SB_model
Propagation of dynamics with the trained OSTL model
We just pass the parameters and the trained OSTL model should be able to predict the corresponding dynamics in one shot
MLQD
n_states=2
time=20
time_step=0.05
energyDiff=1.0
Delta=1.0
gamma=4.0
lamb=0.1
temp=1.0
QDmodel=useQDmodel
QDmodelType=OSTL
energyNorm=1.0
DeltaNorm=1.0
gammaNorm=10
lambNorm=1.0
tempNorm=1.0
systemType=SB
QDmodelIn=OSTL_SB_model.hdf5
QDtrajOut=Qd_trajectory
IR and power spectra from MD Top↑
MD trajectory can be used to generate IR spectrum and power spectrum.
Input and output arguments Top↑
Arguments | Available and default parameters | Description |
---|---|---|
trajH5MDin | required if trajdpin is not provided | file with trajectory in H5MD format |
trajVXYZin | required if trajH5MDin is not provided | plain text file containing velocities |
trajdpin | required if trajH5MDin is not provided | plain text file containing dipole moments |
start_time | 0.0 by default | unit: fs; use trajectory from start_time to end_time |
end_time | maximum time by default | unit: fs; use trajectory from start_time to end_time |
autocorrelationDepth | 1024 by default | autocorrelation depth; unit: fs |
zeropadding | 1024 by default | zero padding; unit: fs |
title | title of the plot | |
output | its value can be ir or ps | which spectrum to output |
The plot will be saved in ir.png or ps.png, and the spectrum will be saved in ir.npy or ps.npy.
Example Top↑
Below is an example of how to generate IR spectrum from MD trajectory
MD2vibr # Generate vibrational spectrum from MD trajectory
trajH5MDin=traj.h5 # Read MD trajectory from traj.h5
dt=0.5 # Time step
start_time=3000 # Start time
end_time=100000 # End time
autocorrelationDepth=1024 # Autocorrelation depth
zeropadding=1024 # Zero padding
output=ir # Generate IR spectrum
Below is an example of how to generate power spectrum from MD trajectory
MD2vibr
trajH5MDin=traj.h5
dt=0.5
start_time=0
end_time=10000
autocorrelationDepth=1024
zeropadding=1024
output=ps
Learning Top↑
Training popular ML models Top↑
The models can be either native MLatom or from third-party interfaces to popular ML model types:
- kernel ridge regression (KRR) models
- KREG (native). See tutorial. Can only be used for single-molecule PES.
- KRR-CM (KRR with Coulomb matrix, native).
- ANI (through TorchANI)
- DeepPot-SE and DPMD (through DeePMD-kit)
- GAP–SOAP (through GAP suite and QUIP)
- PhysNet (through PhysNet)
- sGDML (through sGDML). Can only be used for single-molecule PES
Arguments Top↑
Arguments | Available and default parameters | Description |
createMLmodel | requests training an ML model. | |
XYZfile=[input file with XYZ coordinates] | no default file names. | requests to train on a data set with many molecules provided in file with their XYZ coordinates. The units of coordinates are arbitrary, but many simulations with MLatom require Å which are recommended. |
Yfile=[input file with reference values] and/or YgradXYZfile=[input file with reference XYZ gradients] | Yfile or both of these two arguments can be chosen.no default file names. | Yfile are often energies, it is recommended to use Hartree if the model is intended to be used in further simulations.YgradXYZfile are often energy gradients, it is recommended to use Hartree/Å. Note that gradients are negative forces and appropriate sign should be used. Also, note that sparse gradients can be provided, where for geometries without gradients, YgradXYZfile file should contain ‘0’ followed by a blank line (see tutorial). |
MLmodelOut=[output file with trained model] | no default file name. | saves model to a user-defined file. If the file already exists, MLatom will not overwrite it and stop. |
MLmodelType=[supported ML model type] | KREG [default]Available model types and corresponding programs (MLatomF is a native program): +-------------+----------------+ | MLmodelType | default MLprog | +-------------+----------------+ | KREG | MLatomF | +-------------+----------------+ | sGDML | sGDML | +-------------+----------- ----+ | GAP-SOAP | GAP | +-------------+----------------+ | PhysNet | PhysNet | +-------------+----------------+ | DeepPot-SE | DeePMD-kit | +-------------+----------------+ | ANI | TorchANI | +-------------+----------------+ | Calculations with native implementations do not require this argument. For third-party models the user should provide either MLmodelType and/or MLprog argument (see also installation instructions). Note that to request KRR-CM model, one has to choose descriptor and algorithm details manually. |
MLprog=[supported ML program] | It is recommended to use MLmodelType instead of this option.Supported interfaces with default and tested ML model types: +------------+----------------------+ | MLprog | MLmodelType | +------------+----------------------+ | MLatomF | KREG [default] | | | see | | | MLatom.py KRR help | +------------+----------------------+ | sGDML | sGDML [default] | | | GDML | +------------+----------------------+ | GAP | GAP-SOAP | +------------+----------------------+ | PhysNet | PhysNet | +------------+----------------------+ | DeePMD-kit | DeepPot-SE [default] | | | DPMD | +------------+----------------------+ | TorchANI | ANI [default] | +------------+----------------------+ | Calculations with native implementations do not require this argument. For third-party models the user should provide either MLmodelType and/or MLprog argument (see also installation instructions). Note that to request KRR-CM model, one has to choose descriptor and algorithm details manually. |
eqXYZfileIn=[file with XYZ coordinates of equilibrium geometry] | optional. By default, tries to look for eq.xyz file, if not found, uses the minimum-energy structure in the data set. | can only be used for the KREG model to construct the RE descriptor. |
Additional output arguments Top↑
Arguments | Available and default parameters | Description |
YestFile=[output file with estimated Y values] | this argument is optional and no default parameters are provided. | makes predictions Y for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
YgradXYZestFile=[output file with estimated XYZ gradients] | this argument is optional and no default parameters are provided. | should be used only with XYZfile option. Calculates first XYZ derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
YgradEstFile=[output file with estimated gradients] | this argument is optional and no default parameters are provided. | should be used only with XfileIn option. Calculates first derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
Note: calculations with third-party programs may also generate additional output files.
Additional options for TorchANI interface Top↑
Arguments with their default values:
ani.batch_size=8
batch size
ani.max_epochs=10000000
max epochs
ani.early_stopping_learning_rate=0.00001
learning rate that triggers early-stopping
ani.force_coefficient=0.1
weight for force
ani.Rcr=5.2
radial cutoff radius
ani.Rca=3.5
angular cutoff radius
ani.EtaR=1.6
radial smoothness in radial part
ani.ShfR=0.9,1.16875,1.4375,1.70625,1.975,
2.24375,2.5125,2.78125,3.05,3.31875,3.5875,
3.85625,4.125,4.9375,4.6625,4.93125
radial shifts in radial part
ani.Zeta=32
angular smoothness
ani.ShfZ=0.19634954,0.58904862,0.9817477,
1.3744468,
1.7671459,2.1598449,2.552544,
2.9452431
angular shifts
ani.EtaA=8
radial smoothness in angular part
ani.ShfA=0.9,1.55,2.2,2.85
radial shifts in angular part
ani.Neuron_l1=160
number of neurons in layer 1
ani.Neuron_l2=128
number of neurons in layer 2
ani.Neuron_l3=96
number of neurons in layer 3
ani.AF1='CELU'
acitivation function for layer 1
ani.AF2='CELU'
acitivation function for layer 2
ani.AF3='CELU'
acitivation function for layer 3
Additional options for sGDML Top↑
Arguments with their default values:
sgdml.gdml=False
use GDML instead of sGDML
sgdml.cprsn=False
compress kernel matrix along symmetric degrees of freedom
sgdml.no_E=False
not to predict energies
sgdml.E_cstr=False
include the energy constraints in the kernel
sgdml.s=<s1>[,<s2>[,...]] or <start>:[<step>:]<stop>
set hyperparameter sigma, see sgdml create -h for details.
Additional options for PhysNet Top↑
Arguments with their default values:
physnet.num_features=128
number of input features
physnet.num_basis=64
number of radial basis functions
physnet.num_blocks=5
number of stacked modular building blocks
physnet.num_residual_atomic=2
number of residual blocks for atom-wise refinements
physnet.num_residual_interaction=3
number of residual blocks for refinements of proto-message
physnet.num_residual_output=1
number of residual blocks in output blocks
physnet.cutoff=10.0
cutoff radius for interactions in the neural network
physnet.seed=42
random seed
physnet.learning_rate=0.0008
starting learning rate
physnet.decay_steps=10000000
decay steps
physnet.decay_rate=0.1
decay rate for learning rate
physnet.batch_size=12
training batch size
physnet.valid_batch_size=2
validation batch size
physnet.force_weight=52.91772105638412
weight for force
physnet.summary_interval=5
interval for summary
physnet.validation_interval=5
interval for validation
physnet.save_interval=10
interval for model saving
Additional options for GAP and QUIP Top↑
gapfit.xxx=x
xxx could be any option for gap_fit (e.g. default_sigma
).
Note that there’s no need to set at_file
and gp_file
.gapfit.gap.xxx=x
xxx could be any option for gap.
Arguments with their default values:
gapfit.default_sigma={0.0005,0.001,0,0}
hyperparameter sigmas for energies, forces, virals and hessians
gapfit.e0_method=average
method for determining e0
gapfit.gap.type=soap
descriptor type
gapfit.gap.l_max=6
max number of angular basis functions
gapfit.gap.n_max=6
max number of radial basis functions
gapfit.gap.atom_sigma=0.5
hyperparameter for Gaussain smearing of atom density
gapfit.gap.zeta=4
hyperparameter for kernel sensitivity
gapfit.gap.cutoff=6.0
cutoff radius of local environment
gapfit.gap.cutoff_transition_width=0.5
cutoff transition width
gapfit.gap.delta=1
hyperparameter delta for kernel scaling
Additional options for DeePMD-kit Top↑
Expressions like deepmd.xxx.xxx=X
specify arguments for DeePMD, follows the structure of DeePMD’s json input file.
For example:
deepmd.training.stop_batch=N
is an equivalent of
{
...
"training": {
...
"stop_batch": N
...
}
...
}
in DeePMD-kit’s json input.
In addition, option deepmd.input=S
intakes a input json file S
as a template. Final input file will be generated based on it with deepmd.xxx.xxx=X
options (if any). Check default template file bin/interfaces/DeePMDkit/template.json
for default values.
Example Top↑
See tutorial for training the KREG models.
Here we show how to train an ANI-type model on ethanol PES (trains only on energies).
In MLatom, except for the KREG model, we need to specify MLmodelType. The input is very simple:
createMLmodel # Specify the task for MLatom MLmodelType=ANI # Specify the model type XYZfile=ethanol_geometries.xyz # File with XYZ geometries Yfile=ethanol_energies.txt # File with reference energies
Training generic ML models Top↑
MLatom allows to train kernel ridge regression (KRR) models for any generic data set with input vectors X and reference labels Y. A range of kernel functionals are supported. Instead of using this option, it may be more convenient to use one of the popular ML models available in MLatom.
Required arguments Top↑
Below are required arguments but typically more options are needed, e.g., for choosing a molecular descriptor and algorithm hyperparameters, as shown later.
Arguments | Available and default parameters | Description |
createMLmodel | requests training an ML model. Currently only KRR models are supported. | |
XYZfile=[input file with XYZ coordinates] or XfileIn=[input file with input vectors X] | one and only one of these two options can be chosen. no default file names. | XYZfile : requests to train on a data set with many molecules provided in file with their XYZ coordinates. The units of coordinates are arbitrary, but many simulations with MLatom require Å which are recommended.XfileIn : requests to train on a data set with many input vectors (one input vector per line in text file), which are typically molecular descriptors. |
Yfile=[input file with reference values] and/or YgradXYZfile=[input file with reference XYZ gradients] | one or both of these two options can be chosen. no default file names. | Yfile are often energies, it is recommended to use Hartree if the model is intended to be used in further simulations.YgradXYZfile are often energy gradients, it is recommended to use Hartree/Å. Note that gradients are negative forces and appropriate sign should be used. Also, note that sparse gradients can be provided, where for geometries without gradients, YgradXYZfile file should contain ‘0’ followed by a blank line (see tutorial). |
MLmodelOut=[output file with trained model] | no default file name. | saves model to a user-defined file, commonly with .unf extension. If the file already exists, MLatom will not overwrite it and stop. |
KRR-related arguments Top↑
Arguments | Available and default parameters | Description |
prior=[offset of reference values] | 0.0 [default]mean use average of reference scalar valuesany other user-defined decimal/integer number. | It is often useful to offset reference values, e.g., by removing average value. This may improve stability of the model and make learning easier. |
KRRtask=[one of tasks] | learnVal learns reference values [default if only Yfile provided]learnGradXYZ explicitly learns only XYZ gradients (should be requested for correct simulations). Works only with the KREG model (RE descriptor and Gaussian kernel).learnValGradXYZ explicitly learns both scalar values and XYZ gradients [default if both Yfile and YgradXYZfile are provided]. Works only with the KREG model (RE descriptor and Gaussian kernel). | specifies what to learn: scalar values and/or XYZ gradients. |
lambda=[regularization hyperparameter] | 0.0 [default]opt optimize hyperapameter, see dedicated manualany other user-defined nonnegative decimal/integer number. | It is recommended to always optimize this hyperparameter. Usually, lambda parameter should be rather small but larger than zero, e.g., 10-6. |
lambdaGradXYZ=[regularization hyperparameter for XYZ gradients part] | similar to lambda .Can be used for KRRtask=learnGradXYZ and KRRtask=learnValGradXYZ .For KRRtask=learnGradXYZ , both lambda and lambdaGradXYZ are equivalent. | similar to lambda , may be helpful if it is hard to learn both scalar values and XYZ gradients with a single lambda. |
kernel=[kernel function] | Gaussian [default]. Its hyperparameter: sigma .Modifications of Gaussian kernel: – periodKernel . Its hyperparameters: sigma , period .– decayKernel . Its hyperparameters: sigma , sigmap , period .Laplacian . Its hyperparameter: sigma .exponential . Its hyperparameter: sigma .Matern is the most flexible but relatively slow, hyperparameters: nn, sigma. nn = 0 makes Matern kernel equivalent to exponential kernel, very large nn makes it equivalent to Gaussian kernel.linear . No hyperparameters. | Many of these kernel functions have hyperparameters that are recommended to be defined by indicated arguments. Linear kernel makes KRR equivalent to ridge regression, i.e., kernalized multiple linear regression (MLR) and MLatom prints out coefficients of an equivalent MLR model. |
sigma=[length scale hyperparameter] | 100.0 [default for kernel=Gaussian and kernel=Matern ]800.0 [default for kernel=Laplacian and kernel=exponential ]opt optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | scale length hyperparameter present in most kernel functions. It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |
sigmap=[length scale hyperparameter of a periodic part] | 100.0 [default, can be used only with kernel=decayKernel ]opt optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |
period=[length scale hyperparameter] | 1.0 [default, can be used in both kernel=periodKernel and kernel=decayKernel ]opt optimize hyperapameter, see dedicated manualany other user-defined positive decimal/integer number. | It is recommended to always optimize this hyperparameter, no good default general value can be recommended. |
nn=[length scale hyperparameter] | 2 [default, can only be used for kernel=Matern]opt optimize hyperapameter, see dedicated manualany other user-defined positive integer number. | Since it is an integer hyperparameters, it is usually easy to manually check several values from 1 to 5, because 0 corresponds to exponential kernel, and more than 5 are already close to Gaussian kernel. |
permInvKernel | optional. Related options: molDescrType=permuted , permInvGroups , permInvNuclei , Nperm , selectperm , permIndIn , permlen . | requests calculations with permutationally invariant kernel. Recommended for small data sets to ensure that permutation of homonuclear atoms will not change ML predictions. |
Nperm=[number of permutations] | optional, can only be used with permInvKernel . and XfileIn . | defines number of permutations in the user-provided file with reference values. Each line of input vector file must contain input vectors with molecular descriptors concatenated for all atomic permutation of a single geometry. See also related tutorial. |
selectperm | optional, can only be used with permInvKernel and molDescrType=permuted . | may be useful to find most relevant permutations nad reduce the number of permutations by minimizing distance RMSD to an equilibrium structure. Prints out list of selected permutations. See also related tutorial. |
permIndIn=[file with permutations list] | optional, can only be used with permInvKernel and molDescrType=permuted and permlen . | See also related tutorial. |
permlen=[number of permutations in permIndIn] | optional, can only be used with permInvKernel and molDescrType=permuted and . | See also related tutorial. |
matDecomp=[type of matrix decomposition] | Cholesky [default]LU Bunch-Kaufman | Cholesky is the most efficient, but for very difficult cases (e.g., too small hyperparameter lambda), other types can be used. MLatom first tries to do Cholesky decomposition, if it fails, MLatom tries to do Bunch-Kaufman and, finally, LU. Thus, usually, the user does not need to worry about this option. |
invMatrix | not used by default. Optional. | requests inverting kernel matrix to train the model. Not recommended because it is much slower than the default option. |
Molecular descriptor arguments Top↑
If the user only provides XYZ file with XYZfile
argument, XYZ coordinates need to be first converted into the molecular descriptor.
Arguments | Available and default parameters | Description |
molDescriptor=[molecular descriptor] | RE [default] (relative-to-equilibrium)CM (Coulomb matrix)ID (inverse internuclear distances) | RE descriptor is well-suited for accurate descriptioin of single-molecule PES.CM is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.ID is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor. |
molDescrType=[type of molecular descriptor] | unsorted [default for RE]sorted [default for CM]permuted (optional, can be used for both RE and CM) | unsorted descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.sorted descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: XYZsortedFileOut , permInvGroups , permInvNuclei . See also related tutorial.permuted augments the descriptor with the permutations of user-defined atoms. Related arguments: permInvKernel , permInvGroups , permInvNuclei . See also related tutorial. |
XYZsortedFileOut=[output file with with sorted XYZ coordinates] | optional. Only works with molDescriptor=RE molDescrType=sorted . | saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out |
permInvNuclei=[permutationally invariant nuclei] | optional. Should be used with molDescrType=permuted (and often with permInvKernel ) | E.g. permInvNuclei=2-3.5-6 will permute atoms 2,3 and 6,7. See also related tutorial. |
permInvGroups=[permutationally invariant groups] | optional. Should be used with molDescrType=permuted (and often with permInvKernel ) | E.g. for water dimer permInvGroups=1,2,3-4,5,6 generates permuted atom indices by flipping the monomers in a dimer. |
Additional output arguments Top↑
Arguments | Available and default parameters | Description |
YestFile=[output file with estimated Y values] | this argument is optional and no default parameters are provided. | makes predictions Y for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
YgradXYZestFile=[output file with estimated XYZ gradients] | this argument is optional and no default parameters are provided. | should be used only with XYZfile option. Calculates first XYZ derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
YgradEstFile=[output file with estimated gradients] | this argument is optional and no default parameters are provided. | should be used only with XfileIn option. Calculates first derivatives for the entire data set with the trained model and saves them to the requested file. If a file with the same name already exists, program will terminate and not overwrite it. |
Example Top↑
Here we show how to train a simple model for the H2 dissociation curve with kernel ridge regression.
Download R_20.dat
file with 20 points corresponding to internuclear distances in the H2 molecule in Å:
Download E_FCI_20.dat
file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:
Train (option createMLmodel
) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_overfit.unf
) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10−11 and λ=0:
mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out
In the output file create_E_FCI_20_overfit.out
you can see that the error for the created ML model is essentially zero for the training set. Option sampling=none
ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (options useMLmodel
MLmodelIn
) for calculating energies for its own training set and save them to E_ML_20_overfit.dat
file:
mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out
Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option debug
also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_overfit.out
. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).
Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because of the hyperparameter choice. Thus, optimization of hyperparameters is strongly recommended.
Optimizing hyperparameters Top↑
It is often desirable/necessary to optimize hyperparameters, although many models may have reasonable hyperparameters and/or by default optimize their hyperparameters. There are two main different ways to optimize hyperparameters with MLatom described below: 1) grid search for KRR models (including KREG & KRR-CM), 2) optimization with hyperopt. Grid search is applicable for small number of hyperparameters (one or two) and is very robust, optimization with hyperopt never gives a guarantee of finding good hyperparameters but is more flexible.
Arguments Top↑
The optimization objective is to minimize the validation error. For this, the training data set has to be split into the sub-training and validation sets.
Arguments | Available and default parameters | Description |
minimizeError=[type of validation error to minimize] | RMSE [default]MAE | |
Nsubtrain=[number of the sub-training points or a fraction of the training points] | 80% of the training set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. | points can be sampled in one of the usual ways using sampling argument. By default, randomly. |
Nvalidate=[number of the validation points or a fraction of the training points] | By default, the remaining points of the training set after subtracting the sub-training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. | points can be sampled in one of the usual ways using sampling argument. By default, randomly. |
CVopt | optional. Related option NcvOptFolds . | N-fold cross-validation error. By default, 5-fold cross-validation is used. |
NcvOptFolds=[number of CV folds] | 5 [default]. Can be used only with CVopt . | If this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |
LOOopt | optional. | Leave-one-out cross-validation. Only random or no sampling can be used. |
iCVoptPrefOut=[prefix of files with indices for CVopt] | optinal. No default prefixes. | file names will include the required prefix. |
Nuse=[N first entries of the data set file to be used] | 100% [default] optional. | sometimes it is useful for tests just use a part of a data set. |
Grid search for kernel ridge regression models Top↑
Grid search is performed on a logarithmic grid. After the best parameters are found in the first iteration, MLatom can perform more iterations of a logarithmic grid search. This option is used only for λ and/or σ hyperparameters of KRR.
Arguments | Available and default parameters | Description |
lgOptDepth=[depth of log search] | 3 [default] | often, depth of one or two suffices and is much faster. 3 is a safer option. |
NlgLambda=[number of points on the logarithmic grid (base 2) optimization of lambda] | 6 [default] | used with kernel ridge regression and lambda=opt argument. |
lgLambdaL=[lowest value of log2 λ for a logarithmic grid optimization of lambda] | -35.0 [default] | used with kernel ridge regression and lambda=opt argument. |
lgLambdaH=[highest value of log2 λ for a logarithmic grid optimization of lambda] | -6.0 [default] | used with kernel ridge regression and lambda=opt argument. |
NlgSigma=[number of points on the logarithmic grid (base 2) for optimization of sigma] | 6 [default] | used with kernel ridge regression and sigma=opt argument. |
lgSigmaL=[lowest value of log2 σ for a logarithmic grid optimization of sigma] | 6.0 [default for kernel=Gaussian and kernel=Matern ]5.0 [default for kernel=Laplacian and kernel=exponential ] | used with kernel ridge regression and sigma=opt argument. |
lgSigmaH=[highest value of log2 σ for a logarithmic grid optimization of sigma] | 9.0 [default for kernel=Gaussian and kernel=Matern ]12.0 [default for kernel=Laplacian and kernel=exponential ] | used with kernel ridge regression and sigma=opt argument. |
on-the-fly | not used by default. Optional. | on-the-fly calculation of kernel matrix elements for validation, by default it is false and those elements are stored making calculations faster |
Optimization with hyperopt Top↑
Optimization with hyperopt requires installation of the hyperopt package. This package provides a general solution to the optimization problem using Bayesian methods with Tree-structured Parzen Estimator (TPE).
Arguments | Available and default parameters | Description |
[argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.uniform(lb,ub) or [argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.loguniform(lb,ub) or [argument name of hyperparameter to optimize, e.g., sigma]=hyperopt.qunifrom(lb,ub,q) | No default values. | lower bound lb , and upper bound ub hyperopt.uniform(lb,ub) : linear search space.hyperopt.loguniform(lb,ub) : logarithmic search space, base 2.hyperopt.qunifrom(lb,ub,q) : discrete linear space, rounded by q . |
hyperopt.max_evals=[maximum number of attempts] | 8 [default] | often, several hundreds or even thousands of evaluations are required. |
hyperopt.losstype=[type of loss for several reference properties] | geomean [default]weighted (used with hyperopt.w_grad ) | geomean uses the geometric mean of losses for different properties (typically, energies and forces)weighted currently only needs to define weight for forces (negative XYZ gradients) |
hyperopt.w_grad=[weight for XYZ gradients] | 0.1 [default]Should be used with hyperopt.losstype=weighted | |
hyperopt.points_to_evaluate=[xx,xx,...],[xx,xx,...],... | optional, no default parameters. | specify initial guesses before auto-searching, each point inside a pair of square brackets should have all values to be optimized in order. these evaluations are NOT counted in max_evals. |
Examples Top↑
Two typical examples:
mlatom createMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat MLmodelOut=CH3Cl.unf sigma=opt kernel=Matern
mlatom estAccMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat sigma=hyperopt.loguniform(4,20)
Evaluating ML models Top↑
MLatom can evaluate the ML model, i.e., estimate its generalization error. For this, the total data set should be split into the training and test sets. ML model can be either trained as usual (with a generic model or a popular model) or provided with MLmodelIn
argument. If the model is trained, the user can choose the required arguments to train the model. Below, only arguments unique to this feature are given.
Also, MLatom can calculate learning curves (test error vs the number of the training set).
Arguments | Available and default parameters | Description |
estAccMLmodel | required. | requests estimating generalization error on the test set. This argument cannot be used together with createMLmodel or useMLmodel .ML model can be either trained as usual with a generic model or a popular model. or ML model can be provided with MLmodelIn argument. |
Ntrain=[number of the sub-training points or a fraction of the training points] | 80% of the total set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. | points can be sampled in one of the usual ways using sampling argument. By default, randomly. |
Ntest=[number of the validation points or a fraction of the training points] | By default, the remaining points of the total set after subtracting the training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. | points can be sampled in one of the usual ways using sampling argument. By default, randomly. |
CVtest | optional. Related option NcvOptFolds . | N-fold cross-validation error. By default, 5-fold cross-validation is used. |
NcvTestFolds=[number of CV folds] | 5 [default]. Can be used only with CVopt . | if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |
LOOtest | optional. | leave-one-out cross-validation. Only random or no sampling can be used. |
learningCurve | should be used with lcNtrains argument | produces learning curves. This option produces the following output files in directory learningCurve :– results.json JSON database file with all results– lcy.csv CSV database file with results for values– lcygradxyz.csv CSV database file with results for XYZ gradients– lctimetrain.csv CSV database file with training timings– lctimepredict.csv CSV database file with prediction timings |
lcNtrains=[N,N,N,...,N training set sizes] | required argument if learningCurve is requested | |
lcNrepeats=[N,N,N,...,N numbers of repeats for each Ntrain] or lcNrepeats=[N,N,N,...,N number of repeats for all Ntrains] | 3 [default] | necessary to get error bars. |
Nuse=[N first entries of the data set file to be used] | 100% [default] optional. | sometimes it is useful for tests just use a part of a data set. |
sampling=user-defined | optional. Requires arguments iTrainIn , iTestIn , and/or iCVtestPrefIn . | reads in indices for the training and test sets from files defined by arguments iTrainIn , iTestIn , and/or iCVtestPrefIn . |
iTrainIn=[file with indices of training points] | optional. no default file names. | |
iTestIn=[file with indices of test points] | optional. no default file names. | |
iCVtestPrefIn=[prefix of files with indices for CVtest] | optional. no default file names. | |
MLmodelIn=[file with ML model] | optional. no default file names. | requests to read a file with ML model. |
iTrainOut=[file with indices of training points] | optional. No default file names. | generates indices for the training set. |
iTestOut=[file with indices of test points] | optional. No default file names. | generates indices for the test set. |
iSubtrainOut=[file with indices of sub-training points] | optional. No default file names. | generates indices for the sub-training set. |
iValidateOut=[file with indices of validation points] | optional. No default file names. | generates indices for the validation set. |
iCVtestPrefOut=[prefix of files with indices for CVtest] | no optional. No default file names. | file names will include the required prefix. |
Examples Top↑
Simple example:
mlatom estAccMLmodel XYZfile=CH3Cl.xyz Yfile=en.dat sigma=opt lambda=opt
Example of learning curve:
mlatom learningCurve Yfile=y.dat XYZfile=xyz.dat kernel=Gaussian sigma=opt lambda=opt lcNtrains=100,250,500,1000,2500,5000,10000 lcNrepeats=64,32,16,8,4,2,1
With this command training set sizes listed in lcNtrains
will be tested repeatedly for 64, 32, 16, 8, 4, 2, 1 time(s), respectively. All data generated (including csv reports) will be stored in the folder learningCurve under the current directory.
Δ-learning Top↑
Δ-machine learning can be used with one of the usual options. Below, arguments unique to delta-learning are described. See also a tutorial.
Arguments | Description |
deltaLearn | required. Should be used with one of: – createMLmodel – useMLmodel MLmodelIn – estAccMLmodel |
Yb=[file with data obtained with baseline method] | required for both training and predictions. |
Yt=[file with data obtained with target method] | required only for training. |
YestT=[file with ML estimations of target method] | required for predictions. |
YestFile=[file with ML corrections to baseline method] | required for predictions. |
YgradXYZb=[file with baseline XYZ gradients] | optional. |
YgradXYZt=[file with target XYZ gradients] | optional. |
YgradXYZestT=file with ML estimations of target XYZ gradients] | optional. |
YgradXYZestFile=[file with ML corrections to baseline XYZ gradients] | optional. |
Example Top↑
mlatom estAccMLmodel deltaLearn XfileIn=x.dat Yb=UHF.dat Yt=FCI.dat YestT=D-ML.dat YestFile=corr_ML.dat
Self-correction Top↑
Self-correction as described here. Can be used with one of the usual options. Below, arguments unique to self-correction are described. See also a tutorial.
Arguments | Description |
selfCorrect | required. Should be used with one of: – createMLmodel – useMLmodel MLmodelIn – estAccMLmodel |
Example Top↑
mlatom estAccMLmodel selfCorrect XYZfile=xyz.dat Yfile=y.dat
Data Top↑
Converting XYZ coordinates to molecular descriptor Top↑
Arguments |
XYZ2X |
XYZfile=[input file S with XYZ coordinates] |
XfileOut=[output file S with X values] |
Molecular descriptor arguments Top↑
If the user only provides XYZ file with XYZfile
argument, XYZ coordinates need to be first converted into the molecular descriptor.
Arguments | Available and default parameters | Description |
molDescriptor=[molecular descriptor] | RE [default] (relative-to-equilibrium)CM (Coulomb matrix)ID (inverse internuclear distances) | RE descriptor is well-suited for accurate descriptioin of single-molecule PES.CM is a popular (but somewhat outdated) descriptor which can in principle be also applied to different molecules. In MLatom, full CM (vectorized) is used, not its eigenvalues as in original publication.ID is a popular inverse internuclear distances descriptor used in many ML models, applicable to a single-molecular PES and similar to RE descriptor. |
molDescrType=[type of molecular descriptor] | unsorted [default for RE]sorted [default for CM]permuted (optional, can be used for both RE and CM) | unsorted descriptors are original descriptors, but they do not ensure permutational invariance of homonuclear atoms.sorted descriptors ensure permutational invariance and is typically used for CM descriptor (where CM is sorted by its norms). In case of RE descriptor, sorting is done by nuclear repulsions. It can be used for structure-based sampling, but introduces discontinueities in interpolant and should not be used for simulations. Related options: XYZsortedFileOut , permInvGroups , permInvNuclei . See also related tutorial.permuted augments the descriptor with the permutations of user-defined atoms. Related arguments: permInvKernel , permInvGroups , permInvNuclei . See also related tutorial. |
XYZsortedFileOut=[output file with with sorted XYZ coordinates] | optional. Only works with molDescriptor=RE molDescrType=sorted . | saves file with XYZ coordinates after sorting chosen atoms by the nuclear repulsionsSorts chosen atoms by nuclear repulsion and prints out |
permInvNuclei=[permutationally invariant nuclei] | optional. Should be used with molDescrType=permuted (and often with permInvKernel ) | E.g. permInvNuclei=2-3.5-6 will permute atoms 2,3 and 6,7. See also related tutorial. |
permInvGroups=[permutationally invariant groups] | optional. Should be used with molDescrType=permuted (and often with permInvKernel ) | E.g. for water dimer permInvGroups=1,2,3-4,5,6 generates permuted atom indices by flipping the monomers in a dimer. |
Example Top↑
MLatom.py XYZ2X XYZfile=CH3Cl.xyz XfileOut=CH3Cl.x
Analyzing data sets Top↑
MLatom can analyze data sets by comparing them, e.g., mostly by calculating errors of ML-predicted values with respect to available reference values. All files are input files and MLatom output is a statistical analysis.
Arguments |
analyze |
Yfile=[input file with values] |
YgradXYZfile=[input file with gradients in XYZ coordinates] |
YestFile=[input file with estimated Y values] |
YgradXYZestFile=[input file with estimated XYZ gradients] |
Example Top↑
MLatom.py analyze Yfile=en.dat YestFile=enest.dat
Sampling and splitting Top↑
Arguments for sampling and splitting
Arguments | Available and default parameters | Description |
sample | it requires at least one of iTrainOut , CVtest , LOOtest , CVopt , LOOopt | see also tutorial. |
XYZfile=[file with XYZ coordinates] or XfileIn=[file with input vectors X] | required. | |
iTrainOut=[file with indices of training points] | no default file names. | generates indices for the training set. |
iTestOut=[file with indices of test points] | no default file names. | generates indices for the test set. |
iSubtrainOut=[file with indices of sub-training points] | no default file names. | generates indices for the sub-training set. |
iValidateOut=[file with indices of validation points] | no default file names. | generates indices for the validation set. |
CVtest | optional. Related option NcvOptFolds . | generates indices for splits in N-fold cross-validation. By default, 5-fold cross-validation is used. |
NcvTestFolds=[number of CV folds] | 5 [default]. Can be used only with CVopt . | if this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |
LOOtest | optional. | leave-one-out cross-validation. Only random or no sampling can be used. |
iCVtestPrefOut=[prefix of files with indices for CVtest] | no default prefixes. | file names will include the required prefix. |
CVopt | optional. Related option NcvOptFolds . | generates indices for N-fold cross-validation for hyperparameters optimization. By default, 5-fold cross-validation is used. |
NcvOptFolds=[number of CV folds] | 5 [default]. Can be used only with CVopt . | If this number is equal to the number of data points, leave-one-out cross-validation is performed. Only random or no sampling can be used for cross-validation. |
LOOopt | optional. | Leave-one-out cross-validation. Only random or no sampling can be used. |
iCVoptPrefOut=[prefix of files with indices for CVopt] | no default prefixes. | file names will include the required prefix. |
Additional optional arguments for sampling Top↑
Arguments used with sample
argument.
Arguments | Available and default parameters |
sampling=[type of data set sampling into splits] | random [default] random samplingnone simply split unshuffled data set into the training and test sets (in this order) (and sub-training and validation sets)structure-based structure-based samplingfarthest-point farthest-point traversal iterative procedure |
Nuse=[N first entries of the data set file to be used] | 100% [default] optional. |
Ntrain=[number of the sub-training points or a fraction of the training points] | 80% of the total set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. |
Ntest=[number of the validation points or a fraction of the training points] | By default, the remaining points of the total set after subtracting the training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the total set. |
Nsubtrain=[number of the sub-training points or a fraction of the training points] | 80% of the training set by default. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. |
Nvalidate=[number of the validation points or a fraction of the training points] | By default, the remaining points of the training set after subtracting the sub-training points. If a parameter is a decimal number less than 1, then it is considered to be a fraction of the training set. |
Example Top↑
Structure-based sampling:
mlatom sample sampling=structure-based XYZfile=CH3Cl.xyz Ntrain=1000 Ntest=10000 iTrainOut=itrain.dat iTestOut=itest.dat
Slicing Top↑
Sometimes it is useful to slice data by the Euclidean distance of their descriptors to the equilibrium descriptor. See tutorial.
Arguments for slicing:
Arguments | Available and default parameters |
slice | required. |
XfileIn=[file with input vectors X] | required. |
eqXfileIn=[file with input vector for equilibrium geometry] | required. |
Nslices=[number of slices] | 3 [default]optional. |
Arguments for sampling from slices:
Arguments | Available and default parameters |
sampleFromSlices | |
Ntrain=[total integer number N of training points from all slices] | required. |
Nslices=[number of slices] | 3 [default]optional. |
Arguments for merging indices from slices:
Arguments | Available and default parameters |
mergeSlices | |
Ntrain=[total integer number N of training points from all slices] | required. |
Nslices=[number of slices] | 3 [default]optional. |
Examples Top↑
See tutorial.
MLatom.py slice Nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x
mlatom sampleFromSlices Nslices=3 sampling=structure-based Ntrain=4480
mlatom mergeSlices Nslices=3 Ntrain=4480
Leave a Reply