# Tutorial for AQC Chapter

On this page you can find instructions how to follow the examples given in the book chapter:

Pavlo O. Dral, Quantum Chemistry Assisted by Machine Learning. In *Advances in Quantum Chemistry*: *Chemical Physics and Quantum Chemistry, Volume 81*, 1st ed.; Kenneth Ruud, Erkki J. Brändas, Eds. Academic Press: 2020; Vol. 81. DOI: 10.1016/bs.aiq.2020.05.002. (free copy online till 6 Nov. 2020)

Please see Manual of *MLatom *and other pages on this website for more details about how to use this program package. For examples you need *MLatom* 1.1 or newer version.

Table of Contents

## Example 1: Overfitting of the H_{2} Dissociation Curve

Download `R_20.dat`

file with 20 points corresponding to internuclear distances in the H_{2} molecule in Å:

Download `E_FCI_20.dat`

file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:

Train (option `createMLmodel`

) ML model and save it to a file (option `MLmodelOut=mlmod_E_FCI_20_overfit.unf`

) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{−11} and λ=0:

`mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out`

In the output file `create_E_FCI_20_overfit.out`

you can see that the error for the created ML model is essentially zero for the training set. Option `sampling=none`

ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (option `useMLmodel`

) for calculating energies for its own training set and save them to `E_ML_20_overfit.dat`

file:

`mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out`

Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option `debug`

also prints the values of the regression coefficients alpha to the output file `use_E_FCI_20_overfit.out`

. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because the hyperparameter choice.

## Example 2: Underfitting of the H_{2} Dissociation Curve

Train (option `createMLmodel`

) ML model and save it to a file (option `MLmodelOut=mlmod_E_FCI_20_underfit.unf`

) using the data (training set) from Example 1 and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{9} and λ=1:

`mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=1000000000 lambda=1.0 sampling=none > create_E_FCI_20_underfit.out`

In the output file `create_E_FCI_20_underfit.out`

you can see that the error for the created ML model is very large for the training set. You can use the created ML model (option `useMLmodel`

) for calculating energies for its own training set and save them to `E_ML_20_underfit.dat`

file:

`mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out`

You can see that the ML predicted values are the same for all points. Option `debug`

also prints the values of the regression coefficients alpha to the output file `use_E_FCI_20_underfit.out`

. You can sum them up and see that their sum is equal to the ML energies.

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are do not change. It means that the ML model is underfitted and cannot generalize well to new situations, because the hyperparameter choice.

## Example 3: Model Selection and Evaluation for the H_{2} Dissociation Curve

Download the full data set of with 451 points along the H_{2} dissociation curve. File with internuclear distances in the H_{2} molecule in Å:

File with the full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree):

Now you can download the indices of the points in the data set to be used as the training, test, sub-training, and validation sets. You can check that all the training points are the same as used in Examples 1 and 2.

Now we can optimize hyperparameters and evaluate the generalization error of the ML models using a single command. For example, for the Gaussian kernel function:

`mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lgSigmaH=10 lambda=opt lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat`

You can compare the ML predicted values saved to the file `E_ML_451_Gaussian.dat`

with the reference values and see that this model generalizes much better than models trained in Examples 1 and 2.

Note that many of these options ( `lgSigmaL=-10 lgSigmaH=10`

`lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat`

) are given to ensure that you obtain the same result every time you run the command.

You can simply run:

`mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lambda=opt`

This command will use the random 20% for testing and 20% of training points for validating, i.e. each time you run the command you will see different result. In addition, since MLatom uses the logarithmic search for hyperparameter optimization, you will need to modify the lowest boundaries using commands `lgSigmaL=-10`

and `lgLambdaL=-40`

to get better results:

`mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lambda=opt lgLambdaL=-40`

## Example 4: Extrapolation vs Interpolation of the H_{2} Dissociation Curve

Use the files from Example 3, but move the first three indices from the itrain.dat to itest.dat file and remove the first three indices from the isubtrain.dat file. Use the same command as in Example 3 to test the accuracy, but change the number of training and subtraining points accordingly.

## Example 5: Delta-learning of the H_{2} Dissociation Curve

Download the file with the UHF/STO-3G energies (in Hartree):

Follow the procedure described in Example 4, but now instead of learning the FCI/aug-cc-pV6Z energies directly, use the Δ-ML. For this, you need to use the option `deltaLearn`

and instead of `Yfile`

you need to specify files with the baseline (low-level) reference values (UHF/STO-3G, option `Yb=E_UHF_451.dat`

) and with the target (high-level) reference values (FCI/aug-cc-pV6Z, option `Yt=E_FCI_451.dat`

). In addition, you can specify the filenames for file with the Δ-ML energies estimating target level of theory (option `E_D-ML_451.dat`

) and file with predicted ML corrections (option `YestFile=corr_ML_451.dat`

).

Your input file `mlatom.inp`

for *MLatom *should look like:

estAccMLmodel

deltaLearn

XfileIn=R_451.dat

Yb=E_UHF_451.dat

Yt=E_FCI_451.dat

YestT=E_D-ML_451.dat

YestFile=corr_ML_451.dat

kernel=Gaussian

sigma=opt lgSigmaL=-10 lgSigmaH=10

lambda=opt lgLambdaL=-40

Ntrain=17

Nsubtrain=13

sampling=user-defined

iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat

To run the calculations, just execute the command:

`mlatom mlatom.inp > mlatom.out`

You can see that the delta-ML predictions in the file `E_D-ML_451.dat`

are much better than in Example 4.

## Example 6: Training on Randomly Sampled Points on the H_{2} Dissociation Curve

Use the files `R_451.dat`

and `E_FCI_451.dat`

from Example 3. Create the input file `mlatom.inp`

for *MLatom *with the following options (explanation for them is given in comments after ‘#’):

estAccMLmodel # Requests model evaluation

# Comment the previous option and uncommend the following two options to save the model to file.

# createMLmodel # Requests creating ML model

# MLmodelOut=mlmod_E_FCI_Gaussian_20random.unf # saves the model to file

XfileIn=R_451.dat # File with input vectors

Yfile=E_FCI_451.dat # File with reference data at FCI

YestFile=E_ML_451_Gaussian_20random.dat # File with ML predictions

kernel=Gaussian # Specifies the kernel function

sigma=opt # Requests optimizing sigma parameter

lgSigmaL=-10 lgSigmaH=10 # ... and sets optimization limits

lambda=opt lgLambdaL=-40 # The same for lambda parameter

sampling=random # Requests random sampling of points for training, sub-training, and validation

Ntrain=20 Nsubtrain=16 # Requests using 20 points for training and 16 for sub-training

iTrainOut=itrain_20random.dat # Saves indices of random points chosen for training

iSubtrainOut=isubtrain_20random.dat # Saves indices of random points chosen for sub-training

iValidateOut=ivalidate_20random.dat # Saves indices of random points chosen for validation

iTestOut=itest_20random.dat # Saves indices of random points chosen for testing

To run the calculations, just execute the command:

`mlatom mlatom.inp > mlatom.out`

You can find in the output file various error measures for the random hold-out test set, that should look something like this:

Statistical analysis for 431 entries in the test set

MAE = 0.0008701190755

MSE = -0.0008594726288

RMSE = 0.0045194135782

mean(Y) = -1.0363539894174

mean(Yest) = -1.0372134620462

largest positive outlier

error = 0.0001099473549

index = 45

estimated value = -1.1566967526451

reference value = -1.1568067000000

largest negative outlier

error = -0.0421702385754

index = 1

estimated value = -1.1463026385754

reference value = -1.1041324000000

correlation coefficient = 0.9974017799004

linear regression of {y, y_est} by f(a,b) = a + b * y

R^2 = 0.9948103105483

a = 0.0277384800540

b = 1.0275947726113

SE_a = 0.0037190821572

SE_b = 0.0035833876014

By running this command several times you will see that above numbers are different each time, because each time different training points are chosen as you can check in the `itrain_20random.dat`

file. Note: the keywords requesting saving indices are optional. MLatom does not overwrite these files and stops if it detects them. Thus, if you run this command again, you should either comment these lines or remove previous files or give them each time different name.

## Example 7: Potential energy surface of CH_{3}Cl

Here we follow some of the steps for creating machine learning potential energy surface (PES) of CH_{3}Cl as published in [*J. Chem. Phys.* **2017**, *146*, 244108]. We use the data set with 44819 points from [J. Chem. Phys. **2015**, *142*, 244306], which was kindly provided by Dr. Alec Owens.

Here we show how to:

- Convert geometries from XYZ format to the ML input vectors.
- Sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.
- Train the s10%-ML model using self-correction and structure-based sampled training set.

### Converting geometries to ML input vector

Download `xyz.dat`

file with Cartesian coordinates of CH_{3}Cl

Download `eq.xyz`

file with the near-equilibrium geometry of CH_{3}Cl in Cartesian coordinates:

Convert (option `XYZ2X`

) geometries of CH_{3}Cl in Cartesian coordinates (option `XYZfile=xyz.dat`

) to the ML input vector of normalized inverted internuclear distances (option `molDescriptor=RE`

) with hydrogen atoms (3, 4, 5 in the XYZ files, option `permInvNuclei=3-4-5`

) sorted (option `molDescrType=sorted`

) by their nuclear repulsions to all other atoms:

`mlatom XYZ2X XYZfile=xyz.dat XfileOut=x_sorted.dat molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > XYZ2X.out`

You should get the file x_sorted.dat (option `XfileOut=x.dat`

) with 44819 lines, each with 10 entries. In the `XYZ2X.out`

file you should see the line `Number of permutations: 6`

.

### Sampling points

Here we show how to sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.

Get the input vector for the equilibrium geometry using the same procedure as in preceding section:

`mlatom XYZ2X XYZfile=eq.xyz XfileOut=eq.x molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > eqXYZ2X.out`

You should get a vector with ten 1.0’s in `eq.x`

.

Sort geometries by the Euclidean distance of their corresponding ML input vector to the input vector of the equilibrium geometry and slice the ordered data set into 3 regions of the same size:

`mlatom slice nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x > slice.out`

This command should have created files `xordered.dat`

(input vectors sorted by distance), `indices_ordered.dat`

(indices of ordered data set wrt the original data set), and `distances_ordered.dat`

(list of Euclidean distances of ordered data points to the equilibrium). It has also created directories `slice1`

, `slice2`

, and `slice3`

. Each of them contains three files: `x.dat`

, `slice_indices.dat`

, and `slice_distances.dat`

that are slices of the corresponding files of the entire data set.

Use structure-based sampling to sample the desired number of data from each slice:

`mlatom sampleFromSlices nslices=3 sampling=structure-based Ntrain=4480 > sbs.out`

This command should create `itrain.dat`

files with training set indices in each `slice[1-3]`

directory. Note: it is possible to modify `sliceData.py`

script to submit the jobs in parallel to the queue.

Merge sampled indices from all slices into indices files for the training, test, sub-training, and validation sets using the same order of data points as in original data set:

`mlatom mergeSlices nslices=3 Ntrain=4480 > merge.out`

This command will create four files with indices: `itrain.dat`

(with 4480 points for training), `isubtrain.dat`

(with 80% of training points also chosen using structure-based sampling), `itest.dat`

, and `ivalidate.dat`

.

### Creating ML model

Download `y.dat`

file with the reference energies:

In [*J. Chem. Phys.* **2017**, *146*, 244108] we optimized hyperparameters using the validation set with points having deformation energy not higher than 10000 cm^{-1}. You need to filter out the points with higher energies from the validation set. To do this move the file `ivalidate.dat`

to `ivalidateall.dat`

and use the following Python 2 script:

Now you can train ML model using the reference data for 4480 training points defined in itrain.dat file using the same procedure as described in [*J. Chem. Phys.* **2017**, *146*, 244108]. Check how many points are in files with indices and use the following input file `create.inp`

for *MLatom*:

createMLmodel

XYZfile=xyz.dat

Yfile=y.dat

molDescriptor=RE

molDescrType=permuted

permInvNuclei=3-4-5 # specifies which atoms to permute

kernel=Gaussian

permInvKernel # requests using permutationally invariant kernel

selfCorrect # requests self-correction procedure

sigma=opt lgsigmal=-10

lambda=opt NlgLambda=11 lglambdal=-25 lglambdah=-1

minimizeError=RMSE

sampling=user-defined

Ntrain=4480

Nsubtrain=3584

Nvalidate=281

itrainin=itrain.dat

isubtrainin=isubtrain.dat

ivalidatein=ivalidate.dat

Run the command:

`mlatom create.inp > create.out`

This command created mlmodlayer[1-4].unf files with ML models for each of the 4 layers of the self-correcting procedure. Since we used 4480 out of 44819 points this model is equivalent to s10%-ML model in [*J. Chem. Phys.* **2017**, *146*, 244108]. The above command created files `ylayer[1-4].dat`

with ML predictions for each of the layers. The final predictions are in the `ylayer4.dat`

. You can compare this values with the reference values for the entire data set using the following Python 2 script:

The error should be 3.44 cm^{-1}.

## Example 8: Importance of Sampling in Critical Regions

Download files `R.dat`

with coordinates and `NAC.dat`

with reference nonadiabatic couplings calculated with the spin-boson Hamiltonian model in 1-D (*J. Phys. Chem. Lett.* **2018**, *9*, 5660):

Train ML model on 10 randomly sampled points and make predictions with this model for all points in the data set using the following input file:

estAccMLmodel # Requests model evaluation

XfileIn=R.dat # File with input vectors

Yfile=NAC.dat # File with reference data

YestFile=ML-NAC.dat # File with ML predictions

kernel=Matern # Specifies the kernel function

nn=1 # Hyperparameter

sigma=opt # Requests optimizing sigma parameter

lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits

lambda=opt lgLambdaL=-40 # The same for lambda parameter

sampling=random # Requests random sampling of points for training, sub-training, and validation

Ntrain=10 # Requests using 16 points for training

iTrainOut=itrain_random.dat # Saves indices of random points chosen for training

iSubtrainOut=isubtrain_random.dat # Saves indices of random points chosen for sub-training

iValidateOut=ivalidate_random.dat # Saves indices of random points chosen for validation

iTestOut=itest_random.dat # Saves indices of random points chosen for testing

You can compare reference nonadiabatic couplings with the ML couplings saved to file `ML-NAC.dat`

. If you are very lucky, ML can describe the narrow peack. In most cases though, such an ML model will miss the peak. You can however describe this peak properly, if you know the position of the peak beforehand and add it to the training set. Add the index of the minimum in the data set to the files with indices of training and sub-training points, then use the following input file to train ML model on 11 points (10 previously sampled random points + 1 critical point) and to use this ML model to predict nonadiabatic couplings for all points in the data set:

estAccMLmodel # Requests model evaluation

XfileIn=R.dat # File with input vectors

Yfile=NAC.dat # File with reference data

YestFile=ML-NAC_critical.dat # File with ML predictions

kernel=Matern # Specifies the kernel function

nn=1 # Hyperparameter

sigma=opt # Requests optimizing sigma parameter

lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits

lambda=opt lgLambdaL=-40 # The same for lambda parameter

sampling=user-defined # Requests random sampling of points for training, sub-training, and validation

Ntrain=11 # Requests using 16 points for training

Nsubtrain=9

iTrainIn=itrain_random+critical.dat # Reads indices from file

iSubtrainIn=isubtrain_random+critical.dat # Reads indices from file

iValidateIn=ivalidate_random+critical.dat # Reads indices from file

iTestIn=itest_random+critical.dat # Reads indices from file

As you can see, the ML predictions improved significantly.

How can I have access to the regression coefficients (alphas) in the case KRR model?

Hi Oyeniyi Ezekiel,

see above tutorial, you can load KRR model from saved file and request to output the regression coefficients with debug option, for example:

mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out

Got it. Thank you.

Hi Dral,

It seems the learning curve ML task only works for molDescriptor=RE. Please, how can I use it with molDescriptor=CM?

Thank you?

Hi Oyeniyi,

this tutorial page does not mention any learning curves.

How to do learning curves is described at http://mlatom.com/tutorial-on-benchmarking-machine-learning-potentials/#Collecting_learning_curve_data

Please check there and see if it works. It should work with molDescriptor=CM. If you still have troubles, please post your input file and any error messages you get.

Thank you for your response. I will get back to you.

Maybe you have already ran learning curve task with RE descriptor, then when you only change molDescriptor to CM, the program will not see it as a new learning curve and skip it….

You can modifiy your input like:

…

mlmodeltype=krr-cm

molDescriptor=CM

…

then the learning curve will run with folder name ‘mlatomf_krr-cm_en’ or ‘mlatomf_krr-cm_en_grad’

Hi, thank you for the very nice tutorial. I wonder is it possible to use different training set for the delta-learning in the hierarchical model (e.g., use smaller data set for the high-level reference value). As stated in the manual, we can only use the same XfileIn for each model, right?

Hi, thank you for your comment. Maybe I do not understand exactly what you mean.

Delta-learning just uses the same amount of high-level and low-level data because it learns on their differences.

However, we have hierarchical ML approach (https://doi.org/10.1063/5.0006498) where you can also train a separate model on lots of low-level data so that you can use this separate model as a baseline instead of low-level QM method. Then you can indeed have different data sets for high and low level data but would need to curate data sets and models separately.

Also, note that we have much better ways to do all this in Python, http://mlatom.com/docs