Tutorial for AQC Chapter

On this page you can find instructions how to follow the examples given in the book chapter:

Pavlo O. DralQuantum Chemistry Assisted by Machine Learning. In Advances in Quantum ChemistryChemical Physics and Quantum Chemistry, Volume 81, 1st ed.; Kenneth Ruud, Erkki J. Brändas, Eds. Academic Press: 2020; Vol. 81. DOI: 10.1016/bs.aiq.2020.05.002. (free copy online till 6 Nov. 2020)

Please see Manual of MLatom and other pages on this website for more details about how to use this program package. For examples you need MLatom 1.1 or newer version.

Example 1: Overfitting of the H2 Dissociation Curve

Download R_20.dat file with 20 points corresponding to internuclear distances in the H2 molecule in Å:

Download E_FCI_20.dat file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:

Train (option createMLmodel) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_overfit.unf) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10−11 and λ=0:

mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out

In the output file create_E_FCI_20_overfit.out you can see that the error for the created ML model is essentially zero for the training set. Option sampling=none ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (option useMLmodel) for calculating energies for its own training set and save them to E_ML_20_overfit.dat file:

mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out

Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option debug also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_overfit.out. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because the hyperparameter choice.

Example 2: Underfitting of the H2 Dissociation Curve

Train (option createMLmodel) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_underfit.unf) using the data (training set) from Example 1 and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=109 and λ=1:

mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=1000000000 lambda=1.0 sampling=none > create_E_FCI_20_underfit.out

In the output file create_E_FCI_20_underfit.out you can see that the error for the created ML model is very large for the training set. You can use the created ML model (option useMLmodel) for calculating energies for its own training set and save them to E_ML_20_underfit.dat file:

mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out

You can see that the ML predicted values are the same for all points. Option debug also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_underfit.out. You can sum them up and see that their sum is equal to the ML energies.

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are do not change. It means that the ML model is underfitted and cannot generalize well to new situations, because the hyperparameter choice.

Example 3: Model Selection and Evaluation for the H2 Dissociation Curve

Download the full data set of with 451 points along the H2 dissociation curve. File with internuclear distances in the H2 molecule in Å:

File with the full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree):

Now you can download the indices of the points in the data set to be used as the training, test, sub-training, and validation sets. You can check that all the training points are the same as used in Examples 1 and 2.

Now we can optimize hyperparameters and evaluate the generalization error of the ML models using a single command. For example, for the Gaussian kernel function:

mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lgSigmaH=10 lambda=opt lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat

You can compare the ML predicted values saved to the file E_ML_451_Gaussian.dat with the reference values and see that this model generalizes much better than models trained in Examples 1 and 2.

Note that many of these options ( lgSigmaL=-10 lgSigmaH=10 lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat ) are given to ensure that you obtain the same result every time you run the command.

You can simply run:

mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lambda=opt

This command will use the random 20% for testing and 20% of training points for validating, i.e. each time you run the command you will see different result. In addition, since MLatom uses the logarithmic search for hyperparameter optimization, you will need to modify the lowest boundaries using commands lgSigmaL=-10 and lgLambdaL=-40 to get better results:

mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lambda=opt lgLambdaL=-40

Example 4: Extrapolation vs Interpolation of the H2 Dissociation Curve

Use the files from Example 3, but move the first three indices from the itrain.dat to itest.dat file and remove the first three indices from the isubtrain.dat file. Use the same command as in Example 3 to test the accuracy, but change the number of training and subtraining points accordingly.

Example 5: Delta-learning of the H2 Dissociation Curve

Download the file with the UHF/STO-3G energies (in Hartree):

Follow the procedure described in Example 4, but now instead of learning the FCI/aug-cc-pV6Z energies directly, use the Δ-ML. For this, you need to use the option deltaLearn and instead of Yfile you need to specify files with the baseline (low-level) reference values (UHF/STO-3G, option Yb=E_UHF_451.dat) and with the target (high-level) reference values (FCI/aug-cc-pV6Z, option Yt=E_FCI_451.dat). In addition, you can specify the filenames for file with the Δ-ML energies estimating target level of theory (option E_D-ML_451.dat) and file with predicted ML corrections (option YestFile=corr_ML_451.dat).

Your input file mlatom.inp for MLatom should look like:

estAccMLmodel
deltaLearn
XfileIn=R_451.dat
Yb=E_UHF_451.dat
Yt=E_FCI_451.dat
YestT=E_D-ML_451.dat
YestFile=corr_ML_451.dat
kernel=Gaussian
sigma=opt lgSigmaL=-10 lgSigmaH=10
lambda=opt lgLambdaL=-40
Ntrain=17
Nsubtrain=13
sampling=user-defined
iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat

To run the calculations, just execute the command:

mlatom mlatom.inp > mlatom.out

You can see that the delta-ML predictions in the file E_D-ML_451.dat are much better than in Example 4.

Example 6: Training on Randomly Sampled Points on the H2 Dissociation Curve

Use the files R_451.dat and E_FCI_451.dat from Example 3. Create the input file mlatom.inp for MLatom with the following options (explanation for them is given in comments after ‘#’):

estAccMLmodel # Requests model evaluation
# Comment the previous option and uncommend the following two options to save the model to file.
# createMLmodel # Requests creating ML model
# MLmodelOut=mlmod_E_FCI_Gaussian_20random.unf # saves the model to file
XfileIn=R_451.dat # File with input vectors
Yfile=E_FCI_451.dat # File with reference data at FCI
YestFile=E_ML_451_Gaussian_20random.dat # File with ML predictions
kernel=Gaussian # Specifies the kernel function
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # ... and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=random # Requests random sampling of points for training, sub-training, and validation
Ntrain=20 Nsubtrain=16 # Requests using 20 points for training and 16 for sub-training
iTrainOut=itrain_20random.dat # Saves indices of random points chosen for training
iSubtrainOut=isubtrain_20random.dat # Saves indices of random points chosen for sub-training
iValidateOut=ivalidate_20random.dat # Saves indices of random points chosen for validation
iTestOut=itest_20random.dat # Saves indices of random points chosen for testing

To run the calculations, just execute the command:

mlatom mlatom.inp > mlatom.out

You can find in the output file various error measures for the random hold-out test set, that should look something like this:

Statistical analysis for 431 entries in the test set
MAE = 0.0008701190755
MSE = -0.0008594726288
RMSE = 0.0045194135782
mean(Y) = -1.0363539894174
mean(Yest) = -1.0372134620462
largest positive outlier
error = 0.0001099473549
index = 45
estimated value = -1.1566967526451
reference value = -1.1568067000000
largest negative outlier
error = -0.0421702385754
index = 1
estimated value = -1.1463026385754
reference value = -1.1041324000000
correlation coefficient = 0.9974017799004
linear regression of {y, y_est} by f(a,b) = a + b * y
R^2 = 0.9948103105483
a = 0.0277384800540
b = 1.0275947726113
SE_a = 0.0037190821572
SE_b = 0.0035833876014

By running this command several times you will see that above numbers are different each time, because each time different training points are chosen as you can check in the itrain_20random.dat file. Note: the keywords requesting saving indices are optional. MLatom does not overwrite these files and stops if it detects them. Thus, if you run this command again, you should either comment these lines or remove previous files or give them each time different name.

Example 7: Potential energy surface of CH3Cl

Here we follow some of the steps for creating machine learning potential energy surface (PES) of CH3Cl as published in [J. Chem. Phys. 2017, 146, 244108]. We use the data set with 44819 points from [J. Chem. Phys. 2015, 142, 244306], which was kindly provided by Dr. Alec Owens.

Here we show how to:

  • Convert geometries from XYZ format to the ML input vectors.
  • Sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.
  • Train the s10%-ML model using self-correction and structure-based sampled training set.

Converting geometries to ML input vector

Download xyz.dat file with Cartesian coordinates of CH3Cl

Download eq.xyz file with the near-equilibrium geometry of CH3Cl in Cartesian coordinates:

Convert (option XYZ2X) geometries of CH3Cl in Cartesian coordinates (option XYZfile=xyz.dat) to the ML input vector of normalized inverted internuclear distances (option molDescriptor=RE) with hydrogen atoms (3, 4, 5 in the XYZ files, option permInvNuclei=3-4-5) sorted (option molDescrType=sorted) by their nuclear repulsions to all other atoms:

mlatom XYZ2X XYZfile=xyz.dat XfileOut=x_sorted.dat molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > XYZ2X.out

You should get the file x_sorted.dat (option XfileOut=x.dat) with 44819 lines, each with 10 entries. In the XYZ2X.out file you should see the line Number of permutations: 6.

Sampling points

Here we show how to sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.

Get the input vector for the equilibrium geometry using the same procedure as in preceding section:

mlatom XYZ2X XYZfile=eq.xyz XfileOut=eq.x molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > eqXYZ2X.out

You should get a vector with ten 1.0’s in eq.x.

Sort geometries by the Euclidean distance of their corresponding ML input vector to the input vector of the equilibrium geometry and slice the ordered data set into 3 regions of the same size:

mlatom slice nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x > slice.out

This command should have created files xordered.dat (input vectors sorted by distance), indices_ordered.dat (indices of ordered data set wrt the original data set), and distances_ordered.dat (list of Euclidean distances of ordered data points to the equilibrium). It has also created directories slice1, slice2, and slice3. Each of them contains three files: x.dat, slice_indices.dat, and slice_distances.dat that are slices of the corresponding files of the entire data set.

Use structure-based sampling to sample the desired number of data from each slice:

mlatom sampleFromSlices nslices=3 sampling=structure-based Ntrain=4480 > sbs.out

This command should create itrain.dat files with training set indices in each slice[1-3] directory. Note: it is possible to modify sliceData.py script to submit the jobs in parallel to the queue.

Merge sampled indices from all slices into indices files for the training, test, sub-training, and validation sets using the same order of data points as in original data set:

mlatom mergeSlices nslices=3 Ntrain=4480 > merge.out

This command will create four files with indices: itrain.dat (with 4480 points for training), isubtrain.dat (with 80% of training points also chosen using structure-based sampling), itest.dat, and ivalidate.dat.

Creating ML model

Download y.dat file with the reference energies:

In [J. Chem. Phys. 2017, 146, 244108] we optimized hyperparameters using the validation set with points having deformation energy not higher than 10000 cm-1. You need to filter out the points with higher energies from the validation set. To do this move the file ivalidate.dat to ivalidateall.dat and use the following Python 2 script:

Now you can train ML model using the reference data for 4480 training points defined in itrain.dat file using the same procedure as described in [J. Chem. Phys. 2017, 146, 244108]. Check how many points are in files with indices and use the following input file create.inp for MLatom:

createMLmodel
XYZfile=xyz.dat
Yfile=y.dat
molDescriptor=RE
molDescrType=permuted
permInvNuclei=3-4-5 # specifies which atoms to permute
kernel=Gaussian
permInvKernel # requests using permutationally invariant kernel
selfCorrect # requests self-correction procedure
sigma=opt lgsigmal=-10
lambda=opt NlgLambda=11 lglambdal=-25 lglambdah=-1
minimizeError=RMSE
sampling=user-defined
Ntrain=4480
Nsubtrain=3584
Nvalidate=281
itrainin=itrain.dat
isubtrainin=isubtrain.dat
ivalidatein=ivalidate.dat

Run the command:

mlatom create.inp > create.out

This command created mlmodlayer[1-4].unf files with ML models for each of the 4 layers of the self-correcting procedure. Since we used 4480 out of 44819 points this model is equivalent to s10%-ML model in [J. Chem. Phys. 2017, 146, 244108]. The above command created files ylayer[1-4].dat with ML predictions for each of the layers. The final predictions are in the ylayer4.dat. You can compare this values with the reference values for the entire data set using the following Python 2 script:

The error should be 3.44 cm-1.

Example 8: Importance of Sampling in Critical Regions

Download files R.dat with coordinates and NAC.dat with reference nonadiabatic couplings calculated with the spin-boson Hamiltonian model in 1-D (J. Phys. Chem. Lett. 2018, 9, 5660):

Train ML model on 10 randomly sampled points and make predictions with this model for all points in the data set using the following input file:

estAccMLmodel # Requests model evaluation
XfileIn=R.dat # File with input vectors
Yfile=NAC.dat # File with reference data
YestFile=ML-NAC.dat # File with ML predictions
kernel=Matern # Specifies the kernel function
nn=1 # Hyperparameter
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=random # Requests random sampling of points for training, sub-training, and validation
Ntrain=10 # Requests using 16 points for training
iTrainOut=itrain_random.dat # Saves indices of random points chosen for training
iSubtrainOut=isubtrain_random.dat # Saves indices of random points chosen for sub-training
iValidateOut=ivalidate_random.dat # Saves indices of random points chosen for validation
iTestOut=itest_random.dat # Saves indices of random points chosen for testing

You can compare reference nonadiabatic couplings with the ML couplings saved to file ML-NAC.dat. If you are very lucky, ML can describe the narrow peack. In most cases though, such an ML model will miss the peak. You can however describe this peak properly, if you know the position of the peak beforehand and add it to the training set. Add the index of the minimum in the data set to the files with indices of training and sub-training points, then use the following input file to train ML model on 11 points (10 previously sampled random points + 1 critical point) and to use this ML model to predict nonadiabatic couplings for all points in the data set:

estAccMLmodel # Requests model evaluation
XfileIn=R.dat # File with input vectors
Yfile=NAC.dat # File with reference data
YestFile=ML-NAC_critical.dat # File with ML predictions
kernel=Matern # Specifies the kernel function
nn=1 # Hyperparameter
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=user-defined # Requests random sampling of points for training, sub-training, and validation
Ntrain=11 # Requests using 16 points for training
Nsubtrain=9
iTrainIn=itrain_random+critical.dat # Reads indices from file
iSubtrainIn=isubtrain_random+critical.dat # Reads indices from file
iValidateIn=ivalidate_random+critical.dat # Reads indices from file
iTestIn=itest_random+critical.dat # Reads indices from file

As you can see, the ML predictions improved significantly.

9 Comments on “Tutorial for AQC Chapter

    • Hi Oyeniyi Ezekiel,
      see above tutorial, you can load KRR model from saved file and request to output the regression coefficients with debug option, for example:

      mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out

  1. Hi Dral,
    It seems the learning curve ML task only works for molDescriptor=RE. Please, how can I use it with molDescriptor=CM?
    Thank you?

  2. Hi, thank you for the very nice tutorial. I wonder is it possible to use different training set for the delta-learning in the hierarchical model (e.g., use smaller data set for the high-level reference value). As stated in the manual, we can only use the same XfileIn for each model, right?

    • Hi, thank you for your comment. Maybe I do not understand exactly what you mean.
      Delta-learning just uses the same amount of high-level and low-level data because it learns on their differences.
      However, we have hierarchical ML approach (https://doi.org/10.1063/5.0006498) where you can also train a separate model on lots of low-level data so that you can use this separate model as a baseline instead of low-level QM method. Then you can indeed have different data sets for high and low level data but would need to curate data sets and models separately.
      Also, note that we have much better ways to do all this in Python, http://mlatom.com/docs

Leave a Reply

Your email address will not be published. Required fields are marked *

*