Tutorial for AQC Chapter
On this page you can find instructions how to follow the examples given in the book chapter:
Pavlo O. Dral, Quantum Chemistry Assisted by Machine Learning. In Advances in Quantum Chemistry: Chemical Physics and Quantum Chemistry, Volume 81, 1st ed.; Kenneth Ruud, Erkki J. Brändas, Eds. Academic Press: 2020; Vol. 81. DOI: 10.1016/bs.aiq.2020.05.002. (free copy online till 6 Nov. 2020)
Please see Manual of MLatom and other pages on this website for more details about how to use this program package. For examples you need MLatom 1.1 or newer version.
Table of Contents
Example 1: Overfitting of the H2 Dissociation Curve
Download R_20.dat
file with 20 points corresponding to internuclear distances in the H2 molecule in Å:
Download E_FCI_20.dat
file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:
Train (option createMLmodel
) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_overfit.unf
) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10−11 and λ=0:
mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=0.00000000001 lambda=0.0 sampling=none > create_E_FCI_20_overfit.out
In the output file create_E_FCI_20_overfit.out
you can see that the error for the created ML model is essentially zero for the training set. Option sampling=none
ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (option useMLmodel
) for calculating energies for its own training set and save them to E_ML_20_overfit.dat
file:
mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_overfit.unf XfileIn=R_20.dat YestFile=E_ML_20_overfit.dat debug > use_E_FCI_20_overfit.out
Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option debug
also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_overfit.out
. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).
Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because the hyperparameter choice.
Example 2: Underfitting of the H2 Dissociation Curve
Train (option createMLmodel
) ML model and save it to a file (option MLmodelOut=mlmod_E_FCI_20_underfit.unf
) using the data (training set) from Example 1 and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=109 and λ=1:
mlatom createMLmodel MLmodelOut=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat Yfile=E_FCI_20.dat kernel=Gaussian sigma=1000000000 lambda=1.0 sampling=none > create_E_FCI_20_underfit.out
In the output file create_E_FCI_20_underfit.out
you can see that the error for the created ML model is very large for the training set. You can use the created ML model (option useMLmodel
) for calculating energies for its own training set and save them to E_ML_20_underfit.dat
file:
mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out
You can see that the ML predicted values are the same for all points. Option debug
also prints the values of the regression coefficients alpha to the output file use_E_FCI_20_underfit.out
. You can sum them up and see that their sum is equal to the ML energies.
Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are do not change. It means that the ML model is underfitted and cannot generalize well to new situations, because the hyperparameter choice.
Example 3: Model Selection and Evaluation for the H2 Dissociation Curve
Download the full data set of with 451 points along the H2 dissociation curve. File with internuclear distances in the H2 molecule in Å:
File with the full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree):
Now you can download the indices of the points in the data set to be used as the training, test, sub-training, and validation sets. You can check that all the training points are the same as used in Examples 1 and 2.
Now we can optimize hyperparameters and evaluate the generalization error of the ML models using a single command. For example, for the Gaussian kernel function:
mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lgSigmaH=10 lambda=opt lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat
You can compare the ML predicted values saved to the file E_ML_451_Gaussian.dat
with the reference values and see that this model generalizes much better than models trained in Examples 1 and 2.
Note that many of these options ( lgSigmaL=-10 lgSigmaH=10
lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat
) are given to ensure that you obtain the same result every time you run the command.
You can simply run:
mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lambda=opt
This command will use the random 20% for testing and 20% of training points for validating, i.e. each time you run the command you will see different result. In addition, since MLatom uses the logarithmic search for hyperparameter optimization, you will need to modify the lowest boundaries using commands lgSigmaL=-10
and lgLambdaL=-40
to get better results:
mlatom estAccMLmodel XfileIn=R_451.dat Yfile=E_FCI_451.dat YestFile=E_ML_451_Gaussian.dat kernel=Gaussian sigma=opt lgSigmaL=-10 lambda=opt lgLambdaL=-40
Example 4: Extrapolation vs Interpolation of the H2 Dissociation Curve
Use the files from Example 3, but move the first three indices from the itrain.dat to itest.dat file and remove the first three indices from the isubtrain.dat file. Use the same command as in Example 3 to test the accuracy, but change the number of training and subtraining points accordingly.
Example 5: Delta-learning of the H2 Dissociation Curve
Download the file with the UHF/STO-3G energies (in Hartree):
Follow the procedure described in Example 4, but now instead of learning the FCI/aug-cc-pV6Z energies directly, use the Δ-ML. For this, you need to use the option deltaLearn
and instead of Yfile
you need to specify files with the baseline (low-level) reference values (UHF/STO-3G, option Yb=E_UHF_451.dat
) and with the target (high-level) reference values (FCI/aug-cc-pV6Z, option Yt=E_FCI_451.dat
). In addition, you can specify the filenames for file with the Δ-ML energies estimating target level of theory (option E_D-ML_451.dat
) and file with predicted ML corrections (option YestFile=corr_ML_451.dat
).
Your input file mlatom.inp
for MLatom should look like:
estAccMLmodel
deltaLearn
XfileIn=R_451.dat
Yb=E_UHF_451.dat
Yt=E_FCI_451.dat
YestT=E_D-ML_451.dat
YestFile=corr_ML_451.dat
kernel=Gaussian
sigma=opt lgSigmaL=-10 lgSigmaH=10
lambda=opt lgLambdaL=-40
Ntrain=17
Nsubtrain=13
sampling=user-defined
iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat
To run the calculations, just execute the command:
mlatom mlatom.inp > mlatom.out
You can see that the delta-ML predictions in the file E_D-ML_451.dat
are much better than in Example 4.
Example 6: Training on Randomly Sampled Points on the H2 Dissociation Curve
Use the files R_451.dat
and E_FCI_451.dat
from Example 3. Create the input file mlatom.inp
for MLatom with the following options (explanation for them is given in comments after ‘#’):
estAccMLmodel # Requests model evaluation
# Comment the previous option and uncommend the following two options to save the model to file.
# createMLmodel # Requests creating ML model
# MLmodelOut=mlmod_E_FCI_Gaussian_20random.unf # saves the model to file
XfileIn=R_451.dat # File with input vectors
Yfile=E_FCI_451.dat # File with reference data at FCI
YestFile=E_ML_451_Gaussian_20random.dat # File with ML predictions
kernel=Gaussian # Specifies the kernel function
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # ... and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=random # Requests random sampling of points for training, sub-training, and validation
Ntrain=20 Nsubtrain=16 # Requests using 20 points for training and 16 for sub-training
iTrainOut=itrain_20random.dat # Saves indices of random points chosen for training
iSubtrainOut=isubtrain_20random.dat # Saves indices of random points chosen for sub-training
iValidateOut=ivalidate_20random.dat # Saves indices of random points chosen for validation
iTestOut=itest_20random.dat # Saves indices of random points chosen for testing
To run the calculations, just execute the command:
mlatom mlatom.inp > mlatom.out
You can find in the output file various error measures for the random hold-out test set, that should look something like this:
Statistical analysis for 431 entries in the test set
MAE = 0.0008701190755
MSE = -0.0008594726288
RMSE = 0.0045194135782
mean(Y) = -1.0363539894174
mean(Yest) = -1.0372134620462
largest positive outlier
error = 0.0001099473549
index = 45
estimated value = -1.1566967526451
reference value = -1.1568067000000
largest negative outlier
error = -0.0421702385754
index = 1
estimated value = -1.1463026385754
reference value = -1.1041324000000
correlation coefficient = 0.9974017799004
linear regression of {y, y_est} by f(a,b) = a + b * y
R^2 = 0.9948103105483
a = 0.0277384800540
b = 1.0275947726113
SE_a = 0.0037190821572
SE_b = 0.0035833876014
By running this command several times you will see that above numbers are different each time, because each time different training points are chosen as you can check in the itrain_20random.dat
file. Note: the keywords requesting saving indices are optional. MLatom does not overwrite these files and stops if it detects them. Thus, if you run this command again, you should either comment these lines or remove previous files or give them each time different name.
Example 7: Potential energy surface of CH3Cl
Here we follow some of the steps for creating machine learning potential energy surface (PES) of CH3Cl as published in [J. Chem. Phys. 2017, 146, 244108]. We use the data set with 44819 points from [J. Chem. Phys. 2015, 142, 244306], which was kindly provided by Dr. Alec Owens.
Here we show how to:
- Convert geometries from XYZ format to the ML input vectors.
- Sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.
- Train the s10%-ML model using self-correction and structure-based sampled training set.
Converting geometries to ML input vector
Download xyz.dat
file with Cartesian coordinates of CH3Cl
Download eq.xyz
file with the near-equilibrium geometry of CH3Cl in Cartesian coordinates:
Convert (option XYZ2X
) geometries of CH3Cl in Cartesian coordinates (option XYZfile=xyz.dat
) to the ML input vector of normalized inverted internuclear distances (option molDescriptor=RE
) with hydrogen atoms (3, 4, 5 in the XYZ files, option permInvNuclei=3-4-5
) sorted (option molDescrType=sorted
) by their nuclear repulsions to all other atoms:
mlatom XYZ2X XYZfile=xyz.dat XfileOut=x_sorted.dat molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > XYZ2X.out
You should get the file x_sorted.dat (option XfileOut=x.dat
) with 44819 lines, each with 10 entries. In the XYZ2X.out
file you should see the line Number of permutations: 6
.
Sampling points
Here we show how to sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.
Get the input vector for the equilibrium geometry using the same procedure as in preceding section:
mlatom XYZ2X XYZfile=eq.xyz XfileOut=eq.x molDescriptor=RE molDescrType=sorted permInvNuclei=3-4-5 > eqXYZ2X.out
You should get a vector with ten 1.0’s in eq.x
.
Sort geometries by the Euclidean distance of their corresponding ML input vector to the input vector of the equilibrium geometry and slice the ordered data set into 3 regions of the same size:
mlatom slice nslices=3 XfileIn=x_sorted.dat eqXfileIn=eq.x > slice.out
This command should have created files xordered.dat
(input vectors sorted by distance), indices_ordered.dat
(indices of ordered data set wrt the original data set), and distances_ordered.dat
(list of Euclidean distances of ordered data points to the equilibrium). It has also created directories slice1
, slice2
, and slice3
. Each of them contains three files: x.dat
, slice_indices.dat
, and slice_distances.dat
that are slices of the corresponding files of the entire data set.
Use structure-based sampling to sample the desired number of data from each slice:
mlatom sampleFromSlices nslices=3 sampling=structure-based Ntrain=4480 > sbs.out
This command should create itrain.dat
files with training set indices in each slice[1-3]
directory. Note: it is possible to modify sliceData.py
script to submit the jobs in parallel to the queue.
Merge sampled indices from all slices into indices files for the training, test, sub-training, and validation sets using the same order of data points as in original data set:
mlatom mergeSlices nslices=3 Ntrain=4480 > merge.out
This command will create four files with indices: itrain.dat
(with 4480 points for training), isubtrain.dat
(with 80% of training points also chosen using structure-based sampling), itest.dat
, and ivalidate.dat
.
Creating ML model
Download y.dat
file with the reference energies:
In [J. Chem. Phys. 2017, 146, 244108] we optimized hyperparameters using the validation set with points having deformation energy not higher than 10000 cm-1. You need to filter out the points with higher energies from the validation set. To do this move the file ivalidate.dat
to ivalidateall.dat
and use the following Python 2 script:
Now you can train ML model using the reference data for 4480 training points defined in itrain.dat file using the same procedure as described in [J. Chem. Phys. 2017, 146, 244108]. Check how many points are in files with indices and use the following input file create.inp
for MLatom:
createMLmodel
XYZfile=xyz.dat
Yfile=y.dat
molDescriptor=RE
molDescrType=permuted
permInvNuclei=3-4-5 # specifies which atoms to permute
kernel=Gaussian
permInvKernel # requests using permutationally invariant kernel
selfCorrect # requests self-correction procedure
sigma=opt lgsigmal=-10
lambda=opt NlgLambda=11 lglambdal=-25 lglambdah=-1
minimizeError=RMSE
sampling=user-defined
Ntrain=4480
Nsubtrain=3584
Nvalidate=281
itrainin=itrain.dat
isubtrainin=isubtrain.dat
ivalidatein=ivalidate.dat
Run the command:
mlatom create.inp > create.out
This command created mlmodlayer[1-4].unf files with ML models for each of the 4 layers of the self-correcting procedure. Since we used 4480 out of 44819 points this model is equivalent to s10%-ML model in [J. Chem. Phys. 2017, 146, 244108]. The above command created files ylayer[1-4].dat
with ML predictions for each of the layers. The final predictions are in the ylayer4.dat
. You can compare this values with the reference values for the entire data set using the following Python 2 script:
The error should be 3.44 cm-1.
Example 8: Importance of Sampling in Critical Regions
Download files R.dat
with coordinates and NAC.dat
with reference nonadiabatic couplings calculated with the spin-boson Hamiltonian model in 1-D (J. Phys. Chem. Lett. 2018, 9, 5660):
Train ML model on 10 randomly sampled points and make predictions with this model for all points in the data set using the following input file:
estAccMLmodel # Requests model evaluation
XfileIn=R.dat # File with input vectors
Yfile=NAC.dat # File with reference data
YestFile=ML-NAC.dat # File with ML predictions
kernel=Matern # Specifies the kernel function
nn=1 # Hyperparameter
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=random # Requests random sampling of points for training, sub-training, and validation
Ntrain=10 # Requests using 16 points for training
iTrainOut=itrain_random.dat # Saves indices of random points chosen for training
iSubtrainOut=isubtrain_random.dat # Saves indices of random points chosen for sub-training
iValidateOut=ivalidate_random.dat # Saves indices of random points chosen for validation
iTestOut=itest_random.dat # Saves indices of random points chosen for testing
You can compare reference nonadiabatic couplings with the ML couplings saved to file ML-NAC.dat
. If you are very lucky, ML can describe the narrow peack. In most cases though, such an ML model will miss the peak. You can however describe this peak properly, if you know the position of the peak beforehand and add it to the training set. Add the index of the minimum in the data set to the files with indices of training and sub-training points, then use the following input file to train ML model on 11 points (10 previously sampled random points + 1 critical point) and to use this ML model to predict nonadiabatic couplings for all points in the data set:
estAccMLmodel # Requests model evaluation
XfileIn=R.dat # File with input vectors
Yfile=NAC.dat # File with reference data
YestFile=ML-NAC_critical.dat # File with ML predictions
kernel=Matern # Specifies the kernel function
nn=1 # Hyperparameter
sigma=opt # Requests optimizing sigma parameter
lgSigmaL=-10 lgSigmaH=10 # … and sets optimization limits
lambda=opt lgLambdaL=-40 # The same for lambda parameter
sampling=user-defined # Requests random sampling of points for training, sub-training, and validation
Ntrain=11 # Requests using 16 points for training
Nsubtrain=9
iTrainIn=itrain_random+critical.dat # Reads indices from file
iSubtrainIn=isubtrain_random+critical.dat # Reads indices from file
iValidateIn=ivalidate_random+critical.dat # Reads indices from file
iTestIn=itest_random+critical.dat # Reads indices from file
As you can see, the ML predictions improved significantly.
How can I have access to the regression coefficients (alphas) in the case KRR model?
Hi Oyeniyi Ezekiel,
see above tutorial, you can load KRR model from saved file and request to output the regression coefficients with debug option, for example:
mlatom useMLmodel MLmodelIn=mlmod_E_FCI_20_underfit.unf XfileIn=R_20.dat YestFile=E_ML_20_underfit.dat debug > use_E_FCI_20_underfit.out
Got it. Thank you.
Hi Dral,
It seems the learning curve ML task only works for molDescriptor=RE. Please, how can I use it with molDescriptor=CM?
Thank you?
Hi Oyeniyi,
this tutorial page does not mention any learning curves.
How to do learning curves is described at http://mlatom.com/tutorial-on-benchmarking-machine-learning-potentials/#Collecting_learning_curve_data
Please check there and see if it works. It should work with molDescriptor=CM. If you still have troubles, please post your input file and any error messages you get.
Thank you for your response. I will get back to you.
Maybe you have already ran learning curve task with RE descriptor, then when you only change molDescriptor to CM, the program will not see it as a new learning curve and skip it….
You can modifiy your input like:
…
mlmodeltype=krr-cm
molDescriptor=CM
…
then the learning curve will run with folder name ‘mlatomf_krr-cm_en’ or ‘mlatomf_krr-cm_en_grad’
Hi, thank you for the very nice tutorial. I wonder is it possible to use different training set for the delta-learning in the hierarchical model (e.g., use smaller data set for the high-level reference value). As stated in the manual, we can only use the same XfileIn for each model, right?
Hi, thank you for your comment. Maybe I do not understand exactly what you mean.
Delta-learning just uses the same amount of high-level and low-level data because it learns on their differences.
However, we have hierarchical ML approach (https://doi.org/10.1063/5.0006498) where you can also train a separate model on lots of low-level data so that you can use this separate model as a baseline instead of low-level QM method. Then you can indeed have different data sets for high and low level data but would need to curate data sets and models separately.
Also, note that we have much better ways to do all this in Python, http://mlatom.com/docs