Tutorial for CECAM MLQCDyn 2022

Hands-on session on machine learning:

PS3: First steps into ML (Pavlo O. Dral, Xiamen University, online; Max Pinheiro Jr, Aix-Marseille Université, on site)

This tutorial was prepared with help from Fuchun Ge, Yi-Fan Hou, and Shuang Zhang.

Please check the slides of the lectures

  • L3: General introduction to machine learning
  • L5: ML molecular dynamics: ground state (II)

by Pavlo O. Dral covering theoretical background of this tutorial:

In this tutorial, we will use MLatom@XACS cloud computing service and provide some additional examples with Jupyter notebooks. Note: You can also install and use MLatom locally, but note that MLatom@XACS is a special, upgraded version of MLatom compared to the latest distributed version and not all examples will work with your local version (yet – all features to be released later).

Before we start, please register a free account on XACS cloud and check a brief instructions how to use MLatom@XACS.

Creating a neural network model

We will use a data set with full CI (FCI)/aug-cc-pV6Z energies along the dissociation curve of H2 and train a feed-forward neural network using the Jupyter notebook and data from:

https://colab.research.google.com/github/dralgroup/MLinQCbook22-NN/blob/main/NN_train_colab.ipynb

Note, how much time did it take you to train the model?

Creating a kernel method model – Jupyter

We will use a data set with full CI (FCI)/aug-cc-pV6Z energies along the dissociation curve of H2 and train a kernel ridge regression model using Jupyter notebook (fitting_with_KRR.ipynb) and data from:

https://colab.research.google.com/github/maxjr82/MLinQCbook22-KMs/blob/master/CECAM/fitting_with_KRR_colab.ipynb

Try to modify sigma and lambda in Gaussian kernel and overfit/underfit.

Creating a kernel method model – MLatom@XACS

Download R_451.dat and E_FCI_451.dat files with 451 points corresponding to internuclear distances in Å and FCI/aug-cc-pV6Z energies in the H2 molecule:

Now, you can train the KRR model with Gaussian kernel (default) by providing the following input for MLatom:

createMLmodel                # Specify the task for MLatom
MLmodelOut=H2.unf            # Save model in H2.unf
XfileIn=R_451.dat            # File with R distances
Yfile=E_FCI_451.dat          # The file with FCI energies
sigma=opt                    # Optimize hyperparameter sigma
lambda=opt                   # Optimize hyperparameter lambda

Your screen should look like:

After you click submit and then click on Download Output file when the calculations finish and this button becomes active (you need to wait couple of seconds).

If everything is successful, in your output file with .log extension, you should see:

Using a saved ML model – MLatom@XACS

You can use your ML model, e.g., to make new predictions, geometry optimizations, frequency calculations, dynamics, spectra simulations, etc (all supported by MLatom).

Here, try to find bond length in H2 using the KRR model trained in a previous exercise through a simple 1D grid search. For this, you may use this input:

useMLmodel                   # Specify the task for MLatom
MLmodelIn=H2.unf             # Read in model from H2.unf
XfileIn=R_pred.dat           # File with R distances
YestFile=E_est.dat           # The file with ML-estimated energies

What is the bond length in H2?

Using a saved model for dynamics

Here we take a bit more complicated molecule (ethanol) and use a pre-trained model to run molecular dynamics using a machine learning potential (MLatom supports many types such as KREG, ANI, sGDML, GAP-SOAP, DeepMD, PhysNet, etc.).

Here, we will run molecular dynamics with the KREG model, which employs kernel ridge regression (KRR) with relative-to-equilibrium (RE) descriptor and Gaussian kernel.

MD                                # molecular dynamic
initXYZ=/tmp/CECAM/ethanol.xyz    # initial geometry; Unit: Angstrom
initVXYZ=/tmp/CECAM/ethanol.vxyz  # initial velocity; Unit: Angstrom/fs
dt=0.3                            # time step; Unit: fs
trun=30                           # total time; Unit: fs
MLprog=MLatomF                    # KREG model is implemented in the Fortran part of MLatom (MLatomF)
MLmodelIn=/tmp/CECAM/en.unf       # file that saves the model
thermostat=nose-hoover            # NVT ensemble (Nosé-Hoover thermostat) 
temp=300                          # temperature for NVT; Unit: K

The KREG model is implemented in the Fortran part of MLatom (MLatomF), so we use ‘MLprog=MLatomF‘ here. The input files such as the initial geometry, initial velocity and ML model are already on the cloud.

If everything goes well, you will get your MD trajectory files as shown below:

Choosing a right model

To evaluate an ML model’s performance, we need to test it with unseen data. MLatom provides estAccMLmodel task to do so for you by splitting the data set into the training and test sets.

Here we demonstrate how to evaluate the generalization error for building the machine learning potential of ethanol by training only on energies. We use a very small data set with 100 points for fast demonstration. You may need thousands of points for a real application.

We will test two machine learning potential models (see this paper for details):

  • KREG (based on kernel ridge regression and global descriptor)
  • ANI (based on neural network and local descriptor)

Training a KREG model

Here is a sample input for t:

estAccMLmodel                            # Specify the task for MLatom
MLmodelType=KREG                         # Specify the model type (here - KREG)
XYZfile=ethanol_geometries.xyz           # File with XYZ geometries
Yfile=ethanol_energies.txt               # File with reference energies
sigma=opt                                # Optimize hyperparameter sigma
lambda=opt                               # Optimize hyperparameter lambda

Necessary data files:

Q: In the output file, compare the errors in the training and test sets.

1. Is it enough to use training errors for evaluating the accuracy of the ML model?

Training with ANI-type model

In MLatom, except for the KREG model, we can also use other MLP models, e.g., ANI model (you can also try PhysNet or DPMD):

estAccMLmodel                            # Specify the task for MLatom
MLmodelType=ANI                          # Specify the model type
XYZfile=ethanol_geometries.xyz           # File with XYZ geometries
Yfile=ethanol_energies.txt               # File with reference energies

Q:

2. Can you draw a conclusion whether the KREG or ANI model is more accurate?

Learning curves

Comparing models just based on their performance for a test set after being trained on the same data set is not enough to claim that one model is better than another. Performance of models can change when you change the size of the data set. Thus, ML models’ performances should be compared by evaluating their generalization error over different training set sizes and MLatom provides learningCurve task to automatically perform such calculations and summarize the results.

You can use the following input:

learningCurve                            # Specify the task for MLatom
MLmodelType=KREG                         # Specify the model type
XYZfile=ethanol_geometries.xyz           # File with XYZ geometries
Yfile=ethanol_energies.txt               # The file with reference energies
sigma=opt                                # Optimize hyperparameter sigma
lambda=opt                               # Optimize hyperparameter lambda
lcNtrains=10,25,50,75                    # Specify the number of training point to be examined
lcNrepeats=10,8,5,3                      # Specify the number of repeats for each Ntrain 

MLatom prepares for you several csv files of the learning curve results in the folder named by your modeltype in the learningCurve directory.

E.g. this is a lcy.csv that recodes the learning curve of RMSE on energies, which can be visualized by suitable programs such as Excel:

Q:

1. Why do you need repeating calculations (see Nrepeats for the number of repeating calculations) for each number of training points (Ntrain)?

2. Do you expect that errors drops stronger when you increase training set a) from 100 to 1000 or b) from 1000 to 2000 training points?

Dealing with data from several reference methods (multi-fidelity data)

Δ-learning & hierarchical ML

You can use this Jupyter notebook to investigate the power of learning from multi-fidelity data:

https://colab.research.google.com/github/dralgroup/MLinQCbook22-delta/blob/main/delta-ML_hML_colab.ipynb

MLatom also supports Δ-learning, see manual and tutorial.

Transfer learning

https://colab.research.google.com/github/dralgroup/MLinQCbook22-delta/blob/main/Transfer-learning_colab.ipynb

General-purpose ML-based methods

General-purpose ML-based methods such as AIQM1 and ANI-1ccx can be used directly instead of any much slower and often less accurate traditional QM method for many applications.

Geometry optimizations

Here we optimize a water molecule with ANI-1ccx for which this method gives excellent accuracy. The inputs should look like this:

ANI-1ccx                             # Use ANI-1ccx
geomopt                              # Optimize geometry
xyzfile=/tmp/CECAM/water_init.xyz    # Initial geometry of water
optxyz=water_opt.xyz                 # Optimized geometry is saved in this file

The initial geometry is already on the cloud, so you do not need to upload the file. You are encouraged to generate your own initial geometry.

You can use software to visualize the optmized geometry and check the bond length and bond angle.

Uncertainty quantification

Now we show how to calculate enthalpies of formation on an example of vinylacetylene (see the structure below):

You should use the geometry optimized (provided in /tmp/CECAM/final.xyz) to calculate thermochemical properties by using the option freq:

AIQM1            
xyzfile=/tmp/CECAM/final.xyz
freq

On the platform for MLatom cloud computing http://xacs.xmu.edu.cn/records?attribution_program=MLatom, your screen should look like:

After you click submit and then click on Download Output file when the calculations finish and this button becomes active.

If everything is successful, in your output file with .log extension, you should see:

You can check that the standard deviation of NN predictions is very low. It is below the uncertainty criterion of 0.41 kcal/mol and thus, the calculation is deemed certain. Nevertheless, AIQM1 enthalpy of formation is quite different from the experimental value (see figure above). Double checking the value against the G4 calculations confirms that AIQM1 result is likely to be correct. Indeed, we found in the NIST database an alternative experimental value that is closer to AIQM1 prediction (see figure above).

Dynamics & spectra

We will not run here an MD trajectory, but use an existing trajectory propagated with AIQM1 for 3 ps to simulate an infrared spectrum of an ethanol molecule. AIQM1 also generates dipole moments which are needed to calculate intensities in infrared spectra. MLatom input is very simple:

IRSS                            # Infrared Spectra Simulation
TrajH5MDIn=/tmp/CECAM/traj.h5   # H5MD file

You can see the spectrum (ir.png) in your folder after the computation is finished.

Below is the spectrum from the NIST database:

Limitations

As any computational method, ML-based methods have their limitations, especially, for cases not covered in their training sets.

C60

Optimize the geometry of fullerene (C60) molecule using ANI-1ccx.

ANI-1ccx
xyzfile=/tmp/CECAM/C60_init.xyz
optxyz=C60_final.xyz
geomopt

What is wrong with this geometry?

H2

Optimize (geomopt keyword) H2.

Then calculate heat of formation (freq) for H2 with either ANI-1ccx or AIQM1.

Because H2 is a linear molecule, when you are using keyword freq, you should also specify ase.linear and ase.symmetrynumber this two keywrods. ase.linear is 0 for nonlinear molecule, 1 for linear molecule, and ase.symmetrynumber is the rotational symmetry number for each molecule, (see Table 10.1 and Appendix B of C. Cramer “Essentials of Computational Chemistry”, 2nd Ed.). The rotational symmetry number only affects the results of entropy and free energy, but this influence is usually very small. For example, for hydrogen you should set ase.linear=1 and ase.symmetrynumber=2. So your MLatom input file looks like this:

AIQM1               
xyzfile=/tmp/CECAM/H2_AIQM1_opt.xyz
freq
optprog=ase
ase.linear=1
ase.symmetrynumber=2

You will see in the log file the message:

This is due to the standard deviation of NN predictions (5.6 kcal/mol) is exceed the uncertainty criterion of 0.41 kcal/mol and thus, the calculation is deemed uncertain.

Q: 1. Why, do you think, are AIQM1 calculations for such a simple molecule uncertain?