# Tutorial for KREG and pKREG

This tutorial shows how to train and use the (p)KREG model and create learning curves in MLatom. For theory and evaluation of (p)KREG models, see:

Yi-Fan Hou, Fuchun Ge, Pavlo O. Dral. Explicit learning of derivatives with the KREG and pKREG models on the example of accurate representation of molecular potential energy surfaces.

Preprint onJ. Chem. Theory Comput.2023, 19, 2369–2379. DOI: 10.1021/acs.jctc.2c01038.ChemRxiv: https://doi.org/10.26434/chemrxiv-2022-b5bnt.

KREG refers to kernel ridge regression (KRR) with the relative-to-equilibrium (RE) descriptor and Gaussian kernel, while pKREG also includes permutationally invariant kernel to ensure the invariance under atom permutations.

In the old version of MLatom, the KREG and pKREG model only support direct learning of values. But now, we have implemented the exlicit learning of derivatives, in which case, you can train the model either on gradients only or values and gradients at the same time. In this tutorial, you will learn how to train and use a machine learning potential model of ethanol with (p)KREG.

Before starting this tutorial, you need to get access to MLatom@XACS (http://xacs.xmu.edu.cn/records?attribution_program=MLatom). Register and log in to XACS platform so that you can run calculations.

We provide all the input files used in this tutorial here:

## Train the KREG model on energies only

There are already tutorials on how to train the KREG model on energies only. But here we will still show the input file (data files can be found in KREG_tutorial subfolder):

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=en.unf # The model is saved in en.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
KRRtask=learnVal # Learn on energies only
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
```

Hyperparameters are optimized by logrithmic grid search.

After training the model, you can use it (by uploading the model) to predict energies and gradients for other points:

```
estAccMLmodel # Estimate accuracy of a ML model
mlmodelin=en.unf # Use an existing model
XYZfile=xyz_test.dat # File containing XYZ coordinates of the test points
Yfile=en_test.dat # File containing energies of the test points
YgradXYZfile=grad_test.dat # File containing gradients of the test points
Yestfile=enest.dat # Predicted energies are saved in enest.dat
YgradXYZestfile=gradest.dat # Predicted gradients are save in gradest.dat
```

In the sections below, you can always use this input file to get the estimation values and gradients of the test points (don’t forget to use the correct model name).

## Train the KREG model on gradients only

The input file is similar to that of energies-only model:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=grad.unf # The model is saved in grad.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnGradXYZ # Learn on gradients only
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
```

Though the model is not trained on energies directly, you still need to provide the file containing all the energies of the training points so that the integral constant can be calculated.

## Train the KREG model on both energies and gradients

You can train the KREG model on both energies and gradients by using the following inputs:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=engrad.unf # The model is saved in engrad.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
```

## Using different lambdas

The inputs for training KREG models using different lambdas are similar to that of using the same lambda except for the last four lines:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=engrad_2lambda.unf # The model is saved in engrad_2lambda.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=hyperopt.loguniform(-2,10) # Optimize hyperparameter sigma using hyperopt
lambda=hyperopt.loguniform(-45,-4) # Optimize hyperparameter lambda using hyperopt
lambdaGradXYZ=hyperopt.loguniform(-45,-4) # Optimize hyperparameter lambdaGradXYZ using hyperopt
hyperopt.max_evals=300 # Total number of evaluations in hyperopt
```

Here, hyperopt is used to optimize the three hyperparameters.

## Sparse gradients

The input file for sparse gradients is exactly same as that of using the same lambda of different lambdas (depending on which one you want to use). The only difference is the format of file containing gradients, in which for points without gradients, the gradients are replaced with ‘0’. (See examples below.)

```
0
9
12.0298697242 21.8972781874 -6.5823924247
10.5414394334 58.5013407080 -11.2469708417
-15.0480544528 17.7139692749 31.2700652035
-2.9347985031 -45.3223506830 37.5055838435
-8.9881173177 15.1347152482 -36.4616417990
-28.3058878145 0.5893695072 1.9688557864
25.5490229720 -35.9494714255 25.9451997850
-10.3105361488 -25.2928224835 -13.4477353439
17.4305806338 -7.2344007606 -28.9043038970
```

The input file is:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=engrad_sparse.unf # The model is saved in engrad_sparse.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train_sparse.dat # File containing XYZ coordinates of the training points
Yfile=en_train_sparse.dat # File containing energies of the training points
YgradXYZfile=grad_train_sparse.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=1000 # Number of training points
Nsubtrain=800 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain_sparse.dat # Indices for training set
iValidateIn=ivalidate_sparse.dat # Indices for validation set
iSubtrainIn=isubtrain_sparse.dat # Indices for subtraining set
sigma=hyperopt.loguniform(-2,10) # Optimize hyperparameter sigma using hyperopt
lambda=hyperopt.loguniform(-45,-4) # Optimize hyperparameter lambda using hyperopt
lambdaGradXYZ=hyperopt.loguniform(-45,-4) # Optimize hyperparameter lambdaGradXYZ using hyperopt
hyperopt.max_evals=300 # Total number of evaluations in hyperopt
```

## Train the pKREG model

Before training the pKREG model, you have to choose which permutations to use in the molecule.

Let’s first take a look at the input file:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=pengrad.unf # The model is saved in pengrad.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
moldescrtype=permuted # Keywords related to pKREG
permInvKernel # Use permutationally invariant kernel
permInvNuclei=4-5.6-7-8 # Choose atoms to permute
```

The last three lines are related to the pKREG model, where permutationally invariant kernel is used and permutations are provided by the user. The ‘.’ is used to seperate atoms from different groups and atom indices connected by ‘-‘ are within the same group. Here, 12 permutations are chosen.

## Train the pKREG model (other tricks)

MLatom also supports semi-automatic reduction of permutations first defined by users, where for each point in the training set, the dRMSD (distance root mean square deviation, also referred to as distance matrix error) value is calculated between the equilibrium geometry and each permuted geometry. Only the permutation with the lowest dRMSD for each training point is recorded and added to the list of permutations to be retained, while all remaining permutations are removed. The input file is shown below:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=pengrad_sp.unf # The model is saved in pengrad_sp.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
moldescrtype=permuted # Keywords related to pKREG
permInvKernel # Use permutationally invariant kernel
permInvNuclei=4-5.6-7-8 # Choose atoms to permute
selectperm # Select permutations
```

Besides, the user can also manually select permutations by providing a file (‘perm.dat’) containing lists of atom indices, for example:

```
4 5 6 7 8
4 5 7 8 6
4 5 8 6 7
5 4 6 8 7
5 4 7 6 8
5 4 8 7 6
```

Then, this input file should look like this:

```
createMLmodel # Create a ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=pengrad_ud.unf # The model is saved in pengrad_ud.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz_train.dat # File containing XYZ coordinates of the training points
Yfile=en_train.dat # File containing energies of the training points
YgradXYZfile=grad_train.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
Ntrain=100 # Number of training points
Nsubtrain=80 # Number of subtraining points
sampling=user-defined # User provides indices for datasets
iTrainIn=itrain.dat # Indices for training set
iValidateIn=ivalidate.dat # Indices for validation set
iSubtrainIn=isubtrain.dat # Indices for subtraining set
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
moldescrtype=permuted # Keywords related to pKREG
permInvKernel # Use permutationally invariant kernel
permInvNuclei=4-5.6-7-8 # Choose atoms to permute
permIndIn=perm.dat # File containing permutations
permlen=5 # Length of list in each permutation
```

The file containing all the permutations should be provided and the length of each permutation should also be specified.

## Create learning curves

Here, for simplicity, we will take KREG trained on energies and gradients as an example. For other types of models, you can always modify the input file as are shown above (data files can be found in the MD17 subfolder).

```
learningCurve # Learning Curve
lcNtrains=100,250,500,1000 # Choose training set sizes
lcNrepeats=5,3,1,1 # Number of repeats for each training set size
estAccMLmodel # Estimate accuracy of ML model
mlmodeltype=KREG # Use KREG model
mlmodelout=engrad.unf # The model is saved in engrad.unf
eqXYZfileIn=eq.xyz # File containing equilibrium geometry
XYZfile=xyz.dat # File containing XYZ coordinates of the training points
Yfile=en.dat # File containing energies of the training points
YgradXYZfile=grad.dat # File containing gradients of the training points
KRRtask=learnValGradXYZ # Learn both energies and gradients
sampling=random # Random sampling
sigma=opt # Optimize hyperparameter sigma using grid search
lambda=opt # Optimize hyperparameter lambda using grid search
```