# MLatom GUI tutorial

*MLatom* has its own GUI (graphical user interface), MLatom-GUI, for user-friendly creation of MLatom input and performing calculations. This tool makes it easier for user to choose appropriate input, so that they do not need to remember all the possible options or look them up every time in the manual.

MLatom-GUI script `mlatom-gui.py`

is written by Bao-Xin Xue and can run on Linux.

## GUI Layout

This GUI including four part: first level tabs (task type selection), second level tabs (subtask selection), parameter area to specify input options, output/calculation area.

If your mouse hovers at some area, this GUI will also prompt you some hint and information:

MLatom-GUI will also help you choose file without making a typo, and change the path string into relative path.

## Download & Usage

### Download & Installation

The GUI is released as a part of MLatom starting from version 2.0.4.

It can be also downloaded here for use with older versions:

Download this zip file, and decompress it, you’ll get a file named mlatom-gui.py, you can put this file at `~/bin/ `

or any path in your $PATH variable. Then use command `chmod +x mlatom-gui.py`

to add execution permission.

This GUI program is written completely with python standard library, so you don’t need to install any additional python package.

### Usage

if you are connecting to linux server through ssh, then you need to download X11 client before using this GUI. For windows user, you can download Xming, for MacOS user, you can download XQuartz. For Linux user, you don’t need to modified anything.

Then you need to enable X11 forwarding function, for macOS & Linux user, you’ll need to use command `ssh -Y user@ip -p port`

command to connect server; for windows user, please enable X11 forwarding function in your ssh software.

Just execute command `mlatom-gui.py`

, then the MLatom GUI window will appear on your local machine.

In order to not occupy current terminal, we suggest to use` mlatom-gui.py &`

command to keep this GUI running at the background, then you will be able to operate current terminal.

## AQC Charpter using MLatom GUI

### Example 1: Overfitting of the H_{2} Dissociation Curve

Please create a new directory, then download `R_20.dat`

file with 20 points corresponding to internuclear distances in the H_{2} molecule in Å and `E_FCI_20.dat`

file with full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree) for above 20 points:

```
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/R_20.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/E_FCI_20.dat
```

We will use these data to train a ML model (option `createMLmodel`

) and save it to a file (option `MLmodelOut=E_FCI_20_overfit.unf`

) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{−11} and λ=0:

Please open MLatom GUI with command `mlatom-gui.py`

Train (option `createMLmodel`

) ML model and save it to a file (option `MLmodelOut=mlmod_E_FCI_20_overfit.unf`

) using above data (training set) and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{−11} and λ=0

input necessary parameter, then click KRR button:

Then click calculate, check the result.

check the ML model, input file, standard output file, error output file:

In the output file `E_FCI_20_overfit.inp.out`

you can see that the error for the created ML model is essentially zero for the training set. Option `sampling=none`

ensures that the order of training points remains the same as in the original data set (it does not matter for creating this ML model, but will be useful later). You can use the created ML model (option `useMLmodel`

) for calculating energies for its own training set and save them to `E_ML_20_overfit.dat`

file:

Now you can compare the reference FCI values with the ML predicted values and see that they are the same. Option `debug`

also prints the values of the regression coefficients alpha to the output file `use_E_FCI_20_overfit.inp.out`

. You can compare them with the reference FCI energies and see that they are exactly the same (they are given in the same order as the training points).

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are zero. It means that the ML model is overfitted and cannot generalize well to new situations, because the hyperparameter choice.

### Example 2: Underfitting of the H_{2}Dissociation Curve

Train (option `createMLmodel`

) ML model and save it to a file (option `MLmodelOut=E_FCI_20_underfit.unf`

) using the data (training set) from Example 1 and the following command requesting fitting with the kernel ridge regression, and Gaussian kernel function and the hyperparameters σ=10^{9} and λ=1:

click KRR

close KRR window, click calculate. In the output file `E_FCI_20_underfit.inp.out`

you can see that the error for the created ML model is very large for the training set. You can use the created ML model (option `useMLmodel`

) for calculating energies for its own training set and save them to `E_ML_20_underfit.dat`

file:

You can see that the ML predicted values are the same for all points. Option `debug`

also prints the values of the regression coefficients alpha to the output file `use_E_FCI_20_underfit.inp.out`

. You can sum them up and see that their sum is equal to the ML energies.

Now try to calculate energy with the ML model for any other internuclear distance not present in the training set and see that predictions are do not change. It means that the ML model is underfitted and cannot generalize well to new situations, because the hyperparameter choice.

### Example 3: Model Selection and Evaluation for the H_{2} Dissociation Curve

Please create a new directory then enter it to continue with the following tutorial:

Download the full data set of with 451 points along the H_{2}dissociation curve. File with internuclear distances in the H_{2}molecule in Å and file with the full CI energies (calculated with the aug-cc-pV6Z basis set, in Hartree):

```
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/R_451.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/E_FCI_451.dat
```

Now you can download the indices of the points in the data set to be used as the training, test, sub-training, and validation sets. You can check that all the training points are the same as used in Examples 1 and 2.

```
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/itrain.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/itest.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/isubtrain.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/ivalidate.dat
```

Now we can optimize hyperparameters and evaluate the generalization error of the ML models using a single command. For example, for the Gaussian kernel function:

click KRR:

close KRR window, click calculate, see output window to check the result.

You can compare the ML predicted values saved to the file `E_ML_451_Gaussian.dat`

with the reference values and see that this model generalizes much better than models trained in Examples 1 and 2.

Note that many of these options ( `lgSigmaL=-10 lgSigmaH=10`

`lgLambdaL=-40 sampling=user-defined Ntrain=20 Nsubtrain=16 iTrainIn=itrain.dat iTestIn=itest.dat iSubtrainIn=isubtrain.dat iValidateIn=ivalidate.dat`

) are given to ensure that you obtain the same result every time you run the command.

You can simply run:

This command will use the random 20% for testing and 20% of training points for validating, i.e. each time you run the command you will see different result. In addition, since MLatom uses the logarithmic search for hyperparameter optimization, you will need to modify the lowest boundaries using commands `lgSigmaL=-10`

and `lgLambdaL=-40`

to get better results:

### Example 4: Extrapolation vs Interpolation of the H_{2} Dissociation Curve

Use the files from Example 3, but move the first three indices from the itrain.dat to itest.dat file and remove the first three indices from the isubtrain.dat file. Use the same command as in Example 3 to test the accuracy, but change the number of training and subtraining points accordingly.

### Example 5: Delta-learning of the H_{2}Dissociation Curve

Download the file with the UHF/STO-3G energies (in Hartree):

`wget http://mlatom.com/wp-content/uploads/AQCtutorial/H2/E_UHF_451.dat`

Follow the procedure described in Example 4, but now instead of learning the FCI/aug-cc-pV6Z energies directly, use the Δ-ML. For this, you need to use the option `deltaLearn`

and instead of `Yfile`

you need to specify files with the baseline (low-level) reference values (UHF/STO-3G, option `Yb=E_UHF_451.dat`

) and with the target (high-level) reference values (FCI/aug-cc-pV6Z, option `Yt=E_FCI_451.dat`

). In addition, you can specify the filenames for file with the Δ-ML energies estimating target level of theory (option `E_D-ML_451.dat`

) and file with predicted ML corrections (option `YestFile=corr_ML_451.dat`

).

Your input parameter for *MLatom *should look like:

Then close KRR window, and click calculate, You can see that the delta-ML predictions in the file `E_D-ML_451.dat`

are much better than in Example 4.

### Example 6: Training on Randomly Sampled Points on the H_{2} Dissociation Curve

Use the files `R_451.dat`

and `E_FCI_451.dat`

from Example 3. Use the following option:

You can find in the output file `rand20_E_FCI_451.inp.out`

various error measures for the random hold-out test set, that should look something like this:

```
Statistical analysis for 431 entries in the test set
MAE = 0.0008701190755
MSE = -0.0008594726288
RMSE = 0.0045194135782
mean(Y) = -1.0363539894174
mean(Yest) = -1.0372134620462
largest positive outlier
error = 0.0001099473549
index = 45
estimated value = -1.1566967526451
reference value = -1.1568067000000
largest negative outlier
error = -0.0421702385754
index = 1
estimated value = -1.1463026385754
reference value = -1.1041324000000
correlation coefficient = 0.9974017799004
linear regression of {y, y_est} by f(a,b) = a + b * y
R^2 = 0.9948103105483
a = 0.0277384800540
b = 1.0275947726113
SE_a = 0.0037190821572
SE_b = 0.0035833876014
```

By running this command several times you will see that above numbers are different each time, because each time different training points are chosen as you can check in the `itrain_20rand.dat`

file. Note: the keywords requesting saving indices are optional. MLatom does not overwrite these files and stops if it detects them. Thus, if you run this command again, you should either comment these lines or remove previous files or give them each time different name.

### Example 7: Potential energy surface of CH_{3}Cl

Here we follow some of the steps for creating machine learning potential energy surface (PES) of CH_{3}Cl as published in [*J. Chem. Phys.* **2017**, *146*, 244108]. We use the data set with 44819 points from [J. Chem. Phys. **2015**, *142*, 244306], which was kindly provided by Dr. Alec Owens.

Here we show how to:

- Convert geometries from XYZ format to the ML input vectors.
- Sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.
- Train the s10%-ML model using self-correction and structure-based sampled training set.

Please create a new directory to continue.

#### Converting geometries to ML input vector

Download `xyz.dat`

file with Cartesian coordinates of CH_{3}Cl, Download `eq.xyz`

file with the near-equilibrium geometry of CH_{3}Cl in Cartesian coordinates:

```
wget http://mlatom.com/wp-content/uploads/tutorial/CH3Cl/xyz.dat
wget http://mlatom.com/wp-content/uploads/tutorial/CH3Cl/eq.xyz
```

Convert (option `XYZ2X`

) geometries of CH_{3}Cl in Cartesian coordinates (option `XYZfile=xyz.dat`

) to the ML input vector of normalized inverted internuclear distances (option `molDescriptor=RE`

) with hydrogen atoms (3, 4, 5 in the XYZ files, option `permInvNuclei=3-4-5`

) sorted (option `molDescrType=sorted`

) by their nuclear repulsions to all other atoms:

You should get the file x_sorted.dat (option `XfileOut=x.dat`

) with 44819 lines, each with 10 entries. In the `XYZ2X.inp.out`

file you should see the line `Number of permutations: 6`

.

#### Sampling points

Here we show how to sample points to the training, test, sub-training, and validation sets only from ML input using structure-based sampling from the data sliced into three regions.

Get the input vector for the equilibrium geometry using the same procedure as in preceding section:

You should get a vector with ten 1.0’s in `eq.x`

.

Sort geometries by the Euclidean distance of their corresponding ML input vector to the input vector of the equilibrium geometry and slice the ordered data set into 3 regions of the same size:

This command should have created files `xordered.dat`

(input vectors sorted by distance), `indices_ordered.dat`

(indices of ordered data set wrt the original data set), and `distances_ordered.dat`

(list of Euclidean distances of ordered data points to the equilibrium). It has also created directories `slice1`

, `slice2`

, and `slice3`

. Each of them contains three files: `x.dat`

, `slice_indices.dat`

, and `slice_distances.dat`

that are slices of the corresponding files of the entire data set.

Use structure-based sampling to sample the desired number of data from each slice:

This command should create `itrain.dat`

files with training set indices in each `slice[1-3]`

directory. Note: it is possible to modify `sliceData.py`

script to submit the jobs in parallel to the queue.

Merge sampled indices from all slices into indices files for the training, test, sub-training, and validation sets using the same order of data points as in original data set:

This command will create four files with indices: `itrain.dat`

(with 4480 points for training), `isubtrain.dat`

(with 80% of training points also chosen using structure-based sampling), `itest.dat`

, and `ivalidate.dat`

.

#### Creating ML model

Download `y.dat`

file with the reference energies:

`wget http://mlatom.com/wp-content/uploads/tutorial/CH3Cl/y.dat`

In [*J. Chem. Phys.* **2017**, *146*, 244108] we optimized hyperparameters using the validation set with points having deformation energy not higher than 10000 cm^{-1}. You need to filter out the points with higher energies from the validation set. To do this move the file `ivalidate.dat`

to `ivalidateall.dat`

and use the following Python 2 script:

```
mv ivalidate.dat ivalidateall.dat
wget http://mlatom.com/wp-content/uploads/tutorial/CH3Cl/ivalidate.txt
mv invalidate.txt invalidate.py
python2 invalidate.py
```

Now you can train ML model using the reference data for 4480 training points defined in itrain.dat file using the same procedure as described in [*J. Chem. Phys.* **2017**, *146*, 244108]. Check how many points are in files with indices and use the following input:

The calculation created mlmodlayer[1-4].unf files with ML models for each of the 4 layers of the self-correcting procedure. Since we used 4480 out of 44819 points this model is equivalent to s10%-ML model in [*J. Chem. Phys.* **2017**, *146*, 244108]. The above command created files `ylayer[1-4].dat`

with ML predictions for each of the layers. The final predictions are in the `ylayer4.dat`

. You can compare this values with the reference values for the entire data set using the following Python 2 script:

```
wget http://mlatom.com/wp-content/uploads/tutorial/CH3Cl/compare_weighted.txt
mv compare_weighted.txt compare_weighted.py
python2 compare_weighted.py
```

The error should be 3.44 cm^{-1}.

### Example 8: Importance of Sampling in Critical Regions

Create a new directory then enter it.

Download files `R.dat`

with coordinates and `NAC.dat`

with reference nonadiabatic couplings calculated with the spin-boson Hamiltonian model in 1-D (*J. Phys. Chem. Lett.* **2018**, *9*, 5660):

```
wget http://mlatom.com/wp-content/uploads/AQCtutorial/NACs/R.dat
wget http://mlatom.com/wp-content/uploads/AQCtutorial/NACs/NAC.dat
```

Train ML model on 10 randomly sampled points and make predictions with this model for all points in the data set using the following input file:

You can compare reference nonadiabatic couplings with the ML couplings saved to file `ML-NAC.dat`

. If you are very lucky, ML can describe the narrow peack. In most cases though, such an ML model will miss the peak. You can however describe this peak properly, if you know the position of the peak beforehand and add it to the training set. Add the index of the minimum in the data set to the files with indices of training and sub-training points, then use the following input file to train ML model on 11 points (10 previously sampled random points + 1 critical point) and to use this ML model to predict nonadiabatic couplings for all points in the data set:

As you can see, the ML predictions improved significantly.

## Tutorial for Absorption Spectrum Simulation

### Example 1

we’ll start with a simple example: Simulating absorption spectrum with pre-calculated data

Firstly, you need to download these data, and then decompress it with `unzip`

command.

You’ll find sev.eral file:

```
E1..10.dat # data
f1..10.dat # data
cross-section_ref.dat # reference spectrum data
eq.xyz # equilibrium geometry
nea_geoms.xyz # ensemble of 1001 conformations
inp # input file
```

The “inp” file, we will not use this file this time, because we’ll generate it by this GUI program.

open MLatom GUI, then select “spectrum simulation”, then fill the parameter,

click calculate, MLatom will starting running,

after MLatom finished, you’ll get a prompt:

Then check the log to see whether it ended normally:

Then check the result in terminal:

All things is so simple using this GUI, all you need to do is just clicking!

you can also check the input file, standard output and error output with the file named “mlatom.inp”, “mlatom.inp.out” and “mlatom.inp.err”.

More detail, you can check this tutorial.

### Example 2

Now we will try to calculate without any pre-calculated data. Please download the data below, and decompress it into a directory.

In this task, you need to install Newton-X and Gaussian at first. more details you can see: this tutorial.

But this time, the QC calculation part will take a long time, so it is not recommend to directly calculate within this GUI, you can modifiy the input file name, and click “generate input file”, then in the terminal, execute command `MLatom.py mlatom.inp &> mlatom.log`

manually.

After the calculation finished, you’ll find that MLatom decide 150 as the optimal point number, but sometimes, ML cross section will be of bad quality even after meeting convergence criterion, so we want to check 50 more points (nQMpoints=200).

please create a new directory, and decompress the zip data to this new directory, now we will add additional 50 point, you can modify the input parameter as:

the plotting figure is at the path: `cross-section/plot.png`

, you can check this figure with previous one.

more details you can see: this tutorial.