# Tutorial for ML-NEA Approach

This tutorial shows how to calculate absorption cross section using machine learning-nuclear ensemble approach (ML-NEA) as reported in:

Bao-Xin Xue, Mario Barbatti, Pavlo O. Dral, *Machine Learning for Absorption Cross Sections*, *submitted*.

Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.

## Overview

ML-NEA cross section calculations consist of the following steps, each of which is described in the next sections:

- Generation of 50k ensemble points (set of geometries)
- Generation of training data: vertical excitation energies (E) and oscillator strengths (f) for small number (number of training points, Ntr, one can start with 50) of ensemble point and required number of excited states should be calculated with quantum chemical method
- Machine learning calculations of vertical excitation energies and oscillator strengths for the remaining 50k – Ntr ensemble points
- Cross section calculations from vertical excitation energies and oscillator strengths for all 50k ensemble points.
- Cross section validation. If its accuracy is not satisfactory, increase number of training points and repeat steps 2-5.

Before going through this step, you need to install MLatom, Newton-X, and, optionally, Gaussian 16. All examples are in bash shell.

Set some environmental variables:

```
export NX=[path to Newton-X bin folder]
export PATH=$PATH:[path to MLatom bin folder]
```

We show how above steps work on an example of benzene. To follow this example, create a folder `ML-cross-section`

(we call it `$root`

).

```
mkdir ML-cross-section
cd ML-cross-section
root=`pwd`
```

## Step 1: Generation of 50k ensemble points

We recommend using Newton-X to generate ensemble points with Wigner sampling. Please refer to the Newton-X tutorial & documentation for details how to use Newton-X.

Now we will take benzene for example to explain how the ML-cross-section is calculated and evaluated. Create a folder named `geometry`

at `$root`

and change directory to there.

`mkdir geometry ; cd geometry`

### Geometry optimization and frequency calculations

We will use B3LYP/def2tzvpp to optimize the geometry of benzene.

**Optional: **Optimize the geometry of benzene and perform frequency calculations using the Gaussian input file (freq.com):

```
# B3LYP/def2TZVPP opt freq
benzene opt & freq at b3lyp/def2tzvpp
0 1
C 1.20809735 0.69749533 -0.00000000
C 0.00000000 1.39499067 -0.00000000
C -1.20809735 0.69749533 -0.00000000
C -1.20809735 -0.69749533 -0.00000000
C 0.00000000 -1.39499067 -0.00000000
C 1.20809735 -0.69749533 -0.00000000
H 2.16038781 1.24730049 -0.00000000
H 0.00000000 2.49460097 -0.00000000
H -2.16038781 1.24730049 -0.00000000
H -2.16038781 -1.24730049 -0.00000000
H 0.00000000 -2.49460097 -0.00000000
H 2.16038781 -1.24730049 -0.00000000
```

Save the Gaussian output file to `freq.out`

.

You also need the file with optimized geometry from the freq.out. You can save it to file `opt.xyz`

in XYZ format.

```
12
C 0.00000000 1.39110400 0.00000000
C 1.20473100 0.69555200 0.00000000
C 1.20473100 -0.69555200 0.00000000
C -0.00000000 -1.39110400 0.00000000
C -1.20473100 -0.69555200 -0.00000000
C -1.20473100 0.69555200 0.00000000
H 0.00000000 2.47330200 0.00000000
H 2.14194200 1.23665100 0.00000000
H 2.14194200 -1.23665100 0.00000000
H -0.00000000 -2.47330200 0.00000000
H -2.14194200 -1.23665100 -0.00000000
H -2.14194200 1.23665100 0.00000000
```

You can download these two files here: `freq.out`

and `opt.xyz`

.

Then in the `geometry`

folder, run command:

`$NX/xyz2nx < opt.xyz`

and you will get a file named `geom`

.

### Sampling 50k points

Now, we need to create the input file of `initcond.pl`

. Because we only want the configuration of 50k ensemble points, we only set `nsf=2`

, and `chk_e=0`

to make Newton-X not to perform gaussian calculation.

Download `initqp_input`

, and execute the following command to generate 50k ensemble points.

`nohup $NX/initcond.pl > initcond.log 2>&1 &`

It will take a long time to finish (because of the low efficiency of perl), so you can directly download the output file below to save time.

Then execute the following command:

```
unzip final_output.zip
atom_num=12 # you can change the atom number here for your own molecule
sed -n /Geometry/,/Velocity\ in\ NX/p final_output | sed /Velocity/d | awk '{printf ("%s\t", $1)} {printf ("%.8f\t", $3*0.5291772083)} {printf ("%.8f\t", $4*0.5291772083)} {printf ("%.8f\n", $5*0.5291772083)}' | sed -r s/Geometry.+/$atom_num\\n/g > xyz.dat
```

Then you will get a file named `xyz.dat`

including 50001 ensemble points (the first point is the equilibrium geometry).

## Step 2: Generation of training data

Please create an folder named `gaussian-calc`

at `$root`

. We will use CAM-B3LYP/ma-TZVP for TD calculation.

```
cd $root
mkdir gaussian-calc
cd gaussian-calc
```

### Needed files

You can download all the needed files below:

Here are some description to files:

**itrain.dat**–> the indices of training in xyz.dat**run_multi_gaussian.py**–> script that can run many gaussian calculations**template.com**-> gaussian input file template, you can modify it if you like.**xyz.dat**–> including geometric configuration of 50k ensemble point. please copy it from`../geometry/`

(if you want to calculate your own molecule), or use the file in the zip archive.**multi-gaussian.bsub**(optional)- the batch script of BSUB, it’s optional, you must modify it before using it.
- 42th line defined the ma-TZVP basis, you should modify it at the correct path, you can download this file through this link.

Decomprese the folder at `gaussian-calc/`

folder.

`unzip gaussian-calc.zip`

### Quantum chemical calculations

The `itrain.dat`

has been included in the archive already, it includes the indices of geometry in `xyz.dat`

.

In default, it include 50001 points, because we need to create a 50k points reference for evaluation, but it will take too much time to calculate all the gaussian task, so you can use the command below to create a new `itrain.dat`

file.

The number of itrain_num should be enough for a good cross section spectrum result, normally it should be bigger than 200. But you can start with 50 for a quick tutorial.

```
itrain_num=50 # modify the number here
: > itrain.dat
for i in `seq 1 $itrain_num`; do
echo $i >> itrain.dat
done
```

Certainly you can also calculate 50k by yourself, but we will provide the calculated 50k points cross section data at later.

If you choose to calculate too much points, it’s highly recommended to split task (for advanced linux operators only), the procedure is implemented in this script.

You can directly run `python3 run_multi_gaussian.py`

, calculation results will appear in the `$root/gaussian-calc/log`

folder. (for BSUB circumstance, result will appear in `$root/gaussian-calc/output/log/`

).

After calculation, you can use the command below to copy all the log file into `../all_log`

.

```
cd $root/gaussian-calc
mkdir ../all_log
find . -type f -regex '.+?/[0-9]+?\.log' | xargs -I log_file cp log_file ../all_log/
```

For 500 points, it may take 18 hours at 18 cores server(2.6 GHz).

### Post-processing data

Now we have all the results of 50k TD ensemble points, we need to get all the data of excited energy (E) and oscillator strength (f). You can use this script at `$root`

to excerpt all the needed data.

```
cd $root
num=10 # you can modify it to decide how many excited states you want to excerpt
cat ./gaussian-calc/itrain.dat | while read line; do
for i in `seq 1 $num`; do
grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $5}' | sed s/f=//g >> 'E'$i
grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $9}' | sed s/f=//g >> 'f'$i
done
done
for i in `seq 1 $num`; do
paste ./gaussian-calc/itrain.dat E$i f$i > state$i'.index.E.f'
rm E$i f$i
done
mkdir all_data
mv *.index.E.f all_data
```

It will generate the files named `state$i.index.E.f`

in the folder `$root/all_data/`

. (the calculated data will be provided at the later part)

### Optional: QC cross section

Now we will generate the cross section reference data (50k) to evaluate how better is the ML method performs and how fast the deviation converged. (**If you calculated 50k data at the previous step, you can continue, or just download the calculated result at the end part of this section**.)

Please enter the folder `all_data`

, download this script to this folder, and then execute the following command.

```
cd $root/all_data
### save the script here ###
cp ../gaussian-calc/itrain.dat .
num=`wc itrain.dat -l | awk '{print $1}'`
python3 cross-section-TD.py $num . 10 # third parameter represents the state number defined at the previous step
```

It will generate a file named `cross-section.dat`

, which can be directly download through this link.

## Steps 3-4: Machine learning calculations

Please create a folder named `ML`

at `$root`

, and then enter it.

`cd $root; mkdir ML; cd ML`

### Use descriptor to generate the input file (x.dat)

Now, we will generate the input file for MLaotm, please create a folder named `descriptor`

at `$root`

, please run the command below:

```
cd $root; mkdir descriptor; cd descriptor
cp ../geometry/xyz.dat .
MLatom.py XYZ2X XYZfile=xyz.dat XfileOut=x.dat molDescriptor=RE > xyz2x.out
```

The outputted `x.dat`

is the input file in MLaotm. (this file can be downloaded at the next part)

### Files description

In Machine Learning procedure, we will use scripts and many data to make all the procedure automatically. You can download the zip file below and uncompress it into the `ML`

folder. The content including:

**script/**–> including all the needed script**data/**–> including all the needed data**train_num**–> a file includes how much point you want to train**ML_train.bsub**–> bsub task script for ML (optional)**ML_train.sh**–> the main script that bsub will use**ML-RIC.sh**–> the script to calculate RIC of ML (later part will use it)**TD-RIC.sh**–> the script to calculate RIC of TD (later part will use it)

```
cd $root/ML
#### upload ML.zip file to $root/ML/ ####
unzip ML.zip
```

If you have calculated the cross section of reference, you can execute this command at the folder `$root/ML/`

: (or just use the file provided in `ML.zip`

)

`cp ../all_data/cross-section.dat data`

At previous step, we calculated `x.dat`

, you can copy it to `data/`

, or just use the file provided in `ML.zip`

.

`cp ../descriptor/x.dat data # if you calculated it`

### Submit the Machine Learning task

After modifying some parameter in `ML_train.sh`

and `train_nums`

, just run `bash ML_train.sh`

to starting training. (or use `bsub < ML_train.bsub`

if you have modified the bsub script)

After training, in `$root/ML/train/$num/spectrum/`

, you will get 3 cross section output results, `2/cross-section.dat`

, `2-4/cross-section.dat`

, `all/cross-section.dat`

, which has take first excited state, first 3 excited states, 10 excited states into consideration.

As described in the paper, this script uses E and f calculated using quantum chemical method for the training set and machine learning estimates for the remaining data points. It also set negative f values sometimes predicted by ML with zeros.

### Plotting cross sections

The cross setion data is stored in the path `$root/ML/train/$num/spectrum/all/cross-section.dat`

, you can use the command to draw the figure:

```
train_num=100 # define what training point you want to draw
cd $root/ML/train/$train_num/spectrum/all
awk '{print $1"\t" $3}' cross-section.dat | sed 1d > plot.dat
gnuplot
plot 'plot.dat' u 1:2 w l
```

Or use other software to plot plot.dat, where first column is energy (eV), second column is cross section (Å^2/moledule).

## Step 5: Cross section validation

Obtained cross section can be compared with the any reference absorption spectrum available (e.g. experimental).

In case of benzene, for which very precise cross section calculated with 50k ensemble points available, you can compare the spectrum using the relative integral change (RIC).

### Explanation of RIC

Now we will introduce a evaluating criteria RIC which is the value of this green area divided by the whole area between cross section of referene (50k) and x axis.

### calculate RIC for different number of ensemble points

We need to calculate cross section spectrum (Newton-X method) for RIC. Generally, we use Newton-X to calculate all the cross section spectrum, but since we have calculated all the gaussian result data, so we needn’t to do that.

We implemented the cross section spectrum generation and RIC calculation in the script `TD-RIC.sh`

.

Just execute `bash TD-RIC.sh`

, and you will get a file named `TD-RIC.result`

, it include the RIC value of ensemble method corresponding to the points defined in `train_nums`

.

### calculate RIC for different number of ML points

the procedure of calculating RIC for ML is implemented in `ML-RIC.sh`

, simply execute `bash ML-RIC.sh`

, and you will get a result file named `ML-RIC.result`

corresponding to the points defined in `train_nums`

.