Tutorial for ML-NEA Approach

This tutorial shows how to calculate absorption cross section using machine learning-nuclear ensemble approach (ML-NEA) as reported in:

Bao-Xin Xue, Mario Barbatti, Pavlo O. Dral, Machine Learning for Absorption Cross SectionsJ. Phys. Chem. A 2020124, 7199–7210. DOI: 10.1021/acs.jpca.0c05310.
Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.


ML-NEA cross section calculations consist of the following steps, each of which is described in the next sections:

  1. Generation of 50k ensemble points (set of geometries)
  2. Generation of training data: vertical excitation energies (E) and oscillator strengths (f) for small number (number of training points, Ntr, one can start with 50) of ensemble point and required number of excited states should be calculated with quantum chemical method
  3. Machine learning calculations of vertical excitation energies and oscillator strengths for the remaining 50k – Ntr ensemble points
  4. Cross section calculations from vertical excitation energies and oscillator strengths for all 50k ensemble points.
  5. Cross section validation. If its accuracy is not satisfactory, increase number of training points and repeat steps 2-5.

Before going through this step, you need to install MLatom, Newton-X, and, optionally, Gaussian 16. All examples are in bash shell.

Set some environmental variables:

export NX=[path to Newton-X bin folder]
export PATH=$PATH:[path to MLatom bin folder]

We show how above steps work on an example of benzene. To follow this example, create a folder ML-cross-section (we call it $root ).

mkdir ML-cross-section
cd ML-cross-section

Step 1: Generation of 50k ensemble points

We recommend using Newton-X to generate ensemble points with Wigner sampling. Please refer to the Newton-X tutorial & documentation for details how to use Newton-X.

Now we will take benzene for example to explain how the ML-cross-section is calculated and evaluated. Create a folder named geometry at $root and change directory to there.

mkdir geometry ; cd geometry

Geometry optimization and frequency calculations

We will use B3LYP/def2tzvpp to optimize the geometry of benzene.

Optional: Optimize the geometry of benzene and perform frequency calculations using the Gaussian input file (freq.com):

# B3LYP/def2TZVPP opt freq

benzene opt & freq at b3lyp/def2tzvpp

0 1
 C                  1.20809735    0.69749533   -0.00000000
 C                  0.00000000    1.39499067   -0.00000000
 C                 -1.20809735    0.69749533   -0.00000000
 C                 -1.20809735   -0.69749533   -0.00000000
 C                  0.00000000   -1.39499067   -0.00000000
 C                  1.20809735   -0.69749533   -0.00000000
 H                  2.16038781    1.24730049   -0.00000000
 H                  0.00000000    2.49460097   -0.00000000
 H                 -2.16038781    1.24730049   -0.00000000
 H                 -2.16038781   -1.24730049   -0.00000000
 H                  0.00000000   -2.49460097   -0.00000000
 H                  2.16038781   -1.24730049   -0.00000000

Save the Gaussian output file to freq.out.

You also need the file with optimized geometry from the freq.out. You can save it to file opt.xyz in XYZ format.


C       0.00000000      1.39110400      0.00000000
C       1.20473100      0.69555200      0.00000000
C       1.20473100      -0.69555200     0.00000000
C       -0.00000000     -1.39110400     0.00000000
C       -1.20473100     -0.69555200     -0.00000000
C       -1.20473100     0.69555200      0.00000000
H       0.00000000      2.47330200      0.00000000
H       2.14194200      1.23665100      0.00000000
H       2.14194200      -1.23665100     0.00000000
H       -0.00000000     -2.47330200     0.00000000
H       -2.14194200     -1.23665100     -0.00000000
H       -2.14194200     1.23665100      0.00000000

You can download these two files here: freq.out and opt.xyz.

Then in the geometry folder, run command:

$NX/xyz2nx < opt.xyz

and you will get a file named geom.

Sampling 50k points

Now, we need to create the input file of initcond.pl. Because we only want the configuration of 50k ensemble points, we only set nsf=2, and chk_e=0 to make Newton-X not to perform gaussian calculation.

Download  initqp_input, and execute the following command to generate 50k ensemble points.

nohup $NX/initcond.pl > initcond.log 2>&1 &

It will take a long time to finish (because of the low efficiency of perl), so you can directly download the output file below to save time.

Then execute the following command:

unzip final_output.zip
atom_num=12  # you can change the atom number here for your own molecule
sed -n /Geometry/,/Velocity\ in\ NX/p final_output | sed  /Velocity/d | awk '{printf ("%s\t", $1)} {printf ("%.8f\t", $3*0.5291772083)} {printf ("%.8f\t", $4*0.5291772083)} {printf ("%.8f\n", $5*0.5291772083)}'  | sed -r s/Geometry.+/$atom_num\\n/g > xyz.dat

Then you will get a file named  xyz.dat including 50001 ensemble points (the first point is the equilibrium geometry).

Step 2: Generation of training data

Please create an folder named gaussian-calcat $root . We will use CAM-B3LYP/ma-TZVP for TD calculation.

cd $root
mkdir gaussian-calc
cd gaussian-calc

Needed files

You can download all the needed files below:

Here are some description to files:

  • itrain.dat –> the indices of training in xyz.dat
  • run_multi_gaussian.py –> script that can run many gaussian calculations
  • template.com -> gaussian input file template, you can modify it if you like.
  • xyz.dat –> including geometric configuration of 50k ensemble point. please copy it from ../geometry/(if you want to calculate your own molecule), or use the file in the zip archive.
  • multi-gaussian.bsub (optional)
    1. the batch script of BSUB, it’s optional, you must modify it before using it.
    2. 42th line defined the ma-TZVP basis, you should modify it at the correct path, you can download this file through this link

Decomprese the folder at gaussian-calc/ folder.

unzip gaussian-calc.zip

Quantum chemical calculations

The itrain.dat has been included in the archive already, it includes the indices of geometry in xyz.dat.

In default, it include 50001 points, because we need to create a 50k points reference for evaluation, but it will take too much time to calculate all the gaussian task, so you can use the command below to create a new itrain.dat file.

The number of itrain_num should be enough for a good cross section spectrum result, normally it should be bigger than 200. But you can start with 50 for a quick tutorial.

itrain_num=50  # modify the number here
: > itrain.dat
for i in `seq 1 $itrain_num`; do
    echo $i >> itrain.dat

Certainly you can also calculate 50k by yourself, but we will provide the calculated 50k points cross section data at later.

If you choose to calculate too much points, it’s highly recommended to split task (for advanced linux operators only), the procedure is implemented in this script.

You can directly run python3 run_multi_gaussian.py, calculation results will appear in the $root/gaussian-calc/log folder. (for BSUB circumstance, result will appear in  $root/gaussian-calc/output/log/).

After calculation, you can use the command below to copy all the log file into ../all_log.

cd $root/gaussian-calc
mkdir ../all_log
find . -type f -regex '.+?/[0-9]+?.log' | xargs -I log_file cp log_file ../all_log/

For 500 points, it may take 18 hours at 18 cores server(2.6 GHz).

Post-processing data

Now we have all the results of 50k TD ensemble points, we need to get all the data of excited energy (E) and oscillator strength (f). You can use this script at $root to excerpt all the needed data. 

cd $root
num=10   # you can modify it to decide how many excited states you want to excerpt

cat ./gaussian-calc/itrain.dat | while read line; do
  for i in `seq 1 $num`; do
    grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $5}' | sed s/f=//g >> 'E'$i
    grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $9}' | sed s/f=//g >> 'f'$i
for i in `seq 1 $num`; do
  paste ./gaussian-calc/itrain.dat E$i f$i > state$i'.index.E.f'
  rm E$i f$i

mkdir all_data
mv *.index.E.f all_data

It will generate the files named  state$i.index.E.f in the folder $root/all_data/. (the calculated data will be provided at the later part)

Optional: QC cross section

Now we will generate the cross section reference data (50k) to evaluate how better is the ML method performs and how fast the deviation converged. (If you calculated 50k data at the previous step, you can continue, or just download the calculated result at the end part of this section.)

Please enter the folder all_data, download this script to this folder, and then execute the following command.

cd $root/all_data
### save the script here ###
cp ../gaussian-calc/itrain.dat .
num=`wc itrain.dat -l | awk '{print $1}'`
python3 cross-section-TD.py $num . 10   # third parameter represents the state number defined at the previous step

It will generate a file named cross-section.dat, which can be directly download through this link.

Steps 3-4: Machine learning calculations

Please create a folder named ML at $root, and then enter it.

cd $root; mkdir ML; cd ML

Use descriptor to generate the input file (x.dat)

Now, we will generate the input file for MLaotm, please create a folder named descriptor at $root, please run the command below:

cd $root;  mkdir descriptor;  cd descriptor
cp ../geometry/xyz.dat .
MLatom.py XYZ2X XYZfile=xyz.dat XfileOut=x.dat molDescriptor=RE > xyz2x.out

The outputted x.dat is the input file in MLaotm. (this file can be downloaded at the next part)

Files description

In Machine Learning procedure, we will use scripts and many data to make all the procedure automatically. You can download the zip file below and uncompress it into the ML folder. The content including:

  • script/ –> including all the needed script
  • data/ –> including all the needed data
  • train_num –> a file includes how much point you want to train
  • ML_train.bsub –> bsub task script for ML (optional)
  • ML_train.sh –> the main script that bsub will use
  • ML-RIC.sh –> the script to calculate RIC of ML (later part will use it)
  • TD-RIC.sh –> the script to calculate RIC of TD (later part will use it)
cd $root/ML
#### upload ML.zip file to $root/ML/ ####
unzip ML.zip

If you have calculated the cross section of reference, you can execute this command at the folder $root/ML/: (or just use the file provided in ML.zip)

cp ../all_data/cross-section.dat data

At previous step, we calculated x.dat, you can copy it to data/, or just use the file provided in ML.zip.

cp ../descriptor/x.dat data  # if you calculated it

Submit the Machine Learning task

After modifying some parameter in ML_train.sh and train_nums, just run bash ML_train.sh to starting training. (or use bsub < ML_train.bsub if you have modified the bsub script)

After training, in $root/ML/train/$num/spectrum/, you will get 3 cross section output results, 2/cross-section.dat2-4/cross-section.datall/cross-section.dat, which has take first excited state, first 3 excited states, 10 excited states into consideration.

As described in the paper, this script uses E and f calculated using quantum chemical method for the training set and machine learning estimates for the remaining data points. It also set negative f values sometimes predicted by ML with zeros.

Plotting cross sections

The cross setion data is stored in the path $root/ML/train/$num/spectrum/all/cross-section.dat, you can use the command to draw the figure:

train_num=100  # define what training point you want to draw
cd $root/ML/train/$train_num/spectrum/all
awk '{print $1"\t" $3}' cross-section.dat | sed 1d > plot.dat
plot 'plot.dat' u 1:2 w l

Or use other software to plot plot.dat, where first column is energy (eV), second column is cross section (Å^2/moledule).

Step 5: Cross section validation

Obtained cross section can be compared with the any reference absorption spectrum available (e.g. experimental).

In case of benzene, for which very precise cross section calculated with 50k ensemble points available, you can compare the spectrum using the relative integral change (RIC).

Explanation of RIC

Now we will introduce a evaluating criteria RIC which is the value of this green area divided by the whole area between cross section of referene (50k) and x axis.

calculate RIC for different number of ensemble points

We need to calculate cross section spectrum (Newton-X method) for RIC. Generally, we use Newton-X to calculate all the cross section spectrum, but since we have calculated all the gaussian result data, so we needn’t to do that.

We implemented the cross section spectrum generation and RIC calculation in the script TD-RIC.sh.

Just execute bash TD-RIC.sh, and you will get a file named TD-RIC.result, it include the RIC value of ensemble method corresponding to the points defined in train_nums.

calculate RIC for different number of ML points

the procedure of calculating RIC for ML is implemented in ML-RIC.sh, simply execute bash ML-RIC.sh, and you will get a result file named ML-RIC.result corresponding to the points defined in train_nums.