Tutorial for ML-NEA Approach
This tutorial shows how to calculate absorption cross section using machine learning-nuclear ensemble approach (ML-NEA) as reported in:
Bao-Xin Xue, Mario Barbatti, Pavlo O. Dral, Machine Learning for Absorption Cross Sections, J. Phys. Chem. A 2020, 124, 7199–7210. DOI: 10.1021/acs.jpca.0c05310.
Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.
Table of Contents
Overview
ML-NEA cross section calculations consist of the following steps, each of which is described in the next sections:
- Generation of 50k ensemble points (set of geometries)
- Generation of training data: vertical excitation energies (E) and oscillator strengths (f) for small number (number of training points, Ntr, one can start with 50) of ensemble point and required number of excited states should be calculated with quantum chemical method
- Machine learning calculations of vertical excitation energies and oscillator strengths for the remaining 50k – Ntr ensemble points
- Cross section calculations from vertical excitation energies and oscillator strengths for all 50k ensemble points.
- Cross section validation. If its accuracy is not satisfactory, increase number of training points and repeat steps 2-5.
Before going through this step, you need to install MLatom, Newton-X, and, optionally, Gaussian 16. All examples are in bash shell.
Set some environmental variables:
export NX=[path to Newton-X bin folder]
export PATH=$PATH:[path to MLatom bin folder]
We show how above steps work on an example of benzene. To follow this example, create a folder ML-cross-section
(we call it $root
).
mkdir ML-cross-section
cd ML-cross-section
root=`pwd`
Step 1: Generation of 50k ensemble points
We recommend using Newton-X to generate ensemble points with Wigner sampling. Please refer to the Newton-X tutorial & documentation for details how to use Newton-X.
Now we will take benzene for example to explain how the ML-cross-section is calculated and evaluated. Create a folder named geometry
at $root
and change directory to there.
mkdir geometry ; cd geometry
Geometry optimization and frequency calculations
We will use B3LYP/def2tzvpp to optimize the geometry of benzene.
Optional: Optimize the geometry of benzene and perform frequency calculations using the Gaussian input file (freq.com):
# B3LYP/def2TZVPP opt freq
benzene opt & freq at b3lyp/def2tzvpp
0 1
C 1.20809735 0.69749533 -0.00000000
C 0.00000000 1.39499067 -0.00000000
C -1.20809735 0.69749533 -0.00000000
C -1.20809735 -0.69749533 -0.00000000
C 0.00000000 -1.39499067 -0.00000000
C 1.20809735 -0.69749533 -0.00000000
H 2.16038781 1.24730049 -0.00000000
H 0.00000000 2.49460097 -0.00000000
H -2.16038781 1.24730049 -0.00000000
H -2.16038781 -1.24730049 -0.00000000
H 0.00000000 -2.49460097 -0.00000000
H 2.16038781 -1.24730049 -0.00000000
Save the Gaussian output file to freq.out
.
You also need the file with optimized geometry from the freq.out. You can save it to file opt.xyz
in XYZ format.
12
C 0.00000000 1.39110400 0.00000000
C 1.20473100 0.69555200 0.00000000
C 1.20473100 -0.69555200 0.00000000
C -0.00000000 -1.39110400 0.00000000
C -1.20473100 -0.69555200 -0.00000000
C -1.20473100 0.69555200 0.00000000
H 0.00000000 2.47330200 0.00000000
H 2.14194200 1.23665100 0.00000000
H 2.14194200 -1.23665100 0.00000000
H -0.00000000 -2.47330200 0.00000000
H -2.14194200 -1.23665100 -0.00000000
H -2.14194200 1.23665100 0.00000000
You can download these two files here: freq.out
and opt.xyz
.
Then in the geometry
folder, run command:
$NX/xyz2nx < opt.xyz
and you will get a file named geom
.
Sampling 50k points
Now, we need to create the input file of initcond.pl
. Because we only want the configuration of 50k ensemble points, we only set nsf=2
, and chk_e=0
to make Newton-X not to perform gaussian calculation.
Download initqp_input
, and execute the following command to generate 50k ensemble points.
nohup $NX/initcond.pl > initcond.log 2>&1 &
It will take a long time to finish (because of the low efficiency of perl), so you can directly download the output file below to save time.
Then execute the following command:
unzip final_output.zip
atom_num=12 # you can change the atom number here for your own molecule
sed -n /Geometry/,/Velocity\ in\ NX/p final_output | sed /Velocity/d | awk '{printf ("%s\t", $1)} {printf ("%.8f\t", $3*0.5291772083)} {printf ("%.8f\t", $4*0.5291772083)} {printf ("%.8f\n", $5*0.5291772083)}' | sed -r s/Geometry.+/$atom_num\\n/g > xyz.dat
Then you will get a file named xyz.dat
including 50001 ensemble points (the first point is the equilibrium geometry).
Step 2: Generation of training data
Please create an folder named gaussian-calc
at $root
. We will use CAM-B3LYP/ma-TZVP for TD calculation.
cd $root
mkdir gaussian-calc
cd gaussian-calc
Needed files
You can download all the needed files below:
Here are some description to files:
- itrain.dat –> the indices of training in xyz.dat
- run_multi_gaussian.py –> script that can run many gaussian calculations
- template.com -> gaussian input file template, you can modify it if you like.
- xyz.dat –> including geometric configuration of 50k ensemble point. please copy it from
../geometry/
(if you want to calculate your own molecule), or use the file in the zip archive. - multi-gaussian.bsub (optional)
- the batch script of BSUB, it’s optional, you must modify it before using it.
- 42th line defined the ma-TZVP basis, you should modify it at the correct path, you can download this file through this link.
Decomprese the folder at gaussian-calc/
folder.
unzip gaussian-calc.zip
Quantum chemical calculations
The itrain.dat
has been included in the archive already, it includes the indices of geometry in xyz.dat
.
In default, it include 50001 points, because we need to create a 50k points reference for evaluation, but it will take too much time to calculate all the gaussian task, so you can use the command below to create a new itrain.dat
file.
The number of itrain_num should be enough for a good cross section spectrum result, normally it should be bigger than 200. But you can start with 50 for a quick tutorial.
itrain_num=50 # modify the number here
: > itrain.dat
for i in `seq 1 $itrain_num`; do
echo $i >> itrain.dat
done
Certainly you can also calculate 50k by yourself, but we will provide the calculated 50k points cross section data at later.
If you choose to calculate too much points, it’s highly recommended to split task (for advanced linux operators only), the procedure is implemented in this script.
You can directly run python3 run_multi_gaussian.py
, calculation results will appear in the $root/gaussian-calc/log
folder. (for BSUB circumstance, result will appear in $root/gaussian-calc/output/log/
).
After calculation, you can use the command below to copy all the log file into ../all_log
.
cd $root/gaussian-calc
mkdir ../all_log
find . -type f -regex '.+?/[0-9]+?.log' | xargs -I log_file cp log_file ../all_log/
For 500 points, it may take 18 hours at 18 cores server(2.6 GHz).
Post-processing data
Now we have all the results of 50k TD ensemble points, we need to get all the data of excited energy (E) and oscillator strength (f). You can use this script at $root
to excerpt all the needed data.
cd $root
num=10 # you can modify it to decide how many excited states you want to excerpt
cat ./gaussian-calc/itrain.dat | while read line; do
for i in `seq 1 $num`; do
grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $5}' | sed s/f=//g >> 'E'$i
grep f=0 './all_log/'$line'.log' | grep ' '$i':' | awk '{print $9}' | sed s/f=//g >> 'f'$i
done
done
for i in `seq 1 $num`; do
paste ./gaussian-calc/itrain.dat E$i f$i > state$i'.index.E.f'
rm E$i f$i
done
mkdir all_data
mv *.index.E.f all_data
It will generate the files named state$i.index.E.f
in the folder $root/all_data/
. (the calculated data will be provided at the later part)
Optional: QC cross section
Now we will generate the cross section reference data (50k) to evaluate how better is the ML method performs and how fast the deviation converged. (If you calculated 50k data at the previous step, you can continue, or just download the calculated result at the end part of this section.)
Please enter the folder all_data
, download this script to this folder, and then execute the following command.
cd $root/all_data
### save the script here ###
cp ../gaussian-calc/itrain.dat .
num=`wc itrain.dat -l | awk '{print $1}'`
python3 cross-section-TD.py $num . 10 # third parameter represents the state number defined at the previous step
It will generate a file named cross-section.dat
, which can be directly download through this link.
Steps 3-4: Machine learning calculations
Please create a folder named ML
at $root
, and then enter it.
cd $root; mkdir ML; cd ML
Use descriptor to generate the input file (x.dat)
Now, we will generate the input file for MLaotm, please create a folder named descriptor
at $root
, please run the command below:
cd $root; mkdir descriptor; cd descriptor
cp ../geometry/xyz.dat .
MLatom.py XYZ2X XYZfile=xyz.dat XfileOut=x.dat molDescriptor=RE > xyz2x.out
The outputted x.dat
is the input file in MLaotm. (this file can be downloaded at the next part)
Files description
In Machine Learning procedure, we will use scripts and many data to make all the procedure automatically. You can download the zip file below and uncompress it into the ML
folder. The content including:
- script/ –> including all the needed script
- data/ –> including all the needed data
- train_num –> a file includes how much point you want to train
- ML_train.bsub –> bsub task script for ML (optional)
- ML_train.sh –> the main script that bsub will use
- ML-RIC.sh –> the script to calculate RIC of ML (later part will use it)
- TD-RIC.sh –> the script to calculate RIC of TD (later part will use it)
cd $root/ML
#### upload ML.zip file to $root/ML/ ####
unzip ML.zip
If you have calculated the cross section of reference, you can execute this command at the folder $root/ML/
: (or just use the file provided in ML.zip
)
cp ../all_data/cross-section.dat data
At previous step, we calculated x.dat
, you can copy it to data/
, or just use the file provided in ML.zip
.
cp ../descriptor/x.dat data # if you calculated it
Submit the Machine Learning task
After modifying some parameter in ML_train.sh
and train_nums
, just run bash ML_train.sh
to starting training. (or use bsub < ML_train.bsub
if you have modified the bsub script)
After training, in $root/ML/train/$num/spectrum/
, you will get 3 cross section output results, 2/cross-section.dat
, 2-4/cross-section.dat
, all/cross-section.dat
, which has take first excited state, first 3 excited states, 10 excited states into consideration.
As described in the paper, this script uses E and f calculated using quantum chemical method for the training set and machine learning estimates for the remaining data points. It also set negative f values sometimes predicted by ML with zeros.
Plotting cross sections
The cross setion data is stored in the path $root/ML/train/$num/spectrum/all/cross-section.dat
, you can use the command to draw the figure:
train_num=100 # define what training point you want to draw
cd $root/ML/train/$train_num/spectrum/all
awk '{print $1"\t" $3}' cross-section.dat | sed 1d > plot.dat
gnuplot
plot 'plot.dat' u 1:2 w l
Or use other software to plot plot.dat, where first column is energy (eV), second column is cross section (Å^2/moledule).
Step 5: Cross section validation
Obtained cross section can be compared with the any reference absorption spectrum available (e.g. experimental).
In case of benzene, for which very precise cross section calculated with 50k ensemble points available, you can compare the spectrum using the relative integral change (RIC).
Explanation of RIC
Now we will introduce a evaluating criteria RIC which is the value of this green area divided by the whole area between cross section of referene (50k) and x axis.
calculate RIC for different number of ensemble points
We need to calculate cross section spectrum (Newton-X method) for RIC. Generally, we use Newton-X to calculate all the cross section spectrum, but since we have calculated all the gaussian result data, so we needn’t to do that.
We implemented the cross section spectrum generation and RIC calculation in the script TD-RIC.sh
.
Just execute bash TD-RIC.sh
, and you will get a file named TD-RIC.result
, it include the RIC value of ensemble method corresponding to the points defined in train_nums
.
calculate RIC for different number of ML points
the procedure of calculating RIC for ML is implemented in ML-RIC.sh
, simply execute bash ML-RIC.sh
, and you will get a result file named ML-RIC.result
corresponding to the points defined in train_nums
.