Tutorial for ML-NEA Approach

This tutorial shows how to perform absorption cross section calculations with ML-NEA as described in this paper (please cite it if you use this approach):

Bao-Xin Xue, Mario Barbatti*, Pavlo O. Dral*, Machine Learning for Absorption Cross SectionsJ. Phys. Chem. A 2020124, 7199–7210. DOI: 10.1021/acs.jpca.0c05310.
Preprint on ChemRxiv, DOI: 10.26434/chemrxiv.12594191.

original tutorial (describing all implementation steps) >>>

Introduction

Nuclear Ensemble Approach (NEA) is more accurate approach of calculating absorption spectra than often used single points convolution that only broadens oscillator strengths for equilibrium geometry. Here it is described by sum over Gaussian broadening functions.

The figure above shows different excited energies and oscillator strengths at different conformations (different color means different conformation). So we need to generate many (the more the better, typically hundreds or thousands are required) of conformations of one molecule, and then perform quantum chemical (QC) calculation for each of them.

The figure above show the basic scheme of our method, to replace most of QC calculations with ML predictions. So we only need to calculate very small part of them (blue sticks), and then train a model with this data set and make prediction for the remaining conformations (orange sticks).

Example 1: Use Pre-calculated Data

Now we will use the pre-calculated QC database to perform the ML-NEA calculation. It contains optimized equilibrium geometry and 50k conformations obtained with Newton-X via sampling from Wigner distribution. For each conformation, it has pre-calculated TDDFT vertical excitation energies and oscillator strengths for 10 states.

To calculate ML-NEA cross section, we will use the vertical excitation energies and oscillator strengths for 10 states at 500 different conformation from this database.

You can download the required files to follow this tutorial from

it includes these files:

  • f{1..10}.dat & E{1..10}.dat the pre-calculated QC dataset
  • cross-section_ref.dat: the reference cross section obtained from 50k calculations with reference QC method
  • inp: MLatom input file
  • nea_geoms.xyz: 50k geometries used to make a prediction
  • eq.xyz: the optimized geometry

now let’s see the inp file content:

cross-section     # the task type
Nexcitations=10   # the number of excited states
nQMpoints=500     # number of QC calculations
plotQCNEA         # plot QC-NEA figure
plotQCSPC         # plot cross section from single point convolution from QC calculations on equilibrium geometry
deltaQCNEA=0.02   # the broadening parameter of QC-NEA

just use command MLatom.py inp &> log, and program will use 500 point to train a model, and then use this model to make predictions for 50k conformations.

The result will look like this (see cross-section/plot.png):

Example 2: Calculate from Scratch

Now, we will calculate cross-section from scratch using iterative procedure, which means we will use our rRMSE criteria (see paper) to check whether it has satisfied our convergence criterion (change of geometrical mean of validation RMSEs less than 10% after adding 50 more points to the training set), and if not, continue performing QC calculations.

You can download the required files to follow this tutorial from

it included these file:

  • gaussian_ef.com: Gaussian TD calculation input file (doesn’t matters with the geometry, the geometry is only a placeholder)
  • gaussian_optfreq.com: Guassian opt and freq input file
  • cross-section_ref.dat: the reference cross-section data (calculated with Newton-X)
  • inp: MLatom input file
cross-section     # the task type
Nexcitations=3    # the number of excited states
plotQCNEA         # plot QC-NEA figure
plotQCSPC         # plot cross section from single point convolution from QC calculations on equilibrium geometry
deltaQCNEA=0.05   # the broadening parameter of QC-NEA

Just use command MLatom.py inp &> log, and program will invoke Newton-X to generate 50k conformations, and then perform TDDFT calculations for additional 50 QC point at each step.

[Note that this step will take long time, dozen(s) of hours depending on how fast your machine is]

When it meets criterion (rRMSE<=0.1), it will stop QC calculations. And then will train and use ML model to make predictions for 50k points.

The result will look like:

Pay Attention

Sometimes, ML cross section will be of bad quality even after meeting convergence criterion (as described in paper). For example, if it converged after performing 150 QC calculations, the user may request calculating additional 50 points by adding this line with user-defined number of total QC points (150+50).

nQMpoints=200


As you can see, spectrum improves a lot.
If you add another 50 points (nQMpoints=250) you can get even better result: