Integrals in PSI4¶
Introduction¶
PSI4 has a number of backends available to compute integrals. In order to accomodate these options, while providing a clean interface to the programmer, an abstraction layer is implemented within Libmints. A recent upgrade to the primary integral engine has seen some important changes to the way this interface layer is used; this document is designed to aid new developers as well as those familiar with the older calling conventions to ensure that the most efficient calling conventions are applied.
The older style¶
A very simple loop that does not use permutational symmetry might look something like this in the old scheme:
auto sieve = std::make_shared<ERISieve>(basisset, cutoff);
auto factory= std::make_shared<IntegralFactory>(basisset);
int deriv_level = 0;
bool use_shell_pairs = true;
auto eri = factory->eri(deriv_level, use_shell_pairs);
const double* buffer = eri_->buffer();
for (int P = 0; P < basisset->nshell(); ++P) {
const auto& Pshell = basisset->shell(P);
for (int Q = 0; Q < basisset->nshell(); ++Q) {
const auto& Qshell = basisset->shell(Q);
for (int R = 0; R < basisset->nshell(); ++R) {
const auto& Rshell = basisset->shell(R);
for (int S = 0; S < basisset->nshell(); ++S) {
const auto& Sshell = basisset->shell(S);
if(sieve->shell_significant(P, Q, R, S) {
eri->compute_shell(P, Q, R, S);
// results are in buffer, do something with them..
}
}
}
}
}
An integral factory is used, which can then produce integral object for various operator types and derivative levels. A sieve is also constructed; this allows a quick determination of whether an integral shell quartet will be significant in magnitude or not, potentially saving a lot of work. This simple scheme is clean and easy to understand, and is still supported in the latest version of PSI4 with only a small change to the sieve syntax and handling of buffer addresses, noted below.
The new syntax¶
The newer integral engines being interfaced to PSI4 may or may not require a group of similar integrals to be computed together in a block using vectorized instructions. To accomodate this possibility, a new syntax has been introduced in Libmints:
auto blocksPQ = ints[0]->get_blocks12();
auto blocksRS = ints[0]->get_blocks34();
auto factory= std::make_shared<IntegralFactory>(basisset);
int deriv_level = 0;
bool use_shell_pairs = true;
bool needs_exchange = true;
auto eri = factory->eri(deriv_level, use_shell_pairs, needs_exchange);
const auto &buffers = eri->buffers();
eri->update_density(D);
bool use_batching = eri->maximum_block_size() > 1;
// loop over all the blocks of (P>=Q|
for (size_t blockPQ_idx = 0; blockPQ_idx < blocksPQ.size(); blockPQ_idx++) {
const auto& blockPQ = blocksPQ[blockPQ_idx];
// loop over all the blocks of |R>=S)
size_t start = eri->first_RS_shell_block(blockPQ_idx);
for (int blockRS_idx = loop_start; blockRS_idx < blocksRS.size(); ++blockRS_idx) {
const auto& blockRS = blocksRS[blockRS_idx];
if (!eri->shell_block_significant(blockPQ_idx, blockRS_idx)) continue;
eri->compute_shell_blocks(blockPQ_idx, blockRS_idx);
const auto* block_start = buffers[0];
// Loop over all of the P,Q,R,S shells within the blocks. We have P>=Q, R>=S and PQ<=RS.
for (const auto& pairPQ : blockPQ) {
const auto &P = pairPQ.first;
const auto &Q = pairPQ.second;
const auto& Pshell = basisset->shell(P);
const auto& Qshell = basisset->shell(Q);
const auto Pam = Pshell.am();
const auto Qam = Qshell.am();
for (const auto& pairRS : blockRS) {
const auto &R = pairRS.first;
const auto &S = pairRS.second;
const auto& Rshell = basisset->shell(R);
const auto& Sshell = basisset->shell(S);
const auto Ram = Rshell.am();
const auto Sam = Sshell.am();
size_t block_size = Psize * Qsize * Rsize * Ssize;
// When there are chunks of shellpairs in RS, we need to make sure
// we filter out redundant combinations.
if (use_batching && Pam == Ram && Qam == Sam && ((P > R) || (P == R && Q > S))) {
block_start += block_size;
continue;
}
const double* int_ptr = block_start;
// Query P,Q,R,S shells for metadata and loop over that quartet
// as usual, getting the integrals from the int_ptr buffer.
block_start += block_size;
}
}
}
}
Although this looks more complex, it’s essentially doing the same thing. There are a number of differences that we’ll highlight now.
Sieving¶
This is one of two breaking changes to the old style syntax. Instead of constructing a sieve object, the integral object should be queried directly using the exact same syntax. Requests for whether a shell is significant or a shell block is significant are both supported. A sieve object is created if matching basis sets are found in either the bra or the ket. For a density fitting integral (PQ|0A) where 0 is the null basis set and A is an auxiliary basis set the (PQ| pair will be used to construct all of the sieving data.
Buffer address¶
The old code copied integrals into a buffer owned by the integral object, whose
address remained constant and could be retrieved by the buffer()
member
function. To avoid unnecessary copies, the new code instead uses the integrals
directly from the underlying integral engine’s memory, which may change with
each call to compute integrals. The integral engine provides a
std::vector<const double*>
containing the pointers to the start of each
“chunk” of integrals. For first derivatives there are 12 such “chunks”, which
are ordered Px,Py,Pz,Qx,Qy,Qz,Rx,Ry,Rz,Sx,Sy,Sz, where the Px refers to the x
derivative with respect to the basis functions in shell P. Note that all
integral derivatives are provided by the new integral code, unlike the previous
version where only 9 of 12 were provided and the user was responsible for using
translation invariance relationships to fill in the rest. The addresses for
each chunk are updated in the vector after each call to compute integrals, so
the user should keep a const reference to that object, and query that for the
address of interest.
Density Screening¶
The old code looked only at the integral to determine whether terms can be
avoided a priori. However, if the integral is to be contracted with a
density or a density-like quantity, the screening can be performed on the
product, which yields more sparsity. To enable this, simply call the integral
object’s update_density
member, passing it a SharedMatrix holding the
current density (remember that it changes during each iteration of the SCF) and
the product will be considered during screening. If only coulomb-like terms
are to be computed, the needs_exchange
argument to the integral object
constructor should be set to false, otherwise it should be true to correcly
account for products of the density and integrals that contribute to
exchange-like terms.
Shell blocking¶
Each underlying integral engine knows whether it will use blocks, and will set up the metadata automatically. Instead of looping over individual shells, the user should loop over blocks supplied by the integral object; these blocks will be just a single shell quartet combination for the case where blocking is not used. It is simple to loop over pairs within each block using C++11 syntax, as demonstrated in the code snippet above. Only shell pairs with significant overlap are included in the shell block information, making this an efficient way to loop over non-negligible terms.
Permutational symmetry¶
The pairs within each block are optimized for efficiency. First, they are screened during the integral object’s creation to ensure that only terms with appreciable overlap are stored. Second, only P,Q combinations that are permutationally unique are stored, ordered with the higher angular momentum first. Therefore care must be taken to ensure that the missing permutations are correctly accounted for when processing the integrals within the loop. See the DirectJK code in libfock for an example of using this scheme for a Fock matrix build.
Using bra-ket symmetry¶
In cases where there is no batching performed, bra-ket symmetry can be
trivially enforced by ensuring that one of the block indices is greater than or
equal to the other. When batching is used, the situation is trickier; some ket
batches may contain a mixture of integrals that are bra-ket unique and those
that are not. To handle this we must do a coarse check at the top of the loop
to see if any integrals in the batch are needed, which is implemented by
asking the integral engine where to start looping in the ket via the call to
eri->first_RS_shell_block(PQpair_idx)
. This is followed by a more fine
grained check within the loops to filter individual integrals in the case where
bra and ket have the same angular momentum and there’s a possibility of a
handful of integrals coming from the ket that are redundant. Note that the bra
is not batched in any of our engines currently: only the ket is. For this
reason, density fitting integrals should be written as (A0|PQ) rather than
(PQ|A0) where possible, because we want the ket to contain more functions than
the bra for efficient blocking.
Instantiating integral objects¶
With sieving being introduced in the new integral objects, the cost of their construction has increased. Although significantly cheaper than computing integrals themselves, construction of integral objects can be non-negligible, especially if many threads are used. For example, this pattern can be found in old versions of the code:
std::vector<std::shared_ptr<TwoBodyAOInt>> ints;
ints.push_back(std::shared_ptr<TwoBodyAOInt>(factory->eri()));
for (int thread = 1; thread < num_threads; thread++) {
ints.push_back(std::shared_ptr<TwoBodyAOInt>(factory->eri()));
}
This builds many objects and the cost can add up. With the new scheme, integral objects are forced to implement a clone() member that can be used as follows:
std::vector<std::shared_ptr<TwoBodyAOInt>> ints;
ints.push_back(std::shared_ptr<TwoBodyAOInt>(factory->eri()));
for (int thread = 1; thread < num_threads; thread++) {
ints.push_back(std::shared_ptr<TwoBodyAOInt>(ints[0]->clone()));
}
This method only incurs the cost of creating a single integral object, and performs much cheaper cloning operations to create the other objects for each thread. Moreover, if integral objects are created only in the initialization of each code that uses them, and stored persistently, the cost of integral object creation is further reduced.