10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code
In derivatives pricing and large-scale risk backtesting, where every millisecond matters, traditional normal-distribution integration often becomes both ugly code and a performance bottleneck. This article introduces a fully vectorized numerical algorithm that lets you exploit the parallelism of modern CPUs, GPUs, and even TPUs to break through that bottleneck.
This article is intended for performance-sensitive scenarios, not ultra-high-precision-sensitive ones.
Introduction
In the engineering practice of quantitative finance, the cumulative distribution function of the standard normal distribution is a core component in derivatives pricing, risk measurement, and the numerical solution of many stochastic differential equations. In the Python ecosystem, most research-stage code directly calls scipy.stats.norm.cdf or scipy.stats.multivariate_normal.
However, this “standard approach” suffers from serious performance limitations:
- Call overhead and GIL constraints: although
scipyis implemented in C/Fortran underneath, the overhead of the Python wrapper on every call is not negligible. In many scenarios that are not pure matrix operations, repeated Python function calls inside loops become a fatal drag. - Lack of SIMD vectorization support: standard library functions are often difficult for the compiler to optimize automatically into AVX-512 or Neon instructions. When we need concurrent computation over large tensors, a black-box function blocks pipeline optimization.
- Poor framework portability: modern high-performance trading systems and backtesting frameworks often rely on C++, Rust, or CUDA, or use JIT compilation through Numba/JAX inside Python. Pulling in a heavy
scipydependency makes it unusable in some JIT environments, such as Numbanopythonmode.
Therefore, we need an approximation algorithm based purely on elementary arithmetic operations, addition, subtraction, multiplication, division, powers, and exponentials. Such methods do not depend on complex external libraries, are naturally suitable for vectorization (SIMD), and are easy to port to GPU kernels or FPGA logic, while still providing enough accuracy for practical use.
Univariate Normal Distribution
For the univariate standard normal distribution \(N(0,1)\), the classical numerical approach is to approximate the error function with a polynomial fit, thereby avoiding expensive integration and keeping only one exponential evaluation, \(e^{-x^2/2}\), plus several multiply-add operations.
Because the normal distribution is symmetric around the \(y\)-axis, \(\Phi(-x) = 1 - \Phi(x)\), we only need to compute the case where \(x \ge 0\).
For \(x \ge 0\), the approximate cumulative distribution function is:
$$ \Phi(x) \approx 1 - Z(x) \left( b_1 t + b_2 t^2 + b_3 t^3 + b_4 t^4 + b_5 t^5 \right) $$
where:
\(Z(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\) is the probability density function (PDF) of the standard normal distribution.
\(t = \frac{1}{1 + p x}\) is an auxiliary variable.
To reach the accuracy required by financial computation, with maximum absolute error \(|\epsilon(x)| < 7.5 \times 10^{-8}\), we use the following fitted parameters:
\(p = 0.2316419\)
\(b_1 = 0.319381530\)
\(b_2 = -0.356563782\)
\(b_3 = 1.781477937\)
\(b_4 = -1.821255978\)
\(b_5 = 1.330274429\)
The entire procedure contains only one exponential, one reciprocal, and five multiply-accumulate operations, so the computational complexity is extremely low. An error of \(10^{-7}\) is already sufficient for most option pricing tasks, including finite-difference calculations of Greeks. At the low level, this algorithm can be optimized easily into FMA (Fused Multiply-Add) instructions by the compiler, making it a first-choice approach for handwritten NormCDF implementations in C++ or Numba.
Bivariate Normal Distribution
For the standard bivariate normal distribution function \(\Phi_2(x, y; \rho)\), the formula is:
$$ \Phi_2(x, y; \rho) = \frac{1}{2\pi\sqrt{1-\rho^2}} \int_{-\infty}^{x} \int_{-\infty}^{y} \exp\left( -\frac{u^2 - 2\rho u v + v^2}{2(1-\rho^2)} \right) du \, dv $$
Here, \(\rho\) is the correlation coefficient, satisfying \(-1 < \rho < 1\).
To deal with the difficulty of evaluating the double integral \(\iint\), the most elegant route is to use the Plackett identity (1954). This is a remarkably deep result: the partial derivative of the bivariate normal distribution with respect to the correlation coefficient \(\rho\) is exactly equal to its probability density function.
$$\frac{\partial \Phi_2(x,y;\rho)}{\partial \rho} = \phi_2(x,y;\rho)$$
By the fundamental theorem of calculus, we can decompose \(\Phi_2\) into a “starting point” plus an integral over the accumulated change.
Starting point (\(\rho=0\)): when the assets are uncorrelated, the bivariate distribution reduces to the product of two univariate distributions, namely \(\Phi(x)\Phi(y)\).
Path (\(\rho\)): the accumulated effect from 0 to \(\rho\).
This gives the core Drezner-Wesolowsky integral formula:
$$\Phi_2(x,y;\rho) = \Phi(x)\Phi(y) + \frac{1}{2\pi} \int_0^{\rho} \frac{1}{\sqrt{1-t^2}} \exp\left( -\frac{x^2 - 2txy + y^2}{2(1-t^2)} \right) dt$$
Dimensionality reduction: the original two-dimensional integral over \(dxdy\) is transformed into a one-dimensional integral over the correlation parameter \(t\).
Smoothness: the integrand \(g(t; x, y)\) is a highly smooth analytic function on the interval \([0, \rho]\). That means very high accuracy can be achieved without many sample points.
Role reversal of variables: inside the integral, \(x\) and \(y\) become fixed parameters and no longer participate as integration variables.
Although the formula is simpler, there is still a major engineering issue: the upper limit of integration is \(\rho\). In quantitative applications, we may have \(N\) options, each with a different \(\rho_i\). That means each sample has a different integration path length. To make the method vectorizable, we must map all these uneven intervals \([0, \rho_i]\) onto the standard interval \([-1, 1]\).
Let the relationship between the integration variable \(t\) and the standard variable \(u\) be:
$$t(u) = \frac{\rho}{2}(u+1), \quad u \in [-1, 1]$$
Then the differential becomes:
$$dt = \frac{\rho}{2} du$$
Substituting this change of variable, the integral term becomes:
$$\int_0^\rho g(t) dt = \frac{\rho}{2} \int_{-1}^1 g\left( \frac{\rho}{2}(u+1) \right) du$$
According to Gauss-Legendre theory, an integral over \([-1, 1]\) can be approximated by a weighted sum of \(K\) fixed nodes \(u_j\) and weights \(w_j\):
$$\text{Integral} \approx \frac{\rho}{2} \sum_{j=1}^{K} w_j \cdot g\left( \frac{\rho}{2}(u_j+1); x, y \right)$$
Whether \(\rho\) is 0.1 or 0.9, we only ever need to evaluate those same fixed \(K\) points, for example \(K=10\). The nodes \(u_j\) and weights \(w_j\) are mathematical constants and can be precomputed.
Code and Tests
Below is a ready-to-use JAX implementation that runs efficiently on CPU, GPU, and TPU.
1 | import jax |
GPU Benchmark (Tesla T4, CPU: Xeon 7B13)
| Samples (N) | Task | Method | Time (s) | Speedup |
|---|---|---|---|---|
| 10,000 | Univariate | JAX | 0.15163 | 0.0x |
| 10,000 | Bivariate | JAX | 0.45608 | 12.1x |
| 100,000 | Univariate | JAX | 0.26846 | 0.0x |
| 100,000 | Bivariate | JAX | 0.46016 | 153.1x (Est.) |
| 1,000,000 | Univariate | JAX | 0.15443 | 0.4x |
| 1,000,000 | Bivariate | JAX | 0.27804 | 1139.2x (Est.) |
| 10,000,000 | Univariate | JAX | 0.08185 | 5.3x |
| 10,000,000 | Bivariate | JAX | 0.18332 | 12730.8x (Est.) |
| 100,000,000 | Univariate | JAX | 0.12218 | 36.6x |
| 100,000,000 | Bivariate | JAX | 0.22598 | 101728.1x (Est.) |
- For bivariate normal integration, as the sample size grows, GPU parallelism is increasingly saturated and the speedup expands dramatically, reaching an estimated \(10^5\times\)-level acceleration at the \(10^8\) scale.
- For the univariate CDF, kernel launch and JIT overhead dominate small batches, while GPU throughput advantages begin to appear in very large batches.
10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code