Posted 2026-01-22Updated 2026-01-22Data Science / Quantitative Trading

10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code

In derivatives pricing and large-scale risk backtesting, where every millisecond matters, traditional normal-distribution integration often becomes both ugly code and a performance bottleneck. This article introduces a fully vectorized numerical algorithm that lets you exploit the parallelism of modern CPUs, GPUs, and even TPUs to break through that bottleneck.

This article is intended for performance-sensitive scenarios, not ultra-high-precision-sensitive ones.

Introduction

In the engineering practice of quantitative finance, the cumulative distribution function of the standard normal distribution is a core component in derivatives pricing, risk measurement, and the numerical solution of many stochastic differential equations. In the Python ecosystem, most research-stage code directly calls scipy.stats.norm.cdf or scipy.stats.multivariate_normal.

However, this “standard approach” suffers from serious performance limitations:

Call overhead and GIL constraints: although scipy is implemented in C/Fortran underneath, the overhead of the Python wrapper on every call is not negligible. In many scenarios that are not pure matrix operations, repeated Python function calls inside loops become a fatal drag.
Lack of SIMD vectorization support: standard library functions are often difficult for the compiler to optimize automatically into AVX-512 or Neon instructions. When we need concurrent computation over large tensors, a black-box function blocks pipeline optimization.
Poor framework portability: modern high-performance trading systems and backtesting frameworks often rely on C++, Rust, or CUDA, or use JIT compilation through Numba/JAX inside Python. Pulling in a heavy scipy dependency makes it unusable in some JIT environments, such as Numba nopython mode.

Therefore, we need an approximation algorithm based purely on elementary arithmetic operations, addition, subtraction, multiplication, division, powers, and exponentials. Such methods do not depend on complex external libraries, are naturally suitable for vectorization (SIMD), and are easy to port to GPU kernels or FPGA logic, while still providing enough accuracy for practical use.

Univariate Normal Distribution

For the univariate standard normal distribution $N(0,1)$, the classical numerical approach is to approximate the error function with a polynomial fit, thereby avoiding expensive integration and keeping only one exponential evaluation, $e^{-x^2/2}$, plus several multiply-add operations.

Because the normal distribution is symmetric around the $y$-axis, $\Phi(-x) = 1 - \Phi(x)$, we only need to compute the case where $x \ge 0$.

For $x \ge 0$, the approximate cumulative distribution function is:

$$ \Phi(x) \approx 1 - Z(x) \left( b_1 t + b_2 t^2 + b_3 t^3 + b_4 t^4 + b_5 t^5 \right) $$

where:

$Z(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$ is the probability density function (PDF) of the standard normal distribution.

$t = \frac{1}{1 + p x}$ is an auxiliary variable.

To reach the accuracy required by financial computation, with maximum absolute error $|\epsilon(x)| < 7.5 \times 10^{-8}$, we use the following fitted parameters:

$p = 0.2316419$

$b_1 = 0.319381530$

$b_2 = -0.356563782$

$b_3 = 1.781477937$

$b_4 = -1.821255978$

$b_5 = 1.330274429$

The entire procedure contains only one exponential, one reciprocal, and five multiply-accumulate operations, so the computational complexity is extremely low. An error of $10^{-7}$ is already sufficient for most option pricing tasks, including finite-difference calculations of Greeks. At the low level, this algorithm can be optimized easily into FMA (Fused Multiply-Add) instructions by the compiler, making it a first-choice approach for handwritten NormCDF implementations in C++ or Numba.

Bivariate Normal Distribution

For the standard bivariate normal distribution function $\Phi_2(x, y; \rho)$, the formula is:

$$ \Phi_2(x, y; \rho) = \frac{1}{2\pi\sqrt{1-\rho^2}} \int_{-\infty}^{x} \int_{-\infty}^{y} \exp\left( -\frac{u^2 - 2\rho u v + v^2}{2(1-\rho^2)} \right) du \, dv $$

Here, $\rho$ is the correlation coefficient, satisfying $-1 < \rho < 1$.

To deal with the difficulty of evaluating the double integral $\iint$, the most elegant route is to use the Plackett identity (1954). This is a remarkably deep result: the partial derivative of the bivariate normal distribution with respect to the correlation coefficient $\rho$ is exactly equal to its probability density function.

$$\frac{\partial \Phi_2(x,y;\rho)}{\partial \rho} = \phi_2(x,y;\rho)$$

By the fundamental theorem of calculus, we can decompose $\Phi_2$ into a “starting point” plus an integral over the accumulated change.

Starting point ($\rho=0$): when the assets are uncorrelated, the bivariate distribution reduces to the product of two univariate distributions, namely $\Phi(x)\Phi(y)$.
Path ($\rho$): the accumulated effect from 0 to $\rho$.

This gives the core Drezner-Wesolowsky integral formula:

$$\Phi_2(x,y;\rho) = \Phi(x)\Phi(y) + \frac{1}{2\pi} \int_0^{\rho} \frac{1}{\sqrt{1-t^2}} \exp\left( -\frac{x^2 - 2txy + y^2}{2(1-t^2)} \right) dt$$

Dimensionality reduction: the original two-dimensional integral over $dxdy$ is transformed into a one-dimensional integral over the correlation parameter $t$.
Smoothness: the integrand $g(t; x, y)$ is a highly smooth analytic function on the interval $[0, \rho]$. That means very high accuracy can be achieved without many sample points.
Role reversal of variables: inside the integral, $x$ and $y$ become fixed parameters and no longer participate as integration variables.

Although the formula is simpler, there is still a major engineering issue: the upper limit of integration is $\rho$. In quantitative applications, we may have $N$ options, each with a different $\rho_i$. That means each sample has a different integration path length. To make the method vectorizable, we must map all these uneven intervals $[0, \rho_i]$ onto the standard interval $[-1, 1]$.

Let the relationship between the integration variable $t$ and the standard variable $u$ be:

$$t(u) = \frac{\rho}{2}(u+1), \quad u \in [-1, 1]$$

Then the differential becomes:

$$dt = \frac{\rho}{2} du$$

Substituting this change of variable, the integral term becomes:

$$\int_0^\rho g(t) dt = \frac{\rho}{2} \int_{-1}^1 g\left( \frac{\rho}{2}(u+1) \right) du$$

According to Gauss-Legendre theory, an integral over $[-1, 1]$ can be approximated by a weighted sum of $K$ fixed nodes $u_j$ and weights $w_j$:

$$\text{Integral} \approx \frac{\rho}{2} \sum_{j=1}^{K} w_j \cdot g\left( \frac{\rho}{2}(u_j+1); x, y \right)$$

Whether $\rho$ is 0.1 or 0.9, we only ever need to evaluate those same fixed $K$ points, for example $K=10$. The nodes $u_j$ and weights $w_j$ are mathematical constants and can be precomputed.

Code and Tests

Below is a ready-to-use JAX implementation that runs efficiently on CPU, GPU, and TPU.

import jax
import jax.numpy as jnp

jax.config.update("jax_enable_x64", True)

GL_NODES = jnp.array([
    -0.9739065285171717, -0.8650633666889845, -0.6794095682990244, 
    -0.4333953941292472, -0.1488743389816312,  0.1488743389816312,
    0.4333953941292472,  0.6794095682990244,  0.8650633666889845, 
    0.9739065285171717
])

GL_WEIGHTS = jnp.array([
    0.0666713443086881, 0.1494513491505806, 0.2190863625159820,
    0.2692667193099963, 0.2955242247147529, 0.2955242247147529,
    0.2692667193099963, 0.2190863625159820, 0.1494513491505806,
    0.0666713443086881
])

@jax.jit
def fast_norm_cdf(x):
    p  = 0.2316419
    b1 = 0.319381530
    b2 = -0.356563782
    b3 = 1.781477937
    b4 = -1.821255978
    b5 = 1.330274429
    
    abs_x = jnp.abs(x)
    z = (1.0 / jnp.sqrt(2 * jnp.pi)) * jnp.exp(-0.5 * abs_x**2)
    t = 1.0 / (1.0 + p * abs_x)
    
    poly = t * (b1 + t * (b2 + t * (b3 + t * (b4 + t * b5))))
    y = 1.0 - z * poly
    
    return jnp.where(x >= 0, y, 1.0 - y)

@jax.jit
def fast_bivariate_norm_cdf(x, y, rho):
    base_prob = fast_norm_cdf(x) * fast_norm_cdf(y)
    rho_clipped = jnp.clip(rho, -0.9999, 0.9999)
    
    x_col = x[:, None]
    y_col = y[:, None]
    rho_col = rho_clipped[:, None]
    nodes_row = GL_NODES[None, :]
    weights_row = GL_WEIGHTS[None, :]
    
    t_grid = rho_col / 2.0 * (nodes_row + 1)
    
    one_minus_t2 = 1.0 - t_grid**2
    numerator = x_col**2 - 2 * t_grid * x_col * y_col + y_col**2
    denominator = 2 * one_minus_t2
    
    log_integrand = -jnp.log(2 * jnp.pi) - 0.5 * jnp.log(one_minus_t2) - (numerator / denominator)
    integrand = jnp.exp(log_integrand)
    
    integral_sum = jnp.sum(weights_row * integrand, axis=1)
    
    return base_prob + (rho_clipped / 2.0) * integral_sum

GPU Benchmark (Tesla T4, CPU: Xeon 7B13)

Samples (N)	Task	Method	Time (s)	Speedup
10,000	Univariate	JAX	0.15163	0.0x
10,000	Bivariate	JAX	0.45608	12.1x
100,000	Univariate	JAX	0.26846	0.0x
100,000	Bivariate	JAX	0.46016	153.1x (Est.)
1,000,000	Univariate	JAX	0.15443	0.4x
1,000,000	Bivariate	JAX	0.27804	1139.2x (Est.)
10,000,000	Univariate	JAX	0.08185	5.3x
10,000,000	Bivariate	JAX	0.18332	12730.8x (Est.)
100,000,000	Univariate	JAX	0.12218	36.6x
100,000,000	Bivariate	JAX	0.22598	101728.1x (Est.)

For bivariate normal integration, as the sample size grows, GPU parallelism is increasingly saturated and the speedup expands dramatically, reaching an estimated $10^5\times$-level acceleration at the $10^8$ scale.
For the univariate CDF, kernel launch and JIT overhead dominate small batches, while GPU throughput advantages begin to appear in very large batches.

benchmark

10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code

https://en.heth.ink/FastNorm/

Author

Posted on

2026-01-22

Updated on

2026-01-22

10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code

Introduction

Univariate Normal Distribution

Bivariate Normal Distribution

Code and Tests

Author

Posted on

Updated on

Licensed under

Catalogue

Categories

Recents