# Probability Theory for Physically Based Rendering

**Introduction**

Rendering frequently involves the evaluation of multidimensional definite integrals: e.g., the visibility of an area light, radiance arriving over the area of a pixel, radiance arriving over a period of time, and the irradiance arriving over the hemisphere of a surface point. Evaluation of these integrals is typically done using Monte-Carlo integration, where the integral is replaced by the expected value of a stochastic experiment.

This document details the basic process of Monte-Carlo integration, as well as several techniques to reduce the variance of the approach. This will be done from a practical point of view: it is assumed that the reader is not intimately familiar with probability theory, but still wants to benefit from it for the development of efficient yet correct rendering algorithms.

**Definite Integrals**

A definite integral is an integral of the form \int_{a}^{b}f(x)dx, where [a,b] is an interval (or domain), x is a scalar and f(x) is a function that can be evaluated for each point in the interval. As worded by Wikipedia, a definite integral is defined as the signed area in the x-plane that is bounded by the graph of f, the x-axis and the vertical lines x=a and x=b (*Figure 1a*).

The concept extends intuitively to higher dimensions: for a definite double integral, *signed area* becomes *signed volume* (*Figure 1b*) and in general, for definite multiple integrals, this becomes the *signed hyper-volume*.

In some cases, the area can be determined *analytically*, e.g. for f(x)=2: for the domain [a,b], the area is simply 2(b-a). In other cases an analytical solution is impossible, for example when we want to know the volume of the part of the iceberg that is above the water (*Figure 1c*). In such cases, f(x) can often only be determined by *sampling*.

**Numerical Integration**

We can estimate the area of complex integrals using *numerical integration*. One example of this is the *Riemann sum*. We calculate this sum by dividing the region in regular shapes (e.g. rectangles), that together form a region that is similar to the actual region. The Riemann sum is defined as:

Here, n is the number of subintervals, and \Delta x_i=\frac{b-a}{n} is the width of one subinterval. For each interval i, we sample f at a fixed location x_i in the subinterval (in *Figure 2*: at the start of the subinterval).

Note that as we increase n the Riemann sum will converge to the actual value of the integral:

\tag{2}\int_{a}^{b}f(x)dx=\lim_{||\Delta x||\to 0}\sum_{i=1}^{n}f(x_i)\Delta x_iRiemann sums also work in higher dimensions (*Figure 3*). However, we run into a problem: for a function with two parameters, the number of subintervals must be much larger if we want to have a resolution that is comparable to what we used in the 2D case. This effect is known as the *curse of dimensionality*, and is amplified in higher dimensions.

We will now evaluate the accuracy of the Riemann sum for the following (deliberately convoluted) function:

\tag{3}f(x)=\left |\sin \left (\frac{1}{2}x+\frac{\pi}{2} \right )\tan \frac{x}{27}+\sin \left (\frac{3}{5}x^2 \right )+\frac{4}{x+\pi+1}-1\right |A plot of the function over the domain [-2.5,2.5] is shown below. For reference, we calculate the definite integral \int_{-2.5}^{2.5}f(x) via Wolfram Alpha, which reports that the area is 3.12970. The plot on the right shows the accuracy of numerical integration using the Riemann sum for increasing n.

To put some numbers on the accuracy: for n=50, the error is ~2\times10^{-3}. At n=100, the error is ~3\times10^{-4}. Another order of magnitude is obtained for n=200.

For additional information on Riemann sums, consider these resources:

- Kahn Academy: https://www.khanacademy.org/math/ap-calculus-ab/ab-accumulation-riemann-sums/ab-riemann-sums/v/simple-riemann-approximation-using-rectangles
- Wikipedia: https://en.wikipedia.org/wiki/Riemann_sum

**Monte
Carlo (1)**

For rendering, few integrals (none?) are *univariate*. This means that we quickly meet the curse of dimensionality. On top of that, sampling a function at regular intervals is prone to *undersampling* and *aliasing*: we may miss important values in the function, or end up with unintended interference between the sampled function and the sample pattern (*Figure 5*).

We solve these problems using a technique known as *Monte Carlo integration*. Similar to the Riemann sum this involves sampling the function at a number of points, but unlike the *deterministic* pattern in the Riemann sum, we turn to a fundamentally *non-deterministic* ingredient: random numbers.

Monte Carlo integration is based on the observation that an integral can be replaced by the *expected value* of a stochastic experiment:

In other words: we sample the function n times at random locations within the domain (denoted by capital X), average the samples, and multiply by the width of the domain (for a univariate function). As with the Riemann sum, as n approaches infinity, the average of the samples converges to the expected value, and thus the true value of the integral.

**A
Bit of Probability Theory**

It is important to grasp all the individual concepts here. Let’s start with the *expected value*: this is the value we expect for a single sample. Note that this is not necessarily a *possible* value, which may be counter-intuitive. For example, when we roll a die, the expected value is 3.5: the average of all possible outcomes, (1+2+3+4+5+6)/6=21/6=3.5.

The second concept is *random numbers*. It may sound obvious, but what we need for Monte Carlo integration are uniformly distributed random numbers, i.e. every value must have an equal probability of being generated. More on this later.

A third concept is *deviation*, and, related to that, *variance*. Even when we take a small number of samples, the expected average value as well as the expected value of each individual sample is the same. However, evaluating *Equation 4* rarely will actually produce this value. Deviation is the difference between the expected value and the outcome of the experiment: X-E(X).

In practice, this deviation has an interesting distribution:

This is a plot of the *normal distribution* or *bell curve*: it shows that not all deviations are equally probable. In fact, ~68.2% of our samples are within the range -1\sigma..1\sigma, where \sigma (sigma) is the *standard deviation*. Two useful ways to describe ‘standard deviation’ are:

- Standard deviation is a
*measure of the dispersion*of the data. - 95% of the data points are within 2\sigma from the mean.

To determine the standard deviation we have two methods:

- Standard deviation \sigma=\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left (X_i-E\left [X\right ]\right )^2}: this works if we have a discrete probability distribution and the expected value E[X] is known. This is true for dice, where X={1,2,3,4,5,6} and E[X]=3.5. Plugging in the numbers, we get \sigma=1.71.
- Alternatively, we can calculate the sample standard deviation using \sigma=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}\left (X_i-X\right )^2}. More information on Wikipedia.

Sanity check: does this make sense? If \sigma=1.71, we claim that 68.2% of the samples are within 1.71 from 3.5. We know that {2,3,4,5} satisfy this criterion, 1 and 6 do not. Four out of six is 66.7%. Only if our dice would have been able to produce

anyvalue in the range [1..6] we would have hit the 68.2% mark exactly.

Instead of standard deviation we frequently use the related term *variance*, which is defined simply as Var\left [X\right ]=\sigma^2. Being a square, variance is always positive, which helps in our calculations.

**Monte Carlo (2)**

In an earlier section we evaluated *Equation 3* using a Riemann sum. We will now repeat this experiment using Monte Carlo integration. Recall that Monte Carlo integration is defined as

A straight translation to C code:

double sum = 0; for( int i = 0; i < n; i++ ) sum += f( Rand( 5 ) - 2.5 ); sum = (sum * 5.0) / (double)n;

The result for n=2 to n=200 is shown in the graph below. This suggests that Monte Carlo integration performs much worse than the Riemann sum. A closer inspection of the error shows that for n=200 the average error of the Riemann sum is 0.0002, while the error for Monte Carlo is 0.13.

In higher dimensions, this difference is reduced, but not eliminated. The following equation is an expanded version of the one we used before, taking two parameters:

f(x,y)=\left |\sin\left( \frac{1}{2}x+\frac{\pi}{2}\right )\tan \frac{x}{27}+\sin \left ( \frac{1}{6}x^2\right )+\frac{4}{x+\pi+1}-1\right |\left |\sin\left ( 1.1y\right )\cos\left (2.3x\right )\right | (6)

Over the domain x∈[-2.5,2.5],y∈[-2.5,2.5], the volume bounded by this function and the xy-plane is 6.8685. At n=400 (20×20 samples), the error of the Riemann sum is 0.043. At the same sample count, the average error using Monte Carlo integration is 0.33. This is better than the previous result, but the difference is still significant. To understand this problem we investigate a well-known variance reduction technique for Monte Carlo integration: stratification.

Stratification improves the *uniformity* of random numbers. In *Figure 8a*, eight random numbers are used to sample the function. Since each number is picked at random, they often are not evenly distributed over the domain. *Figure 8b* shows the effect of stratification: the domain is subdivided in eight strata, and in each stratum a random position is chosen, improving uniformity.

The effect on variance is quite pronounced. *Figure 9a* shows a plot of the results with and without stratification. *Figure 9b* shows the error in the estimate. For n=10, the average error for 8 strata is 0.05; for 20 strata 0.07 and for 200 strata it drops to 0.002. Based on these results it is tempting to use a large number of strata. Stratification has drawbacks however, which amplify with an increasing stratum count. First of all, the number of samples must always be a multiple of the number of strata and secondly, like the Riemann sum, stratification suffers from the curse of dimensionality.

**Importance Sampling**

In the previous sections, we have sampled equations uniformly. An extension to the Monte Carlo *integrator* allows us to change that:

Here, p(X) is a *probability density function (pdf)*: it specifies the relative probability that a random variable takes on a particular value.

For a uniform random variable in the range 0..1, the pdf is simply 1 (*Figure 10*a), which means that each value has an equal probability of being chosen. If we integrate this function over the domain [0,0.5] we get 0.5: the probability that X<\frac{1}{2}. For X>\frac{1}{2}, we obviously get the same probability.

*Figure 10b* shows a different pdf. This time, the probability of generating a number smaller than \frac{1}{2} is 70%. This is achieved with the following code snippet:

float SamplePdf() { if (Rand() < 0.7f) return Rand( 0.5f ); else return Rand( 0.5f ) + 0.5f; }

This pdf is defined as:

\tag{8}p(x)=\left\{\begin{matrix}1.4,if x<\frac{1}{2}\\0.6,otherwise\end{matrix}\right. The numbers 1.4 and 0.6 reflect the requirement that the probability that x<\frac{1}{2} is 70%. Integrating the pdf over [0..\frac{1}{2} yields 1.4\times\frac{1}{2}, and 0.6\times\frac{1}{2} equals 0.3. This illustrates an important requirement for pdfs in general: the pdf *must *integrate to 1. Another requirement is that p(x) cannot be zero if f(x) is not zero: this would mean that parts of f have a zero probability of being sampled, which obviously affects the estimate.

A few notes to help you grasp the concept of the pdf:

- A single value of the pdf does
*not*represent a probability: the pdf can therefore locally be greater than 1 (e.g., in the pdf we just discussed). - The integral over (part of) the domain of the pdf however
*is*a probability, and therefore the pdf integrates to 1.

A single value can be interpreted as the *relative likelihood* that a particular value occurs.

Note that the

normal distributionis a probability density function: it provides us with a probability that some random variable falls within a certain range. In the case of the normal distribution, this random variable is deviation from the mean. Like a well-mannered pdf, the normal distribution integrates to 1.

*Equation 7* thus allows for non-uniform sampling. It compensates for this by dividing each sample by the relative chance it is picked. Why this matters is illustrated in *Figure 11*a. The plotted function features a significant interval where its value is 0. Sampling this region is useless: nothing is added to the sum, we just divide by a larger number. Recall the iceberg in *Figure 1c*: there is no point in sampling the height in a large area around the iceberg.

A pdf that exploits this knowledge about the function is shown in *Figure 11*b. Notice that this pdf actually is zero for a range of values. This does not make it an invalid pdf: the function is zero at the same locations. We can extend this idea beyond zero values. Samples are best spent where the function has significant values. In fact, the *ideal pdf* is *proportional to the function we are sampling*. A very good pdf for our function is shown in *Figure 12a*. An even better pdf is shown in *Figure 12b*. In both cases, we must not forget to *normalize* it, so it integrates to 1.

The pdfs in *Figure 1*2 pose two challenges:

- how do we create such pdfs;
- how do we sample such pdfs?

The answer to both questions is: *we don’t.* In many cases the function we
wish to integrate is unknown, and the only way to determine where it is
significant is by sampling it – which is precisely what we need the pdf for; a
classic chicken and egg situation.

In other cases however, we have a coarse idea of where we may expect the function to yield higher values, or zero values. In those cases, a crude pdf is often better than no pdf.

We may also be able to build the pdf on-the-fly. A couple of samples estimate the shape of the function, after which we aim subsequent samples at locations where we expect high values, which we use to improve the pdf, and so on.

In the next article we apply these concepts to rendering. A significant challenge is to construct pdfs. We will explore several cases where pdfs can help sampling.

Contact: bikker.j@gmail.com Twitter: @j_bikker

Github: https://github.com/jbikker/lighthouse2

## 18 thoughts on “Probability Theory for Physically Based Rendering”

Just a typo:

In equation 4, the middle term is missing a factor of b – a.

I.e.,

The integral from a to b of f(x) dx is (b – a) times the expectation of f(X).

The right-hand side includes the correct factor, so the left and right

are close, but the middle isn’t.

Thanks for spotting that. I fixed it in Eq. 4, as well as in 5 and 7.

“Standard deviation is the expected absolute deviation: σ=E[∣X−E(X)∣]\sigma=E\left [\left |X-E \left (X\right )\right |\right ]σ=E[∣X−E(X)∣].”

This is not true. The standard deviation is the sqrt of the variance, so sqrt(E[(X – E(X))^2]) which is not the same thing (nonlinear functions can’t pass through expectation values usually).

Thank you, and you are right. I was looking for a way to describe what the standard deviation is, conceptually. The second bullet point wasn’t great, and as you pointed out, wrong. I replaced it with a concrete number on the distribution of the data, but I am still wondering if there is a better way to describe what it is?

One way that I’ve heard it described that may be intuitive for some people is that, if you consider your collection of n (univariate) independent samples {x_i} (from a distribution with mean mu) as a vector in n-dimensional space, the standard deviation is proportional to the expected value of the distance from the point (mu, mu, …, mu) to your sample. That’s not especially easy to visualize, though.

Possibly a better question is: why do we use the variance to measure dispersion, rather than some other quantity? (Others are used for some purposes, for example the interquartile range or Median absolute deviation from the median: see en.wikipedia.org/Statistical_dispersion ). We use it because it has useful properties, most notably that the variance of a sum of independent random variables is the same as the sum of the variances. This fact is closely related to the central limit theorem, and why the variance is the natural parameter for the normal distribution.

Why use the standard deviation instead of sticking to the variance? Because it has the same units as the data, so it is easier to interpret — it acts like a distance in data space. Relatedly, the standard deviation is a scale parameter, so that, for a scalar a, if X has standard deviation \sigma, aX has standard deviation a\sigma.

In your code just below equation 5. Is there anywhere that shows the eval() and the Rand() function

double sum = 0;

for( int i = 0; i < n; i++ ) sum += eval( Rand( 5 ) – 2.5 );

sum = (sum * 5.0) / (double)n;

Eval is a single evaluation of function f. I replaced ‘eval’ by ‘f’ to make this more clear. Rand() returns a uniform random number, in this case in the range 0..5.