In this series we explore the wonderful world of the bounding volume hierarchy: the mysterious data structure that enables real-time ray tracing on recent GPUs, in complex, animated scenes. Despite the broad adoption of ray tracing in games, in-depth understanding of the underlying technology seems to be reserved to a select few. In this article we show how a proper BVH is implemented without support from DXR / RTX / OptiX / Vulkan, in easy-to-read ‘sane C++’. Whether this is actually useful for you, or just interesting as background information, depends on your needs, obviously.
This series consist of ten articles. For each article, full source code is available on Github, in easy-to-read ‘Sane C++’. Contents:
- Basics: data structure, simple construction and ray traversal.
- Faster rays: the Surface Area Heuristic and ordered traversal.
- Faster construction: binned BVH building.
- BVHs for animated models: refitting and rebuilding.
- BVHs for animated scenes: the top-level BVH.
- TLAS & BLAS part 2 – all together now.
- Consolidating, part 1: ray tracing textured and shaded meshes.
- Consolidating, part 2: Whitted-style ray tracing.
- OpenCL: GPU ray tracing, part 1.
- OpenCL: GPU ray tracing, part 2 (SERIES FINAL, this article).
- P.S.: Concurrency (under construction)
This is the end. Last time we ported the ray tracing code to the GPU, or at least: BLAS traversal and basic visualization. All that is left to do is some cleaning up, and TLAS traversal. With that in place, we will have the same power on CPU and GPU, in terms of animation: we will be able to specify a transform (rotate, translate, scale) for each instance, on top of the capability to rebuild or refit meshes on-the-fly.
In this final article we implement TLAS traversal and the recursive Whitted-style ray tracer on the GPU. For that a few details need to be sorted out. The final product deserves a fitting final demo, for which I will take some inspiration from the work of two of my students at Utrecht University.

Preparations
To keep the article 9a code intact we start with a new file, titled raytracer.cl, which is initially a copy of kernels.cl. The new file will be loaded from ::Init() on the CPU side:
|
1 |
tracer = new Kernel( "cl/raytracer.cl", "render" ); |
The OpenCL source file currently includes template/common.h. We add cl/tools.h, which currently just contains some functions for generating random numbers.
|
1 |
#include "cl/tools.h" |
We can now move most of the code from raytracer.cl to tools.h. I propose to move the various structs (Intersection, Ray, Tri, TriEx, BVHNode, TLASNode and BVHInstance), as well as the color format conversion function RGB32FtoRGB8 and the intersection functions IntersectTri and IntersectAABB. And finally, also BVHIntersect, for BLAS traversal. All that remains in raytracer.cl is then:
float3 Trace(...), which will replicate the functionality of the Trace function we used on the CPU, and__kernel void render(...), the OpenCL kernel, which we call from the C++ code.
The reorganized code lets us deal with high-level things in raytracer.cl, while the low-level details are hidden away in tools.h.
To match the theme of the article, a new mesh has been added to the assets folder: dragon.obj.

This version of the Stanford Dragon consists of 19332 triangles, and it replaces the 1024 triangle Utah teapot. Some changes to the Mesh class are needed to accommodate the larger number of polygons:
|
1 2 3 4 5 6 7 8 9 |
class Mesh { public: Mesh() = default; Mesh( const char* objFile, const char* texFile ); Tri tri[19500]; // triangle data for intersection TriEx triEx[19500]; // triangle data for shading ... }; |
Similar changes are needed in the implementation of the Mesh constructor.
|
1 2 3 4 5 6 |
Mesh::Mesh( const char* objFile, const char* texFile ) { // bare-bones obj file loader; only supports very basic meshes texture = new Surface( texFile ); float2* UV = new float2[11050]; // enough for dragon.obj float3* N = new float3[11050], *P = new float3[19500]; |
Here, 11050 is approximately the number of vertices in the dragon mesh.
With these changes, the code still works, but there is a small issue:

The white shape that obscures the sky is a massive dragon. In the CPU code we would resolve this by assigning a scaling matrix to the instance of the dragon mesh. Or rather: we would apply the inverse of this matrix to the rays that intersects the dragon.
Time to add TLAS traversal.
GPU TLAS
The concept of the TLAS (a.k.a. top level BVH) was introduced in article 5 and article 6. The idea is to build and traverse a BVH over BVHs, where each of those BVHs can have a 4×4 matrix transform, for translation, rotation, and scaling. Using the same BVH multiple times but with a different matrix enables instancing.
We already have the necessary data on the GPU. There is the TLASNode struct:
|
1 2 3 4 5 6 7 |
struct TLASNode { float minx, miny, minz; uint leftRight; // 2x16 bits float maxx, maxy, maxz; uint BLAS; }; |
And an array of BVHInstance objects, indexed by the BLAS field of the TLASNode:
|
1 2 3 4 5 6 7 8 |
struct BVHInstance { uint dummy1, dummy2; uint idx; float16 transform; float16 invTransform; // inverse transform uint dummy[6]; }; |
The strange dummy array at the end is there to ensure that the size and layout of the OpenCL struct matches the CPU struct, at least to the extend that fields transform and invTransform are in the correct location. More on that in a second, because I made some serious mistakes here that need fixing…
We can now port the TLAS::Intersect(...) method. The process is similar to porting the BVH::Intersect method, so I paste the result here without going over every step in detail.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
void TLASIntersect( struct Ray* ray, struct Tri* tri, struct BVHInstance* bvhInstance, struct TLASNode* tlasNode, struct BVHNode* bvhNode, uint* triIdx ) { // initialize reciprocals for TLAS traversal ray->rD = (float3)(1.0f / ray->D.x, 1.0f / ray->D.y, 1.0f / ray->D.z); // use a local stack instead of a recursive function struct TLASNode* node = &tlasNode[0], *stack[32]; uint stackPtr = 0; // traversl loop; terminates when the stack is empty while (1) { if (node->leftRight == 0) // isLeaf() { // current node is a leaf: intersect instance InstanceIntersect( ray, &bvhInstance[node->BLAS], node->BLAS, tri, bvhNode, triIdx ); // pop a node from the stack; terminate if none left if (stackPtr == 0) break; else node = stack[--stackPtr]; continue; } // current node is an interior node: visit child nodes, ordered struct TLASNode* child1 = &tlasNode[node->leftRight & 0xffff]; struct TLASNode* child2 = &tlasNode[node->leftRight >> 16]; float dist1 = IntersectAABB( ray, child1 ); float dist2 = IntersectAABB( ray, child2 ); if (dist1 > dist2) { float d = dist1; dist1 = dist2; dist2 = d; struct TLASNode* c = child1; child1 = child2; child2 = c; } if (dist1 == 1e30f) { // missed both child nodes; pop a node from the stack if (stackPtr == 0) break; else node = stack[--stackPtr]; } else { // visit near node; push the far node if the ray intersects it node = child1; if (dist2 != 1e30f) stack[stackPtr++] = child2; } } } |
We call two functions from TLASIntersect that we have not defined yet.
The first one is particularly interesting: We call IntersectAABB, with a TLASNode* rather than a BVHNode*. To my surprise, this just works. The TLASNode struct has, for the purpose of AABB intersection, the same layout as a BVHNode. And OpenCL doesn’t complain about the pointer type mismatch… Great, for now, but definitely something to keep in mind: Next time this ‘flexibility’ may not be what we expect.
The second function requires more work. Here is the code for InstanceIntersect, based on BVHInstance::Intersect from the CPU code:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
void InstanceIntersect( struct Ray* ray, struct BVHInstance* bvhInstance, int blasIdx, struct Tri* tri, struct BVHNode* bvhNode, uint* triIdx ) { // backup and transform ray using instance transform struct Ray backup = *ray; TransformRay( ray, &bvhInstance->invTransform ); // traverse the BLAS BVHIntersect( ray, blasIdx, tri, bvhNode, triIdx ); // restore ray without overwriting intersection record backup.hit = ray->hit; *ray = backup; } |
This function in turn calls TransformRay, which is defined as follows:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
void TransformRay( struct Ray* ray, float16* invTransform ) { // do the transform ray->D = (float3)( dot( invTransform->s012, ray->D ), dot( invTransform->s456, ray->D ), dot( invTransform->s89A, ray->D ) ); // see TransformVector in template.cpp ray->O = (float3)( dot( invTransform->s012, ray->O ) + invTransform->s3, dot( invTransform->s456, ray->O ) + invTransform->s7, dot( invTransform->s89A, ray->O ) + invTransform->sB ); // see TransformPosition in template.cpp // update ray direction reciprocals ray->rD = (float3)(1.0f / ray->D.x, 1.0f / ray->D.y, 1.0f / ray->D.z); } |
This implements the TransformVector and TransformPosition matrix functions we used in the CPU code: In OpenCL, we have to implement these ourselves.
Note how the float16 data is addressed here: To make a float3 out of cells 0, 1 and 2, we write invTransform->s012. A same technique is available for all OpenCL vector types. For instance, to access x of ray->D we can simply write ray->D.x, but we can also get a float2 based on the fields: e.g., ray->D.xy, but also: ray->D.xx, which would duplicate the first field, or ray->D.zyx, which would flip the three available fields and yields a float3. Any combination of x, y and z works. The technique is called swizzling. For a float16, x, y and z are obviously insufficient. In this case the fields are labeled 0..9 and A..F (or a..f, if you prefer that): this is the hexadecimal index of the 16 matrix cells.
Debugging
There is a problem with all this new code: it doesn’t work. When that happens, we debug, of course. But in OpenCL that is easier said than done: Breakpoints do not exist, and would be somewhat pointless anyway, considering that the code runs on (potentially) thousands of GPU threads. We do have some tools however. We can for example use printf from a kernel:
|
1 2 3 4 5 6 |
void TLASIntersect( struct Ray* ray, struct Tri* tri, struct BVHInstance* bvhInstance, struct TLASNode* tlasNode, struct BVHNode* bvhNode, uint* triIdx ) { printf( "hello\n" ); ... |
Besides that, we can visualize assumptions. For example:
|
1 2 3 4 5 |
float3 Trace( struct Ray* ray, float* skyPixels, struct Tri* triData, struct BVHNode* bvhNodeData, uint* idxData ) { // check incoming data if (skyPixels == 0 || triData == 0) return (float3)(1, 0, 0); ... |
In the latter case, Trace will return a red pixel if we passed it an non-existent skyPixels or triData buffer. Other colors can be used to return information on different problems. Since the visualization is per-pixel, this can yield more detailed information than the printf.
With these limited tools I found several issues. The first is in struct TLASNode, which is very similar to BVHNode, and should thus be ‘just fine’:
|
1 2 3 4 5 6 7 |
struct TLASNode { float minx, miny, minz; uint leftRight; // 2x16 bits float maxx, maxy, maxz; uint BLAS; }; |
If this has you scratching your head: look no further, because the problem is not here, it is in the CPU version of this struct.
|
1 2 3 4 5 6 7 8 9 |
// top-level BVH node struct TLASNode { float3 aabbMin; uint leftRight; float3 aabbMax; uint BLAS; bool isLeaf() { return leftRight == 0; } }; |
Now, one would assume that the two structs have the same size. Sadly, they don’t. The float3 type takes 16 bytes in OpenCL, and therefore, the CPU-side version mimics this, to ensure data compatibility. We thus have a CPU TLASNode that is substantially larger than 32 bytes.
The problem is quickly resolved by changing the CPU-side struct:
|
1 2 3 4 5 6 7 |
// top-level BVH node struct TLASNode { union { struct { float dummy1[3]; uint leftRight; }; float3 aabbMin; }; union { struct { float dummy2[3]; uint BLAS; }; float3 aabbMax; }; bool isLeaf() { return leftRight == 0; } }; |
That is very ugly, but: Now the two 16-byte structs overlap the float3 fields in memory, which causes fields leftRight and BLAS to be exactly in the right location.
After this fix, we still get no useful results out of TLASIntersect. It turns out that there is another problematic struct; this time it is BVHInstance. The struct contains two float16 fields, which have the same size as a mat4 in the CPU code. The problem is: OpenCL dictates that any field must be positioned in a struct such that its offset is a multiple of its size. In this case, the size of a float16 is 64 bytes; we have only 12 bytes before the first float16 and therefore OpenCL will inject 52 unused bytes between field idx and the first transform…
Fixing this is again easier on the CPU side. We modify the BVHInstance struct to:
|
1 2 3 4 5 6 7 8 9 10 |
class BVHInstance { ... mat4 transform; mat4 invTransform; // inverse transform aabb bounds; // in world space BVH* bvh = 0; uint idx; int dummy[7]; }; |
In the new layout, transform and invTransform are properly aligned: transform now has an offset of 0, and invTransform is at offset 64. The array of dummy integers at the end ensures that the size of the whole struct is now a multiple of 64 bytes: This way, in an array of BVHInstances object other than the first also have proper offsets for the float16 fields.
With these fixes, we finally have the full TLAS & BLAS functionality on the GPU, albeit without any fancy shading:

Restoring Light
The hard part is now done. Adding some shading is a matter of porting the body of the Trace function. This should not have too many surprises:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
float3 Trace( struct Ray* ray, float* skyPixels, struct BVHInstance* instData, struct TLASNode* tlasData, uint* texData, struct Tri* triData, struct TriEx* triExData, struct BVHNode* bvhNodeData, uint* idxData ) { TLASIntersect( ray, triData, instData, tlasData, bvhNodeData, idxData ); struct Intersection i = ray->hit; if (i.t == 1e30f) { // sample sky ... } // calculate texture uv based on barycentrics uint triIdx = i.instPrim & 0xfffff; uint instIdx = i.instPrim >> 20; struct TriEx* tri = triExData + triIdx; float2 uv = i.u * tri->uv1 + i.v * tri->uv2 + (1 - (i.u + i.v)) * tri->uv0; int iu = (int)(uv.x * 1024) & 1023; int iv = (int)(uv.y * 1024) & 1023; uint texel = texData[iu + (iv << 10)]; float3 albedo = RGB8toRGB32F( texel ); // calculate the normal for the intersection float3 N0 = (float3)( tri->N0x, tri->N0y, tri->N0z ); float3 N1 = (float3)( tri->N1x, tri->N1y, tri->N1z ); float3 N2 = (float3)( tri->N2x, tri->N2y, tri->N2z ); float3 N = i.u * N1 + i.v * N2 + (1 - (i.u + i.v)) * N0; N = normalize( TransformVector( &N, &instData[instIdx].transform ) ); float3 I = ray->O + i.t * ray->D; // calculate the diffuse reflection in the intersection point float3 L = lightPos - I; float dist = length( L ); L *= 1.0f / dist; return albedo * (ambient + max( 0.0f, dot( N, L ) ) * lightColor * (1.0f / (dist * dist))); } |
Note that the size of the brick texture is now hardcoded. We’ll want to change that at a later stage, to make the renderer more versatile. The light source is also still very much hardcoded, as it was in the CPU version. The three constants for the light source can be defined at global scope, somewhere above the Trace function:
|
1 2 3 |
__constant float3 lightPos = (float3)(3, 10, 2); __constant float3 lightColor = (float3)(150, 150, 120); __constant float3 ambient = (float3)(0.2f, 0.2f, 0.4f); |
To convert pixels read from the texture to float3 we also need a port of the RGB8toRGB32F function. It’s straightforward, and should go to tools.cl:
|
1 2 3 4 5 6 7 8 |
float3 RGB8toRGB32F( uint c ) { float s = 1 / 256.0f; int r = (c >> 16) & 255; int g = (c >> 8) & 255; int b = c & 255; return (float3)(r * s, g * s, b * s); } |
And finally, transforming the normal vector using the matrix for the intersected instance now uses a new function TransformVector, which in hindsight would come in handy in the TransformRay function as well. It is defined as:
|
1 2 3 4 |
float3 TransformVector( float3* V, float16* T ) { return (float3)( dot( T->s012, *V ), dot( T->s456, *V ), dot( T->s89A, *V ) ); } |
And with that we have properly textured and illuminated dragons.

Mirror Mirror
The next challenge for the OpenCL port of the recursive ray tracer is: Recursion, which we need for mirrors, as well as glass and water (dielectrics). There is however a problem with recursion in OpenCL: It is forbidden. Yet, in the original ::Trace function on the CPU, it was rather instrumental to our needs:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
float3 WhittedApp::Trace( Ray& ray, int rayDepth ) { ... if (mirror) { // calculate the specular reflection in the intersection point Ray secondary; secondary.D = ray.D - 2 * N * dot( N, ray.D ); secondary.O = I + secondary.D * 0.001f; secondary.hit.t = 1e30f; if (rayDepth >= 10) return float3( 0 ); return Trace( secondary, rayDepth + 1 ); } else { // calculate the diffuse reflection in the intersection point ... } } |
The highlighted line (line 12) calls the Trace function itself with a just constructed secondary ray, which starts at the intersection point, and extends into the direction calculated using the specular reflection formula on line 8.
Luckily, we have ways to bypass recursion. We already did this to traverse the BVH: Instead of calling BVHIntersect for the left and right child nodes, we switch processing to the nearest child node, while we push the far node on a stack. And, whenever we would normally return, we now pop work from the stack, until the stack is empty.
We can deploy the same scheme here. Except, it will be simpler: Until we add glass or water, the path of a ray does not split up. It ends whenever we hit the skydome, but also when we hit a diffuse object. Only when we hit a reflective dragon we bounce, but in that case we can just restart the Trace function after replacing the original ray by a reflected ray. Like this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
float3 Trace( ... ) { int rayDepth = 0; // bounce until we hit the sky or a diffuse surface while (rayDepth < 10) { TLASIntersect( ray, ... ); struct Intersection i = ray->hit; if (i.t == 1e30f) { // sample sky ... return ... ; } ... if (mirror) { // calculate the specular reflection in the intersection point ray->D = ray->D - 2 * N * dot( N, ray->D ); ray->O = I + ray->D * 0.001f; ray->hit.t = 1e30f; rayDepth++; } else { // calculate the diffuse reflection in the intersection point return ... ; } } return (float3)( 1, 1, 1 ); } |
Basically, this loop makes a ray bounce around, until it does not hit a mirror. Of course we need to be careful with infinite ‘recursion’ (in this case: an infinite loop). So, we keep track of the ‘bounce depth’ and once it exceeds a certain value, we terminate the loop. In that case the last surface we hit must have been a mirror; we thus return white, i.e. (float3)(1, 1, 1), which is probably better than any other color.
The scheme is quite effective, and brings back the shiny dragons:

MASSIVE
One advantage of ray tracing is how it handles massive scenes. For simple scenes, traditional rasterization is typically a lot faster, but at some point, the tables turn. This has everything to do with the BVH: once build (that’s a rather important condition!), a BVH over N triangles requires on average log_2 N steps to get from the root of the tree to a leaf node. For 1,000 triangles we get to a single leaf in 10 steps; for 1,000,000 we get there in 20 steps: Twice the cost, for a thousand times the complexity. Even a billion triangles is doable; this only take three times more work than one thousand.
To put that to the test we will create a scene, in which every vertex of the dragon model is replaced by a dragon. The dragon has 11,042 vertices, and 19,332 polygons; the scene thus will have a bit more than 213 million triangles. You’ll have to believe me when I say that adding another zero to that number will hardly matter.
The scene is constructed in the ::Init() function:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
void MassiveApp::Init() { mesh = new Mesh( "assets/dragon.obj", "assets/bricks.png" ); // load HDR sky ... // dragons in the shape of a dragon bvhInstance = new BVHInstance[11042]; for( int i = 0; i < 11042; i++ ) { bvhInstance[i] = BVHInstance( mesh->bvh, i ); bvhInstance[i].SetTransform( mat4::Translate( mesh->P[i] * 0.2f ) * mat4::Scale( 0.0025f ) * mat4::Rotate( mesh->N[i], 0 ) * mat4::RotateX( PI / 2 ) ); } tlas = TLAS( bvhInstance, 11042 ); tlas.Build(); ... |
To make that work, the BVHInstance array in the application class definition in massive.h is changed to a pointer:
|
1 |
BVHInstance* bvhInstance; |
The ::Init() function needs access to vertex positions and normals. These were stored in local arrays (P and N) in the Mesh constructor. We add pointers to this data to the Mesh class, and make a small change to the Mesh constructor:
|
1 2 3 4 5 6 |
Mesh::Mesh( const char* objFile, const char* texFile ) { // bare-bones obj file loader; only supports very basic meshes ... N = new float3[11043], P = new float3[11043]; ... |
Tiny but important detail: Indices in obj files start at 1; we thus either make the arrays 1 element larger, or subtract 1 from every index that we use – hence the 11043 figure.
We are almost ready to show the scene: The scene is constructed, the (single) BLAS for the 11,042 dragon instances is built, and we have a TLAS – which constructs in about half a second on my machine. Half a second: That means the TLAS will have to be static, for now. We thus build it and copy it once to the GPU, in the ::Init() function. The tlasData->CopyToDevice() and instData->CopyToDevice() commands can now also be moved to ::Init(); we will render a static scene.
If you typed along, you now have a ‘dragon of dragons’, and it runs at a respectable speed.
Camera
Such a scene deserves a proper view.
For that, we will move the camera along a smooth path. The Catmull-Rom spline is ideal for this: it is smooth, and moves through a series of points.

Given P_0 .. P_3, we can find a point on the segment P_1 .. P_2 with the following code:
|
1 2 3 4 5 |
float3 CatmullRom( float t, float3& p0, float3& p1, float3& p2, float3& p3 ) { float3 c = 2 * p0 - 5 * p1 + 4 * p2 - p3, d = 3 * (p1 - p2) + p3 - p0; return 0.5f * (2 * p1 + ((p2 - p0) * t) + (c * t * t) + (d * t * t * t)); } |
By varying t from 0..1, this function allows us to smoothly travel the curve. Once we reach t=1 (or t=0, if we move backwards), we simply select the next set of four points on the curve; this way, the smooth path can consist of any number of points.
Let’s take this one step further: If we move not just the camera over a smooth curve, but also a point at which the camera is aimed, we get a pretty versatile camera system. Consider the following set of points:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
float3 spline[] = { float3( 1.32, 11.35, 0.31 ), float3( 0.35, 11.56, 0.18 ), float3( -0.92, 11.35, 2.92 ), float3( -1.58, 11.55, 2.19 ), float3( -5.51, 13.43, 4.96 ), float3( -5.82, 13.63, 4.03 ), float3( -10.90, 13.48, 4.53 ), float3( -10.39, 13.68, 3.69 ), float3( -13.15, 15.33, 2.21 ), float3( -12.20, 15.52, 1.96 ), float3( -12.44, 15.33, -1.92 ), float3( -11.92, 15.51, -1.08 ), float3( -8.17, 15.48, -4.71 ), float3( -7.32, 15.38, -4.19 ), float3( -0.19, 13.22, -6.09 ), float3( 0.31, 12.72, -5.38 ), float3( 8.44, 8.37, -5.22 ), float3( 8.01, 7.81, -4.52 ), float3( 14.31, 4.71, -1.89 ), float3( 13.40, 4.46, -1.56 ), float3( 11.81, 4.97, 6.36 ), float3( 11.19, 4.65, 5.64 ), float3( 10.57, 4.58, 5.19 ), float3( 9.85, 4.39, 4.52 ), float3( 7.33, 4.66, 2.14 ), float3( 6.61, 4.59, 1.45 ), float3( 6.01, 4.77, 0.68 ), float3( 5.28, 4.70, 0 ), float3( 3.00, 4.48, -2.16 ), float3( 2.27, 4.42, -2.85 ), float3( -1.53, 4.27, -5.49 ), float3( -1.89, 4.21, -4.56 ), float3( -3.63, 4.12, -3.77 ), float3( -3.80, 4.06, -2.78 ), float3( -4.97, 3.97, 2.62 ), float3( -5.38, 3.98, 3.53 ), float3( -9.38, 3.97, 7.47 ), float3( -8.92, 3.98, 6.59 ), float3( -16.17, 5.36, 7.89 ), float3( -15.37, 5.37, 7.29 ), float3( -27.30, 10.08, -1.143 ), float3( -26.33, 10.00, -0.89 ), float3( -30.81, 10.88, -13.78 ), float3( -29.94, 10.80, -13.30 ) }; |
Each pair in this array contains one camera position (on the left) and a camera target (on the right). I obtained these points by manually navigating the scene, and recording 20 points to a file. The code for that is left in the program; feel free to replace the spline with a new path.
The camera can now be updated in the ::Tick() function, with a little help from the new SplineCam function:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
void SplineCam( int seg, float t ) { float3 c0 = spline[seg * 2 - 2], c1 = spline[seg * 2], c2 = spline[seg * 2 + 2], c3 = spline[seg * 2 + 4]; float3 t0 = spline[seg * 2 - 1], t1 = spline[seg * 2 + 1], t2 = spline[seg * 2 + 3], t3 = spline[seg * 2 + 5]; camPos = CatmullRom( t, c0, c1, c2, c3 ); camTarget = CatmullRom( t, t0, t1, t2, t3 ); } void MassiveApp::Tick( float deltaTime ) { // playback spline path static int seg = 1; static float t = 0; t += deltaTime * 0.0005f; if (t > 1) { t -= 1.0f; if (++seg == 20) seg = 1; } SplineCam( seg, t ); mat4 M = mat4::LookAt( camPos, camTarget ); ... |
And with that, we have reached our final destination.

Bandwidth
There’s one thing that is rather silly about the current demo, and it has to do with the template used. It is however a generic thing, so let me explain.
To render an image, the CPU code creates a buffer in VRAM. The OpenCL kernel fills this buffer with pixels. Next, we copy the buffer back to the CPU. And then what? Well, under hood the template maintains a single texture, the size of which matches the window size. The rendered pixels are copied to this texture once per frame, after ::Tick() completes. OpenGL is then used to draw a screen filling quad using this texture. This is how a frame ends up on the monitor. We thus have:
- Pixels rendered on the GPU using OpenCL
- Pixels transferred to CPU RAM
- Pixels copied from CPU RAM to a texture in VRAM
That feels like a rather long route with unnecessary detours, and it is. Luckily, there is a shortcut, and the name of the shortcut is GL/CL interop. This is a technique where OpenGL makes a resource (in this case: a texture) available to OpenCL, so OpenCL can render directly to it. And this saves time, obviously.
To enable the shortcut, we replace one line in ::Init(). The line that currently reads
|
1 |
target = new Buffer( SCRWIDTH * SCRHEIGHT * 4 ); |
is to be replaced by
|
1 2 |
target = new Buffer( GetRenderTarget()->ID, 0, Buffer::TARGET ); screen = 0; |
This is inevitably highly template-specific. The first line creates the buffer using an OpenGL texture identifier, obtained from the render target of the template. The second line is used to tell the template to stop sending frames to the OpenGL texture – we are using OpenCL for that now.
A few more changes are needed. In ::Tick(), we delete the two lines that ‘obtain the rendered result’. And, in the kernel code, the first argument is now different:
|
1 2 3 4 5 6 7 |
__kernel void render( write_only image2d_t target, ... ) { ... } |
And finally, plotting to the target changes. We used a unsigned int array before, but now we have a floating point OpenGL texture to work with. We use the following line to write to it:
|
1 |
write_imagef( target, (int2)(x, y), (float4)( color, 1 ) ); |
After this change the demo works again, this time without wasting time sending pixel data back and forth.
Future Work
Where do we go from here? Well, as a wise man once said, you can be “Ray Tracing – the Rest of Your Life”…
Faster BVH building – The BVH for the dragon is constructed in just a few milliseconds, on a single core. For multiple meshes multiple cores can be used – until they run out. Take it further by optimizing the BVH builder: Some SIMD magic should bring down the time significantly, and beyond that, well, there’s ways to make it very fast. An important trick is to build your BVH while the GPU is tracing rays: the only thing that needs to happen between frames is sending the updated data. And even that isn’t strictly necessary.
Faster traversal – We trace hundreds of millions of rays per second now on the GPU – without RTX. But, the code is far from optimal… There are BVH layouts that are better suited for the GPU, and there’s 4-wide and 8-wide BVHs, which further improve performance. On the CPU packet traversal makes a huge difference, especially for primary rays. By the way, we never implemented proper shadow rays, so that should be on the top of your list.
Better rendering – A few reflecting dragons is nice, but what about more interesting materials, normal maps, depth of field, motion blur, soft shadows, indirect light? This is a field in which you can explore endlessly. Start with basic path tracing, and quickly your code will be too slow again, forcing you to look into filtering techniques and ReSTIR for efficiently sampling many lights. A life is not going to be enough, I’m afraid.
Engine flexibility – The code of this tutorial is short and hopefully clear, but definitely not robust. Handling multiple meshes and textures and being able to delete them and add them at runtime will require a lot of administration – especially if you still want to render on the GPU. And while you’re at it: A fixed-size skydome is not how things should be, obviously. Do mind compile times, once you start cleaning things up: you may have noticed that the tutorial code compiles in a flash, and that’s not just a lucky accident.
OptiX-CL – NVIDIA provides RTX support from CUDA in the form of a library called OptiX. It would be nice if the functionality of this library became available outside of CUDA, and thus for all devices of all vendors. If you see a challenge here: Let me know! Perhaps we can join forces. Especially the now obsolete OptiX Prime API would be very interesting (and relatively simple) to replicate.
Lightouse 2 – Speaking of OptiX: My Lighthouse 2 project uses OptiX to quickly render complex scenes on the GPU, but it is NVIDIA only, essentially. A proper render core using OpenCL or even using the CPU would be great to have. One day, perhaps.

That’s just my ideas though. Let me know what your plans are, but whatever you do, keep tracing.
And with that, allow me to sign off for once as if it’s 1999 on Flipcode:
Love you all,
Jacco Bikker, a.k.a. “The Phantom”.
Closing Remarks
The complete code for this project, in ten convenient incremental stages, can be found on GitHub (for Windows / Visual Studio).
Enjoyed the article? Spotted an error?
Ping me on Twitter (@j_bikker), comment on this article, or send an email to bikker.j@gmail.com.
You can also find me on Github: https://github.com/jbikker.
Or read some more articles: https://jacco.ompf2.com. Some favorites:
- Wavefront path tracing
- Probability Theory for Physically Based Rendering – part 1
- Probability Theory for Physically Based Rendering – part 2
- OPT#3: SIMD (part 1 of 2) – also check the other optimization blog posts!

Hi Jacco,
Thanks for writing this series. I’m been looking for some time to sit down with it. I get access denied when trying to read the earlier posts in the series. Have these posts been moved? Thanks!
The site got slashdotted… The ISP should sort things out soon. Sorry for the inconvenience!
…Aaand we’re back. Should be more stable after this adventure.
No worries, thanks for the excellent tutorial!