# Deferred Rendering Pipeline

Today we’re going to learn about the Deferred Rendering pipeline. The process of Deferred Rendering is storing information from a pass on our objects in one (or several) buffers known as the GBuffer. On a separate pass we calculate lighting information utilizing all of this cached information.

# Deferred Rendering

## The Pipeline

As mentioned, Deferred Rendering is simply a way of structuring your rendering. The first way we usually learn to structure our rendering pass(es) is by looping over our objects, and for each fragment we loop over every light applying the lighting calculation. The complexity of this operation can be greatly simplified and abstracted to the following sum:

$\sum^{Objs}(Ambient + \sum^{Lights}(Diffuse + Specular))$

we can always spot an inefficient algorithm based on how many sums deep we get. The idea behind Deferred Rendering is that we move that inner light sum to the outside of the equation, and the cost of a new light is approximately the cost of a new object in the scene.

$\sum^{Objs}(Ambient) + \sum^{Lights}(Diffuse + Specular)$

That’s really all there is to know about what we’re trying to accomplish. The important thing to understand is how we plan on accomplishing it. This leads us towards understanding why Deferred Rendering is called Deferred Rendering. With Forward Rendering, we just blindly push information to our final presentation buffer – whether that’s the screen or another backbuffer isn’t important. For Deferred Rendering, we will push information into other buffers, and defer the lighting calculation to a later time, when we do our pass over all our lights.

All we really need is the position, normal, and diffuse at any given point, and then if we’re currently processing the lighting calculation for a given fragment, we can add lights to another buffer. We can accomplish this with a Framebuffer Object which allow us to write to different color attachments. After storing that information away, we can then do another pass over the scene and draw our lights and compose the scene back together. The entire algorithm is three passes – Ambient Pass, Lighting Pass, and Composition Pass.

 Geometry Buffers Velocity (No Photo) Depth (No Photo)
 Passes

The velocity buffer is only used for fragment motion blur. It was relatively easy to add, but difficult to master. I don’t think my current implementation is ideal, but generally it looks better than without. There are still a few edge cases where motion blur does not look correct. I don’t present it here because generally there are only values if objects are in motion (or the camera is moving). For this scene, I wanted to keep the camera still for comparison – so there is no motion. Hence the velocity buffer is all 0s – completely black. Introducing motion will cause fragments to light up based on how far the fragment moves per frame.

You can see that the final image is appropriate, however it’s far from a complete pipeline. We still aren’t doing proper gamma correction in any of these screenshots. Since I don’t claim to fully understand the need for gamma correction yet (I’m still learning this stuff, after all – I will look into that for a future project) I will blindly accept the teachings of a section from GPU Gems 3 which claims you can simply manipulate the final colors to get proper gamma correction. Since I have not delved too deeply into understanding this, I will provide an image showing the final result gamma corrected to the best of my knowledge.

Though it’s a little hard to capture since it relies on motion, here is a screenshot of the motion blur as I back the camera through the scene.

## Generating the GBuffer

The first thing you need to do for Deferred Rendering is generate the GBuffer. The GBuffer (or Geometry Buffer) is actually a set of buffers that represent the final information of the geometry that actually makes it’s way to the screen. If you look above, you can see that there are several “buffers”. But even in our case, we don’t actually allocate 6 buffers worth of information. This particular implementation uses only 3 buffers, and even with the crunching we’re already doing to save space, we can actually crunch this down even further. (Though I won’t discuss those opportunities here in length, I will mention them)

The current idea of the GBuffer is a set of buffers with different storage types that allow us to pack information away until a future date. The way I’ve decided to pack this information away is as follows (The xN represents how many positions in a RGBA texture the value is stored in):

 Geometry Attachment Normal (x2) Specular Exponent (x1) Depth (x1)
 Material Attachment Diffuse (x3) Specular Average (x1)
 Dynamics Attachment Velocity (x2) (Unused) (x2)

Though we don’t need to name the different buffers, I find it’s easier to name them for reference. Notice I could actually keep the exact specular colors in the Dynamic Attachment buffer, but instead I just leave it empty. In this case, I simply allocate less storage space for the Dynamic Attachment Buffer – but in other cases, I could increase the amount of data stored.

Generating the GBuffer requires a shader pass over all of the objects in the scene. I’ve set this up to be instanced, and am using a Uniform Block Object to share uniform information between shaders. I’ve extended the functionality of GLSL by creating a pre-parser to add an #include directive. So be careful and note that the #include directive does not actually exist, I am just parsing for it and performing an inline-expansion before I pass it to the GPU. This is relatively easy, especially if you’ve ever made any kind of preprocessor or parser before – and basically required for GLSL to be extensible.

This is not as efficient as it could be. Below I will outline some possible changes which may (or may not) even be possible. They also may (or may not) speed up the render cycle.

1. It’s probably more efficient to just do an entire pass over the whole screen for our ambient light and atmospheric attenuation. This was just easier to draft up.
2. The position is derived from the depth we encode into the fGeometry attachment. We could instead rely on the attached depth buffer for this information.
3. Currently there are two VBOs per object, it’s possible to have only two dynamic VBOs for vertices and indices which can be controlled via parameters to glDrawElements* functions.

And there is even more (thanks to OpenGL 4+) that needs to be examined in maintaining these buffers. Some questions I have that I have yet to test or answer are:

1. Would it be any better/worse to simply keep a dynamic-sized Shader Storage Buffer Object for storing our GBuffer information?
2. If not, materials could be stored (since they’re shared) in a SSBO, and the per-instance variable would be the index. This could be stored in the GBuffer instead of actual values.
3. As mentioned by this OpenGL SuperBible article, it might help even further to move our samplers into bindless texture methodology. This will help us share the buffers between shaders, as well.
4. Indirect drawing with a MultiIndirect buffer could also speed things up, that way we only have to bind a buffer of commands for instanced rendering and tell the GPU to go!
5. Information is uploaded using mapping, but it’s possible to only map the pointer to our VBO buffers once using Persistent Coherent Buffer Ranges.

In order to check each bit of information, another shader is required that outputs the value returned from the buffer. These shaders are very simple, they just grab the decoded value from the texture samplers, and draw them right to the screen. Some of them do a little extra arithmetic to make the results more appealing (normals, for example, I take the absolute value of to make it “prettier”).

## Generating the Light Accumulation Buffer

The next step after creating the GBuffer is to turn on blend mode, and turn off depth tests. We’re going to add values to another buffer which accumulates light for us. This pass is also instanced, which is ideal for point lights. The instanced point light GLSL shader can be found below:

Special Note: Under GPU-timing tests (using Timer Query Objects), I’ve found that discarding fragments which is not within range is faster than passing vec4(0.0) through fFragColor. I imagine these results are GPU-specific, but I like discard better. Also know that discard will not cease execution, so you must place the discard within a branching if-else statement for any actual gain.

The last part of the code can be manipulated for debugging purposes. Instead of simply discarding or setting the calculated light value, we can instead pass some color information back to see some useful information. In order to test that our Deferred Renderer isn’t calculating more information than it needs to, we simply modify the code as such:

The two small modifications will show us two important things:

• White represents all of the areas in which the Fragment Shader is being executed (but the value is being discarded).
Objects which are entirely occluded by white participate in none of the lighting calculation.
• Pink represents all of the areas in which the Fragment Shader is being executed (and the lighting calculation is preformed).
Anything being lit should appear in some manner with additive blending of red.

I should be able to watch this sphere, which is a rough approximation of the perfect sphere formed by the point light, and the pink lighting executions should never be “cut off” from the boundary of the white discard executions. Since the sphere is not infinitely precise, you cannot multiply by just the radius of the light’s influence to promise perfect results – instead there is an extra term to the scale that you have to scale by.

## Perfect Scalar for UV Sphere Light Approximations

The million dollar question is: What is the scalar value we have to scale our mesh by?

Well, that’s complicated. It depends entirely on the mesh. I have assumed that the type of sphere I’m using is a UV Sphere, where every plotted point is exactly 1 unit from the center of the sphere. It looks something like this:

Such a sphere has three configurable variables upon creation

• The segments of the sphere – this is how many parts the sphere is split into along the longitude of the sphere.
• The rings of the sphere – this is how many parts the sphere is split into along the latitude of the sphere.
• The size of the sphere – this is the distance from the center that any plotted point must be. (In our case, 1)

If you consider a top-down view of this problem, you can figure out where the loss of precision comes from (assume segment = 4):

Our approximate sphere is scaled in such a way that it fits perfectly inside the volume of light. What we want is to have the volume represented by our sphere approximation to contain the light volume – not the other way around. Let’s look at a section of the above light volume approximation. (I have stretched it to emphasize that it forms a right triangle.)

There are some knowns here:

• The length of either side of the section is 1 (unit sphere).
• The length of the dotted line is also 1 (imagine we had subdivided more).
• The arc-length of the upper arc is $\frac{2*\pi}{N_{Segments}}$

With this knowledge, we can say a bit more about this (and by association, every) segment of this approximation.

• Using polar coordinates, we can say the linear distance between the beginning and end of the arc is $\sqrt{r_{1}^{2} + r_{2}^{2} - 2 * r_{1}^{2} * r_{2}^{2} *cos(\theta_{2} - \theta_{1})}$
• We can then deduce from our linear length, that one half of that length would represent half of either side of our right triangles we formed.

This paints a pretty telling picture:

We now have knowledge of the inner triangle’s right side F`, and the length of the inner triangle’s hypotenuse (1). Using this, we can use Pythagoras’s Theorem to calculate the length of the inner-triangle’s base (represented by the dotted line along L). With that much information, we can form a ratio of the dotted line L and the full line L. Using the ratios for similar triangles to compute the length of the outer triangle’s right side F. And finally use Pythagoras’s Theorem once more with the far right side F, and the full line along L (which is a 1) to find the actual size (the value we want!) of the hypotenuse. *Phew!*

However, we’re not done. The math we just did is an abstraction of the full problem. We just solved the problem in 2D – but the problem is more complicated than that. If we apply the above algorithm to a UV Sphere, and we provide the number of segments, this is about the scale we get:

Close, but not quite – so what happened?

1. We assumed that the sphere is split into segments with one of it’s rings falling along one of the major axes (X, Y, or Z).
2. We assumed we would only be looking down at the sphere, and do not take the percentage error that the rings themselves introduce.

In order to fix this, instead of running the algorithm in 2D, we have to run the algorithm along a slice of the sphere in 3D. The only real change comes in the form of calculating the diagonal distance between the largest face on the sphere approximation. The equation for distance in 3D polar coordinates is as follows:

$\sqrt{r_{1}^{2} + r_{2}^{2} - 2 * r_{1}^{2} * r_{2}^{2} * (cos(\gamma_{2})*cos(\gamma_{1})*cos(\theta_{2} - \theta_{1}) + sin(\gamma_{2}) * sin(\gamma_{1}))}$

So there are a few more angles here. Where we could easily identify $\theta$ in the previous equation as the angle formed by the segment divisions, here things seem a little more complicated.

Essentially we’re working with latitude and longitude. Our$\theta$ remains unchanged – that is our longitude. Regardless of what latitude we’re looking at, longitude remains unchanged. The tricky part is figuring out $\gamma$, our longitude.

If you think about it, we want the largest plane formed by the ringed subdivisions through our sphere. This is going to fall somewhere around the center – but this kind of splits our function into an if-check.

We can either have a number of rings in which one ring falls on one major axis, and thus the equation is simplified into:

$\sqrt{2 - 2 * cos(\gamma)*cos(\theta)}$

Or we could have the case where there is no ringed subdivision falling along any major axes, in which case we have:

$\sqrt{2 - 2 * (cos(\gamma_{2})*cos(\gamma_{1})*cos(\theta_{2}) + sin(\gamma_{2}) * sin(\gamma_{1}))}$

We can then have a piece-wise  function that states if the number of rings passed in is even one function is used, otherwise another is used. Or if we write it out functionally:

This is a pretty lengthy function, and it’s not simplified. If we plug the math behind this equation into WolframAlpha, we can simplify the entire function to the following:

In some cases this might seem like it over-estimates, but it is 100% accurate. Just simply keep in mind that the area we want to encompass is represented by a sphere – so if you test the equation along a plane, this may (for some combinations of segments and rings) seem like it overestimates. If given the proper testing condition, any scalar matched up to it’s appropriate UV Sphere will accurately – and entirely – encompass the entire spherical volume of the point light.

It’s easiest to check by constructing a UV Sphere which has an odd number of rings, that way we can place an xz-plane at 0 along the y-axis, and we can see that the lighting calculation just barely touches the edge before the execution volume grows again.

## Update (3/4/15)

After playing around with my program a little more, I’ve come to discover that a few of the things I mentioned in this article aren’t 100% accurate. Most notably the discard statement. It’s true that if you must include an if-branch, the discard statement should be used to inform the GPU that it shouldn’t write to any buffers. However, it’s more accurate to say that the if-checks should be avoided altogether. Not having if-checks is such a major gain, that it completely dwarfs the savings of the discard statement. My GLSL shows older point light code which still includes these if-checks, however my more recent attempts remove the if-check completely, and the gain is substantial.

Further tests have shown me that functions intending on limiting the range of a number pose less of a threat than the if-checks themselves, though this seems counter-intuitive, I don’t believe much savings is to be gained by eliminating calls to max(), min(), or clamp(). So if you need to introduce any kind of branching light equation, try to do it numerically.

Another point to be made is that my lights are instanced – there is a gain to that when there are many lights, but in general are we going to meet that need? It would be best to be able to turn this off/on (via glDrawElementsInstanced and glDrawElements). The setup of glDrawElementsInstanced might outweigh the savings of it – unless we have a very complicated light scene.

## Final Renders

After all is said and done, you can flip through all of the different buffers, see some interesting stuff. The final composition screen is made up of simply adding the ambient buffer with the light buffer and applying the motion blur (if need be).

Here are a few screen captures that I found to be interesting:

Rendering only the light pass with 100 lights of radius 10 in a ring.

10 lights, colored random float values [0,1].

A center light for extra ambiance. :P

Several Stanford Bunny objs, though they are quite small.

Lamps! Why not?