(译)Forward vs Deferred vs Forward+ Rendering with DirectX 11
在网上鲜少有能全面,具体对比Forward,Deferred,Forward+渲染管线的文章,所以抽空翻译下,分享出来
原文地址:https://www.3dgep.com/forward-plus/
In this article, I will analyze and compare three rendering algorithms:
在本文中,我将分析并比较三种渲染算法:
- Forward Rendering Forward 渲染
- Deferred Shading 延迟着色
- Forward+ (Tiled Forward Rendering)
Forward+(平铺前向渲染)
Contents
- 1 Introduction 1 介绍
- 2 Forward Rendering 2 前向渲染
- 3 Deferred Shading 3 延迟着色
- 4 Forward+ 4 Foward+
- 5 Experiment Setup and Performance Results
5 实验设置和性能结果 - 6 Future Considerations 6 未来考虑
- 7 Conclusion 7 结论
- 8 Download the Demo 下载演示
- 9 References 参考资料
Introduction 介绍
Forward rendering works by rasterizing each geometric object in the scene. During shading, a list of lights in the scene is iterated to determine how the geometric object should be lit. This means that every geometric object has to consider every light in the scene. Of course, we can optimize this by discarding geometric objects that are occluded or do not appear in the view frustum of the camera. We can further optimize this technique by discarding lights that are not within the view frustum of the camera. If the range of the lights is known, then we can perform frustum culling on the light volumes before rendering the scene geometry. Object culling and light volume culling provide limited optimizations for this technique and light culling is often not practiced when using a forward rendering pipeline. It is more common to simply limit the number of lights that can affect a scene object. For example, some graphics engines will perform per-pixel lighting with the closest two or three lights and per-vertex lighting on three or four of the next closes lights. In traditional fixed-function rendering pipelines provided by OpenGL and DirectX the number of dynamic lights active in the scene at any time was limited to about eight. Even with modern graphics hardware, forward rendering pipelines are limited to about 100 dynamic scene lights before noticeable frame-rate issues start appearing.
前向渲染通过对场景中的每个几何对象进行光栅化来工作。在着色过程中,会迭代场景中的光源列表,以确定如何照亮几何对象。这意味着每个几何对象都必须考虑场景中的每个光源。当然,我们可以通过丢弃被遮挡或不出现在相机视锥体中的几何对象来优化这一过程。我们可以通过丢弃不在相机视锥体内的光源进一步优化这一方式。如果光源的范围已知,那么我们可以在渲染场景几何之前对光体积执行视锥体裁剪。对象裁剪和光体积裁剪为这一方式提供了有限的优化,而在使用前向渲染管线时通常不会实践光裁剪。更常见的做法是简单地限制可以影响场景对象的光源数量。例如,一些图形引擎将对最接近的两三个光源执行每像素光照,并对接下来最接近的三四个光源执行每顶点光照。 在 OpenGL 和 DirectX 提供的传统固定功能渲染管线中,场景中任何时候活动的动态光源数量被限制在大约八个左右。即使使用现代图形硬件,前向渲染管线在出现明显帧率问题之前仅限于大约 100 个动态场景光源。
Deferred shading on the other hand, works by rasterizing all of the scene objects (without lighting) into a series of 2D image buffers that store the geometric information that is required to perform the lighting calculations in a later pass. The information that is stored into the 2D image buffers are:
延迟着色则通过将场景中的所有对象(不带光照)光栅化到一系列 2D 图像缓冲区中,这些缓冲区存储了执行后续光照计算所需的几何信息。存储在 2D 图像缓冲区中的信息包括:
- screen space depth 屏幕空间深度
- surface normals 表面法线
- diffuse color 漫反射颜色
- specular color and specular power
镜面颜色和镜面率
The combination of these 2D image buffers are referred to as the Geometric Buffer (or G-buffer) [1].
这些 2D 图像缓冲区的组合被称为几何缓冲区(或 G-Buffer)[1]。
Other information could also be stored into the image buffers if it is required for the lighting calculations that will be performed later but each G-buffer texture requires at least 8.29 MB of texture memory at full HD (1080p) and 32-bits per pixel.
如果需要进行后续的光照计算,其他信息也可以存储到图像缓冲区中,但每个 G-Buffer纹理在全高清(1080p)和每像素 32 位的情况下至少需要 8.29 MB 的纹理内存。
After the G-buffer has been generated, the geometric information can then be used to compute the lighting information in the lighting pass. The lighting pass is performed by rendering each light source as a geometric object in the scene. Each pixel that is touched by the light’s geometric representation is shaded using the desired lighting equation.
生成 G-Buffer后,几何信息可以用来在光照pass中计算光照信息。光照pass通过将每个光源渲染为场景中的几何对象来执行。每个被光源几何表示触及的像素都使用所需的光照方程进行着色。
The obvious advantage with the deferred shading technique compared to forward rendering is that the expensive lighting calculations are only computed once per light per covered pixel. With modern hardware, the deferred shading technique can handle about 2,500 dynamic scene lights at full HD resolutions (1080p) before frame-rate issues start appearing when rendering only opaque scene objects.
与前向渲染相比,延迟着色方式的明显优势在于昂贵的光照计算仅针对每个受覆盖像素的光源计算一次。使用现代硬件,延迟着色方式可以处理约 2500 个动态场景光源,分辨率为全高清(1080p),在仅渲染不透明场景对象时才会出现帧率问题。
One of the disadvantage of using deferred shading is that only opaque objects can be rasterized into the G-buffers. The reason for this is that multiple transparent objects may cover the same screen pixels but it is only possible to store a single value per pixel in the G-buffers. In the lighting pass the depth value, surface normal, diffuse and specular colors are sampled for the current screen pixel that is being lit. Since only a single value from each G-buffer is sampled, transparent objects cannot be supported in the lighting pass. To circumvent this issue, transparent geometry must be rendered using the standard forward rendering technique which limits either the amount of transparent geometry in the scene or the number of dynamic lights in the scene. A scene which consists of only opaque objects can handle about 2000 dynamic lights before frame-rate issues start appearing.
使用延迟着色的一个缺点是只有不透明物体可以被光栅化到 G-Buffer中。原因在于多个透明物体可能覆盖相同的屏幕像素,但在 G-Buffer中每个像素只能存储单个值
。在光照pass中,为正在照亮的当前屏幕像素采样深度值、表面法线、漫反射和镜面颜色。由于只从每个 G-Buffer中采样单个值,透明物体无法在光照pass中受到支持。为了规避这个问题,透明几何体必须使用标准的前向渲染方式进行渲染,这限制了场景中透明几何体的数量或场景中动态光源的数量。一个只包含不透明物体的场景在出现帧速率问题之前可以处理大约 2000 个动态光源。
Another disadvantage of deferred shading is that only a single lighting model can be simulated in the lighting pass. This is due to the fact that it is only possible to bind a single pixel shader when rendering the light geometry. This is usually not an issue for pipelines that make use of übershaders as rendering with a single pixel shader is the norm, however if your rendering pipeline takes advantage of several different lighting models implemented in various pixel shaders then it will be problematic to switch your rendering pipeline to use deferred shading.
延迟着色的另一个缺点是在光照pass中只能模拟单个光照模型。这是因为在渲染光几何体时只能绑定单个像素着色器。对于使用超级着色器的流水线来说,使用单个像素着色器进行渲染是正常的,通常不是问题,但是如果您的渲染流水线利用各种像素着色器实现了几种不同的光照模型,那么将渲染流水线切换到使用延迟着色将会成为一个问题。
Forward+ [2][3] (also known as tiled forward shading) [4][5] is a rendering technique that combines forward rendering with tiled light culling to reduce the number of lights that must be considered during shading. Forward+ primarily consists of two stages:
Forward+(也称为平铺前向着色)是一种渲染方式,它将前向渲染与平铺光照剔除相结合,以减少在着色过程中必须考虑的光源数量。Forward+主要包括两个阶段:
- Light culling 光照剔除
- Forward rendering 前向渲染
The first pass of the Forward+ rendering technique uses a uniform grid of tiles in screen space to partition the lights into per-tile lists.
Forward+渲染方式的第一遍使用屏幕空间中的均匀网格划分灯光为每个瓦片列表。
The second pass uses a standard forward rendering pass to shade the objects in the scene but instead of looping over every dynamic light in the scene, the current pixel’s screen-space position is used to look-up the list of lights in the grid that was computed in the previous pass. The light culling provides a significant performance improvement over the standard forward rendering technique as it greatly reduces the number of redundant lights that must be iterated to correctly light the pixel. Both opaque and transparent geometry can be handled in a similar manner without a significant loss of performance and handling multiple materials and lighting models is natively supported with Forward+.
第二遍使用标准的前向渲染传递来着色场景中的物体,但是不是遍历场景中的每个动态光源,而是使用当前像素的屏幕空间位置来查找在前一遍计算的网格中的灯光列表。光源剔除大大提高了性能,因为它大大减少了必须迭代的冗余光源数量,以正确照亮像素。不透明和透明几何体可以以类似的方式处理,而不会显著降低性能,并且使用 Forward+本地支持处理多种材质和光照模型。
Since Forward+ incorporates the standard forward rendering pipeline into its technique, Forward+ can be integrated into existing graphics engines that were initially built using forward rendering. Forward+ does not make use of G-buffers and does not suffer the limitations of deferred shading. Both opaque and transparent geometry can be rendered using Forward+. Using modern graphics hardware, a scene consisting of 5,000 – 6,000 dynamic lights can be rendered in real-time at full HD resolutions (1080p).
由于 Forward+将标准的前向渲染管线纳入其方式中,因此 Forward+可以集成到最初使用前向渲染构建的现有图形引擎中。Forward+不使用 G-Buffer,也不受延迟着色的限制。透明和不透明几何体均可使用 Forward+进行渲染。在现代图形硬件的支持下,一个由 5,000 至 6,000 个动态光源组成的场景可以以全高清分辨率(1080p)实时渲染。
In the remainder of this article, I will describe the implementation of these three techniques:
在本文的其余部分,我将描述这三种方式的实现:
- Forward Rendering Forward 渲染
- Deferred Shading 延迟着色
- Forward+ (Tiled Forward Rendering)
Forward+(平铺前向渲染)
I will also show performance statistics under various circumstances to try to determine under which conditions one technique performs better than the others.
我还将展示在各种情况下的性能统计数据,以确定在哪些条件下一种方式的表现优于其他方式。
Definitions 定义
In the context of this article, it is important to define a few terms so that the rest of the article is easier to understand. If you are familiar with the basic terminology used in graphics programming, you may skip this section.
在本文的背景下,重要的是定义一些术语,以便更容易理解文章的其余部分。如果您熟悉图形编程中使用的基本术语,可以跳过本节。
The scene refers to a nested hierarchy of objects that can be rendered. For example, all of the static objects that can be rendered will be grouped into a scene. Each individual renderable object is referenced in the scene using a scene node. Each scene node references a single renderable object (such as a mesh) and the entire scene can be referenced using the scene’s top-level node called the root node. The connection of scene nodes within the scene is also called a scene graph. Since the root node is also a scene node, scenes can be nested to create more complex scene graphs with both static and dynamic objects.
场景是指可以呈现的对象的嵌套层次结构。例如,所有可以呈现的静态对象将被分组到一个场景中。每个单独的可渲染对象在场景中使用场景节点进行引用。每个场景节点引用一个单个的可渲染对象(如网格),整个场景可以使用场景的顶级节点——根节点进行引用。场景中场景节点的连接也称为场景图。由于根节点也是一个场景节点,因此可以嵌套场景以创建具有静态和动态对象的更复杂的场景图。
A pass refers to a single operation that performs one step of a rendering technique. For example, the opaque pass is a pass that iterates over all of the objects in the scene and renders only the opaque objects. The transparent pass will also iterate over all of the objects in the scene but renders only the transparent objects. A pass could also be used for more general operations such as copying GPU resources or dispatching a compute shader.
一个pass是指执行渲染方式的一个步骤的单个操作。例如,不透明pass是一个遍历场景中所有对象并仅渲染不透明对象的pass。透明pass也会遍历场景中的所有对象,但仅渲染透明对象。pass也可以用于更一般的操作,比如复制 GPU 资源或dispatch compute shader。
A technique is the combination of several passes that must be executed in a particular order to implement a rendering algorithm.
方式是必须按特定顺序执行的几个步骤的组合,以实现渲染算法。
A pipeline state refers to the configuration of the rendering pipeline before an object is rendered. A pipeline state object encapsulates the following render state:
管线状态是指在对象渲染之前渲染管线的配置。管线状态对象封装了以下渲染状态:
- Shaders (vertex, tessellation, geometry, and pixel)
着色器(顶点、镶嵌、几何和像素) - Rasterizer state (polygon fill mode, culling mode, scissor culling, viewports)
光栅化器状态(多边形填充模式、剔除模式、剪裁剔除、视口) - Blend state 混合状态
- Depth/Stencil state 深度/模板状态
- Render target 渲染目标
DirectX 12 introduces a pipeline state object but my definition of the pipeline state varies slightly from the DirectX 12 definition.
DirectX 12 引入了管线状态对象,但我的管线状态定义与 DirectX 12 的定义略有不同。
Forward rendering
Forward rendering refers to a rendering technique that traditionally has only two passes:
正向渲染是一种传统上只有两个pass的渲染方式:
- Opaque Pass 不透明pass
- Transparent Pass 透明pass
The opaque pass will render all opaque objects in the scene ideally sorted front to back (relative to the camera) in order to minimize overdraw. During the opaque pass, no blending needs to be performed.
不透明pass将使场景中的所有不透明对象理想地按照从前到后(相对于摄像机)排序,以最小化过度绘制。在不透明pass期间,无需执行混合。
The transparent pass will render all transparent objects in the scene ideally sorted back to front (relative to the camera) in order to support correct blending. During the transparent pass, alpha blending needs to be enabled to allow for semi-transparent materials to be blended correctly with pixels already rendered to the render target’s color buffer.
透明pass将使场景中的所有透明对象理想地按照从后到前(相对于摄像机)排序,以支持正确的混合。在透明pass期间,需要启用 alpha 混合,以便半透明材质能够与已经渲染到渲染目标颜色缓冲区的像素正确混合。
During forward rendering, all lighting is performed in the pixel shader together will all other material shading instructions.
在正向渲染期间,所有光照都在像素着色器中执行,同时执行所有其他材质着色指令。
Deferred shading
Deferred shading refers to a rendering technique that consists of three primary passes:
延迟着色是一种渲染方式,由三个主要pass组成:
- Geometry Pass 几何pass
- Lighting Pass 灯光pass
- Transparent Pass 透明pass
The first pass is the geometry pass which is similar to the opaque pass of the forward rendering technique because only opaque objects are rendered in this pass. The difference is that the geometry pass does not perform any lighting calculations but only outputs the geometric and material data to the G-buffer that was described in the introduction.
第一个pass是几何pass,类似于前向渲染方式中的不透明pass,因为在这个pass中只渲染不透明物体。不同之处在于几何pass不执行任何光照计算,而只将几何和材质数据输出到在介绍中描述的 G-Buffer。
In the lighting pass, the geometric volumes that represent the lights are rendered into the scene and the material information stored in the G-buffer is used to compute the lighting for the rasterized pixels.
在光照pass中,代表光源的几何体积被渲染到场景中,并且使用存储在 G-Buffer中的材质信息来计算光照以用于光栅化像素。
The final pass is the transparent pass. This pass is identical to the transparent pass of the forward rendering technique. Since deferred shading has no native support for transparent materials, transparent objects have to be rendered in a separate pass that performs lighting using the standard forward rendering method.
最终pass是透明pass。该pass与正向渲染方式的透明pass相同。由于延迟着色不支持透明材质,透明物体必须在单独的pass中渲染,该pass使用标准的正向渲染方法进行光照。
Forward+
Forward+ (also referred to as tiled forward rendering) is a rendering technique that consists of three primary passes:
Forward+(也称为平铺正向渲染)是一种渲染方式,由三个主要pass组成:
- Light Culling Pass 光照剔除pass
- Opaque Pass 不透明pass
- Transparent Pass 透明pass
As mentioned in the introduction, the light culling pass is responsible for sorting the dynamic lights in the scene into screen space tiles. A light index list is used to indicate which light indices (from the global light list) are overlapping each screen tile. In the light culling pass, two sets of light index lists will be generated:
正如在介绍中提到的,光照剔除pass负责将场景中的动态光源排序到屏幕空间的瓦片中。一个光源索引列表用于指示哪些光源索引(来自全局光源列表)与每个屏幕瓦片重叠。在光照剔除pass中,将生成两组光源索引列表:
- Opaque light index list
不透明光索引列表 - Transparent light index list
透明光索引列表
The opaque light index list is used when rendering opaque geometry and the transparent light index list is used when rendering transparent geometry.
在渲染不透明几何体时使用不透明光索引列表,在渲染透明几何体时使用透明光索引列表。
The opaque and transparent passes of the Forward+ rendering technique are identical to that of the standard forward rendering technique but instead of looping over all of the dynamic lights in the scene, only the lights in the current fragment’s screen space tile need to be considered.
Forward+渲染方式的不透明和透明pass与标准前向渲染方式相同,但不是遍历场景中的所有动态光源,而是只需考虑当前片元屏幕空间瓦片中的光源。
Light
A light refers to one of the following types of lights:
光指以下类型的光之一:
- Point light 点光源
- Spot light 聚光灯
- Directional light 定向光
All rendering techniques described in this article have support for these three light types. Area lights are not supported. The point light and the spot light are simulated as emanating from a single point of origin while the directional light is considered to emanate from a point infinitely far away emitting light everywhere in the same direction. Point lights and spot lights have a limited range after which their intensity falls-off to zero. The fall-off of the intensity of the light called attenuation. Point lights are geometrically represented as spheres, spot lights as cones, and directional lights as full-screen quads.
本文中描述的所有渲染方式都支持这三种光源类型。区域光不受支持。点光源和聚光灯被模拟为从单个起源点发出,而方向光被认为是从远处无限发光,朝着同一方向到处发光。点光源和聚光灯在超出一定范围后,其强度会衰减至零。光强度的衰减称为衰减。点光源在几何上被表示为球体,聚光灯为圆锥体,方向光为全屏四边形。
Let’s first take a more detailed look at the standard forward rendering technique.
让我们首先更详细地看一下标准的前向渲染方式。
Forward rendering is the simplest of the three lighting techniques and the most common technique used to render graphics in games. It is also the most computationally expensive technique for computing lighting and for this reason, it does not allow for a large number of dynamic lights to be used in the scene.
前向渲染是三种光照方式中最简单、在游戏中渲染图形最常用的方式。它也是计算光照最昂贵的方式,因此不允许在场景中使用大量动态光源。
Most graphics engines that use forward rendering will utilize various techniques to simulate many lights in the scene. For example, lightmapping and light probes are methods used to pre-compute the lighting contributions from static lights placed in the scene and storing these lighting contributions in textures that are loaded at runtime. Unfortunately, lightmapping and light probes cannot be used to simulate dynamic lights in the scene because the lights that were used to produce the lightmaps are often discarded at runtime.
大多数使用前向渲染的图形引擎会利用各种方式来模拟场景中的许多光源。例如,光照贴图和光探针是用于预先计算场景中静态光源的光照贡献并将这些光照贡献存储在纹理中,在运行时加载的方法。不幸的是,光照贴图和光探针无法用于模拟场景中的动态光源,因为用于生成光照贴图的光源通常在运行时被丢弃。
For this experiment, forward rendering is used as the ground truth to compare the other two rendering techniques. The forward rendering technique is also used to establish a performance baseline that can be used to compare the performance of the other rendering techniques.
对于这个实验,前向渲染被用作比较其他两种渲染方式的基准。前向渲染方式还被用来建立一个性能基准,可以用来比较其他渲染方式的性能。
Many functions of the forward rendering technique are reused in the deferred and forward+ rendering techniques. For example, the vertex shader used in forward rendering is also used for both deferred shading and forward+ rendering. Also the methods to compute the final lighting and material shading are reused in all rendering techniques.
前向渲染方式的许多功能在延迟渲染和Foward+渲染方式中得到重复利用。例如,前向渲染中使用的顶点着色器也用于延迟着色和Foward+渲染。同时,计算最终光照和材质着色的方法在所有渲染方式中得到重复利用。
In the next section, I will describe the implementation of the forward rendering technique.
在下一节中,我将描述前向渲染方式的实现。
Foward渲染实现
Vertex Shader 顶点着色器
The vertex shader is common to all rendering techniques. In this experiment, only static geometry is supported and there is no skeletal animation or terrain that would require a different vertex shader. The vertex shader is as simple as it can be while supporting the required functionality in the pixel shader such as normal mapping.
顶点着色器适用于所有渲染方式。在这个实验中,仅支持静态几何体,没有需要不同顶点着色器的骨骼动画或地形。顶点着色器尽可能简单,同时支持像法线贴图这样的像素着色器所需的功能。
Before I show the vertex shader code, I will describe the data structures used by the vertex shader.
在展示顶点着色器代码之前,我将描述顶点着色器使用的数据结构。
1 | struct AppData |
The AppData structure defines the data that is expected to be sent by the application code (for a tutorial on how to pass data from the application to a vertex shader, please refer to my previous article titled Introduction to DirectX 11). For normal mapping, in addition to the normal vector, we also need to send the tangent vector, and optionally the binormal (or bitangent) vector. The tangent and binormal vectors can either be created by the 3D artist when the model is created, or they can be generated by the model importer. In my case, I rely on the Open Asset Import Library [7] to generate the tangents and bitangents if they were not already created by the 3D artist.
AppData 结构定义了应用程序代码预期发送的数据(有关如何将数据从应用程序传递到顶点着色器的教程,请参阅我的上一篇文章,标题为《DirectX 11 入门》)。对于法线贴图,除了法线向量外,我们还需要发送切线向量,以及可选的副法线(或双切线)向量。切线和副法线向量可以由 3D 艺术家在创建模型时创建,也可以由模型导入器生成。在我的情况下,如果 3D 艺术家尚未创建切线和副切线,我依赖于 Open Asset Import Library [7] 来生成切线和副切线。
In the vertex shader, we also need to know how to transform the object space vectors that are sent by the application into view space which are required by the pixel shader. To do this, we need to send the world, view, and projection matrices to the vertex shader (for a review of the various spaces used in this article, please refer to my previous article titled Coordinate Systems). To store these matrices, I will create a constant buffer that will store the per-object variables needed by the vertex shader.
在顶点着色器中,我们还需要知道如何将应用程序发送的对象空间向量转换为像素着色器所需的视图空间。为此,我们需要将世界、视图和投影矩阵发送到顶点着色器(有关本文中使用的各种空间的回顾,请参阅我的上一篇文章,标题为坐标系)。为了存储这些矩阵,我将创建一个常量缓冲区,用于存储顶点着色器所需的每个对象变量。
1 | cbuffer PerObject : register( b0 ) |
Since I don’t need to store the world matrix separately, I precompute the combined model, and view, and the combined model, view, and projection matrices together in the application and send these matrices in a single constant buffer to the vertex shader.
由于我不需要单独存储世界矩阵,我在应用程序中预先计算了组合的模型、视图和组合的模型、视图和投影矩阵,并将这些矩阵一起发送到顶点着色器中的单个常量缓冲区。
The output from the vertex shader (and consequently, the input to the pixel shader) looks like this:
顶点着色器的输出(因此也是像素着色器的输入)如下所示:
1 | struct VertexShaderOutput |
The VertexShaderOutput structure is used to pass the transformed vertex attributes to the pixel shader. The members that are named with a VS postfix indicate that the vector is expressed in view space. I chose to do all of the lighting in view space, as opposed to world space, because it is easier to work in view space coordinates when implementing the deferred shading and forward+ rendering techniques.
VertexShaderOutput 结构用于将转换后的顶点属性传递给像素着色器。以 VS 后缀命名的成员表示该向量是以视图空间表示的。我选择在视图空间中进行所有光照计算,而不是在世界空间中,因为在实现延迟着色和Foward+渲染方式时,使用视图空间坐标更容易。
The vertex shader is fairly straightforward and minimal. It’s only purpose is to transform the object space vectors passed by the application into view space to be used by the pixel shader.
顶点着色器非常简单和最小。它的唯一目的是将应用程序传递的对象空间向量转换为视图空间,以供像素着色器使用。
The vertex shader must also compute the clip space position that is consumed by the rasterizer. The SV_POSITION semantic is applied to the output value from the vertex shader to specify that the value is used as the clip space position but this semantic can also be applied to an input variable of a pixel shader. When SV_POSITION is used as an input semantic to a pixel shader, the value is the position of the pixel in screen space [8]. In both the deferred shading and the forward+ shaders, I will use this semantic to the get the screen space position of the current pixel.
顶点着色器还必须计算由光栅化器使用的裁剪空间位置。SV_POSITION 语义被应用于顶点着色器的输出值,以指定该值用作裁剪空间位置,但此语义也可以应用于像素着色器的输入变量。当 SV_POSITION 用作像素着色器的输入语义时,该值是屏幕空间中像素的位置[8]。在延迟着色和Foward+着色器中,我将使用此语义来获取当前像素的屏幕空间位置。
1 | VertexShaderOutput VS_main( AppData IN ) |
You will notice that I am pre-multiplying the input vectors by the matrices. This indicates that the matrices are stored in column-major order by default. Prior to DirectX 10, matrices in HLSL were loaded in row-major order and input vectors were post-multiplied by the matrices. Since DirectX 10, matrices are loaded in column-major order by default. You can change the default order by specifying the row_major type modifier on the matrix variable declarations [9].
您会注意到我正在将输入向量与矩阵进行预乘。这表明默认情况下矩阵是按列主序存储的。在 DirectX 10 之前,HLSL 中的矩阵是按行主序加载的,输入向量是由矩阵后乘的。自 DirectX 10 以来,默认情况下矩阵是按列主序加载的。您可以通过在矩阵变量声明中指定 row_major 类型修饰符来更改默认顺序。
Pixel Shader 像素着色器
The pixel shader will compute all of the lighting and shading that is used to determine the final color of a single screen pixel. The lighting equations used in this pixel shader are described in a previous article titled Texturing and Lighting in DirectX 11 if you are not familiar with lighting models, then you should read that article first before continuing.
像素着色器将计算用于确定单个屏幕像素最终颜色的所有光照和阴影。此像素着色器中使用的光照方程式在之前的一篇文章中描述,标题为“DirectX 11 中的纹理和光照”,如果您对光照模型不熟悉,则应该先阅读该文章,然后再继续。
The pixel shader uses several structures to do its work. The Material struct stores all of the information that describes the surface material of the object being shaded and the Light struct contains all of the parameters that are necessary to describe a light that is placed in the scene.
像素着色器使用几个结构来完成其工作。Material 结构存储描述被着色对象的表面材质的所有信息,Light 结构包含描述放置在场景中的光源所需的所有参数。
Material 材质
The Material struct defines all of the properties that are necessary to describe the surface of the object currently being shaded. Since some material properties can also have an associated texture (for example, diffuse textures, specular textures, or normal texture), we will also use the material to indicate if those textures are present on the object.
Material 结构定义了描述当前被着色对象表面所需的所有属性。由于一些材质属性也可以有关联的纹理(例如,漫反射纹理、镜面反射纹理或法线纹理),我们还将使用材质来指示这些纹理是否存在于对象上。
1 | struct Material |
The GlobalAmbient term is used to describe the ambient contribution applied to all object in the scene globally. Technically, this variable should be a global variable (not specific to a single object) but since there is only a single material at a time in the pixel shader, I figured it was a fine place to put it.
GlobalAmbient 术语用于描述应用于场景中所有对象的环境贡献。从方式上讲,这个变量应该是一个全局变量(而不是特定于单个对象),但由于像素着色器中一次只有一个材质,我觉得这是一个合适的放置位置。
The ambient, emissive, diffuse, and specular color values have the same meaning as in my previous article titled Texturing and Lighting in DirectX 11 so I will not explain them in detail here.
环境光、自发光、漫反射和镜面颜色值的含义与我之前的文章《DirectX 11 中的纹理和光照》中相同,因此我不会在这里详细解释它们。
The Reflectance component could be used to indicate the amount of reflected color that should be blended with the diffuse color. This would require environment mapping to be implemented which I am not doing in this experiment so this value is not used here.
反射分量可用于指示应与漫反射颜色混合的反射颜色量。这将需要实现环境贴图,而我在这个实验中没有这样做,因此此值在此处未使用。
The Opacity value is used to determine the total opacity of an object. This value can be used to make objects appear transparent. This property is used to render semi-transparent objects in the transparent pass. If the opacity value is less than one (1 being fully opaque and 0 being fully transparent), the object will be considered transparent and will be rendered in the transparent pass instead of the opaque pass.
不透明度值用于确定物体的总不透明度。此值可用于使物体呈现为透明。此属性用于在透明pass中渲染半透明物体。如果不透明度值小于 1(1 表示完全不透明,0 表示完全透明),则该物体将被视为透明,并将在透明pass中渲染,而不是在不透明pass中。
The SpecularPower variable is used to determine how shiny the object appears. Specular power was described in my previous article titled Texturing and Lighting in DirectX 11 so I won’t repeat it here.
SpecularPower 变量用于确定物体看起来有多闪亮。Specular power 在我的上一篇文章中有描述,标题为 DirectX 11 中的纹理和光照,所以我这里不会重复。
The IndexOfRefraction variable can be applied on objects that should refract light through them. Since refraction requires environment mapping techniques that are not implemented in this experiment, this variable will not be used here.
IndexOfRefraction 变量可应用于应该通过它们折射光线的物体。由于折射需要环境贴图方式,而这些方式在此实验中未实现,因此此变量将不会在此处使用。
The HasTexture variables defined on lines 29-38 indicate whether the object being rendered has an associated texture for those properties. If the parameter is true then the corresponding texture will be sampled and the texel will be blended with the corresponding material color value.
HasTexture 变量在第 29-38 行定义,指示正在渲染的物体是否具有相关纹理。如果参数为 true,则将对应的纹理进行采样,并将 texel 与对应的材质颜色值混合。
The BumpIntensity variable is used to scale the height values from a bump map (not to be confused with normal mapping which does not need to be scaled) in order to soften or accentuate the apparent bumpiness of an object’s surface. In most cases models will use normal maps to add detail to the surface of an object without high tessellation but it is also possible to use a heightmap to do the same thing. If a model has a bump map, the material’s HasBumpTexture property will be set to true and in this case the model will be bump mapped instead of normal mapped.
BumpIntensity 变量用于缩放凸起贴图的高度值(不要与不需要缩放的法线贴图混淆),以软化或突出物体表面的凹凸感。在大多数情况下,模型将使用法线贴图为物体表面添加细节,而无需高细分,但也可以使用高度图来实现相同的效果。如果模型具有凸起贴图,则材质的 HasBumpTexture 属性将设置为 true,在这种情况下,模型将进行凸起贴图而不是法线贴图。
The SpecularScale variable is used to scale the specular power value that is read from a specular power texture. Since textures usually store values as unsigned normalized values, when sampling from the texture the value is read as a floating-point value in the range of [0…1]. A specular power of 1.0 does not make much sense (as was explained in my previous article titled Texturing and Lighting in DirectX 11) so the specular power value read from the texture will be scaled by SpecularScale before being used for the final lighting computation.
SpecularScale 变量用于缩放从镜面率纹理中读取的镜面率值。由于纹理通常将值存储为无符号归一化值,因此从纹理采样时,该值将作为浮点值读取,范围为[0…1]。镜面率为 1.0 并没有太多意义(正如我之前的文章《DirectX 11 中的纹理和光照》中所解释的那样),因此在用于最终光照计算之前,将从纹理中读取的镜面率值乘以 SpecularScale 进行缩放。
The AlphaThreshold variable can be used to discard pixels whose opacity is below a certain value using the “discard” command in the pixel shader. This can be used with “cut-out” materials where the object does not need to be alpha blended but it should have holes in the object (for example, a chain-link fence).
AlphaThreshold 变量可用于使用像素着色器中的“丢弃”命令丢弃不透明度低于某个值的像素。这可用于“切割”材质,其中对象不需要进行 alpha 混合,但对象应该有孔洞(例如,链环围栏)。
The Padding variable is used to explicitly add eight bytes of padding to the material struct. Although HLSL will implicitly add this padding to this struct to make sure the size of the struct is a multiple of 16 bytes, explicitly adding the padding makes it clear that the size and alignment of this struct is identical to its C++ counterpart.
Padding 变量用于显式地向材质结构体添加八个字节的填充。虽然 HLSL 会隐式地向该结构体添加此填充,以确保结构体的大小是 16 字节的倍数,但显式添加填充可以清楚地表明该结构体的大小和对齐方式与其 C++对应项相同。
The material properties are passed to the pixel shader using a constant buffer.
材质属性通过常量缓冲区传递给像素着色器。
1 | cbuffer Material : register( b2 ) |
This constant buffer and buffer register slot assignment is used for all pixel shaders described in this article.
本文描述的所有像素着色器都使用此常量缓冲区和缓冲区寄存器分配。
Textures 纹理
The materials have support for eight different textures.
这些材质支持八种不同的纹理。
- Ambient 环境光
- Emissive 自发光
- Diffuse 漫反射
- Specular 镜面
- SpecularPower 反射率
- Normals 法线
- Bump 凹凸
- Opacity 不透明度
Not all scene objects will use all of the texture slots (normal and bump maps are mutually exclusive so they can probably reuse the same texture slot assignment). It is up to the 3D artist to determine which textures will be used by the models in the scene. The application will load the textures that are associated to a material. A texture parameter and an associated texture slot assignment is declared for each of these material properties.
并非所有场景对象都会使用所有的纹理槽(法线和凹凸贴图是互斥的,因此它们可能会重用相同的纹理槽分配)。由 3D 艺术家决定场景中的模型将使用哪些纹理。应用程序将加载与材质关联的纹理。为每个这些材质属性声明一个纹理参数和一个关联的纹理槽分配。
1 | Texture2D AmbientTexture : register( t0 ); |
In every pixel shader described in this article, texture slots 0-7 will be reserved for these textures.
在本文中描述的每个像素着色器中,纹理槽 0-7 将被保留用于这些纹理。
Lights 灯光
The Light struct stores all the information necessary to define a light in the scene. Spot lights, point lights and directional lights are not separated into different structs and all of the properties necessary to define any of those light types are stored in a single struct.
Light 结构存储了定义场景中光源所需的所有信息。聚光灯、点光源和定向光源并未分开存储在不同的结构中,定义任何一种光源类型所需的所有属性都存储在单个结构中。
1 | struct Light |
The Position and Direction properties are stored in both world space (with the WS postfix) and in view space (with VS postfix). Of course the Position variable only applies to point and spot lights while the Direction variable only applies to spot and directional lights. I store both world space and view space position and direction vectors because I find it easier to work in world space in the application then convert the world space vectors to view space before uploading the lights array to the GPU. This way I do not need to maintain multiple light lists at the cost of additional space that is required on the GPU. But even 10,000 lights only require 1.12 MB on the GPU so I figured this was a reasonable sacrifice. But minimizing the size of the light structs could have a positive impact on caching on the GPU and improve rendering performance. This is further discussed in the Future Considerations section at the end of this article.
位置和方向属性分别存储在世界空间(带有 WS 后缀)和视图空间(带有 VS 后缀)中。当然,位置变量仅适用于点光源和聚光灯,而方向变量仅适用于聚光灯和方向光源。我同时存储世界空间和视图空间的位置和方向向量,因为我发现在应用程序中更容易在世界空间中工作,然后在将世界空间向量转换为视图空间之前将灯光数组上传到 GPU。这样,我就不需要维护多个光源列表,而需要的额外空间仅在 GPU 上。即使有 10,000 个光源,也仅需要 1.12 MB 的 GPU 空间,所以我认为这是一个合理的牺牲。但是,减小光源结构的大小可能对 GPU 上的缓存产生积极影响,并提高渲染性能。这在本文末尾的未来考虑部分进一步讨论。
In some lighting models the diffuse and specular lighting contributions are separated. I chose not to separate the diffuse and specular color contributions because it is rare that these values differ. Instead I chose to store both the diffuse and specular lighting contributions in a single variable called Color.
在一些光照模型中,漫反射和镜面反射的光照贡献是分开的。我选择不分开漫反射和镜面反射的颜色贡献,因为这些值不经常不同。相反,我选择将漫反射和镜面反射的光照贡献都存储在一个名为 Color 的单个变量中。
The SpotlightAngle is the half-angle of the spotlight cone expressed in degrees. Working in degrees seems to be more intuitive than working in radians. Of course, the spotlight angle will be converted to radians in the shader when we need to compute the cosine angle of the spotlight and the light vector.
SpotlightAngle 是以度为单位表示的聚光灯锥体的半角。使用度数似乎比使用弧度更直观。当我们需要计算聚光灯和光矢量的余弦角时,聚光灯角度当然会在着色器中转换为弧度。
Spotlight Angle 聚光灯角度
The Range variable determines how far away the light will reach and still contribute light to a surface. Although not entirely physically correct (real lights have an attenuation that never actually reaches 0) lights are required to have a finite range to implement the deferred shading and forward+ rendering techniques. The units of this range are scene specific but generally I try to adhere to the 1 unit is 1 meter specification. For point lights, the range is the radius of the sphere that represents the light and for spotlights, the range is the length of the cone that represents the light. Directional lights don’t use range because they are considered to be infinitely far away pointing in the same direction everywhere.
范围变量确定光线的到达距离,并仍然为表面提供光线。虽然不完全符合物理规律(真实光源具有永远不会真正达到 0 的衰减),但需要光源具有有限范围以实现延迟着色和Foward+渲染方式。这个范围的单位是特定于场景的,但通常我尝试遵守 1 单位等于 1 米的规范。对于点光源,范围是代表光源的球体的半径,对于聚光灯,范围是代表光源的锥体的长度。定向光不使用范围,因为它们被认为是无限远,无论在哪里指向相同方向。
The Intensity variable is used to modulate the computed light contribution. By default, this value is 1 but it can be used to make some lights brighter or more subtle than other lights.
强度变量用于调节计算得到的光照贡献。默认情况下,该值为 1,但可以用来使一些灯光比其他灯光更亮或更微妙。
Lights in the scene can be toggled on and off with the Enabled flag. Lights whose Enabled flag is false will be skipped in the shader.
场景中的灯光可以通过 Enabled 标志进行开关控制。Enabled 标志为 false 的灯光将在着色器中被跳过。
Lights are editable in this demo. A light can be selected by clicking on it in the demo application and its properties can be modified. To indicate that a light is currently selected, the Selected flag will be set to true. When a light is selected in the scene, its visual representation will appear darker (less transparent) to indicate that it is currently selected.
在此演示中,灯光是可编辑的。可以通过在演示应用程序中单击灯光来选择灯光并修改其属性。要表示当前选择的灯光,Selected 标志将被设置为 true。当在场景中选择灯光时,其视觉表示将变暗(不透明度降低),以表示当前选择的状态。
The Type variable is used to indicate which type of light this is. It can have one of the following values:
Type 变量用于指示这是哪种类型的光。它可以具有以下值之一:
1 |
Once again the Light struct is explicitly padded with 8 bytes to match the struct layout in C++ and to make the struct explicitly aligned to 16 bytes which is required in HLSL.
再次,Light 结构体明确填充了 8 字节,以匹配 C++ 中的结构布局,并使结构体明确对齐到 16 字节,这在 HLSL 中是必需的。
The lights array is accessed through a StructuredBuffer. Most lighting shader implementations will use a constant buffer to store the lights array but constant buffers are limited to 64 KB in size which means that it would be limited to about 570 lights before running out of constant memory on the GPU. Structured buffers are stored in texture memory which is limited to the amount of texture memory available on the GPU (usually in the GB range on desktop GPUs). Texture memory is also very fast on most GPUs so storing the lights in a structured buffer did not impose a performance impact. In fact, on my particular GPU (NVIDIA GeForce GTX 680) I noticed a considerable performance improvement when I moved the lights array to a structure buffer.
通过 StructuredBuffer 访问 lights 数组。大多数光照着色器实现将使用常量缓冲区来存储 lights 数组,但常量缓冲区的大小限制为 64 KB,这意味着在 GPU 上的常量内存用完之前,lights 数组将受到限制,大约只能容纳约 570 个光源。Structured buffers 存储在纹理内存中,其大小受 GPU 上可用纹理内存的限制(通常在台式机 GPU 上为 GB 级别)
。大多数 GPU 上的纹理内存速度也非常快,因此将 lights 存储在结构化缓冲区中并不会对性能产生影响。事实上,在我的特定 GPU(NVIDIA GeForce GTX 680)上,当我将 lights 数组移动到结构化缓冲区时,我注意到了显著的性能改进。
1 | StructuredBuffer<Light> Lights : register( t8 ); |
Pixel Shader Continued 像素着色器
The pixel shader for the forward rendering technique is slightly more complicated than the vertex shader. If you have read my previous article titled Texturing and Lighting in DirectX 11 then you should already be familiar with most of the implementation of this shader, but I will explain it in detail here as it is the basis of all of the rendering algorithms shown in this article.
正向渲染方式的像素着色器比顶点着色器稍微复杂一些。如果您已经阅读了我之前的文章《DirectX 11 中的纹理和光照》,那么您应该已经熟悉了这个着色器的大部分实现,但我会在这里详细解释,因为它是本文中展示的所有渲染算法的基础。
Materials 材质
First, we need to gather the material properties of the material. If the material has textures associated with its various components, the textures will be sampled before the lighting is computed. After the material properties have been initialized, all of the lights in the scene will be iterated and the lighting contributions will be accumulated and modulated with the material properties to produce the final pixel color.
首先,我们需要收集材质的材质属性。如果材质具有与其各个组件相关联的纹理,那么在计算光照之前将对纹理进行采样。初始化材质属性后,将迭代场景中的所有灯光,并将光照贡献累积并与材质属性调制,以生成最终像素颜色。
1 | [earlydepthstencil] |
The [earlydepthstencil] attribute before the function indicates that the GPU should take advantage of early depth and stencil culling [10]. This causes the depth/stencil tests to be performed before the pixel shader is executed. This attribute can not be used on shaders that modify the pixel’s depth value by outputting a value using the SV_Depth semantic. Since this pixel shader only outputs a color value using the SV_TARGET semantic, it can take advantage of early depth/stencil testing to provide a performance improvement when a pixel is rejected. Most GPU’s will perform early depth/stencil tests anyways even without this attribute and adding this attribute to the pixel shader did not have a noticeable impact on performance but I decided to keep the attribute anyways.
在函数之前的[earlydepthstencil]属性表示 GPU 应该利用提前深度和模板剔除[10]。这会导致深度/模板测试在像素着色器执行之前执行。这个属性不能用于修改像素深度值的着色器,因为它使用 SV_Depth 语义输出值。由于这个像素着色器只使用 SV_TARGET 语义输出颜色值,它可以利用提前深度/模板测试来提高性能,当像素被拒绝时。大多数 GPU 即使没有这个属性也会执行提前深度/模板测试,将这个属性添加到像素着色器并没有明显影响性能,但我还是决定保留这个属性。
Since all of the lighting computations will be performed in view space, the eye position (the position of the camera) is always (0, 0, 0). This is a nice side effect of working in view space; The camera’s eye position does not need to be passed as an additional parameter to the shader.
由于所有的光照计算将在视图空间中执行,眼睛位置(相机的位置)始终为(0, 0, 0)。这是在视图空间中工作的一个好处;相机的眼睛位置不需要作为额外参数传递给着色器
。
On line 24 a temporary copy of the material is created because its properties will be modified in the shader if there is an associated texture for the material property. Since the material properties are stored in a constant buffer, it would not be possible to directly update the materials properties from the constant buffer uniform variable so a local temporary must be used.
在第 24 行,将材质的临时副本创建,因为如果材质属性有关联的纹理,那么在着色器中将修改这些属性。由于材质属性存储在常量缓冲区中,无法直接从常量缓冲区统一变量更新材质属性,因此必须使用本地临时变量。
Diffuse 漫反射
The first material property we will read is the diffuse color.
我们将要阅读的第一个材质属性是漫反射颜色。
1 | float4 diffuse = mat.DiffuseColor; |
The default diffuse color is the diffuse color assigned to the material’s DiffuseColor variable. If the material also has a diffuse texture associated with it then the color from the diffuse texture will be blended with the material’s diffuse color. If the material’s diffuse color is black (0, 0, 0, 0), then the material’s diffuse color will simply be replaced by the color in the diffuse texture. The any hlsl intrinsic function can be used to find out if any of the color components is not zero.
默认的漫反射颜色是分配给材质的 DiffuseColor 变量的漫反射颜色。如果材质还有与之关联的漫反射纹理,那么来自漫反射纹理的颜色将与材质的漫反射颜色混合。如果材质的漫反射颜色是黑色(0, 0, 0, 0),那么材质的漫反射颜色将简单地被漫反射纹理中的颜色替换。任何 hlsl 内置函数都可以用来查找颜色分量中是否有任何非零值。
Opacity 不透明度
The pixel’s alpha value is determined next.
像素的 alpha 值是接下来确定的。
1 | float alpha = diffuse.a; |
By default, the fragment’s transparency value is determined by the alpha component of the diffuse color. If the material has an opacity texture associated with it, the red component of the opacity texture is used as the alpha value, overriding the alpha value in the diffuse texture. In most cases, opacity textures store only a single channel in the first component of the color that is returned from the Sample method. In order to read from a single-channel texture, we must read from the red channel, not the alpha channel. The alpha channel of a single channel texture will always be 1 so reading the alpha channel from the opacity map (which is most likely a single channel texture) would not provide the value we require.
默认情况下,片元的透明度值由漫反射颜色的 alpha 分量确定。如果材质有与之关联的不透明度纹理,不透明度纹理的红色分量将用作 alpha 值,覆盖漫反射纹理中的 alpha 值。在大多数情况下,不透明度纹理仅存储从 Sample 方法返回的颜色的第一个分量中的单个pass。为了从单pass纹理中读取,我们必须从红色pass读取,而不是从 alpha pass读取。单pass纹理的 alpha pass将始终为 1,因此从不透明度图中读取 alpha pass(这很可能是单pass纹理)将无法提供我们需要的值。
Ambient and Emissive 环境和自发光
The ambient and emissive colors are read in a similar fashion as the diffuse color. The ambient color is also combined with the value of the material’s GlobalAmbient variable.
环境色和发射色的读取方式与漫反射色类似。环境色还与材质的 GlobalAmbient 变量的值相结合。
1 | float4 ambient = mat.AmbientColor; |
Specular Power 反射率
Next the specular power is computed.
接下来计算镜面率。
1 | if ( mat.HasSpecularPowerTexture ) |
If the material has an associated specular power texture, the red component of the texture is sampled and scaled by the value of the material’s SpecularScale variable. In this case, the value of the SpecularPower variable in the material is replaced with the scaled value from the texture.
如果材质具有关联的高光强度纹理,纹理的红色分量将被采样并乘以材质的 SpecularScale 变量的值。在这种情况下,材质中的 SpecularPower 变量的值将被替换为纹理中的缩放值。
Normals 正常
If the material has either an associated normal map or a bump map, normal mapping or bump mapping will be performed to compute the normal vector. If neither a normal map nor a bump map texture is associated with the material, the input normal is used as-is.
如果材质具有关联的法线贴图或凹凸贴图,则将执行法线贴图以计算法线向量。如果材质未关联法线贴图或凹凸贴图纹理,则输入法线将按原样使用。
1 | // Normal mapping |
Normal Mapping 法线贴图
The DoNormalMapping function will perform normal mapping from the TBN (tangent, bitangent/binormal, normal) matrix and the normal map.
DoNormalMapping 函数将从 TBN(切线、双切线/法线副切线、法线)矩阵和法线贴图执行法线采样。
An example normal map texture of the lion head in the Crytek Sponza scene. [11]
狮子头在 Crytek Sponza 场景中的一个示例法线贴图纹理。[11]
1 | float3 ExpandNormal( float3 n ) |
Normal mapping is pretty straightforward and is explained in more detail in a previous article titled Normal Mapping so I won’t explain it in detail here. Basically we just need to sample the normal from the normal map, expand the normal into the range [-1…1] and transform it from tangent space into view space by post-multiplying it by the TBN matrix.
普通贴图非常简单,之前的一篇名为“法线贴图”的文章中对此进行了更详细的解释,所以我就不在这里详细解释了。基本上,我们只需要从法线贴图中采样法线,将法线扩展到[-1…1]范围内,并通过将其与 TBN 矩阵进行后乘来将其从切线空间转换到视图空间。
Bump Mapping 凹凸贴图
Bump mapping works in a similar way, except instead of storing the normals directly in the texture, the bumpmap texture stores height values in the range [0…1]. The normal can be generated from the height map by computing the gradient of the height values in both the U and V texture coordinate directions. Taking the cross product of the gradients in each direction gives the normal in texture space. Post-multiplying the resulting normal by the TBN matrix will give the normal in view space. The height values read from the bump map can be scaled to produce more (or less) accentuated bumpiness.
凹凸贴图的工作方式类似,不同之处在于凹凸贴图纹理中不直接存储法线,而是存储高度值,范围为[0…1]。可以通过计算高度图中 U 和 V 纹理坐标方向上的高度值梯度来生成法线。在每个方向上梯度的叉积给出了纹理空间中的法线。将结果法线与 TBN 矩阵相乘将给出视图空间中的法线。从凹凸贴图中读取的高度值可以缩放以产生更多(或更少)突出的凹凸效果。
Bumpmap texture (left) and the corresponding head model (right). [12]
凹凸贴图纹理(左)和相应的头部模型(右)。【12】
1 | float4 DoBumpMapping( float3x3 TBN, Texture2D tex, sampler s, float2 uv, float bumpScale ) |
I’m not sure if this bump mapping algorithm is 100% correct. I couldn’t find any resource that shows how to do correct bump mapping. Please leave a comment below if you can suggest a better (and correct) method for performing bump mapping.
我不确定这个凹凸贴图算法是否 100%正确。我找不到任何资源显示如何正确地进行凹凸贴图。如果您能提出更好(和正确)的执行凹凸贴图的方法,请在下面留言。
If the material does not have an associated normal map or a bump map, the normal vector from the vertex shader output is used directly.
如果材质没有关联的法线贴图或凹凸贴图,则直接使用顶点着色器输出的法线向量。
Now we have all of the data that is required to compute the lighting.
现在我们拥有计算光照所需的所有数据。
Lighting 灯光
The lighting calculations for the forward rendering technique are performed in the DoLighting function. This function accepts the following arguments:
正向渲染方式的灯光计算是在 DoLighting 函数中执行的。该函数接受以下参数:
- lights: The lights array (as a structured buffer)
灯光:灯光数组(作为结构化缓冲区) - mat: The material properties that were just computed
材质属性刚刚计算出来的 - eyePos: The position of the camera in view space (which is always (0, 0, 0))
eyePos: 相机在视图空间中的位置(始终为(0, 0, 0)) - P: The position of the point being shaded in view space
P: 在视图空间中被着色的点的位置 - N: The normal of the point being shaded in view space.
N:在视空间中着色的点的法线。
The DoLighting function returns a LightingResult structure that contains the diffuse and specular lighting contributions from all of the lights in the scene.
DoLighting 函数返回一个 LightingResult 结构,其中包含场景中所有灯光的漫反射和镜面光照贡献。
1 | // This lighting result is returned by the |
The view vector (V) is computed from the eye position and the position of the shaded pixel in view space.
视图向量(V)是从眼睛位置和视图空间中阴影像素的位置计算得出的。
The light buffer is iterated on line 439. Since we know that disabled lights and lights that are not within range of the point being shaded won’t contribute any lighting, we can skip those lights. Otherwise, the appropriate lighting function is invoked depending on the type of light.
光缓冲区在第 439 行上进行迭代。由于我们知道禁用的灯光和不在被着色点范围内的灯光不会提供任何光照,我们可以跳过这些灯光。否则,根据灯光类型调用适当的光照函数。
Each of the various light types will compute their diffuse and specular lighting contributions. Since diffuse and specular lighting is computed in the same way for every light type, I will define functions to compute the diffuse and specular lighting contributions independent of the light type.
各种光源类型将计算它们的漫反射和镜面反射光照贡献。由于漫反射和镜面反射光照对于每种光源类型的计算方式相同,我将定义函数来计算漫反射和镜面反射光照贡献,独立于光源类型。
Diffuse Lighting 漫反射光照
The DoDiffuse function is very simple and only needs to know about the light vector (L) and the surface normal (N).
DoDiffuse 函数非常简单,只需要知道光矢量(L)和表面法线(N)即可。
Diffuse Lighting 漫射光
1 | float4 DoDiffuse( Light light, float4 L, float4 N ) |
The diffuse lighting is computed by taking the dot product between the light vector (L) and the surface normal (N). The DoDiffuse function expects both of these vectors to be normalized.
漫反射光照是通过计算光矢量(L)和表面法线(N)之间的点积来实现的。DoDiffuse 函数期望这两个向量都被归一化。
The resulting dot product is then multiplied by the color of the light to compute the diffuse contribution of the light.
然后将得到的点积乘以光的颜色,以计算光的漫反射贡献。
Next, we’ll compute the specular contribution of the light.
接下来,我们将计算光的镜面贡献。
Specular Lighting 镜面光照
The DoSpecular function is used to compute the specular contribution of the light. In addition to the light vector (L) and the surface normal (N), this function also needs the view vector (V) to compute the specular contribution of the light.
DoSpecular 函数用于计算光的镜面贡献。除了光矢量(L)和表面法线(N)之外,该函数还需要视图矢量(V)来计算光的镜面贡献。
Specular Lighting 镜面光照
1 | float4 DoSpecular( Light light, Material material, float4 V, float4 L, float4 N ) |
Since the light vector L is the vector pointing from the point being shaded to the light source, it needs to be negated so that it points from the light source to the point being shaded before we compute the reflection vector. The resulting dot product of the reflection vector (R) and the view vector (V) is raised to the power of the value of the material’s specular power variable and modulated by the color of the light. It’s important to remember that a specular power value in the range (0…1) is not a meaningful specular power value. For a detailed explanation of specular lighting, please refer to my previous article titled Texturing and Lighting in DirectX 11.
由于光矢量 L 是从被着色点指向光源的矢量,所以在计算反射矢量之前,需要对其取反,使其从光源指向被着色点。反射矢量(R)和视图矢量(V)的点积结果被提升到材质的高光率变量的值,并受光的颜色调制。重要的是要记住,范围在(0…1)的高光率值并不是一个有意义的高光率值。有关高光光照的详细解释,请参阅我之前发表的文章,标题为《DirectX 11 中的纹理和光照》。
Attenuation 衰减
Attenuation is the fall-off of the intensity of the light as the light is further away from the point being shaded. In traditional lighting models the attenuation is computed as the reciprocal of the sum of three attenuation factors multiplied by the distance to the light (as explained in Attenuation):
衰减是光强随着光离被遮挡点的距离增加而减弱的过程。在传统的光照模型中,衰减被计算为三个衰减因子之和的倒数乘以到光源的距离(如在衰减中所解释的)。
- Constant attenuation 恒定衰减
- Linear attenuation 线性衰减
- Quadratic attenuation 二次衰减
However this method of computing attenuation assumes that the fall-off of the light never reaches zero (lights have an infinite range). For deferred shading and forward+ we must be able to represent the lights in the scene as volumes with finite range so we need to use a different method to compute the attenuation of the light.
然而,这种计算衰减的方法假设光线的衰减永远不会达到零(灯光具有无限范围)。对于延迟着色和前向加法,我们必须能够将场景中的灯光表示为具有有限范围的体积,因此我们需要使用不同的方法来计算光线的衰减。
One possible method to compute the attenuation of the light is to perform a linear blend from 1.0 when the point is closest to the light and 0.0 if the point is at a distance greater than the range of the light. However a linear fall-off does not look very realistic as attenuation in reality is more similar to the reciprocal of a quadratic function.
计算光线衰减的一种可能方法是从 1.0 开始进行线性混合,当点最靠近光源时为 1.0,如果点的距离大于光源的范围,则为 0.0。然而,线性衰减看起来并不是很现实,因为实际上衰减更类似于二次函数的倒数。
I decided to use the smoothstep hlsl intrinsic function which returns a smooth interpolation between a minimum and maximum value.
我决定使用 smoothstep hlsl 内置函数,该函数返回最小值和最大值之间的平滑插值。
HLSL smoothstep intrinsic function
HLSL smoothstep 内置函数
1 | // Compute the attenuation based on the range of the light. |
The smoothstep function will return 0 when the distance to the light (d) is less than ¾ of the range of the light and 1 when the distance to the light is more than the range. Of course we want to reverse this interpolation so we just subtract this value from 1 to get the attenuation we need.
smoothstep 函数将在距离光源的距离(d)小于光源范围的 3/4 时返回 0,在距离光源的距离大于范围时返回 1。当然,我们希望反转这种插值,所以我们只需从 1 中减去这个值,以获得我们需要的衰减。
Optionally, we could adjust the smoothness of the attenuation of the light by parameterization of the 0.75f in the equation above. A smoothness factor of 0.0 should result in the intensity of the light remaining 1.0 all the way to the maximum range of the light while a smoothness of 1.0 should result in the intensity of the light being interpolated through the entire range of the light.
可选地,我们可以通过上述方程中的 0.75f 的参数化来调整光线衰减的平滑度。平滑度因子为 0.0 应导致光线的强度一直保持为 1.0,直到光线的最大范围,而平滑度为 1.0 应导致光线的强度在整个光线范围内插值。
Variable attenuation smoothness.
变量衰减平滑度。
Now let’s combine the diffuse, specular, and attenuation factors to compute the lighting contribution for each light type.
现在让我们结合漫反射、镜面反射和衰减因子来计算每种光类型的光照贡献。
Point Lights 点光源
Point lights combine the attenuation, diffuse, and specular values to determine the final contribution of the light.
点光源结合衰减、漫反射和镜面反射值来确定光的最终贡献。
1 | LightingResult DoPointLight( Light light, Material mat, float4 V, float4 P, float4 N ) |
On line 400-401, the diffuse and specular contributions are scaled by the attenuation and the light intensity factors before being returned from the function.
在第 400-401 行,漫反射和镜面反射的贡献在从函数返回之前会被衰减和光强因子缩放。
Spot Lights 聚光灯
In addition to the attenuation factor, spot lights also have a cone angle. In this case, the intensity of the light is scaled by the dot product between the light vector (L) and the direction of the spotlight. If the angle between light vector and the direction of the spotlight is less than the spotlight cone angle, then the point should be lit by the spotlight. Otherwise the spotlight should not contribute any light to the point being shaded. The DoSpotCone function will compute the intensity of the light based on the spotlight cone angle.
除了衰减因子外,聚光灯还有一个锥角。在这种情况下,光的强度由光矢量(L)与聚光灯方向之间的点积来缩放。如果光矢量与聚光灯方向之间的角度小于聚光灯锥角,则该点应该被聚光灯照亮。否则,聚光灯不应该为被遮蔽的点贡献任何光线。DoSpotCone 函数将根据聚光灯锥角计算光的强度。
1 | float DoSpotCone( Light light, float4 L ) |
First, the cosine angle of the spotlight cone is computed. If the dot product between the direction of the spotlight and the light vector (L) is less than the min cosine angle then the contribution of the light will be 0. If the dot product is greater than max cosine angle then the contribution of the spotlight will be 1.
首先,计算聚光锥的余弦角。如果聚光灯方向与光矢量(L)的点积小于最小余弦角,则光的贡献将为 0。如果点积大于最大余弦角,则聚光的贡献将为 1。
The spotlights minimum and maximum cosine angles.
聚光灯的最小和最大余弦角。
It may seem counter-intuitive that the max cosine angle is a smaller angle than the min cosine angle but don’t forget that the cosine of 0° is 1 and the cosine of 90° is 0.
看起来可能有些反直觉,最大余弦角比最小余弦角小,但不要忘记 0°的余弦是 1,90°的余弦是 0。
The DoSpotLight function will compute the spotlight contribution similar to that of the point light with the addition of the spotlight cone angle.
DoSpotLight 函数将计算聚光灯的贡献,类似于点光源,但增加了聚光锥角。
1 | LightingResult DoSpotLight( Light light, Material mat, float4 V, float4 P, float4 N ) |
Directional Lights 定向光
Directional lights are the simplest light type because they do not attenuate over the distance to the point being shaded.
定向光是最简单的光类型,因为它们不会随着到达被着色点的距离而衰减。
1 | LightingResult DoDirectionalLight( Light light, Material mat, float4 V, float4 P, float4 N ) |
Final Shading 最终着色
Now we have the material properties and the summed lighting contributions of all of the lights in the scene we can combine them to perform final shading.
现在我们有场景中所有灯光的材质属性和总和光照贡献,我们可以将它们组合起来执行最终的着色。
1 | float4 P = float4( IN.positionVS, 1 ); |
On line 113 the lighting contributions is computed using the DoLighting function that was just described.
在第 113 行,使用刚刚描述的 DoLighting 函数计算光照贡献。
On line 115, the material’s diffuse color is modulated by the lights diffuse contribution.
在第 115 行,材质的漫反射颜色受到光的漫反射贡献的调制。
If the material’s specular power is lower than 1.0, it will not be considered for final shading. Some artists will assign a specular power less than 1 if a material does not have a specular shine. In this case we just ignore the specular contribution and the material is considered diffuse only (lambert reflectance only). Otherwise, if the material has a specular color texture associated with it, it will be sampled and combined with the material’s specular color before it is modulated with the light’s specular contribution.
如果材质的高光强度低于 1.0,则不会被考虑用于最终着色。一些艺术家会为没有高光反射的材质分配低于 1 的高光强度。在这种情况下,我们只忽略高光的贡献,将材质视为仅有漫反射(兰伯特反射)。否则,如果材质带有高光颜色纹理,将对其进行采样并与材质的高光颜色结合,然后再与光的高光贡献调制。
The final pixel color is the sum of the ambient, emissive, diffuse and specular components. The opacity of the pixel is determined by the alpha value that was determined earlier in the pixel shader.
最终像素颜色是环境光、自发光、漫反射和高光分量的总和。像素的不透明度由像素着色器中先前确定的 alpha 值决定。
Deferred Shading 延迟渲染
The deferred shading technique consists of three passes:
延迟渲染方式包括三个pass:
- G-buffer pass
- Lighting pass
- Transparent pass
The g-buffer pass will fill the g-buffer textures that were described in the introduction. The lighting pass will render each light source as a geometric object and compute the lighting for covered pixels. The transparent pass will render transparent scene objects using the standard forward rendering technique.
G 缓冲pass将填充介绍中描述的 G 缓冲纹理。光照pass将每个光源渲染为几何对象,并计算覆盖像素的光照。透明pass将使用标准的前向渲染方式渲染透明场景对象。
G-Buffer Pass
The first pass of the deferred shading technique will generate the G-buffer textures. I will first describe the layout of the G-buffers.
延迟着色方式的第一步将生成 G 缓冲纹理。我将首先描述 G 缓冲的布局。
G-Buffer Layout
The layout of the G-buffer can be a subject of an entire article on this website. The layout I chose for this demonstration is based on simplicity and necessity. It is not the most efficient G-buffer layout as some data could be better packed into smaller buffers. There has been some discussion on packing attributes in the G-buffers but I did not perform any analysis regarding the effects of using various packing methods.
G-buffer 的布局可以成为本网站上一篇完整文章的主题。我为这个演示选择的布局是基于简单和必要性的。这不是最有效的 G-buffer 布局,因为一些数据可以更好地打包到更小的缓冲区中。关于在 G-buffer 中打包属性已经有一些讨论,但我没有进行任何关于使用各种打包方法的效果的分析。
The attributes that need to be stored in the G-buffers are:
需要存储在 G-buffer 中的属性是:
- Depth/Stencil 深度/模板
- Light Accumulation 光照累加
- Diffuse 漫反射
- Specular 高光
- Normals 法线
Depth/Stencil Buffer 深度/模板缓冲区
The Depth/Stencil texture is stored as 32-bits per pixel with 24 bits for the depth value as a unsigned normalized value (UNORM) and 8 bits for the stencil value as an unsigned integer (UINT). The texture resource for the depth buffer is created using the R24G8_TYPELESS texture format and the depth/stencil view is created with the D24_UNORM_S8_UINT texture format. When accessing the depth buffer in the pixel shader, the shader resource view is created using the R24_UNORM_X8_TYPELESS texture format since the stencil value is unused.
深度/模板纹理以每像素 32 位存储,其中深度值为 24 位,作为无符号归一化值(UNORM),模板值为 8 位,作为无符号整数(UINT)。 深度缓冲区的纹理资源使用 R24G8_TYPELESS 纹理格式创建,深度/模板视图使用 D24_UNORM_S8_UINT 纹理格式创建。 在像素着色器中访问深度缓冲区时,着色器资源视图使用 R24_UNORM_X8_TYPELESS 纹理格式创建,因为模板值未使用。
The Depth/Stencil buffer will be attached to the output merger stage and will not directly computed in the G-buffer pixel shader. The results of the vertex shader are written directly to the depth/stencil buffer.
深度/模板缓冲区将附加到输出合并阶段,并且不会直接在 G 缓冲像素着色器中计算。 顶点着色器的结果直接写入深度/模板缓冲区。
(需要注意的是,此处的深度模板缓冲区是由Deferred Render创建的,而非渲染流水线的深度模板缓冲区,是两份资源,称之为自定义深度缓冲区更合适,对于Unity的_CameraDepthTexture就是Unity帮我们从深度缓冲区拷贝的一个深度纹理,我们完全可以像Deferred Render这样自己绘制一份DepthTexture)
Output of the Depth/Stencil Buffer in the G-buffer pass
G-buffer pass中深度/模板缓冲区的输出
Light Accumulation Buffer
The light accumulation buffer is used to store the final result of the lighting pass. This is the same buffer as the back buffer of the screen. If your G-buffer textures are the same dimension as your screen, there is no need to allocate an additional buffer for the light accumulation buffer and the back buffer of the screen can be used directly.
光积累缓冲区用于存储光照pass的最终结果。这与屏幕的后备缓冲区相同。如果您的 G 缓冲纹理与屏幕的尺寸相同,则无需为光积累缓冲区分配额外的缓冲区,可以直接使用屏幕的后备缓冲区。
The light accumulation buffer is stored as a 32-bit 4-component unsigned normalized texture using the R8G8B8A8_UNORM texture format for both the texture resource and the shader resource view.
光积累缓冲区存储为 32 位 4 分量无符号归一化纹理,使用 R8G8B8A8_UNORM 纹理格式作为纹理资源和着色器资源视图。
The light accumulation buffer stores the emissive and ambient terms. This image has been considerably brightened to make the scene more visible.
光积累缓冲区存储发射和环境项。为了使场景更加清晰可见,这幅图像已经明显变亮。
After the G-buffer pass, the light accumulation buffer initially only stores the ambient and emissive terms of the lighting equation. This image was brightened considerably to make it more visible.
在 G 缓冲pass之后,光积累缓冲区最初仅存储光照方程中的环境和自发项。为了使其更加可见,这幅图像被明显加亮。
You may also notice that only the fully opaque objects in the scene are rendered. Deferred shading does not support transparent objects so only the opaque objects are rendered in the G-buffer pass.
您可能还注意到场景中仅呈现完全不透明的对象。延迟着色不支持透明对象,因此在 G 缓冲pass中仅呈现不透明对象。
As an optimization, you may also want to accumulate directional lights in the G-buffer pass and skip directional lights in the lighting pass. Since directional lights are rendered as full-screen quads in the lighting pass, accumulating them in the g-buffer pass may save some shader cycles if fill-rate is an issue. I’m not taking advantage of this optimization in this experiment because that would require storing directional lights in a separate buffer which is inconsistent with the way the forward and forward+ pixel shaders handle lighting.
作为一种优化,您可能还希望在 G 缓冲pass中累积定向光,并在光照pass中跳过定向光。由于定向光在光照pass中呈现为全屏幕四边形,如果填充率是一个问题,那么在 G 缓冲pass中累积它们可能会节省一些着色器周期。在这个实验中,我没有利用这种优化,因为这将需要将定向光存储在一个单独的缓冲区中,这与前向和Foward+像素着色器处理光照的方式不一致。
Diffuse Buffer 漫反射Buffer
The diffuse buffer is stored as a 32-bit 4-component unsigned normalized (UNORM) texture. Since only opaque objects are rendered in deferred shading, there is no need for the alpha channel in this buffer and it remains unused in this experiment. Both the texture resource and the shader resource view use the R8G8B8A8_UNORM texture format.
漫反射缓冲区存储为 32 位 4 分量无符号归一化(UNORM)纹理。由于延迟着色中只渲染不透明对象,因此在此缓冲区中不需要 alpha pass,在此实验中保持未使用。纹理资源和着色器资源视图均使用 R8G8B8A8_UNORM 纹理格式。
The Diffuse buffer after the g-buffer pass.
G-buffer pass后的漫反射缓冲区。
The above image shows the result of the diffuse buffer after the G-buffer pass.
上述图像显示了 G-buffer pass后漫反射缓冲区的结果。
Specular Buffer 镜面缓冲区
Similar to the light accumulation and the diffuse buffers, the specular color buffer is stored as a 32-bit 4-component unsigned normalized texture using the R8G8B8A8_UNORM format. The red, green, and blue channels are used to store the specular color while the alpha channel is used to store the specular power. The specular power value is usually expressed in the range (1…256] (or higher) but it needs to be packed into the range [0…1] to be stored in the texture. To pack the specular power into the texture, I use the method described in a presentation given by Michiel van der Leeuw titled “Deferred Rendering in Killzone 2” [13]. In that presentation he uses the following equation to pack the specular power value:
与光累积和漫反射缓冲区类似,镜面反射颜色缓冲区使用 R8G8B8A8_UNORM 格式存储为 32 位 4 分量无符号归一化纹理。红色、绿色和蓝色通道用于存储镜面反射颜色,而 Alpha 通道用于存储镜面反射强度。镜面反射强度值通常以 (1…256]或更高范围表示,但需要将其打包到要存储在纹理中的范围 [0…1],要将镜面反射强度打包到纹理中,我使用所描述的方法。在 Michiel van der Leeuw 发表的题为“Deferred Rendering in Killzone 2”的演讲中,他使用以下公式来计算镜面反射强度值:
This function allows for packing of specular power values in the range [1…1448.15] and provides good precision for values in the normal specular range (1…256). The graph below shows the progression of the packed specular value.
此函数允许打包 [1…1448.15],范围内的镜面反射功率值,并为正常镜面反射中的值提供良好的精度范围 (1…256),下图显示了打包镜面反射值的趋势。
And the result of the specular buffer after the G-buffer pass looks like this.
G-buffer pass后镜面缓冲区的结果看起来像这样。
Normal Buffer 法线缓冲区
The view space normals are stored in a 128-bit 4-component floating point buffer using the R32G32B32A32_FLOAT texture format. A normal buffer of this size is not really necessary and I could probably have packed the X and Y components of the normal into a 32-bit 2-component half-precision floating point buffer and recomputed the z-component in the lighting pass. For this experiment, I favored precision and simplicity over efficiency and since my GPU is not constrained by texture memory I used the largest possible buffer with the highest precision.
视图空间法线存储在一个 128 位 4 分量浮点缓冲区中,使用 R32G32B32A32_FLOAT 纹理格式。这样大小的法线缓冲区实际上并不是必需的,我可能可以将法线的 X 和 Y 分量打包到一个 32 位 2 分量半精度浮点缓冲区中,并在光照pass中重新计算 z 分量。对于这个实验,我更看重精度和简单性,而不是效率,因为我的 GPU 不受纹理内存的限制,我使用了具有最高精度的最大可能缓冲区。
It would be worthwhile to investigate other texture formats for the normal buffer and analyze the quality versus performance tradeoffs. My hypothesis is that using a smaller texture format (for example R16G16_FLOAT) for the normal buffer would produce similar quality results while providing improved performance.
值得研究其他法线缓冲区的纹理格式,并分析质量与性能之间的权衡。我的假设是,对于法线缓冲区使用较小的纹理格式(例如 R16G16_FLOAT)可能会产生类似质量的结果,同时提供改进的性能。
The result of the normal buffer after the G-buffer pass.
正常缓冲区在 G-Buffer传递后的结果。
The image above shows the result of the normal buffer after the G-buffer pass.
上面的图像显示了 G-buffer pass后的正常缓冲区的结果。
Layout Summary 布局摘要
The total G-buffer layout looks similar to the table shown below.
整个 G-buffer 布局看起来与下面显示的表格类似。
R | G | B | A | |
---|---|---|---|---|
Depth/Stencil 深度/模板 | D24_UNORM | S8_UINT | ||
Light Accumulation 光积累 | R8_UNORM | G8_UNORM | B8_UNORM | A8_UNORM |
Diffuse | R8_UNORM | G8_UNORM | B8_UNORM | A8_UNORM |
Specular | R8_UNORM | G8_UNORM | B8_UNORM | A8_UNORM |
Normal | R32_FLOAT | G32_FLOAT | B32_FLOAT | A32_FLOAT |
Pixel Shader 像素着色器
The pixel shader for the G-buffer pass is very similar to the pixel shader for the forward renderer. The primary difference being no lighting calculations are performed in the G-buffer pass. Collecting the material properties are identical in the forward rendering technique so I will not repeat that part of the shader code here.
G-buffer pass的像素着色器与前向渲染器的像素着色器非常相似。主要区别在于 G-buffer pass中不执行任何光照计算。在前向渲染方式中收集材质属性的过程与之前相同,因此我不会在此重复着色器代码的这一部分。
To output the G-buffer data to the textures, each G-buffer texture will be bound to a render target output using PixelShaderOutput structure.
将 G-Buffer数据输出到纹理,每个 G-Buffer纹理将绑定到一个渲染目标输出,使用 PixelShaderOutput 结构。
1 | struct PixelShaderOutput |
Since the depth/stencil buffer is bound to the output-merger stage, we don’t need to output the depth value from the pixel shader.
由于深度/模板缓冲区绑定到输出合并阶段,因此我们不需要从像素着色器输出深度值。
Now let’s fill the G-buffer textures in the pixel shader.
现在让我们在像素着色器中填充 G 缓冲纹理。
1 | [earlydepthstencil] |
Once all of the material properties have been retrieved, we only need to save the properties to the appropriate render target. The source code to read all of the material properties has been skipped for brevity. You can download the source code at the end of this article to see the complete pixel shader.
一旦检索到所有材质属性,我们只需要将属性保存到适当的渲染目标中。为简洁起见,读取所有材质属性的源代码已被省略。您可以在本文末尾下载源代码以查看完整的像素着色器。
With the G-buffers filled, we can compute the final shading in the light pass. In the next sections, I will describe the method used by Guerrilla in Killzone 2 and I will also describe the implementation I used and explain why I used a different method.
有了 G-Buffer填充,我们可以在光pass中计算最终的着色。在接下来的部分,我将描述 Guerrilla 在 Killzone 2 中使用的方法,还将描述我使用的实现并解释为什么我使用了不同的方法。
Lighting Pass (Guerrilla)
The primary source of inspiration for the lighting pass of the deferred shading technique that I am using in this experiment comes from a presentation called “Deferred Rendering in Killzone 2” presented by Michiel van der Leeuw at the Sony Computer Entertainment Graphics Seminar at Palo Alto, California in August 2007 [13]. In Michiel’s presentation, he describes the lighting pass in four phases:
在这个实验中我使用的延迟着色方式的光照pass的主要灵感来源于 Michiel van der Leeuw 在 2007 年 8 月在加利福尼亚州帕洛阿尔托索尼计算机娱乐图形研讨会上的演示“Killzone 2 中的延迟渲染”[13]。在 Michiel 的演示中,他将光照pass描述为四个阶段:
- Clear stencil buffer to 0,
将模板缓冲区清除为 0, - Mark pixels in front of the far light boundary,
标记在远光边界前面的像素, - Count number of lit pixels inside the light volume,
计算光体内部照亮的像素数量, - Shade the lit pixels
遮住亮起的像素
I will briefly describe the last three steps. I will then present the method I chose to use to implement the lighting pass of the deferred shading technique and explain why I chose a different method than what was explained in Michiel’s presentation.
我将简要描述最后三个步骤。然后,我将介绍我选择用来实现延迟着色方式的光照pass的方法,并解释为什么我选择了与 Michiel 演示中所解释的方法不同的方法。
Determine Lit Pixels 确定照亮的像素
According to Michiel’s presentation, in order to determine which pixel are lit, you first need to render the back faces of the light volume and mark the pixels that are in-front of the far light boundary. Then count the number of pixels that are behind the front faces of the light volume. And finally, shade the pixels that are marked and behind the front faces of the light volume.
根据 Michiel 的演示,要确定哪些像素被照亮,首先需要渲染光体的背面,并标记在远光边界前面的像素。然后计算在光体前面的像素数量。最后,着色标记的像素并在光体前面的像素后面。
Mark Pixels 标记像素
In the first phase, the pixels that are in front of the back faces of the light volume will be marked in the stencil buffer. To do this, you must first clear the stencil buffer to 0 then configure the pipeline state with the following settings:
在第一阶段,将标记在光体后面的像素标记在模板缓冲区中。为此,必须首先将模板缓冲区清零,然后使用以下设置配置管线状态:
- Bind only the vertex shader (no pixel shader is required)
仅绑定顶点着色器(不需要像素着色器) - Bind only the depth/stencil buffer to the output merger stage (since no pixel shader is bound, there is no need for a color buffer)
仅将深度/模板缓冲区绑定到输出合并阶段(因为未绑定像素着色器,所以不需要颜色缓冲区) - Rasterizer State: 光栅化器状态:
- Set cull mode to FRONT to render only the back faces of the light volume
将剔除模式设置为 FRONT 以仅渲染光体的背面
- Set cull mode to FRONT to render only the back faces of the light volume
- Depth/Stencil State: 深度/模板状态
- Enable depth testing 启用深度测试
- Disable depth writes 禁用深度写入
- Set the depth function to GREATER_EQUAL
将深度函数设置为 GREATER_EQUAL - Enable stencil operations
启用模板操作 - Set stencil reference to 1
将模板参考值设置为 1 - Set stencil function to ALWAYS
将模板函数设置为始终 - Set stencil operation to REPLACE on depth pass.
在深度pass上将模板操作设置为替换。
And render the light volume. The image below shows the effect of this operation.
并渲染光体积。下面的图像显示了此操作的效果。
The dotted line of the light volume is culled and only the back facing polygons are rendered. The green volumes show where the stencil buffer will be marked with the stencil reference value. The next step is to count the pixels inside the light volume.
光体的虚线被剔除,只渲染背面的多边形。绿色的体积显示模板缓冲区将被标记为模板参考值的位置。下一步是计算光体内的像素数量。
Count Pixels 计算像素
The next phase is to count the number of pixels that were both marked in the previous phase and are inside the light volume. This is done by rendering the front faces of the light volume and counting the number of pixels that are both stencil marked in the previous phase and behind the front faces of the light volume. In this case, the pipeline state should be configured with:
下一阶段是计算在前一阶段标记的像素数量以及位于光体内部的像素数量。这是通过渲染光体的正面并计算在前一阶段被标记的像素数量以及位于光体正面后面的像素数量来完成的。在这种情况下,管线状态应配置为:
- Bind only the vertex shader (no pixel shader is required)
仅绑定顶点着色器(不需要像素着色器) - Bind only the depth/stencil buffer to the output merger stage (since no pixel shader is bound, there is no need for a color buffer)
仅将深度/模板缓冲区绑定到输出合并阶段(因为未绑定像素着色器,所以不需要颜色缓冲区) - Configure the Rasterizer State:
配置光栅化器状态:- Set cull mode to BACK to render only the front faces of the light volume
将剔除模式设置为后向以仅渲染光体的前面
- Set cull mode to BACK to render only the front faces of the light volume
- Depth/Stencil State: 深度/模板状态
- Enable depth testing 启用深度测试
- Disable depth writes 禁用深度写入
- Set the depth function to LESS_EQUAL
将深度函数设置为 LESS_EQUAL - Enable stencil operations
启用模板操作 - Set stencil reference to 1
将模板参考值设置为 1 - Set stencil operations to KEEP (don’t modify the stencil buffer)
将模板操作设置为 KEEP(不修改模板缓冲区) - Set stencil function to EQUAL
将模板函数设置为 EQUAL
And render the light volume again with an occlusion pixel query to count the number of pixels that pass both the depth and stencil operations. The image below shows the effect of this operation.
通过遮挡像素查询再次渲染光体积,以计算通过深度和模板操作的像素数量。下图显示了此操作的效果。
The red volume in the image shows the pixels that would be counted in this phase.
图像中的红色体积显示了在此阶段将被计入的像素。
If the number of pixels rasterized is below a certain threshold, then the shading step can be skipped. If the number of rasterized pixels is above a certain threshold then the pixels need to be shaded.
如果光栅化的像素数量低于一定阈值,则可以跳过着色步骤。如果光栅化的像素数量高于一定阈值,则需要对像素进行着色。
One step that was described in Michiel’s presentation but is skipped for this experiment is generating the light shadow maps. The primary purpose of the pixel query is to skip shadow map generation. Since I’m not doing shadow mapping in this experiment, I completely skip this step in my own implementation (as will be shown later).
Michiel 的演示中描述的一个步骤,但在这个实验中被跳过的是生成光影地图。像素查询的主要目的是跳过阴影地图的生成。由于在这个实验中我没有进行阴影贴图,所以我完全跳过了自己实现中的这一步骤(稍后将会展示)。
Shade Pixels 阴影像素
The final step according to Michiel’s method is to shade the pixels that are inside the light volume. To do this the configuration of the pipeline state should be identical to the pipeline configuration of the count pixels phase with the addition of enabling additive blending, binding a pixel shader and attaching a color buffer to the output merger stage.
根据 Michiel 的方法,最后一步是对位于光体内的像素进行着色。为此,管线状态的配置应与计算像素阶段的管线配置相同,同时启用附加混合,绑定像素着色器,并将颜色缓冲附加到输出合并器阶段。
-
Bind both vertex and pixel shaders
绑定顶点和像素着色器 -
Bind depth/stencil and light accumulation buffer to the output merger stage
将深度/模板和光积累缓冲区绑定到输出合并阶段 -
Configure the Rasterizer State:
配置光栅化器状态:- Set cull mode to BACK to render only the front faces of the light volume
将剔除模式设置为后向以仅渲染光体的前面
- Set cull mode to BACK to render only the front faces of the light volume
-
Depth/Stencil State: 深度/模板状态
- Enable depth testing 启用深度测试
- Disable depth writes 禁用深度写入
- Set the depth function to LESS_EQUAL
将深度函数设置为 LESS_EQUAL - Enable stencil operations
启用模板操作 - Set stencil reference to 1
将模板参考值设置为 1 - Set stencil operations to KEEP (don’t modify the stencil buffer)
将模板操作设置为 KEEP(不修改模板缓冲区) - Set stencil function to EQUAL
将模板函数设置为 EQUAL
-
Blend State: 混合状态:
-
Enable blend operations 启用混合操作
-
Set source factor to ONE
将源因子设置为 ONE -
Set destination factor to ONE
将目的地因子设置为 ONE -
Set blend operation to ADD
将混合操作设置为 ADD
The result should be that only the pixels that are contained within the light volume are shaded.
结果应该是只有包含在光体内的像素才会被着色。
Lighting Pass (My Implementation)
光照pass(我的实现)
The problem with the lighting pass described in Michiel’s presentation is that the pixel query operation will most certainly cause a stall while the CPU has to wait for the GPU query results to be returned. The stall can be avoided if the query results from the previous frame (or previous 2 frames) is used instead of the query results from the current frame relying on the temporal coherence theory [15]. This would require multiple query objects to be created for each light source because query objects can not be reused if they must be persistent across multiple frames.
Michiel 演示中描述的光照pass存在的问题是,像素查询操作几乎肯定会导致停顿,因为 CPU 必须等待 GPU 查询结果返回。如果使用前一帧(或前 2 帧)的查询结果而不是依赖于当前帧的查询结果,可以避免停顿,这依赖于时间相干理论[15]。这将需要为每个光源创建多个查询对象,因为如果查询对象必须在多个帧之间持久存在,则无法重用查询对象。
Since I am not doing shadow mapping in my implementation there was no apparent need to perform the pixel occlusion query that is described in Michiel’s presentation thus avoiding the potential stalls that are incurred from the query operation.
由于我在我的实现中没有进行阴影贴图,因此没有明显需要执行 Michiel 演示中描述的像素遮挡查询,从而避免了由查询操作产生的潜在停顿。
The other problem with the method described in Michiel’s presentation is that if the eye is inside the light volume then no pixels will be counted or shaded in the count pixels and shade pixels phases.
Michiel 演示中描述的方法的另一个问题是,如果眼睛在光体内部,则在计算像素和着色像素阶段不会计算或着色任何像素。
The green volume shown in the image represents the pixels of the stencil buffer that were marked in the first phase. There is no red volume showing the pixels that were shaded because the front faces of the light volume are clipped by the view frustum. I tried to find a way around this issue by disabling depth clipping but this only prevents clipping of pixels in front of the viewer (pixels behind the eye are still clipped).
图像中显示的绿色体积代表在第一阶段标记在模板缓冲区中的像素。没有显示红色体积,因为光体的前表面被视锥体剪裁。我尝试通过禁用深度剪裁来解决这个问题,但这只能防止在观察者前面剪裁像素(眼睛后面的像素仍然被剪裁)。
To solve this problem, I reversed Michiel’s method:
为了解决这个问题,我颠倒了米歇尔的方法:
- Clear stencil buffer to 1,
清除模板缓冲区为 1, - Unmark pixels in front of the near light boundary,
取消标记靠近光边界前面的像素, - Shade pixels that are in front of the far light boundary
着色在远光边界前面的像素
I will explain the last two steps of my implementation and describe the method used to shade the pixels.
我将解释我的实现的最后两个步骤,并描述用于着色像素的方法。
Unmark Pixels 取消标记像素
In the first phase of my implementation we need to unmark all of the pixels that are in front of the front faces of the light’s geometric volume. This ensures that pixels that occlude the light volume are not rendered in the next phase. This is done by first clearing the stencil buffer to 1 to mark all pixels and unmark the pixels that are in front of the front faces of the light volume. The configuration of the pipeline state would look like this:
在我的实现的第一阶段中,我们需要取消标记所有位于光的几何体积前面的像素。这确保了遮挡光体积的像素不会在下一阶段渲染。首先清除模板缓冲区以将所有像素标记为 1,并取消标记位于光体积前面的像素。管线状态的配置如下:
- Bind only the vertex shader (no pixel shader is required)
仅绑定顶点着色器(不需要像素着色器) - Bind only the depth/stencil buffer to the output merger stage (since no pixel shader is bound, there is no need for a color buffer)
仅将深度/模板缓冲区绑定到输出合并阶段(因为未绑定像素着色器,所以不需要颜色缓冲区) - Rasterizer State: 光栅化器状态:
- Set cull mode to BACK to render only the front faces of the light volume
将剔除模式设置为后向以仅渲染光体的前面
- Set cull mode to BACK to render only the front faces of the light volume
- Depth/Stencil State: 深度/模板状态
- Enable depth testing 启用深度测试
- Disable depth writes 禁用深度写入
- Set the depth function to GREATER
将深度函数设置为 GREATER - Enable stencil operations
启用模板操作 - Set stencil function to ALWAYS
将模板函数设置为始终 - Set stencil operation to DECR_SAT on depth pass.
在深度pass上将模板操作设置为 DECR_SAT。
And render the light volume. The image below shows the result of this operation.
然后渲染光体积。下面的图像显示了此操作的结果。
Setting the stencil operation to DECR_SAT will decrement and clamp the value in the stencil buffer to 0 if the depth test passes. The green volume shows where the stencil buffer will be decremented to 0. Consequently, if the eye is inside the light volume, all pixels will still be marked in the stencil buffer because the front faces of the light volume would be clipped by the viewing frustum and no pixels would be unmarked.
将模板操作设置为 DECR_SAT 将在深度测试通过时将模板缓冲区中的值递减并夹紧到 0。绿色体积显示了模板缓冲区将被递减到 0 的位置。因此,如果眼睛在光体积内部,所有像素仍将在模板缓冲区中标记,因为光体积的前表面将被视锥体剪裁,没有像素将被取消标记。
In the next phase the pixels in front of the back faces of the light volume will be shaded.
在下一阶段,将对光体背面前的像素进行着色。
Shade Pixels 阴影像素
In this phase the pixels that are both in front of the back faces of the light volume and not unmarked in the previous frame will be shaded. In this case, the configuration of the pipeline state would look like this:
在这个阶段,那些既在光体后面的像素前面,又在上一帧中没有标记的像素将被着色。在这种情况下,管线状态的配置将如下所示:
-
Bind both vertex and pixel shaders
绑定顶点和像素着色器 -
Bind depth/stencil and light accumulation buffer to the output merger stage
将深度/模板和光积累缓冲区绑定到输出合并阶段 -
Configure the Rasterizer State:
配置光栅化器状态:- Set cull mode to FRONT to render only the back faces of the light volume
将剔除模式设置为 FRONT 以仅渲染光体的背面 - Disable depth clipping 禁用深度裁剪
- Set cull mode to FRONT to render only the back faces of the light volume
-
Depth/Stencil State: 深度/模板状态
- Enable depth testing 启用深度测试
- Disable depth writes 禁用深度写入
- Set the depth function to GREATER_EQUAL
将深度函数设置为 GREATER_EQUAL - Enable stencil operations
启用模板操作 - Set stencil reference to 1
将模板参考值设置为 1 - Set stencil operations to KEEP (don’t modify the stencil buffer)
将模板操作设置为 KEEP(不修改模板缓冲区) - Set stencil function to EQUAL
将模板函数设置为 EQUAL
-
Blend State: 混合状态:
-
Enable blend operations 启用混合操作
-
Set source factor to ONE
将源因子设置为 ONE -
Set destination factor to ONE
将目的地因子设置为 ONE -
Set blend operation to ADD
将混合操作设置为 ADD
You may have noticed that I also disable depth clipping in the rasterizer state for this phase. Doing this will ensure that if any part of the light volume exceeds the far clipping plane, it will not be clipped.
您可能已经注意到,我还在光栅化器状态中禁用了深度裁剪。这样做可以确保如果光体的任何部分超出了远裁剪平面,它不会被裁剪。
The image below shows the result of this operation.
下面的图像显示了此操作的结果。
The red volume shows pixels that will be shaded in this phase. This implementation will properly shade pixels even if the viewer is inside the light volume. In the second phase, only pixels that are both in front of the back faces of the light volume and not unmarked in the previous phase will be shaded.
红色体积显示了在这个阶段将被着色的像素。即使观察者在光体内部,此实现也将正确着色像素。在第二阶段,只有那些在光体后面的背面前面且在前一阶段未标记的像素将被着色。
Next I’ll describe the pixel shader that is used to implement the deferred lighting pass.
接下来,我将描述用于实现延迟光照pass的像素着色器。
Pixel Shader 像素着色器
The pixel shader is only bound during the shade pixels phase described above. It will fetch the texture data from the G-buffers and use it to shade the pixel using the same lighting model that was described in the Forward Rendering section.
像素着色器仅在上面描述的着色像素阶段期间绑定。它将从 G-Buffer中提取纹理数据,并使用它来使用在前向渲染部分中描述的相同光照模型对像素进行着色。
Since all of our lighting calculations are performed in view space, we need to compute the view space position of the current pixel.
由于我们所有的光照计算都是在视图空间中执行的,因此我们需要计算当前像素的视图空间位置。
We will use the the screen space position and the value in the depth buffer to compute the view space position of the current pixel. To do this, we will use the ClipToView function to convert clip space coordinates to view space and the ScreenToView function to convert screen coordinates to view space.
我们将使用屏幕空间位置和深度缓冲区中的值来计算当前像素的视图空间位置。为此,我们将使用 ClipToView 函数将裁剪空间坐标转换为视图空间,并使用 ScreenToView 函数将屏幕坐标转换为视图空间。
In order to facilitate these functions, we need to know the screen dimensions and the inverse projection matrix of the camera which should be passed to the shader from the application in a constant buffer.
为了方便这些功能,我们需要知道屏幕尺寸和摄像机的逆投影矩阵,这些信息应该从应用程序传递给着色器的常量缓冲区中。
1 | // Parameters required to convert screen space coordinates to view space. |
And to convert the screen space coordinates to clip space we need to scale and shift the screen space coordinates into clip space then transform the clip space coordinate into view space by multiplying the clip space coordinate by the inverse of the projection matrix.
将屏幕空间坐标转换为裁剪空间,我们需要将屏幕空间坐标缩放和平移到裁剪空间,然后通过将裁剪空间坐标乘以投影矩阵的逆矩阵来将裁剪空间坐标转换为视图空间。
1 | // Convert clip space coordinates to view space |
First, we need to normalize the screen coordinates by dividing them by the screen dimensions. This will convert the screen coordinates that are expressed in the range ([0…SCREEN_WIDTH], [0…SCREEN_HEIGHT]) into the range ([0…1], [0…1]).
首先,我们需要通过屏幕尺寸来归一化屏幕坐标。这将把以范围([0…屏幕宽度], [0…屏幕高度])表示的屏幕坐标转换为范围([0…1], [0…1])。
In DirectX, the screen origin (0, 0) is the top-left side of the screen and the screen’s y-coordinate increases from top to bottom. This is the opposite direction than the y-coordinate in clip space so we need to flip the y-coordinate in normalized screen space to get it in the range ([0…1], [1…0]). Then we need to scale the normalized screen coordinate by 2 to get it in the range ([0…2], [2…0]) and shift it by -1 to get it in the range ([-1…1], [1…-1]).
在 DirectX 中,屏幕原点(0,0)位于屏幕的左上角,屏幕的 y 坐标从上到下递增。这与裁剪空间中的 y 坐标方向相反,因此我们需要翻转归一化屏幕空间中的 y 坐标,使其在范围内([0…1],[1…0])。然后,我们需要将归一化屏幕坐标缩放 2 倍,使其在范围内([0…2],[2…0]),并将其移位-1,使其在范围内([-1…1],[1…-1])。
Now that we have the clip space position of the current pixel, we can use the ClipToView function to convert it into view space. This is done by multiplying the clip space coordinate by the inverse of the camera’s projection matrix (line 195) and divide by the w component to remove the perspective projection (line 197).
现在我们有了当前像素的剪辑空间位置,我们可以使用 ClipToView 函数将其转换为视图空间。这是通过将剪辑空间坐标乘以相机投影矩阵的逆(第 195 行)并除以 w 分量来完成的,以消除透视投影(第 197 行)。
Now let’s put this function to use in our shader.
现在让我们在着色器中使用这个函数。
1 | [earlydepthstencil] |
The input structure to the deferred lighting pixel shader is identical to the output of the vertex shader including the position parameter that is bound to the SV_Position system value semantic. When used in a pixel shader, the value of the parameter bound to the SV_Position semantic will be the screen space position of the current pixel being rendered. We can use this value and the value from the depth buffer to compute the view space position.
延迟光照像素着色器的输入结构与顶点着色器的输出相同,包括绑定到 SV_Position 系统值语义的位置参数。在像素着色器中使用时,绑定到 SV_Position 语义的参数的值将是当前正在渲染的像素的屏幕空间位置。我们可以使用这个值和深度缓冲区中的值来计算视图空间位置。
Since the G-buffer textures are the same dimension as the screen for the lighting pass, we can use the Texture2D.Load [16] method to fetch the texel from each of the G-buffer textures. The texture coordinate of the Texture2D.Load method is an int3 where the x and y components are the U and V texture coordinates in non-normalized screen coordinate and the z component is the mipmap level to sample. When sampling the G-buffer textures, we always want to sample mipmap level 0 (the most detailed mipmap level). Sampling from a lower mipmap level will cause the textures to appear blocky. If no mipmaps have been generated for the G-Buffer textures, sampling from a lower mipmap level will return black texels. The Texture2D.Load method does not perform any texture filtering when sampling the texture making it faster than the Texture2D.Sample method when using linear filtering.
由于 G 缓冲纹理与屏幕具有相同的尺寸,因此我们可以使用 Texture2D.Load [16]方法从每个 G 缓冲纹理中提取纹素。Texture2D.Load 方法的纹理坐标是一个 int3,其中 x 和 y 分量是非规范化屏幕坐标中的 U 和 V 纹理坐标,z 分量是要采样的 mipmap 级别。在采样 G 缓冲纹理时,我们总是希望采样 mipmap 级别 0(最详细的 mipmap 级别)。从较低的 mipmap 级别采样会导致纹理显示为块状。如果没有为 G 缓冲纹理生成 mipmaps,则从较低的 mipmap 级别采样将返回黑色纹素。Texture2D.Load 方法在采样纹理时不执行任何纹理过滤,因此在使用线性过滤时比 Texture2D.Sample 方法更快。
Once we have the screen space position and the depth value, we can use the ScreenToView function to convert the screen space position to view space.
一旦我们有屏幕空间位置和深度值,我们就可以使用 ScreenToView 函数将屏幕空间位置转换为视图空间。
Before we can compute the lighting, we need to sample the other components from the G-buffer textures.
在计算光照之前,我们需要从 G 缓冲纹理中采样其他组件。
1 | // View vector |
On line 179 the specular power is unpacked from the alpha channel of the specular color using the inverse of the operation that was used to pack it in the specular texture in the G-buffer pass.
在第 179 行,从镜面颜色的 alpha pass中解压出镜面率,使用与在 G 缓冲pass中的镜面纹理中打包它所使用的操作的逆操作。
In order to retrieve the correct light properties, we need to know the index of the current light in the light buffer. For this, we will pass the light index of the current light in a constant buffer.
为了检索正确的光属性,我们需要知道光缓冲区中当前光的索引。为此,我们将在常量缓冲区中传递当前光的光索引。
1 | cbuffer LightIndexBuffer : register( b4 ) |
And retrieve the light properties from the light list and compute the final shading.
从光列表中检索光属性并计算最终的阴影。
1 | Light light = Lights[LightIndex]; |
You may notice that we don’t need to check if the light is enabled in the shader like we did in the forward rendering shader. If the light is not enabled, the light volume should not be rendered by the application.
您可能会注意到,我们不需要像在正向渲染着色器中那样检查光是否已启用。如果光未启用,则应用程序不应渲染光体积。
We also don’t need to check if the light is in range of the current pixel since the pixel shader should not be invoked on pixels that are out of range of the light.
我们也不需要检查光是否在当前像素的范围内,因为像素着色器不应该在超出光范围的像素上调用。
The lighting functions were already explained in the section on forward rendering so they won’t be explained here again.
光照函数已经在前向渲染部分进行了解释,所以这里不会再进行解释。
On line 203, the diffuse and specular terms are combined and returned from the shader. The ambient and emissive terms were already computed in the light accumulation buffer during the G-buffer shader. With additive blending enabled, all of the lighting terms will be summed correctly to compute final shading.
在第 203 行,漫反射和镜面项被合并并从着色器中返回。环境光和自发光项已经在光累积缓冲区中的 G 缓冲着色器中计算过了。启用了加法混合后,所有光照项将被正确求和以计算最终的阴影。
In the final pass, we need to render transparent objects.
在最后一步,我们需要渲染透明物体。
Transparent Pass 透明pass
The transparent pass for the deferred shading technique is identical to the forward rendering technique with alpha blending enabled. There is no new information to provide here. We will reflect on the performance of the transparent pass in the results section described later.
透明pass用于延迟着色方式与启用了 alpha 混合的前向渲染方式相同。这里没有新信息可提供。我们将在稍后描述的结果部分反思透明pass的性能。
Now let’s take a look at the final technique that will be explained in this article; Forward+.
现在让我们来看看本文将解释的最终方式; Forward+。
Forward+
Forward+ improves upon regular forward rendering by first determining which lights are overlapping which area in screen space. During the shading phase, only the lights that are potentially overlapping the current fragment need to be considered. I used the term “potentially” because the technique used to determine overlapping lights is not completely accurate as I will explain later.
Forward+提升了常规的前向渲染,首先确定哪些光源在屏幕空间中重叠。在着色阶段,只有潜在重叠当前片元的光源需要考虑。我使用“潜在”一词,因为用于确定重叠光源的方式并不完全准确,稍后我会解释。
The Forward+ technique consists primarily of these three passes:
Forward+ 方式主要包括以下三个pass:
- Light culling 光照剔除
- Opaque pass 不透明pass
- Transparent pass 透明pass
In the light culling pass, each light in the scene is sorted into screen space tiles.
在光照剔除过程中,场景中的每个光源都被排序到屏幕空间的瓦片中。
In the opaque pass, the light list generated from the light culling pass is used to compute the lighting for opaque geometry. In this pass, not all lights need to be considered for lighting, only the lights that were previously sorted into the current fragments screen space tile need to be considered when computing the lighting.
在不透明pass中,从光剔除pass生成的光列表用于计算不透明几何体的光照。在这个pass中,不需要考虑所有的灯光进行光照,只需要考虑之前被排序到当前片元屏幕空间瓦片中的灯光在计算光照时需要考虑。
The transparent pass is similar to the opaque pass except the light list used for computing lighting is slightly different. I will explain the difference between the light list for the opaque pass and the transparent pass in the following sections.
透明pass类似于不透明pass,只是用于计算光照的光列表略有不同。我将在接下来的部分中解释不透明pass和透明pass的光列表之间的区别。
Grid Frustums 网格视锥体
Before light culling can occur, we need to compute the culling frustums that will be used to cull the lights into the screen space tiles. Since the culling frustums are expressed in view space, they only need to be recomputed if the dimension of the grid changes (for example, if the screen is resized) or the size of a tile changes. I will explain the basis of how the frustum planes for a tile are defined.
在进行光照剔除之前,我们需要计算用于将光源剔除到屏幕空间瓦片中的剔除视锥体。由于剔除视锥体是以视图空间表示的,因此只有在网格的尺寸发生变化(例如,屏幕调整大小)或瓦片的尺寸发生变化时,才需要重新计算。我将解释瓦片的视锥体平面是如何定义的。
The screen is divided into a number of square tiles. I will refer to all of the screen tiles as the light grid. We need to specify a size for each tile. The size defines both the vertical and horizontal size of a single tile. The tile size should not be chosen arbitrarily but it should be chosen so that a each tile can be computed by a single thread group in a DirectX compute shader [17]. The number of threads in a thread group should be a multiple of 64 (to take advantage of dual warp schedulers available on modern GPUs) and cannot exceed 1024 threads per thread group. Likely candidates for the dimensions of the thread group are:
屏幕被分成许多方形瓦片。我将所有屏幕瓦片称为光栅。我们需要为每个瓦片指定一个大小。该大小定义了单个瓦片的垂直和水平尺寸。瓦片大小不应该随意选择,而应该选择一个可以由 DirectX 计算着色器中的单个线程组计算的大小。线程组中的线程数应该是 64 的倍数(以利用现代 GPU 上可用的双 warp 调度程序),并且不能超过每个线程组的 1024 个线程。线程组的维度的可能候选者是:
- 8×8 (64 threads per thread group)
8×8(每个线程组 64 个线程) - 16×16 (256 threads per thread group)
16×16(每个线程组 256 个线程) - 32×32 (1024 threads per thread group)
32×32(每个线程组 1024 个线程)
For now, let’s assume that the thread group has a dimension of 16×16 threads. In this case, each tile for our light grid has a dimension of 16×16 screen pixels.
目前,让我们假设线程组的维度为 16×16 个线程。在这种情况下,我们光栅的每个瓦片都有 16×16 个屏幕像素的尺寸。
The image above shows a partial grid of 16×16 thread groups. Each thread group is divided by the thick black lines and the threads within a thread group are divided by the thin black lines. A tile used for light culling is also divided in the same way.
上面的图像显示了一个 16×16 线程组的部分网格。每个线程组由粗黑线分隔,线程组内的线程由细黑线分隔。用于光线剔除的瓦片也以相同的方式分隔。
If we were to view the tiles at an oblique angle, we can visualize the culling frustum that we need to compute.
如果我们以斜角观察瓦片,我们可以可视化需要计算的剔除视锥体。
The above image shows that the camera’s position (eye) is the origin of the frustum and the corner points of the tile denote the frustum corners. With this information, we can compute the planes of the tile frustum.
上图显示相机位置(眼睛)是截锥体的原点,瓦片的角点表示截锥体的角点。有了这些信息,我们可以计算出瓦片截锥体的平面。
A view frustum is composed of six planes, but to perform the light culling we want to pre-compute the four side planes for the frustum. The computation of the near and far frustum planes will be deferred until the light culling phase.
视锥由六个平面组成,但为了执行光剔除,我们希望预先计算视锥的四个侧面平面。近和远视锥平面的计算将推迟到光剔除阶段。
To compute the left, right, top, and bottom frustum planes we will use the following algorithm:
计算左、右、上和下视锥体平面,我们将使用以下算法:
- Compute the four corner points of the current tile in screen space.
计算屏幕空间中当前瓦片的四个角点。 - Transform the screen space corner points to the far clipping plane in view space.
将屏幕空间的角点转换到视图空间中的远裁剪平面。 - Build the frustum planes from the eye position and two other corner points.
从眼睛位置和另外两个角点构建视锥平面。 - Store the computed frustum in a RWStructuredBuffer.
将计算得到的视锥存储在 RWStructuredBuffer 中。
A plane can be computed if we know three points that lie on the plane [18]. If we number the corner points of a tile, as shown in the above image, we can compute the frustum planes using the eye position and two other corner points in view space.
如果我们知道位于平面上的三个点,那么可以计算出一个平面[18]。如果我们给瓦片的角点编号,如上图所示,我们可以使用视图空间中的眼睛位置和其他两个角点来计算视锥体平面。
For example, we can use the following points to compute the frustum planes assuming a counter-clockwise winding order:
例如,我们可以使用以下点来计算视锥体平面,假设顺时针顺序:
- Left Plane: Eye, Bottom-Left (2), Top-Left (0)
左平面:眼睛,左下角(2),左上角(0) - Right Plane: Eye, Top-Right (1), Bottom-Right (3)
右平面:眼睛,右上角(1),右下角(3) - Top Plane: Eye, Top-Left (0), Top-Right (1)
顶部平面:眼睛,左上角(0),右上角(1) - Bottom Plane: Eye, Bottom-Right (3), Bottom-Left (2)
底部平面:眼睛,右下角(3),左下角(2)
If we know three non-collinear points ABC that lie in the plane (as shown in the above image), we can compute the normal to the plane n [18]:
如果我们知道了位于一个平面上的三个点,就可以知道这个平面的发现
If n is normalized then a given point P that lies on the plane can be used to compute the signed distance from the origin to the plane:
如果n已被归一化,那么给出一个位于平面上方的点P,可以与法线点乘得出P到平面的距离(看不懂这里的可以去复习下从深度图重建片元世界空间坐标的内容)
This is referred to as the constant-normal form of the plane [18] and can also be expressed as
这被称为平面的常法线形式[18],也可以表示为
Where n==(a,b,c) and X=(x,y,z) given that X is a point that lies in the plane.
In the HLSL shader, we can define a plane as a unit normal n and the distance to the origin d.
1 | struct Plane |
Given three non-collinear counter-clockwise points that lie in the plane, we can compute the plane using the ComputePlane function in HLSL.
给定三个不共线的逆时针排列在平面上的点,我们可以使用 HLSL 中的 ComputePlane 函数计算平面。
1 | // Compute a plane from 3 noncollinear points that form a triangle. |
And a frustum is defined as a structure of four planes.
一个截头锥体被定义为由四个平面构成的结构。
1 | // Compute a plane from 3 noncollinear points that form a triangle. |
To precompute the grid frustums we need to invoke a compute shader kernel for each tile in the grid. For example, if the screen resolution is 1280×720 and the light grid is partitioned into 16×16 tiles, we need to compute 80×45 (3,600) frustums. If a thread group contains 16×16 (256) threads we need to dispatch 5×2.8125 thread groups to compute all of the frustums. Of course we can’t dispatch partial thread groups so we need to round up to the nearest whole number when dispatching the compute shader. In this case, we will dispatch 5×3 (15) thread groups each with 16×16 (256) threads and in the compute shader we must make sure that we simply ignore threads that are out of the screen bounds.
为了预先计算网格视锥体,我们需要为网格中的每个瓦片调用一个计算着色器核心。例如,如果屏幕分辨率为 1280×720,光栅被划分为 16×16 个瓦片,我们需要计算 80×45(3,600)个视锥体。如果一个线程组包含 16×16(256)个线程,我们需要派发 5×2.8125 个线程组来计算所有的视锥体。当然,我们不能派发部分线程组,所以在派发计算着色器时,我们需要将其四舍五入到最接近的整数。在这种情况下,我们将派发 5×3(15)个线程组,每个线程组包含 16×16(256)个线程,在计算着色器中,我们必须确保简单地忽略超出屏幕边界的线程。
The above image shows the thread groups that will be invoked to generate the tile frustums assuming a 16×16 thread group. The thick black lines denote the thread group boundary and the thin black lines represent the threads in a thread group. The blue threads represent threads that will be used to compute a tile frustum and the red threads should simply skip the frustum tile computations because they extend past the size of the screen.
上面的图像显示了将被调用以生成瓦片视锥的线程组,假设一个 16×16 线程组。粗黑线表示线程组边界,细黑线代表线程组中的线程。蓝色线程代表将用于计算瓦片视锥的线程,红色线程应该简单地跳过视锥瓦片计算,因为它们超出了屏幕的大小。
We can use the following formula to determine the dimension of the dispatch:
我们可以使用以下公式来确定派遣的维度:
Where g is the total number of threads that will be dispatched, w is the screen width in pixels, h is the screen height in pixels, B is the size of the thread group (in our example, this is 16) and G is the number of thread groups to execute.
With this information we can dispatch the compute shader that will be used to precompute the grid frustums.
有了这些信息,我们可以调度用于预计算网格视锥的计算着色器。
Grid Frustums Compute Shader 网格截锥体计算着色器
By default, the size of a thread group for the compute shader will be 16×16 threads but the application can define a different block size during shader compilation.
默认情况下,计算着色器的线程组大小将为 16×16 个线程,但应用程序可以在着色器编译期间定义不同的块大小。
1 |
And we’ll define a common structure to store the common compute shader input variables.
我们将定义一个通用结构来存储常见的计算着色器输入变量。
1 | struct ComputeShaderInput |
See [10] for a list of the system value semantics that are available as inputs to a compute shader.
参见[10],列出可用作计算着色器输入的系统值语义。
In addition to the system values that are provided by HLSL, we also need to know the total number of threads and the total number of thread groups in the current dispatch. Unfortunately HLSL does not provide system value semantics for these properties. We will store the required values in a constant buffer called DispatchParams.
除了 HLSL 提供的系统值之外,我们还需要知道当前调度中线程的总数和线程组的总数。不幸的是,HLSL 没有为这些属性提供系统值语义。我们将把所需的值存储在一个名为 DispatchParams 的常量缓冲区中。
1 | // Global variables |
The value of the numThreads variable can be used to ensure that a thread in the dispatch is not used if it is out of bounds of the screen as described earlier.
numThreads 变量的值可用于确保调度中的线程不会超出屏幕范围,如前所述。
To store the result of the computed grid frustums, we also need to create a structured buffer that is large enough to store one frustum per tile. This buffer will be bound to the out_Frustrum RWStructuredBuffer variable using a uniform access view.
为了存储计算的网格视锥体的结果,我们还需要创建一个足够大的结构化缓冲区,以存储每个瓦片的一个视锥体。这个缓冲区将绑定到 out_Frustrum RWStructuredBuffer 变量,使用统一访问视图。
1 | // View space frustums for the grid cells. |
Tile Corners in Screen Space 屏幕空间中的瓦片角点
In the compute shader, the first thing we need to do is determine the screen space points of the corners of the tile frustum using the current thread’s global ID in the dispatch.
在计算着色器中,我们首先需要做的是使用调度中当前线程的全局 ID 确定瓦片视锥体角点的屏幕空间点。
1 | // A kernel to compute frustums for the grid |
To convert the global thread ID to the screen space position, we simply multiply by the size of a tile in the light grid. The z-component of the screen space position is -1 because I am using a right-handed coordinate system which has the camera looking in the -z axis in view space. If you are using a left-handed coordinate system, you should use 1 for the z-component. This gives us the screen space positions of the tile corners at the far clipping plane.
将全局线程 ID 转换为屏幕空间位置,我们只需乘以光栅中一个瓦片的大小。屏幕空间位置的 z 分量为-1,因为我使用的是右手坐标系,在视图空间中相机朝向-z 轴。如果您使用左手坐标系,应该将 z 分量设为 1。这样我们就得到了远裁剪平面上瓦片角点的屏幕空间位置。
Tile Corners in View Space 在视图空间中的Tile角
Next we need to convert the screen space positions into view space using the ScreenToView function that was described in the section about the deferred rendering pixel shader.
接下来,我们需要使用关于延迟渲染像素着色器部分中描述的 ScreenToView 函数,将屏幕空间位置转换为视图空间。
1 | float3 viewSpace[4]; |
Compute Frustum Planes 计算视锥体平面
Using the view space positions of the tile corners, we can build the frustum planes.
利用瓦片角的视图空间位置,我们可以构建视锥体平面。
1 | // Now build the frustum planes from the view space points |
Store Grid Frustums 存储网格视锥体
And finally we need to write the frustum to global memory. We must be careful that we don’t access an array element that are out of bounds of the allocated frustum buffer.
最后,我们需要将截头锥体写入全局内存。我们必须小心,不要访问超出分配的截头锥体缓冲区边界的数组元素。
1 | // Store the computed frustum in global memory (if our thread ID is in bounds of the grid). |
Now that we have the precomputed grid frustums, we can use them in the light culling compute shader.
现在我们有预先计算的网格视锥体,我们可以在光遮挡计算着色器中使用它们。
Light Culling 光照剔除
In the next step of the Forward+ rendering technique is to cull the lights using the grid frustums that were computed in the previous section. The computation of the grid frustums only needs to be done once at the beginning of the application or if the screen dimensions or the size of the tiles change but the light culling phase must occur every frame that the camera moves or the position of a light moves or an object in the scene changes that affects the contents of the depth buffer. Any one of these events could occur so it is generally safe to perform light culling each and every frame.
在 Forward+渲染方式的下一步中,使用在前一节中计算的网格视锥体来剔除光源。只需要在应用程序开始时计算一次网格视锥体,或者在屏幕尺寸或瓦片大小发生变化时重新计算。但是,光源剔除阶段必须在每一帧中发生,即摄像机移动、光源位置移动或场景中的对象发生变化影响深度缓冲区内容时。这些事件中的任何一个都可能发生,因此通常每一帧都执行光源剔除是安全的。
The basic algorithm for performing light culling is as follows:
执行光源剔除的基本算法如下:
- Compute the min and max depth values in view space for the tile
计算视图空间中瓦片的最小和最大深度值 - Cull the lights and record the lights into a light index list
剔除灯光并将灯光记录到灯光索引列表中 - Copy the light index list into global memory
将灯光索引列表复制到全局内存中
Compute Min/Max Depth Values 计算最小/最大深度值
The first step of the algorithm is to compute the minimum and maximum depth values per tile of the light grid. The minimum and maximum depth values will be used to compute the near and far planes for our culling frustum.
算法的第一步是计算光栅每个瓦片的最小和最大深度值。最小和最大深度值将用于计算我们剔除视锥体的近平面和远平面。
The image above shows an example scene. The blue objects represent opaque objects in the scene. The yellow objects represent light sources and the shaded gray areas represent the tile frustums that are computed from the minimum and maximum depth values per tile. The green lines represent the tile boundaries for the light grid. The tiles are numbered 1-7 from top to bottom and the opaque objects are numbered 1-5 and the lights are numbered 1-4.
上面的图像显示了一个示例场景。蓝色物体代表场景中的不透明物体。黄色物体代表光源,阴影灰色区域代表从每个瓦片的最小和最大深度值计算出的瓦片视锥体。绿色线条代表光栅的瓦片边界。从上到下,瓦片编号为 1-7,不透明物体编号为 1-5,灯光编号为 1-4。
The first tile has a maximum depth value of 1 (in projected clip space) because there are some pixels that are not covered by opaque geometry. In this case, the culling frustum is very large and may contain lights that don’t affect the geometry. For example, light 1 is contained within tile 1 but light 1 does not affect any geometry. At geometry boundaries, the clipping frustum could potentially be very large and may contain lights that don’t effect any geometry.
第一个瓦片的最大深度值为 1(在投影剪辑空间中),因为有一些像素未被不透明几何体覆盖。在这种情况下,裁剪视锥体非常大,可能包含不影响几何体的灯光。例如,灯光 1 包含在瓦片 1 内,但灯光 1 不影响任何几何体。在几何边界处,裁剪视锥体可能非常大,可能包含不影响任何几何体的灯光。
The minimum and maximum depth values in tile 2 are the same because object 2 is directly facing the camera and fills the entire tile. This won’t be a problem as we will see later when we perform the actual clipping of the light volume.
在瓦片 2 中,最小和最大深度值相同,因为物体 2 直接面向摄像机并填充整个瓦片。当我们执行光体积的实际裁剪时,这不会成为问题。
Object 3 fully occludes light 3 and thus will not be considered when shading any fragments.
物体 3 完全遮挡了光 3,因此在着色任何片元时将不予考虑。
The above image depicts the minimum and maximum depth values per tile for opaque geometry. For transparent geometry, we can only clip light volumes that are behind the maximum depth planes, but we must consider all lights that are in front of all opaque geometry. The reason for this is that when performing the depth pre-pass step to generate the depth texture which is used to determine the minimum and maximum depths per tile, we cannot render transparent geometry into the depth buffer. If we did, then we would not correctly light opaque geometry that is behind transparent geometry. The solution to this problem is described in an article titled “Tiled Forward Shading” by Markus Billeter, Ola Olsson, and Ulf Assarsson [4]. In the light culling compute shader, two light lists will be generated. The first light list contains only the lights that are affecting opaque geometry. The second light list contains only the lights that could affect transparent geometry. When performing final shading on opaque geometry then I will send the first list and when rendering transparent geometry, I will send the second list to the fragment shader.
上述图像显示了不透明几何体每个瓦片的最小和最大深度值。对于透明几何体,我们只能裁剪在最大深度平面后面的光体积,但必须考虑所有在所有不透明几何体前面的光源。原因是在执行深度预处理步骤以生成用于确定每个瓦片的最小和最大深度的深度纹理时,我们不能将透明几何体渲染到深度缓冲区中。如果这样做,那么我们将无法正确照亮位于透明几何体后面的不透明几何体。这个问题的解决方案在一篇名为“平铺式前向着色”的文章中有所描述,作者是 Markus Billeter、Ola Olsson 和 Ulf Assarsson。在光照剔除计算着色器中,将生成两个光列表。第一个光列表仅包含影响不透明几何体的光源。第二个光列表仅包含可能影响透明几何体的光源。在对不透明几何体执行最终着色时,我将发送第一个列表,而在渲染透明几何体时,我将发送第二个列表到片元着色器。
Before I discuss the light culling compute shader, I will discuss the method that is used to build the light lists in the compute shader.
在讨论光剔除计算着色器之前,我将讨论在计算着色器中用于构建光列表的方法。
Light List Data Structure 光列表数据结构
The data structure that is used to store the per-tile light lists is described in the paper titled “Tiled Shading” from Ola Olsson and Ulf Assarsson [5]. Ola and Ulf describe a data structure in two parts. The first part is the light grid which is a 2D grid that stores an offset and a count of values stored in a light index list. This technique is similar to that of an index buffer which refers to the indices of vertices in an vertex buffer.
用于存储每个瓦片光列表的数据结构在 Ola Olsson 和 Ulf Assarsson 的论文“平铺着色”中有描述。Ola 和 Ulf 描述了一个分为两部分的数据结构。第一部分是光栅,它是一个存储光索引列表中值的偏移量和计数的二维网格。这种方式类似于索引缓冲区,它引用顶点缓冲区中的顶点索引。
The size of the light grid is based on the number of screen tiles that are used for light culling. The size of the light index list is based the expected average number of overlapping lights per tile. For example, for a screen resolution of 1280×720 and a tile size of 16×16 results in a 80×45 (3,600) light grid. Assuming an average of 200 lights per tile, this would require a light index list of 720,000 indices. Each light index cost 4 bytes (for a 32-bit unsigned integer) so the light list would consume 2.88 MB of GPU memory. Since we need a separate list for transparent and opaque geometry, this would consume a total of 5.76 MB. Although 200 lights may be an overestimation of the average number of overlapping lights per tile, the storage usage is not outrageous.
光栅的大小基于用于光剔除的屏幕瓦片数量。光索引列表的大小基于每个瓦片预期的平均重叠光数。例如,对于分辨率为 1280×720 且瓦片大小为 16×16 的屏幕,结果是 80×45(3,600)的光栅。假设每个瓦片平均有 200 个光,这将需要一个包含 720,000 个索引的光索引列表。每个光索引占用 4 个字节(对于 32 位无符号整数),因此光列表将消耗 2.88 MB 的 GPU 内存。由于我们需要为透明和不透明几何体分别列出列表,这将总共消耗 5.76 MB。尽管 200 个光可能是对每个瓦片平均重叠光数的过高估计,但存储使用并不过分。
To generate the light grid and the light index list, a group-shared light index list is first generated in the compute shader. A global light index list counter is used to keep track of the current index into the global light index list. The global light index counter is atomically incremented so that no two thread groups can use the same range in the global light index list. Once the thread group has “reserved” space in the global light index list, the group-shared light index list is copied to the global light index list.
生成光栅和光索引列表,首先在计算着色器中生成一个组共享的光索引列表。使用全局光索引列表计数器来跟踪当前索引进入全局光索引列表。全局光索引计数器是原子地递增的,以便没有两个线程组可以使用全局光索引列表中的相同范围。一旦线程组在全局光索引列表中“保留”了空间,组共享的光索引列表就会被复制到全局光索引列表中。
The following pseudo code demonstrates this technique.
以下伪代码演示了这种方式。
1 | function CullLights( L, C, G, I ) |
On the first three lines, the index of the current tile in the grid is defined as t. The local light index list is defined as i and the tile frustum that is used to perform light culling for the current tile is defined as f.
在前三行中,将网格中当前瓦片的索引定义为 t。本地光索引列表定义为 i,用于对当前瓦片执行光剔除的瓦片视锥体定义为 f。
Lines 4, 5, and 6 loop through the global light list and cull the lights against the current tile’s culling frustum. If the light is inside the frustum, the light index is added to the local light index list.
第 4、5 和 6 行循环遍历全局光列表,并根据当前切片的裁剪视锥体对光源进行裁剪。如果光源在视锥体内,则将光源索引添加到本地光源索引列表中。
On line 7 the current index in the global light index list is incremented by the number of lights that are contained in the local light index list. The original value of the global light index list counter before being incremented is stored in the local counter variable c.
在第 7 行,全局光源索引列表中的当前索引会增加本地光源索引列表中包含的光源数量。在增加之前,全局光源索引列表计数器的原始值会存储在本地计数器变量 c 中。
On line 8, the light grid G is updated with the current tile’s offset and count into the global light index list.
在第 8 行,光栅 G 会使用当前切片的偏移量和计数值更新全局光源索引列表。
And finally, on line 9 the local light index list is copied to the global light index list.
最后,在第 9 行,将本地光索引列表复制到全局光索引列表中。
The light grid and the global light index list is then used in the fragment shader to perform final shading.
然后,在片元着色器中使用光栅和全局光索引列表执行最终着色。
Frustum Culling 视锥体裁剪
To perform frustum culling on the light volumes, two frustum culling methods will be presented:
对光体执行截锥体裁剪,将呈现两种截锥体裁剪方法:
- Frustum-Sphere culling for point lights
点光源的截锥-球体裁剪 - Frustum-Cone culling for spot lights
聚光灯锥体裁剪
The culling algorithm for spheres is fairly straightforward. The culling algorithm for cones is slightly more complicated. First I will describe the frustum-sphere algorithm and then I will describe the cone-culling algorithm.
球体的裁剪算法相对简单。锥体的裁剪算法稍微复杂一些。首先我会描述视锥-球体算法,然后我会描述锥体裁剪算法。
Frustum-Sphere Culling 视锥-球体裁剪
We have already seen the definition of the culling frustum in the previous section titled Compute Grid Frustums. A sphere is defined as a center point in view space, and a radius.
我们已经在前一节中看到了修剪视锥的定义,标题为计算网格视锥。一个球被定义为视图空间中的一个中心点和一个半径。
1 | struct Sphere |
A sphere is considered to be “inside” a plane if it is fully contained in the negative half-space of the plane. If a sphere is completly “inside” any of the frustum planes then it is outside of the frustum.
如果一个球完全包含在平面的负半空间中,则认为球在平面“内部”。如果一个球完全包含在任何视锥平面的“内部”,则它在视锥之外。
We can use the following formula to determine the signed distance of a sphere from a plane [18]:
我们可以使用以下公式来确定球体与平面之间的有符号距离[18]:
Where l is the signed distance from the sphere to the plane, c is the center point of the sphere, n is the unit normal to the plane, and d is the distance from the plane to the origin.
If l is less than −r where r is the radius of the sphere, then we know that the sphere is fully contained in the negative half-space of the plane.
l是从球体到平面距离,c是球体中心点,n是平面法线,d是从平面到视点距离
如果l小于-r(为球体半径),那么我们可以认为球体被平面的一半完整包裹
1 | // Check to see if a sphere is fully behind (inside the negative halfspace of) a plane. |
Then we can iteratively apply SphereInsidePlane function to determine if the sphere is contained inside the culling frustum.
然后我们可以迭代应用 SphereInsidePlane 函数来确定球体是否包含在裁剪视锥体内部。
1 | // Check to see of a light is partially contained within the frustum. |
Since the sphere is described in view space, we can quickly determine if the light should be culled based on its z-position and the distance to the near and far clipping planes. If the sphere is either fully in front of the near clipping plane, or fully behind the far clipping plane, then the light can be discarded. Otherwise we have to check if the light is within the bounds of the culling frustum.
由于球是在视图空间中描述的,我们可以根据其 z 位置和到近和远裁剪平面的距离快速确定是否应该根据其 z 位置和到近和远裁剪平面的距离来剔除光线。如果球要么完全在近裁剪平面的前面,要么完全在远裁剪平面的后面,那么光线可以被丢弃。否则,我们必须检查光线是否在剔除视锥体的范围内。
The SphereInsideFrustum assumes a right-handed coordinate system with the camera looking towards the negative z axis. In this case, the far plane is approaching negative infinity so we have to check if the sphere is further away (less than in the negative direction). For a left-handed coordinate system, the zNear and zFar variables should be swapped on line 268.
SphereInsideFrustum 假定一个右手坐标系,摄像机朝向负 z 轴。在这种情况下,远平面逼近负无穷大,因此我们必须检查球体是否更远(在负方向上小于)。对于左手坐标系,应该在第 268 行交换 zNear 和 zFar 变量。
Frustum-Cone Culling 锥体截锥体裁剪
To perform frustum-cone culling, I will use the technique described by Christer Ericson in his book titled “Real-Time Collision Detection” [18]. A cone can be defined by its tip T, a normalized direction vector d, the height of the cone h and the radius of the base r.
为了执行截锥体剔除,我将使用 Christer Ericson 在其名为“实时碰撞检测”的书中描述的技术 [18]。圆锥体可以通过其尖端 T、归一化方向向量 d、圆锥体高度 h 和底面半径 r 来定义
T is the tip of the cone, d is the direction, h is the height and r is the radius of the base of the cone.
In HLSL the cone is defined as
在 HLSL 中,锥台被定义为
1 | struct Cone |
To test if a cone is completely contained in the negative half-space of a plane, only two points need to be tested.
要测试一个圆锥体是否完全包含在平面的负半空间中,只需要测试两个点。
- The tip T of the cone
- The point Q that is on the base of the cone that is farthest away from the plane in the direction of n
If both of these points are contained in the negative half-space of any of the frustum planes, then the cone can be culled.
如果这两点都包含在任何锥台平面的负半空间中,则可以剔除锥体。
To determine the point Q that is farthest away from the plane in the direction of n we will compute an intermediate vector m which is parallel but opposite to n and perpendicular to d.
为了确定在 n 方向上距离平面最远的点 Q,我们将计算一个平行但相反的中间向量 m到 n 并垂直于 d
Q is obtained by stepping from the tip T along the cone axis d at a distance h and then along the base of the cone away from the positive half-space of the plane −m at a factor of r.
Q是通过从尖端 T沿锥轴 d 步进距离 h获得的,然后沿圆锥体的底面距平面 −m 的正半空间的距离为 r倍。
If n×d is zero, then the cone axis d is parallel to the plane normal n and m will be a zero vector. This special case does not need to be handled specifically because in this case the equation reduces to:
如果 n×d为零,则锥轴 d 平行于平面法线 n 和 m 将是零向量。这种特殊情况不需要专门处理,因为在这种情况下,方程简化为:
Which results in the correct point that needs to be tested.
导致需要测试的正确点。
With points T and Q computed, we can test both points if they are in the negative half-space of the plane. If they are, we can conclude that the light can be culled. To test if a point is in the negative half-space of the plane, we can use the following equation:
计算出 T 和 Q 后,我们可以测试这两个点是否位于平面的负半空间中。如果是的话,我们就可以得出结论,光线可以被剔除。为了测试一个点是否在平面的负半空间中,我们可以使用以下方程:
Where l is the signed distance from the point to the plane and X is the point to be tested. If l is negative, then the point is contained in the negative half-space of the plane.
其中l 是点到平面的有符号距离,X 是要测试的点。如果 l 为负,则该点包含在平面的负半空间中。
In HLSL, the function PointInsidePlane is used to test if a point is inside the negative half-space of a plane.
在 HLSL 中,函数 PointInsidePlane 用于测试一个点是否在平面的负半空间内。
1 | // Check to see if a point is fully behind (inside the negative halfspace of) a plane. |
And the ConeInsidePlane function is used to test if a cone is fully contained in the negative half-space of a plane.
而 ConeInsidePlane 函数用于测试锥体是否完全包含在平面的负半空间中。
1 | // Check to see if a cone if fully behind (inside the negative halfspace of) a plane. |
The ConeInsideFrustum function is used to test if the cone is contained within the clipping frustum. This function will return true if the cone is inside the frustum or false if it is fully contained in the negative half-space of any of the clipping planes.
ConeInsideFrustum 函数用于测试锥体是否包含在裁剪截锥内。如果锥体在截锥内部,则此函数将返回 true;如果锥体完全包含在任何裁剪平面的负半空间中,则返回 false。
1 | bool ConeInsideFrustum( Cone cone, Frustum frustum, float zNear, float zFar ) |
First we check if the cone is clipped by the near or far clipping planes. Otherwise we have to check the four planes of the culling frustum. If the cone is in the negative half-space of any of the clipping planes, the function will return false.
首先,我们检查圆锥体是否被近裁剪平面或远裁剪平面剪裁。否则,我们必须检查视锥体的四个平面。如果圆锥体在任何裁剪平面的负半空间中,该函数将返回 false。
Now we can put this together to define the light culling compute shader.
现在我们可以将这些组合起来定义光照剔除计算着色器。
Light Culling Compute Shader 光照剔除计算着色器
The purpose of the light culling compute shader is to update the global light index list and the light grid that is required by the fragment shader. Two lists need to be updated per frame:
光照剔除计算着色器的目的是更新片元着色器所需的全局光索引列表和光栅格。每帧需要更新两个列表:
- Light index list for opaque geometry
不透明几何的光索引列表 - Light index list for transparent geometry
透明几何的光索引列表
To differentiate between the two lists in the HLSL compute shader, I will use the prefix “o_” to refer to the opaque lists and “t_” to refer to transparent lists. Both lists will be updated in the light culling compute shader.
为了区分 HLSL 计算着色器中的这两个列表,我将使用前缀“o_”来指代不透明列表,“t_”来指代透明列表。这两个列表将在光剔除计算着色器中更新。
First we will declare the resources that are required by the light culling compute shader.
首先,我们将声明光照剔除计算着色器所需的资源。
1 | // The depth from the screen space texture. |
In order to read the depth values that are generated the depth pre-pass, the resulting depth texture will need to be sent to the light culling compute shader. The DepthTextureVS texture contains the result of the depth pre-pass.
为了读取深度预pass生成的深度值,结果深度纹理需要被发送到光照剔除计算着色器。DepthTextureVS 纹理包含深度预pass的结果。
The in_Frustums is the structured buffer that was computed in the compute frustums compute shader and was described in the section titled Grid Frustums Compute Shader.
in_Frustums 是在计算 frustums 计算着色器中计算并在标题为 Grid Frustums Compute Shader 的部分中描述的结构化缓冲区。
We also need to keep track of the index into the global light index lists.
我们还需要跟踪全局光索引列表中的索引。
1 | // Global counter for current index into the light index list. |
The o_LightIndexCounter is the current index of the global light index list for opaque geometry and the t_LightIndexCounter is the current index of the global light index list for transparent geometry.
o_LightIndexCounter 是不透明几何体全局光索引列表的当前索引,t_LightIndexCounter 是透明几何体全局光索引列表的当前索引。
Although the light index counters are of type RWStructuredBuffer these buffers only contain a single unsigned integer at index 0.
尽管光指数计数器是 RWStructuredBuffer 类型,但这些缓冲区仅在索引 0 处包含单个无符号整数。
1 | // Light index lists and light grids. |
The light index lists are stored as a 1D array of unsigned integers but the light grids are stored as 2D textures where each “texel” is a 2-component unsigned integer vector. The light grid texture is created using the R32G32_UINT format.
光索引列表存储为一维无符号整数数组,但光栅存储为二维纹理,其中每个“纹素”是一个二分量无符号整数向量。光栅纹理使用 R32G32_UINT 格式创建。
To store the min and max depth values per tile, we need to declare some group-shared variables to store the minimum and maximum depth values. The atomic increment functions will be used to make sure that only one thread in a thread group can change the min/max depth values but unfortunately, shader model 5.0 does not provide atomic functions for floating point values. To circumvent this limitation, the depth values will be stored as unsigned integers in group-shared memory which will be atomically compared and updated per thread.
为了存储每个瓦片的最小和最大深度值,我们需要声明一些组共享变量来存储最小和最大深度值。原子递增函数将被用来确保只有一个线程组中的线程可以更改最小/最大深度值,但不幸的是,着色器模型 5.0 不提供用于浮点值的原子函数。为了规避这个限制,深度值将被存储为无符号整数在组共享内存中,这将被原子地比较和更新每个线程。
1 | groupshared uint uMinDepth; |
Since the frustum used to perform culling will be the same frustum for all threads in a group, it makes sense to keep only one copy of the frustum for all threads in a group. Only thread 0 in the group will need to copy the frustum from the global memory buffer and we also reduce the amount of local register memory required per thread.
由于用于执行裁剪的视锥体将是组中所有线程的相同视锥体,因此保留组中所有线程的视锥体的副本是有意义的。只有组中的线程 0 需要从全局内存缓冲区复制视锥体,我们还减少了每个线程所需的本地寄存器内存量。
1 | groupshared Frustum GroupFrustum; |
We also need to declare group-shared variables to create the temporary light lists. We will need a seperate list for opaque and transparent geometry.
我们还需要声明组共享变量来创建临时光列表。我们将需要一个用于不透明和透明几何体的单独列表。
1 | // Opaque geometry light lists. |
The LightCount will keep track of the number of lights that are intersecting the current tile frustum.
LightCount 将跟踪与当前瓷砖视锥相交的灯光数量。
The LightIndexStartOffset is the offset into the global light index list. This index will be written to the light grid and is used as the starting offset when copying the local light index list to global light index list.
LightIndexStartOffset 是全局光索引列表中的偏移量。此索引将被写入光栅,并在将本地光索引列表复制到全局光索引列表时用作起始偏移量。
The local light index list will allow us to store as many as 1024 lights in a single tile. This maximum value will almost never be reached (at least it shouldn’t be!). Keep in mind that when we allocated storage for the global light list, we accounted for an average of 200 lights per tile. It is possible that there are some tiles that contain more than 200 lights (as long as it is not more than 1024) and some tiles that contain less than 200 lights but we expect the average to be about 200 lights per tile. As previously mentioned, the estimate of an average of 200 lights per tile is probably an overestimation but since GPU memory is not a limiting constraint for this project, I can afford to be liberal with my estimations.
本地光指数列表将允许我们在单个瓦片中存储多达 1024 个灯光。这个最大值几乎永远不会达到(至少不应该!)。请记住,当我们为全局光列表分配存储空间时,我们考虑了每个瓦片平均 200 个灯光。有可能有一些瓦片包含超过 200 个灯光(只要不超过 1024 个),也有一些瓦片包含少于 200 个灯光,但我们预计平均每个瓦片约有 200 个灯光。如前所述,每个瓦片平均 200 个灯光的估计可能是一个过高的估计,但由于 GPU 内存对于这个项目并不是一个限制性约束,我可以在估计上保守一些。
To update the local light counter and the light list, I will define a helper function called AppendLight. Unfortunately I have not yet figured out how to pass group-shared variables as arguments to a function so for now I will define two versions of the same function. One version of the function is used to update the light index list for opaque geometry and the other version is for transparent geometry.
更新本地光计数器和光列表,我将定义一个名为 AppendLight 的辅助函数。不幸的是,我还没有弄清楚如何将组共享变量作为参数传递给函数,所以现在我将定义同一函数的两个版本。函数的一个版本用于更新不透明几何体的光索引列表,另一个版本用于透明几何体。
1 | // Add the light to the visible light list for opaque geometry. |
If you are reading this and you know how I can pass groupshared variables as arguments to a function in HSLS, please leave your solution in the comments below. (No guessing please. Make sure your solution works before suggesting it).
如果您正在阅读此内容,并且知道如何将组共享变量作为参数传递给 HSLS 中的函数,请在下方评论中留下您的解决方案。(请不要猜测。在建议之前确保您的解决方案有效)。
The InterlockedAdd function guarantees that the group-shared light count variable is only updated by a single thread at a time. This way we avoid any race conditions that may occur when multiple threads try to increment the group-shared light count at the same time.
InterlockedAdd 函数确保组共享的灯计数变量一次只能由一个线程更新。这样我们就避免了多个线程同时尝试增加组共享的灯计数时可能发生的竞争条件。
The value of the light count before it is incremented is stored in the index local variable and used to update the light index in the group-shared light index list.
在递增之前存储在索引本地变量中的光计数值,用于更新组共享光索引列表中的光索引。
The method to compute the minimum and maximum depth range per tile is taken from the presentation titled “DirectX 11 Rendering in Battlefield 3” by Johan Andersson in 2011 [3] and “Tiled Shading” by Ola Olsson and Ulf Assarsson [5].
计算每个瓦片的最小和最大深度范围的方法取自于 2011 年 Johan Andersson 的演示文稿“战地 3 中的 DirectX 11 渲染” [3] 和 Ola Olsson 与 Ulf Assarsson 的“平铺着色” [5]。
The first thing we will do in the light culling compute shader is read the depth value for the current thread. Each thread in the thread group will sample the depth buffer only once for the current thread and thus all threads in a group will sample all depth values for a single tile.
在光剔除计算着色器中,我们将首先读取当前线程的深度值。线程组中的每个线程将仅为当前线程采样深度缓冲区一次,因此组中的所有线程将为单个瓦片采样所有深度值。
1 | // Implementation of light culling compute shader is based on the presentation |
Since we can only perform atomic operations on integers, on line 100 we reinterrpret the bits from the floating-point depth as an unsigned integer. Since we expect all depth values in the depth map to be stored in the range [0…1] (that is, all positive depth values) then reinturrpreting the float to an int will still allow us to correctly perform comparissons on these values. As long as we don’t try to preform any arithmetic operations on the unsigned integer depth values, we should get the correct minimum and maximum values.
由于我们只能对整数执行原子操作,在第 100 行,我们将浮点深度的位重新解释为无符号整数。由于我们期望深度图中的所有深度值都存储在范围[0…1]内(即所有正深度值),因此将浮点数重新解释为整数仍然允许我们正确地对这些值执行比较。只要我们不尝试对无符号整数深度值执行任何算术运算,我们应该能够获得正确的最小值和最大值。
1 | if ( IN.groupIndex == 0 ) // Avoid contention by other threads in the group. |
Since we are setting group-shared variables, only one thread in the group needs to set them. In fact the HLSL compiler will generate a race-condition error if we don’t restrict the writing of these variables to a single thread in the group.
由于我们正在设置组共享变量,组中只需要一个线程来设置它们。实际上,如果我们不将这些变量的写入限制在组中的单个线程上,HLSL 编译器将生成竞争条件错误。
To make sure that every thread in the group has reached the same point in the compute shader, we invoke the GroupMemoryBarrierWithGroupSync function. This ensures that any writes to group shared memory have completed and the thread execution for all threads in a group have reached this point.
为了确保组中的每个线程在计算着色器中达到相同的点,我们调用 GroupMemoryBarrierWithGroupSync 函数。这确保了对组共享内存的任何写操作已经完成,并且组中所有线程的线程执行都已经达到了这一点。
Next, we’ll determine the minimum and maximum depth values for the current tile.
接下来,我们将确定当前瓦片的最小和最大深度值。
1 | InterlockedMin( uMinDepth, uDepth ); |
The InterlockedMin and InterlockedMax methods are used to atomically update the uMinDepth and uMaxDepth group-shared variables based on the current threads depth value.
InterlockedMin 和 InterlockedMax 方法用于根据当前线程深度值原子更新 uMinDepth 和 uMaxDepth 组共享变量。
We again need to use the GroupMemoryBarrierWithGroupSync function to ensure all writes to group shared memory have been comitted and all threads in the group have reached this point in the compute shader.
我们再次需要使用 GroupMemoryBarrierWithGroupSync 函数,以确保所有对组共享内存的写操作都已提交,并且组中的所有线程都已到达计算着色器中的此点。
After the minimum and maximum depth values for the current tile have been found, we can reinterrpret the unsigned integer back to a float so that we can use it to compute the view space clipping planes for the current tile.
在找到当前瓦片的最小和最大深度值之后,我们可以将无符号整数重新解释为浮点数,以便我们可以使用它来计算当前瓦片的视空间裁剪平面。
1 | float fMinDepth = asfloat( uMinDepth ); |
On line 118 the minimum and maximum depth values as unsigned integers need to be reinterpret as floating point values so that they can be used to compute the correct points in view space.
在第 118 行,最小和最大深度值作为无符号整数需要重新解释为浮点值,以便可以用来计算视图空间中的正确点。
The view space depth values are computed using the ScreenToView function and extracting the z component of the position in view space. We only need these values to compute the near and far clipping planes in view space so we only need to know the distance from the viewer.
视图空间深度值是使用 ScreenToView 函数计算的,并提取视图空间中位置的 z 分量。我们只需要这些值来计算视图空间中的近裁剪面和远裁剪面,因此我们只需要知道与观察者的距离。
When culling lights for transparent geometry, we don’t want to use the minimum depth value from the depth map. Instead we will clip the lights using the camera’s near clipping plane. In this case, we will use the nearClipVS value which is the distance to the camera’s near clipping plane in view space.
在为透明几何体剔除光源时,我们不希望使用深度图中的最小深度值。相反,我们将使用摄像机的近裁剪面来裁剪光源。在这种情况下,我们将使用 nearClipVS 值,即到摄像机近裁剪面的距离。
Since I’m using a right-handed coordinate system with the camera pointing towards the negative z axis in view space, the minimum depth clipping plane is computed with a normal n pointing in the direction of the negative z axis and the distance to the origin d is -minDepth. We can verify that this is correct by using the constant-normal form of a plane:
由于我使用右手坐标系,相机指向视图空间中的负 z 轴,因此最小深度裁剪平面是使用指向负 z 方向的法线 n 计算的轴,到原点 d 的距离为 -minDepth。我们可以通过使用平面的常量正规形式来验证这是正确的:
通过代入n=(0,0,-1), X=(x,y,z),
这意味着 (0,0,zmin) 是裁剪平面上的最小深度点。
1 | // Cull lights |
If every thread in the thread group checks one light in the global light list at the same time, then we can check 16×16 (256) lights per iteration of the for-loop defined on line 132. The loop starts with i=groupIndex and i is incremented BLOCK_SIZE×BLOCK_SIZE for each iteration of the loop. This implies that for BLOCK_SIZE=16, each thread in the thread group will check every 256th light until all lights have been checked.
如果线程组中的每个线程同时检查全局灯光列表中的一个灯光,那么我们可以在第 132 行定义的 for 循环的每次迭代中检查 16×16 (256) 个灯光。循环以 i=groupIndex 开始对于循环的每次迭代, i = groupIndex 以及 i 递增, BLOCK_SIZE×BLOCK_SIZE BLOCK_SIZE × BLOCK_SIZE 。这意味着对于 BLOCK_SIZE=16 BLOCK_SIZE = 16 ,线程组中的每个线程将检查每 256 个灯,直到检查完所有灯为止。
- Thread 0 checks: { 0, 256, 512, 768, … }
线程 0 检查:{ 0, 256, 512, 768, … } - Thread 1 checks: { 1, 257, 513, 769, … }
线程 1 检查:{1, 257, 513, 769, …} - Thread 2 checks: { 2, 258, 514, 770, … }
线程 2 检查:{2, 258, 514, 770, …} - …
- Thread 255 checks: { 255, 511, 767, 1023, … }
线程 255 检查:{255, 511, 767, 1023, …}
For 10,000 lights, the for loop only needs 40 iterations (per thread) to check all lights for a tile.
对于 10,000 个灯光,for 循环只需要 40 次迭代(每个线程)来检查一个瓦片上的所有灯光。
First we’ll check point lights using the SphereInsideFrustum function that was defined earlier.
首先,我们将使用之前定义的 SphereInsideFrustum 函数来检查点光源。
1 | switch ( light.Type ) |
On line 142 a sphere is defined using the position and range of the light.
在第 142 行,使用光的位置和范围定义了一个球体。
First we check if the light is within the tile frustum using the near clipping plane of the camera and the maximum depth read from the depth buffer. If the light volume is in this range, it is added to the light index list for transparent geometry.
首先,我们检查光是否在瓦片视锥体内,使用摄像机的近裁剪平面和从深度缓冲区读取的最大深度。如果光体积在这个范围内,它将被添加到透明几何体的光索引列表中。
To check if the light should be added to the global light index list for opaque geometry, we only need to check the minimum depth clipping plane that was previously defined on line 128. If the light is within the culling frustum for transparent geometry and in front of the minimum depth clipping plane, the index of the light is added to the light index list for opaque geometry.
要检查光是否应该被添加到不透明几何体的全局光索引列表中,我们只需要检查之前在第 128 行定义的最小深度裁剪平面。如果光在透明几何体的裁剪视锥体内并且在最小深度裁剪平面的前面,则将光的索引添加到不透明几何体的光索引列表中。
Next, we’ll check spot lights.
接下来,我们将检查聚光灯。
1 | case SPOT_LIGHT: |
Checking cones is almost identical to checking spheres so I won’t go into any detail here. The radius of the base of the spotlight cone is not stored with the light so it needs to be calculated for the ConeInsideFrustum function. To compute the radius of the base of the cone, we can use the tangent of the spotlight angle multiplied by the height of the cone.
检查圆锥体几乎与检查球体相同,因此我在这里不会详细介绍。聚光锥体底部的半径未与光一起存储,因此需要为 ConeInsideFrustum 函数计算。要计算圆锥体底部的半径,我们可以使用聚光角的正切乘以圆锥体的高度。
And finally we need to check directional lights. This is by far the easiest part of this function.
最后,我们需要检查方向灯。这绝对是这个功能中最容易的部分。
1 | case DIRECTIONAL_LIGHT: |
There is no way to reliably cull directional lights so if we encounter a directional light, we have no choice but to add it’s index to the light index list.
没有可靠的方法来筛选定向光源,因此如果我们遇到定向光源,我们别无选择,只能将其索引添加到光源索引列表中。
To ensure that all threads in the thread group have recorded their lights to the group-shared light index list, we will invoke the GroupMemoryBarrierWithGroupSync function to synchronize all threads in the group.
确保线程组中的所有线程都已将其光线记录到组共享的光线索引列表中,我们将调用 GroupMemoryBarrierWithGroupSync 函数来同步组中的所有线程。
After we have added all non-culled lights to the group-shared light index lists we need to copy it to the global light index list. First, we’ll update the global light index list counter.
在将所有未被剔除的光线添加到组共享的光线索引列表之后,我们需要将其复制到全局光线索引列表中。首先,我们将更新全局光线索引列表计数器。
1 | // Update global memory with visible light buffer. |
We will once again use the InterlockedAdd function to increment the global light index list counter by the number of lights that were appended to the group-shared light index list. On lines 194 and 198 the light grid is updated with the offset and light count of the global light index list.
我们将再次使用 InterlockedAdd 函数,将全局光线索引列表计数器增加已追加到组共享光线索引列表中的光线数量。在第 194 行和 198 行,光栅将使用全局光线索引列表的偏移量和光线计数进行更新。
To avoid race conditions, only the first thread in the thread group will be used to update the global memory.
为了避免竞态条件,只有线程组中的第一个线程将用于更新全局内存。
On line 201, all threads in the thread group must be synced again before we can update the global light index list.
在第 201 行,必须再次同步线程组中的所有线程,然后我们才能更新全局光索引列表。
1 | // Update global memory with visible light buffer. |
To update the opaque and transparent global light index lists, we will allow all threads to write a single index into the light index list using a similar method that was used to iterate the light list on lines 132-183 shown previously.
更新不透明和透明的全局光索引列表,我们将允许所有线程使用类似于之前在 132-183 行上显示的迭代光列表的方法,将单个索引写入光索引列表。
At this point both the light grid and the global light index list contain the necessary data to be used by the pixel shader to perform final shading.
在这一点上,光栅和全局光索引列表都包含了供像素着色器使用的必要数据,以执行最终着色。
Final Shading 最终着色
The last part of the Forward+ rendering technique is final shading. This method is no different from the standard forward rendering technique that was discussed in the section titled Forward Rendering – Pixel Shader except that instead of looping through the entire global light list, we use the light index list that was generated in the light culling phase.
Forward+渲染方式的最后部分是最终着色。这种方法与在标题为前向渲染-像素着色器的部分讨论的标准前向渲染方式没有什么不同,只是我们不再通过整个全局光列表进行循环,而是使用在光剔除阶段生成的光索引列表。
In addition to the properties that were described in the section about standard forward rendering, the Forward+ pixel shader also needs to take the light index list and the light grid that was generated in the light culling phase.
除了在标准前向渲染部分中描述的属性之外,Forward+像素着色器还需要获取在光照剔除阶段生成的光索引列表和光网格。
1 | StructuredBuffer<uint> LightIndexList : register( t9 ); |
When rendering opaque geometry, you must take care to bind the light index list and light grid for opaque geometry and when rendering transparent geometry, the light index list and light grid for transparent geometry. Of course this seems obvious but the only differentiating factor for the final shading pixel shader is the light index list and light grid that is bound to the pixel shader stage.
在渲染不透明几何体时,必须注意为不透明几何体绑定光索引列表和光网格;在渲染透明几何体时,为透明几何体绑定光索引列表和光网格。当然,这似乎是显而易见的,但最终着色像素着色器的唯一区别因素是绑定到像素着色器阶段的光索引列表和光网格。
1 | [earlydepthstencil] |
Most of the code for this pixel shader is identical to that of the forward rendering pixel shader so it is omitted here for brevity. The primary concept here is shown on line 298 where the tile index into the light grid is computed from the screen space position. Using the tile index, the start offset and light count is read from the light grid on lines 301 and 302.
大部分像素着色器的代码与前向渲染像素着色器的代码相同,因此为简洁起见在此省略。主要概念显示在第 298 行,其中从屏幕空间位置计算出光栅格中的瓦片索引。使用瓦片索引,从光栅格中在第 301 行和 302 行读取起始偏移和光计数。
In the for-loop defined on line 306 loops over the light count and reads the light’s index from the light index list and uses that index to retrieve the light from the global light list.
在定义在第 306 行的 for 循环中,循环遍历光计数并从光索引列表中读取光的索引,然后使用该索引从全局光列表中检索光。
Now let’s see how the performance of the various methods compare.
现在让我们看看各种方法的性能如何比较。
Experiment Setup and Performance Results 实验设置和性能结果
To measure the performance of the various rendering techniques, I used the Crytek Sponza scene [11] on an NVIDIA GeForce GTX 680 GPU at a screen resolution of 1280×720. The camera was placed close to the world origin and the lights were animated to rotate in a circle around the world origin.
为了衡量各种渲染方式的性能,我在 NVIDIA GeForce GTX 680 GPU 上使用了 Crytek Sponza 场景[11],屏幕分辨率为 1280×720。摄像机靠近世界原点,灯光被动画化以围绕世界原点旋转。
I tested each rendering technique using two scenarios:
我使用两种场景测试了每种渲染方式:
- Large lights with a range of 35-40 units
具有 35-40 个单位范围的大灯 - Small lights with a range of 1-2 units
1-2 个单位范围内的小灯光
Having a few (2-3) large lights in the scene is a realistic scenario (for example key light, fill light, and back light [25]). These lights may be shadow casters that set the mood and create the ambient for the scene. Having many (more than 5) large lights that fill the screen is not necessarily a realistic scenario but I wanted to see how the various techniques scaled when using large, screen-filling lights.
在场景中放置几个(2-3 个)大灯是一个现实的情景(例如主光、补光和背光[25])。这些灯光可能是投射阴影的灯光,用来营造氛围。在屏幕上放置许多(超过 5 个)大灯并不一定是一个现实的情景,但我想看看在使用大面积、填充屏幕的灯光时各种方式是如何扩展的。
Having many small lights is a more realistic scenario that might be commonly used in games. Many small lights can be used to simulate area lights or bounced lighting effects similar to the effects of global illumination algorithms that are usually only simulated using light maps or light probes as described in the section titled Forward Rendering.
在游戏中使用许多小灯光是一个更为现实的情景。许多小灯光可以用来模拟区域光或反射光效果,类似于全局光照算法的效果,通常只能通过光照图或光探针来模拟,如“前向渲染”部分所述。
Although the demo supports directional lights I did not test the performance of rendering using directional lights. Directional lights are large screen filling lights that are similar to lights having a range of 35-40 units (the first scenario).
尽管演示支持定向光,但我没有测试使用定向光进行渲染的性能。定向光是大面积填充屏幕的灯光,类似于具有 35-40 单位范围的灯光(第一个情景)。
In both scenarios lights were randomly placed throughout the scene within the boundaries of the scene. The sponza scene was scaled down so that its bounds were approximately 30 units in the X and Z axes and 15 units in the Y axis.
在这两种情况下,灯光是随机放置在场景的边界内的。 Sponza 场景被缩小,使其边界在 X 轴和 Z 轴上约为 30 个单位,在 Y 轴上约为 15 个单位。
Each graph displays a set of curves that represent the various phases of the rendering technique. The horizontal axis of the curve represents the number of lights in the scene and the vertical axis represents the running time measured in milliseconds. Each graph also displays a minimum and maximum threshold. The minimum threshold is displayed as a green horizontal line in the graph and represents the ideal frame-rate of 60 Frames-Per Second (FPS) or 16.6 ms. The maximum threshold is displayed as a red horizontal line in the graph and represents the lowest acceptable frame-rate of 30 FPS or 33.3 ms.
每个图表显示一组曲线,代表渲染方式的各个阶段。曲线的横轴代表场景中灯光的数量,纵轴代表以毫秒为单位的运行时间。每个图表还显示了最小和最大阈值。最小阈值显示为图表中的绿色水平线,代表理想帧率为每秒 60 帧(FPS)或 16.6 毫秒。最大阈值显示为图表中的红色水平线,代表最低可接受的帧率为每秒 30 帧(FPS)或 33.3 毫秒。
Forward Rendering Performance 前向渲染性能
Let us first analyze the performance of the forward rendering technique using large lights.
让我们首先分析使用大灯光的前向渲染方式的性能。
Large Lights 大灯
The graph below shows the performance results of the forward rendering technique using large lights.
下面的图表显示了使用大灯光的前向渲染方式的性能结果。
The graph displays the two primary phases of the forward rendering technique. The purple curve shows the opaque pass and the dark red curve shows the transparent pass. The orange line shows the total time to render the scene.
该图显示了前向渲染方式的两个主要阶段。紫色曲线显示了不透明pass,深红色曲线显示了透明pass。橙色线显示了渲染场景的总时间。
As can be seen by this graph, rendering opaque geometry takes the most amount of time and increases exponentially as the number of lights increases. The time to render transparent geometry also increases exponentially but there is much less transparent geometry in the scene than opaque geometry so the increase seems more gradual.
如图所示,渲染不透明几何体需要最长的时间,并且随着灯光数量的增加呈指数增长。渲染透明几何体的时间也呈指数增长,但场景中的透明几何体要比不透明几何体少得多,因此增长看起来更为渐进。
Even with very large lights, standard forward rendering is able to render 64 dynamic lights while still maintaining frame-rates below the maximum threshold of 30 FPS. With more than 512 lights, the frame time becomes immeasurably high.
即使使用非常大的灯光,标准前向渲染可以渲染 64 个动态灯光,同时保持帧速率低于 30 FPS 的最大阈值。超过 512 个灯光后,帧时间变得无法测量。
From this we can conclude that if the scene contains more than 64 large visible lights, you may want to consider using a different rendering technique than forward rendering.
由此我们可以得出结论,如果场景中包含超过 64 个大型可见光源,您可能需要考虑使用不同于前向渲染的渲染方式。
Small Lights 小灯
Forward rendering performs better when the scene contains many small lights. In this case, the rendering technique can handle twice as many lights while still maintaining acceptable performance. After more than 1024 lights, the frame time was so high, it was no longer worth measuring.
正向渲染在场景包含许多小灯时表现更好。在这种情况下,渲染方式可以处理两倍多的灯光,同时仍保持可接受的性能。超过 1024 个灯光后,帧时间变得如此之高,不再值得测量。
We see again that the most amount of time is spent rendering opaque geometry which is not surprising. The trends for both large and small lights are similar but when using small lights, we can create twice as many lights while achieving acceptable frame-rates.
我们再次看到,大部分时间都花在渲染不透明几何体上,这并不令人意外。大灯和小灯的趋势相似,但使用小灯时,我们可以创建两倍多的灯光,同时实现可接受的帧速率。
Next I’ll analyze the performance of the deferred rendering technique.
接下来我将分析延迟渲染方式的性能。
Deferred Rendering Performance 延迟渲染性能
The same experiment was repeated but this time using the deferred rendering technique. Let’s first analyze the performance of using large screen-filling lights.
同样的实验被重复进行,但这次使用了延迟渲染方式。让我们首先分析使用大屏幕填充灯光的性能。
Large Lights 大灯
The graph below shows the performance results of deferred rendering using large lights.
下面的图表显示了使用大灯光的延迟渲染的性能结果。
Rendering large lights using deferred rendering proved to be only marginally better than forward rendering. Since rendering transparent geometry uses the exact same code paths as the forward rendering technique, the performance of rendering transparent geometry using forward versus deferred rendering are virtually identical. As expected, there is no performance benefit when rendering transparent geometry.
使用延迟渲染渲染大型灯光,与前向渲染相比,效果仅略有改善。由于渲染透明几何体使用与前向渲染方式完全相同的代码路径,因此使用前向渲染与延迟渲染渲染透明几何体的性能几乎相同。如预期的那样,在渲染透明几何体时没有性能优势。
The marginal performance benefit of rendering opaque geometry using deferred rendering is primarily due to the reduced number of redundant lighting computations that forward rendering performs on occluded geometry. Redundant lighting computations that are performed when using forward rendering can be mitigated by using a depth pre-pass which would allow for early z-testing to reject fragments before performing expensive lighting calculations. Deferred rendering implicitly benefits from early z-testing and stencil operations that are not performed during forward rendering.
使用延迟渲染渲染不透明几何体的边际性能优势主要是由于前向渲染在遮挡几何体上执行的冗余光照计算数量减少。使用深度预pass可以减轻前向渲染时执行的冗余光照计算,从而允许在执行昂贵的光照计算之前拒绝片元。延迟渲染隐式受益于早期 z 测试和前向渲染期间未执行的模板操作。
Small Lights 小灯
The graph below shows the performance results of deferred rendering using small lights.
下面的图表显示了使用小灯光进行延迟渲染的性能结果。
The graph shows that deferred rendering is capable of rendering 512 small dynamic lights while still maintaining acceptable frame rates. In this case the time to render transparent geometry greatly exceeds that of rendering opaque geometry. If rendering only opaque objects, then the deferred rendering technique is capable of rendering 2048 lights while maintaining frame-rates below the minimum acceptable threshold of 60 FPS. Rendering transparent geometry greatly exceeds the maximum threshold after about 700 lights.
图表显示,延迟渲染能够渲染 512 个小型动态光源,同时仍保持可接受的帧率。在这种情况下,渲染透明几何体的时间远远超过了渲染不透明几何体的时间。如果只渲染不透明对象,那么延迟渲染方式能够渲染 2048 个光源,同时保持在低于最低可接受的 60 FPS 阈值以下的帧率。渲染透明几何体在大约 700 个光源后远远超过了最大阈值。
Forward Plus Performance Forward+性能
The same experiment was repeated once again using tiled forward rendering. First we will analyze at the performance characteristics using large lights.
同样的实验再次使用平铺的前向渲染进行了重复。首先,我们将分析使用大光源的性能特征。
Large Lights 大灯
The graph below shows the performance results of tiled forward rendering using large scene lights.
下面的图表显示了使用大场景灯光的平铺前向渲染的性能结果。
The graph shows that tiled forward rendering is not well suited for rendering scenes with many large lights. Rendering 512 screen filling lights in the scene caused issues because the demo only accounts for having an average of 200 lights per tile. With 512 large lights the 200 light average was exceeded and many tiles simply appeared black.
图表显示,瓷砖式前向渲染不适合渲染具有许多大光源的场景。在场景中渲染 512 个铺满屏幕的光源会导致问题,因为演示仅考虑每个瓷砖平均有 200 个光源。使用 512 个大光源,超过了 200 个光源的平均值,许多瓷砖简单地变黑。
Using large lights, the light culling phase never exceeded 1 ms but the opaque pass and the transparent pass quickly exceeded the maximum frame-rate threshold of 30 FPS.
使用大光源,光源剔除阶段从未超过 1 毫秒,但不透明pass和透明pass很快超过了最大帧速率阈值 30 FPS。
Small Lights 小灯
The graph shows the performance of tiled forward rendering using small lights.
该图表显示了使用小光源的瓷砖式前向渲染的性能。
Forward plus really shines when using many small lights. In this case we see that the light culling phase (orange line) is the primary bottleneck of the rendering technique. Even with over 16,000 lights, rendering opaque (blue line) and transparent (purple line) geometry fall below the minimum threshold to achieve a desired frame-rate of 60 FPS. The majority of the frame time is consumed by the light culling phase.
前向加真正在使用许多小灯光时发挥作用。在这种情况下,我们看到光剔除阶段(橙线)是渲染方式的主要瓶颈。即使有超过 16,000 个灯光,渲染不透明(蓝线)和透明(紫线)几何体都低于实现所需帧速率 60 FPS 的最低阈值。大部分帧时间被光剔除阶段消耗。
Now lets see how the three techniques compare against each other.
现在让我们看看这三种方式如何相互比较。
Techniques Compared 方式比较
First we’ll look at how the three techniques compare when using large lights.
首先,我们将看看这三种方式在使用大灯时如何相互比较。
Large Lights 大灯
The graph below shows the performance of the three rendering techniques when using large lights.
下面的图表显示了使用大灯光时三种渲染方式的性能。
As expected, forward rendering is the most expensive rendering algorithm when rendering large lights. Deferred rendering and tiled forward rendering are comparable in performance. Even if we disregard rendering transparent geometry in the scene, deferred rendering and tiled forward rendering have similar performance characteristics.
正如预期的那样,在渲染大光源时,前向渲染是最昂贵的渲染算法。延迟渲染和瓦片前向渲染在性能上是可比的。即使我们忽略场景中透明几何体的渲染,延迟渲染和瓦片前向渲染具有类似的性能特征。
If we consider scenes with only a few large lights there is still no discernible performance benefits between forward, deferred, or forward plus rendering.
如果我们只考虑有几个大灯光的场景,前向渲染、延迟渲染或前向加渲染之间仍然没有明显的性能优势。
If we consider the memory footprint required to perform forward rendering versus deferred rendering versus tiled forward rendering then traditional forward rendering has the smallest memory usage.
如果我们考虑执行前向渲染与延迟渲染与平铺前向渲染所需的内存占用,那么传统的前向渲染具有最小的内存使用量。
Regardless of the number of lights in the scene, deferred rendering requires about four bytes of GPU memory per pixel per additional G-buffer render target. Tiled forward rendering requires additional GPU storage for the light index list and the light grid which must be stored even when the scene contains only a few dynamic lights.
无论场景中灯光数量如何,延迟渲染每个额外的 G 缓冲渲染目标每像素大约需要四个字节的 GPU 内存。平铺前向渲染需要额外的 GPU 存储用于光索引列表和光栅格,即使场景只包含少量动态光也必须存储。
- Deferred Rendering (Diffuse, Specular, Normal @ 1280×720): +11 MB
延迟渲染(漫反射,镜面反射,法线 @ 1280×720):+11 MB - Tiled Forward Rendering (Light Index List, Light Grid @ 1280×720): +5.76 MB
平铺式前向渲染(光索引列表,1280×720 光栅):+5.76 MB
The additional storage requirements for deferred rendering is based on an additional three full-screen buffers at 32-bits (4 bytes) per pixel. The depth/stencil buffer and the light accumulation buffers are not considered as additional storage because standard forward rendering uses these buffers as well.
延迟渲染的额外存储需求基于每像素 32 位(4 字节)的三个全屏缓冲区。深度/模板缓冲区和光积累缓冲区不被视为额外存储,因为标准前向渲染也使用这些缓冲区。
The additional storage requirements for tiled forward rendering is based on two light index lists that have enough storage for an average of 200 lights per tile and two 80×45 light grids that store 2-component unsigned integer per grid cell.
平铺式前向渲染的额外存储需求基于两个光索引列表,每个瓦片平均存储 200 个光源,以及两个 80×45 的光栅,每个格子存储 2 个无符号整数。
If GPU storage is a rare commodity for the target platform and there is no need for many lights in the scene, traditional forward rendering is still the best choice.
如果 GPU 存储在目标平台上是一种稀缺资源,并且场景中不需要太多的光照,传统的前向渲染仍然是最佳选择。
Small Lights 小灯
The graph below shows the performance of the three rendering techniques when using small lights.
下面的图表显示了使用小灯光时三种渲染方式的性能。
In the case of small lights, tiled forward rendering clearly comes out as the winner in terms of rendering times. Up until somewhere around 128 lights, deferred and tiled forward rendering are comparable in performance but quickly diverge when the scene contains many dynamic lights. Also we must consider the fact that a large portion of the deferred rendering technique is consumed by rendering transparent objects. If transparent objects are not a requirement, then deferred rendering may be a viable option.
在小灯光的情况下,瓷砖式前向渲染在渲染时间方面显然是赢家。直到大约 128 个灯光左右,延迟和瓷砖式前向渲染在性能上是可比的,但当场景包含许多动态灯光时,它们很快就会分道扬镳。此外,我们必须考虑到延迟渲染方式的大部分消耗在渲染透明对象上。如果透明对象不是必需的话,那么延迟渲染可能是一个可行的选择。
Even with small lights, deferred rendering requires many more draw calls to render the geometry of the light volumes. Using deferred rendering, each light volume must be rendered at least twice, the first draw call updates the stencil buffer and the second draw call performs the lighting equations. If the graphics platform is very sensitive to excessive draw calls, then deferred rendering may not be the best choice.
即使使用小灯光,延迟渲染也需要更多的绘制调用来渲染光体的几何形状。使用延迟渲染,每个光体至少必须渲染两次,第一次绘制调用更新模板缓冲区,第二次绘制调用执行光照方程。如果图形平台对过多的绘制调用非常敏感,那么延迟渲染可能不是最佳选择。
Similar to the scenario with large lights, when rendering only a few lights in the scene then all three techniques have similar performance characteristics. In this case, we must consider the additional memory requirements that are imposed by deferred and tiled forward rendering. Again, if GPU memory is scarce and there is no need for many dynamic lights in the scene then standard forward rendering may be a viable solution.
与大灯光场景类似,当场景中只渲染少量灯光时,三种方式的性能特征相似。在这种情况下,我们必须考虑延迟和平铺前向渲染所施加的额外内存需求。同样,如果 GPU 内存稀缺且场景中不需要许多动态灯光,则标准前向渲染可能是一个可行的解决方案。
Future Considerations 未来考虑
While working on this project I have identified several issues that would benefit from consideration in the future.
在这个项目中工作时,我发现了一些问题,这些问题在未来值得考虑。
- General Issues: 一般问题:
- Size of the light structure
轻结构的大小
- Size of the light structure
- Forward Rendering: 前向渲染
- Depth pre-pass 深度预pass
- View frustum culling of visible lights
查看视锥体裁剪可见光
- Deferred Rendering: 延迟渲染
- Optimize G-buffers 优化 G-Buffer
- Rendering of directional lights
定向光的渲染
- Tiled Forward Rendering 平铺正向渲染
- Improve light culling 改进光剔除
General Considerations 一般考虑
For each of the rendering techniques used in this demo there is only a single global light list which stores directional, point, and spotlights in a single data structure. In order to store all of the properties necessary to perform correct lighting, each individual light structure requires 160 bytes of GPU memory. If we only store the absolute minimum amount of information needed to describe a light source we could take advantage of improved caching of the light data and potentially improve rendering performance across all rendering techniques. This may require having additional data structures to store only the relevant information that is needed by either the compute or the fragment shader or creating separate lists for directional, spot, and point lights so that no redundant information that is not relevant to the light source is stored in the data structure.
在此演示中使用的每种渲染方式中,只有一个存储定向光、点光和聚光灯的全局光列表,存储在单个数据结构中。为了存储执行正确光照所需的所有属性,每个单独的光结构需要 160 字节的 GPU 内存。如果我们只存储描述光源所需的绝对最小信息,我们可以利用改进的光数据缓存,并可能改善所有渲染方式的渲染性能。这可能需要有额外的数据结构,只存储计算着色器或片元着色器所需的相关信息,或者创建定向光、聚光灯和点光的单独列表,以便不存储与光源无关的冗余信息在数据结构中。
Forward Rendering 前向渲染
This implementation of the forward rendering technique makes no attempt to optimize the forward rendering pipeline. Culling lights against the view frustum would be a reasonable method to improve the rendering performance of the forward renderer.
此前向渲染方式的实现没有尝试优化前向渲染管线。根据视锥体剔除光源将是改善前向渲染器渲染性能的合理方法。
Performing a depth prepass as the first step of the forward rendering technique would allow us to take advantage of early z-testing to eliminate redundant lighting calculations.
作为正向渲染方式的第一步,执行深度预先通行将使我们能够利用早期 z 测试来消除冗余的光照计算。
Deferred Rendering 延迟渲染
When creating the implementation for the deferred rendering technique, I did not spend much time evaluating the performance of deferred rendering dependent on the format of the G-buffer textures used. The layout of the G-buffer was chosen for simplicity and ease of use. For example, the G-buffer texture to store view space normals uses a 4-component 32-bit floating-point buffer. Storing this render target as a 2-component 16-bit fixed-point buffer would not only reduce the buffer size by 75%, it would also improve texture caching. The only change that would need to be made to the shader is the method used to pack and unpack the normal data in the buffer. To pack the normal into the G-buffer, we would only need to cast the normalized 32-bit floating-point x and y values of the normal into 16-bit floating point values and store them in the render target. To unpack the normals in the lighting pass, we could read the 16-bit components from the buffer and compute the z-component of the normal by applying the following formula:
在创建延迟渲染方式的实现时,我并没有花太多时间评估延迟渲染的性能与所使用的 G-buffer 纹理格式有关。G-buffer 的布局是为了简单和易用而选择的。例如,用于存储视图空间法线的 G-buffer 纹理使用了一个 4 分量 32 位浮点缓冲。将这个渲染目标存储为一个 2 分量 16 位定点缓冲不仅会减少缓冲区大小 75%,还会改善纹理缓存。唯一需要更改的是在着色器中用于打包和解包法线数据的方法。为了将法线打包到 G-buffer 中,我们只需要将法线的归一化 32 位浮点 x 和 y 值转换为 16 位浮点值并存储在渲染目标中。在光照pass中解包法线时,我们可以从缓冲区中读取 16 位分量,并通过以下公式计算法线的 z 分量:
This would result in the z-component of the normal always being positive in the range [0⋯1]. This is usually not a problem since the normals are always stored in view-space and if the normal’s z-component is negative, then it would be back-facing and back-facing polygons should be culled anyways.
这将导致法线的 z 分量在 [0⋯1] 范围内始终为正。这通常不是问题,因为法线始终存储在视图空间中,并且如果法线的 z 分量为负,则它将是背面的,并且背面的多边形无论如何都应该被剔除。
Another potential area of improvement for the deferred renderer is the handling of directional lights. Currently the implementation renders directional lights as full-screen quads in the lighting pass. This may not be the best approach as even a few directional lights will cause severe overdraw and could become a problem on fill-rate bound hardware. To mitigate this issue, we could move the lighting computations for directional lights into the G-buffer pass and accumulate the lighting contributions from directional lights into the light accumulation buffer similar to how ambient and emissive terms are being applied.
延迟渲染器的另一个潜在改进领域是处理定向光源。目前的实现将定向光源渲染为全屏四边形在光照pass中。这可能不是最佳方法,因为即使是少量的定向光源也会导致严重的过度绘制,并可能成为填充率受限硬件的问题。为了缓解这个问题,我们可以将定向光源的光照计算移到 G-buffer pass,并将定向光源的光照贡献累积到光积累缓冲区中,类似于环境和发射项的应用方式。
This technique could be further improved by performing a depth-prepass before the G-buffer pass to allow for early z-testing to remove redundant lighting calculations.
这种方式可以通过在 G-Buffer传递之前执行深度预pass来进一步改进,以允许提前进行 z 测试,以消除多余的光照计算。
One of the advantages of using deferred rendering is that shadow maps can be recycled because only a single light is being rendered in the lighting pass at a time so only one shadow map needs to be allocated. Moving the lighting calculations for directional lights to the G-buffer pass would require that any shadow maps used by the directional lights need to be available before the G-buffer pass. This is only a problem if there are a lot of shadow casting directional lights in the scene. If using a lot of shadow-casting directional lights, this method of performing lighting computations of directional lights in the G-buffer pass may not be feasible.
使用延迟渲染的一个优点是阴影贴图可以被回收利用,因为在光照pass一次只渲染一个光源,所以只需要分配一个阴影贴图。将定向光的光照计算移动到 G 缓冲pass会要求在 G 缓冲pass之前需要准备好定向光使用的任何阴影贴图。只有在场景中有很多投射阴影的定向光时才会出现问题。如果使用了很多投射阴影的定向光,在 G 缓冲pass中执行定向光的光照计算的方法可能不可行。
Tiled Forward Rendering 平铺式前向渲染
As can be seen from the experiment results, the light culling stage takes a considerable amount of time to perform. If the performance of the light culling phase could be improved then we could gain an overall performance improvement of the tiled forward rendering technique. Perhaps we could perform an early culling step that eliminates lights that are not in the viewing frustum. This would require creating another compute shader that performs view frustum culling against all lights in the scene but instead of culling all lights against 3,600 frustums, only the view frustum needs to be checked. This way, each thread in the dispatch would only need to check a very small subset of the lights against the view frustum. After culling the lights against the larger view frustum, the per-tile light culling compute shader would only have to check the lights that are contained in the view frustum.
从实验结果可以看出,光剔除阶段需要相当长的时间来执行。如果光剔除阶段的性能得到改进,那么我们就可以获得平铺式前向渲染方式的整体性能提升。也许我们可以执行一个早期剔除步骤,消除不在视锥体内的光源。这将需要创建另一个计算着色器,对场景中的所有光源执行视锥体剔除,但不是对所有光源执行 3,600 个视锥体的剔除,只需要检查视锥体。这样,调度中的每个线程只需要检查一小部分光源是否在视锥体内。在根据较大的视锥体剔除光源之后,每个瓦片的光剔除计算着色器只需要检查包含在视锥体内的光源。
Another improvement to the light culling phase may be achievable using sparse octrees to store a light list at each node of the octree. A node is split if the nodes exceeds some maximum threshold for light counts. Nodes that don’t contain any lights in the octree can be removed from the octree and would not need to be considered during final rendering.
光照剔除阶段的另一个改进可能是使用稀疏八叉树来存储八叉树每个节点的光列表。如果节点中的光数量超过某个最大阈值,就会对节点进行分割。在八叉树中不包含任何光的节点可以从八叉树中移除,并且在最终渲染过程中无需考虑。
DirectX 12 introduces Volume Tiled Resources [20] which could be used to implement the sparse octree. Nodes in the octree that don’t have any lights would not need any backing memory. I’m not exactly sure how this would be implemented but it may be worth investigating.
DirectX 12 引入了体积平铺资源[20],可用于实现稀疏八叉树。八叉树中没有任何光源的节点不需要任何后备内存。我不确定这将如何实现,但值得调查。
Another area of improvement for the tiled forward rendering technique would be to improve the accuracy of the light culling. Frustum culling could result in a light being considered to be contained within a tile when in fact no part of the light volume is contained in the tile.
瓷砖化前向渲染方式的另一个改进领域是提高光照剔除的准确性。视锥剔除可能导致将光视为包含在瓷砖中,而实际上光体的任何部分都不包含在瓷砖中。
As can be seen in the above image, a point light is highlighted with a red circle. The blue tiles in the image show which tiles detect that the circle is contained within the frustum of the tile. Of course the tiles inside the red circle should detect the point light but the tiles at the corners are false positives. This happens because the sphere cannot be totally rejected by any plane of the tile’s frustum.
如上图所示,一个点光源用红色圆圈标出。图中的蓝色瓷砖显示了哪些瓷砖检测到圆圈包含在瓷砖的视锥体内。当然,红色圆圈内的瓷砖应该检测到点光源,但是在角落的瓷砖是误报。这是因为球体不能被瓷砖视锥体的任何平面完全拒绝。
If we zoom-in to the top-left tile (highlighted green in the video above) we can inspect the top, left, bottom, and right frustum planes of the tile. If you play the video you will see that the sphere is partially contained in all four of the tile’s frustum planes and thus the light cannot be culled.
如果我们放大到左上角的瓷砖(在上面的视频中用绿色标出),我们可以检查瓷砖的顶部、左侧、底部和右侧截锥面。如果您播放视频,您会看到球部分包含在瓷砖的四个截锥面中,因此光线无法被剔除。
In a GDC 2015 presentation by Gareth Thomas [21] he presents several methods to improve the accuracy of tile-based compute rendering. He suggests using parallel reduction instead of atomic min/max functions in the light culling compute shader. His performance analyses shows that he was able to achieve an 11 – 14 percent performance increase by using parallel reduction instead of atomic min/max.
在 2015 年 GDC 由 Gareth Thomas [21]演示的演示中,他提出了几种改进基于瓦片的计算渲染准确性的方法。他建议在光遮挡计算着色器中使用并行归约而不是原子最小/最大函数。他的性能分析显示,通过使用并行归约而不是原子最小/最大,他能够实现 11-14%的性能提升。
In order to improve the accuracy of the light culling, Gareth suggests using an axis-aligned bounding box (AABB) to approximate the tile frustum. Using AABB’s to approximate the size of the tile frustum proves to be a successful method for reducing the number of false positives without incurring an expensive intersection test. To perform the sphere-AABB intersection test, Gareth suggests using a very simple algorithm described by James Arvo in the first edition of the Graphics Gems series [22].
为了提高光遮挡的准确性,Gareth 建议使用轴对齐边界框(AABB)来近似瓦片视锥体。使用 AABB 来近似瓦片视锥体的大小被证明是一种成功减少误报数目的方法,而不需要进行昂贵的相交测试。为了执行球体-AABB 相交测试,Gareth 建议使用 James Arvo 在《图形宝石》系列第一版中描述的非常简单的算法。
Another issue with tile-based light culling using the min/max depth bounds occurs in tiles with large depth discontinuities, for example when foreground geometry only partially overlaps a tile.
使用最小/最大深度边界进行基于瓦片的光遮挡时的另一个问题出现在具有大深度不连续性的瓦片中,例如当前景几何体仅部分重叠瓦片时。
The blue and green tiles contain very few lights. In this case the minimum and maximum depth values are in close proximity. The red tiles indicate that the tile contains many lights due to a large depth disparity. In Gareth Thomas’s presentation [21] he suggests splitting the frustum in two halves and computing minimum and maximum depth values for each half of the split frustum. This implies that the light culling algorithm must perform twice as much work per tile but his performance analysis shows that total frame time is reduced by about 10 – 12 percent using this technique.
蓝色和绿色瓦片中包含非常少的光。在这种情况下,最小和最大深度值非常接近。红色瓦片表示该瓦片包含许多光,因为深度差异很大。在加雷斯·托马斯的演示中[21],他建议将视锥体分成两半,并为分割后的每半部分计算最小和最大深度值。这意味着光剔除算法必须对每个瓦片执行两倍的工作量,但他的性能分析显示,使用这种方式可以将总帧时间减少约 10-12%。
A more interesting performance optimization is a method called Clustered Shading presented by Ola Olsson, Markus Billeter, and Ulf Assarsson in their paper titled “Clustered Deferred and Forward Shading” [23]. Their method groups view samples with similar properties (3D position and normals) into clusters. Lights in the scene are assigned to clusters and the per-cluster light lists are used in final shading. In their paper, they claim to be able to handle one million light sources while maintaining real-time frame-rates.
更有趣的性能优化是一种称为“集群着色”的方法,由 Ola Olsson、Markus Billeter 和 Ulf Assarsson 在他们的论文《集群延迟和前向着色》[23]中提出。他们的方法将具有相似属性(3D 位置和法线)的视图样本分组到集群中。场景中的灯光被分配到集群,并且每个集群的灯光列表在最终着色中使用。在他们的论文中,他们声称能够处理一百万个光源,同时保持实时帧速率。
Other space partitioning algorithms may also prove to be successful at improving the performance of tile-based compute shaders. For example the use of Binary Space Partitioning (BSP) trees to split lights into the leaves of a binary tree. When performing final shading, only the lights in the leaf nodes of the BSP where the fragment exists needs to be considered for lighting.
其他空间划分算法也可能在改进基于瓦片的计算着色器性能方面取得成功。例如,使用二叉空间划分(BSP)树将光源分割到二叉树的叶子节点中。在执行最终着色时,只需要考虑片元存在的 BSP 叶子节点中的光源进行光照。
Another possible data structure that could be used to reduce redundant lighting calculations is a sparse voxel octree as described by Cyril Crassin and Simon Green in OpenGL insights [24]. Instead of using the octree to store material information, the data structure is used to store the light index lists of lights contained in each node. During final shading, the light index lists are queried from the octree depending on the 3D position of the fragment.
另一个可能用于减少冗余光照计算的数据结构是稀疏体素八叉树,由 Cyril Crassin 和 Simon Green 在 OpenGL Insights [24]中描述。该数据结构不是用于存储材质信息,而是用于存储每个节点中包含的光源索引列表。在最终着色期间,根据片元的 3D 位置从八叉树中查询光源索引列表。
Conclusion 结论
In this article I described the implementation of three rendering techniques:
在本文中,我描述了三种渲染方式的实现:
- Forward Rendering Forward 渲染
- Deferred Rendering 延迟渲染
- Tiled Forward (Forward+) Rendering
平铺前向(Forward+)渲染
I have shown that traditional forward rendering is well suited for scenarios which require support for multiple shading models and semi-transparent objects. Forward rendering is also well suited for scenes that have only a few dynamic lights. The analysis shows that scenes that contain less than 100 dynamic scene lights still performs reasonably well on commercial hardware. Forward rendering also has a low memory footprint when multiple shadow maps are not required. When GPU memory is scarce and support for many dynamic lights is not a requirement (for example on mobile or embedded devices) traditional forward rendering may be the best choice.
我已经表明传统前向渲染非常适合需要支持多种着色模型和半透明对象的场景。前向渲染也非常适合只有少量动态光源的场景。分析表明,包含少于 100 个动态场景光源的场景在商用硬件上仍然表现得相当好。当 GPU 内存稀缺且不需要支持许多动态光源时(例如在移动设备或嵌入式设备上),传统前向渲染可能是最佳选择,因为前向渲染在不需要多个阴影贴图时具有较低的内存占用。
Deferred rendering is best suited for scenarios that don’t have a requirement for multiple shading models or semi-transparent objects but do have a requirement of many dynamic scene lights. Deferred rendering is well suited for many shadow casting lights because a single shadow map can be shared between successive lights in the lighting pass. Deferred rendering is not well suited for devices with limited GPU memory. Amongst the three rendering techniques, deferred rendering has the largest memory footprint requiring an additional 4 bytes per pixel per G-buffer texture (~3.7 MB per texture at a screen resolution of 1280×720).
延迟渲染最适合不需要多个着色模型或半透明对象的情况,但需要许多动态场景光源的情况。延迟渲染非常适合许多投射阴影的光源,因为在光照pass中可以在连续光源之间共享单个阴影贴图。延迟渲染不适合 GPU 内存有限的设备。在三种渲染方式中,延迟渲染具有最大的内存占用量,每像素每个 G 缓冲纹理需要额外 4 字节(在分辨率为 1280×720 的屏幕分辨率下每个纹理约为 3.7 MB)。
Tiled forward rendering has a small initial overhead required to dispatch the light culling compute shader but the performance of tiled forward rendering with many dynamic lights quickly supasses the performance of both forward and deferred rendering. Tiled forward rendering requires a small amount of additional memory. Approximately 5.7 MB of additional storage is required to store the light index list and light grid using 16×16 tiles at a screen resolution of 1280×720. Tiled forward rendering requires that the target platform has support for compute shaders. It is possible to perform the light culling on the CPU and pass the light index list and light grid to the pixel shader in the case that compute shaders are not available but the performance trad-off might negate the benefit of performing light culling in the first place.
瓦片前向渲染需要一定的初始开销来调度光照剔除计算着色器,但是具有许多动态光源的瓦片前向渲染的性能很快超过了前向渲染和延迟渲染的性能。瓦片前向渲染需要少量额外的内存。在屏幕分辨率为 1280×720 时,需要大约 5.7 MB 的额外存储空间来存储光索引列表和光栅,使用 16×16 个瓦片。瓦片前向渲染要求目标平台支持计算着色器。如果计算着色器不可用,则可以在 CPU 上执行光照剔除,并将光索引列表和光栅传递给像素着色器,但性能折衷可能会抵消进行光照剔除的好处。
Tiled forward shading supports both multi-material and semi-transparent materials natively (using two light index lists) and both opaque and semi-transparent materials can benefit from the performance gains offered by tiled forward shading.
平铺式前向渲染本地支持多材质和半透明材质(使用两个光索引列表),不透明和半透明材质都可以从平铺式前向渲染提供的性能增益中受益。
Although tiled forward shading may seem like the answer to life, the universe and everything (actually, 42 is), there are improvements that can be made to this technique. Clustered deferred rendering [23] should be able to perform even better at the expense of additional memory requirements. Perhaps the memory requirements of clustered deferred rendering could be mitigated by the use of sparse volume textures [20] but that has yet to be seen.
尽管平铺的前向着色似乎是对生命、宇宙和一切的答案(实际上是 42),但这种方式仍有改进空间。聚类延迟渲染[23]应该能够在增加内存需求的情况下表现得更好。也许通过使用稀疏体纹理[20]可以缓解聚类延迟渲染的内存需求,但这还有待观察。
Download the Demo 下载演示
The source code (including pre-built executables) can be download from GitHub using the link below. The repository is almost 1GB in size and contains all of the pre-built 3rd party libraries and the Crytek Sponza scene [11]
源代码(包括预构建的可执行文件)可以从 GitHub 使用下面的链接下载。该存储库的大小接近 1GB,并包含所有预构建的第三方库和 Crytek Sponza 场景[11]。
https://github.com/3dgep/ForwardPlus