Whoa! Last time I (Matias) posted, we had a different website ūüôā

What I’ve been working on:

I’ve started working on “Compute Shaders” (CS). They have been blocking future progress for far too long. They are required¬†for modern techniques such as Forward+ and come in handy for things like tiled deferred rendering. Originally I started on compute shaders because I wanted them for a new terrain system that could generate the shadow maps in real time (yeah, there’s a new terrain system coming). CS are the only way to do it efficiently.

This has been a lot of work, and is currently under the unstable branch “2.1-pso-compute”. Actually, Compute Shaders are already working. However what we need to support is UAV buffers, which is sort of speak like read & write malloc’ed arrays of¬†GPUs. We’ve already added support for UAV textures, but UAV buffers were missing and they are more flexible.

UAV buffers need to be treated with care though, because you’ll likely want to to create them from C++ (since they’re almost like a GPU malloc), but the compositor needs to be aware of them.

Why, you ask? Because the compositor is in charge of placing memory barriers and resource transitions. In other words: It needs to prevent race conditions. You’ll see … if you run Compute Shader A, and then Compute Shader B, and then render geometry, you have no guarantee that A will be completed before B. In fact, rendering geometry may even finish before A does!

This is great for GPU parallelism (colloquially known as Async Compute), but sucks if there were data dependencies (e. g. B depended on A, or Rendering Geometry depended on A or B). Memory barriers/resource transitions ensure shaders that must be run in order are executed in order while, hopefully, shader executions that are independent can run in parallel without being stalled.

Note: D3D11 implicitly inserts implicit memory barriers between compute shader executions, OpenGL only offers coarse memory barriers, but only Vulkan & D3D12 offer fine memory barriers.

The Compositor is in the best spot for this kind of work because it analyzes dependencies once (during workspace initialization) and can see all input, outputs and data dependencies.

That means that while UAV buffers can be created and managed from C++, some parts must be relinquished or informed to the compositor to ensure proper behavior (whether via scripts or via code). This means I need to be extremely careful with the design to avoid a clueless programmer innocently setting an UAV as input/output from a compute shader directly without the compositor noticing.

To make things worse, D3D11 & OpenGL differ quite greatly in how UAV buffers should be handled.

All in all, progress is steady. Compute Shaders are coming.

Other stuff that has increased in importance has been multiple RenderTarget inputs to compositor workspaces. Right now we only allow defining one “final target” which is treated as the final output (i. e. the RenderWIndow), although it doesn’t¬†necessarily have to.¬†The intention is to support more than just one external RenderTargets being available to a Compositor Workspace, which helps a lot in chaining multiple workspaces together¬†(and are also very relevant for calculating memory barriers correctly).


Compute Shaders defined via JSON and have access to the Hlms

This is something I’ve been wanting to do for a while. Instead of using low level material’s syntax or interface (which was half ill-suited, half well-suited for the job), a new special Hlms is in charge of Compute shaders. We already have a working¬†example¬†(see all possible settings), although beware it’s subject to change. Auto params work, but not all of them since some don’t make sense because they were meant for rendering (and for the moment, attempting to use the unsuitable ones will likely result in a crash).

Why am I excited about the Hlms access? Because of the preprocessor of course!

For example, you can use the Hlms to unroll a loop based on the width of texture, thus reducing loop overhead during execution. Unrolling a loop can be critical in fully utilizing all bandwidth when performing certain tasks (such as parallel reduction), though it can hurt performance in other cases (particularly if the instruction length is too high, or it results in high register pressure).

You can also adapt your code based on the number of threads per group or automatically modify the shader depending on whether the bound texture is MSAA or not (which would normally require defining multiple shader programs and manually selecting the correct one).


What’s next?

Once Compute Shaders are over, I can finish what little remains of this terrain system and release it. Afterwards I may end up resuming on DERGO (a in-Blender live Ogre material editor) or continuing support for GLES3.

I really want to work more on DERGO, but GLES3 working again would mean three platforms being compatible again (OS X, Android, iOS) and once we have¬†that in the bag, we might start talking about an official 2.1 SDK release date. Can’t say that isn’t tempting…


What else was accomplished in the past 2 months:

  1. User al2950 contributed PlaneBoundedVolumeSceneQuery to Ogre 2.x. This feature has been requested by many. Thanks!
  2. spookyboo deserves a special mention for reporting a lot of JSON bugs for our PBS materials. He has been severely stress testing our system as he’s been working on a Hlms material editor.
  3. A major bug involving reading of normal maps was fixed. Thanks to user GlowingPotato for noticing!
  4. Several fixes affecting the accuracy of our PBS implementation.
  5. Other minor bug fixes.

Thanks to all community users who have been reporting your issues and helping make Ogre 2.1 more robust every day! Our PBS implementation has been under a lot of scrutiny lately and I like it. It hurts my ego of course, but it results in improved quality. This of course means our users can focus on the important parts and not on the technical details.

Tags: , , , , ,