Wow! It’s been a long time since the last report.
A lot has happened since last time, so I thought and was pressed to inform of what’s new. I (dark_sylinc) have been bugged by two people on Twitter to write this update, so you can thank them for this report!
But first of all, Merry Christmass and Happy New Year!!!
If you don’t celebrate any or all of the two, then have a great day anyway!
Merry XMas! if you don’t celebrate it, good wishes anyway!
It’s been 9 months since our last progress report. We think it’s time for a new one! Oh boy. So much has been done and still in the works.
In case you didn’t notice: Ogre 2.1 now runs on Apple’s API Metal. And it’s stable! It only works on iOS for the moment, since a few tweaks are required to make it work on macOS as well.
You’ll need to use the 2.1-pso branch in order to use Metal. The 2.1-pso branch is scheduled to be merged with 2.1, once testing of the branch 2-1-pso-cache-legacy finishes (which implements a PSO cache utility meant to help users port their immediate style rendering code such as GUIs to support PSOs without major/significant changes).
MSAA resolving for D3D11
MSAA for Render Textures has been broken on D3D11 since like…forever. Not anymore. D3D11 MSAA targets will now be resolved appropriately, according to our implicit resolve rules (explicit resolve support still pending, but in that regard OpenGL is in the same state).
Parallax Corrected Cubemaps
PCC for short, aka Local Cubemaps, Local reflections, Cube projection.
PCC reflections are very important to achieve accurate local reflections.
View post on imgur.com
Our PCC implementation has two modes of operations: Automatic & Manual. Both have their strength and weaknesses.
Automatic “just works”. Probes get automatically blended together (based on camera position) and applied. However automatic may have trouble showing reflections from distant probes, and in some cases the blending may be too evident.
Manual solves the problem of distant reflections not showing up and the blending issue, but it requires you to explicitly set the probe to the material. Also if you don’t perfectly subdivide the geometry to fit the probe’s bounds, you may see gaps (since there is no blending happening at all).
You can actually mix automatic and manual behaviors.
Once the texture refactor is ready (keep reading) we may provide more powerful and superior automatic methods (by using Clustered Forward to select which probe to use and cube arrays to do the actual selection, which are only supported on DX11 class hardware or better).
For more information and experimentation you can look at our two samples LocalCubemaps and LocalCubemapsManualProbe.
Important note: The samples don’t make it yet too obvious that the PCC system reserves one visibility mask + Render Queue for their internal computations (i.e. it stores its Items into a RenderQueue of your choosing, set with a visibility bit also of your choosing). If you accidentally try to render those items, it will look funny. Keep in mind they may affect other things too, such as ray picking and Instant Radiosity generation (remember to filter those objects out).
Created Ogre 2.1 FAQ in the Wiki
We’ve addressed it in a news post already. We’ve written a wiki resource to address frequently asked questions regarding Ogre 2.1.
Texture Matrix Animation in Unlit
This has been requested a lot. Now you got it!
While we don’t yet provide easy ways to animate textures using material commands like old 1.x materials did, at least it’s now possible to animate them by hand if you need to.
We’re working on a technique called Instant Radiosity. The idea is very simple:
- Trace a lot of rays from the light.
- Generate a point light where the ray hits the surface. We’ll call this point light a VPL (Virtual Point Light)
- Cluster very close VPLs into one by summing their light contribution and averaging their locations.
- Use Forward3D or ForwardClustered (or Deferred Rendering) to use all these VPLs in scene.
The technique is an approximation but the results are very convincing, and lots of knobs to adjust for tweaking the results efficiently.
Instant Radiosity is not a very slow technique but neither a real time technique. When it comes to generating the VPLs, we still need to raytrace. Even if the raytrace takes e.g. 500 ms, it’s not suitable for real time. It was chosen because it was easy to implement and offers a lot of flexibility when compared to other techniques (such as Light Propagation Volumes) due to all the settings that can be adjusted, while also illuminating dynamic objects. In other words, the cost benefit ratio was really good.
This image is only lit by a spotlight + Global Illumination:
View post on imgur.com
And now same scene with VPL debug markers turned on:
View post on imgur.com
Another angle with parameters exaggerated:
View post on imgur.com
Instant Radiosity made it obvious that Forward3D was under-performing. While it was original research done by me (dark_sylinc), it was clear it wasn’t as well as I had estimated and hoped.
So I just went ahead and implemented Clustered Forward. It’s both threaded (slices are assigned to different threads) and SIMD optimized. Also the Frustum vs. Spotlight and Frustum vs. Pointlight intersection tests are much (!) tighter than the ones we use for Forward3D.
In debug builds, having many slices may take a noticeable hit on CPU when compared to debug Forward3D. Though you could just use less slices during debug, or switch to Forward3D.
Clustered Forward allows controlling many slices, which improves GPU speed and the tight frustum tests mean shaders don’t waste precious cycles trying to shade with lights that aren’t actually visible. This leads to an average performance improvement of 33%, though your mileage may vary (it can be 2x faster or 0x if your lights had gigantic ranges).
- Scene passes now have “enable_forwardplus” to explicitly turn of Forward3D and ForwardClustered in passes you don’t need them. This will improve CPU consumption by avoiding wasting cycles in building the light lists on something you won’t be using.
- Compositor workspaces now support more than one input (i.e. not just the “final target”, which was usually but not always the RenderWindow). “connect_output” still exists, but it just does the same as “connect_external 0”. Useful when you want a workspace to produce a lot of results for you, not just one.
- 2D Array textures, cubemaps in compositors: They can now be created via compositor scripts. See the manual for more information.
- UAV Buffers: Instead of creating UAV textures, you can also create buffers. We allow creating buffers of fixed byte sizes, and width x height sizes. You can also create them from C++ and treat them as external buffers for the workspace (they work just like external textures). Useful mostly for compute and some advanced rendering algorithms. See the manual for more information.
- Double-sided stencil: Some parameters have been moved to the “both” block. See Sample_StencilTest and the manual for more information.
Quadratic behavior when loading meshes
User 0xC0DEFACE noticed loading 60.000 meshes was taking more than 15 minutes due to O(N^2) behavior in our code. He fixed it with a very trivial change and brought it down to 4 seconds. Kudos!
Merged PSO branch
PSO had been very stable for a long while now. Plus there had been a bunch of very important bugfixes (like stencil support, some edge case viewport glitches) that were only in PSO branch.
Furthermore I found myself very often merging and cherry-picking between the different 2.1 branches, which is a sign that they needed to be merged. So I did.
The last non-pso commit was 06631aef218d73fdc2ca323da626a53650d941be
New user Hotshot5000 has stepped up to port the GLES3 RenderSystem to Ogre 2.1.
He’s claiming he is starting to see the light. We wish him good luck as we wait impatiently for more updates from him!
In Design: Texture Refactor
The texture refactor was announced. No coding work has been done yet, but we’re solidifying the foundations how texture loading will work in the future.
Whoa! Last time I (Matias) posted, we had a different website 🙂
What I’ve been working on:
I’ve started working on “Compute Shaders” (CS). They have been blocking future progress for far too long. They are required for modern techniques such as Forward+ and come in handy for things like tiled deferred rendering. Originally I started on compute shaders because I wanted them for a new terrain system that could generate the shadow maps in real time (yeah, there’s a new terrain system coming). CS are the only way to do it efficiently.
This has been a lot of work, and is currently under the unstable branch “2.1-pso-compute”. Actually, Compute Shaders are already working. However what we need to support is UAV buffers, which is sort of speak like read & write malloc’ed arrays of GPUs. We’ve already added support for UAV textures, but UAV buffers were missing and they are more flexible.
UAV buffers need to be treated with care though, because you’ll likely want to to create them from C++ (since they’re almost like a GPU malloc), but the compositor needs to be aware of them.
Why, you ask? Because the compositor is in charge of placing memory barriers and resource transitions. In other words: It needs to prevent race conditions. You’ll see … if you run Compute Shader A, and then Compute Shader B, and then render geometry, you have no guarantee that A will be completed before B. In fact, rendering geometry may even finish before A does!
This is great for GPU parallelism (colloquially known as Async Compute), but sucks if there were data dependencies (e. g. B depended on A, or Rendering Geometry depended on A or B). Memory barriers/resource transitions ensure shaders that must be run in order are executed in order while, hopefully, shader executions that are independent can run in parallel without being stalled.
Note: D3D11 implicitly inserts implicit memory barriers between compute shader executions, OpenGL only offers coarse memory barriers, but only Vulkan & D3D12 offer fine memory barriers.
The Compositor is in the best spot for this kind of work because it analyzes dependencies once (during workspace initialization) and can see all input, outputs and data dependencies.
That means that while UAV buffers can be created and managed from C++, some parts must be relinquished or informed to the compositor to ensure proper behavior (whether via scripts or via code). This means I need to be extremely careful with the design to avoid a clueless programmer innocently setting an UAV as input/output from a compute shader directly without the compositor noticing.
To make things worse, D3D11 & OpenGL differ quite greatly in how UAV buffers should be handled.
All in all, progress is steady. Compute Shaders are coming.
Other stuff that has increased in importance has been multiple RenderTarget inputs to compositor workspaces. Right now we only allow defining one “final target” which is treated as the final output (i. e. the RenderWIndow), although it doesn’t necessarily have to. The intention is to support more than just one external RenderTargets being available to a Compositor Workspace, which helps a lot in chaining multiple workspaces together (and are also very relevant for calculating memory barriers correctly).
Compute Shaders defined via JSON and have access to the Hlms
This is something I’ve been wanting to do for a while. Instead of using low level material’s syntax or interface (which was half ill-suited, half well-suited for the job), a new special Hlms is in charge of Compute shaders. We already have a working example (see all possible settings), although beware it’s subject to change. Auto params work, but not all of them since some don’t make sense because they were meant for rendering (and for the moment, attempting to use the unsuitable ones will likely result in a crash).
Why am I excited about the Hlms access? Because of the preprocessor of course!
For example, you can use the Hlms to unroll a loop based on the width of texture, thus reducing loop overhead during execution. Unrolling a loop can be critical in fully utilizing all bandwidth when performing certain tasks (such as parallel reduction), though it can hurt performance in other cases (particularly if the instruction length is too high, or it results in high register pressure).
You can also adapt your code based on the number of threads per group or automatically modify the shader depending on whether the bound texture is MSAA or not (which would normally require defining multiple shader programs and manually selecting the correct one).
Once Compute Shaders are over, I can finish what little remains of this terrain system and release it. Afterwards I may end up resuming on DERGO (a in-Blender live Ogre material editor) or continuing support for GLES3.
I really want to work more on DERGO, but GLES3 working again would mean three platforms being compatible again (OS X, Android, iOS) and once we have that in the bag, we might start talking about an official 2.1 SDK release date. Can’t say that isn’t tempting…
What else was accomplished in the past 2 months:
- User al2950 contributed PlaneBoundedVolumeSceneQuery to Ogre 2.x. This feature has been requested by many. Thanks!
- spookyboo deserves a special mention for reporting a lot of JSON bugs for our PBS materials. He has been severely stress testing our system as he’s been working on a Hlms material editor.
- A major bug involving reading of normal maps was fixed. Thanks to user GlowingPotato for noticing!
- Several fixes affecting the accuracy of our PBS implementation.
- Other minor bug fixes.
Thanks to all community users who have been reporting your issues and helping make Ogre 2.1 more robust every day! Our PBS implementation has been under a lot of scrutiny lately and I like it. It hurts my ego of course, but it results in improved quality. This of course means our users can focus on the important parts and not on the technical details.
A little late report. We know we missed April & May in the middle. But don’t worry. We’ve been busy!
So…what’s new in the Ogre 2.1 development branch?
1. Added depth texture support! This feature has been requested many times for a very long time. It was about time we added it!
Now you can write directly to depth buffers (aka depth-only passes) and read from them. This is very useful for Shadow Mapping. It also allows us to do PCF filtering in hardware with OpenGL.
But you can also read the depth buffers from regular passes, which is useful for reconstructing position in Deferred Shading systems, and post-processing effects that need depth, like SSAO and Depth of Field, without having to use MRT or another emulation to get the depth.
We make the distinction between “depth buffer” and “depth textures”. In Ogre, “depth textures” are buffers that have been requested to be read from as a texture at some point in time. If you want to ever use it as a texture, you’ll want to request a depth texture (controlled via RenderTarget::setPreferDepthTexture).
A “depth buffer” is a depth buffer that you will never be reading from as a texture and that can’t be used as such. This is because certain hardware gets certain optimizations or gets more precise depth formats available that would otherwise be unavailable if you ask for a depth textures.
For most modern hardware though, there’s probably no noticeable performance difference in this flag.