Wow! It’s been a long time since the last report.

A lot has happened since last time, so I thought and was pressed to inform of what’s new. I (dark_sylinc) have been bugged by two people on Twitter to write this update, so you can thank them for this report!

But first of all, Merry Christmass and Happy New Year!!!

If you don’t celebrate any or all of the two, then have a great day anyway!

ESM (Exponential Shadow Maps) and Shadow Map enhancements

Shadow Mapping got a huge rewrite. This fixed many problems. For instance, it fixed a long-standing issue where Ogre would crash if all directional lights were disabled and PSSM was being used. This small issue may look a corner case but they were annoying for editors, as it is common for a user to accidentally end up without lights when PSSM is enabled.

As part of that rewrite, we implemented ESM (Exponential Shadow Maps). They give soft shadow appeareance. Let the screenshot speak for itself:

Now, ESM isn’t perfect and doesn’t fit all scenarios (it works best for indoor scenarios), but regardless it is a great soft-shadowing technique.

We now also support shadow map atlases. Which means you can put all shadow maps in one single giant texture instead of using multiple textures. Or use multiple atlases, if you want. Previously we were limited to 1 texture per shadow map.

What’s the advantage? Well, due to several API restrictions and legacy issues, Ogre is limited to up to 16 textures per shader (multiple slices in the same texture array count as one texture). Ogre 2.2 will be increasing that limit to 32 (more if you only restrict to certain platforms/APIs), but until 2.2 arrives, that limit has been annoying for some of our users. A few have even workarounded it by patching Ogre 2.1 and only sticking to D3D11. So if you used 3 PSSM splits + 2 additional shadow maps; that would cost you 5 slots. Ouch! Atlases allow you to merge them and consume just one slot.

Furthermore, some GPUs perform faster because less texture slots means less VGPR and SGPR pressure, which means better occupancy and thus better latency hiding. In other words, higher performance.

Point light shadows

Another feature that became part of this Shadow Mapping rewrite, was point light shadows. Previously, it just didn’t work. All we had was directional and spot light shadow maps. Let the screenshots do the talking:

Static shadows

A new feature also part of the Shadow Mapping rewrite was the introduction of static shadows. This doesn’t work well for directional lights (something is being thought about that, though). But it works very well for point and spot shadow maps. If the light doesn’t rotate or move, and the objects around it are static; you can just “bake” the shadows instead of updating them every frame, thus allowing massive performance improvements and increasing the amount of shadow casting lights in the scene.

You may now be thinking: what’s the point of static shadows if I could use an offline raytracer (which is definitely going to produce very pretty shadows) and bake the results into a lightmap?

Well, thing is, you can update these shadow maps on demand. If one of these lights needs to move or too much dynamic geometry gets close to a particular light, you can start updating that light every frame until it gets quiet again. Alternatively, you could dynamically switch which lights cast shadows as the player traverses a scene, with constant memory consumption (unlike a lightmap, which would require more memory the bigger the scene is).

If static shadows as a performance improvement is not your thing, you may still find this useful because you can use this system to update the lights every frame while you manually control which lights cast shadows. Previously Ogre would decide for you which lights cast shadows based on distance to camera.

While this works well in many scenarios, there are some cases where the lights selected to perform shadow mapping would “jump”, causing these shadows to flicker or change in quality abruptly. You can now control these shadows via setLightFixedToShadowMap and setStaticShadowMapDirty.

Screen Space Reflections and Hybrid Rendering

We’ve implemented SSR. There’s a new sample showcasing it.

In order to implement SSR, we need the depth. But to have the depth, we need to render first. It’s a chicken-and-egg problem.

Thus we are forced to perform a depth prepass. Depending on your scene complexity (particularly, the amount of vertices and objects), for some of our users a depth prepass means faster performance. But for other users this means slower performance. To keep the latter happy, we implemented “hybrid rendering”, as it combines both Deferred and Forward shading concepts.

The core concept is that the depth prepass is no longer a depth prepass only, but also exports a GBuffer containing information such as normals, shadow mapping and roughness information. Then the 2nd pass loads this data from the GBuffer instead of calculating them.

We’re basically offloading some work from the 2nd pass into the 1st pass, resulting in thinner shaders with lower register pressure, thus better occupancy, which ultimately results in higher performance. I talk about this problem in depth in my own blogpost.

Hybrid rendering is more than just a performance optimization, as now that we have GBuffer data we can implement other rendering algorithms that may need it.

Compute based mipmapping (HQ)

Alongside SSR, we needed a way to sample from lower mipmaps in order to display glossy reflections for surfaces with high roughness. Our automatic mipmap generation method (which basically relies on the driver) was not enough as we needed the mips to be much blurrier.

By reusing and adapting the gaussian filter from GPU Open that we initially used for the Terrain sample, we were able to generate the mipmaps and filter them with a large kernel radius at the same time using compute.

Mipmap generation using compute allows for large kernel radiuses (thus very smooth mips) while also paving some future work for DX12 / Vulkan since these APIs don’t provide driver-based automatic mipmap generation.

Note that at the moment the Compute-based mipmap generation only works on 2D textures, and doesn’t support all formats.

macOS support (Metal AND GL3+)

Thanks to user johughes and a new team member, Deron Johnson aka berserkerviking, that joined us very briefly and sadly had to leave us, our Metal RenderSystem works on Metal! Support should probably be considered beta compared to Windows, Linux and iOS, but it still manages to work quite well.

But something incredible happened. User DimA took the improvements made by Hotshot5000 in his GLES3 branch; and used that to get GL3+ to work on OS X / macOS! Honestly I didn’t think it would be possible!

Because macOS only supports up to GL 4.1, some features won’t be available (e.g. anything that relies on Compute Shaders), and most of the time performance cannot hope to be as good as it could be on Metal. Nonetheless this breakthrough greatly increases our compatibility, allowing our users to run Ogre3D 2.1 on their older Macs. Watchout for shader compile times though. The GL drivers from macOS take much longer to compile compared to other platforms and APIs, and sadly the shader cache doesn’t work there.

Fine light mask granularity in Forward+

The problem to solve was that users were asking for a way to have lights only affecting a particular set of objects (basically they were asking for us to honour MovableObject::setLightMask).

In the case of lights using Forward+, originally it was only possible to apply “broad granularity” masks by playing with viewport light visibility masks (via CompositorPassSceneDef::mLightVisibilityMask) and putting objects affected by the different lights into different render queues (or object visibility masks) in combination with multiple render scene passes. Most of the time this was fast (not always, because this would affect the render order and sometimes this means objects that should be rendered last end up being rendered first, thus negating Early Z) and would get the job done but it was convoluted. Furthemore it wouldn’t work with transparent objects because you can’t easily alter their render order. It only worked for very broad masks, e.g. have a large group of objects not being affected by certain lights.

I was against “fine granularity” light mask culling due to performance concerns, but due to popular demand it is now possible to honour the light masks in an easy way. In the end, the performance price wasn’t that high and seems I worried too much.

Nonetheless the option is disabled by default so users don’t pay the price for a feature they may not be needing. You can enable it via CMake setting OGRE_CONFIG_ENABLE_FINE_LIGHT_MASK_GRANULARITY.

Integrated SMAA

SMAA (Enhanced Subpixel Morphological Antialiasing) from Jorge Jimenez aka iryoku and others, was integrated into Ogre. SMAA is one of the best AA filters currently out there, which is great to use when MSAA is not an option, or when you need to apply antialiasing that does not come from geometric aliasing (MSAA only solves geometric aliasing). Note that SMAA supports several modes and since they’re confusing and aren’t properly explained on the web, we’ll explain them for you:

  • SMAA 1x: This is the regular SMAA.
  • SMAA S2x: This version uses MSAA2x and reads from both subsamples as if they were individual, different images; then applies SMAA on both of them separately and merges the result. This results in higher quality AA than SMAA 1x.
  • SMAA T2x: This version is similar to S2x, but instead of using MSAA 2x to obtain the two individual texture sources, it uses the texture from the previous and current frame, while offsetting them each frame (jitter) to produce the same effect MSAA 2x offsets on S2x would do. If we weren’t to apply SMAA on top of this, a static scene would shake because it’s being constantly offsetted between odd and even frames. But because the result is averaged, you don’t see any shaking. If you’re familiar with it, the concept is very similar to interlaced video. The advantage over S2x is that it doesn’t require MSAA. The disadvantage is that because objects are often in motion, we need to track where objects were positioned in the previous frame, aka “velocity buffer”, which is not always easy. And when the velocity buffer gets this wrong, you’ll see ghosting or halo effects. For entirely static scenes (the camera must be static too) S2x and T2x would produce identical results. But because of these artifacts when things are in motion, it’s safe to say S2x is superior in quality to T2x.
  • SMAA 4x: This applies T2x on top of S2x. As simple as that. It needs both MSAA 2x and a velocity buffer. It’s the highest quality (unless the velocity buffer does a really poor job thus you’d see lots of ghosting artifacts everywhere).

Note that at the moment Ogre only supports SMAA 1x, as it is the easiest to support (also the most popular variant). Theoretically, supporting S2x should be much easier in Ogre 2.2, asuming your application can make use of MSAA. There are no plans to support the temporal variations until we have a velocity buffer in place. We may implement a velocity buffer to enhance SSR in the future. If that gets done, then T2x should be easy to support. But until then, there’s no ETA and it’s not a priority either.

HDR + MSAA + reversible tonemapping operator

You may have heard that combining HDR and MSAA can still results in nasty aliasing. Sometimes even make it worse. This can happen if you’re looking at a very dark object with a very lit background (or viceversa), which is a common scenario if you’re looking at a bright sky from a window while indoors.

The problem happens because MSAA is supposed to average RGB colour values meant for the monitor in order to make the borders smooth. This means MSAA resolve must happen after tonemapping, which is sloooow. But Ogre (like many other engines) performs tonemapping after resolving. Which means that instead of averaging RGB monitor values, we average light values in lumens. It’s like putting the sugar and milk into the blender first, turn it on, later add the fruit and call it a milkshake. You need to turn the blender after you’ve added the fruit.

That’s where reversible tonemapping enters!.

Reversible tonemapping is a fast method to solve this problem. It consists in doing the following:

  1. Load the individual subsamples
  2. Apply a simple tonemap operator
  3. Average the values together (i.e. resolve)
  4. Perform the reverse operation of step 2.
  5. Now perform the real tonemapping (which may be much more expensive, and probably not easily reversible)

While this method may not be completely “correct”, it gets the job done very well.

Planar Reflections

You must first enable it via the CMake option OGRE_BUILD_COMPONENT_PLANAR_REFLECTIONS.

Planar reflections is nothing new. The old Fresnel sample from Ogre 1.x in fact used planar reflections for the swimming pool.

However it wasn’t working out of the box in Ogre 2.x; and now it has a much more advanced system where you can define many ‘actors’ and plane orientations and Ogre will automatically select the closest ones to the camera (up to N actors, depending on your performance settings) and use them for performing reflections without having you to code a single line of shader code.

Furthermore, we support reprojection. Traditional planar reflections only work well as long as the surface matches exactly the reflection plane. Our reprojection allows to compensate for small deviations in the surface where it doesn’t align perfectly with the reflection plane.

It’s not magic: the more the surface deviates from the ideal reflection plane, the less we can hide the distortion.

Planar reflections are disabled by default to avoid having users to pay any performance price for a feature they don’t use.

Planar reflections are integrated into both PBS and Unlit shaders, and you can use roughness maps for glossy planar reflections.

Fixed BillboardChain, which wasn’t working

Thanks to user zxz for looking into this! See the forum post for details.

Added Blinn-Phong BRDF

We’ve added new BRDF options. The reasons were two:

  1. ‘Default’ Blinn-Phong can be used instead of the Default GGX or Cook Torrance BRDF, as it is much faster. This can help a lot on slower GPUs, such as the ones found in mobile like the iPhone and Android. Or even if you want to support slow desktop GPUs like Intel HD or older or low end GPUs from AMD and NVIDIA.
  2. ‘Legacy’ Blinn-Phong isn’t physically based and not very realistic either, but it looks much closer to what fixed function used to look like (i.e. Ogre 1.x) which can help by making porting to 2.1 easier. It’s also the fastest BRDF.

Added ASTC compression support for Metal

We’ve added a patch from David to support ASTC in Metal for iOS. Now you can use much less memory for your textures.

Ogre 2.2 progress

If you’ve been following our progress, you probably noticed our repository has a new branch labelled “v2.2-WIP”. The main reason of 2.2 is a complete overhaul of the Texture system, with a focus on lowering GPU RAM consumption, background texture streaming, asynchronic GPU to CPU texture transfers, and reducing out of GPU memory scenarios (which are relatively easy to run into if your project is medium-to-large sized).

Background streaming, better RAM consumption

If a picture says a thousand words, then a video says a million:

Background streaming is now a reality! There’s still work left to do because we need to do. Particularly the problem right now is that when the texture finishes loading, Ogre will very likely need a new shader, thus triggering a shader compile. If the shader isn’t in the cache, it will cause a small stall, which is exactly what we were trying to prevent with background streaming.

The solution will be to keep a texture cache that caches metadata (basically resolution and pixel format) so we know this information before the texture is even loaded; and thus we won’t have to need a new shader at all. The solution is easy and simple, I just need to write it. The exact nature of the problem is explained in a forum post.

Load Store semantics

Ogre 2.2 is much more mobile friendly. Metal introduced the concepts of “load” and “store” actions, and we follow that paradigm because it’s easy to implement and understand.

First we need to explain how mobile TBDR GPUs work (Tiled Based Deferred Renderers). In a regular immediate GPU (any Desktop GPU), the GPU just processes and draws triangles to the screen in the order they’re submitted, writing results to RAM, and reading from RAM if need to. Run the vertex shader, then the pixel shader, go on to the next triangle. The process is slightly more complex because there’s a lot of parallelization going on (i.e. multiple triangles worked on a the same time), but the overall scheme of things is that desktop GPUs process things in order.

TBDRs work differently: They process all the vertices first (i.e. run all of the vertex shaders), and then later it goes through each tile (usually of 8×8 pixels), finds which triangles touch that tile; sorts them front to back (unless alpha testing or alpha blending is used). and then runs the pixel shaders on all the triangles and pixels. Then proceeds to the next tile. This has the following advantages:

  1. Most pixels filled by opaque triangles will only be shaded by a pixel shader only once.
  2. Early-Z is implicit. This also means a depth-prepass is unnecessary and only a waste of time.
  3. The whole tile stays in an on-chip cache (which has much lower latency, much greater bandwidth, and much lower power consumption). Once the tile is completely done, the cache gets flushed into RAM. In contrast, a desktop GPU could be constantly reading back and forth from main RAM (they have caches, but data doesn’t necessarily stay always in the cache).

The main disadvantage is that it does not scale well to a large number of vertices (since the GPU must store all of the processed vertices somewhere). There’s a performance cliff past certain vertex count: Exceed certain threshold and you’ll see your framerate decrease very fast the more vertices you add once you’re beyond that limit.

This is not usually a problem in mobile because well… nobody expects a phone to process 4 million vertices or more per frame. Also you can make it up by using improved pixel shaders (because of the Early Z you get for free).

In TBDRs, each tile is a self-contained unit of work that never flushes the cache from start to end until all the works has been done (unless the GPU has ran out of space because we’ve submitted too much work, but let’s not dwell into that).

If you want a more in-depth explanation, read A look at the PowerVR graphics architecture: Tile-based rendering and Understanding PowerVR Series5XT: PowerVR, TBDR and architecture efficiency.

Now that we’ve explained how TBDRs work, we can explain load and store actions

In an immediate renderer, clearing the colour and depth buffers means we instruct the GPU to basically memset the whole texture to a specific value. And then we render to it.

In TBDRs, this is inefficient; as the memset will store a value to RAM, that later needs to be read from RAM. TBDRs can:

  1. Clear the cache instead, rather than loading it from RAM. (i.e. set the cache to a specific value each time the GPU begins a new tile)
  2. If you don’t need the results of a particular buffer once you’re done rendering, you can discard them instead of flushing it to RAM. This saves bandwidth and power. For example you may not need to save the depth buffer. Or you may only need the resolved result of MSAA render, and discard the contents of the MSAA surface.

The Metal RenderSystem in Ogre 2.1 tried to merge clears alongside draws as much as possible, but it didn’t always get it right; and it glitched when doing complex MSAA rendering.

Now in Ogre 2.2 you can explicitly specify what you want to do. For load actions you can do:

  1. DontCare: The cache is not initialized. This is the fastest option, and only works glitch-free if you can guarantee you will render to all the pixels in the screen with opaque geometry.
  2. Clear: The cache is cleared to a particular value. This is fast.
  3. Load: Load whatever was on RAM. This is the slowest, but also the default as it is the safest option.

For store actions you also get:

  1. DontCare: Discard the contents after we’re done with the current pass. Useful if you only want colour and don’t care what happens with the depth & stencil buffers. Discarding contents either improves framerate or battery duration or makes rendering friendlier to SLI/Crossfire.
  2. Store: Save results to RAM. This is the default.
  3. MultisampleResolve: Resolve MSAA rendering into resolve texture. Contents of MSAA texture are discarded.
  4. StoreAndMultisampleResolve: Resolve MSAA rendering into resolve texture. Contents of MSAA texture are kept.

This gives you a lot of power and control over how mobile GPUs control their caches in order to achieve maximum performance. But be careful: If you set a load or store action to “DontCare” and then you do need the values, then you’ll end up reading garbage every frame (uninitialized values), which can result in glitches.

These semantics can also help on Desktop. Whenever we can, we emulate this behavior (to make porting to mobile easier), but we also tell the API about this information whenever the DX11 and GL APIs allow us. This can mostly help with SLI and Crossfire.

More control over MSAA

Explicit MSAA finally arrived in Ogre 2.2; and thanks to load and store actions; you have a lot of control over what happens with MSAA and when; which can result in high quality rendering by making MSAA a first class citizen.

We’re also planning on telling you where the subsamples are located; which can help with certain algorithms that require this information (such as SMAA S2x). We also want to support programmable subsample locations, but unfortunately DX11 does not expose it unless we use vendor-specific extensions (which we’re actually evaluating to support them…) and not all GPUs support them either.

Programmable sample locations is a huge deal. For example it makes low res particle FXs work much better: By using a half resolution MSAA 4x surface with subsamples set to center; it is possible to shade colour at half resolution without pixelization or halo artifacts at the borders because depth discontinuities are evaluated at full resolution. Console games are already starting to do this because both PS4 and XBox One support it.

There have been other numerous MSAA changes. In Ogre 2.1 MRT + MSAA did not work correctly except for D3D11 (which makes SSR sample to only work with MSAA in D3D11). Now it works everywhere.

Another change for example that in Ogre 2.1 all render textures in the compositor default to using MSAA if the main target is using MSAA. In Ogre 2.2; we default to not using MSAA unless told otherwise. We found out most textures do not need MSAA because they’re postprocessing FXs or similar; thus the MSAA is only a waste of memory and performance by doing useless resolves. Therefore only a few render textures actually need MSAA. This saves a lot of GPU RAM and some performance.

 

Well, that’s mostly it. I’m sure I may have missed a few features. They are a lot!!! And of course, there has been lots of bugfixes along the road.

Cheers and I’m off to a new year!

Happy New Year from behalf the Ogre Team!