Removing bloat and reducing size in Ogre Next 2.4 by up to 20%

As we mentioned last time, Ogre-Next 2.4 will be mostly focusing on maintenance and fixing code debt.

Over a week we’ve fixed a lot of warnings.

With default settings on Clang on Linux, there’s no errors (except on OgreMeshTool).

But we haven’t taken a look at deprecated warnings yet.

On Apple/XCode and MSVC there’s very few warnings now

GCC needs Wconversion off because it’s incredibly dumb. It’s got too many false positives.
Fixing them would hurt readability too much, causing more harm than good.

Some CMake settings may cause warnings, e.g. we generate lots of warnings if OGRE_CONFIG_DOUBLE is on.

We also applied clang format to everything, and added C++11 override everywhere.

We’ve also removed dead code and added an option to turn off custom memory allocators off which is the default.

We’ve removed Boost, thus CMake won’t waste its time (and that was a lot) looking for an optional dependency we didn’t really need anymore.

Overall in Debug builds we’re seeing a 20% reduction in binary size (incl. symbols)!!
While Release builds have between 1% and 3% reductions.

The only drawback is that merging code 2.3 -> 2.4 is now harder as it’s almost guaranteed to cause a merge conflict, since nearly every line got touched.

Nonetheless it seems that merging by “merge using always theirs”, then applying clang format, then seeing the diff of what’s going to be merge is a good approach to prevent bad merges and fix accidentally introducing bugs

The numbers are in MBs

Linux Clang 9 – Debug

Lib Name

Before (2.3)

After (2.4)

Diff

libOgreHlmsPbs_d.so

8,69

7,53

-15,48 %

libOgreHlmsUnlit_d.so

1,92

1,59

-20,48 %

libOgreMain_d.so

65,00

55,01

-18,17 %

libOgreMeshLodGenerator_d.so

4,27

3,71

-15,11 %

libOgreOverlay_d.so

3,87

3,30

-17,23 %

libOgreSamplesCommon_d.a

10,80

9,10

-18,64 %

libOgreSceneFormat_d.so

2,55

2,03

-25,44 %

Plugin_ParticleFX_d.so

1,79

1,63

-9,85 %

RenderSystem_GL3Plus_d.so

6,97

5,23

-33,25 %

RenderSystem_NULL_d.so

1,70

1,35

-25,87 %

RenderSystem_Vulkan_d.so

15,82

14,63

-8,18 %





Total

123,39

105,12

-17,39 %

Linux Clang 9 – Release

Lib Name

Before (2.3)

After (2.4)

Diff

libOgreHlmsPbs.so

0,80

0,78

-3,03 %

libOgreHlmsUnlit.so

0,19

0,19

-3,58 %

libOgreMain.so

7,31

6,93

-5,61 %

libOgreMeshLodGenerator.so

0,24

0,22

-5,21 %

libOgreOverlay.so

1,00

1,00

-0,14 %

libOgreSamplesCommon.a

0,43

0,41

-3,85 %

libOgreSceneFormat.so

0,23

0,23

0,95 %

Plugin_ParticleFX.so

0,25

0,24

-5,38 %

RenderSystem_GL3Plus.so

0,85

0,76

-11,86 %

RenderSystem_NULL.so

0,16

0,16

-0,94 %

RenderSystem_Vulkan.so

7,93

7,91

-0,22 %





Total

19,40

18,83

-3,02 %

MSVC 2019 – Debug

Lib Name

Before (2.3)

After (2.4)

Diff

OgreHlmsPbs_d.dll

1,89

1,82

-3,45 %

OgreHlmsPbs_d.pdb

15,53

12,82

-21,15 %

OgreHlmsUnlit_d.dll

0,47

0,45

-4,50 %

OgreHlmsUnlit_d.pdb

9,51

6,89

-37,96 %

OgreMain_d.dll

24,28

22,81

-6,45 %

OgreMain_d.pdb

101,56

81,94

-23,94 %

OgreMeshLodGenerator_d.dll

1,02

0,98

-4,90 %

OgreMeshLodGenerator_d.pdb

10,57

8,64

-22,34 %

OgreMeshTool_d.pdb

11,95

8,66

-37,98 %

OgreOverlay_d.dll

1,77

1,75

-1,45 %

OgreOverlay_d.pdb

13,28

9,50

-39,70 %

OgreSceneFormat_d.dll

0,58

0,57

-1,78 %

OgreSceneFormat_d.pdb

11,07

8,04

-37,82 %

Plugin_ParticleFX_d.dll

0,39

0,37

-5,59 %

Plugin_ParticleFX_d.pdb

6,18

5,12

-20,59 %

RenderSystem_Direct3D11_d.dll

2,02

1,90

-6,32 %

RenderSystem_Direct3D11_d.pdb

17,13

14,11

-21,37 %

RenderSystem_GL3Plus_d.dll

2,00

1,69

-18,41 %

RenderSystem_GL3Plus_d.pdb

16,17

12,25

-32,03 %

RenderSystem_NULL_d.dll

0,39

0,36

-8,09 %

RenderSystem_NULL_d.pdb

8,96

6,26

-43,04 %

RenderSystem_Vulkan_d.dll

17,69

17,58

-0,66 %

RenderSystem_Vulkan_d.pdb

104,54

87,89

-18,94 %

OgreSamplesCommon_d.lib

12,83

11,63

-10,28 %

OgreSamplesCommon_d.pdb

7,74

6,78

-14,18 %





Total

399,51

330,81

-20,77 %

MSVC 2019 – Release

Lib Name

Before (2.3)

After (2.4)

Diff

OgreHlmsPbs.dll

0,55

0,55

0,09 %

OgreHlmsUnlit.dll

0,14

0,13

-1,45 %

OgreMain.dll

6,91

6,82

-1,32 %

OgreMeshLodGenerator.dll

0,20

0,21

2,12 %

OgreOverlay.dll

0,71

0,71

-0,28 %

OgreSceneFormat.dll

0,17

0,17

0,00 %

Plugin_ParticleFX.dll

0,14

0,14

0,35 %

RenderSystem_Direct3D11.dll

0,52

0,51

-0,38 %

RenderSystem_GL3Plus.dll

0,59

0,54

-9,37 %

RenderSystem_NULL.dll

0,12

0,12

-2,54 %

RenderSystem_Vulkan.dll

3,81

3,80

-0,18 %

OgreSamplesCommon.lib

1,48

1,41

-4,72 %





Total

15,33

15,11

-1,44 %

That’s all for now. Ogre 2.3 was released just a week ago and we wanted to share such an exciting development already happening on 2.4.

Discussion in forum thread.

Vulkan Progress Report

If you follow my twitter you may have seen I tweeted about it.

Or if you follow our Ogre repo, you may have seen some commits.

Yes, we’re working on Vulkan support.

So far we only got to a clear screen, so this is all you’re gonna get thus far:

It is working with 3 different drivers: AMDVLK, AMD RADV, and Intel Mesa, so that’s nice.
Only X11 (via xcb library) works for now, but more Windowing systems are planned for later.

A very low level library

Vulkan is very low level, and setting this up hasn’t been easy. The motto is that all commands are submitted in order, but they are not guaranteed to end in order unless they’re properly guarded.

Want to present on screen? You better setup a semaphore so the present command waits for the GPU to finish rendering to the backbuffer.

Submitted twice to the GPU? You better sync these two submissions or else they may be reordered

On the plus side, a modern rendering library could take advantage of this to start rendering the next frame while e.g. compute postprocessing is happening on a separate queue on the current frame.

A lot of misinformation

There’s a lot of samples out there. But many of them are wrong or incomplete.

For example the LunarG’s official samples are wrong because they acquire the backbuffer from the GPU using the same semaphore instead of using one semaphore per frame.

In many of the samples this is not a problem because they perform a full stall for demo purposes, but some of the more ‘real world’ samples do not.

They also do not teach how to deal with GPU systems where the present queue and the graphics rendering queue are different (I don’t know which systems have this setup, but I suspect it has to do with Optimus laptops and similar setups where GPU doing rendering is not the one hooked to the monitor).

Google’s samples are much better, but they still miss some stuff, such as inserting a barrier dependency on VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT so that the graphics queue doesn’t start rendering to a backbuffer before it has been fully acquired and no longer used for presentation.

This bug is hard to catch because often the race condition will never happen due to the nature of double and triple buffer, and worst case scenario this could result in tearing or similar artifacts (even if vsync is enabled).

Though there’s the possibility that failing to insert this barrier can result in severe artifacts in AMD GPUs due to DCC compression on render targets being dirty while rendering to it. Godot’s renderer had encountered this problem.

This is covered at the end of Synchronization Examples’ Swapchain Image Acquire and Present .

Last week, Khronos released a new set of official samples. So far these seem to perform all correct practices.

A VERY good resource on Vulkan Synchronization I found is Yet another blog explaining Vulkan synchronization. It is really, really good.

If I were to summarize Vulkan, it reminds me of Javascript async/promises development: Everything is asynchronous and has to be coded with promises.

Once you get into the async mindset, Vulkan makes sense.

Where to next?

There’s a lot that needs to be done: Resizing the swapchain is not yet coded, separate Graphics and Present queues is not handled, there’s zero buffer management, no textures, no shaders.

The next task I’ll be focusing on is shaders; because they are useful to show stuff on screen and see if they’re working. Even if there are no vertex buffers yet, we can use gl_VertexID tricks to render triangles on screen.

And once shaders are working, we can then test if vertex buffers work once they’re ready, and if textures work, etc.

So that’s all for now. Until next time!

Improvements in VR, morph animations, moving to Github and CI

Over the last few weeks a new sample appeared: Tutorial_OpenVR

We’ve integrated OpenVR into Ogre.

The tasks done to achieve this can be summarized into five:

(more…)

Progress Report July 2019

Hoo boy! This report was scheduled for January but couldn’t be released on time for various reasons.

We have another report coming as this is old news already! We have another report coming mostly talking about VTC (Voxel Cone Tracing) which is a very interesting feature that has been in development during this year.

But until then, let’s talk about all the other features that have been implemented so far!

(more…)

Ogre Progress Report: December 2016

Merry XMas! if you don’t celebrate it, good wishes anyway!

It’s been 9 months since our last progress report. We think it’s time for a new one! Oh boy. So much has been done and still in the works.

 

Metal Support

In case you didn’t notice: Ogre 2.1 now runs on Apple’s API Metal. And it’s stable! It only works on iOS for the moment, since a few tweaks are required to make it work on macOS as well.

You’ll need to use the 2.1-pso branch in order to use Metal. The 2.1-pso branch is scheduled to be merged with 2.1, once testing of the branch 2-1-pso-cache-legacy finishes (which implements a PSO cache utility meant to help users port their immediate style rendering code such as GUIs to support PSOs without major/significant changes).

 

MSAA resolving for D3D11

MSAA for Render Textures has been broken on D3D11 since like…forever. Not anymore. D3D11 MSAA targets will now be resolved appropriately, according to our implicit resolve rules (explicit resolve support still pending, but in that regard OpenGL is in the same state).

 

Parallax Corrected Cubemaps

PCC for short, aka Local Cubemaps, Local reflections, Cube projection.

PCC reflections are very important to achieve accurate local reflections.

View post on imgur.com


Our PCC implementation has two modes of operations: Automatic & Manual. Both have their strength and weaknesses.

Automatic “just works”. Probes get automatically blended together (based on camera position) and applied. However automatic may have trouble showing reflections from distant probes, and in some cases the blending may be too evident.

Manual solves the problem of distant reflections not showing up and the blending issue, but it requires you to explicitly set the probe to the material. Also if you don’t perfectly subdivide the geometry to fit the probe’s bounds, you may see gaps (since there is no blending happening at all).

You can actually mix automatic and manual behaviors.

Once the texture refactor is ready (keep reading) we may provide more powerful and superior automatic methods (by using Clustered Forward to select which probe to use and cube arrays to do the actual selection, which are only supported on DX11 class hardware or better).

For more information and experimentation you can look at our two samples LocalCubemaps and LocalCubemapsManualProbe.

Important note: The samples don’t make it yet too obvious that the PCC system reserves one visibility mask + Render Queue for their internal computations (i.e. it stores its Items into a RenderQueue of your choosing, set with a visibility bit also of your choosing). If you accidentally try to render those items, it will look funny. Keep in mind they may affect other things too, such as ray picking and Instant Radiosity generation (remember to filter those objects out).

 

Created Ogre 2.1 FAQ in the Wiki

We’ve addressed it in a news post already. We’ve written a wiki resource to address frequently asked questions regarding Ogre 2.1.

 

Texture Matrix Animation in Unlit

This has been requested a lot. Now you got it!

While we don’t yet provide easy ways to animate textures using material commands like old 1.x materials did, at least it’s now possible to animate them by hand if you need to.

 

Global Illumination

We’re working on a technique called Instant Radiosity. The idea is very simple:

  1. Trace a lot of rays from the light.
  2. Generate a point light where the ray hits the surface. We’ll call this point light a VPL (Virtual Point Light)
  3. Cluster very close VPLs into one by summing their light contribution and averaging their locations.
  4. Use Forward3D or ForwardClustered (or Deferred Rendering) to use all these VPLs in scene.

The technique is an approximation but the results are very convincing, and lots of knobs to adjust for tweaking the results efficiently.

Instant Radiosity is not a very slow technique but neither a real time technique. When it comes to generating the VPLs, we still need to raytrace. Even if the raytrace takes e.g. 500 ms, it’s not suitable for real time. It was chosen because it was easy to implement and offers a lot of flexibility when compared to other techniques (such as Light Propagation Volumes) due to all the settings that can be adjusted, while also illuminating dynamic objects. In other words, the cost benefit ratio was really good.

This image is only lit by a spotlight + Global Illumination:

View post on imgur.com

And now same scene with VPL debug markers turned on:

View post on imgur.com

Another angle with parameters exaggerated:

View post on imgur.com

Clustered Forward

Instant Radiosity made it obvious that Forward3D was under-performing. While it was original research done by me (dark_sylinc), it was clear it wasn’t as well as I had estimated and hoped.

So I just went ahead and implemented Clustered Forward. It’s both threaded (slices are assigned to different threads) and SIMD optimized. Also the Frustum vs. Spotlight and Frustum vs. Pointlight intersection tests are much (!) tighter than the ones we use for Forward3D.

In debug builds, having many slices may take a noticeable hit on CPU when compared to debug Forward3D. Though you could just use less slices during debug, or switch to Forward3D.

Clustered Forward allows controlling many slices, which improves GPU speed and the tight frustum tests mean shaders don’t waste precious cycles trying to shade with lights that aren’t actually visible. This leads to an average performance improvement of 33%, though your mileage may vary (it can be 2x faster or 0x if your lights had gigantic ranges).

 

Compositor improvements

  • Scene passes now have “enable_forwardplus” to explicitly turn of Forward3D and ForwardClustered in passes you don’t need them. This will improve CPU consumption by avoiding wasting cycles in building the light lists on something you won’t be using.
  • Compositor workspaces now support more than one input (i.e. not just the “final target”, which was usually but not always the RenderWindow). “connect_output” still exists, but it just does the same as “connect_external 0”. Useful when you want a workspace to produce a lot of results for you, not just one.
  • 2D Array textures, cubemaps in compositors: They can now be created via compositor scripts. See the manual for more information.
  • UAV Buffers: Instead of creating UAV textures, you can also create buffers. We allow creating buffers of fixed byte sizes, and width x height sizes. You can also create them from C++ and treat them as external buffers for the workspace (they work just like external textures). Useful mostly for compute and some advanced rendering algorithms. See the manual for more information.
  • Double-sided stencil: Some parameters have been moved to the “both” block. See Sample_StencilTest and the manual for more information.

 

Quadratic behavior when loading meshes

User 0xC0DEFACE noticed loading 60.000 meshes was taking more than 15 minutes due to O(N^2) behavior in our code. He fixed it with a very trivial change and brought it down to 4 seconds. Kudos!

 

Merged PSO branch

PSO had been very stable for a long while now. Plus there had been a bunch of very important bugfixes (like stencil support, some edge case viewport glitches) that were only in PSO branch.

Furthermore I found myself very often merging and cherry-picking between the different 2.1 branches, which is a sign that they needed to be merged. So I did.

The last non-pso commit was 06631aef218d73fdc2ca323da626a53650d941be

 

GLES3 Progress

New user Hotshot5000 has stepped up to port the GLES3 RenderSystem to Ogre 2.1.

He’s claiming he is starting to see the light. We wish him good luck as we wait impatiently for more updates from him!

 

In Design: Texture Refactor

The texture refactor was announced. No coding work has been done yet, but we’re solidifying the foundations how texture loading will work in the future.