Inconsistancy in floating-point modes can cause rare crashes

Post by **Jabberwocky** » Sat Jul 16, 2011 10:50 am

Yes, getting rid of the equals check altogether also fixes the crash. And it's simpler/cleaner, so that seems like the best fix.

The final code:

Code: Select all

        /// Comparator to order objects by descending camera distance
        struct DepthSortDescendingLess
        {
           const Camera* camera;

           DepthSortDescendingLess(const Camera* cam)
              : camera(cam)
           {
           }

           bool _OgreExport operator()(const RenderablePass& a, const RenderablePass& b) const
           {
              if (a.renderable == b.renderable)
              {
                 // Same renderable, sort by pass hash
                 return a.pass->getHash() < b.pass->getHash();
              }
              else
              {
                 // Different renderables, sort by depth
                 Real adepth = a.renderable->getSquaredViewDepth(camera);
                 Real bdepth = b.renderable->getSquaredViewDepth(camera);

                 // Sort DESCENDING by depth (i.e. far objects first)
                 return adepth > bdepth;
              }
           }
        };

Good eye, CABAListic.
Thanks to you and everyone else who commented on this problem. It's nice having some backup on tricky bugs like this.

Post by **Jabberwocky** » Fri Jul 22, 2011 9:30 pm

So, this is the bug that never dies.
Cabalistic, your suggested fix of removing the Math::RealEquals check alltogether did fix the crash, but introduced a new "depth fighting" style problem. It can be seen here, watch the rocket particles flicker in and out:
[vimeo]26777700[/vimeo]

Somehow, that fix doesn't seem to reliably sort depth equal renderables in the same order between frames. I went back to my more complex fix proposed earlier in the thread, and this problem disappears.

Now it's not totally clear to me why my fix doesn't have the depth fighting problem. Maybe it has to do with the ordering of the list we're sorting.

Anyway, I feel like I should summarize the issue, since this has gotten pretty complex.
- with the current 1.7.1 code, my game crashes in the stable_sort routine
- I have determined that it was because of non-deterministic results from the Math::RealEquals check of the 1.7.1 code
- this problem can be fixed by introducing a larger tolerance value into the Math::RealEquals check.

So, there are 2 potential fixes:
1. Leave the code as currently written, but add a proper tolerance value to Math::RealEqual. This is the least change, it fixes the crash, and everything else should operate as it currently does. However, earlier in this thread I outlined a theoretical situation in which this could still cause a stable_sort crash. Although I have not observed this in practice.
2. Use the fix I proposed a few posts up. This removes the pass pointer comparison check of the current code, of which there has been some criticism. However, there's a little bit of black magic as to why it doesn't suffer the same depth-sorting problem displayed in the video above.

Post by **CABAListic** » Fri Jul 22, 2011 10:54 pm

Interesting. stable_sort should produce reliable results, so I'm guessing the list in question is regenerated and populated in a different order.

I really wish the true equality test (==) were reliable as it really, really should be (after all, it's not a particularly difficult check). Then we could introduce a stable secondary sorting criterion without any RealEqual hacks, which are bound to break again at some point. Right now, I'd be most interested to know why the list is populated in different orders before the sorting and if there is a way to prevent that.

Post by **dark_sylinc** » Sat Jul 23, 2011 8:08 am

Edit 2: It's getting reaaaallly late here, but the conclusion about the Transitivity of equivalence is the same that Jabberwocky already came to. Gosh I need to rest.

I took a deep, carefull look at SGI's reference (here, here & here)

I think I know what's going on. The problem is in the RealEqual thing. It potentially breaks the Transitivity of equivalence invariant. Suppose you have these three values:
X = 1.0 - 1e-6f //0.999...
Y = 1.0;
Z = 1.0 + 1e-6f //1.000001

Math::RealEqual( X, Y ) will return true.
Math::RealEqual( Y, Z ) will return true.
Math::RealEqual( X, Z ) will return false ---> Ooops. Different code path. Depending on what the other "return a.pass < b.pass;" had returned and what the current "return (adepth > bdepth);" will return, we will potentially be violating transitivity of equivalence (if X == Y && Y== Z, then X == Z, which is being violated).

With a bigger epsilon, we just translate the problem to bigger numbers.

[s]If we remove RealEqual, and use the == operator (the original problem from this thread in 2007), then we are violating Irreflexivity invariant because f( x, x ) may return true.[/s]
Edit: Using the == operator will always return false, as it should. I'm still thinking it as it doesn't feel right (and obviously it doesn't work)

Removing the whole "if( Math::RealEqual(adepth, bdepth) ) return a.pass < b.pass;" block should fix it.
However, I think that's what's causing the flickering problem, right?

Well, the flickering problem might be being caused by:
a. Floating point imprecisions. Let's make use a very big epsilon to show the behavior:
In frame 0:
particle X = 0.99
particle Y = 1.00

In frame 1:
particle X = 1.02 //Order became reversed
particle Y = 1.01

In frame 2:
particle X = 1.03 //Order became reversed again!
particle Y = 1.04

b. Equivalence! If X == Y, then the order is unaltered by sort. If at frame 0 X is submitted to the queue before Y, and at frame 1 Y is submitted before X, then we'll get flickering. In order words, it stops being sorted, and now becomes a FIFO queue for those particles with equal (actually, equivalent) depth.

Solutions:
First, remove RealEqual(). It breaks Transitivity of equivalence (on extreme cases, but it does).
[s]Second, remove "return a.pass < b.pass;". It breaks Irreflexivity.[/s]

Alternatives (not mutually exclusive):
a. Quantize values. Round them, truncate them, convert to integers (all of them potentially slow, I'm afraid). This solves flickering caused by floating point imprecisions (assuming this is actually a valid cause! I'm guessing!) If my guess is wrong, then quantization wouldn't be necessary. In other words:

Code: Select all

return Quantize(adepth) > Quantize(bdepth);

b. To avoid flickering caused by equivalence, use some sort of hashing that guarantees the order is preserved regardless of how they were submitted to the queue, and this hash should be unique for each renderable. In other words:

Code: Select all

if( adepth == bdepth ) //Or the quantize() version
{
    // Must return deterministic result, but never treat two renderables as being equivalent
    return a.renderable->getHash() < b.renderable->getHash();
}
else
{
//...
}

If someone thinks I'm mistaken, I would be glad to be pointed out. I really hate writing predicators.

Cheers
Dark Sylinc

Post by **Jabberwocky** » Sun Jul 24, 2011 8:48 am

CABAListic wrote:Interesting. stable_sort should produce reliable results, so I'm guessing the list in question is regenerated and populated in a different order.

Agreed.

CABAListic wrote:Then we could introduce a stable secondary sorting criterion without any RealEqual hacks, which are bound to break again at some point.

Maybe.
I don't consider it a hack to calculate a proper tolerance value based on the magnitude of the values being compared. My particular implementation may be a bit hacky, but there's probably an established way of doing this.

CABAListic wrote:Right now, I'd be most interested to know why the list is populated in different orders before the sorting and if there is a way to prevent that.

I haven't looked into this code at all. But it's going to be dependent on some kind of traversal of the scene graph, which will be SceneManager type dependent. So even if we figure out why we're getting different ordering, or manage to fix it, I don't think it would be safe to assume other SceneManagers will provide this list in any kind of specific order.

Post by **Jabberwocky** » Sun Jul 24, 2011 9:02 am

@dark_sylinc:
Thanks for digging into things.

dark_sylinc wrote:I think I know what's going on. The problem is in the RealEqual thing. It potentially breaks the Transitivity of equivalence invariant.

I expressed the same theory here.
It turns out this was not the cause of the crash. But I still believe it could be a problem. Especially since you came to the same conclusion.

The actual problem was that Math::RealEquals alternately returned either true or false for the exact same 2 numbers. This was evident by the logging I added to the sort routine.

Weird, I know. Although the original bug reporters seem to have observed this exact same behaviour. Sinbad thought the Math::RealEquals check would fix it. It didn't, because of the incorrect (default) tolerance value being passed to that function.

Well, the flickering problem might be being caused by:

I'm pretty sure, and Cabalistic seems to agree, that the flickering is caused by the ordering of the list passed into the sort routine.
Example:
Let's say we have 4 renderables:
renderable A, depth 10
renderable B, depth 12
renderable C, depth 12
renderable D, depth 15

Frame 1: the list passed to stable_sort is B, C, A, D
stable_sort produces: A, B, C, D

Frame 2: the list passed to stable_sort is C, B, A, D
stable_sort produces: A, C, B, D

note the reversal of B <-> C

Likely, this is why the pass pointer comparison was added in the current code.

Post by **CABAListic** » Sun Jul 24, 2011 10:11 am

Ah no, I think dark_sylinc nailed it. The depth values change from frame to frame (curtesy of the particles moving in space), and if they are very close to each other, then inconsistencies can cause one to be in front in one frame and the other to be in front in the next frame. Would perfectly explain the different ordering and does suggest that we need a RealEqual check to get rid of it.

Post by **dark_sylinc** » Mon Jul 25, 2011 7:42 pm

I'm surprised someone understood me! I was so tired while writing it I was worried it would be incomprehensible for anyone.

Jabberwocky wrote: I expressed the same theory here.
It turns out this was not the cause of the crash. But I still believe it could be a problem. Especially since you came to the same conclusion.

Yeah, I missed your post and later saw you've already got there.

Jabberwocky wrote: The actual problem was that Math::RealEquals alternately returned either true or false for the exact same 2 numbers.

Weird, I know. Although the original bug reporters seem to have observed this exact same behaviour. Sinbad thought the Math::RealEquals check would fix it

The original problem was with RealEqual not being used and still crashed. I wonder if this problem was what was going on (God save our souls!!!).
The more complex is the floating point code, the greater chances the compiler will optimize it in such way that starts behaving like in the link.

Edit: Also Sinbad wrote that pointer comparison and thinking it might not ever end up inside that 'if' block. In other words, what happens inside that block has never been thoroughly tested. And note comparing two unrelated pointers with "< <= > >=" is undefined behavior by the C++ standard (and "==" & "!=" is a bit more complex).

CABAListic wrote:Would perfectly explain the different ordering and does suggest that we need a RealEqual check to get rid of it.

The only problem, is that like Jabberwocky also concluded, RealEqual poses a crash risk. We need some other way of grouping similar numbers together. Rounding could work. NOW, how to round is an entirely different deal (i.e. round( x * 10000.0f ) would work, but not for large values as it would quickly saturate. Perhaps round( log( x * 10000.0f ) ) ? ...sloow )

Ogre Forums

Inconsistancy in floating-point modes can cause rare crashes

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra

Re: Inconsistancy in floating-point modes can cause rare cra