[MAJOR ISSUE] Mogre/Ogre generalized memory bug

Kodachi_Garou

31-01-2008 11:58:14

For months I've been dealing with strange phenomena like access violations or creation functions and properties returning null references, creeping in from Mogre library and have not been able to pin them down to a simple bullet-proof example until now. Some related previous post references: http://www.ogre3d.org/phpBB2addons/viewtopic.php?t=5710 and http://www.ogre3d.org/phpBB2addons/view ... 7fc386b91b.

Ok, finally I've got it and it's bad, way BAD. There is indeed a bizarre bug with either Ogre or Mogre, but given the larger number of strange uses done with Ogre, for now I think it's the latter. I've found a simple bullet-proof condition for producing null reference pointers on Mogre function creation.

Just smack these two functions on your favorite Mogre sample and watch them blow:

public override void CreateFrameListener()
{
base.CreateFrameListener();
root.FrameStarted += new FrameListener.FrameStartedHandler(root_FrameStarted);

new Thread(delegate()
{
while (!shutDown)
{
// Choose your poison
/*/
byte[] arr = new byte[320 * 240 * 3];
arr[120] = 5;
/*/
GC.Collect();
/**/
Thread.Sleep(66);
}
}).Start();
}

bool root_FrameStarted(FrameEvent evt)
{
SceneNode node = sceneMgr.CreateSceneNode();
sceneMgr.DestroySceneNode(node.Name);
return true;
}


What could this mean? Here's my two cents:

Basically, if you have a Thread which is concurrent with the Mogre thread and is somewhat allocating memory, or worse, freeing memory, Mogre will snap. Keep in mind that it is likely that this will only happen on dual-core machines, but I haven't tested this claim thouroughly.

This could come from a variety of sources, including input handling buffers like keyboard, mouse, video camera threads, etc.

If there is a high memory manipulation from both the Mogre thread and the memory handling thread, Mogre will crash hard, even if the thread is completely independent with nothing to do with Mogre data structures.

If none of the current Mogre maintainers can provide a reasonable explanation for this, the next step is to breathe deep and delve into the arcane Mogre source code and try to grasp the issue at its root.

A possible hint to consider:

- Ogre is a non-MT safe library
- The garbage collector runs in an independent system thread
- Garbage collection could happen concurrently with the calling of an Ogre function.
- Some concurrency handling failure could lead to exceptions, crashes and generalized strange behaviour.

Best regards,

Gonçalo

Marioko

31-01-2008 16:47:04

u are right MOGRE and OGRE are Anti-multithreads framework, but in OGRE it can be solve it with good concurrent design. In MOGRE, threads have another problems added by .NET framework, because .NET manage its own threads and subthreads and MOGRE is a mixed code assembley bugs can appaear like rabbits :D

About "externals" Threads like input, video or sound i never had a null pointer using it togheter.. but using .NET thread model i had a lot like you.

[MAJOR ISSUE] means advanced knowledge of C++ threading and C++/CLI interop threading and pure .NET threading. If u can solve this issue please let us know..

In the other hands i try to find a solution :D

EDIT:
BTW create nodes in FrameStarted is a very bad practice that can generate a lots problems.. :evil:

PD: Sorry my ugly bad english :D

Kodachi_Garou

31-01-2008 16:57:38

Hi Marioko,

I'm already looking into Mogre source code and I'm currently identifying a lot of potentially dangerous code which could be highly related with this issue.

However, I've never built Mogre before so I can't test my hints just yet. I've heard that you have a compilation manual with the necessary steps for building Mogre.

If you have it, could you please send it to me? It's alright if it's in spanish, I can understand it (although I'm not spanish myself).

Thanks for the help. I think it's essential to solve this issue, since .NET applications are only expected to increase their number of threads in the future, with the current interest and rise in multi-core machines.

Best regards,

Gonçalo

P.S.: I know that creating SceneNodes every frame is not a good practice, but this example was just for illustrative purposes so that you wouldn't have to wait about 3 to 4 hours for the bug to popup :)

Kodachi_Garou

06-02-2008 18:56:45

Ok, I think I've figured out the reason why Mogre returns null references at arbitrary non-deterministic moments during multi-threaded memory heavy applications.

The remainder of this topic is for whoever understands the Mogre architecture. I'll basically walk you through the steps performed during the creation of a simple SceneNode and explain you why the currently implemented behavior results in a race condition leading to null references.

Lastly, I'll propose a solution, and we can then sit down and discuss the pros and cons of it. But rest assured, this is a F**** time-bomb waiting to happen. This bug needs to be squashed now for the good of Mogre's future.

Let us begin the voyage:

1. The managed code

node = sceneMgr.CreateSceneNode();

A simple directive at first, but let us step on how Mogre actually handles this:

2. The unmanaged surface

Mogre::SceneNode^ SceneManager::CreateSceneNode( )
{
return static_cast<Ogre::SceneManager*>(_native)->createSceneNode( );
}


A single line of code, but hiding so many mysteries. We are basically calling the createSceneNode() function from the NATIVE sceneManager instance and returning its result. However, notice that createSceneNode() originally returns Ogre::SceneNode* and NOT Mogre::SceneNode^. So how does the magic happen?

Basically, there are a set of implicit operators converting to and from managed and native types, defined in a ready-made MACRO expansion (debugging nightmare). Here they are:

3. The Macro Expansion nightmare

#define DEFINE_MANAGED_NATIVE_CONVERSIONS(T) \
static operator T^ (const Ogre::T* t) { \
RETURN_CLR_OBJECT(T, (const_cast<Ogre::T*>(t)) ) \
} \
static operator T^ (const Ogre::T& t) { \
RETURN_CLR_OBJECT(T, (&const_cast<Ogre::T&>(t)) ) \
} \
inline static operator Ogre::T* (T^ t) { \
return (t == CLR_NULL) ? 0 : static_cast<Ogre::T*>(t->_native); \
} \
inline static operator Ogre::T& (T^ t) { \
return *static_cast<Ogre::T*>(t->_native); \
}


These macros are defined for all MOGRE classes, and as such, for the Mogre::SceneNode class. From this whole set, we're basically interested in the first function:

static operator T^ (const Ogre::T* t) {
RETURN_CLR_OBJECT(T, (const_cast<Ogre::T*>(t)) )
}


The conversion function is itself implemented by yet another Macro expansion, defined as:

#define RETURN_CLR_OBJECT(T,n) \
if (n == 0) return nullptr; \
Object^ clr = *n; \
if (nullptr == clr) { \
n->_Init_CLRObject(); \
clr = *n; \
} \
return static_cast<T^>(clr);


Starting to get scared? Well, you should be, for this is exactly where the race condition responsible for the bug happens. Let's recap the execution up to this point to see where we stand.

The low-level createSceneNode() was called and the result passed on to the implicit conversion operator, and then to the above macro. The parameter 'n' of the macro now holds the Ogre::SceneNode* pointer.

Continuing:

if (n == 0) return nullptr;

The parameter 'n' is not zero, so this condition is cleared, leading us to:

Object^ clr = *n;


Yet another black magic happens. The Ogre::SceneNode* pointer is converted to a System::Object^ managed reference. As expected, this is done by another implicit conversion, but this time of a different nature.

4. The parallel inheritance hierarchies

As it stands, Mogre does its wrapping magic by defining two parallel hierarchies closely intertwined. On the one hand, there is the original Ogre unmanaged class hierarchy, with the twist that all classes inherit from a special top level class, the CLRObject.

On the other hand, there is Mogre's own class hierarchy, mirroring that of Ogre, but with all classes deriving from another special class, aptly named Wrapper.

So, how do these two interrelate? Let us step in the CLRObject class, to see a special implicit conversion operator:

inline operator Object^ () {
if (_handle == 0)
return nullptr;
else
return __VOIDPTR_TO_GCHANDLE(_handle).Target;
}


Returning to the line currently under analysis, we are now in a position to understand it:

Object^ clr = *n;

The parameter 'n' is of type Ogre::SceneNode* which in turn inherits from CLRObject, inheriting this implicit conversion operator to Object^. Basically, each CLRObject holds an internal reference, implemented via GCHandle instances, to managed objects.

When this conversion operator is called to, it basically uses this reference to access the managed reference and returns it. Of course, if the reference is 0, it has no choice but to return a null reference.

CLRObjects have their handles initialized to 0 in the default constructor, so the Ogre::SceneNode* instance gets converted to a null reference. The next macro lines show us how CLRObjects come to possess non-null references:

if (nullptr == clr) { \
n->_Init_CLRObject(); \
clr = *n; \
}


So basically if the CLRObject is yet to be initialized, Mogre initializes them, via the _Init_CLRObject() virtual method. What this method does is basically instantiate a new MOGRE class FROM the UNMANAGED side, via these Macro expansions:

//for subclasses of CLRObject
#define DECLARE_INIT_CLROBJECT_METHOD_OVERRIDE(T) virtual void _Init_CLRObject(void) { _Init_CLRObject_##T(this); }


and (for each derived class)

#define DEFINE_INIT_METHOD(T) \
void _Init_CLRObject_##T(CLRObject* pClrObj) \
{ \
*pClrObj = gcnew Mogre::T(pClrObj); \
}


If you're somehow still following me, hang strong, for we are getting very close to the hot spot.

So basically, Mogre calls these _Init_CLRObject_##T methods using instances of CLRObject and the 'pClrObj' parameter is the very pointer to the native CLRObject used to make the call. Confusing?

Following our example should make it clear. Basically, the Ogre::SceneNode* pointer is passed inside this function, where the constructor for the MANAGED counterpart to the class is called. In this case, the Mogre::SceneNode constructor is called.

And then, yet another black magic. This managed reference is assigned to the unmanaged instance passed to the function. How is this possible? The CLRObject class has another black magic operator to do this, this time an assignment operator:

inline CLRObject& operator=(Object^ t) {
if (_handle == 0)
_handle = __GCHANDLE_TO_VOIDPTR(GCHandle::Alloc(t, GCHandleType::Weak));
else
__VOIDPTR_TO_GCHANDLE(_handle).Target = t;

return *this;
}


What this does is create a GCHandle to the managed reference. A GCHandle is a special struct of .NET allowing unmanaged code to keep a reference to managed classes. By creating a GCHandle we can get a pointer which can be used to refetch the instance at some point later in time.

This is where the race-condition manifests. GCHandle has several modes of operation (Pinned, Normal, Weak and WeakTrackResurrection).

Pinned means that the garbage collector can't move the memory pointed to by the managed object.

Normal means that the garbage collector can't destroy the managed object until the GCHandle is freed.

Mogre uses Weak, so what does it mean? It basically means that the GCHandle reference doesn't stop the garbage collector from collecting the object. From that point on, accessing the Target property on the GCHandle simply returns null references.

And THAT is the reason why Mogre non-deterministically returns null references. Basically, in the _Init_CLRHandle method, ogre did this:

void _Init_CLRObject_##T(CLRObject* pClrObj) \
{ \
*pClrObj = gcnew Mogre::T(pClrObj); \
}


NO REFERENCE is kept anywhere from this method, and the CLRObject only keeps a WEAK reference to the created object. This means that from this point on, unless someone gets a hold of it, the garbage collector WILL mark this object for collection on the next opportunity.

Let us look again at the final lines of the RETURN_CLR_OBJECT macro:

if (nullptr == clr) { \
n->_Init_CLRObject(); \
clr = *n; \
}


The _Init_CLRObject() creates and immediately lets go of the managed object. The line

clr = *n;

will obtain it again by using the GCHandle reference. But what if something happens BETWEEN these two lines of code? More specifically, what happens if the GC starts functioning in between? Basically, the managed reference will be immediately collected, and from this point on, the GCHandle will only return null.

This CAN happen in any .NET application, since the GC runs on a separate thread. If a garbage collection happens between these two lines, Mogre is helpless and won't ever notice the trouble it's gotten itself into.

If anyone survived through this, my suggestion:

Change the GCHandle::Alloc to use a Normal reference and not a Weak one. We want Ogre classes to manage the .NET classes lifetime anyway, so this is the way it should be.

More specifically, this change SHOULD fix everything if my theory proves right:

inline CLRObject& operator=(Object^ t) {
if (_handle == 0)
_handle = __GCHANDLE_TO_VOIDPTR(GCHandle::Alloc(t, GCHandleType::Weak));
else
__VOIDPTR_TO_GCHANDLE(_handle).Target = t;

return *this;
}


change to

inline CLRObject& operator=(Object^ t) {
if (_handle == 0)
_handle = __GCHANDLE_TO_VOIDPTR(GCHandle::Alloc(t, GCHandleType::Normal));
else
__VOIDPTR_TO_GCHANDLE(_handle).Target = t;

return *this;
}


And it should be fixed. Unless there is some side-effect to changing it to Normal (I checked it and didn't find anything)....

I've tested it on my own error revealing sample (see above) and it has worked for one hour (so far) and with no memory leaks or crashes whatsoever :D.

I'm sorry for the HUGE rambling, but I just had to let my findings registered somewhere, and if they have even the smallest chance of inspiring someone to clean-up this architecture, I thought they should be made public.

Best regards,

Gonçalo

dunce

07-02-2008 11:38:35

Fundamental investigation! I hardly survived reading the whole article. :shock:

mkh42

08-02-2008 16:12:14

thank you for your great debugging.
your solution seems resonable and should imho be implemented into mogre.

pin

08-02-2008 19:48:52

Great find Kodachi_Garou. It's encouraging to know that someone else is also using Mogre and helps in spotting and fixing the bugs. Thanks for sharing this hopefully the Mogre maintainers will include the change in the next Mogre release.

Marioko

10-02-2008 19:50:33

wow very deep investigation... sorry for this later comment, i was busy this days. I have new computer and that take me a while make it ready for war (developing and gaming and everything else). :D

Ok.. i going to make that fix, and it will be released for public testing and use comming soon.

Talking about MOGRE design and arquitecture, i was reading about C++/CLI and there is NOT need to modifed or touch OGRE (c++) code for make a powerful wrapper, but that means a complete remake of MOGRE. Kodachi_Garou, we should talk about MOGRE future. You have a great knowledge about these things and may be we can make a TOTAL MAKEOVER of MOGRE.... is a great project, but is possible..

BTW i think that Bekas (MOGRE Author) come back from army next month, i am not sure, but it is, he can help us a lot..

thanks for all

dodongoxp

10-02-2008 21:58:38

that sounds really really cool and ambitious.. that remake surelly will help with the shoggoth update when the time comes 8)

Kodachi_Garou

12-02-2008 10:50:37

I think that rebuilding Mogre from scratch is indeed a great idea, although VERY ambitious. Ogre is a huge library to wrap.

Mogre has very confusing innards and everything could probably be solved in a much more elegant, and probably even more performant, way. Also, Mogre would surely benefit from following the .NET framework design guidelines.

Maybe we should start a thread to address this discussion. This week I'll be extremely busy, but starting the next I can probably find time to sit down and properly analyse this.

Best regards,

Gonçalo

Marioko

12-02-2008 16:59:39

I agree.. :D i am extremetly busy too :D

About MogreNG (next generation) :lol: project we already have a starting point. The cpp2java tool can make a very easy to read meta.xml file with all the description (namespace, classes,methods,structs and enum) of Ogre (c++) headers we "just" need build a new Autowrapper

tomorrow i goin to post new thread about this.. maybe someone else can help us 8)

Zool

18-02-2008 13:25:35

Hey, just pointing out the the change that Kodachi_Garou proposed (GCHandleType::Weak -> GCHandleType::Normal), should be applied to CLRHandle (of CLRHandle.h) too.

Marioko

18-02-2008 16:32:14

ok thanks.. i try to realease a new version (1.4.6.1) in this fixed... :D

Zool

18-02-2008 17:32:16

Hmm, I looked into it more and I think I was wrong about the need to make the change for CLRHandle too.

CLRHandle uses another way to create wrapper objects (Mogre: Marshalling.h):
#define DEFINE_MANAGED_NATIVE_CONVERSIONS_FOR_CLRHANDLE(T) \
static operator T^ (const Ogre::T* ct) { \
Ogre::T* t = const_cast<Ogre::T*>(ct); \
if (t) \
{ \
Object^ clr = t->_CLRHandle; \
if (CLR_NULL == clr) \
{ \
clr = gcnew T(t); \
t->_CLRHandle = clr; \
} \
return static_cast<T^>(clr); \
} \
else \
return nullptr; \
} \

This line
t->_CLRHandle = clr;
Goes through this one (Ogre: CLRHandle.h):
inline CLRHandle& operator=(Object^ t) {
if (_handle == 0)
_handle = __GCHANDLE_TO_VOIDPTR(GCHandle::Alloc(t, GCHandleType::Weak));
else
__VOIDPTR_TO_GCHANDLE(_handle).Target = t;

return *this;
}

So basically the managed object for CLRHandle is created on the Mogre side (unlike CLRObject which happens on the Ogre side), so there's no danger that it will fall out of scope and get garbage collected before the marshalling function returns it.

I think GCHandle::Weak is useful because if the managed wrappers goes out of use for the .NET application, there's no need to keep it in memory permanently until the native object is destroyed.

The program may call a function and get an object just to read a value or something, no need to keep every managed wrapper that is created around.

Bontakun

27-02-2008 12:16:10

Good work Kodachi_Garou

Bekas

15-05-2008 15:57:38

Mogre uses Weak, so what does it mean? It basically means that the GCHandle reference doesn't stop the garbage collector from collecting the object. From that point on, accessing the Target property on the GCHandle simply returns null references.

And THAT is the reason why Mogre non-deterministically returns null references. Basically, in the _Init_CLRHandle method, ogre did this:

void _Init_CLRObject_##T(CLRObject* pClrObj) \
{ \
*pClrObj = gcnew Mogre::T(pClrObj); \
}


NO REFERENCE is kept anywhere from this method, and the CLRObject only keeps a WEAK reference to the created object. This means that from this point on, unless someone gets a hold of it, the garbage collector WILL mark this object for collection on the next opportunity.

Whoa, that's a great find!
The intention of using weak references was to let the garbage collector collect a wrapper object that is no longer used but I totally missed that for a little while, in this line, the object doesn't have a reference pointed to it.
Definately a strong reference is needed.

About the MACRO mess:
The reasons Mogre is full of them are because..

-I initially started writing Mogre by hand, not automatically wrapping it, and to avoid code duplication..
-..I couldn't use templates because templates are not usable by other .NET languages, so I resorted to lots of awful macros.

Marioko

15-05-2008 16:31:05

great.. about template, there is managed templates than can be used in native way and can be use it in others .NET languages.. Is something like a mix between Native templates and Generics, but with the power of native templates and portability of Generics..