[SOLVED] Multi-threaded crashes


23-05-2014 22:10:04

I don't remember seeing this problem before, but for some reason I'm getting a lot of multi-threaded related crashes that are really confusing and hard to track down. I'm mostly curious about whether or not anyone has any ideas about what might be causing the problem or where to look.

The problem I'm seeing is that after letting the application run for a while there are a lot of sound objects being created and destroyed for playing sound effects of the various characters walking around (usually crashes somewhere around 800 or upwards of 3000 sounds in about a minute or so). Somehow in the midst of all this when OgreOggSoundManager::_performAction() goes to handle an action OgreOggSoundManager::hasSound() is returning that it does have the sound, but then immediately after when OgreOggSoundManager::getSound() is called it returns NULL. In some cases the sound is missing from the container and in other cases the sound is in the container but somehow OgreOggSoundManager::getSound() didn't find it. I assume the former is that it is being removed somehow between the calls to hasSound() and getSound(), and the later case I don't understand at all since it means that it was in the container when hasSound() was called, got removed before getSound() was called, and then got immediately added back before the returned NULL pointer is used.

All in all I'm extremely confused and really not sure how to approach this problem at all. I'm considering putting a mutex around all accesses of the mSoundMap and see if that catches any issues, but that's most likely going to slow down the library considerably so it's not really a good solution.

EDIT: On a related note, I'm also having issues where OgreOggSoundManager::_sortNearToFar() or OgreOggSoundManager::_sortFarToNear() are failing to sort properly. When it gets to the line where the mActiveSounds container is being sorted it fails with an error stating that sequence is not ordered. I think this is also a multi-threading issue where the position of the listener is being changed in the main thread while the sound thread is in the middle of sorting the container, which would cause the sorting to get confused. This one seems a lot harder to fix unfortunately. At the moment the only idea I have is to add some sort of flag to OgreOggSoundManager that gets set before mActiveSounds is sorted and cleared immediately after, which the main thread can use to check if it's safe to update the listener position. If the flag is set then it must postpone updating the listener position to avoid this problem. Thoughts? Any other ideas on how to fix this?

EDIT: Found yet another multi-threaded related bug. If you have a lot of temporary sounds it is possible to get into a state where a sound gets added to the mSoundsToDestroy container twice, which will crash when it gets to the second entry since it's already destroyed. The cause of this is the _reactivateQueuedSoundsImpl() functionality. Since it automatically stops sounds sometimes, stopping the sound causes temporary sounds to get added to the mSoundsToDestroy even though it shouldn't be since it's not technically stopped yet and will be resumed later once _reactivateQueuedSoundsImpl() is called. Not sure how to deal with this one yet.


24-05-2014 18:27:17

I thought of an alternative solution to the second problem (the listener position being changed out from under the sorting functor). Instead of adding this flag that the user has to know about and check before modifying the listener, I propose that we instead add a new mutex to the OgreOggListener class, which will be used to wrap all functions that modify or look at any of the various members. With this mutex in place the _sortNearToFar and _sortFarToNear functors will be changed so that they no longer call OgreOggSoundManager::getSingleton().getListener()->getPosition() directly, but are instead passed the Ogre::Vector3 listener position in their constructor. This way the code can query the listener for its position once, right before sorting, pass it to the functor, and then not have to worry about wondering if the value is going to change out from under it or not. Unfortunately, like the solution for the first issue, this means there is going to be more contention for resources with another mutex around all the listener data, but I don't see any other way to possibly fix this problem without more multi-threading issues. How does this sound? I haven't implemented this yet but it should be fairly easy to do.


24-05-2014 20:50:06

Found more information about the third crash. The problem is caused by temporary sounds being added to the mSoundsToReactivate container when OgreOggSoundManager::_requestSoundSource() is called. It is being added at the bottom of the else (approximately line 1125). As to the fix, this is my current idea:

  1. 1. Add a new value to the SoundState enum, SS_DESTROYED.
    2. Wherever _destroyTemporarySound() is being called, set the state to SS_DESTROYED.
    3. Back in OgreOggSoundManager::_requestSoundSource(), before adding the sound to the mSoundsToReactivate container, check to see if the sound state is SS_DESTROYED. If it is then do not add it to the mSoundsToReactivate since it is going to be destroyed sometime soon.[/list:u]

    To make absolutely sure the sound isn't being messed with after it has been queued up for destruction it might also be a good idea to add assertions in the various play/pause/stop functions (including the _impl versions) to check that the state is not SS_DESTROYED. This will help detect any further bugs that could cause problems since once _destroyTemporarySound() has been called on a sound its state should never be changed or else it will cause a crash.

    How does this sound?


29-05-2014 23:13:24

I've compiled all the fixes I have together into a single patch so they can be looked over. You can download the patch file here. Let me know if you need to change anything.


04-06-2014 21:54:08

Thanks for the detailed investigation and fixes, what you have proposed and supplied sounds good to me, haven't got a lot of time to thoroughly test this at the moment, I've committed the fixes but will note its potentially unstable.