Today I managed to test out my idea about a job-list based parallel processing class, when I got home I actually implemented it.
The idea is simple, there is a static class (C++ namespace) called xiJobManager, the job manager has a ring of jobs and every time you prepare a job it gets placed on the manager's ring, if the slot at that time happens to be taken up by a job that hasn't been processed yet, that job is deleted and replaced with the new job (As this is for games, it is expected that any out of date information about the game world will be updated in a few milliseconds time anyway).
When the manager is created, it spawns 4 xiWorkerThreads that constantly check the manager's job ring for any new jobs, they each check the ring at almost random positions.
Each thread increments the cursor on the ring, the idea is that once it takes a job, that cursor position has it's job removed so either the next new job is placed there or the cursor is moved when the next thread checks for jobs, this way the newest jobs will ALWAYS be placed at the end of the list, the only negative is the empty cursor positions that get cycled.
To add a job it's simply PrepareJob( functionPointer, functionArgumentPointer, functionReturnPointer ), all of which are optional params, so it even supports getting the results of the call.
I did some tests, and sadly the constant running of the 4 threads is a bottleneck itself, but at the moment I don't have any intense processing beyond movement, so I expect that the benefits of this will shine through when I have collision detection, path finding and AI running in the game world.
I have also made it sort-of configurable, so I default the engine to 4 worker threads (I will probably reduce this to 1 in the future) and you can recreate the job manager at any time with more threads, when you drop the job manager it finishes all tasks before moving the workers onto the main thread and cleaning memory, so I can have a setting in my game's option menu where you can tweak how many cores you want the engine to run on.
This stuff would be brilliant on high-core machines, on my desktop there is good speed up going from 1 core to 4 cores, 4 cores being faster than 1 (0 cores is the fastest due to that bottleneck I mentioned), at 5+ cores my machine starts to chug with the work load (My machine is quad-core with 8 threads, so at 5+ it starts using 1 physical core with 2 threads which isn't good for the high-speed game world that they run in).
I expect that 8 core PCs will benefit a lot from this kind of code when they eventually appear in mass-market, it also prepares my engine for possible console ports which is good.
This also helps a LOT with high-resolution user input, my mouse suddenly felt massively responsive with the input code happening on another thread.