I've been playing around with concurrency problems in simple programs and thought this one would be a good genre to try to think about. The main problems with concurrency in this type of simulation are the following:
With that said I've come up with the following model that requires only 9 "sync points" for 4 threads in the main simulation.
(click for full size) Particles in the concurrent sections must obey a speed limit of 1/6 the board size to prevent having to worry about memory access. Things that generate instantaneous lines of particles or must move faster than that would need to be caught in the preproccessing stage at the beginning.
For normal simulations I'd expect this to give a ~2x performance boost over a single threaded implementation on a 4 core or higher machine. Before I go and try to see if this is feasible I was wondering if anyone else had come up with a clever way to squeeze out more concurrency with lower overhead?
Thanks