Both of these are in dire need. I suggest OpenCL because it operates on both classes of graphics card whereas CUDA is NVidia specific. Multithreading is also needed but only in the case of OpenCL being impossible.
The fact that adding it to the Toy at this stage would require extensive re-writing of the code only means that the code should be re-written from the ground up, not that it is complex or impossible. Start from scratch and you'll make something better than if you try to adapt the changes into an old Frankenstein's Monster of code.
OpenCL is for computing upon the more limited architexture. Implimented properly, it can do physics and heat-simulation and the like. In other words, it can handle much of the computing required, most likely at a much faster speed. It's very good for some tasks, which is why the bitcoin-mining world latched onto it for so long.
Given that I'm not the first to suggest OpenCL and I hope I'm not the first to suggest multithreading, that means a giant pile of "perfect timing" was just glossed over and avoided. Fine, the code was just rewritten. Hopefully that means adding some more bits in will be easy. If not? Then it was a wasted oppurtunity.
Actually, most of the simulation code was not rewritten, the old functions were just moved and renamed.
So then, it really is a frankenstein's monster of code?