TL;DR: For certain types of programs, you can take advantage of idiosyncrasies in the Python interpreter and the host operating system to create real shared memory between processes and get some pretty good parallelization. Premature optimization is the root of all evil. As a developer, you've probably heard this before, and what it means basically is that you shouldn't waste time optimizing code unless it's already doing what you want it to do. We also live in an era of seemingly unlimited resources with AWS/Google Compute, and often the easiest way to get higher throughput in your programs or service is just to pay for more instances. But sometimes it's fun to see what sort of performance we can get on a simple laptop (and save some cash at the same time). So anyway ... I've been working on this thing, and it took too damn long to run, and I needed to run it lots and lots of times ... so, it was time to optimize. Basic optimization has two main steps: 1) P...