Monday, August 15, 2016

Pushing Python Performance With Parallelization

TL;DR: For certain types of programs, you can take advantage of idiosyncrasies in the Python interpreter and the host operating system to create real shared memory between processes and get some pretty good parallelization.

Premature optimization is the root of all evil.

As a developer, you've probably heard this before, and what it means basically is that you shouldn't waste time optimizing code unless it's already doing what you want it to do. We also live in an era of seemingly unlimited resources with AWS/Google Compute, and often the easiest way to get higher throughput in your programs or service is just to pay for more instances. But sometimes it's fun to see what sort of performance we can get on a simple laptop (and save some cash at the same time).

So anyway ... I've been working on this thing, and it took too damn long to run, and I needed to run it lots and lots of times ... so, it was time to optimize. Basic optimization has two main steps:

1) Profile, to find out where your program is slow.
2) Speed up the parts of your program that run the most.

Seems pretty obvious, but there are also a lot of different reasons programs can be slow. Are you I/O bound, meaning most of the time is spent waiting for reads/writes to and from your hard disk? Are you network bound, with most of the time waiting for network responses? Are you memory bound, meaning your program's performance is dependent on how much data you can fit in RAM? Are you CPU bound? That last one is kinda the best, since the CPU is the fastest part of your computer (not counting GPU and tasks specifically suited for GPU programming), meaning usually the CPU is underused because you can't shove data through it fast enough.

Let's talk about a simple type of program that is very common. It has a big chunk of data to process sequentially, and it spits out a result. Can you fit it all in RAM? Great. Do that, and you've just drastically improved the performance of your program. But modern CPUs have multiple cores, so technically if you could run multiple processes with the same memory, you could get better performance.

But ... Python. Python is not designed for performance. If you really want a performant program with shared memory, you should probably be coding it in C++. But I enjoy the ease and speed of development with Python, and wanted to see what sort of performance I could get out of it before rewriting in C++, so I started digging. First, there's a problem with Python parallelization called the Global Interpreter Lock, or GIL, meaning if you make a program multithreaded, only one thread is actually running at the same time (due to the GIL), so it won't take advantage of multiple CPUs/cores. The solution is the multiprocessing module, which gets around the GIL program by using full-fledged processes, each with their own instance of the interpreter, rather than threads (so you have the memory overhead of an interpreter instance for each process, but the advantage that they can run simultaneously). It's quite simple to use, but this is only really useful to us if we can create shared memory for our big data blob, and have each process read that same chunk of shared memory.

First, let's take a look at multiprocessing. Here's a simple example that creates a bunch of processes and waits for them all to finish. The nice thing is that it's all synchronized and automated - when a process finishes, the pool will start another process, until all the jobs are complete.

As you can see, it's super fast. Mainly because it's not doing anything.

For our second test, let's simulate having a big chunk of data, and passing that to our workers, since that's the type of program we're trying to optimize.

What the ... why is it suddenly so slow?? As you can see, the workers each run nearly instantly (looping over a million items and doing nothing with them is not very hard), but the total time ballooned to above seven seconds. The reason is that the one million object list is a local variable that's passed as a function parameter to the workers. What the Python interpreter will do internally is pickle the object on the calling side, then unpickle it in the worker. So not only do you get a copy of the object for each worker, you also get the expense of pickling and unpickling. No good.

(Quick side note: Python actually provides "shared" and "managed" objects for creating shared memory among processes. However, for what we're doing, they're extremely slow, since every access is proxied and made process-safe. They also take a very long time to create.)

Fortunately, it turns out we can do much better. On Unix-based operating systems (i.e., Linux, Mac OSX), we can take advantage of the operating system's "copy-on-write" forking behavior. This basically means that when the process forks, rather than making a copy of the program and its context, it will just use the same copy of memory unless values are changed, in which case it will create a copy of the changed part in each process (that's very simplified, and OS profs are probably cringing, but it's the basic idea). But how do we get our data into this "context" that's shared? Easy - use the dreaded "global variable":

As you can see, it runs fast again. Note that now our "data" object has 128 million ints rather than a million, so we can get some process time in the workers. And it's not creating a copy for the forked processes, because that would take a really really long time to pickle/unpickle (trust me). But, we introduced a new problem. Now we have a huge global variable, which is really poor form, since it will be initialized if our module gets imported. We also probably want our data to be created based on some program args, so it should be protected by a function or something. But we saw that if we pass the data to our workers after it's created in __main__, it gets copied. How can we solve this problem?

There's actually a simple solution. If we define our worker function in the same scope that we build our data object, then the worker function will have access to it, and the Python interpreter will put this in the "global" context of the process. Thus, it won't be copied like our local variable from our earlier attempt. Let's also increase the number of "jobs" from 10 to 100, because it takes a bit of time for parallelization to "warm up" (likely due to caching, which is a whole other topic for performance optimization).

As you can see, we're getting true parallelization now! With one worker, it takes about 20 seconds, and when we double to two, we get nearly twice the performance! Also note that my Macbook has a dual-core CPU, but four virtual cores ("hyperthreading"), so going from two workers to four only gets us a small increase in performance.

And that's it! Now you have a highly parallel data processor using shared memory in Python. The above examples were specifically run with Python 2.7 and PyPy on OS X 10.10.5. Note that PyPy and CPython behave somewhat differently, and will crash at different places in the automatic pickling/unpickling of objects. And also, apparently this won't work at all on Windows, because Windows will copy the entire process before forking. But if your program is constrained in the way described above, and you're running on 'Nix, you can actually get some pretty decent performance!

Happy parallelizing!

No comments:

Post a Comment