Skip to main content

Pushing Python Performance With Parallelization

TL;DR: For certain types of programs, you can take advantage of idiosyncrasies in the Python interpreter and the host operating system to create real shared memory between processes and get some pretty good parallelization.

Premature optimization is the root of all evil.

As a developer, you've probably heard this before, and what it means basically is that you shouldn't waste time optimizing code unless it's already doing what you want it to do. We also live in an era of seemingly unlimited resources with AWS/Google Compute, and often the easiest way to get higher throughput in your programs or service is just to pay for more instances. But sometimes it's fun to see what sort of performance we can get on a simple laptop (and save some cash at the same time).

So anyway ... I've been working on this thing, and it took too damn long to run, and I needed to run it lots and lots of times ... so, it was time to optimize. Basic optimization has two main steps:

1) Profile, to find out where your program is slow.
2) Speed up the parts of your program that run the most.

Seems pretty obvious, but there are also a lot of different reasons programs can be slow. Are you I/O bound, meaning most of the time is spent waiting for reads/writes to and from your hard disk? Are you network bound, with most of the time waiting for network responses? Are you memory bound, meaning your program's performance is dependent on how much data you can fit in RAM? Are you CPU bound? That last one is kinda the best, since the CPU is the fastest part of your computer (not counting GPU and tasks specifically suited for GPU programming), meaning usually the CPU is underused because you can't shove data through it fast enough.

Let's talk about a simple type of program that is very common. It has a big chunk of data to process sequentially, and it spits out a result. Can you fit it all in RAM? Great. Do that, and you've just drastically improved the performance of your program. But modern CPUs have multiple cores, so technically if you could run multiple processes with the same memory, you could get better performance.

But ... Python. Python is not designed for performance. If you really want a performant program with shared memory, you should probably be coding it in C++. But I enjoy the ease and speed of development with Python, and wanted to see what sort of performance I could get out of it before rewriting in C++, so I started digging. First, there's a problem with Python parallelization called the Global Interpreter Lock, or GIL, meaning if you make a program multithreaded, only one thread is actually running at the same time (due to the GIL), so it won't take advantage of multiple CPUs/cores. The solution is the multiprocessing module, which gets around the GIL program by using full-fledged processes, each with their own instance of the interpreter, rather than threads (so you have the memory overhead of an interpreter instance for each process, but the advantage that they can run simultaneously). It's quite simple to use, but this is only really useful to us if we can create shared memory for our big data blob, and have each process read that same chunk of shared memory.

First, let's take a look at multiprocessing. Here's a simple example that creates a bunch of processes and waits for them all to finish. The nice thing is that it's all synchronized and automated - when a process finishes, the pool will start another process, until all the jobs are complete.

As you can see, it's super fast. Mainly because it's not doing anything.

For our second test, let's simulate having a big chunk of data, and passing that to our workers, since that's the type of program we're trying to optimize.

What the ... why is it suddenly so slow?? As you can see, the workers each run nearly instantly (looping over a million items and doing nothing with them is not very hard), but the total time ballooned to above seven seconds. The reason is that the one million object list is a local variable that's passed as a function parameter to the workers. What the Python interpreter will do internally is pickle the object on the calling side, then unpickle it in the worker. So not only do you get a copy of the object for each worker, you also get the expense of pickling and unpickling. No good.

(Quick side note: Python actually provides "shared" and "managed" objects for creating shared memory among processes. However, for what we're doing, they're extremely slow, since every access is proxied and made process-safe. They also take a very long time to create.)

Fortunately, it turns out we can do much better. On Unix-based operating systems (i.e., Linux, Mac OSX), we can take advantage of the operating system's "copy-on-write" forking behavior. This basically means that when the process forks, rather than making a copy of the program and its context, it will just use the same copy of memory unless values are changed, in which case it will create a copy of the changed part in each process (that's very simplified, and OS profs are probably cringing, but it's the basic idea). But how do we get our data into this "context" that's shared? Easy - use the dreaded "global variable":

As you can see, it runs fast again. Note that now our "data" object has 128 million ints rather than a million, so we can get some process time in the workers. And it's not creating a copy for the forked processes, because that would take a really really long time to pickle/unpickle (trust me). But, we introduced a new problem. Now we have a huge global variable, which is really poor form, since it will be initialized if our module gets imported. We also probably want our data to be created based on some program args, so it should be protected by a function or something. But we saw that if we pass the data to our workers after it's created in __main__, it gets copied. How can we solve this problem?

There's actually a simple solution. If we define our worker function in the same scope that we build our data object, then the worker function will have access to it, and the Python interpreter will put this in the "global" context of the process. Thus, it won't be copied like our local variable from our earlier attempt. Let's also increase the number of "jobs" from 10 to 100, because it takes a bit of time for parallelization to "warm up" (likely due to caching, which is a whole other topic for performance optimization).

As you can see, we're getting true parallelization now! With one worker, it takes about 20 seconds, and when we double to two, we get nearly twice the performance! Also note that my Macbook has a dual-core CPU, but four virtual cores ("hyperthreading"), so going from two workers to four only gets us a small increase in performance.

And that's it! Now you have a highly parallel data processor using shared memory in Python. The above examples were specifically run with Python 2.7 and PyPy on OS X 10.10.5. Note that PyPy and CPython behave somewhat differently, and will crash at different places in the automatic pickling/unpickling of objects. And also, apparently this won't work at all on Windows, because Windows will copy the entire process before forking. But if your program is constrained in the way described above, and you're running on 'Nix, you can actually get some pretty decent performance!

Happy parallelizing!


  1. Great article with clear and easy explanation, right there!
    Hopefully, we can keep in touch in Korea.
    Please, visit our site below and it would be great if you send a feedback.


Post a Comment

Popular posts from this blog

Why Korean Is Hard For Native English Speakers

A couple of days ago, as an experiment, I wrote my first blog post ever in a non-English language. It was an attempt to explain some of the reasons that Korean is hard to learn for native English speakers, so I figured I might as well try to write it in Korean. Those of you who actually read Korean can see how awkward the attempt was =).

In any case, the post came from an email conversation I had with The Korean from Ask a Korean, a fantastically well-written blog about all things Korea from the perspective of a Korean who moved to the United States during high school. Since I tend to geek out on language things, I figured I might as well post part of that conversation. An edited version follows.


Out of the languages that I've attempted to learn so far, Korean has been the hardest. I've done a lot of meta thinking about learning Korean, and I think there are a number of reasons it's difficult for non-Koreans (and especially Westerners) to learn:

1) Obviously, the…

Stuttering in Korea

I had given up on English. It's my native language, but I figured after 30 some-odd years of disfluent speech, it was time to try something else. So I signed up for language classes in Korean, rationalizing that if I was going to try to teach myself how to speak, I might as well learn a new language along the way.

This might seem completely insane, but when the prevailing theme of your conscious thoughts for multiple decades is some variant of "Why can't I say what I want to say?", you come up with lots of crazy ideas.

For background, I've been a person who stutters for my entire life. I wrote about it on this blog a few years ago, so I think it's time for a followup. I've learned a lot since then, about myself and about stuttering, but in this post I simply want to give some insight into what it's actually like to stutter, and how my speech has changed over time.

After the last stuttering post, the predominant reaction I got from friends was either &…

English Lesson: "한국에 오신지 얼마나 됐어요?"라고 영어로?

I've lost count of how many times Koreans have asked me the question, "How long do you stay in Korea?" in those words or something very similar. Clearly this question is taught in every English class in Korea, because I hear it over and over again, so I just wanted to be very clear about something here:


This phrase is incorrect for a few reasons, but primarily because it sounds ambiguous to native English speakers. Specifically, there are probably two different questions that you really want to ask:

1) How long have you been in Korea? (한국에 오신지 얼마나 됐어요?)
2) How long will you stay in Korea? (한국에 얼마나 있을 거예요?/한국에 얼마동안 있을 계획이에요?)

Nearly always the intended question is number 1, "How long have you been in Korea?", followed afterwards by number 2, "How long will you stay in Korea?". But the incorrectly stated question ambiguously sounds somewhere in between number 1 and number 2. So, don't ever use it again. T…