Skip to main content

Pushing Python Performance With Parallelization

TL;DR: For certain types of programs, you can take advantage of idiosyncrasies in the Python interpreter and the host operating system to create real shared memory between processes and get some pretty good parallelization.

Premature optimization is the root of all evil.

As a developer, you've probably heard this before, and what it means basically is that you shouldn't waste time optimizing code unless it's already doing what you want it to do. We also live in an era of seemingly unlimited resources with AWS/Google Compute, and often the easiest way to get higher throughput in your programs or service is just to pay for more instances. But sometimes it's fun to see what sort of performance we can get on a simple laptop (and save some cash at the same time).

So anyway ... I've been working on this thing, and it took too damn long to run, and I needed to run it lots and lots of times ... so, it was time to optimize. Basic optimization has two main steps:

1) Profile, to find out where your program is slow.
2) Speed up the parts of your program that run the most.

Seems pretty obvious, but there are also a lot of different reasons programs can be slow. Are you I/O bound, meaning most of the time is spent waiting for reads/writes to and from your hard disk? Are you network bound, with most of the time waiting for network responses? Are you memory bound, meaning your program's performance is dependent on how much data you can fit in RAM? Are you CPU bound? That last one is kinda the best, since the CPU is the fastest part of your computer (not counting GPU and tasks specifically suited for GPU programming), meaning usually the CPU is underused because you can't shove data through it fast enough.

Let's talk about a simple type of program that is very common. It has a big chunk of data to process sequentially, and it spits out a result. Can you fit it all in RAM? Great. Do that, and you've just drastically improved the performance of your program. But modern CPUs have multiple cores, so technically if you could run multiple processes with the same memory, you could get better performance.

But ... Python. Python is not designed for performance. If you really want a performant program with shared memory, you should probably be coding it in C++. But I enjoy the ease and speed of development with Python, and wanted to see what sort of performance I could get out of it before rewriting in C++, so I started digging. First, there's a problem with Python parallelization called the Global Interpreter Lock, or GIL, meaning if you make a program multithreaded, only one thread is actually running at the same time (due to the GIL), so it won't take advantage of multiple CPUs/cores. The solution is the multiprocessing module, which gets around the GIL program by using full-fledged processes, each with their own instance of the interpreter, rather than threads (so you have the memory overhead of an interpreter instance for each process, but the advantage that they can run simultaneously). It's quite simple to use, but this is only really useful to us if we can create shared memory for our big data blob, and have each process read that same chunk of shared memory.

First, let's take a look at multiprocessing. Here's a simple example that creates a bunch of processes and waits for them all to finish. The nice thing is that it's all synchronized and automated - when a process finishes, the pool will start another process, until all the jobs are complete.

As you can see, it's super fast. Mainly because it's not doing anything.

For our second test, let's simulate having a big chunk of data, and passing that to our workers, since that's the type of program we're trying to optimize.

What the ... why is it suddenly so slow?? As you can see, the workers each run nearly instantly (looping over a million items and doing nothing with them is not very hard), but the total time ballooned to above seven seconds. The reason is that the one million object list is a local variable that's passed as a function parameter to the workers. What the Python interpreter will do internally is pickle the object on the calling side, then unpickle it in the worker. So not only do you get a copy of the object for each worker, you also get the expense of pickling and unpickling. No good.

(Quick side note: Python actually provides "shared" and "managed" objects for creating shared memory among processes. However, for what we're doing, they're extremely slow, since every access is proxied and made process-safe. They also take a very long time to create.)

Fortunately, it turns out we can do much better. On Unix-based operating systems (i.e., Linux, Mac OSX), we can take advantage of the operating system's "copy-on-write" forking behavior. This basically means that when the process forks, rather than making a copy of the program and its context, it will just use the same copy of memory unless values are changed, in which case it will create a copy of the changed part in each process (that's very simplified, and OS profs are probably cringing, but it's the basic idea). But how do we get our data into this "context" that's shared? Easy - use the dreaded "global variable":

As you can see, it runs fast again. Note that now our "data" object has 128 million ints rather than a million, so we can get some process time in the workers. And it's not creating a copy for the forked processes, because that would take a really really long time to pickle/unpickle (trust me). But, we introduced a new problem. Now we have a huge global variable, which is really poor form, since it will be initialized if our module gets imported. We also probably want our data to be created based on some program args, so it should be protected by a function or something. But we saw that if we pass the data to our workers after it's created in __main__, it gets copied. How can we solve this problem?

There's actually a simple solution. If we define our worker function in the same scope that we build our data object, then the worker function will have access to it, and the Python interpreter will put this in the "global" context of the process. Thus, it won't be copied like our local variable from our earlier attempt. Let's also increase the number of "jobs" from 10 to 100, because it takes a bit of time for parallelization to "warm up" (likely due to caching, which is a whole other topic for performance optimization).

As you can see, we're getting true parallelization now! With one worker, it takes about 20 seconds, and when we double to two, we get nearly twice the performance! Also note that my Macbook has a dual-core CPU, but four virtual cores ("hyperthreading"), so going from two workers to four only gets us a small increase in performance.

And that's it! Now you have a highly parallel data processor using shared memory in Python. The above examples were specifically run with Python 2.7 and PyPy on OS X 10.10.5. Note that PyPy and CPython behave somewhat differently, and will crash at different places in the automatic pickling/unpickling of objects. And also, apparently this won't work at all on Windows, because Windows will copy the entire process before forking. But if your program is constrained in the way described above, and you're running on 'Nix, you can actually get some pretty decent performance!

Happy parallelizing!


  1. Great article with clear and easy explanation, right there!
    Hopefully, we can keep in touch in Korea.
    Please, visit our site below and it would be great if you send a feedback.


Post a Comment

Popular posts from this blog

Why Korean Is Hard For Native English Speakers

A couple of days ago, as an experiment, I wrote my first blog post ever in a non-English language. It was an attempt to explain some of the reasons that Korean is hard to learn for native English speakers, so I figured I might as well try to write it in Korean. Those of you who actually read Korean can see how awkward the attempt was =).

In any case, the post came from an email conversation I had with The Korean from Ask a Korean, a fantastically well-written blog about all things Korea from the perspective of a Korean who moved to the United States during high school. Since I tend to geek out on language things, I figured I might as well post part of that conversation. An edited version follows.


Out of the languages that I've attempted to learn so far, Korean has been the hardest. I've done a lot of meta thinking about learning Korean, and I think there are a number of reasons it's difficult for non-Koreans (and especially Westerners) to learn:

1) Obviously, the…

Don't Take Korean Language Advice From Kyopos

I'm not sure why it took me so long to figure this out, but the last people you should take Korean language advice from are kyopos (foreign-born or raised Koreans). That being said, if you do follow their advice, you will get many laughs from Koreans. Some of my personal favorites, all of which actually happened to me:

- When I first got to Korea, I was at some open-air event, and during a break I started talking to one of the hosts. He said he was only a part-time host, so I asked him what his full-time job was, and he said "백수" (which is slang for "unemployed guy"). I asked him what that was, and he replied, "Comedian". So then the next few people I met, I proudly told I was a baeksu. (Edit: Actually, this guy was Korean Korean, not kyopo.)

- Next, a kyopo who lived in the apartment I moved into back in 2010 asked me what I was doing in Korea, and I told him I was starting a company, and asked how to say that in Korean in case people ask. He told me…

영어가 모국어인 사람들은 왜 한국어를 배우기가 어려운 이유

이 포스트는 내 처음 한국어로 블로그 포스트인데, 한국어에 대하니까 잘 어울린다. =) 자, 시작합시다! 왜 외국사람에게 한국어를 배우기가 어렵다? 난 한국어를 배우고 있는 사람이라서 이 문제에 대해 많이 생각하고 있었다. 여러가지 이유가 있는데 오늘 몇 이유만 논할 것이다.

1. 분명히 한국어 문법은 영어에 비해 너무 많이 다른다. 영어는 “오른쪽으로 분지(分枝)의 언어"라고 하는데 한국어는 “왼쪽으로 분지의 언어"이다. 뜻이 무엇이나요? 예를 보면 이해할 수 있을 것이다. 간단한 문장만 말하면 (외국어를 말하는 남들은 간단한 문장의 수준을 지낼 수가 약간 드물다), 간단한 걸 기억해야 돼: 영어는 “SVO”인데 한국어는 “SOV”이다. “I’m going to school”라고 한국어로는 “저는 학교에 가요"라고 말한다. 영어로 똑바로 번역하면 “I’m school to go”이다. 두 언어 다르는 게 목적어와 동사의 곳을 교환해야 한다. 별로 어렵지 않다. 하지만, 조금 더 어렵게 만들자. “I went to the restaurant that we ate at last week.” 한국어로는 “전 우리 지난 주에 갔던 식당에 또 갔어요"라고 말한다. 영어로 똑바로 번역하면 “I we last week went to restaurant to again went”말이다. 한국어가 왼쪽으로 분지 언어라서 문장 중에 왼쪽으로 확대한다! 이렇게 좀 더 쉽게 볼 수 있다: “전 (우리 지난 주에 갔던 식당)에 또 갔어요”. 주제가 “전"이고 동사가 “갔다"이고 목적어가 “우리 지난 주에 갔던 식당"이다. 영어 문장은 오른쪽으로 확대한다: I (S) went (V) to (the restaurant (that we went to (last week))) (O). 그래서 두 숙어 문장 만들고 싶으면 생각속에서도 순서를 변해야 된다.

2. 첫 째 점이니까 다른 사람을 자기 말을 아라들게 하고 싶으면, 충분히 …