Skip to main content

Pushing Python Performance With Parallelization

TL;DR: For certain types of programs, you can take advantage of idiosyncrasies in the Python interpreter and the host operating system to create real shared memory between processes and get some pretty good parallelization.

Premature optimization is the root of all evil.

As a developer, you've probably heard this before, and what it means basically is that you shouldn't waste time optimizing code unless it's already doing what you want it to do. We also live in an era of seemingly unlimited resources with AWS/Google Compute, and often the easiest way to get higher throughput in your programs or service is just to pay for more instances. But sometimes it's fun to see what sort of performance we can get on a simple laptop (and save some cash at the same time).

So anyway ... I've been working on this thing, and it took too damn long to run, and I needed to run it lots and lots of times ... so, it was time to optimize. Basic optimization has two main steps:

1) Profile, to find out where your program is slow.
2) Speed up the parts of your program that run the most.

Seems pretty obvious, but there are also a lot of different reasons programs can be slow. Are you I/O bound, meaning most of the time is spent waiting for reads/writes to and from your hard disk? Are you network bound, with most of the time waiting for network responses? Are you memory bound, meaning your program's performance is dependent on how much data you can fit in RAM? Are you CPU bound? That last one is kinda the best, since the CPU is the fastest part of your computer (not counting GPU and tasks specifically suited for GPU programming), meaning usually the CPU is underused because you can't shove data through it fast enough.

Let's talk about a simple type of program that is very common. It has a big chunk of data to process sequentially, and it spits out a result. Can you fit it all in RAM? Great. Do that, and you've just drastically improved the performance of your program. But modern CPUs have multiple cores, so technically if you could run multiple processes with the same memory, you could get better performance.

But ... Python. Python is not designed for performance. If you really want a performant program with shared memory, you should probably be coding it in C++. But I enjoy the ease and speed of development with Python, and wanted to see what sort of performance I could get out of it before rewriting in C++, so I started digging. First, there's a problem with Python parallelization called the Global Interpreter Lock, or GIL, meaning if you make a program multithreaded, only one thread is actually running at the same time (due to the GIL), so it won't take advantage of multiple CPUs/cores. The solution is the multiprocessing module, which gets around the GIL program by using full-fledged processes, each with their own instance of the interpreter, rather than threads (so you have the memory overhead of an interpreter instance for each process, but the advantage that they can run simultaneously). It's quite simple to use, but this is only really useful to us if we can create shared memory for our big data blob, and have each process read that same chunk of shared memory.

First, let's take a look at multiprocessing. Here's a simple example that creates a bunch of processes and waits for them all to finish. The nice thing is that it's all synchronized and automated - when a process finishes, the pool will start another process, until all the jobs are complete.

As you can see, it's super fast. Mainly because it's not doing anything.

For our second test, let's simulate having a big chunk of data, and passing that to our workers, since that's the type of program we're trying to optimize.

What the ... why is it suddenly so slow?? As you can see, the workers each run nearly instantly (looping over a million items and doing nothing with them is not very hard), but the total time ballooned to above seven seconds. The reason is that the one million object list is a local variable that's passed as a function parameter to the workers. What the Python interpreter will do internally is pickle the object on the calling side, then unpickle it in the worker. So not only do you get a copy of the object for each worker, you also get the expense of pickling and unpickling. No good.

(Quick side note: Python actually provides "shared" and "managed" objects for creating shared memory among processes. However, for what we're doing, they're extremely slow, since every access is proxied and made process-safe. They also take a very long time to create.)

Fortunately, it turns out we can do much better. On Unix-based operating systems (i.e., Linux, Mac OSX), we can take advantage of the operating system's "copy-on-write" forking behavior. This basically means that when the process forks, rather than making a copy of the program and its context, it will just use the same copy of memory unless values are changed, in which case it will create a copy of the changed part in each process (that's very simplified, and OS profs are probably cringing, but it's the basic idea). But how do we get our data into this "context" that's shared? Easy - use the dreaded "global variable":

As you can see, it runs fast again. Note that now our "data" object has 128 million ints rather than a million, so we can get some process time in the workers. And it's not creating a copy for the forked processes, because that would take a really really long time to pickle/unpickle (trust me). But, we introduced a new problem. Now we have a huge global variable, which is really poor form, since it will be initialized if our module gets imported. We also probably want our data to be created based on some program args, so it should be protected by a function or something. But we saw that if we pass the data to our workers after it's created in __main__, it gets copied. How can we solve this problem?

There's actually a simple solution. If we define our worker function in the same scope that we build our data object, then the worker function will have access to it, and the Python interpreter will put this in the "global" context of the process. Thus, it won't be copied like our local variable from our earlier attempt. Let's also increase the number of "jobs" from 10 to 100, because it takes a bit of time for parallelization to "warm up" (likely due to caching, which is a whole other topic for performance optimization).

As you can see, we're getting true parallelization now! With one worker, it takes about 20 seconds, and when we double to two, we get nearly twice the performance! Also note that my Macbook has a dual-core CPU, but four virtual cores ("hyperthreading"), so going from two workers to four only gets us a small increase in performance.

And that's it! Now you have a highly parallel data processor using shared memory in Python. The above examples were specifically run with Python 2.7 and PyPy on OS X 10.10.5. Note that PyPy and CPython behave somewhat differently, and will crash at different places in the automatic pickling/unpickling of objects. And also, apparently this won't work at all on Windows, because Windows will copy the entire process before forking. But if your program is constrained in the way described above, and you're running on 'Nix, you can actually get some pretty decent performance!

Happy parallelizing!

Comments

  1. Great article with clear and easy explanation, right there!
    Hopefully, we can keep in touch in Korea.
    Please, visit our site below and it would be great if you send a feedback.

    www.myriels.com

    woojun.shin@myriels.com

    ReplyDelete

Post a Comment

Popular posts from this blog

Why Korean Is Hard For Native English Speakers

A couple of days ago, as an experiment, I wrote my first blog post ever in a non-English language. It was an attempt to explain some of the reasons that Korean is hard to learn for native English speakers, so I figured I might as well try to write it in Korean. Those of you who actually read Korean can see how awkward the attempt was =).

In any case, the post came from an email conversation I had with The Korean from Ask a Korean, a fantastically well-written blog about all things Korea from the perspective of a Korean who moved to the United States during high school. Since I tend to geek out on language things, I figured I might as well post part of that conversation. An edited version follows.

---------

Out of the languages that I've attempted to learn so far, Korean has been the hardest. I've done a lot of meta thinking about learning Korean, and I think there are a number of reasons it's difficult for non-Koreans (and especially Westerners) to learn:

1) Obviously, the…

Don't Take Korean Language Advice From Kyopos

I'm not sure why it took me so long to figure this out, but the last people you should take Korean language advice from are kyopos (foreign-born or raised Koreans). That being said, if you do follow their advice, you will get many laughs from Koreans. Some of my personal favorites, all of which actually happened to me:

- When I first got to Korea, I was at some open-air event, and during a break I started talking to one of the hosts. He said he was only a part-time host, so I asked him what his full-time job was, and he said "백수" (which is slang for "unemployed guy"). I asked him what that was, and he replied, "Comedian". So then the next few people I met, I proudly told I was a baeksu. (Edit: Actually, this guy was Korean Korean, not kyopo.)

- Next, a kyopo who lived in the apartment I moved into back in 2010 asked me what I was doing in Korea, and I told him I was starting a company, and asked how to say that in Korean in case people ask. He told me…

Is It Worth It To Learn Korean?

Learning Korean as a non-Asian foreigner is an exercise in masochism. Note that I specify "non-Asian". Why does that make a difference? Simply because Koreans possess a deeply-ingrained belief that non-Asians are incapable of speaking Korean. The self-fulfilling prophecy of it is that since Koreans expect you to be incapable of speaking Korean, due to this mental block, they are likely to not understand you regardless of your proficiency level. Additionally, they won't respond to you with normal Korean like they would respond to an Asian person, because they assume you couldn't possibly understand. You will rarely ever have an opportunity to hear natural Korean, because Koreans simply won't speak it with you unless 1) they are open-minded and awesome (meaning they have probably lived abroad - thank you to all of you), or 2) they have known you long enough that they've gotten past the odd sight of a foreigner speaking Korean.

In short, nearly every time you op…