Python has a reputation for being notoriously slow, especially for numeric operations. Most of us use packages like NumPy and SciPy to mitigate this and make your huge numeric calculation code run in at least a bearable time. In this post, we’ll talk a bit about Numba, the famed package designed to speed up your Python code, even beyond the capabilities of NumPy using a simple decorator. More importantly, we’ll learn about a new long-awaited feature being introduced in the latest release, that is, support for NumPy’s new random module. This enables accessing the NumPy’s random number generator methods from within Numba functions, by allowing the NumPy’s Generator objects to cross the JIT Boundary.
Hey! I’m Kaustubh, a Computer Science student from India. During the Summer of 2022, I interned at Anaconda and worked on developing the Numba project. This post was made to make a (shamelessly self-promotional) summary of my work in the Numba library in the duration of my internship.
Now before learning about what I did in Numba, let’s take a short tour of Python numeric libraries and how Numba fits in all this:
Some of the same qualities that make Python user-friendly and suitable for data science are the same qualities that make Python slow. The biggest one being the fact that Python is an interpreted language. Traditionally, this problem was mitigated by writing computationally intensive algorithms in C or C++ and calling them from the outward-facing Python code. NumPy is an excellent example of this type of organization. The computationally most intensive parts are written in C/C++ and exposed to Python via binding code.
Fun-fact: the NumPy arrays that we use most of the time are actually stored, indexed, and iterated using C-code.
The other approach to improve the performance of your Python code is to use a Python-accelerating library. Numba is a perfect example of this. Numba translates Python functions to highly optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can easily approach the speeds of C or FORTRAN, simply because of the amount of low level optimization applied to your code, especially if it has looping logic in it. Additionally, Numba has support for automatic parallelization of loops (which is different from loop optimization), generation of GPU-accelerated code (using CUDA), and creation of universal functions (ufuncs) and C callbacks. Libraries like these, in general, combine the relative ease of writing Python code with the performance enhancements provided by executing the operations in a different backend altogether.
Numba's support for NumPy methods
One of the advantages of using Numba is its near-seamless integration with NumPy functions, especially NumPy
ndarrays. Numba detects whenever a NumPy function is being used within the code to be Just-In-Time (JIT)-compiled and dispatches the provided arguments to a respective Numba-based implementation internally that provides the same functionality as the original.
However, a major annoyance with directly trying to JIT-compile complex code written mainly using NumPy is the lack of support for specific abilities of NumPy within Numba. For instance, one can notice a significant lack of support when trying to JIT-compile code that uses NumPy’s infamous Fancy/Advanced array indexing using Numba or the lack of axis argument for several NumPy methods.
One more such feature whose support was being requested by NumPy users in Numba was support for the new NumPy random module. For those who aren’t familiar with it, NumPy has a module dedicated to random number generation which includes the generation of numbers following a particular random distribution, for example, Poisson or Gaussian distributions. This is especially useful in scientific computing, where generating random data is necessary for statistical analyses and computations. Coincidentally these are instances where Python accelerating libraries are required the most due to the computationally intensive algorithms being a large part of code logic. Hence, having the random generation support within JIT-compiled code is a huge advantage in the majority of cases.
Previously NumPy used to have a global ‘state array’, which was a sequence of bits stored as an array from which random numbers could be drawn. This state array could be initialized using (integer or array of) a sequence of bits. The array was updated repeatedly using the Mersenne Twister Algorithm which is a pseudo-random number algorithm.These kinds of algorithms have a cycle, so they eventually repeat. The advantage of the Mersenne Twister is that it has an especially long cycle (due to use of large, Mersenne prime numbers, which is where it takes its name from). This was also helpful because by setting the same initial seed, even with the generation of completely random numbers, we could have exactly the same results (at least up to rounding errors) when we rerun the code. Two systems with the same seed would have the same state array; hence, the subsequent random number generated is the same. However, there were problems with having a global seed; for example, when you have instances of two independent threads on the same system running the same scripts, trying to access the same global state array, it was challenging to maintain the same results from them for random number generation. To mitigate this, NumPy introduced class-based random number generation using class objects named Generators. And replaced the global seed with an object named BitGenerator that held the seed array as an attribute to the class. This also had the added advantage of having faster and more convenient algorithms, like Lemire’s rejection algorithm, for random number generation.
Basic Support for NumPy Generators
With PR #8031 in Numba, support for these Generator objects is being added. There were a lot of challenges while implementing the Generators as an object that could be lowered into the LLVM based environment of Numba; while keeping track of the original Generator object as well as the underlying Bitgenerator object. Fortunately, to make work easier, NumPy devs provided ctypes bindings to BitGenerators, which allowed us to draw bits directly from the C implementations of the object. (Thanks, NumPy devs!) The initial implementation for this was provided by Stuart Archibald from the Numba team and was improved upon by proper tracking of reference counts and error handling.
One of the issues we faced while implementing support of Generators was that we needed to maintain references to the original as a pointer to the object, if we ever needed to return it. Now one could easily do that by simply storing the pointer to the object somewhere accessible. However there is a bit of caveat here, Numba has its own runtime called NRT (Numba Run Time) that manages its own stuff such as reference counts. It however is NOT responsible for maintaining Python reference counts of an object outside of NRT. Hence, there could be a case where the object to which the pointer pointed to may not exist at all, because in the original Python environment it’s been deleted. This was mitigated by use of MemInfo objects which kept track of the information about memory at the pointer, including its references in the Python environment.
This PR was then followed by subsequent PRs which added distributions built on top of the Generator object support. #8038 and #8040 added such distributions while #8041 and #8042 added support for general methods of Generator objects such as shuffling and random integer generation (Lemire’s Rejection Algorithm). A majority of time of my internship was devoted to these PRs, half of it trying to implement them and the other half trying to track down a very sneaky bug within them. To understand this ‘bug’ let’s first understand where it originates from.
While implementing #8038, we noticed that even though we had directly translated the algorithms from NumPy code, there were slight precision differences between the results produced by Numba vs results produced by NumPy. What’s more interesting was this particular discrepancy was only being observed in certain systems on the CI, for instance it was being observed on aarch64 and ppc64le systems but not on linux-64 bit systems. At first they were small enough for us to ignore ~2000 ULPs (Units of Least Precision) for 64 bit integers, but as more and more complex distributions got introduced the precision difference touched ranges in order of ~10000s of ULPs. Thus we had to address the problem at its core. After multiple fruitless debugging sessions, we finally tracked down the precision differences right down to the assembly instructions that were causing it. (Thanks to Stuart again). These precision differences were actually being caused by floating point contractions. What it means is that in certain assembly languages (like those in ppc64le and aarch64 systems respectively) there are instructions like fmadd and fmsub that combine two assembly instructions i.e. mul and add/sub. So what happened in our case was that the NumPy code got executed using these instructions while the Numba code didn’t. This led to rounding errors as the numerical results of a single fmadd and two instructions: multiply and add ended up being slightly different. Once the problem was identified, we came up with #8232 as a fix.
The Generator support was released with the 0.56.0 version along with a load of other cool features. Subsequent releases will keep on adding more and more distribution-related algorithms. We aim for full parity with NumPy counterparts in both the latest and the legacy random number generator modules. In the future, the plan is to add more complimentary futures for this module such as making the Generator objects thread safe and also making it possible to build them within CPU-JITed code of Numba using a constructor.
Another cool feature that I was working upon during the summer is support for Fancy/Advanced indexing in Numba (starting with #8238). I’ll be posting a blogpost about it soon, so stay tuned for further updates.
A note to mentor and team:
Thank you Numba team for all its support. Especially Valentin Haenel (my mentor during the internship) for his guidance and Stuart Archibald for the initial patches and reviews. Couldn’t have made it this far without you guys.
Days since last segfault: 0x0
Ich bin ein Berliner!!