Numba, Mojo, and Compilers for Numerical Computing

There have been a lot of articles written following Modular’s announcement of Mojo, a Python “successor” language that will bring high performance computing (especially for AI applications) to Python users. Here at Anaconda, we’ve been working on the problem of HPC and Python for over a decade now. As part of that work, we created Numba in 2012, a Python just-in-time compiler specifically for numerical computing, and have been improving it ever since. We’ve learned a lot during that time about what it means to compile and accelerate Python, and thought it would be useful to share some of that perspective here. Along the way, you’ll learn about some of the design tradeoffs Numba has had to navigate, and maybe get some context for the challenges that Mojo will face as it takes a different (but potentially complementary) approach to Python compilation.

When is Python useful for HPC?

The difficulty in deciding how Python should be best used for high performance computing (HPC) hinges on the difficulty in defining the purpose of HPC. HPC is often discussed in terms of maximizing execution speed of large, complex calculations. For these purposes, Python is not very useful, except maybe as a job configuration or control language delegating nearly all computation to highly optimized components. (This is the traditional “glue language” role that Python is very good at.)

However, the economies of scale in software can confuse our understanding of priorities. Some kinds of software, for example linear algebra algorithms, are so ubiquitous across disciplines and so frequently used that they merit spending a huge amount of developer effort on performance. There are almost no situations where you should implement matrix multiplication yourself. Instead, you should use MKL, or OpenBLAS, or cuBLAS, or some other library that has had a tremendous amount of effort from hardware experts invested into it. The time spent developing and debugging these sorts of libraries is tiny compared to the amount of time the library will be used by the entire software community. Additionally, it is often possible to find resources to fund these fundamental libraries from hardware vendors and non-profit funding sources because these applications are pivotal. This is the “HPC mass production” scenario.

But many software projects are not in this “mass production” category. They are research projects where a small team of developers needs to iterate quickly to discover a solution to a particular problem. Or it is an established project where maintenance costs and barriers to contribution need to be minimized because the team is small, or turnover (as in many academic research contexts) is high. The incredible diversity of use cases in the Python numerical computing ecosystem is its strength, but also means that development processes optimized for “mass production” are not necessarily the best choice. Speed matters, but total time matters more, including developer onboarding time, feature development time, debugging time, maintenance time, packaging time, as well as execution time.

This is the vast world of “high-enough performance computing,” and is full of tradeoffs. Numba seeks to find a balance that reduces total person time in the numerical computing space, of which execution time is one component. Mojo is trying to do a similar thing, but with a different set of tradeoffs. (Literally anyone who isn’t hand-coding assembly language for critical code paths is trading execution speed for something.) Hopefully I can help illustrate some of these tradeoffs in the following sections.

Python isn’t a language, it is a community

When people talk about the characteristics (like “speed” or “ease of use”) of “Python,” they are usually mashing together a bunch of distinct parts that are important to think about separately when designing a strategy to improve Python in some way. These parts include:

  • The Python language specification
  • The Python interpreter (usually CPython for most users)
  • The Python standard library
  • The huge collection of third party packages
  • The network of users and developers who help create, maintain, and teach others about all of the above

When a new solution to “the Python performance problem” is announced, I find it useful to understand how it impacts each of these items separately. For example, Numba makes the following choices:

  • Python language: No changes. Numba uses standard Python language features, like decorators and context managers, to intercept execution and annotate code for compilation.

  • Python interpreter: No changes, although Numba is CPython-specific. Numba is specifically designed to work inside CPython as a regular module to minimize barriers to use and maximize application compatibility.

  • Python standard library: No changes, but Numba can only accelerate a small number of standard Python types and built-in functions. Applications are free to use the entire standard library, but only outside of Numba-optimized functions (with some exceptions).

  • Third party packages: Numba does not prevent usage of third party packages in an application because Numba is confined only to specific functions the developer chooses to compile. However, Numba cannot compile code that calls most third party libraries either. Numba includes specific optimization for usage of NumPy arrays and functions, and support for other libraries can be added using Numba’s extension mechanism (for example Awkward Array).

  • Users and developers: Numba tries to be easy to pick up for anyone familiar with Python and NumPy, although more advanced capabilities require learning a few new concepts. Numba also strives to be easy to integrate into other Python packages, avoiding a compile step during packaging, in some cases allowing packages to be distributed as pure Python source code which is compiled for the user’s specific hardware at runtime.

In general, Numba is trying to strike a balance of minimizing the barrier to adopting Numba in a numerical application by existing simply as a module that is compatible with the standard CPython interpreter. The application author does need to actively opt-in to Numba for each function that needs to be compiled, but we find this minimizes surprises and leads to more successful usage in the long run.

Mojo takes a different approach:

  • Python language: Mojo adds significant new syntax to the Python language, aspiring to be a “successor language” for Python that offers more low-level types and performance control. In particular, Mojo defines a new kind of function (using the “fn” keyword) that will be more strictly typed and less dynamic so that faster code can be generated. Regular functions (defined with the standard “def” keyword) have more dynamic and implicit behavior, which will make them feel more familiar to Python developers, but they are not “Python functions” in a strict sense.

  • Python interpreter: Mojo has its own runtime interpreter (called “mojo”) which must be used to call Mojo code, and a compiler for standalone binaries will be available as well. Mojo also links to the CPython runtime, which it uses to allow it to import Python modules and call methods on Python objects.

  • Python standard library: Mojo has its own standard library, which currently is very low-level (at least in the public documentation). Mojo modules can access the Python standard library, via the Python interop functionality.

  • Third party packages: As the Mojo compiler is not generally available, no one has made any Mojo packages. It is unclear how they will be distributed, or what the packaging infrastructure and tooling will look like. Mojo can access 3rd party Python packages via the interop interface, same as the Python standard library.

  • Users and developers: Mojo’s target audience currently seems to be developers of what I would call “application code.” Mojo is not yet targeting anyone creating reusable modules for redistribution, and not Python package developers either.

The best way I’ve found to think about Mojo is that it is trying to be a language that looks like Python, feels somewhat familiar to Python developers, and is designed to interoperate with Python out of the box. It is not a drop-in replacement for Python, however, unless you are writing a new application from scratch that makes limited use of other Python modules.

Mojo fn/def vs. Numba’s object/nopython modes

One of the more interesting decisions in Mojo is to create two kinds of functions:

  • def functions: These functions are designed to feel like standard Python functions. The dynamic behavior we expect from Python should continue to work here, although Mojo def functions can go beyond Python via creation of immutable variables, define type signatures using Mojo types, and so on.

  • fn functions: These functions have more strict rules about the contents of the function, such as required type declarations for arguments and local variables, and explicit declaration of exceptions.

It is clear from the restrictions that fn functions are designed to be statically typed by a compiler and can be compiled ahead of time with the performance one would expect from a statically typed language.

Numba takes a somewhat similar approach, and has two different compilation modes, which are called “object mode” and “nopython” mode. These two compiler modes have the following behavior:

  • object mode: Does not attempt type inference on the function, treating all values as opaque Python objects. Object mode simulates the Python interpreter at compile time, emitting calls to the CPython C API to perform all operations on the Python objects in the function. This results in a negligible performance improvement to the code. Due to limitations in the bytecode analysis front-end of Numba, there are some Python language constructs not supported. Execution is still subject to the Python Global Interpreter Lock (GIL).

  • nopython mode: This performs a complete type-inference pass on the function, and, for the subset of Python that is supported, eliminates all direct manipulation of Python objects. Scalar values are unboxed into native machine types, and operations on NumPy arrays directly modify the data buffer of those arrays. Most reference counting is completely eliminated, although a few instances may remain depending on whether new NumPy arrays are allocated. Numba, however, uses its own thread-safe reference counting internally, so the GIL can be released and multiple threads can execute Numba functions simultaneously.

In practice, we’ve found object mode is no longer very useful. In early Numba, nopython mode was very limited, so object mode was paired with a technique called “loop lifting.” The Numba compiler would find loops within a function that could be compiled to nopython mode and would “lift” them out of the function body, and the rest of the function would be compiled with object mode. As nopython mode has gained more capabilities (including memory allocation and access to typed Python-like containers), we find the use cases for object mode compilation of functions have diminished. In fact, we’ve even inverted the situation, and now have support for object mode blocks within a nopython mode function, which is useful for things like supporting Python callbacks.

Since Numba exists within the CPython interpreter, the best practice for functions that cannot be compiled with nopython mode is to not compile them at all! Thanks to the Faster CPython effort, these sorts of generic functions are getting faster with every release anyway, and Numba is most effective on the numerical core of the application, where the compiler can strip away all of the Python overhead. For this reason, Siu Kwan Lam (one of Numba’s creators) often calls Numba a “performance linter” because the limitations of nopython mode guide Numba users into using Python in a way that can be most effectively sped up by a compiler, usually between 2x and 200x. If we can’t achieve that, there isn’t much point to compilation in the first place.

At first glance, Numba’s object mode seems like Mojo’s def functions, but the difference is that Mojo’s def functions feel like Python functions, but are not Python functions. Typing is optional, but new Mojo declarations (like var and let) are allowed, and the types are still Mojo types, not Python types. Access to Python modules needs to happen via special wrappers created by the Mojo Python interface, which can import Python modules. Mojo uses CPython as a library to handle these calls, but the body of the def function itself is still executed by the Mojo runtime. In principle, this could make Mojo behave a lot like Cython, but we still don’t know much about how convenient it will be to use Python from Mojo in practice, and we know nothing about whether Python will be able to call Mojo functions easily.

Array Types and Operations

The most important part of any numerical computing system in 2023 is the multidimensional array. Multidimensional arrays are extremely useful and memory efficient containers for numerical data, and concepts like broadcasting, universal functions, and advanced indexing have dramatically simplified the application of numerical computing to many situations, including machine learning. Moreover, arrays are very compiler-friendly data structures, which has created opportunities for many compilers and compiler technologies. In the Python space, NumPy has become the standard array, and without NumPy, there would be no Numba.

The most surprising part of the Mojo announcement for me was the complete lack of a built-in multidimensional array type. Given Mojo’s target use case of machine learning workloads, I can only assume information about the multidimensional array was held back for a future update, because it absolutely has to exist for Mojo to be useful. The absence was doubly surprising because MLIR (which the Mojo documentation mentions as part of its core compiler design) includes a tensor type, which is very similar to the NumPy array, and also defines a generic linear algebra operation (linalg.generic) in one of the MLIR dialects that is extremely powerful. I would describe the linalg.generic operation as a superset of the generalized ufunc that both NumPy and Numba support (via the @guvectorize decorator), which can represent a surprisingly wide range of array operations in a very natural and compact way. Not seeing linalg.generic mentioned in the Mojo docs was very disappointing, and I hope we learn more about it in a future update.

With no array type to operate on, Mojo is currently lacking in all the array functions that one would expect from a numerical computing language. These can surely be added, and even implemented directly in the Mojo language, but they are not there now. Hopefully Modular will learn from the recently created Python Array API standard and consider using that as the basis for their user-facing array API.

Even beyond the array type and functions, it will be extremely important for Mojo to have a clear mechanism to pass array-like data between it and other languages, especially Python. NumPy’s C API is just as important as its Python API, as it enables passing data in and out of the Python runtime to 3rd party libraries written in other languages. Mojo will have a long road to broader adoption, so anything that eases interoperability with existing numerical computing libraries will be essential.

Extensibility and MLIR

Whether or not the Mojo compiler needs to be extensible is unclear. Many languages are self-contained, in that you are expected to “extend” the system by writing even more code in that language. That seems to be how Mojo is being implemented now, as most of the syntax and standard library of Mojo is focused on very low-level types and concepts, like SIMD operations and object lifetime management. With those ingredients, many of the essential higher level features needed in Mojo can be implemented in Mojo.

Python (and Numba) are not like this at all. Python’s superpower (and curse at times) is that it makes it very easy to extend Python with other languages, typically C or C++, but recently Rust is also growing in popularity. Every popular numerical computing package in Python either includes some non-Python code in it, or depends on one that does. For this reason, the C API of Python is very important for enabling extension of the interpreter to new types and modules. In a similar way, Numba’s limitations mean that there are some low-level operations that cannot be generated by Numba when it compiles a function with its @jit decorator. For these cases, Numba has a extension API which allows a variety of things, including:

  • Generation of machine code (technically LLVM IR) to implement specific low level operations.
  • Custom overrides of existing Python functions to define how they should be compiled for specific data types.
  • Definition of custom “boxing” and “unboxing” methods for passing new data types through Numba-compiled code.

While Mojo does not necessarily need these things due to its design, I believe the Python interoperability feature in Mojo will need an extensible mechanism to define how to translate data structures across the Python / Mojo boundary. We know from Numba experience that anything like a NumPy array is very easy to move back and forth, and we believe that Apache Arrow data structures offer a similar possibility for more heterogeneous data (dataframes and beyond). Numba has even created a few typed data containers specifically to facilitate movement of statically-typed lists and statically-typed dictionaries across the Numba / Python boundary. Manipulating Python data through a wrapper mechanism (as is currently described in the Mojo docs) is a fine general solution, but faster mechanisms will be needed if Mojo is going to coexist with non-trivial Python code.

Another area of extensibility that could be important to Mojo and is definitely important to the future of Numba is interoperability with other MLIR-based tools. MLIR is designed to be a high level compiler intermediate representation (IR) with some unique features. MLIR is extensible by defining new dialects, which allows the same IR to be used on a wide range of compiler use cases, and, in principle, permits different compiler tools that support MLIR to work together. The Numba team is very excited about the potential for MLIR and plans to incorporate it into the next generation of Numba. However, we can’t tell how Mojo plans to expose its MLIR-based internals. Would something like Numba be able to target the Mojo compiler pipeline directly, or possibly the reverse: could future Numba consume the functions produced by the Mojo frontend and parser?

Packaging and Composability

The final area we are curious about Mojo is the packaging approach. One feature of Mojo that was emphasized was the ability of the Mojo compiler to produce fully self-contained executables that could be deployed wherever they were needed. Application distribution is a very important use case that has not gotten as much attention in the Python ecosystem as package distribution. Python’s plethora of packaging tools focus on allowing users to install Python components along with their complex web of dependencies. The Python Package Index (and adjacent package repositories, like conda-forge and the Anaconda Distribution) is truly one of Python’s superpowers, and if Mojo wants to jumpstart a broad community of developers, it will also need a packaging story. There are so many unanswered questions at this point:

  • Will hybrid Mojo-Python projects be possible, with both Mojo and Python dependencies?
  • Will Mojo packages only distribute source code, with the intent that all Mojo applications must be compiled by the application builder along with all of their dependencies?
  • Or, will Mojo allow precompiled modules to be distributed, in which case,how will inter-module optimization be enabled?

Aside: Although this may be a long shot, I think it would be great if Mojo would consider conda as a package manager, as it already has excellent Python support, but is also cross-platform and multi-language packaging system. Mojo doesn't need to build yet another package manager, and conda would be an excellent choice for the complex hybrid Mojo/Python projects that Modular hopes will exist in the future. Give us a call, Chris, if you want to chat. 😀

The packaging story for Numba today is much simpler, as it is just a Python package that does just-in-time compilation. Projects that use Numba compilation do so by taking Numba as a dependency, and then can choose to distribute their own package as pure source code. The translation from Python to machine code happens at runtime on the end user’s system. This dramatically simplifies the distribution of those packages downstream of Numba, who do not need to build wheels for each target platform unless they have other compiled code aside from Numba-decorated functions.

However, the Numba team definitely appreciates that JIT compilation is not suitable for all situations, either because of the increased startup time, or limitations of the target platform, which might forbid JIT compilation entirely. Because of this, in our planning for the next generation of Numba, we’ve decided to first focus on an ahead-of-time compiler (like Mojo) and then will treat just-in-time compilation as a special case. Ahead-of-time compilation is much trickier because you have to make choices at build time for which there are downstream consequences to consumers of your package. You have to worry about ABI stability, supporting multiple variants of input types, and different machine instruction set variants (like AVX, AVX2, AVX-512, etc). At the same time, we do not want ahead-of-time compilation to preclude JIT optimization across modules that were distributed separately. We have been working on techniques to handle all of these situations in a new library called PIXIE, which allows additional metadata to be inserted into library files to enable dynamic dispatch, CPU feature selection, and optional future recompilation of functions with a JIT.

Conclusions and a Wishlist for Mojo

I want to conclude by reiterating that my goal with this article is not to dampen enthusiasm for Mojo, but to help the reader be more curious about the details when they read about new Python and Python-adjacent compiler projects. Impressive looking speedup factors are not where the most difficult challenges are.

I think it is an interesting idea to try to blend a Python-like syntax with the capabilities of MLIR to target a wide range of potential hardware. Mojo’s Python interoperability features could expand the capabilities of the Python ecosystem, which would be a great thing. But, there is a tremendous amount of work left to do on Mojo and a lot of design details have not been shared. We simply don’t know enough to decide how Mojo will impact Python, or how best to interact with it.

Given that, I want to conclude by reiterating the wishlist of things I hope we see in Mojo in the coming months:

  • A native multidimensional array type that exposes the same power as MLIR’s tensor type and linalg dialect. That will enable the same kind of array programming we are familiar with from NumPy.

  • More explanation and infrastructure for managing data motion across the Mojo / Python boundary. NumPy arrays should be usable by Mojo native code (hopefully that tensor type we’re asking for in the previous item) without data copies. Similarly, being able to use Apache Arrow data structures across the two languages would be very powerful. In general, we need something extensible so that new data translators can be added.

  • Permit an inversion of control, so that Mojo modules can be imported from the Python interpreter. Mojo could be an interesting alternative to Cython if that was possible, and if NumPy arrays could be passed to Mojo functions with minimal overhead.

  • Expose the MLIR internals of Mojo via some kind of interface that lets us retarget the compiler. A true Python-to-Mojo compiler pipeline could be possible then, and that would allow a lot more experimentation. How should we use Mojo with new or custom MLIR dialects?

  • More details about Mojo packaging in general. Will there be a package index for Mojo-specific modules by third parties? How should mixed Mojo and Python environments be handled? (Seriously, please take a look at conda. 🙂)

Of course, we need to see an open source Mojo compiler and runtime before many of these things will be possible. Hopefully we’ll get more details on that in the future as well.