There have been a lot of articles written following Modular’s announcement of
Mojo, a Python “successor” language that
will bring high performance computing (especially for AI applications) to
Python users. Here at Anaconda, we’ve been working on the problem of HPC and
Python for over a decade now. As part of that work, we created
Numba in 2012, a Python just-in-time compiler
specifically for numerical computing, and have been improving it ever since.
We’ve learned a lot during that time about what it means to compile and
accelerate Python, and thought it would be useful to share some of that
perspective here. Along the way, you’ll learn about some of the design
tradeoffs Numba has had to navigate, and maybe get some context for the
challenges that Mojo will face as it takes a different (but potentially
complementary) approach to Python compilation.
When is Python useful for HPC?
The difficulty in deciding how Python should be best used for high performance
computing (HPC) hinges on the difficulty in defining the purpose of HPC. HPC
is often discussed in terms of maximizing execution speed of large, complex
calculations. For these purposes, Python is not very useful, except maybe as
a job configuration or control language delegating nearly all computation to
highly optimized components. (This is the traditional “glue language” role
that Python is very good at.)
However, the economies of scale in software can confuse our understanding of
priorities. Some kinds of software, for example linear algebra algorithms,
are so ubiquitous across disciplines and so frequently used that they merit
spending a huge amount of developer effort on performance. There are almost
no situations where you should implement matrix multiplication yourself.
Instead, you should use MKL, or OpenBLAS, or cuBLAS, or some other library
that has had a tremendous amount of effort from hardware experts invested into
it. The time spent developing and debugging these sorts of libraries is tiny
compared to the amount of time the library will be used by the entire software
community. Additionally, it is often possible to find resources to fund these
fundamental libraries from hardware vendors and non-profit funding sources
because these applications are pivotal. This is the “HPC mass production”
scenario.
But many software projects are not in this “mass production” category. They
are research projects where a small team of developers needs to iterate
quickly to discover a solution to a particular problem. Or it is an
established project where maintenance costs and barriers to contribution need
to be minimized because the team is small, or turnover (as in many academic
research contexts) is high. The incredible diversity of use cases in the
Python numerical computing ecosystem is its strength, but also means that
development processes optimized for “mass production” are not necessarily the
best choice. Speed matters, but total time matters more, including developer
onboarding time, feature development time, debugging time, maintenance time,
packaging time, as well as execution time.
This is the vast world of “high-enough performance computing,” and is full of
tradeoffs. Numba seeks to find a balance that reduces total person time in
the numerical computing space, of which execution time is one component. Mojo
is trying to do a similar thing, but with a different set of tradeoffs.
(Literally anyone who isn’t hand-coding assembly language for critical code
paths is trading execution speed for something.) Hopefully I can help
illustrate some of these tradeoffs in the following sections.
Python isn’t a language, it is a community
When people talk about the characteristics (like “speed” or “ease of use”) of
“Python,” they are usually mashing together a bunch of distinct parts that are
important to think about separately when designing a strategy to improve
Python in some way. These parts include:
- The Python language specification
- The Python interpreter (usually CPython for most users)
- The Python standard library
- The huge collection of third party packages
- The network of users and developers who help create, maintain, and teach others about all of the above
When a new solution to “the Python performance problem” is announced, I find
it useful to understand how it impacts each of these items separately. For
example, Numba makes the following choices:
-
Python language: No changes. Numba uses standard Python language
features, like decorators and context managers, to intercept execution and
annotate code for compilation.
-
Python interpreter: No changes, although Numba is CPython-specific. Numba
is specifically designed to work inside CPython as a regular module to
minimize barriers to use and maximize application compatibility.
-
Python standard library: No changes, but Numba can only accelerate a small
number of standard Python types and built-in functions. Applications are
free to use the entire standard library, but only outside of Numba-optimized
functions (with some exceptions).
-
Third party packages: Numba does not prevent usage of third party packages
in an application because Numba is confined only to specific functions the
developer chooses to compile. However, Numba cannot compile code that calls
most third party libraries either. Numba includes specific optimization for
usage of NumPy arrays and functions, and support for
other libraries can be added using Numba’s extension mechanism (for example
Awkward Array).
-
Users and developers: Numba tries to be easy to pick up for anyone
familiar with Python and NumPy, although more advanced capabilities require
learning a few new concepts. Numba also strives to be easy to integrate
into other Python packages, avoiding a compile step during packaging, in
some cases allowing packages to be distributed as pure Python source code
which is compiled for the user’s specific hardware at runtime.
In general, Numba is trying to strike a balance of minimizing the barrier to adopting Numba in a numerical application by existing simply as a module that is compatible with the standard CPython interpreter. The application author does need to actively opt-in to Numba for each function that needs to be compiled, but we find this minimizes surprises and leads to more successful usage in the long run.
Mojo takes a different approach:
-
Python language: Mojo adds significant new syntax to the Python language, aspiring to be a “successor language” for Python that offers more low-level types and performance control. In particular, Mojo defines a new kind of function (using the “fn” keyword) that will be more strictly typed and less dynamic so that faster code can be generated. Regular functions (defined with the standard “def” keyword) have more dynamic and implicit behavior, which will make them feel more familiar to Python developers, but they are not “Python functions” in a strict sense.
-
Python interpreter: Mojo has its own runtime interpreter (called “mojo”) which must be used to call Mojo code, and a compiler for standalone binaries will be available as well. Mojo also links to the CPython runtime, which it uses to allow it to import Python modules and call methods on Python objects.
-
Python standard library: Mojo has its own standard library, which currently is very low-level (at least in the public documentation). Mojo modules can access the Python standard library, via the Python interop functionality.
-
Third party packages: As the Mojo compiler is not generally available, no one has made any Mojo packages. It is unclear how they will be distributed, or what the packaging infrastructure and tooling will look like. Mojo can access 3rd party Python packages via the interop interface, same as the Python standard library.
-
Users and developers: Mojo’s target audience currently seems to be developers of what I would call “application code.” Mojo is not yet targeting anyone creating reusable modules for redistribution, and not Python package developers either.
The best way I’ve found to think about Mojo is that it is trying to be a language that looks like Python, feels somewhat familiar to Python developers, and is designed to interoperate with Python out of the box. It is not a drop-in replacement for Python, however, unless you are writing a new application from scratch that makes limited use of other Python modules.
Mojo fn/def vs. Numba’s object/nopython modes
One of the more interesting decisions in Mojo is to create two kinds of functions:
-
def
functions: These functions are designed to feel like standard Python
functions. The dynamic behavior we expect from Python should continue to
work here, although Mojo def functions can go beyond Python via creation of
immutable variables, define type signatures using Mojo types, and so on.
-
fn
functions: These functions have more strict rules about the contents of
the function, such as required type declarations for arguments and local
variables, and explicit declaration of exceptions.
It is clear from the restrictions that fn
functions are designed to be
statically typed by a compiler and can be compiled ahead of time with the
performance one would expect from a statically typed language.
Numba takes a somewhat similar approach, and has two different compilation
modes, which are called “object mode” and “nopython” mode. These two compiler
modes have the following behavior:
-
object mode: Does not attempt type inference on the function, treating all
values as opaque Python objects. Object mode simulates the Python
interpreter at compile time, emitting calls to the CPython C API to perform
all operations on the Python objects in the function. This results in a
negligible performance improvement to the code. Due to limitations in the
bytecode analysis front-end of Numba, there are some Python language
constructs not supported. Execution is still subject to the Python Global
Interpreter Lock (GIL).
-
nopython mode: This performs a complete type-inference pass on the
function, and, for the subset of Python that is supported, eliminates all
direct manipulation of Python objects. Scalar values are unboxed into
native machine types, and operations on NumPy arrays directly modify the
data buffer of those arrays. Most reference counting is completely
eliminated, although a few instances may remain depending on whether new
NumPy arrays are allocated. Numba, however, uses its own thread-safe
reference counting internally, so the GIL can be released and multiple
threads can execute Numba functions simultaneously.
In practice, we’ve found object mode is no longer very useful. In early
Numba, nopython mode was very limited, so object mode was paired with a
technique called “loop lifting.” The Numba compiler would find loops within a
function that could be compiled to nopython mode and would “lift” them out of
the function body, and the rest of the function would be compiled with object
mode. As nopython mode has gained more capabilities (including memory
allocation and access to typed Python-like containers), we find the use cases
for object mode compilation of functions have diminished. In fact, we’ve even
inverted the situation, and now have support for object mode blocks within a
nopython mode function, which is useful for things like supporting Python
callbacks.
Since Numba exists within the CPython interpreter, the best practice for
functions that cannot be compiled with nopython mode is to not compile them at
all! Thanks to the Faster CPython effort, these sorts of generic functions
are getting faster with every release anyway, and Numba is most effective on
the numerical core of the application, where the compiler can strip away all
of the Python overhead. For this reason, Siu Kwan Lam (one of Numba’s
creators) often calls Numba a “performance linter” because the limitations of
nopython mode guide Numba users into using Python in a way that can be most
effectively sped up by a compiler, usually between 2x and 200x. If we can’t
achieve that, there isn’t much point to compilation in the first place.
At first glance, Numba’s object mode seems like Mojo’s def
functions, but the
difference is that Mojo’s def
functions feel like Python functions, but are
not Python functions. Typing is optional, but new Mojo declarations (like var
and let
) are allowed, and the types are still Mojo types, not Python types.
Access to Python modules needs to happen via special wrappers created by the
Mojo Python interface, which can import Python modules. Mojo uses CPython as
a library to handle these calls, but the body of the def
function itself is
still executed by the Mojo runtime. In principle, this could make Mojo behave
a lot like Cython, but we still don’t know much about how convenient it will
be to use Python from Mojo in practice, and we know nothing about whether
Python will be able to call Mojo functions easily.
Array Types and Operations
The most important part of any numerical computing system in 2023 is the
multidimensional array. Multidimensional arrays are extremely useful and
memory efficient containers for numerical data, and concepts like
broadcasting, universal functions, and advanced indexing have dramatically
simplified the application of numerical computing to many situations,
including machine learning. Moreover, arrays are very compiler-friendly data
structures, which has created opportunities for many compilers and compiler
technologies. In the Python space, NumPy has become the standard array, and
without NumPy, there would be no Numba.
The most surprising part of the Mojo announcement for me was the complete lack
of a built-in multidimensional array type. Given Mojo’s target use case of
machine learning workloads, I can only assume information about the
multidimensional array was held back for a future update, because it
absolutely has to exist for Mojo to be useful. The absence was doubly
surprising because MLIR (which the Mojo
documentation mentions as part of its core compiler design) includes a tensor
type, which is very similar
to the NumPy array, and also defines a generic linear algebra
operation (linalg.generic
) in
one of the MLIR dialects that is extremely powerful. I would describe the
linalg.generic
operation as a superset of the generalized
ufunc
that both NumPy and Numba support (via the
@guvectorize
decorator), which can represent a surprisingly wide range of array operations
in a very natural and compact way. Not seeing linalg.generic
mentioned in
the Mojo docs was very disappointing, and I hope we learn more about it in a
future update.
With no array type to operate on, Mojo is currently lacking in all the array
functions that one would expect from a numerical computing language. These
can surely be added, and even implemented directly in the Mojo language, but
they are not there now. Hopefully Modular will learn from the recently
created Python Array API standard
and consider using that as the basis for their user-facing array API.
Even beyond the array type and functions, it will be extremely important for
Mojo to have a clear mechanism to pass array-like data between it and other
languages, especially Python. NumPy’s C
API is just as
important as its Python API, as it enables passing data in and out of the
Python runtime to 3rd party libraries written in other languages. Mojo will
have a long road to broader adoption, so anything that eases interoperability
with existing numerical computing libraries will be essential.
Extensibility and MLIR
Whether or not the Mojo compiler needs to be extensible is unclear. Many
languages are self-contained, in that you are expected to “extend” the
system by writing even more code in that language. That seems to be how Mojo
is being implemented now, as most of the syntax and standard library of Mojo
is focused on very low-level types and concepts, like SIMD operations and
object lifetime management. With those ingredients, many of the essential
higher level features needed in Mojo can be implemented in Mojo.
Python (and Numba) are not like this at all. Python’s superpower (and curse
at times) is that it makes it very easy to extend Python with other languages,
typically C or C++, but recently Rust is also growing in popularity. Every
popular numerical computing package in Python either includes some non-Python
code in it, or depends on one that does. For this reason, the C API of Python
is very important for enabling extension of the interpreter to new types and
modules. In a similar way, Numba’s limitations mean that there are some
low-level operations that cannot be generated by Numba when it compiles a
function with its @jit decorator. For these cases, Numba has a extension API
which allows a variety of things, including:
- Generation of machine code (technically LLVM IR) to implement specific low level operations.
- Custom overrides of existing Python functions to define how they should be compiled for specific data types.
- Definition of custom “boxing” and “unboxing” methods for passing new data types through Numba-compiled code.
While Mojo does not necessarily need these things due to its design, I believe
the Python interoperability feature in Mojo will need an extensible mechanism
to define how to translate data structures across the Python / Mojo boundary.
We know from Numba experience that anything like a NumPy array is very easy to
move back and forth, and we believe that Apache Arrow data structures offer a
similar possibility for more heterogeneous data (dataframes and beyond). Numba
has even created a few typed data containers specifically to facilitate
movement of statically-typed
lists
and statically-typed
dictionaries
across the Numba / Python boundary. Manipulating Python data through a wrapper
mechanism (as is currently described in the Mojo docs) is a fine general
solution, but faster mechanisms will be needed if Mojo is going to coexist
with non-trivial Python code.
Another area of extensibility that could be important to Mojo and is
definitely important to the future of Numba is interoperability with other
MLIR-based tools. MLIR is designed to be a high level compiler intermediate
representation
(IR) with some unique features. MLIR is extensible by defining new dialects,
which allows the same IR to be used on a wide range of compiler use cases,
and, in principle, permits different compiler tools that support MLIR to work
together. The Numba team is very excited about the potential for MLIR and
plans to incorporate it into the next generation of Numba. However, we can’t
tell how Mojo plans to expose its MLIR-based internals. Would something like
Numba be able to target the Mojo compiler pipeline directly, or possibly the
reverse: could future Numba consume the functions produced by the Mojo
frontend and parser?
Packaging and Composability
The final area we are curious about Mojo is the packaging approach. One feature of Mojo that was emphasized was the ability of the Mojo compiler to produce fully self-contained executables that could be deployed wherever they were needed. Application distribution is a very important use case that has not gotten as much attention in the Python ecosystem as package distribution. Python’s plethora of packaging tools focus on allowing users to install Python components along with their complex web of dependencies. The Python Package Index (and adjacent package repositories, like conda-forge and the Anaconda Distribution) is truly one of Python’s superpowers, and if Mojo wants to jumpstart a broad community of developers, it will also need a packaging story. There are so many unanswered questions at this point:
- Will hybrid Mojo-Python projects be possible, with both Mojo and Python
dependencies?
- Will Mojo packages only distribute source code, with the intent that all
Mojo applications must be compiled by the application builder along with all
of their dependencies?
- Or, will Mojo allow precompiled modules to be distributed, in which case,how
will inter-module optimization be enabled?
Aside: Although this may be a long shot, I think it would be great if Mojo
would consider conda
as a package manager, as it already has excellent Python support, but is also
cross-platform and multi-language packaging system. Mojo doesn't need to
build yet another package manager, and conda would be an excellent choice for
the complex hybrid Mojo/Python projects that Modular hopes will exist in the
future. Give us a call, Chris, if you want to chat. 😀
The packaging story for Numba today is much simpler, as it is just a Python
package that does just-in-time compilation. Projects that use Numba
compilation do so by taking Numba as a dependency, and then can choose to
distribute their own package as pure source code. The translation from Python
to machine code happens at runtime on the end user’s system. This
dramatically simplifies the distribution of those packages downstream of
Numba, who do not need to build wheels for each target platform unless they
have other compiled code aside from Numba-decorated functions.
However, the Numba team definitely appreciates that JIT compilation is not
suitable for all situations, either because of the increased startup time, or
limitations of the target platform, which might forbid JIT compilation
entirely. Because of this, in our planning for the next generation of
Numba,
we’ve decided to first focus on an ahead-of-time compiler (like Mojo) and then
will treat just-in-time compilation as a special case. Ahead-of-time
compilation is much trickier because you have to make choices at build time
for which there are downstream consequences to consumers of your package. You
have to worry about ABI stability, supporting multiple variants of input
types, and different machine instruction set variants (like AVX, AVX2,
AVX-512, etc). At the same time, we do not want ahead-of-time compilation to
preclude JIT optimization across modules that were distributed separately. We
have been working on techniques to handle all of these situations in a new
library called
PIXIE,
which allows additional metadata to be inserted into library files to enable
dynamic dispatch, CPU feature selection, and optional future recompilation of
functions with a JIT.
Conclusions and a Wishlist for Mojo
I want to conclude by reiterating that my goal with this article is not to
dampen enthusiasm for Mojo, but to help the reader be more curious about the
details when they read about new Python and Python-adjacent compiler projects.
Impressive looking speedup factors are not where the most difficult challenges
are.
I think it is an interesting idea to try to blend a Python-like syntax with
the capabilities of MLIR to target a wide range of potential hardware. Mojo’s
Python interoperability features could expand the capabilities of the Python
ecosystem, which would be a great thing. But, there is a tremendous amount of
work left to do on Mojo and a lot of design details have not been shared. We
simply don’t know enough to decide how Mojo will impact Python, or how best to
interact with it.
Given that, I want to conclude by reiterating the wishlist of things I hope we see in Mojo in the coming months:
-
A native multidimensional array type that exposes the same power as MLIR’s tensor type and linalg dialect. That will enable the same kind of array programming we are familiar with from NumPy.
-
More explanation and infrastructure for managing data motion across the Mojo / Python boundary. NumPy arrays should be usable by Mojo native code (hopefully that tensor type we’re asking for in the previous item) without data copies. Similarly, being able to use Apache Arrow data structures across the two languages would be very powerful. In general, we need something extensible so that new data translators can be added.
-
Permit an inversion of control, so that Mojo modules can be imported from the Python interpreter. Mojo could be an interesting alternative to Cython if that was possible, and if NumPy arrays could be passed to Mojo functions with minimal overhead.
-
Expose the MLIR internals of Mojo via some kind of interface that lets us retarget the compiler. A true Python-to-Mojo compiler pipeline could be possible then, and that would allow a lot more experimentation. How should we use Mojo with new or custom MLIR dialects?
-
More details about Mojo packaging in general. Will there be a package index for Mojo-specific modules by third parties? How should mixed Mojo and Python environments be handled? (Seriously, please take a look at conda. 🙂)
Of course, we need to see an open source Mojo compiler and runtime before many
of these things will be possible. Hopefully we’ll get more details on that in
the future as well.