Anaconda Engineering Blog

Numba, Mojo, and Compilers for Numerical Computing

2023-09-01T00:00:00-05:00

There have been a lot of articles written following Modular’s announcement of Mojo, a Python “successor” language that will bring high performance computing (especially for AI applications) to Python users. Here at Anaconda, we’ve been working on the problem of HPC and Python for over a decade now. As part of that work, we created Numba in 2012, a Python just-in-time compiler specifically for numerical computing, and have been improving it ever since. We’ve learned a lot during that time about what it means to compile and accelerate Python, and thought it would be useful to share some of that perspective here. Along the way, you’ll learn about some of the design tradeoffs Numba has had to navigate, and maybe get some context for the challenges that Mojo will face as it takes a different (but potentially complementary) approach to Python compilation.

When is Python useful for HPC?

The difficulty in deciding how Python should be best used for high performance computing (HPC) hinges on the difficulty in defining the purpose of HPC. HPC is often discussed in terms of maximizing execution speed of large, complex calculations. For these purposes, Python is not very useful, except maybe as a job configuration or control language delegating nearly all computation to highly optimized components. (This is the traditional “glue language” role that Python is very good at.)

However, the economies of scale in software can confuse our understanding of priorities. Some kinds of software, for example linear algebra algorithms, are so ubiquitous across disciplines and so frequently used that they merit spending a huge amount of developer effort on performance. There are almost no situations where you should implement matrix multiplication yourself. Instead, you should use MKL, or OpenBLAS, or cuBLAS, or some other library that has had a tremendous amount of effort from hardware experts invested into it. The time spent developing and debugging these sorts of libraries is tiny compared to the amount of time the library will be used by the entire software community. Additionally, it is often possible to find resources to fund these fundamental libraries from hardware vendors and non-profit funding sources because these applications are pivotal. This is the “HPC mass production” scenario.

But many software projects are not in this “mass production” category. They are research projects where a small team of developers needs to iterate quickly to discover a solution to a particular problem. Or it is an established project where maintenance costs and barriers to contribution need to be minimized because the team is small, or turnover (as in many academic research contexts) is high. The incredible diversity of use cases in the Python numerical computing ecosystem is its strength, but also means that development processes optimized for “mass production” are not necessarily the best choice. Speed matters, but total time matters more, including developer onboarding time, feature development time, debugging time, maintenance time, packaging time, as well as execution time.

This is the vast world of “high-enough performance computing,” and is full of tradeoffs. Numba seeks to find a balance that reduces total person time in the numerical computing space, of which execution time is one component. Mojo is trying to do a similar thing, but with a different set of tradeoffs. (Literally anyone who isn’t hand-coding assembly language for critical code paths is trading execution speed for something.) Hopefully I can help illustrate some of these tradeoffs in the following sections.

Python isn’t a language, it is a community

When people talk about the characteristics (like “speed” or “ease of use”) of “Python,” they are usually mashing together a bunch of distinct parts that are important to think about separately when designing a strategy to improve Python in some way. These parts include:

The Python language specification
The Python interpreter (usually CPython for most users)
The Python standard library
The huge collection of third party packages
The network of users and developers who help create, maintain, and teach others about all of the above

When a new solution to “the Python performance problem” is announced, I find it useful to understand how it impacts each of these items separately. For example, Numba makes the following choices:

Python language: No changes. Numba uses standard Python language features, like decorators and context managers, to intercept execution and annotate code for compilation.
Python interpreter: No changes, although Numba is CPython-specific. Numba is specifically designed to work inside CPython as a regular module to minimize barriers to use and maximize application compatibility.
Python standard library: No changes, but Numba can only accelerate a small number of standard Python types and built-in functions. Applications are free to use the entire standard library, but only outside of Numba-optimized functions (with some exceptions).
Third party packages: Numba does not prevent usage of third party packages in an application because Numba is confined only to specific functions the developer chooses to compile. However, Numba cannot compile code that calls most third party libraries either. Numba includes specific optimization for usage of NumPy arrays and functions, and support for other libraries can be added using Numba’s extension mechanism (for example Awkward Array).
Users and developers: Numba tries to be easy to pick up for anyone familiar with Python and NumPy, although more advanced capabilities require learning a few new concepts. Numba also strives to be easy to integrate into other Python packages, avoiding a compile step during packaging, in some cases allowing packages to be distributed as pure Python source code which is compiled for the user’s specific hardware at runtime.

In general, Numba is trying to strike a balance of minimizing the barrier to adopting Numba in a numerical application by existing simply as a module that is compatible with the standard CPython interpreter. The application author does need to actively opt-in to Numba for each function that needs to be compiled, but we find this minimizes surprises and leads to more successful usage in the long run.

Mojo takes a different approach:

Python language: Mojo adds significant new syntax to the Python language, aspiring to be a “successor language” for Python that offers more low-level types and performance control. In particular, Mojo defines a new kind of function (using the “fn” keyword) that will be more strictly typed and less dynamic so that faster code can be generated. Regular functions (defined with the standard “def” keyword) have more dynamic and implicit behavior, which will make them feel more familiar to Python developers, but they are not “Python functions” in a strict sense.
Python interpreter: Mojo has its own runtime interpreter (called “mojo”) which must be used to call Mojo code, and a compiler for standalone binaries will be available as well. Mojo also links to the CPython runtime, which it uses to allow it to import Python modules and call methods on Python objects.
Python standard library: Mojo has its own standard library, which currently is very low-level (at least in the public documentation). Mojo modules can access the Python standard library, via the Python interop functionality.
Third party packages: As the Mojo compiler is not generally available, no one has made any Mojo packages. It is unclear how they will be distributed, or what the packaging infrastructure and tooling will look like. Mojo can access 3rd party Python packages via the interop interface, same as the Python standard library.
Users and developers: Mojo’s target audience currently seems to be developers of what I would call “application code.” Mojo is not yet targeting anyone creating reusable modules for redistribution, and not Python package developers either.

The best way I’ve found to think about Mojo is that it is trying to be a language that looks like Python, feels somewhat familiar to Python developers, and is designed to interoperate with Python out of the box. It is not a drop-in replacement for Python, however, unless you are writing a new application from scratch that makes limited use of other Python modules.

Mojo fn/def vs. Numba’s object/nopython modes

One of the more interesting decisions in Mojo is to create two kinds of functions:

def functions: These functions are designed to feel like standard Python functions. The dynamic behavior we expect from Python should continue to work here, although Mojo def functions can go beyond Python via creation of immutable variables, define type signatures using Mojo types, and so on.
fn functions: These functions have more strict rules about the contents of the function, such as required type declarations for arguments and local variables, and explicit declaration of exceptions.

It is clear from the restrictions that fn functions are designed to be statically typed by a compiler and can be compiled ahead of time with the performance one would expect from a statically typed language.

Numba takes a somewhat similar approach, and has two different compilation modes, which are called “object mode” and “nopython” mode. These two compiler modes have the following behavior:

object mode: Does not attempt type inference on the function, treating all values as opaque Python objects. Object mode simulates the Python interpreter at compile time, emitting calls to the CPython C API to perform all operations on the Python objects in the function. This results in a negligible performance improvement to the code. Due to limitations in the bytecode analysis front-end of Numba, there are some Python language constructs not supported. Execution is still subject to the Python Global Interpreter Lock (GIL).
nopython mode: This performs a complete type-inference pass on the function, and, for the subset of Python that is supported, eliminates all direct manipulation of Python objects. Scalar values are unboxed into native machine types, and operations on NumPy arrays directly modify the data buffer of those arrays. Most reference counting is completely eliminated, although a few instances may remain depending on whether new NumPy arrays are allocated. Numba, however, uses its own thread-safe reference counting internally, so the GIL can be released and multiple threads can execute Numba functions simultaneously.

In practice, we’ve found object mode is no longer very useful. In early Numba, nopython mode was very limited, so object mode was paired with a technique called “loop lifting.” The Numba compiler would find loops within a function that could be compiled to nopython mode and would “lift” them out of the function body, and the rest of the function would be compiled with object mode. As nopython mode has gained more capabilities (including memory allocation and access to typed Python-like containers), we find the use cases for object mode compilation of functions have diminished. In fact, we’ve even inverted the situation, and now have support for object mode blocks within a nopython mode function, which is useful for things like supporting Python callbacks.

Since Numba exists within the CPython interpreter, the best practice for functions that cannot be compiled with nopython mode is to not compile them at all! Thanks to the Faster CPython effort, these sorts of generic functions are getting faster with every release anyway, and Numba is most effective on the numerical core of the application, where the compiler can strip away all of the Python overhead. For this reason, Siu Kwan Lam (one of Numba’s creators) often calls Numba a “performance linter” because the limitations of nopython mode guide Numba users into using Python in a way that can be most effectively sped up by a compiler, usually between 2x and 200x. If we can’t achieve that, there isn’t much point to compilation in the first place.

At first glance, Numba’s object mode seems like Mojo’s def functions, but the difference is that Mojo’s def functions feel like Python functions, but are not Python functions. Typing is optional, but new Mojo declarations (like var and let) are allowed, and the types are still Mojo types, not Python types. Access to Python modules needs to happen via special wrappers created by the Mojo Python interface, which can import Python modules. Mojo uses CPython as a library to handle these calls, but the body of the def function itself is still executed by the Mojo runtime. In principle, this could make Mojo behave a lot like Cython, but we still don’t know much about how convenient it will be to use Python from Mojo in practice, and we know nothing about whether Python will be able to call Mojo functions easily.

Array Types and Operations

The most important part of any numerical computing system in 2023 is the multidimensional array. Multidimensional arrays are extremely useful and memory efficient containers for numerical data, and concepts like broadcasting, universal functions, and advanced indexing have dramatically simplified the application of numerical computing to many situations, including machine learning. Moreover, arrays are very compiler-friendly data structures, which has created opportunities for many compilers and compiler technologies. In the Python space, NumPy has become the standard array, and without NumPy, there would be no Numba.

The most surprising part of the Mojo announcement for me was the complete lack of a built-in multidimensional array type. Given Mojo’s target use case of machine learning workloads, I can only assume information about the multidimensional array was held back for a future update, because it absolutely has to exist for Mojo to be useful. The absence was doubly surprising because MLIR (which the Mojo documentation mentions as part of its core compiler design) includes a tensor type, which is very similar to the NumPy array, and also defines a generic linear algebra operation (linalg.generic) in one of the MLIR dialects that is extremely powerful. I would describe the linalg.generic operation as a superset of the generalized ufunc that both NumPy and Numba support (via the @guvectorize decorator), which can represent a surprisingly wide range of array operations in a very natural and compact way. Not seeing linalg.generic mentioned in the Mojo docs was very disappointing, and I hope we learn more about it in a future update.

With no array type to operate on, Mojo is currently lacking in all the array functions that one would expect from a numerical computing language. These can surely be added, and even implemented directly in the Mojo language, but they are not there now. Hopefully Modular will learn from the recently created Python Array API standard and consider using that as the basis for their user-facing array API.

Even beyond the array type and functions, it will be extremely important for Mojo to have a clear mechanism to pass array-like data between it and other languages, especially Python. NumPy’s C API is just as important as its Python API, as it enables passing data in and out of the Python runtime to 3rd party libraries written in other languages. Mojo will have a long road to broader adoption, so anything that eases interoperability with existing numerical computing libraries will be essential.

Extensibility and MLIR

Whether or not the Mojo compiler needs to be extensible is unclear. Many languages are self-contained, in that you are expected to “extend” the system by writing even more code in that language. That seems to be how Mojo is being implemented now, as most of the syntax and standard library of Mojo is focused on very low-level types and concepts, like SIMD operations and object lifetime management. With those ingredients, many of the essential higher level features needed in Mojo can be implemented in Mojo.

Python (and Numba) are not like this at all. Python’s superpower (and curse at times) is that it makes it very easy to extend Python with other languages, typically C or C++, but recently Rust is also growing in popularity. Every popular numerical computing package in Python either includes some non-Python code in it, or depends on one that does. For this reason, the C API of Python is very important for enabling extension of the interpreter to new types and modules. In a similar way, Numba’s limitations mean that there are some low-level operations that cannot be generated by Numba when it compiles a function with its @jit decorator. For these cases, Numba has a extension API which allows a variety of things, including:

Generation of machine code (technically LLVM IR) to implement specific low level operations.
Custom overrides of existing Python functions to define how they should be compiled for specific data types.
Definition of custom “boxing” and “unboxing” methods for passing new data types through Numba-compiled code.

While Mojo does not necessarily need these things due to its design, I believe the Python interoperability feature in Mojo will need an extensible mechanism to define how to translate data structures across the Python / Mojo boundary. We know from Numba experience that anything like a NumPy array is very easy to move back and forth, and we believe that Apache Arrow data structures offer a similar possibility for more heterogeneous data (dataframes and beyond). Numba has even created a few typed data containers specifically to facilitate movement of statically-typed lists and statically-typed dictionaries across the Numba / Python boundary. Manipulating Python data through a wrapper mechanism (as is currently described in the Mojo docs) is a fine general solution, but faster mechanisms will be needed if Mojo is going to coexist with non-trivial Python code.

Another area of extensibility that could be important to Mojo and is definitely important to the future of Numba is interoperability with other MLIR-based tools. MLIR is designed to be a high level compiler intermediate representation (IR) with some unique features. MLIR is extensible by defining new dialects, which allows the same IR to be used on a wide range of compiler use cases, and, in principle, permits different compiler tools that support MLIR to work together. The Numba team is very excited about the potential for MLIR and plans to incorporate it into the next generation of Numba. However, we can’t tell how Mojo plans to expose its MLIR-based internals. Would something like Numba be able to target the Mojo compiler pipeline directly, or possibly the reverse: could future Numba consume the functions produced by the Mojo frontend and parser?

Packaging and Composability

The final area we are curious about Mojo is the packaging approach. One feature of Mojo that was emphasized was the ability of the Mojo compiler to produce fully self-contained executables that could be deployed wherever they were needed. Application distribution is a very important use case that has not gotten as much attention in the Python ecosystem as package distribution. Python’s plethora of packaging tools focus on allowing users to install Python components along with their complex web of dependencies. The Python Package Index (and adjacent package repositories, like conda-forge and the Anaconda Distribution) is truly one of Python’s superpowers, and if Mojo wants to jumpstart a broad community of developers, it will also need a packaging story. There are so many unanswered questions at this point:

Will hybrid Mojo-Python projects be possible, with both Mojo and Python dependencies?
Will Mojo packages only distribute source code, with the intent that all Mojo applications must be compiled by the application builder along with all of their dependencies?
Or, will Mojo allow precompiled modules to be distributed, in which case,how will inter-module optimization be enabled?

Aside: Although this may be a long shot, I think it would be great if Mojo would consider conda as a package manager, as it already has excellent Python support, but is also cross-platform and multi-language packaging system. Mojo doesn't need to build yet another package manager, and conda would be an excellent choice for the complex hybrid Mojo/Python projects that Modular hopes will exist in the future. Give us a call, Chris, if you want to chat. 😀

The packaging story for Numba today is much simpler, as it is just a Python package that does just-in-time compilation. Projects that use Numba compilation do so by taking Numba as a dependency, and then can choose to distribute their own package as pure source code. The translation from Python to machine code happens at runtime on the end user’s system. This dramatically simplifies the distribution of those packages downstream of Numba, who do not need to build wheels for each target platform unless they have other compiled code aside from Numba-decorated functions.

However, the Numba team definitely appreciates that JIT compilation is not suitable for all situations, either because of the increased startup time, or limitations of the target platform, which might forbid JIT compilation entirely. Because of this, in our planning for the next generation of Numba, we’ve decided to first focus on an ahead-of-time compiler (like Mojo) and then will treat just-in-time compilation as a special case. Ahead-of-time compilation is much trickier because you have to make choices at build time for which there are downstream consequences to consumers of your package. You have to worry about ABI stability, supporting multiple variants of input types, and different machine instruction set variants (like AVX, AVX2, AVX-512, etc). At the same time, we do not want ahead-of-time compilation to preclude JIT optimization across modules that were distributed separately. We have been working on techniques to handle all of these situations in a new library called PIXIE, which allows additional metadata to be inserted into library files to enable dynamic dispatch, CPU feature selection, and optional future recompilation of functions with a JIT.

Conclusions and a Wishlist for Mojo

I want to conclude by reiterating that my goal with this article is not to dampen enthusiasm for Mojo, but to help the reader be more curious about the details when they read about new Python and Python-adjacent compiler projects. Impressive looking speedup factors are not where the most difficult challenges are.

I think it is an interesting idea to try to blend a Python-like syntax with the capabilities of MLIR to target a wide range of potential hardware. Mojo’s Python interoperability features could expand the capabilities of the Python ecosystem, which would be a great thing. But, there is a tremendous amount of work left to do on Mojo and a lot of design details have not been shared. We simply don’t know enough to decide how Mojo will impact Python, or how best to interact with it.

Given that, I want to conclude by reiterating the wishlist of things I hope we see in Mojo in the coming months:

A native multidimensional array type that exposes the same power as MLIR’s tensor type and linalg dialect. That will enable the same kind of array programming we are familiar with from NumPy.
More explanation and infrastructure for managing data motion across the Mojo / Python boundary. NumPy arrays should be usable by Mojo native code (hopefully that tensor type we’re asking for in the previous item) without data copies. Similarly, being able to use Apache Arrow data structures across the two languages would be very powerful. In general, we need something extensible so that new data translators can be added.
Permit an inversion of control, so that Mojo modules can be imported from the Python interpreter. Mojo could be an interesting alternative to Cython if that was possible, and if NumPy arrays could be passed to Mojo functions with minimal overhead.
Expose the MLIR internals of Mojo via some kind of interface that lets us retarget the compiler. A true Python-to-Mojo compiler pipeline could be possible then, and that would allow a lot more experimentation. How should we use Mojo with new or custom MLIR dialects?
More details about Mojo packaging in general. Will there be a package index for Mojo-specific modules by third parties? How should mixed Mojo and Python environments be handled? (Seriously, please take a look at conda. 🙂)

Of course, we need to see an open source Mojo compiler and runtime before many of these things will be possible. Hopefully we’ll get more details on that in the future as well.

Decimals type for pandas

2023-06-20T00:00:00-05:00

Introduction

Some time ago I had a go at implementing a "decimals" extension type for pandas. This was following stumbling upon parquet data of that type, which pandas could not read: pyarrow would error and fastparquet would convert to floats. The decimal type, with known, fixed precision, is very important in real-world applications such as finance, where exact equality of fractional values is required. The following should succeed, but with standard python or pandas does not:

>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> pd.Series([0.1, 0.1, 0.1]).sum() == 0.3
False

Although one could solve this with python's builtin decimal.Decimal objects, we want something that supports vectorised compute, which will be orders of magnitude faster.

Implementations

I set out make an implementation for discussion. This made use of pandas' "extension types" system, which allows non-standard dtypes and array types to be registered. A decent proof of concept was pretty quick to make (see the repo). However, in conversations around it, it was brought to my attention that the advent of more generalised arrow type support in pandas would expose their decimal type for use in pandas too.

So, if arrow could handle this, then the need for pandas-decimal goes away. I accepted this, but later conversation spurred me to look again. So let's summarise the situation.

Comparisons

Ease of use

From the python perspective, pyarrow interoperates with decimal.Decimal objects; internally it holds an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from float is supported, but arithmetic with floats is not supported.

import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa

# dtype object
pa_dtype = pa.decimal128(precision=7, scale=3)

# creation from decimal objects
data = pa.array(
    [
        decimal.Decimal("8093.012"),
        decimal.Decimal("8094.123"),
        decimal.Decimal("8095.234"),
        decimal.Decimal("8096.345"),
        decimal.Decimal("8097.456"),
        decimal.Decimal("8098.567"),
    ],
    type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))

# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype(pd.ArrowDtype(pa_dtype))
assert df.x.equals(df2.x)

# fail to operate on floats
df.x[0] = 1.1  # ArrowTypeError
df + 1.1  # ArrowTypeError
df + decimal.Decimal("1.1")  # OK

Conversely, pandas-decimal does not handle decimal.Decimal only, but implicitly converts to and from floats on access. This means that arithmetic works as you would normally expect, except you lose the precision guarantees of the decimal representation.

import pandas_decimal

# creation with floats
df = pd.DataFrame(
    {"x": [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567]},
    dtype="decimal[3]",
)

# conversion
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype("decimal[3]")
assert df.x.equals(df2.x)

# operating on floats
df.x[0] = 1.1  # OK
df + 1.1  # OK
df + decimal.Decimal("1.1")  # ValueError

Performance

Comparing the two vectorised array types to python objects and each other, we see clearly that pandas-decimal wins on all operations. I include also just one float calculation, showing that pandas-decimal is even faster than that.

data = [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567] * 1000000

# pyarrow
df = pd.DataFrame({"x": data).astype(dtype=pd.ArrowDtype(pa_dtype))
%timeit df.x + 1
16.2 ms ± 85.8 µs
%timeit df.x + decimal.Decimal(0.1)
150 ms ± 1.09 ms
%timeit df.x.mean()
17.5 ms ± 54.1 µs
%timeit df.x.std()  # Error

# pandas-decimal
df = pd.DataFrame({"x":  data}, dtype="decimal[3]")
%timeit df.x + 1
4.62 ms ± 48.5 µs
%timeit df.x + 0.1
4.9 ms ± 530 µs
%timeit df.x.mean()
2.16 ms ± 1.79 µs
%timeit df.x.std()
11 ms ± 119 µs

# pure python
df["x"] = df.x.map(lambda x: decimal.Decimal(str(x)))
%timeit df.x + 1
475 ms ± 4.79 ms
%timeit df.x + decimal.Decimal(0.1)
460 ms ± 7.96 ms
%timeit df.x.mean()
392 ms ± 573 µs
%timeit df.x.mean()  # Error

# inexact floats
df["x"] = df.x.astype("float")
%timeit df.x.mean()
7.6 ms ± 115 µs

IO

pyarrow's decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation. pandas-decimal is not integrated with any IO. In the original design, it was anticipated that integration with fastparquet would be trivial, and would not need any additional dependencies, but this has not been implemented since development of pandas-decimal halted

Summary

pyarrow does not provide the all-encompassing solution to fixed-precision decimals we would have hoped for - at least not yet. One assumes that the functionality may well already exist within arrow itself, but is not well exposed to pandas due to lack of interest/motivation. This surprised me somewhat, since finance is such a big user of pandas, and has traditionally required exact decimals for instance in their database models.

For the time being, pandas-decimal, a very small-effort proof-of concept, shows clear advantages in speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:

it is a separate, non-standard install, whereas pyarrow is likely already available in the environments
it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)
the integration with parquet has not been done

I am hoping, that by writing this post, I will motivate arrow developers to improve decimal support, so that pandas-decimal can be retired even as a proof of concept.

OSS Roundup

2023-05-07T00:00:00-05:00

This is our ongoing series of articles in which we will give weekly highlights of open-source software (OSS) activities here at Anaconda. We will list various achievements of the last week or so, link to interesting things and give brief details of ongoing work and plans. Each team will only write something when they have something to say.

Conda (language-agnostic, multi-platform package management ecosystem)

We’ve released updates to conda-package-handling and conda-package-streaming to reduce memory usage. This will be the first conda-package-handling released to PyPI once PyPI admins free the name, using PyPI’s new organization feature.

Distdatacats Team (remote bytes, file formats, catalogs and data processing)

fsspec 2023.5.0 and friends are out
Reference Filesystem can now write references directly to parquet, allowing for combine of parquet-to-parquet, which should have a much smaller memory footprint for very large reference sets.
dask-awkward optimisations are finally in a good place: layer merging works and is fast, and we can run column optimisation on only one partition to avoid scaling issues. Upstream dask cull() remains an issue, scaling with number of partitions, and we are looking at ways to avoid this. The high-energy physics workflows prompting this introspection are easily the biggest dask task graphs in existence.
Article on this very blog about benchmarking a particular dask-parquet-s3 workflow and what we learned.

Numba (JIT-compiling python code to make it fast)

Numba 0.57.0 and llvmlite 0.40.0 have been released
Python3.11, NumPy 1.24, LLVM 14 support. See Changelog for more details.
This release had 253 PRs from 47 different authors!
Started to look at integrating numba_rvsdg into numba

Jupyter (in-browser IDE for python and others)

The team has released the 1.0.0 version of nbclassic, a package that allows the “classic” Jupyter Notebook (equivalent to Notebook 6.5) to be installed and used alongside JupyterLab or Notebook 7 in an environment. We’ve also put out a release of the jupyter-nbextensions-configurator which fixes some compatibility issues with nbclassic. This week the team will be at JupyterCon, so stop by the Anaconda booth and say hello!

BeeWare (deploy python projects to mobile and elsewhere)

This week, the BeeWare team has been cleaning up after the PyCon US sprints. The sprints generated dozens of major and minor feature contributions; this week we’ve been able to merge nearly all of them.

Benchmarking a dask-parquet-s3 workflow

2023-05-04T00:00:00-05:00

Foreword

Benchmarking is hard and usually biased by the author's experience and viewpoint, which strongly affects what they choose to benchmark and how many "tricks" they know to optimise performance.

This article is an extension of benchmarking on the Coiled blog, which showed that using PyArrow string rather than python string objects is beneficial for the workflow presented. I don't disagree with that conclusion, but would like to explore the performance details further.

As a side-note, the possible advantages of PyArrow string storage was part of the discussion when Pandas decided to switch to using PyArrow as the default Parquet loading engine (a decision Dask later followed). As the author of fastparquet, the previous default, I clearly have a vested interest in showing that the package is still a performance contender, so take my observations with some salt.

Setup

As in the original benchmark, we will be timing the following workflow:

df = dd.read_parquet("s3://coiled-datasets/uber-lyft-tlc/", ...)
df["tipped"] = df.tips != 0
df.groupby(df.hvfhs_license_num).tipped.mean().compute()

The timing will be done on the last line only, so the IO and shuffle-compute are both contributing, but IO dominates. The data are 25GiB of parquet in 720 files in a public access bucket on AWS S3. It will be only the first line (with the ellipsis) that we will be changing. Column "hvfhs_license_num" is strings with a small number of unique values and "tips" is float32. All are SNAPPY compressed.

For the timings here, there were 10 dask workers with 1 thread and 4GB of memory each, in a Kubernetes cluster via dask-gateway in AWS us-east-1. This is not the same cluster setup as the original! The same cluster was used throughout, and there was no significant memory pressure at any point. All versions were at the current latest (dask 2023.4.0, pyarrow 11.0.0, pandas 2.0.1, fastparquet 2023.4.0). Each time is the best of three repeats.

Runs

1. Baseline (pyarrow with python strings)

dask.config.set({"dataframe.convert-string": False})
df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
)

Time: 4m3s

Everything here is default

2. Use fastparquet instead

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
    engine="fastparquet",
)

Time: 2m40s

Yes, here is my big bias. Whenever someone says something can be improved in arrow, I try with fastparquet. What do you know? It's magically much better in this case. I can't answer for particularly why this would be the case. Note that fastparquet produces columns of python string objects when the base parquet type is UTF8.

3. Fastparquet with categories

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
    engine="fastparquet",
    categories=["hvfhs_license_num"],
)

Time: 2m48s

This was the original supposition I was interested in: the grouping column is actually stored as dict-encoded in the files, so loading as pandas categorical should be much faster, but it made... almost no difference at all. More on this below.

Pyarrow does not allow you to load this data as categorical.

4. Pyarrow and pyarrow string

dask.config.set({"dataframe.convert-string": True})
df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True},
    engine="pyarrow",
)

Time: 2m17s

This was the thesis of the original article, that the new string storage mechanism should be much faster. It provides a decent boost over PyArrow (with Python string), and is also better than fastparquet, above. I do not have the knowledge to be able to tweak pyarrow further, but note that this is still using s3fs as the library to fetch bytes (a discussion about using pyarrow's own s3 implementation is another reason I wanted to chase this topic).

5. Rust implementation of s3fs and fastparquet

fsspec.register_implementation("s3", rfsspec.RustyS3FileSystem, clobber=True)
df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True, "region": "us-east-2"},
    engine="fastparquet",
)

Times: 2m5s

This is the same as 2., but the S3 transfer is being done by rfsspec, which is still very experimental. Note that rfsspec requires that the region is provided, so some speed boost will be coming from avoiding HTTP redirection in all S3 calls. Also, rfsspec has a larger default buffer size, so there might be fewer requests here.

All the rest of the runs below use rfsspec.

6. Now with parquet-specific file access

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True, "region": "us-east-2"},
    engine="fastparquet",
    open_file_options={"precache_options": {"method": "parquet"}},
)

Times: 2m5s

Please see this article for a motivation and description of using the footer information of a parquet file to know exactly which byte-ranges will be needed, and prospectively/concurrently fetching them. We see that it makes no difference whatever, which is extremely fishy!

7. Next, specify the columns manually

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True, "region": "us-east-2"},
    columns=["tips", "hvfhs_license_num"],
    index=False,
    engine="fastparquet",
    open_file_options={"precache_options": {"method": "parquet"}},
)

Time: 49s

Bingo! This has made the biggest difference so far. So, it seems that dask did not figure out, that we only needed those two columns from the source, and maybe all the bytes of every file were being loaded every time. That also explains why the "precache-option" (6.) didn't make any difference - we were still loading the whole thing - and why categorising our groupby column (3.) didn't help - it was only affecting one column of many.

8. Add categories back in

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
    storage_options={"anon": True, "region": "us-east-2"},
    columns=["tips", "hvfhs_license_num"],
    index=False,
    engine="fastparquet",
    open_file_options={"precache_options": {"method": "parquet"}},
    categories=["hvfhs_license_num"],
)

Time: 39s

This combines 7. and 3. Now we see that categorisation makes a decent difference, since we don't need to make python strings, and grouping on a code number is much faster too. This also has the best memory footprint (two bytes per value for the categorical column).

Remaining low-hanging fruit?

If 720 files are being processed in 10 workers in 40s, that means very roughly half a second each, assuming the final aggregations are a small fraction of the total. At this point, latency talking with the remote store starts to matter, as the number of bytes in the parquet headers and actual data bytes in the two columns don't account to much against AWS-to-AWS throughput.

Doing profiling, it turns out that each task calls fs.info three times per input file: once to check whether it is a file, once to get the file size (so you can read the footer with the parquet metadata) and once when it is opened again to fetch data. s3fs wants to have file information available, so that it can require an ETag match on the target, and avoid corrupted data should the target get overwritten during IO. However, we should be able to cache these details at least for a short time. Right now, it takes about 20% of the running time just to run info, (of worker thread time, according to the Dask profile dashboard) and we can cut that by a factor of 3. s3fs already caches directory listings, but rfsspec does not, and info() bypasses that anyway - so some work to do.

Conclusions

In brief, here I show a few ways in which you can really push performance for a relatively simple load-group-aggregate workflow on dask-dataframe. It turns out that the new pyarrow-strings flag available in dask is not the biggest level that you can pull, and particularly column selection is critical.

OSS Roundup

2023-04-29T00:00:00-05:00

Conda (language-agnostic, multi-platform package management ecosystem)

The conda team has officially launched conda.org! 🎉🚀 This website will be home for the entire conda community. We are still very much actively developing it and welcome any contributions. Have a great idea for a blog article or feature? Get in touch with us by filing an issue at our GitHub project or stop by our Matrix chat and say hello 👋.

Distdatacats Team (remote bytes, file formats, catalogs and data processing)

2023.4.0 release of fastparquet is out
Please see the partner blog about the transform capabilities of intake-duckdb
In nice synergy with the work on duckdb, dataframe pipelines are coming to Intake core (full documentation yet to come)
continued releases of dask-awkward and friends as we iron out bugs for large complex high-energy analysis workflows.

Numba (JIT-compiling python code to make it fast)

After resolving a few bugs discovered in the Numba 0.57.0 release candidate (including one very obscure LLVM bug!), we have done the final release this week.

Jupyter (in-browser IDE for python and others)

The Jupyter team released a new version of nbclassic (0.5.6), the component that allows the classic Notebook (version 6.x) to coexist in an environment with JupyterLab. This is in preparation for officially tagging version 1.0 of nbclassic later this week.

This has also been a big documentation week, as we’ve been working on pull requests across a few Jupyter projects to explain development conventions (like how semantic versioning is used) and to improve readability for newer users and extension authors. We’re also busy preparing for JupyterCon in just over a week, where the whole Anaconda Jupyter team will be in attendance, along with several other Anaconda people. If you are planning to attend, please stop by the booth and say hello, and check out the talk on the future of the Jupyter notebook we are co-presenting.

BeeWare (deploy python projects to mobile and elsewhere)

This past week the BeeWare team was at the PyCon Sprints. We peaked at ~25 contributors working on issues, and handed out almost as many challenge coins to first time contributors, tackling a range of issues from minor documentation cleanups to major new features in Briefcase and Toga. For a full summary, check out our April 2023 Status Update.

PyScript

If you haven't seen the announcement from PyCon US, check out pyscript.com!

Intake Gets Some Wings From DuckDB

2023-04-27T00:00:00-05:00

Loading data can be a pain. Let's say you have a 100GB folder of Gzipped CSVs sitting on Amazon S3 — what is the simplest way of converting this dataset into a Dask DataFrame that your data science team can work with? What about a Jupyter notebook that contains the five-ish lines of Python needed to handle imports, credentials, and loading? What if the source data changes, or the URL changes, or you want to include metadata or plots? Now those five-ish lines are broken and you need to somehow push the updates to all your users.

Enter Intake. With Intake you can encode all the details of your data collection into a single catalog. Remote files (including catalogs themselves) are a breeze thanks to fsspec. Details such as filetypes, transfer protocols, chunk sizes are all abstracted away from the user who only needs to know enough to open a catalog and read the data sources. Intake makes it easy for data stewards to curate large, varied, and complex datasets into easily distributable and maintainable catalogs. Importantly, it places the onus of handling the data's particular eccentricities (access patterns, etc.) on the data steward rather than the end user who just wants to be handed a DataFrame so they can move on with their life. Intake is used by data science teams and data engineers, and plays a key role in Anaconda Nucleus' new data catalogs service.

Intake aims to give you your data then get out of the way. It is just a data loader, not a query engine. But what if you want to provide your users with a transformed, subsampled, or aggregated view of your dataset? Intake provides an interface for building custom plugins that will handle these sorts of things, but building plugins or extending existing plugins takes some time and know-how. What if you could simply write some SQL and provide a transformed data source directly, without needing to modify or copy the original data?

Enter DuckDB. DuckDB wants to be your general purpose query engine for tabular data. It is a data format, a library of analytical tools, and an overall data science workhorse. Below we'll go into some of what makes DuckDB special, the problems it tries to solve, and why it is a natural extension to Intake's existing functionality.

Intake background

Intake organizes data sources with DataSource objects; a DataSource is a wrapper around some container type, commonly a DataFrame, that has a bunch of metadata about the source data and a .read() method for loading the actual data into memory. Data sources can point to remote files, integrate with Dask, and even load pre-defined plots with hvPlot. An intake Catalog is a collection of data sources. Catalogs can even nest inside other catalogs.

# nyc_taxi.yaml
sources:
  nyc_taxi:
    description: NYC Taxi dataset
    driver: parquet
    args:
      urlpath: s3://datashader-data/nyc_taxi_wide.parq
    metadata:
      plot:
        datashade: true

pip install intake intake-parquet s3fs

import intake

cat = intake.open_catalog("nyc_taxi.yaml")
cat.nyc_taxi.plot()

Intake allows you to extend the DataSource and Catalog classes and add them to the intake registry, wrapping your custom data structure, whether it's a file or database, in Intake's semantics. Now your user can call df = intake.open_catalog(...).source_name.read() on anything you'd like. Many such plugins already exist and can be found at the plugin directory.

Adding a query engine

The purpose of Intake is to make it easy to distribute existing data, but what if you want to share only part of a source, or an aggregation like a groupby? Well, you can perform the derivation yourself and save the result as a new source. That's not ideal. A better solution would be to use Intake's Dataset Transforms which operate on an existing DataSource and perform some sort of custom operation. The particular logic of the operation is up to the developer who needs to wrap the DerivedSource class, write the code, and package the code for distribution. Intake provides the Columns transformation which returns just the source DataFrame's columns as an example.

This sort of functionality is very useful, especially for teams and projects that know their data well and want to integrate Intake into a data pipeline, have sources that are complex derivations of other sources, or have non-DataFrame containers. See intake-xarray for an example of a custom Xarray Dataset transform.

For more general purpose transformations, the intake-duckdb plugin leverages DuckDB's unique ability to query many types of tabular datasets as if they were database tables. Being an embedable, single-file data format, DuckDB resembles SQLite but is optimized for analytics and aggregations. It can also operate directly on data that exists only in-memory without copying anything. With a modest amount of work, Intake-DuckDB extends this capability to the humble DataSource and any child class, as long as the container type is a DataFrame.

Now you can provide filtered or modified versions of your data without needing to write a custom Intake plugin or copy any data around. Just write a little SQL. Tools like Ibis can even help with converting complex DataFrame operations into valid SQL.

Example

The following catalog aggregates and joins some data about vehicle crashes in New York in 2023. Notice that ny_crashes_vs_registrations_2023 uses the duckdb_transform driver to perform a join on ny_vehicle_registrations and ny_crashes which are both CSVSources.

# ny_crashes.yaml
sources:
  ny_vehicle_registrations:
    description: New York vehicle registrations
    driver: csv
    args:
      urlpath: https://data.ny.gov/api/views/w4pv-hbkt/rows.csv
      csv_kwargs:
        usecols:
          - Zip
          - State
          - Record Type
          - Reg Valid Date
          - Reg Expiration Date
        parse_dates:
          - Reg Valid Date
          - Reg Expiration Date
        blocksize:

  ny_crashes:
    description: New York traffic crashes
    driver: csv
    args:
      urlpath: https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv
      csv_kwargs:
        usecols:
          - ZIP CODE
          - CRASH DATE
        parse_dates:
          - CRASH DATE
        dtype: {ZIP CODE: object}
        blocksize:

  ny_crashes_vs_registrations_2023:
    description: Comparison of New York vehicle crashes vs. registrations by ZIP code
      in 2023
    driver: duckdb_transform
    args:
      targets:
        - ny_vehicle_registrations
        - ny_crashes
      sql_expr: |
        SELECT r.zip, c.crash_count, r.registration_count
        FROM (
            SELECT "ZIP CODE" as zip, COUNT(*) as crash_count
            FROM ny_crashes
            WHERE "CRASH DATE" BETWEEN '2023-01-01' AND '2023-12-31'
            GROUP BY "ZIP CODE"
        ) as c
        JOIN (
            SELECT Zip as zip, COUNT(*) as registration_count
            FROM ny_vehicle_registrations
            WHERE State = 'NY'
                AND "Record Type" = 'VEH'
                AND "Reg Valid Date" <= '2023-12-31'
                AND "Reg Expiration Date" >= '2023-01-01'
            GROUP BY Zip
        ) as r
        ON c.zip = r.zip

import intake

cat = intake.open_catalog("ny_crashes.yaml")
cat.ny_crashes_vs_registrations_2023.read()

Check out the README for some additional drivers that can build Intake catalogs from embedded DuckDB files and query them directly.

Limitations

Intake-DuckDB is in early stage development, arguably still a prototype. There are few configuration options for duckdb_transform sources, and interoperability with other Intake sources is largely untested. DuckDB contains a rich set of features, including running queries directly on remote datasets without needing to suck the whole thing into memory, and generating plots directly with the engine; Intake uses none of this, but could in the future. Intake could very well push more processing down to the Duck layer, or use DuckDB as a general purpose persistence store for any source type. Questions and PRs are more than welcome over on GitHub.

OSS Roundup

2023-04-17T00:00:00-05:00

This is the first in an ongoing series of articles in which we will give weekly highlights of open-source software (OSS) activities here at Anaconda. We will list various achievements of the last week or so, link to interesting things and give brief details of ongoing work and plans. Each team will only write something when they have something to say.

General

Keep eyes on PyCon USA, where pyscript.com is officially launched this week. For Europeans, we are also at PyCon DE/PyData Berlin, which has already started.

Numba Team (JIT-compiling python code to make it fast)

Numba 0.57.0rc1 and llvmlite0.40.0rc1 are released. The new Numba and llvmlite releases add support for Python 3.11, NumPy 1.24, and upgrades to LLVM 14. After this release, we will be dedicating more time in developing a new compiler pipeline. More details can be found at https://numba.discourse.group/t/proposal-numba-2023-mvp/1792.

Distdatacats Team (remote bytes, file formats, catalogs and data processing)

The Intake project has a new driver leveraging DuckDB: intake-duckdb
fsspec and friends release 2023.4.0 with greatly increased documentation and coverage around expectations for bulk file operations
rfsspec, the experimental Rust reimplementation of some async backends for fsspec is now on PyPI as 0.1.0: https://pypi.org/project/rfsspec/ . We will be using this for benchmarking IO-heavy workloads on dask clusters.
the kerchunk project is trying to merge references sets directly to parquet output, hoping to complete datasets with radically smaller memory footprint
awkward-array 2023.4.1 is out, and we have been working hard to improve our optimization code to deal with the extreme requirements of High-Energy Physics analysis, which has so many more operations and input files than any other data processing pipeline we've come across before.

Jupyter Team (in-browser IDE for python and others)

The Jupyter Team has been working on patching JupyterLab extensions in preparation for the upcoming JupyterLab 4 release. We have also been working on pull requests for a few Jupyter Notebook 6 extensions that were adversely affected by an update of Marked.js in the notebook code to deal with a reported CVE. The team is preparing for the upcoming JupyterCon in Paris, where we will be co-presenting a talk on the past, present and future of the Jupyter Notebook.

BeeWare Team (deploy python projects to mobile and elsewhere)

PyCon US 2023 is rapidly approaching, and BeeWare will be there! There's 2 BeeWare-related talks on the schedule, and we'll have a booth in the Community section of the main floor. If you're going to be there, stop by - and if you're a fan of the project and want to help us staff the booth - get in touch.

Toga 0.3.1 has been released. This fixes a number of layout issues, and completes the documentation and implementation of 9 widgets (Widget, Button, Label, ActivityIndicator, Box, Divider, ProgressBar, Switch and Slider). It also introduces the use of Shoelace as a web component toolkit for the web backend.

Briefcase 0.3.14 has been released. This is another big Briefcase release, adding code signing for Windows, system packaging for Arch, ManyLinux-based AppImage builds, faster Flatpak builds, and support for PyGame.

Rubicon ObjC 0.4.6 has been released. This is a minor bugfix release, mostly to address silencing a warning that was being raised when Rubicon was installed with recent versions of Setuptools.

My Summer at Anaconda: Porting NumPy's `random` module to Numba

2022-08-12T00:00:00-05:00

Python has a reputation for being notoriously slow, especially for numeric operations. Most of us use packages like NumPy and SciPy to mitigate this and make your huge numeric calculation code run in at least a bearable time. In this post, we’ll talk a bit about Numba, the famed package designed to speed up your Python code, even beyond the capabilities of NumPy using a simple decorator. More importantly, we’ll learn about a new long-awaited feature being introduced in the latest release, that is, support for NumPy’s new random module. This enables accessing the NumPy’s random number generator methods from within Numba functions, by allowing the NumPy’s Generator objects to cross the JIT Boundary.

Hey! I’m Kaustubh, a Computer Science student from India. During the Summer of 2022, I interned at Anaconda and worked on developing the Numba project. This post was made to make a (shamelessly self-promotional) summary of my work in the Numba library in the duration of my internship.

Introducing Numba

Now before learning about what I did in Numba, let’s take a short tour of Python numeric libraries and how Numba fits in all this: Some of the same qualities that make Python user-friendly and suitable for data science are the same qualities that make Python slow. The biggest one being the fact that Python is an interpreted language. Traditionally, this problem was mitigated by writing computationally intensive algorithms in C or C++ and calling them from the outward-facing Python code. NumPy is an excellent example of this type of organization. The computationally most intensive parts are written in C/C++ and exposed to Python via binding code.

Fun-fact: the NumPy arrays that we use most of the time are actually stored, indexed, and iterated using C-code.

The other approach to improve the performance of your Python code is to use a Python-accelerating library. Numba is a perfect example of this. Numba translates Python functions to highly optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can easily approach the speeds of C or FORTRAN, simply because of the amount of low level optimization applied to your code, especially if it has looping logic in it. Additionally, Numba has support for automatic parallelization of loops (which is different from loop optimization), generation of GPU-accelerated code (using CUDA), and creation of universal functions (ufuncs) and C callbacks. Libraries like these, in general, combine the relative ease of writing Python code with the performance enhancements provided by executing the operations in a different backend altogether.

Numba's support for NumPy methods

One of the advantages of using Numba is its near-seamless integration with NumPy functions, especially NumPy ndarrays. Numba detects whenever a NumPy function is being used within the code to be Just-In-Time (JIT)-compiled and dispatches the provided arguments to a respective Numba-based implementation internally that provides the same functionality as the original.

However, a major annoyance with directly trying to JIT-compile complex code written mainly using NumPy is the lack of support for specific abilities of NumPy within Numba. For instance, one can notice a significant lack of support when trying to JIT-compile code that uses NumPy’s infamous Fancy/Advanced array indexing using Numba or the lack of axis argument for several NumPy methods.

One more such feature whose support was being requested by NumPy users in Numba was support for the new NumPy random module. For those who aren’t familiar with it, NumPy has a module dedicated to random number generation which includes the generation of numbers following a particular random distribution, for example, Poisson or Gaussian distributions. This is especially useful in scientific computing, where generating random data is necessary for statistical analyses and computations. Coincidentally these are instances where Python accelerating libraries are required the most due to the computationally intensive algorithms being a large part of code logic. Hence, having the random generation support within JIT-compiled code is a huge advantage in the majority of cases.

Previously NumPy used to have a global ‘state array’, which was a sequence of bits stored as an array from which random numbers could be drawn. This state array could be initialized using (integer or array of) a sequence of bits. The array was updated repeatedly using the Mersenne Twister Algorithm which is a pseudo-random number algorithm.These kinds of algorithms have a cycle, so they eventually repeat. The advantage of the Mersenne Twister is that it has an especially long cycle (due to use of large, Mersenne prime numbers, which is where it takes its name from). This was also helpful because by setting the same initial seed, even with the generation of completely random numbers, we could have exactly the same results (at least up to rounding errors) when we rerun the code. Two systems with the same seed would have the same state array; hence, the subsequent random number generated is the same. However, there were problems with having a global seed; for example, when you have instances of two independent threads on the same system running the same scripts, trying to access the same global state array, it was challenging to maintain the same results from them for random number generation. To mitigate this, NumPy introduced class-based random number generation using class objects named Generators. And replaced the global seed with an object named BitGenerator that held the seed array as an attribute to the class. This also had the added advantage of having faster and more convenient algorithms, like Lemire’s rejection algorithm, for random number generation.

Basic Support for NumPy Generators

With PR #8031 in Numba, support for these Generator objects is being added. There were a lot of challenges while implementing the Generators as an object that could be lowered into the LLVM based environment of Numba; while keeping track of the original Generator object as well as the underlying Bitgenerator object. Fortunately, to make work easier, NumPy devs provided ctypes bindings to BitGenerators, which allowed us to draw bits directly from the C implementations of the object. (Thanks, NumPy devs!) The initial implementation for this was provided by Stuart Archibald from the Numba team and was improved upon by proper tracking of reference counts and error handling.

One of the issues we faced while implementing support of Generators was that we needed to maintain references to the original as a pointer to the object, if we ever needed to return it. Now one could easily do that by simply storing the pointer to the object somewhere accessible. However there is a bit of caveat here, Numba has its own runtime called NRT (Numba Run Time) that manages its own stuff such as reference counts. It however is NOT responsible for maintaining Python reference counts of an object outside of NRT. Hence, there could be a case where the object to which the pointer pointed to may not exist at all, because in the original Python environment it’s been deleted. This was mitigated by use of MemInfo objects which kept track of the information about memory at the pointer, including its references in the Python environment.

This PR was then followed by subsequent PRs which added distributions built on top of the Generator object support. #8038 and #8040 added such distributions while #8041 and #8042 added support for general methods of Generator objects such as shuffling and random integer generation (Lemire’s Rejection Algorithm). A majority of time of my internship was devoted to these PRs, half of it trying to implement them and the other half trying to track down a very sneaky bug within them. To understand this ‘bug’ let’s first understand where it originates from.

While implementing #8038, we noticed that even though we had directly translated the algorithms from NumPy code, there were slight precision differences between the results produced by Numba vs results produced by NumPy. What’s more interesting was this particular discrepancy was only being observed in certain systems on the CI, for instance it was being observed on aarch64 and ppc64le systems but not on linux-64 bit systems. At first they were small enough for us to ignore ~2000 ULPs (Units of Least Precision) for 64 bit integers, but as more and more complex distributions got introduced the precision difference touched ranges in order of ~10000s of ULPs. Thus we had to address the problem at its core. After multiple fruitless debugging sessions, we finally tracked down the precision differences right down to the assembly instructions that were causing it. (Thanks to Stuart again). These precision differences were actually being caused by floating point contractions. What it means is that in certain assembly languages (like those in ppc64le and aarch64 systems respectively) there are instructions like fmadd and fmsub that combine two assembly instructions i.e. mul and add/sub. So what happened in our case was that the NumPy code got executed using these instructions while the Numba code didn’t. This led to rounding errors as the numerical results of a single fmadd and two instructions: multiply and add ended up being slightly different. Once the problem was identified, we came up with #8232 as a fix.

The Generator support was released with the 0.56.0 version along with a load of other cool features. Subsequent releases will keep on adding more and more distribution-related algorithms. We aim for full parity with NumPy counterparts in both the latest and the legacy random number generator modules. In the future, the plan is to add more complimentary futures for this module such as making the Generator objects thread safe and also making it possible to build them within CPU-JITed code of Numba using a constructor.

Another cool feature that I was working upon during the summer is support for Fancy/Advanced indexing in Numba (starting with #8238). I’ll be posting a blogpost about it soon, so stay tuned for further updates.

A note to mentor and team:

Thank you Numba team for all its support. Especially Valentin Haenel (my mentor during the internship) for his guidance and Stuart Archibald for the initial patches and reviews. Couldn’t have made it this far without you guys.

Days since last segfault: 0x0

Ich bin ein Berliner!!

Mahe's Internship Experience at Anaconda

2022-07-20T00:00:00-05:00

In part one of this two blog series I wrote about conda and software packaging. I explained important terminology from the conda packaging ecosystem and discussed the features of the automatic recipe generator called Grayskull. In part two, I talk about my work during the internship at Anaconda, my experience here, and my learnings.

My Work During The Internship

Adding CRAN Support to Grayskull

Grayskull could generate recipes for Python packages available on PyPI and GitHub. Another useful package origin could be CRAN, given the popularity of the R language. Therefore during this internship, I worked on adding CRAN support to Grayskull. I studied the CRAN documentation and learnt how CRAN ships its packages and what all sources are available to extract the metadata for an R package. Through my research, I found that all R packages have a 'DESCRIPTION' file. This file contains metadata about the package. I began to map information in the DESCRIPTION file to the information in a conda recipe and I realized that a number of fields in the conda recipe for an R package and be directly populated from the information available in the DESCRIPTION file of that package. I was, therefore, able to generate R recipes through Grayskull. Of course, the DESCRIPTION file does not have all the information needed to write the entire recipe. Additional layers were added (and more are to be added) to fill in the missing information. Presently we are only able to support simple R packages for recipe generation, i.e. packages that do not need system-specific compilation. In the next iteration, we would try to also support complex packages, whose recipes must include compiler information. See here.

Detecting Percentage Match of Licenses in Generated Recipes

Every conda package is shipped with a license. There are some standard licenses such as MIT, Apache, BSD, etc that are widely used. Sometimes people add these licenses to their projects but make modifications to the license according to their needs. Conda recipes require license information of the package. It is therefore important that during the recipe generation process, when Grayskull detects the license of the package, the user is informed that the license has been modified and to what extent. The intent here is to detect and inform when subtle changes have been added to the original text (like one extra clause), making it a new license altogether. Grayskull uses the rapidfuzz Python library to fuzzy match the package license with a list of standard licenses. I used the 'fuzz' module of this library to calculate the percentage match of the license and then display a warning to the user. This lets the user know if their included license deviates significantly from the standard version of the license.

Initializing Grayskull Documentation

During one of the 'Hackdays' at Anaconda, I set up the initial documentation for Grayskull. As the work on Grayskull progresses, we will need proper documentation in place to keep track of the development. The documentation will also help new contributors to onboard with ease. The 'Hackdays' were a great motivation to do things fast!

Writing a Conda Enhancement Proposal

Finally, I wrote a CEP (Conda Enhancement Proposal) explaining why Grayskull is a great recipe generator and how we can make the migration from conda-skeleton to Grayskull possible. The CEP compares features between conda-skeleton and Grayskull and discusses what features need to be added to Grayskull to make it more versatile. The CEP serves as a single point for the community to interact and discuss the proposed changes and make valuable suggestions before a decision is made. You can check out the CEP here.

The Many Things That I Learned

Leadership and Initiative-Taking Skills

When my internship officially began we did not yet have a fixed plan about what I was going to work on. Usually the internship project is decided based on the prior experiences, expertise and interests of the intern. My mentor, Jannis Leidel, encouraged me to explore the various projects ongoing within Conda and see if something interested me. I explored. I was already more than a week into the internship and still couldn't decide what I wanted to work on. This made me anxious. I needed the project to be challenging enough so that it would be help me grow, but I also needed it to not be too difficult, too above my current level of skill and knowledge, because otherwise I might get overwhelmed and drop it midway. I needed it it to be at the sweet spot of being challenging but aligning with my previous experience and knowledge. I suggested that maybe I could work on adding more package origins to Grayskull, because that would take it a step further to being a versatile conda recipe generator. Jannis welcomed my idea and developed on it. Now we needed to decide which new package origin to add to Grayskull. There were several to choose from; PyProject, GitLab, CRAN etc. I reached out to Cheng Lee, who I knew had a lot of experience working on conda packaging. I requested that we set up a meeting to help me decide what to work on. He kindly agreed and after some discussion we decided that it would be a good idea to add CRAN support to Grayskull since R is a popularly-used language and there are a number of R conda packages in the ecosystem. The problem, though, was that I had no prior experience dealing in R packages. Nor did we have any R experts on the team. But coming from conda-forge, I knew the power and resourcefulness of opens source communities. I reached out to people in the conda-forge community who had experience working with R packages. Björn Grüning, who wrote the R helper script for conda-forge (a script that runs over R recipes generated by conda-skeleton and modifies them to better suit conda-forge), was kind enough to talk with me, discuss my ideas for CRAN support in Grayskull and guide me whenever I experienced blockers. I also met with Filipe Fernandes and the developer of Grayskull, Marcelo Trevisani, to discuss with them my plans and ideas. They gave me their valuable insights and advice. Marcelo was also generous enough to agree to meet with me regularly during the term of my internship so that I could receive timely feedback on my progress and help whenever I needed it.

This internship project pushed me to go out of my way to learn and acquire the information I needed to move forward. It forced me to delimit myself, take initiative and develop leadership qualities.

Thinking About and Planning Projects In a Sustainable Manner

I met with my mentor, Jannis Leidel, twice a week. Through our meetings in the duration of three months, I learned many valuable lessons from him. One lesson that I would especially like to mention is 'thinking about projects sustainably'. Jannis would often ask me what I thought would be the future of Grayskull. Was I interested in continuing working on it after the internship? Did I see other people working on it? Jannis insisted that Grayskull development work should continue beyond the internship and that we have to figure out how. He also insisted that it was unfair to expect the original developers (in this case Marcelo) to continue investing time and effort into Grayskull unpaid. We have to figure out ways to promote Grayskull so that its development continues in an organic and sustainable manner. I realized that programs such as Google Summer of Code and Outreachy are very useful for this purpose. They provide visibility to a project and thereby invite new contributors to it. We plan to register Grayskull in such open source programs in the future.

The Advantages of Daily Standups

My manager, Dan Meador, encourages us to write daily standups. A standup is where team members who work asynchronously (because of time zone differences) share with each other what they're currently working on, what they're planning to work on and if they're experiencing any blockers. I realized that writing standups helped me clarify my thoughts and solidified my goals for the day. This is something I really struggled with -- breaking bigger tasks into smaller ones and sorting what needs to be done first. However, through standups and other productivity hacks that my manager shared with me, I was able to learn this skill. I feel that breaking down tasks and deciding every morning (or the previous evening) what exactly you're going to work on today can really enhance your productivity. I still get lazy sometimes, and forget to plan ahead and then my days are not as productive as I'd like them to be, and then I feel super guilty for not being my best productive self. But one's gotta keep trying until good practices become habits.

Connecting With People

What I most enjoy doing in life is connecting with people; recognizing our shared human-ness despite our many apparent differences. And this internship provided me with plenty opportunities to reach out to people, talk to them, discuss ideas, and develop friendships. I am grateful for the many new bonds I made with people within and outside Anaconda.
Jannis says that we must always remember that behind these computer screens there are real human beings, with feelings and egos and insecurities. And as long as we treat each other with empathy, we can successfully create sustainable communities where people feel belonged and cared for. I feel very strongly about this and I believe it is important to always keep in mind the human element during all our professional interactions. At the end of the day we're all fragile, vulnerable beings trying to achieve great goals together with all the strength and grace we can muster.

The last months have been fulfilling and exciting. I have learnt a lot and grown as a software engineer and as a person. I am truly grateful for this opportunity, for the mentorship I received and and for all the new friendships that came my way and made my life more meaningful. Thank you, Anaconda. Thank you, The Spirit of Open Source.

Grayskull - The Community-Developed conda Recipe Generator

2022-07-19T00:00:00-05:00

This is part one of a two blog series. In this blog I talk about software packaging and conda. Head to part two to read about my work and experience during my internship at Anaconda.

Once again attempts were made to, once again, clarify that conda is not Anaconda's nickname. This time through tweets and memes. Today, let us understand, for once and for all (as if such an ideality were possible!) what conda is. Spoiler: It is an OS-agnostic package manager.

My name is Mahe. I am a senior year Computer Engineering student from Delhi, now a Software Engineer at Anaconda. I was very recently an intern here. During my internship I worked on a community-developed project called 'Grayskull'. In this blog post, I will talk about conda, conda-skeleton, and Grayskull. I will also discuss the disadvantages of tightly coupled projects/tools, and the advantages of embracing community innovations in open source ecosystems.

Concepts and Terminology

Before moving forward, let us quickly learn a few terms widely used in the Conda packaging ecosystem. A Software Package is simply a working piece of code that does something. Software packages are installable so that people can benefit from the code written by others. Channels are online locations where these packages live and can be downloaded from. Channels are warehouses of packages. Conda is an OS-agnostic package and environment manager for Python packages and data science adjacent libraries. It allows you to manage the environments and dependencies of your packages and generate the needed context for your project to run successfully on a variety of machines. Conda-build is a set of commands and tools that lets you build your own Conda packages. To create a package with conda-build, you need to provide a Recipe, minimally a meta.yaml file that contains the packaging metadata and build instructions for that specific package. You can learn more about conda recipes here.

Writing Recipes

For someone who is new to the packaging world, writing package recipes can seem quite intimidating. Even people who are not new to it would agree that writing package recipes is often boring and tiresome, not to mention highly error-prone. Example recipes and templates help, sure, but one would rather their life was made easier and their package recipe was generated automatically and was perfectly concise. There is Conda Skeleton, an automatic conda recipe generator provided with conda-build. Conda Skeleton is a helpful tool indeed. But it falls short of being the perfect recipe generator for several reasons:

It is slow in generating recipes.
It cannot be deployed on systems without conda.
It has a huge number of dependencies.
The recipes it generates are not always concise.

These shortcomings in conda-skeleton led to the development of Grayskull in the conda-forge community by Marcelo Trevisani.

Grayskull - The Community-Developed conda Recipe Generator

Grayskull is an automatic conda recipe generator that generates concise conda recipes for Python packages available on PyPI and GitHub. It significantly improves upon conda-skeleton in terms of speed, conciseness of the recipes, packaging environment specificity, and memory usage. Grayskull has proved to be an extremely useful tool for the packaging ecosystem by generating very accurate recipes very quickly.

Grayskull - An Improvement Upon conda-skeleton

Grayskull generates recipes that take into consideration the platform, Python version available, selectors, compilers (Fortran, C and C++), package constraints, license type, etc. It uses metadata available from multiple sources to create the best recipe possible.

The table below compares and contrasts the performance and mechanisms of Grayskull and conda-skeleton:

Grayskull	conda-skeleton
Detects when the recipe supports noarch:python	Does not detect noarch:python
Always tries to detect compilers	Does not detect compilers
Standalone application, can be deployed on systems without conda	Relies on conda
Light weight due to reduced dependencies	Huge number of dependencies due to reliance on conda
pip installable	Not pip installable
Creates a small, temporary virtual env to stimulate the installation of the package using the source tarball	Creates a separate conda env and runs the solver, hence takes up a lot of time
Generates concise recipes	Sometimes mixes up dependencies and generates unnecessarily bloated recipes

Improving conda-skeleton is Tough

conda-skeleton the recipe generator is very tightly coupled with conda-build, the package builder. Due to this, it is very risky to try and change any functionality in conda-skeleton, as it could lead to breaking something in conda-build itself. Also worth noting is that the conda-skeleton code is not very modular and does not contain very many comments. This can make onboarding to conda-skeleton a difficult and time-consuming task.

Grayskull, on the other hand, is a standalone tool. The code is very interchangeable, which makes it easier to add new functionality or update existing functionality. Grayskull also has ample comments describing each function in the code, which makes it easier for new people to onboard and understand the codebase.

Embracing Community Innovation

Anaconda, the company behind conda and conda-skeleton, gracefully acknowledged the advantages of Grayskull over conda-skeleton and has been supporting Grayskull and is making efforts to adopt it as the de facto conda recipe generator. This also falls in line with the conda project efforts.

During my internship at Anaconda I worked on Grayskull, adding more package origins to it, and taking it a step further to being a versatile conda recipe generator. In the follow up blog I talk about my work during the internship and my experience at Anaconda.

FOSS Fridays at Anaconda

2022-05-06T00:00:00-05:00

Anaconda is an important part of the Open Source ecosystem. A significant percentage of our team contributes full-time to open-source projects, but it is a passion for most of the organization.

What are FOSS Fridays

The idea is to take a break from normal teamwork to contribute to any open-source project that we would like to support. We want to give an opportunity for everyone in technology to contribute to an open-source project we care about, Python or not, Anaconda-related or not.

FOSS Fridays happen on first Friday of the month. Jan 7, 2022 was our first FOSS Friday🤩.

FOSS Friday Updates - May 2022

PyScript PyScript PyScript

PyScript had just launched and everyone at Anaconda were so excited. Here is Peter Wang's keynote at Pycon 2022 on "PyScript - Programming for Everyone".

Dan Meador added Github issue forms for PyScript.
Jannis Leidel worked on the documentation infrastructure for PyScript.
Katherine Kinnaman too added changes to the documentation infrastructure for PyScript.
Kevin Goldsmith rewrote the CONTRIBUTING page for PyScript to make it easier for folks to understand how to contribute to the project.
Matt Kramer added pre-commit hooks for PyScript. He also added new issues up for grabs to the pyscript-cli project.

Improve performance for conda-forge

Carl Anderson and Daniel Holth improved conda channel cloning performance for conda-forge by fixing a performance bug in conda-index. They also worked on a proof-of-concept rewrite of conda-index which will reduce package propagation latency for cloned channels by using sqlite databases instead of relying on a filesystem cache of a channel.

conda-forge package sync improvements

This graph shows the duration of conda-forge channel cloning before and after the performance bug was fixed. The Y axis is measured in seconds and each point on the X axis is a point in time when the clone job ran.

BDD test automation framework

Vijesh Kumar and Nishita Beeraka worked on a BDD test automation framework design to work closely with the manual QA automation team.

Parenthood and Leadership

Princiya Sequeira wrote a blog post on parenthood and leadership.

Welcome to the world PyScript

2022-04-30T00:00:00-05:00

- pandas

One of the main reasons I joined Anaconda seven and a half years ago was the company’s commitment to the data science and Python communities by creating tools that enable people to do more with less.

Today I'm happy to announce a new project that we’ve been working on here at Anaconda and we hope will take another serious step towards making programming and data science available and accessible to everyone.

What is PyScript

PyScript is a framework that allows users to run Python and create rich applications in the browser by simply using special HTML tags provided by the framework itself. Core features include:

Python in the browser: Enable drop-in content, external file hosting (made possible by the Pyodide project, thank you!), and application hosting without the reliance on server-side configuration
Python ecosystem: Run many popular packages of Python and the scientific stack (such as numpy, pandas, scikit-learn, and more)
Python with JavaScript: Bi-directional communication between Python and Javascript objects and namespaces
Environment management: Allow users to define what packages and files to include for the page code to run
Visual application development: Use readily available curated UI components, such as buttons, containers, text boxes, and more
Flexible framework: A flexible framework that can be leveraged to create and share new pluggable and extensible components directly in Python

All that to say… PyScript is just HTML, only a bit (okay, maybe a lot) more powerful, thanks to the rich and accessible ecosystem of Python libraries.

Wait... what? Why?

tl;dr: As an industry, we have focussed on making the impossible possible, rather than focussing on making the possible accessible to all.

At some point, in the 80s, personal computers became cheaper, which led to them becoming more popular. Most of the HW (C64/ZX80/Apple II) gave the user direct access to BASIC. A programming interface ready to use and a language simple to learn. Later, while systems became more complex (and complicated), frameworks like Visual Basic and HyperCard made it easy to create and package/distribute visual applications. Even the web, when it started, was accessible! All you needed was a text editor and a way to upload your files somewhere, before we created CGI and heavier server-side logic/rendering, etc...

It's somehow unfortunate that in the last 2/3 decades we created simpler programming languages, made things faster, more scalable, and bigger; requiring an increasing amount of surrounding technology and the complexity of infrastructure needed to make things work. Today, in addition to the problem of packaging and distributing applications for different architectures and platforms, we added the complexity of having the server/client separation, which requires an additional networking layer and so on... This leads to having to learn about servers, cloud vendors, web stacks, how to test code in a simulated production environment, how to deploy applications,... All of a sudden, instead of the 1 problem users were initially trying to solve when they started, they now have many problems!

Similarly, modern HTML/CSS and JS are very powerful and can be used to create really powerful and beautiful UIs, but require a significant learning curve for users to be proficient at it. This is also true for native GUI Applications. In fact, Python, the #1 most popular programming language in the world doesn't have a straightforward story on how to build native GUI Applications. Nor for making websites [entirely with python, server + client]! Nor for packaging and distributing applications!

We believe users should be spending their time thinking about and writing their applications and solving real problems. Let's make programming more fun and simpler, while keeping the right technology advancements we made over the past 20/30 years. The more we do, the more users will come.

So, how does it work?

Warning, we are about to get a little technical.... :)

The core concept of PyScript, as a framework, is to provide a set of [opinionated] components and tools that allow users to quickly create and share their applications. We also don't want to reinvent the wheel and aim to reuse the great work that many others are already doing.

With that in mind, let's start from the foundation...

The platform

One of the hotest topics people work on to solve today is: how do we create an abstraction that allows users to ship their applications to multiple HW/SW platforms without having to rewrite and rebuild their code? Most of the solutions today tend to fall under one of the 2 buckets: Virtual Machines or Containers. Both are great or have limitations, depending on the type of application and how heavy your need for abstracting a whole machine is.

Instead of creating a whole new technology stack, we want to start from the best option the ecosystem provides today. So, what virtualization abstraction system is the most popular and ubiquitous today? With a little bit of flexibility, we can claim that the Browser (browsers in general) is an excellent Virtual Machine, that actually checks a lot of the boxes we are looking for. They are everywhere (from laptops to tablets and phones), secure (browsers have been working security and isolation from the underlying file system for decades), powerful (from HW acceleration to the maturity of WASM and Web Assembly), and stable.

The Stack

Keeping in mind one of the premises above, we want to provide a reliable and fun experience to PyScript users (whether they are authoring or consuming an application), ultimately making the web a friendly and hackable place for users. For this reason, we need something beyond the current state of web development. Something that can:

give users a first-class programming language that is less weird, more expressive, and easier to learn than Javascript.
centralize: strip away most of the complexity of the client/server modern web by removing that distinction as much as possible.

Luckily for us, the ecosystem has been building the foundations of a very solid stack that we can build on top of:

WebAssembly/WASM: a portable binary-code format and text format for executable programs & software interfaces to enable high performance applications on web pages and other environments
Emscripten(https://emscripten.org/): an Open Source compiler toolchain to WebAssmbly, practically allowing any portable C/C++ codebase to be compiled into WebAssembly
Pyodide(https://pyodide.org/)/python-wasm(https://github.com/ethanhs/python-wasm): Python implementations compiled to WebAssembly

As Python found its success standing on the shoulders of giants and building out of the excellent work of many people, we can do that too!

The Interface

One of our highest goals is to make programming and the web a friendly and hackable place where anyone can create interesting things and still have fun.

As hinted above, the presentation layer of the modern web is really powerful and actually not bad, if you know what to do. That means that either you've been doing this for some time or that you'll have to spend a considerable amount of time learning. Even then, that ecosystem moves so fast that is often hard even for experts to keep up.

Instead, we want a system that:

offers a clean and simple API
supports standard HTML
extends the HTML elements with custom components that are opinionated and predictable (do fewer things but do it "as you'd expect it")
is extensible and offers an easy way for users to define their own new components

To do this, PyScript defines a series of new HTML tags (web components). For instance, to write a simple program, one can just use the <py-script> tag and write Python code inside the tag itself

<py-script>
"Hello World"
</py-script>

or, alternatively, pass the source file directly

<py-script src="/my_own_file.py"></py-script>

PyScript will read that code, run it on a python interpreter and handle the output accordingly.

If I need to load (install) additional modules and packages needed by my application, I can just use the <py-env> tag to specify my environment requirements

  <py-env>
- bokeh
- numpy
- paths:
  - /utils.py
  </py-env>

To add a REPL-like component to create an interactive experience, one can just use the <py-repl> tag

<py-repl id="my-repl"  auto-generate="true"> </py-repl>

and it'll create a widget like the one below, that can be used to access everything loaded and executed by the other tags we mentioned before, such as <py-script> and <py-env>.

Since we already loaded pandas and numpy for you, try copying, pasting and running (by hitting the green arrow) the code below:

import pandas as pd
import numpy as np

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Voila!

The point is, that by registering new web components that are simple and very expressive and users don't need to waste their time learning css and other specific web dev technologies.

Where is PyScript today?

Today, April 30th, 2022, PyScript is just at its beginning and is very limited compared to the vision we have for the project. It's a demonstration that we can build the vision and the technology is mature enough for us to create a new way of programming, building, sharing, and deploying applications. Be advised that it's very unstable and limited, but it works and can be used to hack with and build experimental applications.

We hope to make progress fast and that in a few weeks/months, this post will be outdated :).

For more information about the available features and how to get started, visit the project documentation.

Where is PyScript going

One of the ways I like to think of PyScript is "the Minecraft for software development". A framework that provides basic blocks for users to create their own worlds [applications] or new blocks [PyScript components and widgets] that others can use. In that sense we want to build a framework that is:

extremely simple and expressive
feels familiar to users
extensible:
so users can create new widgets and share them with others
so we can support multiple runtimes...
... and multiple languages ...
... that can interop with each other ...
... and yet be controlled to also create secure namespaces
runs on both the browser and server/native side

In addition to all that, it's worth mentioning that with this project, we are exploring new horizons, and a lot of the old paradigms that are at the roots of "standard" server-side programming are not that untouchable anymore. For instance, I/O, network, and storage on the browser/client side are not the same as in traditional native systems. We'll save that topic for another post, but the point here is that we have the opportunity to innovate and explore, and that's what we want to do.

It's also worth mentioning that a lot of the core technology used to build PyScript is itself recent and very vibrant. As these technologies mature and expose new functionalities, we want to extend PyScript and take all the advantages we can get.

Thanks

PyScript wouldn’t be here without the help of some incredible people. We’d really like to thank:

Peter Wang, Kevin Goldsmith, Philipp Rudiger, Antonio Cuni, Russell Keith-Magee, Mateusz Paprocki, Princiya Sequeira, Jannis Leidel, David Mason, Anna NG, Maria Genovese, Katherine Kinnaman, Kent Pribbernow, Albert DeFusco, Michael Verhulst and Chris Leonard for the contributions to the project and helping spin it up
Especial thanks to the Pyodide maintainers (Roman Yurchak, Hood Chatham and all the contributors)

SBOMs at Anaconda

2022-04-05T00:00:00-05:00

Last fall, Anaconda launched a collaboration with Microsoft to create Software Bills of Material (SBOMs) for all packages in the “defaults” channels of our repository. We are excited to announce that we have achieved this goal and to share why and how we did it.

What are SBOMs and what value do they provide?

Following the discovery of the SolarWinds supply chain hack in 2021, the White House issued the Executive Order on Improving the Nation’s Cybersecurity, which detailed new requirements to help strengthen the security of the Federal Government’s software supply chain. SBOMs are a key part of this effort, functioning as “a list of ingredients” that enable users to verify the components, licensing, and provenance of software installed on their systems.

Anaconda’s SBOMs are built in accordance with Software Package Data Exchange (SPDX) specifications, version 2.2.1, which specifies the checksum hash values of software down to the individual file level. When a new software vulnerability is discovered and made public (e.g. via the NVD database), we can check whether our packages contain any vulnerable components, identify their hash values in our SBOMs, and use those values to verify whether these vulnerable components are installed on our users’ systems. The licensing and provenance information can also help our users (particularly enterprise customers) determine whether certain packages meet their governance standards. Finally, we cryptographically sign each SBOM document so that they can be verified by the recipients, ruling out tampering.

Building `sbomtool`

Anaconda has about 300,000 package artifacts (.tar.bz2 and .conda files) in its “defaults” channels, and we needed to build a tool that could create SBOMs for all of them. This is sbomtool. On a basic level, sbomtool is a CLI application written in Python that ingests conda packages and outputs SBOM documents that follow the SPDX specification. Early on, we discovered a great Python package (built and maintained by the SPDX organization) that provided us with an easy-to-use API to build and validate SBOMs.

However, before we could start churning out SBOMs, there were a number of challenges we had to address. For example, we knew that license information hasn’t always been consistently available in our packages. SPDX maintains a standard for formatting common open source software license types, and we incorporated a mechanism to apply this standard in our SBOMs and backfill corrections for packages that have incorrect or missing license information. We researched and made such corrections for over 1,800 packages so that their SBOMs comply with the SPDX license type standard.

There were other challenges we did not anticipate that led us down some interesting and productive paths:

The architecture of a conda package can normally be found under the subdir key in the index.json metadata file. But after reviewing a particularly big batch of SBOM failures related to this key, we discovered that in much older conda packages the architecture was listed under a platform key instead. A bit of code archaeology revealed that the subdir key was added in 2015 and was implemented to enable use of the first “noarch” builds of conda packages.
When we started using tools-python to build sbomtool, the only checksum algorithm it supported was SHA1, which is no longer accepted as secure. We wrote a patch and contributed a PR to the upstream project to enable use of more secure checksum algorithms like SHA256 (which we use in our SBOMs).
Some conda packages contain symlinks to system files on host environments or other files within the package itself. SPDX did not have a prescribed method for specifying symlinks when we began work on sbomtool, so we raised an issue and started a discussion that may soon lead to a policy update.

Deployment

On March 24, 2022 - after months of building sbomtool, fixing bugs, and resolving metadata issues - we reached 100% SBOM coverage and are continuing to build SBOMs for new packages that are uploaded daily. We are now focusing on automating and integrating sbomtool into our package build pipeline, and we look forward to making SBOMs available as a new tiered service offering for our customers.

Welcome to the Anaconda Engineering Blog

2022-03-21T23:40:00-05:00

At Anaconda, we talk a lot about the problems of numerical computing. We discuss the issues that Data Scientists, Data Engineers, Analysts, and others face daily. Many of us come from these backgrounds ourselves. We share our knowledge on the main Anaconda blog and Anaconda Nucleus.

We are also a software engineering company. We contribute to open source projects like conda, Numba, Dask, Bokeh, Pyston, Beeware. We build web services, websites, and on-premise products. We create a Python distribution used by over 25 million developers worldwide.

We build products and packages using Python, Javascript, Terraform, C, C++, Go, R, Java, and ObjC.

We created this site to share the knowledge we gain from building our products with the broader software development community.