<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Anaconda Engineering Blog</title><link href="https://engineering.anaconda.com/" rel="alternate"/><link href="https://engineering.anaconda.com/feeds/all.atom.xml" rel="self"/><id>https://engineering.anaconda.com/</id><updated>2023-09-01T00:00:00-05:00</updated><entry><title>Numba, Mojo, and Compilers for Numerical Computing</title><link href="https://engineering.anaconda.com/2023/09/numba-mojo.html" rel="alternate"/><published>2023-09-01T00:00:00-05:00</published><updated>2023-09-01T00:00:00-05:00</updated><author><name>Stan Seibert</name></author><id>tag:engineering.anaconda.com,2023-09-01:/2023/09/numba-mojo.html</id><summary type="html">&lt;p&gt;A comparison of Numba and Mojo, and a Mojo wishlist&lt;/p&gt;</summary><content type="html">&lt;p&gt;There have been a lot of articles written following Modular’s announcement of
&lt;a href="https://docs.modular.com/mojo/"&gt;Mojo&lt;/a&gt;, a Python “successor” language that
will bring high performance computing (especially for AI applications) to
Python users. Here at Anaconda, we’ve been working on the problem of HPC and
Python for over a decade now. As part of that work, we created
&lt;a href="https://numba.pydata.org"&gt;Numba&lt;/a&gt; in 2012, a Python just-in-time compiler
specifically for numerical computing, and have been improving it ever since.
We’ve learned a lot during that time about what it means to compile and
accelerate Python, and thought it would be useful to share some of that
perspective here. Along the way, you’ll learn about some of the design
tradeoffs Numba has had to navigate, and maybe get some context for the
challenges that Mojo will face as it takes a different (but potentially
complementary) approach to Python compilation.&lt;/p&gt;
&lt;h2&gt;When is Python useful for HPC?&lt;/h2&gt;
&lt;p&gt;The difficulty in deciding how Python should be best used for high performance
computing (HPC) hinges on the difficulty in defining the purpose of HPC. HPC
is often discussed in terms of maximizing execution speed of large, complex
calculations. For these purposes, Python is not very useful, except maybe as
a job configuration or control language delegating nearly all computation to
highly optimized components. (This is the traditional “glue language” role
that Python is very good at.)&lt;/p&gt;
&lt;p&gt;However, the economies of scale in software can confuse our understanding of
priorities. Some kinds of software, for example linear algebra algorithms,
are so ubiquitous across disciplines and so frequently used that they merit
spending a huge amount of developer effort on performance. There are almost
no situations where you should implement matrix multiplication yourself.
Instead, you should use MKL, or OpenBLAS, or cuBLAS, or some other library
that has had a tremendous amount of effort from hardware experts invested into
it. The time spent developing and debugging these sorts of libraries is tiny
compared to the amount of time the library will be used by the entire software
community. Additionally, it is often possible to find resources to fund these
fundamental libraries from hardware vendors and non-profit funding sources
because these applications are pivotal. This is the “HPC mass production”
scenario.&lt;/p&gt;
&lt;p&gt;But many software projects are not in this “mass production” category. They
are research projects where a small team of developers needs to iterate
quickly to discover a solution to a particular problem. Or it is an
established project where maintenance costs and barriers to contribution need
to be minimized because the team is small, or turnover (as in many academic
research contexts) is high. The incredible diversity of use cases in the
Python numerical computing ecosystem is its strength, but also means that
development processes optimized for “mass production” are not necessarily the
best choice. Speed matters, but total time matters more, including developer
onboarding time, feature development time, debugging time, maintenance time,
packaging time, as well as execution time.&lt;/p&gt;
&lt;p&gt;This is the vast world of “high-enough performance computing,” and is full of
tradeoffs. Numba seeks to find a balance that reduces total person time in
the numerical computing space, of which execution time is one component. Mojo
is trying to do a similar thing, but with a different set of tradeoffs.
(Literally anyone who isn’t hand-coding assembly language for critical code
paths is trading execution speed for something.) Hopefully I can help
illustrate some of these tradeoffs in the following sections.&lt;/p&gt;
&lt;h2&gt;Python isn’t a language, it is a community&lt;/h2&gt;
&lt;p&gt;When people talk about the characteristics (like “speed” or “ease of use”) of
“Python,” they are usually mashing together a bunch of distinct parts that are
important to think about separately when designing a strategy to improve
Python in some way. These parts include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Python language specification&lt;/li&gt;
&lt;li&gt;The Python interpreter (usually CPython for most users)&lt;/li&gt;
&lt;li&gt;The Python standard library&lt;/li&gt;
&lt;li&gt;The huge collection of third party packages&lt;/li&gt;
&lt;li&gt;The network of users and developers who help create, maintain, and teach others about all of the above&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When a new solution to “the Python performance problem” is announced, I find
it useful to understand how it impacts each of these items separately. For
example, Numba makes the following choices:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python language&lt;/em&gt;: No changes. Numba uses standard Python language
  features, like decorators and context managers, to intercept execution and
  annotate code for compilation.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python interpreter&lt;/em&gt;: No changes, although Numba is CPython-specific. Numba
  is specifically designed to work inside CPython as a regular module to
  minimize barriers to use and maximize application compatibility.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python standard library&lt;/em&gt;: No changes, but Numba can only accelerate a small
  number of standard Python types and built-in functions. Applications are
  free to use the entire standard library, but only outside of Numba-optimized
  functions (with some exceptions).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Third party packages&lt;/em&gt;: Numba does not prevent usage of third party packages
  in an application because Numba is confined only to specific functions the
  developer chooses to compile. However, Numba cannot compile code that calls
  most third party libraries either. Numba includes specific optimization for
  usage of &lt;a href="https://numpy.org/"&gt;NumPy&lt;/a&gt; arrays and functions, and support for
  other libraries can be added using Numba’s extension mechanism (for example
  &lt;a href="https://awkward-array.org/doc/main/"&gt;Awkward Array&lt;/a&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Users and developers&lt;/em&gt;: Numba tries to be easy to pick up for anyone
  familiar with Python and NumPy, although more advanced capabilities require
  learning a few new concepts. Numba also strives to be easy to integrate
  into other Python packages, avoiding a compile step during packaging, in
  some cases allowing packages to be distributed as pure Python source code
  which is compiled for the user’s specific hardware at runtime.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In general, Numba is trying to strike a balance of minimizing the barrier to adopting Numba in a numerical application by existing simply as a module that is compatible with the standard CPython interpreter. The application author does need to actively opt-in to Numba for each function that needs to be compiled, but we find this minimizes surprises and leads to more successful usage in the long run.&lt;/p&gt;
&lt;p&gt;Mojo takes a different approach:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python language&lt;/em&gt;: Mojo adds significant new syntax to the Python language, aspiring to be a “successor language” for Python that offers more low-level types and performance control. In particular, Mojo defines a new kind of function (using the “fn” keyword) that will be more strictly typed and less dynamic so that faster code can be generated. Regular functions (defined with the standard “def” keyword) have more dynamic and implicit behavior, which will make them feel more familiar to Python developers, but they are not “Python functions” in a strict sense.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python interpreter&lt;/em&gt;: Mojo has its own runtime interpreter (called “mojo”) which must be used to call Mojo code, and a compiler for standalone binaries will be available as well. Mojo also links to the CPython runtime, which it uses to allow it to import Python modules and call methods on Python objects.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Python standard library&lt;/em&gt;: Mojo has its own standard library, which currently is very low-level (at least in the public documentation). Mojo modules can access the Python standard library, via the Python interop functionality.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Third party packages&lt;/em&gt;: As the Mojo compiler is not generally available, no one has made any Mojo packages. It is unclear how they will be distributed, or what the packaging infrastructure and tooling will look like. Mojo can access 3rd party Python packages via the interop interface, same as the Python standard library.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Users and developers&lt;/em&gt;: Mojo’s target audience currently seems to be developers of what I would call “application code.” Mojo is not yet targeting anyone creating reusable modules for redistribution, and not Python package developers either.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The best way I’ve found to think about Mojo is that it is trying to be a language that looks like Python, feels somewhat familiar to Python developers, and is designed to interoperate with Python out of the box. It is not a drop-in replacement for Python, however, unless you are writing a new application from scratch that makes limited use of other Python modules.&lt;/p&gt;
&lt;h2&gt;Mojo fn/def vs. Numba’s object/nopython modes&lt;/h2&gt;
&lt;p&gt;One of the more interesting decisions in Mojo is to create two kinds of functions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;def&lt;/code&gt; functions: These functions are designed to feel like standard Python
  functions. The dynamic behavior we expect from Python should continue to
  work here, although Mojo def functions can go beyond Python via creation of
  immutable variables, define type signatures using Mojo types, and so on.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;fn&lt;/code&gt; functions: These functions have more strict rules about the contents of
  the function, such as required type declarations for arguments and local
  variables, and explicit declaration of exceptions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is clear from the restrictions that &lt;code&gt;fn&lt;/code&gt; functions are designed to be
statically typed by a compiler and can be compiled ahead of time with the
performance one would expect from a statically typed language.&lt;/p&gt;
&lt;p&gt;Numba takes a somewhat similar approach, and has two different compilation
modes, which are called “object mode” and “nopython” mode. These two compiler
modes have the following behavior:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;object mode&lt;/em&gt;: Does not attempt type inference on the function, treating all
  values as opaque Python objects. Object mode simulates the Python
  interpreter at compile time, emitting calls to the CPython C API to perform
  all operations on the Python objects in the function. This results in a
  negligible performance improvement to the code. Due to limitations in the
  bytecode analysis front-end of Numba, there are some Python language
  constructs not supported. Execution is still subject to the Python Global
  Interpreter Lock (GIL).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;nopython mode&lt;/em&gt;: This performs a complete type-inference pass on the
  function, and, for the subset of Python that is supported, eliminates all
  direct manipulation of Python objects. Scalar values are unboxed into
  native machine types, and operations on NumPy arrays directly modify the
  data buffer of those arrays. Most reference counting is completely
  eliminated, although a few instances may remain depending on whether new
  NumPy arrays are allocated. Numba, however, uses its own thread-safe
  reference counting internally, so the GIL can be released and multiple
  threads can execute Numba functions simultaneously.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In practice, we’ve found object mode is no longer very useful. In early
Numba, nopython mode was very limited, so object mode was paired with a
technique called “loop lifting.” The Numba compiler would find loops within a
function that could be compiled to nopython mode and would “lift” them out of
the function body, and the rest of the function would be compiled with object
mode. As nopython mode has gained more capabilities (including memory
allocation and access to typed Python-like containers), we find the use cases
for object mode compilation of functions have diminished. In fact, we’ve even
inverted the situation, and now have support for object mode blocks within a
nopython mode function, which is useful for things like supporting Python
callbacks.&lt;/p&gt;
&lt;p&gt;Since Numba exists within the CPython interpreter, the best practice for
functions that cannot be compiled with nopython mode is to not compile them at
all! Thanks to the Faster CPython effort, these sorts of generic functions
are getting faster with every release anyway, and Numba is most effective on
the numerical core of the application, where the compiler can strip away all
of the Python overhead. For this reason, Siu Kwan Lam (one of Numba’s
creators) often calls Numba a “performance linter” because the limitations of
nopython mode guide Numba users into using Python in a way that can be most
effectively sped up by a compiler, usually between 2x and 200x. If we can’t
achieve that, there isn’t much point to compilation in the first place.&lt;/p&gt;
&lt;p&gt;At first glance, Numba’s object mode seems like Mojo’s &lt;code&gt;def&lt;/code&gt; functions, but the
difference is that Mojo’s &lt;code&gt;def&lt;/code&gt; functions feel like Python functions, but are
not Python functions. Typing is optional, but new Mojo declarations (like &lt;code&gt;var&lt;/code&gt;
and &lt;code&gt;let&lt;/code&gt;) are allowed, and the types are still Mojo types, not Python types.
Access to Python modules needs to happen via special wrappers created by the
Mojo Python interface, which can import Python modules. Mojo uses CPython as
a library to handle these calls, but the body of the &lt;code&gt;def&lt;/code&gt; function itself is
still executed by the Mojo runtime. In principle, this could make Mojo behave
a lot like Cython, but we still don’t know much about how convenient it will
be to use Python from Mojo in practice, and we know nothing about whether
Python will be able to call Mojo functions easily.&lt;/p&gt;
&lt;h2&gt;Array Types and Operations&lt;/h2&gt;
&lt;p&gt;The most important part of any numerical computing system in 2023 is the
multidimensional array. Multidimensional arrays are extremely useful and
memory efficient containers for numerical data, and concepts like
broadcasting, universal functions, and advanced indexing have dramatically
simplified the application of numerical computing to many situations,
including machine learning. Moreover, arrays are very compiler-friendly data
structures, which has created opportunities for many compilers and compiler
technologies. In the Python space, NumPy has become the standard array, and
without NumPy, there would be no Numba.&lt;/p&gt;
&lt;p&gt;The most surprising part of the Mojo announcement for me was the complete lack
of a built-in multidimensional array type. Given Mojo’s target use case of
machine learning workloads, I can only assume information about the
multidimensional array was held back for a future update, because it
absolutely has to exist for Mojo to be useful. The absence was doubly
surprising because &lt;a href="https://mlir.llvm.org/"&gt;MLIR&lt;/a&gt; (which the Mojo
documentation mentions as part of its core compiler design) includes a &lt;a href="https://mlir.llvm.org/docs/Dialects/TensorOps/"&gt;tensor
type&lt;/a&gt;, which is very similar
to the NumPy array, and also defines a &lt;a href="https://mlir.llvm.org/docs/Dialects/Linalg/"&gt;generic linear algebra
operation&lt;/a&gt; (&lt;code&gt;linalg.generic&lt;/code&gt;) in
one of the MLIR dialects that is extremely powerful. I would describe the
&lt;code&gt;linalg.generic&lt;/code&gt; operation as a superset of the &lt;a href="https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html"&gt;generalized
ufunc&lt;/a&gt;
that both NumPy and Numba support (via the
&lt;a href="https://numba.pydata.org/numba-doc/dev/user/vectorize.html#the-guvectorize-decorator"&gt;&lt;code&gt;@guvectorize&lt;/code&gt;&lt;/a&gt;
decorator), which can represent a surprisingly wide range of array operations
in a very natural and compact way. Not seeing &lt;code&gt;linalg.generic&lt;/code&gt; mentioned in
the Mojo docs was very disappointing, and I hope we learn more about it in a
future update.&lt;/p&gt;
&lt;p&gt;With no array type to operate on, Mojo is currently lacking in all the array
functions that one would expect from a numerical computing language. These
can surely be added, and even implemented directly in the Mojo language, but
they are not there now. Hopefully Modular will learn from the recently
created &lt;a href="https://data-apis.org/array-api/2022.12/"&gt;Python Array API standard&lt;/a&gt;
and consider using that as the basis for their user-facing array API.&lt;/p&gt;
&lt;p&gt;Even beyond the array type and functions, it will be extremely important for
Mojo to have a clear mechanism to pass array-like data between it and other
languages, especially Python. &lt;a href="https://numpy.org/doc/stable/reference/c-api/index.html"&gt;NumPy’s C
API&lt;/a&gt; is just as
important as its Python API, as it enables passing data in and out of the
Python runtime to 3rd party libraries written in other languages. Mojo will
have a long road to broader adoption, so anything that eases interoperability
with existing numerical computing libraries will be essential.&lt;/p&gt;
&lt;h2&gt;Extensibility and MLIR&lt;/h2&gt;
&lt;p&gt;Whether or not the Mojo compiler needs to be extensible is unclear. Many
languages are self-contained, in that you are expected to “extend” the
system by writing even more code in that language. That seems to be how Mojo
is being implemented now, as most of the syntax and standard library of Mojo
is focused on very low-level types and concepts, like SIMD operations and
object lifetime management. With those ingredients, many of the essential
higher level features needed in Mojo can be implemented in Mojo.&lt;/p&gt;
&lt;p&gt;Python (and Numba) are not like this at all. Python’s superpower (and curse
at times) is that it makes it very easy to extend Python with other languages,
typically C or C++, but recently Rust is also growing in popularity. Every
popular numerical computing package in Python either includes some non-Python
code in it, or depends on one that does. For this reason, the C API of Python
is very important for enabling extension of the interpreter to new types and
modules. In a similar way, Numba’s limitations mean that there are some
low-level operations that cannot be generated by Numba when it compiles a
function with its @jit decorator. For these cases, Numba has a extension API
which allows a variety of things, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generation of machine code (technically LLVM IR) to implement specific low level operations.&lt;/li&gt;
&lt;li&gt;Custom overrides of existing Python functions to define how they should be compiled for specific data types.&lt;/li&gt;
&lt;li&gt;Definition of custom “boxing” and “unboxing” methods for passing new data types through Numba-compiled code.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While Mojo does not necessarily need these things due to its design, I believe
the Python interoperability feature in Mojo will need an extensible mechanism
to define how to translate data structures across the Python / Mojo boundary.
We know from Numba experience that anything like a NumPy array is very easy to
move back and forth, and we believe that Apache Arrow data structures offer a
similar possibility for more heterogeneous data (dataframes and beyond). Numba
has even created a few typed data containers specifically to facilitate
movement of &lt;a href="https://numba.readthedocs.io/en/stable/reference/pysupported.html#typed-list"&gt;statically-typed
lists&lt;/a&gt;
and &lt;a href="https://numba.readthedocs.io/en/stable/reference/pysupported.html#typed-list"&gt;statically-typed
dictionaries&lt;/a&gt;
across the Numba / Python boundary. Manipulating Python data through a wrapper
mechanism (as is currently described in the Mojo docs) is a fine general
solution, but faster mechanisms will be needed if Mojo is going to coexist
with non-trivial Python code.&lt;/p&gt;
&lt;p&gt;Another area of extensibility that could be important to Mojo and is
definitely important to the future of Numba is interoperability with other
MLIR-based tools. MLIR is designed to be a high level compiler &lt;a href="https://en.wikipedia.org/wiki/Intermediate_representation"&gt;intermediate
representation&lt;/a&gt;
(IR) with some unique features. MLIR is extensible by defining new dialects,
which allows the same IR to be used on a wide range of compiler use cases,
and, in principle, permits different compiler tools that support MLIR to work
together. The Numba team is very excited about the potential for MLIR and
plans to incorporate it into the next generation of Numba. However, we can’t
tell how Mojo plans to expose its MLIR-based internals. Would something like
Numba be able to target the Mojo compiler pipeline directly, or possibly the
reverse: could future Numba consume the functions produced by the Mojo
frontend and parser?&lt;/p&gt;
&lt;h2&gt;Packaging and Composability&lt;/h2&gt;
&lt;p&gt;The final area we are curious about Mojo is the packaging approach. One feature of Mojo that was emphasized was the ability of the Mojo compiler to produce fully self-contained executables that could be deployed wherever they were needed. Application distribution is a very important use case that has not gotten as much attention in the Python ecosystem as package distribution. Python’s plethora of packaging tools focus on allowing users to install Python components along with their complex web of dependencies. The &lt;a href="https://pypi.org/"&gt;Python Package Index&lt;/a&gt; (and adjacent package repositories, like &lt;a href="https://conda-forge.org/"&gt;conda-forge&lt;/a&gt; and the &lt;a href="https://www.anaconda.com/download"&gt;Anaconda Distribution&lt;/a&gt;) is truly one of Python’s superpowers, and if Mojo wants to jumpstart a broad community of developers, it will also need a packaging story. There are so many unanswered questions at this point:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Will hybrid Mojo-Python projects be possible, with both Mojo and Python
  dependencies?&lt;/li&gt;
&lt;li&gt;Will Mojo packages only distribute source code, with the intent that all
  Mojo applications must be compiled by the application builder along with all
  of their dependencies?&lt;/li&gt;
&lt;li&gt;Or, will Mojo allow precompiled modules to be distributed, in which case,how
  will inter-module optimization be enabled?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Aside:&lt;/em&gt; Although this may be a long shot, I think it would be great if Mojo
would consider &lt;a href="https://conda.io/projects/conda/en/latest/index.html"&gt;conda&lt;/a&gt;
as a package manager, as it already has excellent Python support, but is also
cross-platform and &lt;em&gt;multi-language&lt;/em&gt; packaging system. Mojo doesn't need to
build yet another package manager, and conda would be an excellent choice for
the complex hybrid Mojo/Python projects that Modular hopes will exist in the
future. Give us a call, Chris, if you want to chat. 😀&lt;/p&gt;
&lt;p&gt;The packaging story for Numba today is much simpler, as it is just a Python
package that does just-in-time compilation. Projects that use Numba
compilation do so by taking Numba as a dependency, and then can choose to
distribute their own package as pure source code. The translation from Python
to machine code happens at runtime on the end user’s system. This
dramatically simplifies the distribution of those packages downstream of
Numba, who do not need to build wheels for each target platform unless they
have other compiled code aside from Numba-decorated functions.&lt;/p&gt;
&lt;p&gt;However, the Numba team definitely appreciates that JIT compilation is not
suitable for all situations, either because of the increased startup time, or
limitations of the target platform, which might forbid JIT compilation
entirely. Because of this, in our planning for the &lt;a href="https://numba.discourse.group/t/proposal-development-focus-for-2023h1/1773"&gt;next generation of
Numba&lt;/a&gt;,
we’ve decided to first focus on an ahead-of-time compiler (like Mojo) and then
will treat just-in-time compilation as a special case. Ahead-of-time
compilation is much trickier because you have to make choices at build time
for which there are downstream consequences to consumers of your package. You
have to worry about ABI stability, supporting multiple variants of input
types, and different machine instruction set variants (like AVX, AVX2,
AVX-512, etc). At the same time, we do not want ahead-of-time compilation to
preclude JIT optimization across modules that were distributed separately. We
have been working on techniques to handle all of these situations in a new
library called
&lt;a href="https://github.com/numba/pixie/blob/poc/numba_mvp/mvp/pixie_demonstration.ipynb"&gt;PIXIE&lt;/a&gt;,
which allows additional metadata to be inserted into library files to enable
dynamic dispatch, CPU feature selection, and optional future recompilation of
functions with a JIT.&lt;/p&gt;
&lt;h2&gt;Conclusions and a Wishlist for Mojo&lt;/h2&gt;
&lt;p&gt;I want to conclude by reiterating that my goal with this article is not to
dampen enthusiasm for Mojo, but to help the reader be more curious about the
details when they read about new Python and Python-adjacent compiler projects.
Impressive looking speedup factors are not where the most difficult challenges
are.&lt;/p&gt;
&lt;p&gt;I think it is an interesting idea to try to blend a Python-like syntax with
the capabilities of MLIR to target a wide range of potential hardware. Mojo’s
Python interoperability features could expand the capabilities of the Python
ecosystem, which would be a great thing. But, there is a tremendous amount of
work left to do on Mojo and a lot of design details have not been shared. We
simply don’t know enough to decide how Mojo will impact Python, or how best to
interact with it.&lt;/p&gt;
&lt;p&gt;Given that, I want to conclude by reiterating the wishlist of things I hope we see in Mojo in the coming months:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;A native multidimensional array type that exposes the same power as MLIR’s tensor type and linalg dialect. That will enable the same kind of array programming we are familiar with from NumPy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;More explanation and infrastructure for managing data motion across the Mojo / Python boundary. NumPy arrays should be usable by Mojo native code (hopefully that tensor type we’re asking for in the previous item) without data copies. Similarly, being able to use Apache Arrow data structures across the two languages would be very powerful. In general, we need something extensible so that new data translators can be added.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Permit an inversion of control, so that Mojo modules can be imported from the Python interpreter. Mojo could be an interesting alternative to Cython if that was possible, and if NumPy arrays could be passed to Mojo functions with minimal overhead.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Expose the MLIR internals of Mojo via some kind of interface that lets us retarget the compiler. A true Python-to-Mojo compiler pipeline could be possible then, and that would allow a lot more experimentation. How should we use Mojo with new or custom MLIR dialects?&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;More details about Mojo packaging in general. Will there be a package index for Mojo-specific modules by third parties? How should mixed Mojo and Python environments be handled? (Seriously, please take a look at conda. 🙂)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Of course, we need to see an open source Mojo compiler and runtime before many
of these things will be possible. Hopefully we’ll get more details on that in
the future as well.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/></entry><entry><title>Decimals type for pandas</title><link href="https://engineering.anaconda.com/2023/06/pandas-decimals.html" rel="alternate"/><published>2023-06-20T00:00:00-05:00</published><updated>2023-06-20T00:00:00-05:00</updated><author><name>Martin Durant</name></author><id>tag:engineering.anaconda.com,2023-06-20:/2023/06/pandas-decimals.html</id><summary type="html">&lt;p&gt;Options for using fixed-precision decimals in pandas&lt;/p&gt;</summary><content type="html">&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Some time &lt;a href="http://martindurant.github.io/blog/anaconda-hack/"&gt;ago&lt;/a&gt; I had a
go at implementing a "decimals" extension type for pandas. This was following
stumbling upon parquet data of that type, which pandas could not read:
&lt;code&gt;pyarrow&lt;/code&gt; would error and &lt;code&gt;fastparquet&lt;/code&gt; would convert to floats. The decimal
type, with known, fixed precision, is very important in real-world applications
such as finance, where exact equality of fractional values is required.
The following should succeed, but with standard python or pandas does not:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;
&lt;span class="kc"&gt;False&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;
&lt;span class="kc"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Although one could solve this with python's builtin &lt;code&gt;decimal.Decimal&lt;/code&gt;
objects, we want something that supports vectorised compute, which will
be orders of magnitude faster.&lt;/p&gt;
&lt;h2&gt;Implementations&lt;/h2&gt;
&lt;p&gt;I set out make an implementation for discussion. This made use of
pandas' "extension types" system, which allows non-standard dtypes and
array types to be registered. A decent proof of concept was pretty quick
to make (see &lt;a href="https://github.com/intake/pandas-decimal"&gt;the repo&lt;/a&gt;).
However, in conversations around it, it was brought to my attention
that the advent of more generalised &lt;code&gt;arrow&lt;/code&gt; type support in pandas would
expose their &lt;a href="https://arrow.apache.org/docs/cpp/api/datatype.html#_CPPv4N5arrow4Type4type10DECIMAL128E"&gt;decimal&lt;/a&gt;
type for use in pandas too.&lt;/p&gt;
&lt;p&gt;So, if &lt;code&gt;arrow&lt;/code&gt; could handle this, then the need for &lt;code&gt;pandas-decimal&lt;/code&gt;
&lt;a href="https://github.com/intake/pandas-decimal/issues/3#issuecomment-1377694506"&gt;goes away&lt;/a&gt;.
I accepted this, but later conversation spurred me to look again. So let's
summarise the situation.&lt;/p&gt;
&lt;h2&gt;Comparisons&lt;/h2&gt;
&lt;h3&gt;Ease of use&lt;/h3&gt;
&lt;p&gt;From the python perspective, pyarrow interoperates with &lt;code&gt;decimal.Decimal&lt;/code&gt; objects; internally it holds
an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from
float is supported, but arithmetic with floats is not supported.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;decimal&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;dask.dataframe&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;dd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pyarrow&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nn"&gt;pa&lt;/span&gt;

&lt;span class="c1"&gt;# dtype object&lt;/span&gt;
&lt;span class="n"&gt;pa_dtype&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;decimal128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# creation from decimal objects&lt;/span&gt;
&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pa&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8093.012&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8094.123&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8095.234&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8096.345&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8097.456&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;8098.567&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pa_dtype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArrowDtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pa_dtype&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# explicit conversion&lt;/span&gt;
&lt;span class="n"&gt;df_float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;float&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_float&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArrowDtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pa_dtype&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;equals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# fail to operate on floats&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;  &lt;span class="c1"&gt;# ArrowTypeError&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;  &lt;span class="c1"&gt;# ArrowTypeError&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1.1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# OK&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Conversely, &lt;code&gt;pandas-decimal&lt;/code&gt; does not handle &lt;code&gt;decimal.Decimal&lt;/code&gt; only, but implicitly converts
to and from floats on access. This means that arithmetic works as you would normally expect,
except you lose the precision guarantees of the decimal representation.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas_decimal&lt;/span&gt;

&lt;span class="c1"&gt;# creation with floats&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;8093.012&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8094.123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8095.234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8096.345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8097.456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8098.567&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;decimal[3]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# conversion&lt;/span&gt;
&lt;span class="c1"&gt;# explicit conversion&lt;/span&gt;
&lt;span class="n"&gt;df_float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;float&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df_float&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;decimal[3]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;equals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# operating on floats&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;  &lt;span class="c1"&gt;# OK&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;  &lt;span class="c1"&gt;# OK&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;1.1&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# ValueError&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Performance&lt;/h3&gt;
&lt;p&gt;Comparing the two vectorised array types to python objects and each other, we see clearly that
pandas-decimal wins on all operations. I include also just one float calculation, showing that
pandas-decimal is even faster than that.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;8093.012&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8094.123&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8095.234&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8096.345&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8097.456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;8098.567&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;

&lt;span class="c1"&gt;# pyarrow&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArrowDtype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pa_dtype&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mf"&gt;16.2&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;85.8&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;150&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;1.09&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mf"&gt;17.5&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;54.1&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Error&lt;/span&gt;

&lt;span class="c1"&gt;# pandas-decimal&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;decimal[3]&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mf"&gt;4.62&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;48.5&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;
&lt;span class="mf"&gt;4.9&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mi"&gt;530&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mf"&gt;2.16&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;1.79&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;11&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mi"&gt;119&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;

&lt;span class="c1"&gt;# pure python&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="mi"&gt;475&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;4.79&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;decimal&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decimal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;460&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mf"&gt;7.96&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mi"&gt;392&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mi"&gt;573&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Error&lt;/span&gt;

&lt;span class="c1"&gt;# inexact floats&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;x&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;float&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;timeit&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="mf"&gt;7.6&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt; &lt;span class="err"&gt;±&lt;/span&gt; &lt;span class="mi"&gt;115&lt;/span&gt; &lt;span class="n"&gt;µs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;IO&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;pyarrow&lt;/code&gt;'s decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion
to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation.
&lt;code&gt;pandas-decimal&lt;/code&gt; is not integrated with any IO. In the original design, it was anticipated that integration
with &lt;code&gt;fastparquet&lt;/code&gt; would be trivial, and would not need any additional dependencies, but this has not been
implemented since development of pandas-decimal halted&lt;/p&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;pyarrow&lt;/code&gt; does not provide the all-encompassing solution to fixed-precision decimals we would have hoped
for - at least not yet. One assumes that the functionality may well already exist within &lt;code&gt;arrow&lt;/code&gt; itself,
but is not well exposed to &lt;code&gt;pandas&lt;/code&gt; due to lack of interest/motivation. This surprised me somewhat,
since finance is such a big user of &lt;code&gt;pandas&lt;/code&gt;, and has traditionally required exact decimals for instance
in their database models.&lt;/p&gt;
&lt;p&gt;For the time being, &lt;code&gt;pandas-decimal&lt;/code&gt;, a very small-effort proof-of concept, shows clear advantages in
speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it is a separate, non-standard install, whereas &lt;code&gt;pyarrow&lt;/code&gt; is likely already available in the environments&lt;/li&gt;
&lt;li&gt;it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)&lt;/li&gt;
&lt;li&gt;the integration with parquet has not been done&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I am hoping, that by writing this post, I will motivate &lt;code&gt;arrow&lt;/code&gt; developers to improve decimal support, so
that &lt;code&gt;pandas-decimal&lt;/code&gt; can be retired even as a proof of concept.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/></entry><entry><title>OSS Roundup</title><link href="https://engineering.anaconda.com/2023/05/oss-20230507.html" rel="alternate"/><published>2023-05-07T00:00:00-05:00</published><updated>2023-05-07T00:00:00-05:00</updated><author><name>Martin Durant</name></author><id>tag:engineering.anaconda.com,2023-05-07:/2023/05/oss-20230507.html</id><summary type="html">&lt;p&gt;Weekly news from Anaconda's OSS engineers&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is our ongoing series of articles in which we will give weekly highlights
of open-source software (OSS) activities here at Anaconda. We will list various achievements
of the last week &lt;em&gt;or so&lt;/em&gt;,
link to interesting things and give brief details of ongoing work and plans. Each team
will only write something when they have something to say.&lt;/p&gt;
&lt;h3&gt;Conda (language-agnostic, multi-platform package management ecosystem)&lt;/h3&gt;
&lt;p&gt;We’ve released updates to conda-package-handling and conda-package-streaming to reduce
memory usage. This will be the first conda-package-handling released to PyPI once PyPI
admins free the name, using PyPI’s new organization feature.&lt;/p&gt;
&lt;h3&gt;Distdatacats Team (remote bytes, file formats, catalogs and data processing)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;fsspec 2023.5.0 and friends are out&lt;/li&gt;
&lt;li&gt;Reference Filesystem can now write references directly to parquet, allowing for
  combine of parquet-to-parquet, which should have a much smaller memory footprint
  for very large reference sets.&lt;/li&gt;
&lt;li&gt;dask-awkward optimisations are finally in a good place: layer merging works and is fast, and we
  can run column optimisation on only one partition to avoid scaling issues. Upstream
  dask &lt;code&gt;cull()&lt;/code&gt; remains an issue, scaling with number of partitions, and we are looking
  at ways to avoid this. The high-energy physics workflows prompting this introspection
  are easily the biggest dask task graphs in existence.&lt;/li&gt;
&lt;li&gt;Article on &lt;a href="https://engineering.anaconda.com/2023/05/dask-parquet-s3.html"&gt;this very blog&lt;/a&gt;
  about benchmarking a particular dask-parquet-s3 workflow and what we learned.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Numba (JIT-compiling python code to make it fast)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://numba.discourse.group/t/ann-numba-0-57-0-and-llvmlite-0-40-0/1914"&gt;Numba 0.57.0 and llvmlite 0.40.0 have been released&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Python3.11, NumPy 1.24, LLVM 14 support. See &lt;a href="https://numba.readthedocs.io/en/0.57.0/release-notes.html#version-0-57-0-1-may-2023"&gt;Changelog&lt;/a&gt; for more details.&lt;/li&gt;
&lt;li&gt;This release had 253 PRs from 47 different authors!&lt;/li&gt;
&lt;li&gt;Started to look at integrating numba_rvsdg into numba&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Jupyter (in-browser IDE for python and others)&lt;/h3&gt;
&lt;p&gt;The team has released the 1.0.0 version of &lt;a href="https://nbclassic.readthedocs.io/en/latest/nbclassic.html"&gt;nbclassic&lt;/a&gt;,
a package that allows the “classic”
Jupyter Notebook (equivalent to Notebook 6.5) to be installed and used alongside JupyterLab
or Notebook 7 in an environment. We’ve also put out a release of the
&lt;a href="https://github.com/Jupyter-contrib/jupyter_nbextensions_configurator"&gt;jupyter-nbextensions-configurator&lt;/a&gt;
which fixes some compatibility issues with nbclassic.
This week the team will be at JupyterCon, so stop by the Anaconda booth and say hello!&lt;/p&gt;
&lt;h3&gt;BeeWare (deploy python projects to mobile and elsewhere)&lt;/h3&gt;
&lt;p&gt;This week, the BeeWare team has been cleaning up after the PyCon US sprints. The sprints
generated dozens of major and minor feature contributions; this week we’ve been able to
merge nearly all of them.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/></entry><entry><title>Benchmarking a dask-parquet-s3 workflow</title><link href="https://engineering.anaconda.com/2023/05/dask-parquet-s3.html" rel="alternate"/><published>2023-05-04T00:00:00-05:00</published><updated>2023-05-04T00:00:00-05:00</updated><author><name>Martin Durant</name></author><id>tag:engineering.anaconda.com,2023-05-04:/2023/05/dask-parquet-s3.html</id><summary type="html">&lt;p&gt;Benchmarking is still hard...&lt;/p&gt;</summary><content type="html">&lt;h3&gt;Foreword&lt;/h3&gt;
&lt;p&gt;Benchmarking is &lt;a href="https://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks"&gt;hard&lt;/a&gt;
and usually biased by the author's experience and viewpoint, which strongly affects
what they choose to benchmark and how many "tricks" they know to optimise performance.&lt;/p&gt;
&lt;p&gt;This article is an extension of benchmarking on the
&lt;a href="https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes"&gt;Coiled blog&lt;/a&gt;, which showed
that using PyArrow string rather than python string objects is beneficial for the
workflow presented. I don't disagree with that conclusion, but would like to explore
the performance details further.&lt;/p&gt;
&lt;p&gt;As a side-note, the possible
advantages of PyArrow string storage was part of the discussion when Pandas decided to
switch to using PyArrow as the default Parquet loading engine (a decision Dask later
followed). As the author of &lt;code&gt;fastparquet&lt;/code&gt;, the previous
default, I clearly have a vested interest in showing that the package is still a performance
contender, so take my observations with some salt.&lt;/p&gt;
&lt;h3&gt;Setup&lt;/h3&gt;
&lt;p&gt;As in the original benchmark, we will be timing the following workflow:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tipped&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tips&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hvfhs_license_num&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tipped&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The timing will be done on the last line only, so the IO and shuffle-compute are
both contributing, but IO dominates. The data are 25GiB of parquet in 720 files in a
public access bucket on AWS S3. It will be only the first line (with the
ellipsis) that we will be changing. Column "hvfhs_license_num" is strings
with a small number of unique values and "tips" is float32. All are SNAPPY
compressed.&lt;/p&gt;
&lt;p&gt;For the timings here, there were 10 dask workers with 1 thread and 4GB of memory
each, in a Kubernetes cluster via dask-gateway in AWS us-east-1. This is not the same
cluster setup as the original! The same cluster was used throughout, and there
was no significant memory pressure at any point.
All versions were at the current latest (dask 2023.4.0, pyarrow 11.0.0, pandas 2.0.1,
fastparquet 2023.4.0). Each time is the best of three repeats.&lt;/p&gt;
&lt;h3&gt;Runs&lt;/h3&gt;
&lt;h4&gt;1. Baseline (pyarrow with python strings)&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.convert-string&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 4m3s&lt;/p&gt;
&lt;p&gt;Everything here is default&lt;/p&gt;
&lt;h4&gt;2. Use fastparquet instead&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 2m40s&lt;/p&gt;
&lt;p&gt;Yes, here is my big bias. Whenever someone says something can be improved in arrow,
I try with &lt;code&gt;fastparquet&lt;/code&gt;. What do you know? It's magically much better in this case.
I can't answer for particularly why this would be the case. Note that &lt;code&gt;fastparquet&lt;/code&gt;
produces columns of python string objects when the base parquet type is UTF8.&lt;/p&gt;
&lt;h4&gt;3. Fastparquet with categories&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 2m48s&lt;/p&gt;
&lt;p&gt;This was the original supposition I was interested in: the grouping column is actually
stored as dict-encoded in the files, so loading as pandas categorical should be much faster,
but it made... almost no difference at all. More on this below.&lt;/p&gt;
&lt;p&gt;Pyarrow does not allow you to load this data
as categorical.&lt;/p&gt;
&lt;h4&gt;4. Pyarrow and pyarrow string&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dask&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;dataframe.convert-string&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;pyarrow&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 2m17s&lt;/p&gt;
&lt;p&gt;This was the thesis of the original article, that the new string storage mechanism
should be much faster. It provides a decent boost over PyArrow (with Python string),
and is also better than &lt;code&gt;fastparquet&lt;/code&gt;, above.
I do not have the knowledge to be able to tweak pyarrow further, but note that
this is still using s3fs as the library to fetch bytes (a discussion about using
pyarrow's own s3 implementation is another reason I wanted to chase this topic).&lt;/p&gt;
&lt;h4&gt;5. Rust implementation of s3fs and fastparquet&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;fsspec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;register_implementation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;s3&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rfsspec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RustyS3FileSystem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clobber&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;region&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;us-east-2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Times: 2m5s&lt;/p&gt;
&lt;p&gt;This is the same as 2., but the S3 transfer is being done by
&lt;a href="https://github.com/martindurant/rfsspec"&gt;rfsspec&lt;/a&gt;, which is still very experimental.
Note that rfsspec requires that the region is provided, so some speed boost will
be coming from avoiding HTTP redirection in all S3 calls. Also, &lt;code&gt;rfsspec&lt;/code&gt; has a larger
default buffer size, so there might be fewer requests here.&lt;/p&gt;
&lt;p&gt;All the rest of the runs below use rfsspec.&lt;/p&gt;
&lt;h4&gt;6. Now with parquet-specific file access&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;region&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;us-east-2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;open_file_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;precache_options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;method&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;parquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Times: 2m5s&lt;/p&gt;
&lt;p&gt;Please see &lt;a href="https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/"&gt;this article&lt;/a&gt;
for a motivation and description of using the footer information of a parquet file to
know exactly which byte-ranges will be needed, and prospectively/concurrently fetching
them. We see that it makes no difference whatever, which is extremely fishy!&lt;/p&gt;
&lt;h4&gt;7. Next, specify the columns manually&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;region&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;us-east-2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;open_file_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;precache_options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;method&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;parquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 49s&lt;/p&gt;
&lt;p&gt;Bingo! This has made the biggest difference so far. So, it seems that dask did not figure
out, that we only needed those two columns from the source, and maybe all the bytes of every
file were being loaded every time. That also explains why the "precache-option" (6.) didn't
make any difference - we were still loading the whole thing - and why categorising
our groupby column (3.) didn't help - it was only affecting one column of many.&lt;/p&gt;
&lt;h4&gt;8. Add categories back in&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;s3://coiled-datasets/uber-lyft-tlc/&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;storage_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;anon&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;region&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;us-east-2&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;tips&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;fastparquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;open_file_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;precache_options&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;method&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;parquet&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;hvfhs_license_num&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Time: 39s&lt;/p&gt;
&lt;p&gt;This combines 7. and 3. Now we see that categorisation makes a decent difference, since
we don't need to make python strings, and grouping on a code number is much faster too.
This also has the best memory footprint (two bytes per value for the categorical column).&lt;/p&gt;
&lt;h3&gt;Remaining low-hanging fruit?&lt;/h3&gt;
&lt;p&gt;If 720 files are being processed in 10 workers in 40s, that means very roughly half a
second each, assuming the final aggregations are a small fraction of the total. At this
point, latency talking with the remote store starts to matter, as the number of bytes
in the parquet headers and actual data bytes in the two columns don't account to much
against AWS-to-AWS throughput.&lt;/p&gt;
&lt;p&gt;Doing profiling, it turns out that each task calls &lt;code&gt;fs.info&lt;/code&gt;
three times per input file: once to check whether it is a file, once to get the file
size (so you can read the footer with the parquet metadata) and once when it is
opened again to fetch data. s3fs wants to have file information available, so that
it can require an ETag match on the target, and avoid corrupted data should the target
get overwritten during IO. However, we should be able to cache these details at least for
a short time. Right now, it takes about 20% of the running time just to run &lt;code&gt;info&lt;/code&gt;,
(of worker thread time, according to the Dask profile dashboard)
and we can cut that by a factor of 3. s3fs already caches directory listings, but
rfsspec does not, and info() bypasses that anyway - so some work to do.&lt;/p&gt;
&lt;h3&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;In brief, here I show a few ways in which you can really push performance for a
relatively simple load-group-aggregate workflow on dask-dataframe. It turns out
that the new pyarrow-strings flag available in dask is not the biggest level that
you can pull, and particularly
&lt;a href="https://dask.discourse.group/t/column-optimzation/1815"&gt;column selection&lt;/a&gt;
is critical.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/><category term="python"/><category term="dask"/><category term="s3"/><category term="parquet"/></entry><entry><title>OSS Roundup</title><link href="https://engineering.anaconda.com/2023/04/oss-20230429.html" rel="alternate"/><published>2023-04-29T00:00:00-05:00</published><updated>2023-04-29T00:00:00-05:00</updated><author><name>Martin Durant</name></author><id>tag:engineering.anaconda.com,2023-04-29:/2023/04/oss-20230429.html</id><summary type="html">&lt;p&gt;Weekly news from Anaconda's OSS engineers&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is our ongoing series of articles in which we will give weekly highlights
of open-source software (OSS) activities here at Anaconda. We will list various achievements
of the last week &lt;em&gt;or so&lt;/em&gt;,
link to interesting things and give brief details of ongoing work and plans. Each team
will only write something when they have something to say.&lt;/p&gt;
&lt;h3&gt;Conda (language-agnostic, multi-platform package management ecosystem)&lt;/h3&gt;
&lt;p&gt;The conda team has officially launched &lt;a href="http://conda.org"&gt;conda.org&lt;/a&gt;! 🎉🚀 This website will be
home for the entire conda community. We are still very much actively developing
it and welcome any contributions. Have a great idea for a blog article or feature?
Get in touch with us by filing an issue at our GitHub project or stop by our
Matrix chat and say hello 👋.&lt;/p&gt;
&lt;h3&gt;Distdatacats Team (remote bytes, file formats, catalogs and data processing)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;2023.4.0 release of fastparquet is out&lt;/li&gt;
&lt;li&gt;Please see the partner blog about the
  &lt;a href="https://engineering.anaconda.com/2023/04/intake-gets-wings-from-duckdb.html"&gt;transform capabilities&lt;/a&gt;
  of intake-duckdb&lt;/li&gt;
&lt;li&gt;In nice synergy with the work on duckdb,
  &lt;a href="https://github.com/intake/intake/pull/729"&gt;dataframe pipelines&lt;/a&gt; are coming to Intake core (full
  documentation yet to come)&lt;/li&gt;
&lt;li&gt;continued releases of dask-awkward and friends as we iron out bugs for large complex
  high-energy analysis workflows.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Numba (JIT-compiling python code to make it fast)&lt;/h3&gt;
&lt;p&gt;After resolving a few bugs discovered in the Numba 0.57.0 release candidate
(including one very obscure LLVM bug!),
we have done the &lt;a href="https://numba.discourse.group/t/ann-numba-0-57-0-and-llvmlite-0-40-0/1914"&gt;final release&lt;/a&gt; this week.&lt;/p&gt;
&lt;h3&gt;Jupyter (in-browser IDE for python and others)&lt;/h3&gt;
&lt;p&gt;The Jupyter team released a new version of nbclassic (0.5.6), the component
that allows the classic Notebook (version 6.x) to coexist in an environment
with JupyterLab. This is in preparation for officially tagging version 1.0 of
nbclassic later this week.&lt;/p&gt;
&lt;p&gt;This has also been a big documentation week, as
we’ve been working on pull requests across a few Jupyter projects to explain
development conventions (like how semantic versioning is used) and to improve
readability for newer users and extension authors. We’re also busy preparing
for JupyterCon in just over a week, where the whole Anaconda Jupyter team will
be in attendance, along with several other Anaconda people. If you are planning
to attend, please stop by the booth and say hello, and check out the talk on
the &lt;a href="https://cfp.jupytercon.com/2023/talk/TWJMCN/"&gt;future of the Jupyter notebook&lt;/a&gt; we are co-presenting.&lt;/p&gt;
&lt;h3&gt;BeeWare (deploy python projects to mobile and elsewhere)&lt;/h3&gt;
&lt;p&gt;This past week the BeeWare team was at the PyCon Sprints. We peaked at ~25 contributors
working on issues, and handed out almost as many challenge coins to first time
contributors, tackling a range of issues from minor documentation cleanups to major
new features in Briefcase and Toga. For a full summary, check out our &lt;a href="https://beeware.org/news/buzz/april-2023-status-update/"&gt;April 2023 Status Update&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;PyScript&lt;/h3&gt;
&lt;p&gt;If you haven't seen the announcement from PyCon US, check out
&lt;a href="http://pyscript.com"&gt;pyscript.com&lt;/a&gt;!&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/></entry><entry><title>Intake Gets Some Wings From DuckDB</title><link href="https://engineering.anaconda.com/2023/04/intake-gets-wings-from-duckdb.html" rel="alternate"/><published>2023-04-27T00:00:00-05:00</published><updated>2023-04-27T00:00:00-05:00</updated><author><name>Blake Rosenthal</name></author><id>tag:engineering.anaconda.com,2023-04-27:/2023/04/intake-gets-wings-from-duckdb.html</id><summary type="html">&lt;p&gt;Intake's new DuckDB plugin&lt;/p&gt;</summary><content type="html">&lt;p&gt;Loading data can be a pain. Let's say you have a 100GB folder of Gzipped CSVs sitting on Amazon S3 — what is the simplest way of converting this dataset into a Dask DataFrame that your data science team can work with? What about a Jupyter notebook that contains the five-ish lines of Python needed to handle imports, credentials, and loading? What if the source data changes, or the URL changes, or you want to include metadata or plots? Now those five-ish lines are broken and you need to somehow push the updates to all your users.&lt;/p&gt;
&lt;p&gt;Enter &lt;a href="https://intake.readthedocs.io/en/latest/"&gt;Intake&lt;/a&gt;. With Intake you can encode all the details of your data collection into a single &lt;a href="https://intake.readthedocs.io/en/latest/catalog.html"&gt;catalog&lt;/a&gt;. Remote files (including catalogs themselves) are a breeze thanks to &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/index.html"&gt;fsspec&lt;/a&gt;. Details such as filetypes, transfer protocols, chunk sizes are all abstracted away from the user who only needs to know enough to open a catalog and read the data sources. Intake makes it easy for data stewards to curate large, varied, and complex datasets into easily distributable and maintainable catalogs. Importantly, it places the onus of handling the data's particular eccentricities (access patterns, etc.) on the data steward rather than the end user who just wants to be handed a DataFrame so they can move on with their life. Intake is used by data science teams and data engineers, and plays a key role in &lt;a href="https://docs.anaconda.com/pro/anaconda-notebooks/notebook-data-catalog/"&gt;Anaconda Nucleus' new data catalogs service&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Intake aims to give you your data then get out of the way. It is &lt;em&gt;just&lt;/em&gt; a data loader, not a query engine. But what if you want to provide your users with a transformed, subsampled, or aggregated view of your dataset? Intake provides an interface for building custom plugins that will handle these sorts of things, but building plugins or extending existing plugins takes some time and know-how. What if you could simply write some SQL and provide a transformed data source directly, without needing to modify or copy the original data?&lt;/p&gt;
&lt;p&gt;Enter &lt;a href="https://duckdb.org/docs/"&gt;DuckDB&lt;/a&gt;. DuckDB wants to be your general purpose query engine for tabular data. It is a data format, a library of analytical tools, and an overall data science workhorse. Below we'll go into some of what makes DuckDB special, the problems it tries to solve, and why it is a natural extension to Intake's existing functionality.&lt;/p&gt;
&lt;h2&gt;Intake background&lt;/h2&gt;
&lt;p&gt;Intake organizes data sources with &lt;a href="https://intake.readthedocs.io/en/latest/api_base.html#intake.source.base.DataSource"&gt;&lt;code&gt;DataSource&lt;/code&gt;&lt;/a&gt; objects; a &lt;code&gt;DataSource&lt;/code&gt; is a wrapper around some container type, commonly a DataFrame, that has a bunch of metadata about the source data and a &lt;code&gt;.read()&lt;/code&gt; method for loading the actual data into memory. Data sources can point to remote files, integrate with &lt;a href="https://docs.dask.org/en/stable/"&gt;Dask&lt;/a&gt;, and even load pre-defined &lt;a href="https://intake.readthedocs.io/en/latest/plotting.html"&gt;plots&lt;/a&gt; with &lt;a href="https://hvplot.holoviz.org/index.html"&gt;hvPlot&lt;/a&gt;. An intake &lt;a href="https://intake.readthedocs.io/en/latest/api_base.html#intake.catalog.Catalog"&gt;&lt;code&gt;Catalog&lt;/code&gt;&lt;/a&gt; is a collection of data sources. Catalogs can even nest inside other catalogs.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# nyc_taxi.yaml&lt;/span&gt;
&lt;span class="nt"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;nyc_taxi&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;NYC Taxi dataset&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;parquet&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;urlpath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;s3://datashader-data/nyc_taxi_wide.parq&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;datashade&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;intake&lt;span class="w"&gt; &lt;/span&gt;intake-parquet&lt;span class="w"&gt; &lt;/span&gt;s3fs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;intake&lt;/span&gt;

&lt;span class="n"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;intake&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;nyc_taxi.yaml&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nyc_taxi&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Intake allows you to &lt;a href="https://intake.readthedocs.io/en/latest/making-plugins.html"&gt;extend&lt;/a&gt; the &lt;code&gt;DataSource&lt;/code&gt; and &lt;code&gt;Catalog&lt;/code&gt; classes and add them to the intake registry, wrapping your custom data structure, whether it's a file or database, in Intake's semantics. Now your user can call &lt;code&gt;df = intake.open_catalog(...).source_name.read()&lt;/code&gt; on anything you'd like. Many such plugins already exist and can be found at the &lt;a href="https://intake.readthedocs.io/en/latest/plugin-directory.html"&gt;plugin directory&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Adding a query engine&lt;/h2&gt;
&lt;p&gt;The purpose of Intake is to make it easy to distribute existing data, but what if you want to share only part of a source, or an aggregation like a groupby? Well, you can perform the derivation yourself and save the result as a new source. That's not ideal. A better solution would be to use Intake's &lt;a href="https://intake.readthedocs.io/en/latest/transforms.html"&gt;Dataset Transforms&lt;/a&gt; which operate on an existing &lt;code&gt;DataSource&lt;/code&gt; and perform some sort of custom operation. The particular logic of the operation is up to the developer who needs to wrap the &lt;a href="https://intake.readthedocs.io/en/latest/transforms.html#intake.source.derived.DerivedSource"&gt;&lt;code&gt;DerivedSource&lt;/code&gt;&lt;/a&gt; class, write the code, and package the code for distribution. Intake provides the &lt;a href="https://intake.readthedocs.io/en/latest/transforms.html#class-example"&gt;&lt;code&gt;Columns&lt;/code&gt;&lt;/a&gt; transformation which returns just the source DataFrame's columns as an example.&lt;/p&gt;
&lt;p&gt;This sort of functionality is very useful, especially for teams and projects that know their data well and want to integrate Intake into a data pipeline, have sources that are complex derivations of other sources, or have non-DataFrame containers. See &lt;a href="https://github.com/intake/intake-xarray/blob/master/intake_xarray/derived.py#L38"&gt;intake-xarray&lt;/a&gt; for an example of a custom Xarray Dataset transform.&lt;/p&gt;
&lt;p&gt;For more general purpose transformations, the &lt;a href="https://intake-duckdb.readthedocs.io/en/latest/"&gt;intake-duckdb&lt;/a&gt; plugin leverages DuckDB's unique ability to query many types of tabular datasets as if they were database tables. Being an embedable, single-file data format, DuckDB resembles SQLite but is optimized for analytics and aggregations. It can also operate directly on data that exists only in-memory without copying anything. With a modest amount of work, Intake-DuckDB extends this capability to the humble &lt;code&gt;DataSource&lt;/code&gt; and any child class, as long as the container type is a DataFrame.&lt;/p&gt;
&lt;p&gt;Now you can provide filtered or modified versions of your data without needing to write a custom Intake plugin or copy any data around. Just write a little SQL. Tools like &lt;a href="https://ibis-project.org/"&gt;Ibis&lt;/a&gt; can even help with converting complex DataFrame operations into valid SQL.&lt;/p&gt;
&lt;h2&gt;Example&lt;/h2&gt;
&lt;p&gt;The following catalog aggregates and joins some data about vehicle crashes in New York in 2023. Notice that &lt;code&gt;ny_crashes_vs_registrations_2023&lt;/code&gt; uses the &lt;code&gt;duckdb_transform&lt;/code&gt; driver to perform a join on &lt;code&gt;ny_vehicle_registrations&lt;/code&gt; and &lt;code&gt;ny_crashes&lt;/code&gt; which are both &lt;code&gt;CSVSource&lt;/code&gt;s.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# ny_crashes.yaml&lt;/span&gt;
&lt;span class="nt"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;ny_vehicle_registrations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;New York vehicle registrations&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;csv&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;urlpath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://data.ny.gov/api/views/w4pv-hbkt/rows.csv&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;csv_kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;usecols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Zip&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;State&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Record Type&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Reg Valid Date&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Reg Expiration Date&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;parse_dates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Reg Valid Date&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Reg Expiration Date&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;blocksize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;ny_crashes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;New York traffic crashes&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;csv&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;urlpath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;https://data.cityofnewyork.us/api/views/h9gi-nx95/rows.csv&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;csv_kwargs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;usecols&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ZIP CODE&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;CRASH DATE&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;parse_dates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;CRASH DATE&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;{&lt;/span&gt;&lt;span class="nt"&gt;ZIP CODE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;object&lt;/span&gt;&lt;span class="p p-Indicator"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;blocksize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;ny_crashes_vs_registrations_2023&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Comparison of New York vehicle crashes vs. registrations by ZIP&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;code in 2023&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;duckdb_transform&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;args&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ny_vehicle_registrations&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;ny_crashes&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;sql_expr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;|&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;SELECT r.zip, c.crash_count, r.registration_count&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;FROM (&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;SELECT &amp;quot;ZIP CODE&amp;quot; as zip, COUNT(*) as crash_count&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;FROM ny_crashes&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;WHERE &amp;quot;CRASH DATE&amp;quot; BETWEEN &amp;#39;2023-01-01&amp;#39; AND &amp;#39;2023-12-31&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;GROUP BY &amp;quot;ZIP CODE&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;) as c&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;JOIN (&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;SELECT Zip as zip, COUNT(*) as registration_count&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;FROM ny_vehicle_registrations&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;WHERE State = &amp;#39;NY&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="no"&gt;AND &amp;quot;Record Type&amp;quot; = &amp;#39;VEH&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="no"&gt;AND &amp;quot;Reg Valid Date&amp;quot; &amp;lt;= &amp;#39;2023-12-31&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;                &lt;/span&gt;&lt;span class="no"&gt;AND &amp;quot;Reg Expiration Date&amp;quot; &amp;gt;= &amp;#39;2023-01-01&amp;#39;&lt;/span&gt;
&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="no"&gt;GROUP BY Zip&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;) as r&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="no"&gt;ON c.zip = r.zip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;intake&lt;/span&gt;

&lt;span class="n"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;intake&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;open_catalog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;ny_crashes.yaml&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ny_crashes_vs_registrations_2023&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Check out the &lt;a href="https://github.com/intake/intake-duckdb#readme"&gt;README&lt;/a&gt; for some additional drivers that can build Intake catalogs from embedded DuckDB files and query them directly.&lt;/p&gt;
&lt;h2&gt;Limitations&lt;/h2&gt;
&lt;p&gt;Intake-DuckDB is in early stage development, arguably still a prototype. There are few configuration options for &lt;code&gt;duckdb_transform&lt;/code&gt; sources, and interoperability with other Intake sources is largely untested. DuckDB contains a rich set of features, including running queries directly on remote datasets without needing to suck the whole thing into memory, and generating plots directly with the engine; Intake uses none of this, but could in the future. Intake could very well push more processing down to the Duck layer, or use DuckDB as a general purpose &lt;a href="https://intake.readthedocs.io/en/latest/persisting.html"&gt;persistence store&lt;/a&gt; for any source type. Questions and PRs are more than welcome over on &lt;a href="https://github.com/intake/intake-duckdb"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/><category term="python"/><category term="intake"/><category term="duckdb"/><category term="data-catalogs"/></entry><entry><title>OSS Roundup</title><link href="https://engineering.anaconda.com/2023/04/oss-20230417.html" rel="alternate"/><published>2023-04-17T00:00:00-05:00</published><updated>2023-04-17T00:00:00-05:00</updated><author><name>Martin Durant</name></author><id>tag:engineering.anaconda.com,2023-04-17:/2023/04/oss-20230417.html</id><summary type="html">&lt;p&gt;Weekly news from Anaconda's OSS engineers&lt;/p&gt;</summary><content type="html">&lt;p&gt;This is the first in an ongoing series of articles in which we will give weekly highlights
of open-source software (OSS) activities here at Anaconda. We will list various achievements
of the last week &lt;em&gt;or so&lt;/em&gt;,
link to interesting things and give brief details of ongoing work and plans. Each team
will only write something when they have something to say.&lt;/p&gt;
&lt;h3&gt;General&lt;/h3&gt;
&lt;p&gt;Keep eyes on PyCon USA, where &lt;code&gt;pyscript.com&lt;/code&gt; is officially launched this week. For
Europeans, we are also at PyCon DE/PyData Berlin, which has already started.&lt;/p&gt;
&lt;h3&gt;Numba Team (JIT-compiling python code to make it fast)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;Numba 0.57.0rc1&lt;/strong&gt; and &lt;strong&gt;llvmlite0.40.0rc1&lt;/strong&gt; are released. The new Numba and llvmlite releases
add support for Python 3.11, NumPy 1.24, and upgrades to LLVM 14. After this release,
we will be dedicating more time in developing a new compiler pipeline. More details
can be found at https://numba.discourse.group/t/proposal-numba-2023-mvp/1792.&lt;/p&gt;
&lt;h3&gt;Distdatacats Team (remote bytes, file formats, catalogs and data processing)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The Intake project has a new driver leveraging DuckDB: &lt;a href="https://github.com/intake/intake-duckdb"&gt;intake-duckdb&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;fsspec and friends release 2023.4.0 with greatly increased
  &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/copying.html"&gt;documentation&lt;/a&gt; and
  coverage around expectations for bulk file operations&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rfsspec&lt;/code&gt;, the experimental Rust reimplementation of some async backends for fsspec
  is now on PyPI as 0.1.0: https://pypi.org/project/rfsspec/ . We will be using this
  for benchmarking IO-heavy workloads on dask clusters.&lt;/li&gt;
&lt;li&gt;the kerchunk project is trying to merge references sets directly to parquet output,
  hoping to complete datasets with radically smaller memory footprint&lt;/li&gt;
&lt;li&gt;awkward-array 2023.4.1 is out, and we have been working hard to improve our optimization
  code to deal with the extreme requirements of High-Energy Physics analysis, which has
  so many more operations and input files than any other data processing pipeline we've come
  across before.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Jupyter Team (in-browser IDE for python and others)&lt;/h3&gt;
&lt;p&gt;The Jupyter Team has been working on patching JupyterLab extensions in preparation
for the upcoming JupyterLab 4 release. We have also been working on pull requests
for a few Jupyter Notebook 6 extensions that were adversely affected by an update
of Marked.js in the notebook code to deal with a reported CVE. The team is preparing
for the upcoming JupyterCon in Paris, where we will be co-presenting a talk on the
past, present and future of the Jupyter Notebook.&lt;/p&gt;
&lt;h3&gt;BeeWare Team (deploy python projects to mobile and elsewhere)&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;PyCon US 2023 is rapidly approaching&lt;/strong&gt;, and BeeWare will be there! There's 2
BeeWare-related talks on the schedule, and we'll have a booth in the Community
section of the main floor. If you're going to be there, stop by - and if you're
a fan of the project and want to help us staff the booth - get in touch.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Toga 0.3.1 has been released&lt;/strong&gt;. This fixes a number of layout issues, and completes
the documentation and implementation of 9 widgets (Widget, Button, Label,
ActivityIndicator, Box, Divider, ProgressBar, Switch and Slider). It also introduces
the use of Shoelace as a web component toolkit for the web backend.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Briefcase 0.3.14 has been released.&lt;/strong&gt; This is another big Briefcase release, adding
code signing for Windows, system packaging for Arch, ManyLinux-based AppImage builds,
faster Flatpak builds, and support for PyGame.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Rubicon ObjC 0.4.6 has been released.&lt;/strong&gt; This is a minor bugfix release, mostly to
address silencing a warning that was being raised when Rubicon was installed with
recent versions of Setuptools.&lt;/p&gt;</content><category term="Engineering"/><category term="open-source"/></entry><entry><title>My Summer at Anaconda: Porting NumPy's `random` module to Numba</title><link href="https://engineering.anaconda.com/2022/08/numba.html" rel="alternate"/><published>2022-08-12T00:00:00-05:00</published><updated>2022-08-12T00:00:00-05:00</updated><author><name>Kaustubh Chaudhari</name></author><id>tag:engineering.anaconda.com,2022-08-12:/2022/08/numba.html</id><summary type="html">&lt;p&gt;Summary of Kaustubh's work at Anaconda as a Software Engineer Intern working on porting NumPy's &lt;code&gt;random&lt;/code&gt; module to Numba.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Python has a reputation for being notoriously slow, especially for numeric operations. Most of us use packages like &lt;a href="https://numpy.org/"&gt;NumPy&lt;/a&gt; and &lt;a href="https://scipy.org/"&gt;SciPy&lt;/a&gt; to mitigate this and make your huge numeric calculation code run in at least a bearable time. In this post, we’ll talk a bit about &lt;a href="https://numba.pydata.org/"&gt;Numba&lt;/a&gt;, the famed package designed to speed up your Python code, even beyond the capabilities of NumPy using a simple decorator. More importantly, we’ll learn about a new long-awaited feature being introduced in the latest release, that is, support for NumPy’s new random module. This enables accessing the NumPy’s random number generator methods from within Numba functions, by allowing the NumPy’s &lt;a href="https://numpy.org/doc/stable/reference/random/generator.html"&gt;Generator&lt;/a&gt; objects to cross the JIT Boundary.&lt;/p&gt;
&lt;p&gt;Hey! I’m Kaustubh, a Computer Science student from India. During the Summer of 2022, I interned at Anaconda and worked on developing the Numba project. This post was made to make a (shamelessly self-promotional) summary of my work in the Numba library in the duration of my internship.&lt;/p&gt;
&lt;h4&gt;Introducing Numba&lt;/h4&gt;
&lt;p&gt;Now before learning about what I did in Numba, let’s take a short tour of Python numeric libraries and how Numba fits in all this:
Some of the same qualities that make Python user-friendly and suitable for data science are the same qualities that make Python slow. The biggest one being the fact that Python is an interpreted language. Traditionally, this problem was mitigated by writing computationally intensive algorithms in C or C++ and calling them from the outward-facing Python code. NumPy is an excellent example of this type of organization. The computationally most intensive parts are written in C/C++ and exposed to Python via binding code.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fun-fact: the NumPy arrays that we use most of the time are actually stored, indexed, and iterated using C-code.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The other approach to improve the performance of your Python code is to use a Python-accelerating library. Numba is a perfect example of this. Numba translates Python functions to highly optimized machine code at runtime using the industry-standard &lt;a href="https://llvm.org/"&gt;LLVM&lt;/a&gt; compiler library. Numba-compiled numerical algorithms in Python can easily approach the speeds of C or FORTRAN, simply because of the amount of low level optimization applied to your code, especially if it has looping logic in it. Additionally, Numba has support for automatic parallelization of loops (which is different from loop optimization), generation of GPU-accelerated code (using CUDA), and creation of universal functions (ufuncs) and C callbacks. Libraries like these, in general, combine the relative ease of writing Python code with the performance enhancements provided by executing the operations in a different backend altogether.&lt;/p&gt;
&lt;h4&gt;Numba's support for NumPy methods&lt;/h4&gt;
&lt;p&gt;One of the advantages of using Numba is its near-seamless integration with NumPy functions, especially NumPy &lt;code&gt;ndarrays&lt;/code&gt;. Numba detects whenever a NumPy function is being used within the code to be Just-In-Time (JIT)-compiled and dispatches the provided arguments to a respective Numba-based implementation internally that provides the same functionality as the original.&lt;/p&gt;
&lt;p&gt;However, a major annoyance with directly trying to JIT-compile complex code written mainly using NumPy is the lack of support for specific abilities of NumPy within Numba. For instance, one can notice a significant lack of support when trying to JIT-compile code that uses NumPy’s infamous Fancy/Advanced array indexing using Numba or the lack of axis argument for several NumPy methods.&lt;/p&gt;
&lt;p&gt;One more such feature whose support was being requested by NumPy users in Numba was support for the new NumPy random module. For those who aren’t familiar with it, NumPy has a module dedicated to random number generation which includes the generation of numbers following a particular random distribution, for example, Poisson or Gaussian distributions. This is especially useful in scientific computing, where generating random data is necessary for statistical analyses and computations. Coincidentally these are instances where Python accelerating libraries are required the most due to the computationally intensive algorithms being a large part of code logic. Hence, having the random generation support within JIT-compiled code is a huge advantage in the majority of cases.&lt;/p&gt;
&lt;p&gt;Previously NumPy used to have a global ‘state array’, which was a sequence of bits stored as an array from which random numbers could be drawn. This state array could be initialized using (integer or array of) a sequence of bits. The array was updated repeatedly using the &lt;a href="https://en.wikipedia.org/wiki/Mersenne_Twister"&gt;Mersenne Twister&lt;/a&gt; Algorithm which is a pseudo-random number algorithm.These kinds of algorithms have a cycle, so they eventually repeat. The advantage of the Mersenne Twister is that it has an especially long cycle (due to use of large, Mersenne prime numbers, which is where it takes its name from). This was also helpful because by setting the same initial seed, even with the generation of completely random numbers, we could have exactly the same results (at least up to rounding errors) when we rerun the code. Two systems with the same seed would have the same state array; hence, the subsequent random number generated is the same. However, there were problems with having a global seed; for example, when you have instances of two independent threads on the same system running the same scripts, trying to access the same global state array, it was challenging to maintain the same results from them for random number generation. To mitigate this, NumPy introduced class-based random number generation using class objects named Generators. And replaced the global seed with an object named BitGenerator that held the seed array as an attribute to the class. This also had the added advantage of having faster and more convenient algorithms, like &lt;a href="https://lemire.me/en/publication/arxiv1805/"&gt;Lemire’s rejection&lt;/a&gt; algorithm, for random number generation.&lt;/p&gt;
&lt;h4&gt;Basic Support for NumPy Generators&lt;/h4&gt;
&lt;p&gt;With PR &lt;a href="https://github.com/numba/numba/pull/8031"&gt;#8031&lt;/a&gt; in Numba, support for these Generator objects is being added. There were a lot of challenges while implementing the Generators as an object that could be lowered into the LLVM based environment of Numba; while keeping track of the original Generator object as well as the underlying Bitgenerator object. Fortunately, to make work easier, NumPy devs provided ctypes bindings to BitGenerators, which allowed us to draw bits directly from the C implementations of the object. (Thanks, NumPy devs!) The initial implementation for this was provided by Stuart Archibald from the Numba team and was improved upon by proper tracking of reference counts and error handling.&lt;/p&gt;
&lt;p&gt;One of the issues we faced while implementing support of Generators was that we needed to maintain references to the original as a pointer to the object, if we ever needed to return it. Now one could easily do that by simply storing the pointer to the object somewhere accessible. However there is a bit of caveat here, Numba has its own runtime called NRT (Numba Run Time) that manages its own stuff such as reference counts. It however is NOT responsible for maintaining Python reference counts of an object outside of NRT. Hence, there could be a case where the object to which the pointer pointed to may not exist at all, because in the original Python environment it’s been deleted. This was mitigated by use of MemInfo objects which kept track of the information about memory at the pointer, including its references in the Python environment.&lt;/p&gt;
&lt;p&gt;This PR was then followed by subsequent PRs which added distributions built on top of the Generator object support. &lt;a href="https://github.com/numba/numba/pull/8038"&gt;#8038&lt;/a&gt; and &lt;a href="https://github.com/numba/numba/pull/8040"&gt;#8040&lt;/a&gt; added such distributions while &lt;a href="https://github.com/numba/numba/pull/8041"&gt;#8041&lt;/a&gt; and &lt;a href="https://github.com/numba/numba/pull/8042"&gt;#8042&lt;/a&gt; added support for general methods of Generator objects such as shuffling and random integer generation (Lemire’s Rejection Algorithm). A majority of time of my internship was devoted to these PRs, half of it trying to implement them and the other half trying to track down a very sneaky bug within them. To understand this ‘bug’ let’s first understand where it originates from.&lt;/p&gt;
&lt;p&gt;While implementing &lt;a href="https://github.com/numba/numba/pull/8038"&gt;#8038&lt;/a&gt;, we noticed that even though we had directly translated the algorithms from NumPy code, there were slight precision differences between the results produced by Numba vs results produced by NumPy. What’s more interesting was this particular discrepancy was only being observed in certain systems on the CI, for instance it was being observed on aarch64 and ppc64le systems but not on linux-64 bit systems. At first they were small enough for us to ignore ~2000 ULPs (Units of Least Precision) for 64 bit integers, but as more and more complex distributions got introduced the precision difference touched ranges in order of ~10000s of ULPs. Thus we had to address the problem at its core. After multiple fruitless debugging sessions, we finally tracked down the precision differences right down to the assembly instructions that were causing it. (Thanks to Stuart again). These precision differences were actually being caused by floating point contractions. What it means is that in certain assembly languages (like those in ppc64le and aarch64 systems respectively) there are instructions like fmadd and fmsub that combine two assembly instructions i.e. mul and add/sub. So what happened in our case was that the NumPy code got executed using these instructions while the Numba code didn’t. This led to rounding errors as the numerical results of a single fmadd and two instructions: multiply and add ended up being slightly different. Once the problem was identified, we came up with &lt;a href="https://github.com/numba/numba/pull/8032"&gt;#8232&lt;/a&gt; as a fix.&lt;/p&gt;
&lt;p&gt;The Generator support was released with the 0.56.0 version along with a load of other cool features. Subsequent releases will keep on adding more and more distribution-related algorithms. We aim for full parity with NumPy counterparts in both the latest and the legacy random number generator modules. In the future, the plan is to add more complimentary futures for this module such as making the Generator objects thread safe and also making it possible to build them within CPU-JITed code of Numba using a constructor.&lt;/p&gt;
&lt;p&gt;Another cool feature that I was working upon during the summer is support for Fancy/Advanced indexing in Numba (starting with &lt;a href="https://github.com/numba/numba/pull/8038"&gt;#8238&lt;/a&gt;). I’ll be posting a blogpost about it soon, so stay tuned for further updates.&lt;/p&gt;
&lt;h4&gt;A note to mentor and team:&lt;/h4&gt;
&lt;p&gt;Thank you Numba team for all its support. Especially Valentin Haenel (my mentor during the internship) for his guidance and Stuart Archibald for the initial patches and reviews. Couldn’t have made it this far without you guys.&lt;/p&gt;
&lt;p&gt;Days since last segfault: 0x0&lt;/p&gt;
&lt;p&gt;Ich bin ein Berliner!!&lt;/p&gt;</content><category term="Engineering"/><category term="numba"/><category term="random numbers"/><category term="open-source"/></entry><entry><title>Mahe's Internship Experience at Anaconda</title><link href="https://engineering.anaconda.com/2022/07/internship-experience.html" rel="alternate"/><published>2022-07-20T00:00:00-05:00</published><updated>2022-07-20T00:00:00-05:00</updated><author><name>Mahe Iram Khan</name></author><id>tag:engineering.anaconda.com,2022-07-20:/2022/07/internship-experience.html</id><summary type="html">&lt;p&gt;Mahe Iram Khan's experience at Anaconda as a Software Engineer Intern working on Grayskull, an open source conda recipe generator.&lt;/p&gt;</summary><content type="html">&lt;p&gt;In &lt;a href="https://engineering.anaconda.com/2022/07/grayskull.html"&gt;part one&lt;/a&gt; of this two blog series I wrote about conda and software packaging. I explained important terminology from the conda packaging ecosystem and discussed the features of the automatic recipe generator called Grayskull. In part two, I talk about my work during the internship at Anaconda, my experience here, and my learnings.&lt;/p&gt;
&lt;h2&gt;My Work During The Internship&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Adding CRAN Support to Grayskull&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Grayskull could generate recipes for Python packages available on PyPI and GitHub. Another useful package origin could be CRAN, given the popularity of the R language. Therefore during this internship, I worked on adding CRAN support to Grayskull.
  I studied the CRAN documentation and learnt how CRAN ships its packages and what all sources are available to extract the metadata for an R package.
  Through my research, I found that all R packages have a 'DESCRIPTION' file. This file contains metadata about the package.
  I began to map information in the DESCRIPTION file to the information in a conda recipe and I realized that a number of fields in the conda recipe for an R package and be directly populated from the information available in the DESCRIPTION file of that package. I was, therefore, able to generate R recipes through Grayskull. Of course, the DESCRIPTION file does not have all the information needed to write the entire recipe. Additional layers were added (and more are to be added) to fill in the missing information.
  Presently we are only able to support simple R packages for recipe generation, i.e. packages that do not need system-specific compilation. In the next iteration, we would try to also support complex packages, whose recipes must include compiler information. See &lt;a href="https://github.com/conda-incubator/grayskull/pull/349/files"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Detecting Percentage Match of Licenses in Generated Recipes&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Every conda package is shipped with a license. There are some standard licenses such as MIT, Apache, BSD, etc that are widely used. Sometimes people add these licenses to their projects but make modifications to the license according to their needs. Conda recipes require license information of the package. It is therefore important that during the recipe generation process, when Grayskull detects the license of the package, the user is informed that the license has been modified and to what extent.
  The intent here is to detect and inform when subtle changes have been added to the original text (like one extra clause), making it a new license altogether. Grayskull uses the rapidfuzz Python library to fuzzy match the package license with a list of standard licenses.
  I used the 'fuzz' module of this library to calculate the percentage match of the license and then display a warning to the user. This lets the user know if their included license deviates significantly from the standard version of the license.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Initializing Grayskull Documentation&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;During one of the 'Hackdays' at Anaconda, I set up the initial documentation for Grayskull. As the work on Grayskull progresses, we will need proper documentation in place to keep track of the development. The documentation will also help new contributors to onboard with ease. The 'Hackdays' were a great motivation to do things fast!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Writing a Conda Enhancement Proposal&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Finally, I wrote a CEP (Conda Enhancement Proposal) explaining why Grayskull is a great recipe generator and how we can make the migration from conda-skeleton to Grayskull possible.
  The CEP compares features between conda-skeleton and Grayskull and discusses what features need to be added to Grayskull to make it more versatile. The CEP serves as a single point for the community to interact and discuss the proposed changes and make valuable suggestions before a decision is made. You can check out the CEP &lt;a href="https://github.com/conda/ceps/pull/17/files?short_path=0f8fc8e#diff-0f8fc8e13bcbb7680d85f8120cbb7b42735da265ae8e690435467628c86ed6e3"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;The Many Things That I Learned&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Leadership and Initiative-Taking Skills&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When my internship officially began we did not yet have a fixed plan about &lt;em&gt;what&lt;/em&gt; I was going to work on. Usually the internship project is decided based on the prior experiences, expertise and interests of the intern. My mentor, &lt;a href="https://twitter.com/jezdez"&gt;Jannis Leidel&lt;/a&gt;, encouraged me to explore the various projects ongoing within Conda and see if something interested me. I explored. I was already more than a week into the internship and still couldn't decide what I wanted to work on. This made me anxious. I needed the project to be challenging enough so that it would be help me grow, but I also needed it to not be too difficult, too above my current level of skill and knowledge, because otherwise I might get overwhelmed and drop it midway. I needed it it to be at the sweet spot of being challenging but aligning with my previous experience and knowledge.
  I suggested that maybe I could work on adding more package origins to Grayskull, because that would take it a step further to being a versatile conda recipe generator. Jannis welcomed my idea and developed on it.
  Now we needed to decide which new package origin to add to Grayskull. There were several to choose from; PyProject, GitLab, CRAN etc. I reached out to &lt;a href="https://twitter.com/chenghlee"&gt;Cheng Lee&lt;/a&gt;, who I knew had a lot of experience working on conda packaging. I requested that we set up a meeting to help me decide what to work on. He kindly agreed and after some discussion we decided that it would be a good idea to add CRAN support to Grayskull since R is a popularly-used language and there are a number of R conda packages in the ecosystem.
  The problem, though, was that I had no prior experience dealing in R packages. Nor did we have any R experts on the team.
  But coming from conda-forge, I knew the power and resourcefulness of opens source communities. I reached out to people in the conda-forge community who had experience working with R packages. &lt;a href="https://twitter.com/bjoerngruening"&gt;Björn Grüning&lt;/a&gt;, who wrote the R helper script for conda-forge (a script that runs over R recipes generated by conda-skeleton and modifies them to better suit conda-forge), was kind enough to talk with me, discuss my ideas for CRAN support in Grayskull and guide me whenever I experienced blockers.
  I also met with &lt;a href="https://twitter.com/ocefpaf"&gt;Filipe Fernandes&lt;/a&gt; and the developer of Grayskull, Marcelo Trevisani, to discuss with them my plans and ideas. They gave me their valuable insights and advice. Marcelo was also generous enough to agree to meet with me regularly during the term of my internship so that I could receive timely feedback on my progress and help whenever I needed it.&lt;/p&gt;
&lt;p&gt;This internship project pushed me to go out of my way to learn and acquire the information I needed to move forward. It forced me to delimit myself, take initiative and develop leadership qualities.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Thinking About and Planning Projects In a Sustainable Manner&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I met with my mentor, Jannis Leidel, twice a week. Through our meetings in the duration of three months, I learned many valuable lessons from him. One lesson that I would especially like to mention is 'thinking about projects sustainably'. Jannis would often ask me what I thought would be the future of Grayskull. Was I interested in continuing working on it after the internship? Did I see other people working on it? Jannis insisted that Grayskull development work should continue beyond the internship and that we have to figure out how. He also insisted that it was unfair to expect the original developers (in this case Marcelo) to continue investing time and effort into Grayskull unpaid. We have to figure out ways to promote Grayskull so that its development continues in an organic and sustainable manner.
  I realized that programs such as Google Summer of Code and Outreachy are very useful for this purpose. They provide visibility to a project and thereby invite new contributors to it. We plan to register Grayskull in such open source programs in the future.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The Advantages of Daily Standups&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My manager, &lt;a href="https://twitter.com/DanGoodTweet"&gt;Dan Meador&lt;/a&gt;, encourages us to write daily standups. A standup is where team members who work asynchronously (because of time zone differences) share with each other what they're currently working on, what they're planning to work on and if they're experiencing any blockers. I realized that writing standups helped me clarify my thoughts and solidified my goals for the day. This is something I really struggled with -- breaking bigger tasks into smaller ones and sorting what needs to be done first. However, through standups and other productivity hacks that my manager shared with me, I was able to learn this skill. I feel that breaking down tasks and deciding every morning (or the previous evening) what exactly you're going to work on today can really enhance your productivity.
  I still get lazy sometimes, and forget to plan ahead and then my days are not as productive as I'd like them to be, and then I feel super guilty for not being my best productive self. But one's gotta keep trying until good practices become habits.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Connecting With People&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;What I most enjoy doing in life is connecting with people; recognizing our shared human-ness despite our many apparent differences. And this internship provided me with plenty opportunities to reach out to people, talk to them, discuss ideas, and develop friendships. I am grateful for the many new bonds I made with people within and outside Anaconda. &lt;br&gt; Jannis says that we must always remember that behind these computer screens there are &lt;em&gt;real&lt;/em&gt; human beings, with feelings and egos and insecurities. And as long as we treat each other with empathy, we can successfully create sustainable communities where people feel belonged and cared for. I feel very strongly about this and I believe it is important to always keep in mind the &lt;em&gt;human element&lt;/em&gt; during all our professional interactions. At the end of the day we're all fragile, vulnerable beings trying to achieve great goals together with all the strength and grace we can muster.&lt;/p&gt;
&lt;p&gt;The last months have been fulfilling and exciting. I have learnt a lot and grown as a software engineer and as a person. I am truly grateful for this opportunity, for the mentorship I received and and for all the new friendships that came my way and made my life more meaningful.
Thank you, Anaconda. Thank you, The Spirit of Open Source.&lt;/p&gt;</content><category term="Engineering"/><category term="packaging"/><category term="conda"/><category term="recipes"/><category term="open-source"/><category term="internship"/><category term="mentorship"/></entry><entry><title>Grayskull - The Community-Developed conda Recipe Generator</title><link href="https://engineering.anaconda.com/2022/07/grayskull.html" rel="alternate"/><published>2022-07-19T00:00:00-05:00</published><updated>2022-07-19T00:00:00-05:00</updated><author><name>Mahe Iram Khan</name></author><id>tag:engineering.anaconda.com,2022-07-19:/2022/07/grayskull.html</id><summary type="html">&lt;p&gt;Grayskull is a community developed conda recipe generator that does away with a number of problems in conda-skeleton. Improving conda-skeleton is difficult due to its tight coupling with conda-build. Embracing the community developed Grayskull is easier and more sustainable.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;em&gt;This is part one of a two blog series. In this blog I talk about software packaging and conda. Head to &lt;a href="https://engineering.anaconda.com/2022/07/internship-experience.html"&gt;part two&lt;/a&gt; to read about my work and experience during my internship at Anaconda.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Once again attempts were made to, once again, clarify that conda is not Anaconda's nickname. This time through &lt;a href="https://twitter.com/anacondainc/status/1498691287851143173"&gt;tweets and memes&lt;/a&gt;. Today, let us understand, for once and for all (as if such an ideality were possible!) what conda is.
Spoiler: It is an OS-agnostic package manager.&lt;/p&gt;
&lt;p&gt;My name is Mahe. I am a senior year Computer Engineering student from Delhi, now a Software Engineer at Anaconda.
I was very recently an intern here. During my internship I worked on a community-developed project called '&lt;a href="https://github.com/conda-incubator/grayskull"&gt;Grayskull&lt;/a&gt;'. In this blog post, I will talk about conda, conda-skeleton, and Grayskull. I will also discuss the disadvantages of tightly coupled projects/tools, and the advantages of embracing community innovations in open source ecosystems.&lt;/p&gt;
&lt;h2&gt;Concepts and Terminology&lt;/h2&gt;
&lt;p&gt;Before moving forward, let us quickly learn a few terms widely used in the Conda packaging ecosystem.
A &lt;strong&gt;Software Package&lt;/strong&gt; is simply a working piece of code that does something. Software packages are installable so that people can benefit from the code written by others.
&lt;strong&gt;Channels&lt;/strong&gt; are online locations where these packages live and can be downloaded from. Channels are warehouses of packages.
&lt;strong&gt;Conda&lt;/strong&gt; is an OS-agnostic package and environment manager for Python packages and data science adjacent libraries. It allows you to manage the environments and dependencies of your packages and generate the needed context for your project to run successfully on a variety of machines.
&lt;strong&gt;Conda-build&lt;/strong&gt; is a set of commands and tools that lets you build your own Conda packages.
To create a package with conda-build, you need to provide a &lt;strong&gt;Recipe&lt;/strong&gt;, minimally a meta.yaml file that contains the packaging metadata and build instructions for that specific package. You can learn more about conda recipes &lt;a href="https://docs.conda.io/projects/conda-build/en/latest/resources/define-metadata.html#meta-yaml"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Writing Recipes&lt;/h2&gt;
&lt;p&gt;For someone who is new to the packaging world, writing package recipes can seem quite intimidating. Even people who are not new to it would agree that writing package recipes is often boring and tiresome, not to mention highly error-prone.
Example recipes and templates help, sure, but one would rather their life was made easier and their package recipe was generated automatically and was perfectly concise.
There is Conda Skeleton, an automatic conda recipe generator provided with conda-build.
Conda Skeleton is a helpful tool indeed. But it falls short of being the perfect recipe generator for several reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;It is slow in generating recipes.&lt;/li&gt;
&lt;li&gt;It cannot be deployed on systems without conda.&lt;/li&gt;
&lt;li&gt;It has a huge number of dependencies.&lt;/li&gt;
&lt;li&gt;The recipes it generates are not always concise.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These shortcomings in conda-skeleton led to the development of Grayskull in the conda-forge community by &lt;a href="https://twitter.com/mdtrevisani"&gt;Marcelo Trevisani&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;Grayskull  -  The Community-Developed conda Recipe Generator&lt;/h2&gt;
&lt;p&gt;&lt;a href="https://github.com/conda-incubator/grayskull#readme"&gt;Grayskull&lt;/a&gt; is an automatic conda recipe generator that generates concise conda recipes for Python packages available on PyPI and GitHub.
It significantly improves upon conda-skeleton in terms of speed, conciseness of the recipes, packaging environment specificity, and memory usage.
Grayskull has proved to be an extremely useful tool for the packaging ecosystem by generating very accurate recipes very quickly.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A flow chart showing the words 'package name' connected to a gray colored skull via a rightward arrow, the skull is connected via a rightward arrow to a curled up paper with recipe content on it" src="https://engineering.anaconda.com/images/grayskull/grayskull_recipe.png" title="Grayskull generates conda recipes"&gt;&lt;/p&gt;
&lt;h2&gt;Grayskull  - An Improvement Upon conda-skeleton&lt;/h2&gt;
&lt;p&gt;Grayskull generates recipes that take into consideration the platform, Python version available, selectors, compilers (Fortran, C and C++), package constraints, license type, etc.
It uses metadata available from multiple sources to create the best recipe possible.&lt;/p&gt;
&lt;p&gt;The table below compares and contrasts the performance and mechanisms of Grayskull and conda-skeleton:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Grayskull&lt;/th&gt;
&lt;th&gt;conda-skeleton&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Detects when the recipe supports noarch:python&lt;/td&gt;
&lt;td&gt;Does not detect noarch:python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Always tries to detect compilers&lt;/td&gt;
&lt;td&gt;Does not detect compilers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standalone application, can be deployed on systems without conda&lt;/td&gt;
&lt;td&gt;Relies on conda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Light weight due to reduced dependencies&lt;/td&gt;
&lt;td&gt;Huge number of dependencies due to reliance on conda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pip installable&lt;/td&gt;
&lt;td&gt;Not pip installable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creates a small, temporary virtual env to stimulate the installation of the package using the source tarball&lt;/td&gt;
&lt;td&gt;Creates a separate conda env and runs the solver, hence takes up a lot of time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generates concise recipes&lt;/td&gt;
&lt;td&gt;Sometimes mixes up dependencies and generates unnecessarily bloated recipes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2&gt;Improving conda-skeleton is Tough&lt;/h2&gt;
&lt;p&gt;conda-skeleton the recipe generator is very tightly coupled with conda-build, the package builder. Due to this, it is very risky to try and change any functionality in conda-skeleton, as it could lead to breaking something in conda-build itself. Also worth noting is that the conda-skeleton code is not very modular and does not contain very many comments. This can make onboarding to conda-skeleton a difficult and time-consuming task.&lt;/p&gt;
&lt;p&gt;Grayskull, on the other hand, is a standalone tool. The code is very interchangeable, which makes it easier to add new functionality or update existing functionality. Grayskull also has ample comments describing each function in the code, which makes it easier for new people to onboard and understand the codebase.&lt;/p&gt;
&lt;h2&gt;Embracing Community Innovation&lt;/h2&gt;
&lt;p&gt;Anaconda, the company behind conda and conda-skeleton, gracefully acknowledged the advantages of Grayskull over conda-skeleton and has been supporting Grayskull and is making efforts to adopt it as the de facto conda recipe generator.
This also falls in line with the &lt;a href="https://twitter.com/condaproject/status/1498678697603256334"&gt;conda project&lt;/a&gt; efforts.&lt;/p&gt;
&lt;p&gt;During my internship at Anaconda I worked on Grayskull, adding more package origins to it, and taking it a step further to being a versatile conda recipe generator. In the &lt;a href="https://engineering.anaconda.com/2022/07/internship-experience.html"&gt;follow up blog&lt;/a&gt; I talk about my work during the internship and my experience at Anaconda.&lt;/p&gt;</content><category term="Engineering"/><category term="packaging"/><category term="conda"/><category term="recipes"/><category term="open-source"/></entry><entry><title>FOSS Fridays at Anaconda</title><link href="https://engineering.anaconda.com/2022/05/foss-friday-may-2022.html" rel="alternate"/><published>2022-05-06T00:00:00-05:00</published><updated>2022-05-06T00:00:00-05:00</updated><author><name>Princiya Sequeira</name></author><id>tag:engineering.anaconda.com,2022-05-06:/2022/05/foss-friday-may-2022.html</id><summary type="html">&lt;p&gt;FOSS Friday updates for May 2022&lt;/p&gt;</summary><content type="html">&lt;p&gt;Anaconda is an important part of the Open Source ecosystem. A significant percentage of our team contributes full-time to open-source projects, but it is a passion for most of the organization.&lt;/p&gt;
&lt;h2&gt;What are FOSS Fridays&lt;/h2&gt;
&lt;p&gt;The idea is to take a break from normal teamwork to contribute to any open-source project that we would like to support. We want to give an opportunity for everyone in technology to contribute to an open-source project we care about, Python or not, Anaconda-related or not.&lt;/p&gt;
&lt;p&gt;FOSS Fridays happen on first Friday of the month. Jan 7, 2022 was our first FOSS Friday🤩.&lt;/p&gt;
&lt;h2&gt;FOSS Friday Updates - May 2022&lt;/h2&gt;
&lt;h3&gt;PyScript PyScript PyScript&lt;/h3&gt;
&lt;p&gt;&lt;a href="./2022/04/welcome-pyscript"&gt;PyScript&lt;/a&gt; had just launched and everyone at Anaconda were so excited. &lt;a href="https://anaconda.cloud/pyscript-pycon2022-peter-wang-keynote"&gt;Here&lt;/a&gt; is Peter Wang's keynote at Pycon 2022 on "PyScript - Programming for Everyone".&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/LtDan33"&gt;Dan Meador&lt;/a&gt; added Github issue forms for PyScript.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/jezdez"&gt;Jannis Leidel&lt;/a&gt; worked on the documentation infrastructure for PyScript.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/kathatherine"&gt;Katherine Kinnaman&lt;/a&gt; too added changes to the documentation infrastructure for PyScript.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="./author/kevin-goldsmith.html"&gt;Kevin Goldsmith&lt;/a&gt; rewrote the &lt;a href="https://github.com/pyscript/pyscript/blob/main/CONTRIBUTING.md"&gt;CONTRIBUTING page&lt;/a&gt; for PyScript to make it easier for folks to understand how to contribute to the project.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href="https://github.com/mattkram"&gt;Matt Kramer&lt;/a&gt; added &lt;code&gt;pre-commit&lt;/code&gt; hooks for PyScript. He also added new issues up for grabs to the &lt;a href="https://github.com/pyscript/pyscript-cli/issues"&gt;pyscript-cli&lt;/a&gt; project.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Improve performance for conda-forge&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://github.com/barabo"&gt;Carl Anderson&lt;/a&gt; and &lt;a href="https://github.com/dholth"&gt;Daniel Holth&lt;/a&gt; improved conda channel cloning performance for &lt;a href="https://conda-forge.org/"&gt;conda-forge&lt;/a&gt; by fixing a performance bug in conda-index. They also worked on a proof-of-concept rewrite of &lt;a href="https://github.com/conda-incubator/conda-index"&gt;conda-index&lt;/a&gt; which will reduce package propagation latency for cloned channels by using sqlite databases instead of relying on a filesystem cache of a channel.&lt;/p&gt;
&lt;h4&gt;conda-forge package sync improvements&lt;/h4&gt;
&lt;p&gt;&lt;img alt="image" src="https://user-images.githubusercontent.com/4342684/171940095-78f8a3cd-9eca-4315-84f6-2aa675c09826.png"&gt;
This graph shows the duration of &lt;code&gt;conda-forge&lt;/code&gt; channel cloning before and after the performance bug was fixed. The Y axis is measured in seconds and each point on the X axis is a point in time when the clone job ran.&lt;/p&gt;
&lt;h3&gt;BDD test automation framework&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://github.com/vijeshkumarr"&gt;Vijesh Kumar&lt;/a&gt; and &lt;a href="https://github.com/nishita-beeraka"&gt;Nishita Beeraka&lt;/a&gt; worked on a BDD test automation framework design to work closely with the manual QA automation team.&lt;/p&gt;
&lt;h3&gt;Parenthood and Leadership&lt;/h3&gt;
&lt;p&gt;&lt;a href="https://github.com/princiya"&gt;Princiya Sequeira&lt;/a&gt; wrote a blog post on parenthood and leadership.&lt;/p&gt;</content><category term="FOSS"/><category term="opensource"/><category term="foss-fridays"/></entry><entry><title>Welcome to the world PyScript</title><link href="https://engineering.anaconda.com/2022/04/welcome-pyscript.html" rel="alternate"/><published>2022-04-30T00:00:00-05:00</published><updated>2022-04-30T00:00:00-05:00</updated><author><name>Fabio Pliger</name></author><id>tag:engineering.anaconda.com,2022-04-30:/2022/04/welcome-pyscript.html</id><summary type="html">&lt;p&gt;PyScript&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;py-env&gt;
- pandas
  &lt;/py-env&gt;&lt;/p&gt;
&lt;p&gt;One of the main reasons I joined Anaconda seven and a half years ago was the company’s commitment to the data science and Python communities by creating tools that enable people to do more with less.&lt;/p&gt;
&lt;p&gt;Today I'm happy to announce a new project that we’ve been working on here at Anaconda and we hope will take another serious step towards making programming and data science available and accessible to everyone.&lt;/p&gt;
&lt;h1&gt;What is PyScript&lt;/h1&gt;
&lt;p&gt;PyScript is a framework that allows users to run Python and create rich applications in the browser by simply using special HTML tags provided by the framework itself. Core features include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python in the browser&lt;/strong&gt;: Enable drop-in content, external file hosting (made possible by the Pyodide project, thank you!), and application hosting without the reliance on server-side configuration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python ecosystem&lt;/strong&gt;: Run many popular packages of Python and the scientific stack (such as numpy, pandas, scikit-learn, and more)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Python with JavaScript&lt;/strong&gt;: Bi-directional communication between Python and Javascript objects and namespaces&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Environment management&lt;/strong&gt;: Allow users to define what packages and files to include for the page code to run&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visual application development&lt;/strong&gt;: Use readily available curated UI components, such as buttons, containers, text boxes, and more&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexible framework&lt;/strong&gt;: A flexible framework that can be leveraged to create and share new pluggable and extensible components directly in Python&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All that to say… PyScript is just HTML, only a bit (okay, maybe a lot) more powerful, thanks to the rich and accessible ecosystem of Python libraries.&lt;/p&gt;
&lt;h1&gt;Wait... what? Why?&lt;/h1&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt;: As an industry, we have focussed on making the impossible possible, rather than focussing on making the possible accessible to all.&lt;/p&gt;
&lt;p&gt;At some point, in the 80s, personal computers became cheaper, which led to them becoming more popular. Most of the HW (C64/ZX80/Apple II) gave the user direct access to BASIC. A programming interface ready to use and a language simple to learn. Later, while systems became more complex (and complicated), frameworks like Visual Basic and HyperCard made it easy to create and package/distribute visual applications. Even the web, when it started, was accessible! All you needed was a text editor and a way to upload your files somewhere, before we created CGI and heavier server-side logic/rendering, etc...&lt;/p&gt;
&lt;p&gt;It's somehow unfortunate that in the last 2/3 decades we created simpler programming languages, made things faster, more scalable, and bigger; requiring an increasing amount of surrounding technology and the complexity of infrastructure needed to make things work. Today, in addition to the problem of packaging and distributing applications for different architectures and platforms, we added the complexity of having the server/client separation, which requires an additional networking layer and so on... This leads to having to learn about servers, cloud vendors, web stacks, how to test code in a simulated production environment, how to deploy applications,... All of a sudden, instead of the 1 problem users were initially trying to solve when they started, they now have many problems!&lt;/p&gt;
&lt;p&gt;Similarly, modern HTML/CSS and JS are very powerful and can be used to create really powerful and beautiful UIs, but require a significant learning curve for users to be proficient at it. This is also true for native GUI Applications. In fact, Python, the #1 most popular programming language in the world doesn't have a straightforward story on how to build native GUI Applications. Nor for making websites [entirely with python, server + client]! Nor for packaging and distributing applications!&lt;/p&gt;
&lt;p&gt;We believe users should be spending their time thinking about and writing their applications and solving real problems. Let's make programming more fun and simpler, while keeping the right technology advancements we made over the past 20/30 years. The more we do, the more users will come.&lt;/p&gt;
&lt;h1&gt;So, how does it work?&lt;/h1&gt;
&lt;p&gt;Warning, we are about to get a little technical.... :)&lt;/p&gt;
&lt;p&gt;The core concept of PyScript, as a framework, is to provide a set of [opinionated] components and tools that allow users to quickly create and share their applications. We also don't want to reinvent the wheel and aim to reuse the great work that many others are already doing.&lt;/p&gt;
&lt;p&gt;With that in mind, let's start from the foundation...&lt;/p&gt;
&lt;h2&gt;The platform&lt;/h2&gt;
&lt;p&gt;One of the hotest topics people work on to solve today is: how do we create an abstraction that allows users to ship their applications to multiple HW/SW platforms without having to rewrite and rebuild their code? Most of the solutions today tend to fall under one of the 2 buckets: Virtual Machines or Containers. Both are great or have limitations, depending on the type of application and how heavy your need for abstracting a whole machine is.&lt;/p&gt;
&lt;p&gt;Instead of creating a whole new technology stack, we want to start from the best option the ecosystem provides today. So, what virtualization abstraction system is the most popular and ubiquitous today? With a little bit of flexibility, we can claim that the Browser (browsers in general) is an excellent Virtual Machine, that actually checks a lot of the boxes we are looking for. They are everywhere (from laptops to tablets and phones), secure (browsers have been working security and isolation from the underlying file system for decades), powerful (from HW acceleration to the maturity of WASM and Web Assembly), and stable.&lt;/p&gt;
&lt;h2&gt;The Stack&lt;/h2&gt;
&lt;p&gt;Keeping in mind one of the premises above, we want to provide a reliable and fun experience to PyScript users (whether they are authoring or consuming an application), ultimately making the web a friendly and hackable place for users. For this reason, we need something beyond the current state of web development. Something that can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;give users a first-class programming language that is less weird, more expressive, and easier to learn than Javascript.&lt;/li&gt;
&lt;li&gt;centralize: strip away most of the complexity of the client/server modern web by removing that distinction as much as possible.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Luckily for us, the ecosystem has been building the foundations of a very solid stack that we can build on top of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;WebAssembly/WASM: a portable binary-code format and text format for executable programs &amp;amp; software interfaces to enable high performance applications on web pages and other environments&lt;/li&gt;
&lt;li&gt;Emscripten(https://emscripten.org/): an Open Source compiler toolchain to WebAssmbly, practically allowing any portable C/C++ codebase to be compiled into WebAssembly&lt;/li&gt;
&lt;li&gt;Pyodide(https://pyodide.org/)/python-wasm(https://github.com/ethanhs/python-wasm): Python implementations compiled to WebAssembly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As Python found its success standing on the shoulders of giants and building out of the excellent work of many people, we can do that too!&lt;/p&gt;
&lt;h2&gt;The Interface&lt;/h2&gt;
&lt;p&gt;One of our highest goals is to make programming and the web a friendly and hackable place where anyone can create interesting things and still have fun.&lt;/p&gt;
&lt;p&gt;As hinted above, the presentation layer of the modern web is really powerful and actually not bad, &lt;strong&gt;if&lt;/strong&gt; you know what to do. That means that either you've been doing this for some time or that you'll have to spend a considerable amount of time learning. Even then, that ecosystem moves so fast that is often hard even for experts to keep up.&lt;/p&gt;
&lt;p&gt;Instead, we want a system that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;offers a clean and simple API&lt;/li&gt;
&lt;li&gt;supports standard HTML&lt;/li&gt;
&lt;li&gt;extends the HTML elements with custom components that are opinionated and predictable (do fewer things but do it "as you'd expect it")&lt;/li&gt;
&lt;li&gt;is extensible and offers an easy way for users to define their own new components&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To do this, PyScript defines a series of new HTML tags (web components). For instance, to write a simple program, one can just use the &lt;code&gt;&amp;lt;py-script&amp;gt;&lt;/code&gt; tag and write Python code inside the tag itself&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;py-script&amp;gt;&lt;/span&gt;
&amp;quot;Hello&lt;span class="w"&gt; &lt;/span&gt;World&amp;quot;
&lt;span class="nt"&gt;&amp;lt;/py-script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;or, alternatively, pass the source file directly&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;py-script&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;/my_own_file.py&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/py-script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;PyScript will read that code, run it on a python interpreter and handle the output accordingly.&lt;/p&gt;
&lt;p&gt;If I need to load (install) additional modules and packages needed by my application, I can just use the &lt;code&gt;&amp;lt;py-env&amp;gt;&lt;/code&gt; tag to specify my environment requirements&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;py-env&amp;gt;&lt;/span&gt;
-&lt;span class="w"&gt; &lt;/span&gt;bokeh
-&lt;span class="w"&gt; &lt;/span&gt;numpy
-&lt;span class="w"&gt; &lt;/span&gt;paths:
&lt;span class="w"&gt;  &lt;/span&gt;-&lt;span class="w"&gt; &lt;/span&gt;/utils.py
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/py-env&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;To add a REPL-like component to create an interactive experience, one can just use the &lt;code&gt;&amp;lt;py-repl&amp;gt;&lt;/code&gt; tag&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;py-repl&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;my-repl&amp;quot;&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;auto-generate=&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;true&amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;/py-repl&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;and it'll create a widget like the one below, that can be used to access everything loaded and executed by the other tags we mentioned before, such as &lt;code&gt;&amp;lt;py-script&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;py-env&amp;gt;&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;py-repl id="my-repl" auto-generate="true"&gt; &lt;/py-repl&gt;&lt;/p&gt;
&lt;p&gt;Since we already loaded pandas and numpy for you, try copying, pasting and running (by hitting the green arrow) the code below:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pandas&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Voila!&lt;/p&gt;
&lt;p&gt;The point is, that by registering new web components that are simple and very expressive and users don't need to waste their time learning css and other specific web dev technologies.&lt;/p&gt;
&lt;h1&gt;Where is PyScript today?&lt;/h1&gt;
&lt;p&gt;Today, April 30th, 2022, PyScript is just at its beginning and is very limited compared to the vision we have for the project. It's a demonstration that we can build the vision and the technology is mature enough for us to create a new way of programming, building, sharing, and deploying applications. Be advised that it's very unstable and limited, but it works and can be used to hack with and build experimental applications.&lt;/p&gt;
&lt;p&gt;We hope to make progress fast and that in a few weeks/months, this post will be outdated :).&lt;/p&gt;
&lt;p&gt;For more information about the available features and how to get started, visit the project documentation.&lt;/p&gt;
&lt;h1&gt;Where is PyScript going&lt;/h1&gt;
&lt;p&gt;One of the ways I like to think of PyScript is "the Minecraft for software development". A framework that provides basic blocks for users to create their own worlds [applications] or new blocks [PyScript components and widgets] that others can use. In that sense we want to build a framework that is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;extremely simple and expressive&lt;/li&gt;
&lt;li&gt;feels familiar to users&lt;/li&gt;
&lt;li&gt;extensible:&lt;/li&gt;
&lt;li&gt;so users can create new widgets and share them with others&lt;/li&gt;
&lt;li&gt;so we can support multiple runtimes...&lt;/li&gt;
&lt;li&gt;... and multiple languages ...&lt;/li&gt;
&lt;li&gt;... that can interop with each other ...&lt;/li&gt;
&lt;li&gt;... and yet be controlled to also create secure namespaces&lt;/li&gt;
&lt;li&gt;runs on both the browser and server/native side&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In addition to all that, it's worth mentioning that with this project, we are exploring new horizons, and a lot of the old paradigms that are at the roots of "standard" server-side programming are not that untouchable anymore. For instance, I/O, network, and storage on the browser/client side are not the same as in traditional native systems. We'll save that topic for another post, but the point here is that we have the opportunity to innovate and explore, and that's what we want to do.&lt;/p&gt;
&lt;p&gt;It's also worth mentioning that a lot of the core technology used to build PyScript is itself recent and very vibrant. As these technologies mature and expose new functionalities, we want to extend PyScript and take all the advantages we can get.&lt;/p&gt;
&lt;h1&gt;Thanks&lt;/h1&gt;
&lt;p&gt;PyScript wouldn’t be here without the help of some incredible people. We’d really like to thank:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Peter Wang, Kevin Goldsmith, Philipp Rudiger, Antonio Cuni, Russell Keith-Magee, Mateusz Paprocki, Princiya Sequeira, Jannis Leidel, David Mason, Anna NG, Maria Genovese, Katherine Kinnaman, Kent Pribbernow, Albert DeFusco, Michael Verhulst and Chris Leonard for the contributions to the project and helping spin it up&lt;/li&gt;
&lt;li&gt;Especial thanks to the Pyodide maintainers (Roman Yurchak, Hood Chatham and all the contributors)&lt;/li&gt;
&lt;/ul&gt;</content><category term="Engineering"/><category term="pyscript"/><category term="javascript"/><category term="html"/><category term="framework"/><category term="future"/><category term="python"/></entry><entry><title>SBOMs at Anaconda</title><link href="https://engineering.anaconda.com/2022/04/sboms-at-anaconda.html" rel="alternate"/><published>2022-04-05T00:00:00-05:00</published><updated>2022-04-05T00:00:00-05:00</updated><author><name>Paul Yim</name></author><id>tag:engineering.anaconda.com,2022-04-05:/2022/04/sboms-at-anaconda.html</id><summary type="html">&lt;p&gt;SBOMs at Anaconda&lt;/p&gt;</summary><content type="html">&lt;p&gt;Last fall, Anaconda &lt;a href="https://www.anaconda.com/press/anaconda-announces-collaboration-with-microsoft"&gt;launched&lt;/a&gt; a collaboration with Microsoft to create Software Bills of Material (SBOMs) for all packages in the “defaults” channels of our repository. We are excited to announce that we have achieved this goal and to share why and how we did it.&lt;/p&gt;
&lt;h2&gt;What are SBOMs and what value do they provide?&lt;/h2&gt;
&lt;p&gt;Following the discovery of the &lt;a href="https://www.crowdstrike.com/blog/sunspot-malware-technical-analysis/"&gt;SolarWinds supply chain hack&lt;/a&gt; in 2021, the White House issued the &lt;a href="https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/executive-order-on-improving-the-nations-cybersecurity/"&gt;&lt;em&gt;Executive Order on Improving the Nation’s Cybersecurity&lt;/em&gt;&lt;/a&gt;, which detailed new requirements to help strengthen the security of the Federal Government’s software supply chain. SBOMs are a key part of this effort, functioning as “&lt;a href="https://ntia.gov/SBOM"&gt;a list of ingredients&lt;/a&gt;” that enable users to verify the components, licensing, and provenance of software installed on their systems.&lt;/p&gt;
&lt;p&gt;Anaconda’s SBOMs are built in accordance with &lt;a href="https://spdx.dev/"&gt;Software Package Data Exchange&lt;/a&gt; (SPDX) specifications, version 2.2.1, which specifies the checksum hash values of software down to the individual file level. When a new software vulnerability is discovered and made public (e.g. via the &lt;a href="https://nvd.nist.gov/"&gt;NVD database&lt;/a&gt;), we can check whether our packages contain any vulnerable components, identify their hash values in our SBOMs, and use those values to verify whether these vulnerable components are installed on our users’ systems. The licensing and provenance information can also help our users (particularly enterprise customers) determine whether certain packages meet their governance standards. Finally, we cryptographically sign each SBOM document so that they can be verified by the recipients, ruling out tampering.&lt;/p&gt;
&lt;h2&gt;Building &lt;code&gt;sbomtool&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;Anaconda has about 300,000 package artifacts (&lt;code&gt;.tar.bz2&lt;/code&gt; and &lt;code&gt;.conda&lt;/code&gt; files) in its “defaults” channels, and we needed to build a tool that could create SBOMs for all of them. This is &lt;code&gt;sbomtool&lt;/code&gt;. On a basic level, &lt;code&gt;sbomtool&lt;/code&gt; is a CLI application written in Python that ingests conda packages and outputs SBOM documents that follow the SPDX specification. Early on, we discovered &lt;a href="https://github.com/spdx/tools-python"&gt;a great Python package&lt;/a&gt; (built and maintained by the SPDX organization) that provided us with an easy-to-use API to build and validate SBOMs.&lt;/p&gt;
&lt;p&gt;However, before we could start churning out SBOMs, there were a number of challenges we had to address. For example, we knew that license information hasn’t always been consistently available in our packages. SPDX &lt;a href="https://spdx.org/licenses/"&gt;maintains&lt;/a&gt; a standard for formatting common open source software license types, and we incorporated a mechanism to apply this standard in our SBOMs and backfill corrections for packages that have incorrect or missing license information. We researched and made such corrections for over 1,800 packages so that their SBOMs comply with the SPDX license type standard.&lt;/p&gt;
&lt;p&gt;There were other challenges we did not anticipate that led us down some interesting and productive paths:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The architecture of a conda package can normally be found under the &lt;code&gt;subdir&lt;/code&gt; key in the &lt;code&gt;index.json&lt;/code&gt; metadata file. But after reviewing a particularly big batch of SBOM failures related to this key, we discovered that in much older conda packages the architecture was listed under a &lt;code&gt;platform&lt;/code&gt; key instead. A bit of code archaeology revealed that the &lt;code&gt;subdir&lt;/code&gt; key was &lt;a href="https://github.com/conda/conda-build/pull/317/commits/1dfd191854ee26ebebf28cca06ee51a00329165a"&gt;added in 2015&lt;/a&gt; and &lt;a href="https://github.com/Anaconda-Platform/anaconda-client/pull/115"&gt;was implemented to enable use of the first “noarch” builds of conda packages&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;When we started using &lt;a href="https://github.com/spdx/tools-python"&gt;&lt;code&gt;tools-python&lt;/code&gt;&lt;/a&gt; to build &lt;code&gt;sbomtool&lt;/code&gt;, the only checksum algorithm it supported was SHA1, &lt;a href="https://www.trendmicro.com/vinfo/us/security/news/vulnerabilities-and-exploits/sha-1-collision-signals-the-end-of-the-algorithm-s-viability"&gt;which is no longer accepted as secure&lt;/a&gt;. We wrote a patch and contributed &lt;a href="https://github.com/spdx/tools-python/pull/200#issuecomment-981758826"&gt;a PR&lt;/a&gt; to the upstream project to enable use of more secure checksum algorithms like SHA256 (which we use in our SBOMs).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Some conda packages contain symlinks to system files on host environments or other files within the package itself. SPDX did not have a prescribed method for specifying symlinks when we began work on &lt;code&gt;sbomtool&lt;/code&gt;, so &lt;a href="https://github.com/spdx/spdx-spec/issues/610"&gt;we raised an issue&lt;/a&gt; and started a discussion that may soon lead to a policy update.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Deployment&lt;/h2&gt;
&lt;p&gt;On March 24, 2022 - after months of building &lt;code&gt;sbomtool&lt;/code&gt;, fixing bugs, and resolving metadata issues - we reached 100% SBOM coverage and are continuing to build SBOMs for new packages that are uploaded daily. We are now focusing on automating and integrating &lt;code&gt;sbomtool&lt;/code&gt; into our package build pipeline, and we look forward to making SBOMs available as a new tiered service offering for our customers.&lt;/p&gt;</content><category term="Engineering"/><category term="sbomtool"/><category term="sbom"/><category term="spdx"/><category term="python"/><category term="metadata"/><category term="license"/><category term="licensing"/></entry><entry><title>Welcome to the Anaconda Engineering Blog</title><link href="https://engineering.anaconda.com/2022/03/welcome-to-the-anaconda-engineering-blog.html" rel="alternate"/><published>2022-03-21T23:40:00-05:00</published><updated>2022-03-21T23:40:00-05:00</updated><author><name>Kevin Goldsmith</name></author><id>tag:engineering.anaconda.com,2022-03-21:/2022/03/welcome-to-the-anaconda-engineering-blog.html</id><summary type="html">&lt;p&gt;Welcome to our new blog.&lt;/p&gt;</summary><content type="html">&lt;p&gt;At Anaconda, we talk a lot about the problems of numerical computing. We discuss the issues that Data Scientists, Data Engineers, Analysts, and others face daily. Many of us come from these backgrounds ourselves. We share our knowledge on the main &lt;a href="https://www.anaconda.com/blog"&gt;Anaconda blog&lt;/a&gt; and &lt;a href="https://anaconda.cloud/"&gt;Anaconda Nucleus&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We are also a software engineering company. We contribute to open source projects like &lt;a href="https://conda.io/"&gt;conda&lt;/a&gt;, &lt;a href="https://numba.pydata.org/"&gt;Numba&lt;/a&gt;, &lt;a href="https://dask.org/"&gt;Dask&lt;/a&gt;, &lt;a href="https://bokeh.org/"&gt;Bokeh&lt;/a&gt;, &lt;a href="https://www.pyston.org/"&gt;Pyston&lt;/a&gt;, &lt;a href="https://beeware.org/"&gt;Beeware&lt;/a&gt;. We build web services, websites, and on-premise products. We create a Python distribution used by over 25 million developers worldwide.&lt;/p&gt;
&lt;p&gt;We build products and packages using Python, Javascript, Terraform, C, C++, Go, R, Java, and ObjC.&lt;/p&gt;
&lt;p&gt;We created this site to share the knowledge we gain from building our products with the broader software development community.&lt;/p&gt;</content><category term="Meta"/></entry></feed>