Introduction
Some time ago I had a
go at implementing a "decimals" extension type for pandas. This was following
stumbling upon parquet data of that type, which pandas could not read:
pyarrow
would error and fastparquet
would convert to floats. The decimal
type, with known, fixed precision, is very important in real-world applications
such as finance, where exact equality of fractional values is required.
The following should succeed, but with standard python or pandas does not:
>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> pd.Series([0.1, 0.1, 0.1]).sum() == 0.3
False
Although one could solve this with python's builtin decimal.Decimal
objects, we want something that supports vectorised compute, which will
be orders of magnitude faster.
Implementations
I set out make an implementation for discussion. This made use of
pandas' "extension types" system, which allows non-standard dtypes and
array types to be registered. A decent proof of concept was pretty quick
to make (see the repo).
However, in conversations around it, it was brought to my attention
that the advent of more generalised arrow
type support in pandas would
expose their decimal
type for use in pandas too.
So, if arrow
could handle this, then the need for pandas-decimal
goes away.
I accepted this, but later conversation spurred me to look again. So let's
summarise the situation.
Comparisons
Ease of use
From the python perspective, pyarrow interoperates with decimal.Decimal
objects; internally it holds
an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from
float is supported, but arithmetic with floats is not supported.
import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
# dtype object
pa_dtype = pa.decimal128(precision=7, scale=3)
# creation from decimal objects
data = pa.array(
[
decimal.Decimal("8093.012"),
decimal.Decimal("8094.123"),
decimal.Decimal("8095.234"),
decimal.Decimal("8096.345"),
decimal.Decimal("8097.456"),
decimal.Decimal("8098.567"),
],
type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype(pd.ArrowDtype(pa_dtype))
assert df.x.equals(df2.x)
# fail to operate on floats
df.x[0] = 1.1 # ArrowTypeError
df + 1.1 # ArrowTypeError
df + decimal.Decimal("1.1") # OK
Conversely, pandas-decimal
does not handle decimal.Decimal
only, but implicitly converts
to and from floats on access. This means that arithmetic works as you would normally expect,
except you lose the precision guarantees of the decimal representation.
import pandas_decimal
# creation with floats
df = pd.DataFrame(
{"x": [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567]},
dtype="decimal[3]",
)
# conversion
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype("decimal[3]")
assert df.x.equals(df2.x)
# operating on floats
df.x[0] = 1.1 # OK
df + 1.1 # OK
df + decimal.Decimal("1.1") # ValueError
Performance
Comparing the two vectorised array types to python objects and each other, we see clearly that pandas-decimal wins on all operations. I include also just one float calculation, showing that pandas-decimal is even faster than that.
data = [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567] * 1000000
# pyarrow
df = pd.DataFrame({"x": data).astype(dtype=pd.ArrowDtype(pa_dtype))
%timeit df.x + 1
16.2 ms ± 85.8 µs
%timeit df.x + decimal.Decimal(0.1)
150 ms ± 1.09 ms
%timeit df.x.mean()
17.5 ms ± 54.1 µs
%timeit df.x.std() # Error
# pandas-decimal
df = pd.DataFrame({"x": data}, dtype="decimal[3]")
%timeit df.x + 1
4.62 ms ± 48.5 µs
%timeit df.x + 0.1
4.9 ms ± 530 µs
%timeit df.x.mean()
2.16 ms ± 1.79 µs
%timeit df.x.std()
11 ms ± 119 µs
# pure python
df["x"] = df.x.map(lambda x: decimal.Decimal(str(x)))
%timeit df.x + 1
475 ms ± 4.79 ms
%timeit df.x + decimal.Decimal(0.1)
460 ms ± 7.96 ms
%timeit df.x.mean()
392 ms ± 573 µs
%timeit df.x.mean() # Error
# inexact floats
df["x"] = df.x.astype("float")
%timeit df.x.mean()
7.6 ms ± 115 µs
IO
pyarrow
's decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion
to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation.
pandas-decimal
is not integrated with any IO. In the original design, it was anticipated that integration
with fastparquet
would be trivial, and would not need any additional dependencies, but this has not been
implemented since development of pandas-decimal halted
Summary
pyarrow
does not provide the all-encompassing solution to fixed-precision decimals we would have hoped
for - at least not yet. One assumes that the functionality may well already exist within arrow
itself,
but is not well exposed to pandas
due to lack of interest/motivation. This surprised me somewhat,
since finance is such a big user of pandas
, and has traditionally required exact decimals for instance
in their database models.
For the time being, pandas-decimal
, a very small-effort proof-of concept, shows clear advantages in
speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:
- it is a separate, non-standard install, whereas
pyarrow
is likely already available in the environments - it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)
- the integration with parquet has not been done
I am hoping, that by writing this post, I will motivate arrow
developers to improve decimal support, so
that pandas-decimal
can be retired even as a proof of concept.