Decimals type for pandas

Introduction

Some time ago I had a go at implementing a "decimals" extension type for pandas. This was following stumbling upon parquet data of that type, which pandas could not read: pyarrow would error and fastparquet would convert to floats. The decimal type, with known, fixed precision, is very important in real-world applications such as finance, where exact equality of fractional values is required. The following should succeed, but with standard python or pandas does not:

>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> pd.Series([0.1, 0.1, 0.1]).sum() == 0.3
False

Although one could solve this with python's builtin decimal.Decimal objects, we want something that supports vectorised compute, which will be orders of magnitude faster.

Implementations

I set out make an implementation for discussion. This made use of pandas' "extension types" system, which allows non-standard dtypes and array types to be registered. A decent proof of concept was pretty quick to make (see the repo). However, in conversations around it, it was brought to my attention that the advent of more generalised arrow type support in pandas would expose their decimal type for use in pandas too.

So, if arrow could handle this, then the need for pandas-decimal goes away. I accepted this, but later conversation spurred me to look again. So let's summarise the situation.

Comparisons

Ease of use

From the python perspective, pyarrow interoperates with decimal.Decimal objects; internally it holds an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from float is supported, but arithmetic with floats is not supported.

import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa

# dtype object
pa_dtype = pa.decimal128(precision=7, scale=3)

# creation from decimal objects
data = pa.array(
    [
        decimal.Decimal("8093.012"),
        decimal.Decimal("8094.123"),
        decimal.Decimal("8095.234"),
        decimal.Decimal("8096.345"),
        decimal.Decimal("8097.456"),
        decimal.Decimal("8098.567"),
    ],
    type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))

# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype(pd.ArrowDtype(pa_dtype))
assert df.x.equals(df2.x)

# fail to operate on floats
df.x[0] = 1.1  # ArrowTypeError
df + 1.1  # ArrowTypeError
df + decimal.Decimal("1.1")  # OK

Conversely, pandas-decimal does not handle decimal.Decimal only, but implicitly converts to and from floats on access. This means that arithmetic works as you would normally expect, except you lose the precision guarantees of the decimal representation.

import pandas_decimal

# creation with floats
df = pd.DataFrame(
    {"x": [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567]},
    dtype="decimal[3]",
)

# conversion
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype("decimal[3]")
assert df.x.equals(df2.x)

# operating on floats
df.x[0] = 1.1  # OK
df + 1.1  # OK
df + decimal.Decimal("1.1")  # ValueError

Performance

Comparing the two vectorised array types to python objects and each other, we see clearly that pandas-decimal wins on all operations. I include also just one float calculation, showing that pandas-decimal is even faster than that.

data = [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567] * 1000000

# pyarrow
df = pd.DataFrame({"x": data).astype(dtype=pd.ArrowDtype(pa_dtype))
%timeit df.x + 1
16.2 ms ± 85.8 µs
%timeit df.x + decimal.Decimal(0.1)
150 ms ± 1.09 ms
%timeit df.x.mean()
17.5 ms ± 54.1 µs
%timeit df.x.std()  # Error

# pandas-decimal
df = pd.DataFrame({"x":  data}, dtype="decimal[3]")
%timeit df.x + 1
4.62 ms ± 48.5 µs
%timeit df.x + 0.1
4.9 ms ± 530 µs
%timeit df.x.mean()
2.16 ms ± 1.79 µs
%timeit df.x.std()
11 ms ± 119 µs

# pure python
df["x"] = df.x.map(lambda x: decimal.Decimal(str(x)))
%timeit df.x + 1
475 ms ± 4.79 ms
%timeit df.x + decimal.Decimal(0.1)
460 ms ± 7.96 ms
%timeit df.x.mean()
392 ms ± 573 µs
%timeit df.x.mean()  # Error

# inexact floats
df["x"] = df.x.astype("float")
%timeit df.x.mean()
7.6 ms ± 115 µs

IO

pyarrow's decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation. pandas-decimal is not integrated with any IO. In the original design, it was anticipated that integration with fastparquet would be trivial, and would not need any additional dependencies, but this has not been implemented since development of pandas-decimal halted

Summary

pyarrow does not provide the all-encompassing solution to fixed-precision decimals we would have hoped for - at least not yet. One assumes that the functionality may well already exist within arrow itself, but is not well exposed to pandas due to lack of interest/motivation. This surprised me somewhat, since finance is such a big user of pandas, and has traditionally required exact decimals for instance in their database models.

For the time being, pandas-decimal, a very small-effort proof-of concept, shows clear advantages in speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:

  • it is a separate, non-standard install, whereas pyarrow is likely already available in the environments
  • it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)
  • the integration with parquet has not been done

I am hoping, that by writing this post, I will motivate arrow developers to improve decimal support, so that pandas-decimal can be retired even as a proof of concept.