# Decimals type for pandas

## Introduction

Some time ago I had a go at implementing a "decimals" extension type for pandas. This was following stumbling upon parquet data of that type, which pandas could not read: `pyarrow` would error and `fastparquet` would convert to floats. The decimal type, with known, fixed precision, is very important in real-world applications such as finance, where exact equality of fractional values is required. The following should succeed, but with standard python or pandas does not:

``````>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> pd.Series([0.1, 0.1, 0.1]).sum() == 0.3
False
``````

Although one could solve this with python's builtin `decimal.Decimal` objects, we want something that supports vectorised compute, which will be orders of magnitude faster.

## Implementations

I set out make an implementation for discussion. This made use of pandas' "extension types" system, which allows non-standard dtypes and array types to be registered. A decent proof of concept was pretty quick to make (see the repo). However, in conversations around it, it was brought to my attention that the advent of more generalised `arrow` type support in pandas would expose their decimal type for use in pandas too.

So, if `arrow` could handle this, then the need for `pandas-decimal` goes away. I accepted this, but later conversation spurred me to look again. So let's summarise the situation.

## Comparisons

### Ease of use

From the python perspective, pyarrow interoperates with `decimal.Decimal` objects; internally it holds an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from float is supported, but arithmetic with floats is not supported.

``````import decimal
import pandas as pd
import pyarrow as pa

# dtype object
pa_dtype = pa.decimal128(precision=7, scale=3)

# creation from decimal objects
data = pa.array(
[
decimal.Decimal("8093.012"),
decimal.Decimal("8094.123"),
decimal.Decimal("8095.234"),
decimal.Decimal("8096.345"),
decimal.Decimal("8097.456"),
decimal.Decimal("8098.567"),
],
type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))

# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype(pd.ArrowDtype(pa_dtype))
assert df.x.equals(df2.x)

# fail to operate on floats
df.x = 1.1  # ArrowTypeError
df + 1.1  # ArrowTypeError
df + decimal.Decimal("1.1")  # OK
``````

Conversely, `pandas-decimal` does not handle `decimal.Decimal` only, but implicitly converts to and from floats on access. This means that arithmetic works as you would normally expect, except you lose the precision guarantees of the decimal representation.

``````import pandas_decimal

# creation with floats
df = pd.DataFrame(
{"x": [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567]},
dtype="decimal",
)

# conversion
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype("decimal")
assert df.x.equals(df2.x)

# operating on floats
df.x = 1.1  # OK
df + 1.1  # OK
df + decimal.Decimal("1.1")  # ValueError
``````

### Performance

Comparing the two vectorised array types to python objects and each other, we see clearly that pandas-decimal wins on all operations. I include also just one float calculation, showing that pandas-decimal is even faster than that.

``````data = [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567] * 1000000

# pyarrow
df = pd.DataFrame({"x": data).astype(dtype=pd.ArrowDtype(pa_dtype))
%timeit df.x + 1
16.2 ms ± 85.8 µs
%timeit df.x + decimal.Decimal(0.1)
150 ms ± 1.09 ms
%timeit df.x.mean()
17.5 ms ± 54.1 µs
%timeit df.x.std()  # Error

# pandas-decimal
df = pd.DataFrame({"x":  data}, dtype="decimal")
%timeit df.x + 1
4.62 ms ± 48.5 µs
%timeit df.x + 0.1
4.9 ms ± 530 µs
%timeit df.x.mean()
2.16 ms ± 1.79 µs
%timeit df.x.std()
11 ms ± 119 µs

# pure python
df["x"] = df.x.map(lambda x: decimal.Decimal(str(x)))
%timeit df.x + 1
475 ms ± 4.79 ms
%timeit df.x + decimal.Decimal(0.1)
460 ms ± 7.96 ms
%timeit df.x.mean()
392 ms ± 573 µs
%timeit df.x.mean()  # Error

# inexact floats
df["x"] = df.x.astype("float")
%timeit df.x.mean()
7.6 ms ± 115 µs
``````

### IO

`pyarrow`'s decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation. `pandas-decimal` is not integrated with any IO. In the original design, it was anticipated that integration with `fastparquet` would be trivial, and would not need any additional dependencies, but this has not been implemented since development of pandas-decimal halted

## Summary

`pyarrow` does not provide the all-encompassing solution to fixed-precision decimals we would have hoped for - at least not yet. One assumes that the functionality may well already exist within `arrow` itself, but is not well exposed to `pandas` due to lack of interest/motivation. This surprised me somewhat, since finance is such a big user of `pandas`, and has traditionally required exact decimals for instance in their database models.

For the time being, `pandas-decimal`, a very small-effort proof-of concept, shows clear advantages in speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:

• it is a separate, non-standard install, whereas `pyarrow` is likely already available in the environments
• it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)
• the integration with parquet has not been done

I am hoping, that by writing this post, I will motivate `arrow` developers to improve decimal support, so that `pandas-decimal` can be retired even as a proof of concept.