## Introduction

Some time ago I had a
go at implementing a "decimals" extension type for pandas. This was following
stumbling upon parquet data of that type, which pandas could not read:
`pyarrow`

would error and `fastparquet`

would convert to floats. The decimal
type, with known, fixed precision, is very important in real-world applications
such as finance, where exact equality of fractional values is required.
The following should succeed, but with standard python or pandas does not:

```
>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> pd.Series([0.1, 0.1, 0.1]).sum() == 0.3
False
```

Although one could solve this with python's builtin `decimal.Decimal`

objects, we want something that supports vectorised compute, which will
be orders of magnitude faster.

## Implementations

I set out make an implementation for discussion. This made use of
pandas' "extension types" system, which allows non-standard dtypes and
array types to be registered. A decent proof of concept was pretty quick
to make (see the repo).
However, in conversations around it, it was brought to my attention
that the advent of more generalised `arrow`

type support in pandas would
expose their decimal
type for use in pandas too.

So, if `arrow`

could handle this, then the need for `pandas-decimal`

goes away.
I accepted this, but later conversation spurred me to look again. So let's
summarise the situation.

## Comparisons

### Ease of use

From the python perspective, pyarrow interoperates with `decimal.Decimal`

objects; internally it holds
an optimised binary representation. Arithmetic with integers is automatic and explicit conversion to/from
float is supported, but arithmetic with floats is not supported.

```
import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa
# dtype object
pa_dtype = pa.decimal128(precision=7, scale=3)
# creation from decimal objects
data = pa.array(
[
decimal.Decimal("8093.012"),
decimal.Decimal("8094.123"),
decimal.Decimal("8095.234"),
decimal.Decimal("8096.345"),
decimal.Decimal("8097.456"),
decimal.Decimal("8098.567"),
],
type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype(pd.ArrowDtype(pa_dtype))
assert df.x.equals(df2.x)
# fail to operate on floats
df.x[0] = 1.1 # ArrowTypeError
df + 1.1 # ArrowTypeError
df + decimal.Decimal("1.1") # OK
```

Conversely, `pandas-decimal`

does not handle `decimal.Decimal`

only, but implicitly converts
to and from floats on access. This means that arithmetic works as you would normally expect,
except you lose the precision guarantees of the decimal representation.

```
import pandas_decimal
# creation with floats
df = pd.DataFrame(
{"x": [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567]},
dtype="decimal[3]",
)
# conversion
# explicit conversion
df_float = df.astype("float")
df2 = df_float.astype("decimal[3]")
assert df.x.equals(df2.x)
# operating on floats
df.x[0] = 1.1 # OK
df + 1.1 # OK
df + decimal.Decimal("1.1") # ValueError
```

### Performance

Comparing the two vectorised array types to python objects and each other, we see clearly that
pandas-decimal wins on all operations. I include also just one float calculation, showing that
pandas-decimal is even faster than that.

```
data = [8093.012, 8094.123, 8095.234, 8096.345, 8097.456, 8098.567] * 1000000
# pyarrow
df = pd.DataFrame({"x": data).astype(dtype=pd.ArrowDtype(pa_dtype))
%timeit df.x + 1
16.2 ms ± 85.8 µs
%timeit df.x + decimal.Decimal(0.1)
150 ms ± 1.09 ms
%timeit df.x.mean()
17.5 ms ± 54.1 µs
%timeit df.x.std() # Error
# pandas-decimal
df = pd.DataFrame({"x": data}, dtype="decimal[3]")
%timeit df.x + 1
4.62 ms ± 48.5 µs
%timeit df.x + 0.1
4.9 ms ± 530 µs
%timeit df.x.mean()
2.16 ms ± 1.79 µs
%timeit df.x.std()
11 ms ± 119 µs
# pure python
df["x"] = df.x.map(lambda x: decimal.Decimal(str(x)))
%timeit df.x + 1
475 ms ± 4.79 ms
%timeit df.x + decimal.Decimal(0.1)
460 ms ± 7.96 ms
%timeit df.x.mean()
392 ms ± 573 µs
%timeit df.x.mean() # Error
# inexact floats
df["x"] = df.x.astype("float")
%timeit df.x.mean()
7.6 ms ± 115 µs
```

### IO

`pyarrow`

's decimal column can be saved to/loaded from parquet without conversion, and quickly. Conversion
to text (e.g., for CSV) is as slow as you would expect, since it doesn't gain from vectorisation.
`pandas-decimal`

is not integrated with any IO. In the original design, it was anticipated that integration
with `fastparquet`

would be trivial, and would not need any additional dependencies, but this has not been
implemented since development of pandas-decimal halted

## Summary

`pyarrow`

does not provide the all-encompassing solution to fixed-precision decimals we would have hoped
for - at least not yet. One assumes that the functionality may well already exist within `arrow`

itself,
but is not well exposed to `pandas`

due to lack of interest/motivation. This surprised me somewhat,
since finance is such a big user of `pandas`

, and has traditionally required exact decimals for instance
in their database models.

For the time being, `pandas-decimal`

, a very small-effort proof-of concept, shows clear advantages in
speed and flexibility. However, I still don't particuylarly recommend it to anyone for three mean reasons:

- it is a separate, non-standard install, whereas
`pyarrow`

is likely already available in the environments
- it lacks testing or any support (the status of this in pandas/pyarrow is not known to me)
- the integration with parquet has not been done

I am hoping, that by writing this post, I will motivate `arrow`

developers to improve decimal support, so
that `pandas-decimal`

can be retired even as a proof of concept.