Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient implementation of integers #42

Closed
markshannon opened this issue Apr 9, 2021 · 20 comments
Closed

More efficient implementation of integers #42

markshannon opened this issue Apr 9, 2021 · 20 comments

Comments

@markshannon
Copy link
Member

markshannon commented Apr 9, 2021

This might be a useful precursor to #7 as we will need to implement distinct paths for small integers vs larger integers for #7 as well.

Currently, int is implemented as an array of digits, where a digit is (approx) half a machine word.

typedef struct {
    OBJECT_HEADER;
    Py_ssize_t size_and_sign;
    digit digits[1];
} PyLongObject;

I would suggest changing this to:

typedef struct {
    OBJECT_HEADER;
    Py_ssize_t tagged_bits;
}  PyIntObject;
typedef struct {
    PyInt header;
    digit digits[1];
} PyLongObject;

Values that, if represented as PyLongObject, would have -2 <= size <= 2 would be represented as a PyIntObject with obj->tagged_value_or_size_and_sign = value << 2.
Other values would be represented as a PyLongObject as they are now, except that the size would be stored as (size<<1)+1.

Intrinsic functions for add-with-overflow will help with performance. GCC has those. Windows has them for unsigned values only (I don't know why).

Even without intrinsics, the overflow checks can be implemented in only a few instructions, so will still be a lot faster than what we have now.

@pitrou
Copy link

pitrou commented May 13, 2021

Sidenote: portable-snippets is a useful source for portable overflow checks.

@heeres
Copy link

heeres commented May 31, 2021

Hi,

I made a patch that produces a faster range iterator by reusing the PyLong object it returns: heeres/cpython@662b587. I know it's not correctly handling the number of digits in the PyLong yet, but that is a solvable issue.

Before patch:

./python -m timeit -c 'for i in range(10000000):pass'
2 loops, best of 5: 121 msec per loop
./python -m timeit -c 'for i in range(-5,200):pass'
200000 loops, best of 5: 1.46 usec per loop
./python -m timeit -c 'for i in range(1000):i+1'
10000 loops, best of 5: 28.6 usec per loop

After patch:

/python -m timeit -c 'for i in range(10000000):pass'
5 loops, best of 5: 58.4 msec per loop
./python -m timeit -c 'for i in range(-5,200):pass'
200000 loops, best of 5: 1.37 usec per loop
./python -m timeit -c 'for i in range(1000):i+1'
10000 loops, best of 5: 20.3 usec per loop

(so more than a factor 2 faster on large ranges, a bit faster on interned-only ints)

This also demonstrates that allocating/initializing/destroying a new PyLong object adds quite a bit of overhead (e.g. always going through PyObject_Alloc, except when interned). This will have an enormous impact on all basic long/double functions, unless they are treated in a special way internally such as the tagged pointer approach (#7). Even then you have to be careful to not require creation of new PyLong objects all the time, or it might kill a lot of the benefit. In the ideal world I think the ints and doubles would indeed be represented by tagged PyObject* values everywhere, but that seems like too big a change as it would have far-reaching effects (but also significant memory-usage advantage).

The allocation overhead is perhaps also incurred more often than strictly required by defining integers(/longs) and doubles as immutable objects, so that in-place operations are not available and have to allocate a new object. There is probably lots of code around that relies on this behaviour, but would it be worth to consider making longs mutable, i.e. implement the nb_inplace_* methods?

Perhaps it would also be benificial to re-introduce the intobject.c (as int64_t I guess, even on 16- or 32-bit systems) from Python2, and return an upgraded PyLong only when required? (which I assume is relative rare)

(On a side-note: floatobject.c and several other types (intobject.c previously) have a free_list, but longobject.c does not, although a patch to add it seems to have existed at some point: https://bugs.python.org/file9824/py3k_longfreelist2-gps02.patch. Is there a good reason not to have this? It might already help)

@gvanrossum
Copy link
Collaborator

but would it be worth to consider making longs mutable, i.e. implement the nb_inplace_* methods?

Sorry, no. This is where I draw the line. We can optimize all we want, but the semantics should keep integers immutable. Otherwise we'd get into crazy situations like this:

x = 1
y = 1
x += 1
print(y)  # Would print "2" with your proposal

Perhaps it would also be beneficial to re-introduce the intobject.c [...]

People have been used to type(1) is type(2**100) for many years, and preserving that invariant would require at least one bit in the fixed-width (64-bit) int. There really isn't an extra bit available, unless you limit this to 63 bits -- which might work. IIRC Mark's idea to just store the value in the ob_size field. There are at least 62 bits available there (on a 64-bit machine) -- we need one bit for the sign and another bit to tell which form is used.

@heeres
Copy link

heeres commented Jun 1, 2021

but would it be worth to consider making longs mutable, i.e. implement the nb_inplace_* methods?

Sorry, no. This is where I draw the line. We can optimize all we want, but the semantics should keep integers immutable. Otherwise we'd get into crazy situations like this:

x = 1
y = 1
x += 1
print(y)  # Would print "2" with your proposal

This would indeed not be desirable behavior, but there are several options:
a. these integers would all be separate constants / objects (with a large memory-footprint penalty)
b. the in-place operators return a new object if they are interned longs. Or more strict: the in-place operators re-use the existing object only if there are no other references to it.

I do think that in-place (or any other) operations on longs returning new objects represents a significant fraction of the overhead of these integer operations.

@heeres
Copy link

heeres commented Jun 2, 2021

Another attempt to speed things up a bit: more aggressive inlining of single digit PyLong creation (heeres/cpython@a10bd88).

Before:

./python -m timeit 'for i in range(1000000): i+1'; done
10 loops, best of 5: 30.1 msec per loop

After:

./python -m timeit 'for i in range(1000000): i+1'; done
10 loops, best of 5: 27.4 msec per loop

Or a 9% improvement. Not bad I would say (although the patch is not very pretty).

Again, I think object creation represents a large fraction of the overhead on integer operations, and therefore I would guess that the proposal at the top of this issue, with tagged integers at the PyLong/PyInt object level, might result in only small speed-ups.

There are very many checks on the sign of the size, perhaps things would turn out slightly more efficient by storing the sign as a bit instead, i.e. sign = ob_size&1, size=ob_size>>1, but one would have to really try to see if it makes a difference.

Maybe there is also something to gain by trying to double the digit size to (almost) full machine words? Of course this makes multiplications and divisions more complicated, but for most operations it makes little difference (since additions and subtractions should still fit).

@isidentical
Copy link

Did you try these in a debug build @heeres? If so, please also check out the pgo+lto compiled binaries since these sorts of micro optimizations tend to have much less effect on those. Especially for small margins (<%15), it might render the optimization redundant.

@gvanrossum
Copy link
Collaborator

Another attempt to speed things up a bit: more aggressive inlining of single digit PyLong creation (heeres/cpython@a10bd88).

Assuming the effect persists with PGO, this is pretty neat. I'd add comments to explain where the various bits are expanded from (e.g. _Py_NewReference() contains the tracemalloc stuff and what's after that), and add a comment to the extent of "If one of those ever changes, this needs to be changed, too".

@gvanrossum
Copy link
Collaborator

(On a side-note: floatobject.c and several other types (intobject.c previously) have a free_list, but longobject.c does not, although a patch to add it seems to have existed at some point: https://bugs.python.org/file9824/py3k_longfreelist2-gps02.patch. Is there a good reason not to have this? It might already help)

I don't know the history here, but I suspect modern mallocs are getting better so that the benefit of a free list isn't as big as it used to be.

@gvanrossum
Copy link
Collaborator

I made a patch that produces a faster range iterator by reusing the PyLong object it returns: heeres/cpython@662b587. I know it's not correctly handling the number of digits in the PyLong yet, but that is a solvable issue.

Looking at that patch, apart from the number of digits, the assumption that if the refcount is 2, we know the other refcount will be overwritten with the new value seems to sink this idea. One could easily construct a counterexample using next(), e.g.

r = range(1000, 2000)
i = next(r)
j = next(r)

Or am I missing something?

@heeres
Copy link

heeres commented Jun 2, 2021

Did you try these in a debug build @heeres? If so, please also check out the pgo+lto compiled binaries since these sorts of micro optimizations tend to have much less effect on those. Especially for small margins (<%15), it might render the optimization redundant.

This is with ./configure --enable-optimizations, i.e. OPT=-DNDEBUG -g -fwrapv -O3 -Wall

If I add -flto to OPT and *LDFLAGS, I get the following results:

Without patch:

./python -m timeit 'for i in range(1000000): i+1'; done
10 loops, best of 5: 29 msec per loop

With patch:

./python -m timeit 'for i in range(1000000): i+1'; done
10 loops, best of 5: 27.7 msec per loop

Less indeed, but still something. Note that by looking at the code it's clear that we can remove some checks and operations that the linker really can't get rid of.

@isidentical
Copy link

Just a reminder that timeit is not great for that sort of micro benchmarks. Use pyperf with pyperf system tune. If you have the resources, the best results are obtained through running the benchmarks on the isolated cpus so nothing can interfere. (~%2-3 is generally considered noise btw).

@heeres
Copy link

heeres commented Jun 2, 2021

I made a patch that produces a faster range iterator by reusing the PyLong object it returns: heeres/cpython@662b587. I know it's not correctly handling the number of digits in the PyLong yet, but that is a solvable issue.

Looking at that patch, apart from the number of digits, the assumption that if the refcount is 2, we know the other refcount will be overwritten with the new value seems to sink this idea. One could easily construct a counterexample using next(), e.g.

r = range(1000, 2000)
i = next(r)
j = next(r)

Or am I missing something?

Ah, good point. Had not considered that use-case.

for i in range(0,100000,set_long_in_place=True)?

@gvanrossum
Copy link
Collaborator

for i in range(0,100000,set_long_in_place=True)?

Sorry, no -- nobody is going to put that in their code, it will just clutter up the API.

@heeres
Copy link

heeres commented Jun 2, 2021

for i in range(0,100000,set_long_in_place=True)?

Sorry, no -- nobody is going to put that in their code, it will just clutter up the API.

I understand and wasn't really serious :-) Perhaps the compiler or optimizer could detect the pattern where it would be suitable to do so, but that sounds rather involved so I'll let it go for now.

@gvanrossum
Copy link
Collaborator

I understand and wasn't really serious :-)

Sorry, I didn't pick up on that.

Perhaps the compiler or optimizer could detect the pattern where it would be suitable to do so, but that sounds rather involved so I'll let it go for now.

It was worth a shot -- we need lots of ideas like these, and occasionally one sticks. You often need to write the code to see why an idea doesn't stick. The same technique of reusing an object that's nominally immutable does work for dict.items(), for example.

@heeres
Copy link

heeres commented Jun 3, 2021

Perhaps the compiler or optimizer could detect the pattern where it would be suitable to do so, but that sounds rather involved so I'll let it go for now.

It was worth a shot -- we need lots of ideas like these, and occasionally one sticks. You often need to write the code to see why an idea doesn't stick. The same technique of reusing an object that's nominally immutable does work for dict.items(), for example.

Maybe all hope is not lost for this optimization: heeres/cpython@90c35d5

This approach has fewer assumptions, and treats a FOR_ITER/STORE_FAST pair like a super-instruction. For a (to-be-replaced) local pylong with ref count >1 it pre-decrements the ref count, so that the range iterator can be sure it holds the last reference.

Performance is more than 2 times better on for i in range(10000000):pass and a few percent better for small integers. I expect no penalty on other FOR_ITER/STORE_FAST combinations, and a negligible effect on different patterns involving FOR_ITER.

Number of digits should now also be set correctly

@gvanrossum
Copy link
Collaborator

This sounds like a perfect application of a super-instruction (#16) or even better a combined instruction (#36). The compiler can replace all FOR_ITER/STORE_FAST opcode pairs (in the peephole pass) and then you won't need the ugly flag variable.

@sweeneyde
Copy link

(On a side-note: floatobject.c and several other types (intobject.c previously) have a free_list, but longobject.c does not, although a patch to add it seems to have existed at some point: https://bugs.python.org/file9824/py3k_longfreelist2-gps02.patch. Is there a good reason not to have this? It might already help)

I don't know the history here, but I suspect modern mallocs are getting better so that the benefit of a free list isn't as big as it used to be.

I tested something like Stefan Behnel's pylong_freelist patch from https://bugs.python.org/issue24076 (conceptually simpler using a static array rather than a singly linked list) and got some promising results on for i in range(10_000): i+i

Mean +- std dev: [range_10000_main] 432 us +- 14 us -> [range_10000_freelist] 311 us +- 3 us: 1.39x faster

The exact patch I applied:

diff --git a/Objects/longobject.c b/Objects/longobject.c
index 33fea6491b..ca6277f451 100644
--- a/Objects/longobject.c
+++ b/Objects/longobject.c
@@ -23,6 +23,15 @@ class int "PyObject *" "&PyLong_Type"
 #define NSMALLNEGINTS           _PY_NSMALLNEGINTS
 #define NSMALLPOSINTS           _PY_NSMALLPOSINTS

+#ifndef PyLong_MAXFREELIST
+#define PyLong_MAXFREELIST  20
+#endif
+
+#if PyLong_MAXFREELIST > 0
+static PyLongObject *free_list[PyLong_MAXFREELIST];
+static int numfree = 0;
+#endif
+
 _Py_IDENTIFIER(little);
 _Py_IDENTIFIER(big);

@@ -171,6 +180,18 @@ _PyLong_FromMedium(sdigit x)
     assert(!IS_SMALL_INT(x));
     assert(is_medium_int(x));
     /* We could use a freelist here */
+#if PyLong_MAXFREELIST > 0
+    if (numfree) {
+        PyLongObject *result = free_list[--numfree];
+        assert(result != NULL);
+        assert(Py_TYPE(result) == &PyLong_Type);
+        /* Inline PyObject_InitVar */
+        _Py_NewReference((PyObject *)result);
+        Py_SET_SIZE(result, x < 0 ? -1: 1);
+        result->ob_digit[0] = x < 0 ? -x : x;
+        return (PyObject *)result;
+    }
+#endif
     PyLongObject *v = PyObject_Malloc(sizeof(PyLongObject));
     if (v == NULL) {
         PyErr_NoMemory();
@@ -5644,6 +5665,18 @@ long_long_meth(PyObject *self, PyObject *Py_UNUSED(ignored))
     return long_long(self);
 }

+static void
+long_dealloc(PyObject *v)
+{
+#if PyLong_MAXFREELIST > 0
+    if (numfree < PyLong_MAXFREELIST && Py_TYPE(v) == &PyLong_Type && Py_SIZE(v) == 1) {
+        free_list[numfree++] = (PyLongObject*)v;
+        return;
+    }
+#endif
+    Py_TYPE(v)->tp_free(v);
+}
+
 static PyMethodDef long_methods[] = {
     {"conjugate",       long_long_meth, METH_NOARGS,
      "Returns self, the complex conjugate of any int."},
@@ -5743,7 +5776,7 @@ PyTypeObject PyLong_Type = {
     "int",                                      /* tp_name */
     offsetof(PyLongObject, ob_digit),           /* tp_basicsize */
     sizeof(digit),                              /* tp_itemsize */
-    0,                                          /* tp_dealloc */
+    long_dealloc,                               /* tp_dealloc */
     0,                                          /* tp_vectorcall_offset */
     0,                                          /* tp_getattr */
     0,                                          /* tp_setattr */
@@ -5842,6 +5875,23 @@ _PyLong_Init(PyInterpreterState *interp)
     return 0;
 }

+int
+PyLong_ClearFreeList(void)
+{
+#if PyLong_MAXFREELIST > 0
+    PyLongObject *op;
+    int ret = numfree;
+    while (numfree) {
+        op = free_list[--numfree];
+        assert(PyLong_CheckExact(op));
+        PyObject_Del(op);
+    }
+    return ret;
+#else
+    return 0;
+#endif
+}
+

 int
 _PyLong_InitTypes(void)
@@ -5861,4 +5911,7 @@ _PyLong_Fini(PyInterpreterState *interp)
     for (Py_ssize_t i = 0; i < NSMALLNEGINTS + NSMALLPOSINTS; i++) {
         Py_CLEAR(interp->small_ints[i]);
     }
+#if PyLong_MAXFREELIST > 0
+    (void)PyLong_ClearFreeList();
+#endif
 }

@markshannon
Copy link
Member Author

markshannon commented Sep 28, 2021

It is almost certainly worth adding freelists for "medium" ints. However, what we don't want is another ad-hoc freelist implementation.

I've just created #89 to cover this.

@markshannon
Copy link
Member Author

I'm closing this as it seems to have been hijacked into a discussion about range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants