-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient implementation of integers #42
Comments
Sidenote: portable-snippets is a useful source for portable overflow checks. |
Hi, I made a patch that produces a faster range iterator by reusing the PyLong object it returns: heeres/cpython@662b587. I know it's not correctly handling the number of digits in the PyLong yet, but that is a solvable issue. Before patch:
After patch:
(so more than a factor 2 faster on large ranges, a bit faster on interned-only ints) This also demonstrates that allocating/initializing/destroying a new PyLong object adds quite a bit of overhead (e.g. always going through PyObject_Alloc, except when interned). This will have an enormous impact on all basic long/double functions, unless they are treated in a special way internally such as the tagged pointer approach (#7). Even then you have to be careful to not require creation of new PyLong objects all the time, or it might kill a lot of the benefit. In the ideal world I think the ints and doubles would indeed be represented by tagged PyObject* values everywhere, but that seems like too big a change as it would have far-reaching effects (but also significant memory-usage advantage). The allocation overhead is perhaps also incurred more often than strictly required by defining integers(/longs) and doubles as immutable objects, so that in-place operations are not available and have to allocate a new object. There is probably lots of code around that relies on this behaviour, but would it be worth to consider making longs mutable, i.e. implement the nb_inplace_* methods? Perhaps it would also be benificial to re-introduce the intobject.c (as int64_t I guess, even on 16- or 32-bit systems) from Python2, and return an upgraded PyLong only when required? (which I assume is relative rare) (On a side-note: floatobject.c and several other types (intobject.c previously) have a free_list, but longobject.c does not, although a patch to add it seems to have existed at some point: https://bugs.python.org/file9824/py3k_longfreelist2-gps02.patch. Is there a good reason not to have this? It might already help) |
Sorry, no. This is where I draw the line. We can optimize all we want, but the semantics should keep integers immutable. Otherwise we'd get into crazy situations like this:
People have been used to |
This would indeed not be desirable behavior, but there are several options: I do think that in-place (or any other) operations on longs returning new objects represents a significant fraction of the overhead of these integer operations. |
Another attempt to speed things up a bit: more aggressive inlining of single digit PyLong creation (heeres/cpython@a10bd88). Before:
After:
Or a 9% improvement. Not bad I would say (although the patch is not very pretty). Again, I think object creation represents a large fraction of the overhead on integer operations, and therefore I would guess that the proposal at the top of this issue, with tagged integers at the PyLong/PyInt object level, might result in only small speed-ups. There are very many checks on the sign of the size, perhaps things would turn out slightly more efficient by storing the sign as a bit instead, i.e. sign = ob_size&1, size=ob_size>>1, but one would have to really try to see if it makes a difference. Maybe there is also something to gain by trying to double the digit size to (almost) full machine words? Of course this makes multiplications and divisions more complicated, but for most operations it makes little difference (since additions and subtractions should still fit). |
Did you try these in a debug build @heeres? If so, please also check out the pgo+lto compiled binaries since these sorts of micro optimizations tend to have much less effect on those. Especially for small margins (<%15), it might render the optimization redundant. |
Assuming the effect persists with PGO, this is pretty neat. I'd add comments to explain where the various bits are expanded from (e.g. |
I don't know the history here, but I suspect modern mallocs are getting better so that the benefit of a free list isn't as big as it used to be. |
Looking at that patch, apart from the number of digits, the assumption that if the refcount is 2, we know the other refcount will be overwritten with the new value seems to sink this idea. One could easily construct a counterexample using next(), e.g.
Or am I missing something? |
This is with ./configure --enable-optimizations, i.e. OPT=-DNDEBUG -g -fwrapv -O3 -Wall If I add -flto to OPT and *LDFLAGS, I get the following results: Without patch:
With patch:
Less indeed, but still something. Note that by looking at the code it's clear that we can remove some checks and operations that the linker really can't get rid of. |
Just a reminder that |
Ah, good point. Had not considered that use-case. for i in range(0,100000,set_long_in_place=True)? |
Sorry, no -- nobody is going to put that in their code, it will just clutter up the API. |
I understand and wasn't really serious :-) Perhaps the compiler or optimizer could detect the pattern where it would be suitable to do so, but that sounds rather involved so I'll let it go for now. |
Sorry, I didn't pick up on that.
It was worth a shot -- we need lots of ideas like these, and occasionally one sticks. You often need to write the code to see why an idea doesn't stick. The same technique of reusing an object that's nominally immutable does work for dict.items(), for example. |
Maybe all hope is not lost for this optimization: heeres/cpython@90c35d5 This approach has fewer assumptions, and treats a FOR_ITER/STORE_FAST pair like a super-instruction. For a (to-be-replaced) local pylong with ref count >1 it pre-decrements the ref count, so that the range iterator can be sure it holds the last reference. Performance is more than 2 times better on Number of digits should now also be set correctly |
I tested something like Stefan Behnel's pylong_freelist patch from https://bugs.python.org/issue24076 (conceptually simpler using a static array rather than a singly linked list) and got some promising results on
The exact patch I applied:
|
It is almost certainly worth adding freelists for "medium" ints. However, what we don't want is another ad-hoc freelist implementation. I've just created #89 to cover this. |
I'm closing this as it seems to have been hijacked into a discussion about |
This might be a useful precursor to #7 as we will need to implement distinct paths for small integers vs larger integers for #7 as well.
Currently,
int
is implemented as an array of digits, where a digit is (approx) half a machine word.I would suggest changing this to:
Values that, if represented as
PyLongObject
, would have-2 <= size <= 2
would be represented as aPyIntObject
withobj->tagged_value_or_size_and_sign = value << 2
.Other values would be represented as a
PyLongObject
as they are now, except that the size would be stored as(size<<1)+1
.Intrinsic functions for add-with-overflow will help with performance. GCC has those. Windows has them for unsigned values only (I don't know why).
Even without intrinsics, the overflow checks can be implemented in only a few instructions, so will still be a lot faster than what we have now.
The text was updated successfully, but these errors were encountered: