GH-91719: Make MSVC generate somewhat faster switch code #91718

gvanrossum · 2022-04-20T01:28:20Z

Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

See faster-cpython/ideas#321 (comment)

Apparently a switch on an 8-bit quantity where all cases are present generates a more efficient jump (doing only one indexed memory load instead of two). See faster-cpython/ideas#321 (comment)

markshannon · 2022-04-20T15:18:50Z

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

We should probably make use_tracing an 8 bit unsigned integer as well.

gvanrossum · 2022-04-20T15:38:33Z

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

Yeah, I had considered that, it makes sense. I'll confirm that it has the same effect.

We should probably make use_tracing an 8 bit unsigned integer as well.

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

gvanrossum · 2022-04-20T15:39:24Z

I don't believe this needs a news blurb.

markshannon · 2022-04-20T15:42:46Z

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

gvanrossum · 2022-04-20T18:14:01Z

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

Okay, I'll make that change.

gvanrossum · 2022-04-20T19:07:47Z

@markshannon, please re-review. I confirmed that the switch still uses a single indirection (goto *(base + offset_table[opcode])). I also found that the opcode |= use_tracing is a single instruction (but maybe it always was one?).

Python/ceval.c

Tools/scripts/generate_opcode_h.py

markshannon · 2022-04-21T09:12:26Z

Looks good to me

vstinner · 2022-04-25T10:00:49Z

Oh wow, that's a simple and clever optimization! Great that it helps MSVC to optimize Python on Windows!

vstinner · 2022-04-25T10:01:17Z

Follow-up fo clean the public API: #91906

Make MSVC generate somewhat faster switch code

ece341c

Apparently a switch on an 8-bit quantity where all cases are present generates a more efficient jump (doing only one indexed memory load instead of two). See faster-cpython/ideas#321 (comment)

gvanrossum requested a review from markshannon as a code owner April 20, 2022 01:28

bedevere-bot added the awaiting core review label Apr 20, 2022

gvanrossum changed the title ~~Make MSVC generate somewhat faster switch code~~ GH-91719: Make MSVC generate somewhat faster switch code Apr 20, 2022

gvanrossum mentioned this pull request Apr 20, 2022

Measure Windows performance (and improve if lacking) faster-cpython/ideas#321

Closed

gvanrossum added the skip news label Apr 20, 2022

Make opcode and use_tracing uint8_t

51ec108

erlend-aasland reviewed Apr 20, 2022

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

Tools/scripts/generate_opcode_h.py Outdated Show resolved Hide resolved

gvanrossum added 2 commits April 20, 2022 15:32

Fix/move comment about opcode and switch

6603d85

No need to check for opcode 255 any more

558d2c4

gvanrossum merged commit f8dc618 into python:main Apr 21, 2022

bedevere-bot removed the awaiting core review label Apr 21, 2022

gvanrossum mentioned this pull request Apr 23, 2022

Improve performance for switch in ceval.c when using MSVC #91719

Closed

neonene mentioned this pull request Apr 23, 2022

Performance regression 3.10b1: inlining issue in the big _PyEval_EvalFrameDefault() function with Visual Studio (MSC) #89279

Closed

gvanrossum deleted the dummy-cases branch August 7, 2022 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-91719: Make MSVC generate somewhat faster switch code #91718

GH-91719: Make MSVC generate somewhat faster switch code #91718

gvanrossum commented Apr 20, 2022

markshannon commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

markshannon commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

markshannon commented Apr 21, 2022

vstinner commented Apr 25, 2022

vstinner commented Apr 25, 2022

GH-91719: Make MSVC generate somewhat faster switch code #91718

GH-91719: Make MSVC generate somewhat faster switch code #91718

Conversation

gvanrossum commented Apr 20, 2022

markshannon commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

markshannon commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

gvanrossum commented Apr 20, 2022

markshannon commented Apr 21, 2022

vstinner commented Apr 25, 2022

vstinner commented Apr 25, 2022