Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-91719: Make MSVC generate somewhat faster switch code #91718

Merged
merged 4 commits into from
Apr 21, 2022

Conversation

gvanrossum
Copy link
Member

Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

See faster-cpython/ideas#321 (comment)

Apparently a switch on an 8-bit quantity where all cases are
present generates a more efficient jump (doing only one indexed
memory load instead of two).

See faster-cpython/ideas#321 (comment)
@gvanrossum gvanrossum requested a review from markshannon as a code owner April 20, 2022 01:28
@gvanrossum gvanrossum changed the title Make MSVC generate somewhat faster switch code GH-91719: Make MSVC generate somewhat faster switch code Apr 20, 2022
@markshannon
Copy link
Member

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

We should probably make use_tracing an 8 bit unsigned integer as well.

@gvanrossum
Copy link
Member Author

Would it make more sense to redefine opcode to be a uint8_t, rather than casting it?

Yeah, I had considered that, it makes sense. I'll confirm that it has the same effect.

We should probably make use_tracing an 8 bit unsigned integer as well.

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

@gvanrossum
Copy link
Member Author

I don't believe this needs a news blurb.

@markshannon
Copy link
Member

I don't see why -- it's not used in a similar switch AFAICT, and it's not cramped for space in its struct. I assume for most other operations the cost of loading an int and loading a byte is effectively the same, since the CPU has to load a whole cache line (32 or 64 bytes) anyway.

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

@gvanrossum
Copy link
Member Author

The dispatch sequence includes opcode |= cframe.use_tracing.
If cframe.use_tracing is a 32 bit int, then the compiler needs to add a cast.
If it is the same type as opcode, it does not.

Okay, I'll make that change.

@gvanrossum
Copy link
Member Author

@markshannon, please re-review. I confirmed that the switch still uses a single indirection (goto *(base + offset_table[opcode])). I also found that the opcode |= use_tracing is a single instruction (but maybe it always was one?).

@markshannon
Copy link
Member

Looks good to me

@vstinner
Copy link
Member

Oh wow, that's a simple and clever optimization! Great that it helps MSVC to optimize Python on Windows!

@vstinner
Copy link
Member

Follow-up fo clean the public API: #91906

@gvanrossum gvanrossum deleted the dummy-cases branch August 7, 2022 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants