Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into string sharing for deep-frozen code #218

Closed
gvanrossum opened this issue Jan 10, 2022 · 26 comments
Closed

Look into string sharing for deep-frozen code #218

gvanrossum opened this issue Jan 10, 2022 · 26 comments

Comments

@gvanrossum
Copy link
Collaborator

gvanrossum commented Jan 10, 2022

When code objects are created by the compiler or by marshal, certain strings are interned using PyUnicode_InternInPlace(). But code objects obtained from the deep-freeze process do not do this.

While strings are deduped within a module, certain strings (e.g. dunders, 'name', 'self') occur frequently in different modules and thus the deep-frozen code object will use up slightly more space, and certain operations will be slightly slower (e.g. comparison of two strings that are both interned is done by pointer comparison).

We could do a number of different things (or combine several):

  • Add PyUnicode_InternInPlace() calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though.
  • Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)
  • Give strings that are likely candidates (i.e., that look like ASCII identifiers -- see all_name_chars() in codeobject.c) an external name with "weak linkage" so that the linker can dedupe them. (Props to @lpereira for this one.)
  • For strings that occur in the array of known _Py_Identifiers, replace the string with a reference into that array, for pure savings. (@ericsnowcurrently has more details; IIRC this array isn't in "main" yet.)

[I am not planning to attack this any time soon, so if somebody wants to tackle this, go ahead and assign to yourself.]

@oraluben
Copy link

oraluben commented Jan 11, 2022

FYI: In our mmap mechanism, we obverse ~10% slowdown in some complex benchmarks, when short path of dict lookup failed because of copied strings. We're in the middle of adding PyUnicode_InternInPlace calls when loading the archive.

Update: some numbers generated by this simple test case https://github.com/oraluben/pyperformance/blob/cd7050101e0c32c0e3fa0359785a4cc28530e3ed/pyperformance/data-files/benchmarks/bm_module_load_attr/run_benchmark.py

*-raw.json are referring clean cpython and *-cds.json are our mmap result.

before:

$pyperf compare_to pyperformance-3.9-raw.json pyperformance-3.9-cds.json pyperformance-3.10-raw.json pyperformance-3.10-cds.json --table
+------------------+-----------------------+-----------------------+------------------------+------------------------+
| Benchmark        | pyperformance-3.9-raw | pyperformance-3.9-cds | pyperformance-3.10-raw | pyperformance-3.10-cds |
+==================+=======================+=======================+========================+========================+
| module_load_attr | 163 ms                | 183 ms: 1.12x slower  | 149 ms: 1.10x faster   | 181 ms: 1.11x slower   |
+------------------+-----------------------+-----------------------+------------------------+------------------------+

after (strings in mapped memory are re-interned):

$pyperf compare_to pyperformance-3.9-raw.json pyperformance-3.9-cds.json pyperformance-3.10-raw.json pyperformance-3.10-cds.json --table
+------------------+-----------------------+-----------------------+------------------------+------------------------+
| Benchmark        | pyperformance-3.9-raw | pyperformance-3.9-cds | pyperformance-3.10-raw | pyperformance-3.10-cds |
+==================+=======================+=======================+========================+========================+
| module_load_attr | 166 ms                | not significant       | 148 ms: 1.12x faster   | 150 ms: 1.11x faster   |
+------------------+-----------------------+-----------------------+------------------------+------------------------+

@kumaraditya303
Copy link

kumaraditya303 commented Jan 12, 2022

Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)

I liked @markshannon idea and I got it working on linux, it reduced the size of generated files by around 1.2Mb as it shares the constants etc. Here is the branch (https://github.com/kumaraditya303/cpython/tree/deepfreeze), next I'll see how to get it working on Windows. (EDIT: It should be Mb not MB)

@kumaraditya303
Copy link

Status Update: @gvanrossum I have got it working on Linux and Windows so should I create a PR ?

@pxeger
Copy link

pxeger commented Jan 12, 2022

@oraluben the line numbers in that link will break when dictobject.c changes. Here's a permalink (once you've highlighted the code range, click the three dots to the left of the line numbers and Copy Permalink):

https://github.com/python/cpython/blob/be578e0c063dad1dbb273f86d5bc77e4e6f14583/Objects/dictobject.c#L794-L795

@gvanrossum
Copy link
Collaborator Author

Status Update: @gvanrossum I have got it working on Linux and Windows so should I create a PR ?

Sounds great! Let's make it a draft PR so I can try it out.

@kumaraditya303
Copy link

@gvanrossum Created PR here python/cpython#30572

@gvanrossum
Copy link
Collaborator Author

I see you already submitted a PR and created an issue (https://bugs.python.org/issue46430) to intern strings.

We need another canonicalization: "small integers" should use their cached equivalent.

@kumaraditya303
Copy link

We need another canonicalization: "small integers" should use their cached equivalent.

That was much simpler See python/cpython#30715

@kumaraditya303
Copy link

Since most of what is mentioned in the issue is done, can the issue be changed with a checklist of all the items and be marked with what is already done ?

@gvanrossum
Copy link
Collaborator Author

Why don't you summarize what you think still needs to be done.

@gvanrossum
Copy link
Collaborator Author

Thanks for the list! Go ahead with the tuple singleton. The other PR is still in flux, so we have to wait.

@shihai1991
Copy link

  • Add PyUnicode_InternInPlace() calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though.

I see you already submitted a PR and created an issue (https://bugs.python.org/issue46430) to intern strings.

I create a demo in python/cpython#31154 to test the benchmark.
Using the interned string looks better :)

$ python3 -m pyperf compare_to main.json interned.json --table
+-------------------------+---------+-----------------------+
| Benchmark               | main    | interned              |
+=========================+=========+=======================+
| 2to3                    | 353 ms  | 316 ms: 1.12x faster  |
+-------------------------+---------+-----------------------+
| chaos                   | 98.8 ms | 83.9 ms: 1.18x faster |
+-------------------------+---------+-----------------------+
| deltablue               | 5.70 ms | 4.97 ms: 1.15x faster |
+-------------------------+---------+-----------------------+
| fannkuch                | 526 ms  | 449 ms: 1.17x faster  |
+-------------------------+---------+-----------------------+
| float                   | 107 ms  | 97.8 ms: 1.10x faster |
+-------------------------+---------+-----------------------+
| go                      | 190 ms  | 172 ms: 1.10x faster  |
+-------------------------+---------+-----------------------+
| hexiom                  | 8.60 ms | 7.59 ms: 1.13x faster |
+-------------------------+---------+-----------------------+
| json_dumps              | 15.1 ms | 13.8 ms: 1.09x faster |
+-------------------------+---------+-----------------------+
| json_loads              | 35.7 us | 28.8 us: 1.24x faster |
+-------------------------+---------+-----------------------+
| logging_format          | 9.01 us | 7.38 us: 1.22x faster |
+-------------------------+---------+-----------------------+
| logging_silent          | 139 ns  | 122 ns: 1.14x faster  |
+-------------------------+---------+-----------------------+
| logging_simple          | 7.96 us | 6.62 us: 1.20x faster |
+-------------------------+---------+-----------------------+
| mako                    | 13.5 ms | 12.0 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| meteor_contest          | 127 ms  | 118 ms: 1.08x faster  |
+-------------------------+---------+-----------------------+
| nbody                   | 124 ms  | 121 ms: 1.03x faster  |
+-------------------------+---------+-----------------------+
| nqueens                 | 113 ms  | 96.3 ms: 1.17x faster |
+-------------------------+---------+-----------------------+
| pathlib                 | 22.5 ms | 21.3 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| pickle                  | 14.1 us | 11.4 us: 1.24x faster |
+-------------------------+---------+-----------------------+
| pickle_dict             | 31.8 us | 27.7 us: 1.15x faster |
+-------------------------+---------+-----------------------+
| pickle_list             | 4.85 us | 4.35 us: 1.11x faster |
+-------------------------+---------+-----------------------+
| pickle_pure_python      | 433 us  | 380 us: 1.14x faster  |
+-------------------------+---------+-----------------------+
| pidigits                | 235 ms  | 220 ms: 1.07x faster  |
+-------------------------+---------+-----------------------+
| pyflate                 | 578 ms  | 517 ms: 1.12x faster  |
+-------------------------+---------+-----------------------+
| python_startup          | 11.5 ms | 15.7 ms: 1.36x slower |
+-------------------------+---------+-----------------------+
| python_startup_no_site  | 8.30 ms | 7.96 ms: 1.04x faster |
+-------------------------+---------+-----------------------+
| raytrace                | 425 ms  | 366 ms: 1.16x faster  |
+-------------------------+---------+-----------------------+
| regex_compile           | 182 ms  | 158 ms: 1.15x faster  |
+-------------------------+---------+-----------------------+
| regex_dna               | 221 ms  | 223 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| regex_effbot            | 3.66 ms | 3.44 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| regex_v8                | 29.5 ms | 28.0 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| richards                | 68.0 ms | 57.3 ms: 1.19x faster |
+-------------------------+---------+-----------------------+
| scimark_fft             | 448 ms  | 370 ms: 1.21x faster  |
+-------------------------+---------+-----------------------+
| scimark_lu              | 150 ms  | 130 ms: 1.15x faster  |
+-------------------------+---------+-----------------------+
| scimark_monte_carlo     | 88.2 ms | 79.1 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| scimark_sor             | 152 ms  | 140 ms: 1.09x faster  |
+-------------------------+---------+-----------------------+
| scimark_sparse_mat_mult | 6.40 ms | 4.91 ms: 1.30x faster |
+-------------------------+---------+-----------------------+
| spectral_norm           | 137 ms  | 112 ms: 1.22x faster  |
+-------------------------+---------+-----------------------+
| sqlite_synth            | 4.39 us | 4.03 us: 1.09x faster |
+-------------------------+---------+-----------------------+
| telco                   | 8.30 ms | 7.39 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| unpack_sequence         | 50.0 ns | 50.9 ns: 1.02x slower |
+-------------------------+---------+-----------------------+
| unpickle                | 19.5 us | 16.0 us: 1.22x faster |
+-------------------------+---------+-----------------------+
| unpickle_list           | 5.21 us | 5.96 us: 1.14x slower |
+-------------------------+---------+-----------------------+
| unpickle_pure_python    | 348 us  | 282 us: 1.23x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_parse         | 194 ms  | 183 ms: 1.06x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_iterparse     | 128 ms  | 123 ms: 1.05x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_generate      | 111 ms  | 93.6 ms: 1.19x faster |
+-------------------------+---------+-----------------------+
| xml_etree_process       | 78.3 ms | 66.1 ms: 1.18x faster |
+-------------------------+---------+-----------------------+
| Geometric mean          | (ref)   | 1.11x faster          |
+-------------------------+---------+-----------------------+

@JelleZijlstra
Copy link
Contributor

That's an amazing speed improvement. Are you sure those numbers are right?

@shihai1991
Copy link

That's an amazing speed improvement. Are you sure those numbers are right?

I am sure that I use same local vm、same compiler commands and same benchmarks(Using pyperformance) :).
Maybe you could check it again if you have free time ;)

And I test some modules import benchmark sepereately. Looks like the one reason of benchmark(pyperformance) to speed up is modules import become quickly?(I am not sure).

# ./python -m pyperf timeit 'import abc'
$ python3 -m pyperf compare_to main_import.json interned_import.json --table
+-----------+-------------+-----------------------+
| Benchmark | main_import | interned_import       |
+===========+=============+=======================+
| timeit    | 116 ns      | 75.9 ns: 1.52x faster |
+-----------+-------------+-----------------------+

# ./python -m pyperf timeit 'import os'
$ python3 -m pyperf compare_to main_import_os.json interned_import_os.json --table
+-----------+----------------+-----------------------+
| Benchmark | main_import_os | interned_import_os    |
+===========+================+=======================+
| timeit    | 116 ns         | 74.7 ns: 1.55x faster |
+-----------+----------------+-----------------------+

@JelleZijlstra
Copy link
Contributor

Thanks! I don't have things set up to run benchmarks at the moment, sorry.

It's definitely plausible to me that this optimization makes importing frozen modules a lot faster. Note though that in your tests, the modules you're testing are almost certainly already imported, so really you're just testing the path where the module is already in sys.modules.

But it's surprising that it makes the high-level benchmarks faster. For example, scimark_sparse_mat_mult is 1.30x faster in your result, but that benchmark is just doing a bunch of math on integers (https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_scimark/run_benchmark.py#L178).

@methane
Copy link

methane commented Feb 6, 2022

Without deepfreeze, compile()+marshal() merged constants and interned names in pyc files.
If the patch makes benchmark really faster, it means deepfreeze caused massive performance regression.
I can not believe it...

@gvanrossum
Copy link
Collaborator Author

I don’t believe it, I will need to run the benchmark myself.

@methane
Copy link

methane commented Feb 7, 2022

main (8726067) vs python/cpython#31154 (067ad2)

$ main/python -m pyperf compare_to main.json pr31154.json -G --min-speed=1
Slower (19):
- regex_effbot: 3.76 ms +- 0.03 ms -> 3.94 ms +- 0.02 ms: 1.05x slower
- regex_dna: 232 ms +- 1 ms -> 240 ms +- 1 ms: 1.04x slower
- logging_format: 10.1 us +- 0.1 us -> 10.4 us +- 0.2 us: 1.03x slower
- html5lib: 85.4 ms +- 3.3 ms -> 88.0 ms +- 3.3 ms: 1.03x slower
- sqlalchemy_imperative: 27.8 ms +- 1.2 ms -> 28.6 ms +- 1.5 ms: 1.03x slower
- pickle_dict: 38.9 us +- 0.2 us -> 39.9 us +- 0.2 us: 1.03x slower
- scimark_lu: 155 ms +- 1 ms -> 158 ms +- 3 ms: 1.02x slower
- pickle_list: 5.54 us +- 0.06 us -> 5.63 us +- 0.04 us: 1.02x slower
- unpickle: 18.9 us +- 0.2 us -> 19.2 us +- 0.9 us: 1.02x slower
- regex_v8: 29.0 ms +- 0.1 ms -> 29.4 ms +- 0.1 ms: 1.02x slower
- logging_silent: 141 ns +- 1 ns -> 143 ns +- 1 ns: 1.01x slower
- json_dumps: 16.3 ms +- 0.4 ms -> 16.5 ms +- 0.3 ms: 1.01x slower
- go: 180 ms +- 2 ms -> 182 ms +- 2 ms: 1.01x slower
- raytrace: 425 ms +- 3 ms -> 431 ms +- 5 ms: 1.01x slower
- logging_simple: 9.03 us +- 0.18 us -> 9.14 us +- 0.19 us: 1.01x slower
- meteor_contest: 124 ms +- 0 ms -> 125 ms +- 1 ms: 1.01x slower
- json_loads: 34.3 us +- 0.3 us -> 34.7 us +- 0.5 us: 1.01x slower
- chameleon: 9.13 ms +- 0.09 ms -> 9.24 ms +- 0.11 ms: 1.01x slower
- sympy_sum: 227 ms +- 2 ms -> 230 ms +- 2 ms: 1.01x slower

Faster (10):
- sqlite_synth: 3.32 us +- 0.09 us -> 3.23 us +- 0.07 us: 1.03x faster
- spectral_norm: 147 ms +- 4 ms -> 144 ms +- 1 ms: 1.02x faster
- unpickle_list: 5.97 us +- 0.04 us -> 5.87 us +- 0.05 us: 1.02x faster
- telco: 8.68 ms +- 0.31 ms -> 8.55 ms +- 0.21 ms: 1.01x faster
- xml_etree_process: 78.2 ms +- 1.0 ms -> 77.1 ms +- 0.8 ms: 1.01x faster
- chaos: 97.6 ms +- 0.8 ms -> 96.4 ms +- 0.9 ms: 1.01x faster
- django_template: 52.0 ms +- 0.6 ms -> 51.4 ms +- 0.6 ms: 1.01x faster
- dulwich_log: 99.5 ms +- 0.9 ms -> 98.4 ms +- 0.6 ms: 1.01x faster
- float: 95.2 ms +- 1.3 ms -> 94.2 ms +- 1.2 ms: 1.01x faster
- scimark_sor: 161 ms +- 2 ms -> 159 ms +- 1 ms: 1.01x faster

Benchmark hidden because not significant (30): 2to3, crypto_pyaes, deltablue, fannkuch, hexiom, mako, nbody, nqueens, pathlib, pickle, pickle_pure_python, pidigits, pyflate, python_startup, python_startup_no_site, regex_compile, richards, scimark_fft, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpack_sequence, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.00x slower

@kumaraditya303
Copy link

kumaraditya303 commented Mar 12, 2022

All of the ideas in the to-do list are implemented by now.

@gvanrossum
Copy link
Collaborator Author

Great. Even the “weak linkage” idea?

@kumaraditya303
Copy link

Great. Even the “weak linkage” idea?

No, I was talking about this todo list #218 (comment)

@gvanrossum
Copy link
Collaborator Author

Okay, thanks so much!

There is one new request. A flag to can pass to the Makefile or configure script to disable (deep)freezing (except freezing the importlib bootstrap), for rare platforms that need to squeeze space.

@gvanrossum
Copy link
Collaborator Author

Closing this, declaring victory. Thanks @kumaraditya303! We can open a bpo issue for the flag to disable.

Repository owner moved this from Todo to Done in Fancy CPython Board Mar 14, 2022
@kumaraditya303
Copy link

There is one new request. A flag to can pass to the Makefile or configure script to disable (deep)freezing (except freezing the importlib bootstrap), for rare platforms that need to squeeze space.

I don't have much experience with the automake/autoconfigure thing so I would defer it to the person who wants to implement for their rare platforms.

@gvanrossum
Copy link
Collaborator Author

@kumaraditya303 All I was asking was a command-line flag for freeze_modules.py to skip all categories except the importlib bootstrap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

7 participants