Look into string sharing for deep-frozen code #218

gvanrossum · 2022-01-10T18:41:54Z

When code objects are created by the compiler or by marshal, certain strings are interned using PyUnicode_InternInPlace(). But code objects obtained from the deep-freeze process do not do this.

While strings are deduped within a module, certain strings (e.g. dunders, 'name', 'self') occur frequently in different modules and thus the deep-frozen code object will use up slightly more space, and certain operations will be slightly slower (e.g. comparison of two strings that are both interned is done by pointer comparison).

We could do a number of different things (or combine several):

Add PyUnicode_InternInPlace() calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though.
Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)
Give strings that are likely candidates (i.e., that look like ASCII identifiers -- see all_name_chars() in codeobject.c) an external name with "weak linkage" so that the linker can dedupe them. (Props to @lpereira for this one.)
For strings that occur in the array of known _Py_Identifiers, replace the string with a reference into that array, for pure savings. (@ericsnowcurrently has more details; IIRC this array isn't in "main" yet.)

[I am not planning to attack this any time soon, so if somebody wants to tackle this, go ahead and assign to yourself.]

The text was updated successfully, but these errors were encountered:

oraluben · 2022-01-11T02:03:16Z

FYI: In our mmap mechanism, we obverse ~10% slowdown in some ~~complex~~ benchmarks, when short path of dict lookup failed because of copied strings. We're in the middle of adding PyUnicode_InternInPlace calls when loading the archive.

Update: some numbers generated by this simple test case https://github.com/oraluben/pyperformance/blob/cd7050101e0c32c0e3fa0359785a4cc28530e3ed/pyperformance/data-files/benchmarks/bm_module_load_attr/run_benchmark.py

*-raw.json are referring clean cpython and *-cds.json are our mmap result.

before:

$pyperf compare_to pyperformance-3.9-raw.json pyperformance-3.9-cds.json pyperformance-3.10-raw.json pyperformance-3.10-cds.json --table
+------------------+-----------------------+-----------------------+------------------------+------------------------+
| Benchmark        | pyperformance-3.9-raw | pyperformance-3.9-cds | pyperformance-3.10-raw | pyperformance-3.10-cds |
+==================+=======================+=======================+========================+========================+
| module_load_attr | 163 ms                | 183 ms: 1.12x slower  | 149 ms: 1.10x faster   | 181 ms: 1.11x slower   |
+------------------+-----------------------+-----------------------+------------------------+------------------------+

after (strings in mapped memory are re-interned):

$pyperf compare_to pyperformance-3.9-raw.json pyperformance-3.9-cds.json pyperformance-3.10-raw.json pyperformance-3.10-cds.json --table
+------------------+-----------------------+-----------------------+------------------------+------------------------+
| Benchmark        | pyperformance-3.9-raw | pyperformance-3.9-cds | pyperformance-3.10-raw | pyperformance-3.10-cds |
+==================+=======================+=======================+========================+========================+
| module_load_attr | 166 ms                | not significant       | 148 ms: 1.12x faster   | 150 ms: 1.11x faster   |
+------------------+-----------------------+-----------------------+------------------------+------------------------+

kumaraditya303 · 2022-01-12T06:30:15Z

Merge all deepfrozen files into a single file and dedupe strings when that file is written. Requires changing the deepfreeze build procedures for Win and Unix. An advantage is that we could dedupe other things (basically all constants, even bytecode) this way. (@markshannon seems to lean towards this one.)

I liked @markshannon idea and I got it working on linux, it reduced the size of generated files by around 1.2Mb as it shares the constants etc. Here is the branch (https://github.com/kumaraditya303/cpython/tree/deepfreeze), next I'll see how to get it working on Windows. (EDIT: It should be Mb not MB)

kumaraditya303 · 2022-01-12T13:24:24Z

Status Update: @gvanrossum I have got it working on Linux and Windows so should I create a PR ?

pxeger · 2022-01-12T13:38:12Z

@oraluben the line numbers in that link will break when dictobject.c changes. Here's a permalink (once you've highlighted the code range, click the three dots to the left of the line numbers and Copy Permalink):

https://github.com/python/cpython/blob/be578e0c063dad1dbb273f86d5bc77e4e6f14583/Objects/dictobject.c#L794-L795

gvanrossum · 2022-01-12T19:17:46Z

Status Update: @gvanrossum I have got it working on Linux and Windows so should I create a PR ?

Sounds great! Let's make it a draft PR so I can try it out.

kumaraditya303 · 2022-01-13T03:47:12Z

@gvanrossum Created PR here python/cpython#30572

gvanrossum · 2022-01-19T19:15:42Z

I see you already submitted a PR and created an issue (https://bugs.python.org/issue46430) to intern strings.

We need another canonicalization: "small integers" should use their cached equivalent.

kumaraditya303 · 2022-01-20T04:44:52Z

We need another canonicalization: "small integers" should use their cached equivalent.

That was much simpler See python/cpython#30715

kumaraditya303 · 2022-02-04T18:56:42Z

Since most of what is mentioned in the issue is done, can the issue be changed with a checklist of all the items and be marked with what is already done ?

gvanrossum · 2022-02-04T22:50:25Z

Why don't you summarize what you think still needs to be done.

kumaraditya303 · 2022-02-05T08:43:51Z

Current Progress:

DONE:

gvanrossum · 2022-02-05T15:52:47Z

Thanks for the list! Go ahead with the tuple singleton. The other PR is still in flux, so we have to wait.

shihai1991 · 2022-02-06T03:39:17Z

Add PyUnicode_InternInPlace() calls to the "get toplevel" function in each deepfrozen .c file. This would still waste the space though.

I see you already submitted a PR and created an issue (https://bugs.python.org/issue46430) to intern strings.

I create a demo in python/cpython#31154 to test the benchmark.
Using the interned string looks better :)

$ python3 -m pyperf compare_to main.json interned.json --table
+-------------------------+---------+-----------------------+
| Benchmark               | main    | interned              |
+=========================+=========+=======================+
| 2to3                    | 353 ms  | 316 ms: 1.12x faster  |
+-------------------------+---------+-----------------------+
| chaos                   | 98.8 ms | 83.9 ms: 1.18x faster |
+-------------------------+---------+-----------------------+
| deltablue               | 5.70 ms | 4.97 ms: 1.15x faster |
+-------------------------+---------+-----------------------+
| fannkuch                | 526 ms  | 449 ms: 1.17x faster  |
+-------------------------+---------+-----------------------+
| float                   | 107 ms  | 97.8 ms: 1.10x faster |
+-------------------------+---------+-----------------------+
| go                      | 190 ms  | 172 ms: 1.10x faster  |
+-------------------------+---------+-----------------------+
| hexiom                  | 8.60 ms | 7.59 ms: 1.13x faster |
+-------------------------+---------+-----------------------+
| json_dumps              | 15.1 ms | 13.8 ms: 1.09x faster |
+-------------------------+---------+-----------------------+
| json_loads              | 35.7 us | 28.8 us: 1.24x faster |
+-------------------------+---------+-----------------------+
| logging_format          | 9.01 us | 7.38 us: 1.22x faster |
+-------------------------+---------+-----------------------+
| logging_silent          | 139 ns  | 122 ns: 1.14x faster  |
+-------------------------+---------+-----------------------+
| logging_simple          | 7.96 us | 6.62 us: 1.20x faster |
+-------------------------+---------+-----------------------+
| mako                    | 13.5 ms | 12.0 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| meteor_contest          | 127 ms  | 118 ms: 1.08x faster  |
+-------------------------+---------+-----------------------+
| nbody                   | 124 ms  | 121 ms: 1.03x faster  |
+-------------------------+---------+-----------------------+
| nqueens                 | 113 ms  | 96.3 ms: 1.17x faster |
+-------------------------+---------+-----------------------+
| pathlib                 | 22.5 ms | 21.3 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| pickle                  | 14.1 us | 11.4 us: 1.24x faster |
+-------------------------+---------+-----------------------+
| pickle_dict             | 31.8 us | 27.7 us: 1.15x faster |
+-------------------------+---------+-----------------------+
| pickle_list             | 4.85 us | 4.35 us: 1.11x faster |
+-------------------------+---------+-----------------------+
| pickle_pure_python      | 433 us  | 380 us: 1.14x faster  |
+-------------------------+---------+-----------------------+
| pidigits                | 235 ms  | 220 ms: 1.07x faster  |
+-------------------------+---------+-----------------------+
| pyflate                 | 578 ms  | 517 ms: 1.12x faster  |
+-------------------------+---------+-----------------------+
| python_startup          | 11.5 ms | 15.7 ms: 1.36x slower |
+-------------------------+---------+-----------------------+
| python_startup_no_site  | 8.30 ms | 7.96 ms: 1.04x faster |
+-------------------------+---------+-----------------------+
| raytrace                | 425 ms  | 366 ms: 1.16x faster  |
+-------------------------+---------+-----------------------+
| regex_compile           | 182 ms  | 158 ms: 1.15x faster  |
+-------------------------+---------+-----------------------+
| regex_dna               | 221 ms  | 223 ms: 1.01x slower  |
+-------------------------+---------+-----------------------+
| regex_effbot            | 3.66 ms | 3.44 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| regex_v8                | 29.5 ms | 28.0 ms: 1.06x faster |
+-------------------------+---------+-----------------------+
| richards                | 68.0 ms | 57.3 ms: 1.19x faster |
+-------------------------+---------+-----------------------+
| scimark_fft             | 448 ms  | 370 ms: 1.21x faster  |
+-------------------------+---------+-----------------------+
| scimark_lu              | 150 ms  | 130 ms: 1.15x faster  |
+-------------------------+---------+-----------------------+
| scimark_monte_carlo     | 88.2 ms | 79.1 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| scimark_sor             | 152 ms  | 140 ms: 1.09x faster  |
+-------------------------+---------+-----------------------+
| scimark_sparse_mat_mult | 6.40 ms | 4.91 ms: 1.30x faster |
+-------------------------+---------+-----------------------+
| spectral_norm           | 137 ms  | 112 ms: 1.22x faster  |
+-------------------------+---------+-----------------------+
| sqlite_synth            | 4.39 us | 4.03 us: 1.09x faster |
+-------------------------+---------+-----------------------+
| telco                   | 8.30 ms | 7.39 ms: 1.12x faster |
+-------------------------+---------+-----------------------+
| unpack_sequence         | 50.0 ns | 50.9 ns: 1.02x slower |
+-------------------------+---------+-----------------------+
| unpickle                | 19.5 us | 16.0 us: 1.22x faster |
+-------------------------+---------+-----------------------+
| unpickle_list           | 5.21 us | 5.96 us: 1.14x slower |
+-------------------------+---------+-----------------------+
| unpickle_pure_python    | 348 us  | 282 us: 1.23x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_parse         | 194 ms  | 183 ms: 1.06x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_iterparse     | 128 ms  | 123 ms: 1.05x faster  |
+-------------------------+---------+-----------------------+
| xml_etree_generate      | 111 ms  | 93.6 ms: 1.19x faster |
+-------------------------+---------+-----------------------+
| xml_etree_process       | 78.3 ms | 66.1 ms: 1.18x faster |
+-------------------------+---------+-----------------------+
| Geometric mean          | (ref)   | 1.11x faster          |
+-------------------------+---------+-----------------------+

JelleZijlstra · 2022-02-06T03:44:21Z

That's an amazing speed improvement. Are you sure those numbers are right?

shihai1991 · 2022-02-06T04:15:29Z

That's an amazing speed improvement. Are you sure those numbers are right?

I am sure that I use same local vm、same compiler commands and same benchmarks(Using pyperformance) :).
Maybe you could check it again if you have free time ;)

And I test some modules import benchmark sepereately. Looks like the one reason of benchmark(pyperformance) to speed up is modules import become quickly?(I am not sure).

# ./python -m pyperf timeit 'import abc'
$ python3 -m pyperf compare_to main_import.json interned_import.json --table
+-----------+-------------+-----------------------+
| Benchmark | main_import | interned_import       |
+===========+=============+=======================+
| timeit    | 116 ns      | 75.9 ns: 1.52x faster |
+-----------+-------------+-----------------------+

# ./python -m pyperf timeit 'import os'
$ python3 -m pyperf compare_to main_import_os.json interned_import_os.json --table
+-----------+----------------+-----------------------+
| Benchmark | main_import_os | interned_import_os    |
+===========+================+=======================+
| timeit    | 116 ns         | 74.7 ns: 1.55x faster |
+-----------+----------------+-----------------------+

JelleZijlstra · 2022-02-06T05:31:21Z

Thanks! I don't have things set up to run benchmarks at the moment, sorry.

It's definitely plausible to me that this optimization makes importing frozen modules a lot faster. Note though that in your tests, the modules you're testing are almost certainly already imported, so really you're just testing the path where the module is already in sys.modules.

But it's surprising that it makes the high-level benchmarks faster. For example, scimark_sparse_mat_mult is 1.30x faster in your result, but that benchmark is just doing a bunch of math on integers (https://github.com/python/pyperformance/blob/main/pyperformance/data-files/benchmarks/bm_scimark/run_benchmark.py#L178).

methane · 2022-02-06T10:44:33Z

Without deepfreeze, compile()+marshal() merged constants and interned names in pyc files.
If the patch makes benchmark really faster, it means deepfreeze caused massive performance regression.
I can not believe it...

gvanrossum · 2022-02-06T17:43:27Z

I don’t believe it, I will need to run the benchmark myself.

methane · 2022-02-07T00:58:56Z

main (8726067) vs python/cpython#31154 (067ad2)

$ main/python -m pyperf compare_to main.json pr31154.json -G --min-speed=1
Slower (19):
- regex_effbot: 3.76 ms +- 0.03 ms -> 3.94 ms +- 0.02 ms: 1.05x slower
- regex_dna: 232 ms +- 1 ms -> 240 ms +- 1 ms: 1.04x slower
- logging_format: 10.1 us +- 0.1 us -> 10.4 us +- 0.2 us: 1.03x slower
- html5lib: 85.4 ms +- 3.3 ms -> 88.0 ms +- 3.3 ms: 1.03x slower
- sqlalchemy_imperative: 27.8 ms +- 1.2 ms -> 28.6 ms +- 1.5 ms: 1.03x slower
- pickle_dict: 38.9 us +- 0.2 us -> 39.9 us +- 0.2 us: 1.03x slower
- scimark_lu: 155 ms +- 1 ms -> 158 ms +- 3 ms: 1.02x slower
- pickle_list: 5.54 us +- 0.06 us -> 5.63 us +- 0.04 us: 1.02x slower
- unpickle: 18.9 us +- 0.2 us -> 19.2 us +- 0.9 us: 1.02x slower
- regex_v8: 29.0 ms +- 0.1 ms -> 29.4 ms +- 0.1 ms: 1.02x slower
- logging_silent: 141 ns +- 1 ns -> 143 ns +- 1 ns: 1.01x slower
- json_dumps: 16.3 ms +- 0.4 ms -> 16.5 ms +- 0.3 ms: 1.01x slower
- go: 180 ms +- 2 ms -> 182 ms +- 2 ms: 1.01x slower
- raytrace: 425 ms +- 3 ms -> 431 ms +- 5 ms: 1.01x slower
- logging_simple: 9.03 us +- 0.18 us -> 9.14 us +- 0.19 us: 1.01x slower
- meteor_contest: 124 ms +- 0 ms -> 125 ms +- 1 ms: 1.01x slower
- json_loads: 34.3 us +- 0.3 us -> 34.7 us +- 0.5 us: 1.01x slower
- chameleon: 9.13 ms +- 0.09 ms -> 9.24 ms +- 0.11 ms: 1.01x slower
- sympy_sum: 227 ms +- 2 ms -> 230 ms +- 2 ms: 1.01x slower

Faster (10):
- sqlite_synth: 3.32 us +- 0.09 us -> 3.23 us +- 0.07 us: 1.03x faster
- spectral_norm: 147 ms +- 4 ms -> 144 ms +- 1 ms: 1.02x faster
- unpickle_list: 5.97 us +- 0.04 us -> 5.87 us +- 0.05 us: 1.02x faster
- telco: 8.68 ms +- 0.31 ms -> 8.55 ms +- 0.21 ms: 1.01x faster
- xml_etree_process: 78.2 ms +- 1.0 ms -> 77.1 ms +- 0.8 ms: 1.01x faster
- chaos: 97.6 ms +- 0.8 ms -> 96.4 ms +- 0.9 ms: 1.01x faster
- django_template: 52.0 ms +- 0.6 ms -> 51.4 ms +- 0.6 ms: 1.01x faster
- dulwich_log: 99.5 ms +- 0.9 ms -> 98.4 ms +- 0.6 ms: 1.01x faster
- float: 95.2 ms +- 1.3 ms -> 94.2 ms +- 1.2 ms: 1.01x faster
- scimark_sor: 161 ms +- 2 ms -> 159 ms +- 1 ms: 1.01x faster

Benchmark hidden because not significant (30): 2to3, crypto_pyaes, deltablue, fannkuch, hexiom, mako, nbody, nqueens, pathlib, pickle, pickle_pure_python, pidigits, pyflate, python_startup, python_startup_no_site, regex_compile, richards, scimark_fft, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpack_sequence, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate

Geometric mean: 1.00x slower

kumaraditya303 · 2022-03-12T18:44:41Z

All of the ideas in the to-do list are implemented by now.

gvanrossum · 2022-03-12T19:33:02Z

Great. Even the “weak linkage” idea?

kumaraditya303 · 2022-03-13T06:21:55Z

Great. Even the “weak linkage” idea?

No, I was talking about this todo list #218 (comment)

gvanrossum · 2022-03-13T15:53:54Z

Okay, thanks so much!

There is one new request. A flag to can pass to the Makefile or configure script to disable (deep)freezing (except freezing the importlib bootstrap), for rare platforms that need to squeeze space.

gvanrossum · 2022-03-14T18:36:17Z

Closing this, declaring victory. Thanks @kumaraditya303! We can open a bpo issue for the flag to disable.

kumaraditya303 · 2022-03-18T08:57:25Z

There is one new request. A flag to can pass to the Makefile or configure script to disable (deep)freezing (except freezing the importlib bootstrap), for rare platforms that need to squeeze space.

I don't have much experience with the automake/autoconfigure thing so I would defer it to the person who wants to implement for their rare platforms.

gvanrossum · 2022-03-18T16:02:44Z

@kumaraditya303 All I was asking was a command-line flag for freeze_modules.py to skip all categories except the importlib bootstrap.

gramster added this to Fancy CPython Board Jan 10, 2022

gramster moved this to Todo in Fancy CPython Board Jan 10, 2022

kumaraditya303 mentioned this issue Jan 13, 2022

bpo-46429: Merge all deepfrozen files into one python/cpython#30572

Merged

ericsnowcurrently mentioned this issue Jan 19, 2022

move _Py_unicode_state.ids (_Py_Identifier) into _PyRuntimeState.global_objects.singletons #198

Closed

kumaraditya303 mentioned this issue Feb 10, 2022

bpo-46712: Share global string identifiers in deepfreeze python/cpython#31261

Merged

gvanrossum closed this as completed Mar 14, 2022

Repository owner moved this from Todo to Done in Fancy CPython Board Mar 14, 2022

kumaraditya303 mentioned this issue Mar 25, 2022

Static Objects Section in Faster CPython Whats New Fidget-Spinner/cpython#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look into string sharing for deep-frozen code #218

Look into string sharing for deep-frozen code #218

gvanrossum commented Jan 10, 2022 •

edited

Loading

oraluben commented Jan 11, 2022 •

edited

Loading

kumaraditya303 commented Jan 12, 2022 •

edited

Loading

kumaraditya303 commented Jan 12, 2022

pxeger commented Jan 12, 2022

gvanrossum commented Jan 12, 2022

kumaraditya303 commented Jan 13, 2022

gvanrossum commented Jan 19, 2022

kumaraditya303 commented Jan 20, 2022

kumaraditya303 commented Feb 4, 2022

gvanrossum commented Feb 4, 2022

kumaraditya303 commented Feb 5, 2022 •

edited

Loading

gvanrossum commented Feb 5, 2022

shihai1991 commented Feb 6, 2022

JelleZijlstra commented Feb 6, 2022

shihai1991 commented Feb 6, 2022

JelleZijlstra commented Feb 6, 2022

methane commented Feb 6, 2022

gvanrossum commented Feb 6, 2022

methane commented Feb 7, 2022

kumaraditya303 commented Mar 12, 2022 •

edited

Loading

gvanrossum commented Mar 12, 2022

kumaraditya303 commented Mar 13, 2022

gvanrossum commented Mar 13, 2022

gvanrossum commented Mar 14, 2022

kumaraditya303 commented Mar 18, 2022

gvanrossum commented Mar 18, 2022

Look into string sharing for deep-frozen code #218

Look into string sharing for deep-frozen code #218

Comments

gvanrossum commented Jan 10, 2022 • edited Loading

oraluben commented Jan 11, 2022 • edited Loading

kumaraditya303 commented Jan 12, 2022 • edited Loading

kumaraditya303 commented Jan 12, 2022

pxeger commented Jan 12, 2022

gvanrossum commented Jan 12, 2022

kumaraditya303 commented Jan 13, 2022

gvanrossum commented Jan 19, 2022

kumaraditya303 commented Jan 20, 2022

kumaraditya303 commented Feb 4, 2022

gvanrossum commented Feb 4, 2022

kumaraditya303 commented Feb 5, 2022 • edited Loading

gvanrossum commented Feb 5, 2022

shihai1991 commented Feb 6, 2022

JelleZijlstra commented Feb 6, 2022

shihai1991 commented Feb 6, 2022

JelleZijlstra commented Feb 6, 2022

methane commented Feb 6, 2022

gvanrossum commented Feb 6, 2022

methane commented Feb 7, 2022

kumaraditya303 commented Mar 12, 2022 • edited Loading

gvanrossum commented Mar 12, 2022

kumaraditya303 commented Mar 13, 2022

gvanrossum commented Mar 13, 2022

gvanrossum commented Mar 14, 2022

kumaraditya303 commented Mar 18, 2022

gvanrossum commented Mar 18, 2022

gvanrossum commented Jan 10, 2022 •

edited

Loading

oraluben commented Jan 11, 2022 •

edited

Loading

kumaraditya303 commented Jan 12, 2022 •

edited

Loading

kumaraditya303 commented Feb 5, 2022 •

edited

Loading

kumaraditya303 commented Mar 12, 2022 •

edited

Loading