-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the handling of escape sequences in the ntriples and nquads parsers #1663
Fixed the handling of escape sequences in the ntriples and nquads parsers #1663
Conversation
@nicholascar this is okay to merge, but I think it has some conflicts with #1656 and #1656 also contains some fixes for xfail tests that I added here, so I will rebase this once that is finished. The fix will remain as is though, changes will only occur in |
I have done some performance testing in here In summary, the new function is about the same speed in cases with fewer escape sequences, but when it is processing strings with many string escape sequeces (e.g. The string escape sequences are the escape sequences that were not being handled correctly before, so it would make sense that most of the cost comes in here. Output of the performance tests can be seen below: $ poetry run pytest ./tests/test_unescape_performance.py --benchmark-group-by param:test_strings_key
============================================================================ test session starts ============================================================================
platform linux -- Python 3.10.1, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/iwana/sw/d/github.com/aucampia/rdflib-scratchpad, configfile: pyproject.toml
plugins: subtests-0.5.0, cov-3.0.0, benchmark-3.4.1, profiling-1.7.0
collected 19 items
tests/test_unescape_performance.py ................... [100%]
---------- coverage: platform linux, python 3.10.1-final-0 -----------
Name Stmts Miss Cover Missing
--------------------------------------------------------------------------
src/aucampia/__init__.py 2 0 100%
src/aucampia/rdflib/__init__.py 2 0 100%
src/aucampia/rdflib/scratchpad/__init__.py 0 0 100%
src/aucampia/rdflib/scratchpad/_version.py 1 1 0% 1
src/aucampia/rdflib/scratchpad/cli.py 35 35 0% 5-103
src/aucampia/rdflib/scratchpad/unquote.py 65 5 92% 86, 111, 114-117
--------------------------------------------------------------------------
TOTAL 105 41 61%
---------------------------------------------------------------------------------- benchmark 'test_strings_key=few_escapes': 3 tests -----------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_new-few_escapes] 20.6550 (1.0) 63.6000 (1.0) 23.9169 (1.0) 3.3959 (1.0) 22.9790 (1.0) 2.5099 (2.79) 1505;1214 41.8115 (1.0) 30025 1
test_performance[unquote_old-few_escapes] 31.2369 (1.51) 86.9200 (1.37) 32.7835 (1.37) 4.4484 (1.31) 31.7130 (1.38) 0.9000 (1.0) 613;1130 30.5031 (0.73) 19029 1
test_performance[unquote_validate-few_escapes] 56.5629 (2.74) 136.6610 (2.15) 58.6242 (2.45) 6.3523 (1.87) 57.1000 (2.48) 1.3730 (1.53) 516;791 17.0578 (0.41) 13661 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------- benchmark 'test_strings_key=many_escapes': 3 tests --------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_old-many_escapes] 157.8090 (1.0) 335.1220 (1.0) 168.2338 (1.0) 12.4257 (1.0) 165.3350 (1.0) 6.4884 (1.0) 360;458 5.9441 (1.0) 5371 1
test_performance[unquote_new-many_escapes] 181.9211 (1.15) 412.7750 (1.23) 203.8222 (1.21) 27.6217 (2.22) 196.4370 (1.19) 31.3795 (4.84) 259;113 4.9062 (0.83) 3663 1
test_performance[unquote_validate-many_escapes] 527.6541 (3.34) 1,207.7210 (3.60) 599.7562 (3.57) 125.5351 (10.10) 556.6350 (3.37) 90.8774 (14.01) 150;140 1.6673 (0.28) 1797 1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------- benchmark 'test_strings_key=many_escapes_narrow': 3 tests ---------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_new-many_escapes_narrow] 186.9510 (1.0) 862.6611 (2.19) 224.9757 (1.13) 61.3487 (3.58) 202.4500 (1.05) 24.0203 (3.75) 567;783 4.4449 (0.89) 4875 1
test_performance[unquote_old-many_escapes_narrow] 190.0180 (1.02) 394.5540 (1.0) 199.5608 (1.0) 17.1156 (1.0) 193.3036 (1.0) 6.4040 (1.0) 461;616 5.0110 (1.0) 3986 1
test_performance[unquote_validate-many_escapes_narrow] 550.5960 (2.95) 2,762.7711 (7.00) 640.1752 (3.21) 118.2362 (6.91) 603.3650 (3.12) 70.2695 (10.97) 100;99 1.5621 (0.31) 1633 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------- benchmark 'test_strings_key=many_escapes_string': 3 tests ---------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_old-many_escapes_string] 34.5800 (1.0) 2,135.3530 (7.51) 36.3306 (1.0) 15.6064 (1.80) 35.5910 (1.0) 0.9671 (1.44) 170;848 27.5250 (1.0) 18888 1
test_performance[unquote_new-many_escapes_string] 125.9911 (3.64) 284.2019 (1.0) 131.3483 (3.62) 8.6620 (1.0) 129.8441 (3.65) 0.6723 (1.0) 384;2398 7.6133 (0.28) 7209 1
test_performance[unquote_validate-many_escapes_string] 412.5390 (11.93) 1,590.4229 (5.60) 578.3370 (15.92) 175.2262 (20.23) 514.7040 (14.46) 88.4010 (131.50) 314;351 1.7291 (0.06) 2242 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------- benchmark 'test_strings_key=many_escapes_wide': 3 tests ---------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_new-many_escapes_wide] 199.0810 (1.0) 566.6000 (1.68) 218.3829 (1.05) 35.5424 (3.44) 205.9960 (1.01) 11.8142 (2.01) 269;751 4.5791 (0.95) 4065 1
test_performance[unquote_old-many_escapes_wide] 200.5970 (1.01) 337.2300 (1.0) 207.4374 (1.0) 10.3271 (1.0) 203.2500 (1.0) 5.8776 (1.0) 525;668 4.8207 (1.0) 4545 1
test_performance[unquote_validate-many_escapes_wide] 577.1730 (2.90) 1,395.9260 (4.14) 653.9705 (3.15) 100.4474 (9.73) 629.0320 (3.09) 81.9499 (13.94) 72;60 1.5291 (0.32) 1371 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------- benchmark 'test_strings_key=no_escapes': 3 tests --------------------------------------------------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_performance[unquote_new-no_escapes] 308.7509 (1.0) 2,293.8999 (1.0) 368.6624 (1.0) 91.9060 (1.0) 337.7034 (1.0) 56.3450 (1.0) 10851;10725 2,712.5091 (1.0) 141744 20
test_performance[unquote_old-no_escapes] 437.9544 (1.42) 27,126.9819 (11.83) 574.8759 (1.56) 173.0340 (1.88) 554.0205 (1.64) 58.9062 (1.05) 5921;8709 1,739.5060 (0.64) 153422 1
test_performance[unquote_validate-no_escapes] 8,968.0543 (29.05) 41,804.0436 (18.22) 9,557.4152 (25.92) 1,250.9006 (13.61) 9,387.9644 (27.80) 118.9765 (2.11) 1348;8067 104.6308 (0.04) 71587 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Legend:
Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
OPS: Operations Per Second, computed as 1 / Mean
============================================================================ 19 passed in 19.46s ============================================================================ |
to provide an impression of the scope of the problems with the current function:
|
3cffd4d
to
47ac111
Compare
…sers. These parsers will now correctly handle strings like `"\\r"`. The time it takes for these parsers to parse strings with escape sequences will be increased, and the increase will be correlated with the amount of escape sequences that occur in a string. For strings with many escape sequences the parsing speed seems to be almost 4 times slower. Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly. - Add xfail tests for a couple of issues This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1649
47ac111
to
8bb9dbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good, thanks @aucampia
Thanks to KirkMcDonald on irc://libera.chat/python I did a comparison between using a dict lookup and str.translate, and str.translate for this use case (1 char string) is actually about twice as slow, so I'm going to change the str.translate to just dict lookups.
|
`str.translate` is not faster than dict lookup for this case, which is a one char lookup. For more info see RDFLib#1663 (comment)
done now. |
These parsers will now correctly handle strings like
"\\r"
.The time it takes for these parsers to parse strings with escape
sequences will be increased, and the increase will be correlated with
the amount of escape sequences that occur in a string.
For strings with many escape sequences the parsing speed seems to be
almost 4 times slower.
Also:
same graph can now easily be tested to be isomorphic by just adding
them in
test/variants
.testutils.GraphHelper
, including some methods that doesasserts with better messages. Also include some tests for GraphHelper.
when parsing, and change verbose flag to rather be based on debug
logging.
test/test_issue247.py
to variants..editorconfig
which prevents it from workingproperly.
This includes xfails for the following issues:
Fixes #1655