Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite asyncio/sslproto.py #158

Closed
1st1 opened this issue May 30, 2018 · 30 comments
Closed

Rewrite asyncio/sslproto.py #158

1st1 opened this issue May 30, 2018 · 30 comments

Comments

@1st1
Copy link
Member

1st1 commented May 30, 2018

Problems with sslproto.py

  • The code is very complex and hard to reason about. At this point it's pretty much unmaintainable, many bugs on bugs.python.org linger unresolved just because we don't have capacity to debug them or review PRs.

  • It's very slow. SSL connections are 3-4x slower than regular connections.

  • It has many yet unresolved problems associated with SSL connection shutdown and cleanup:

  • We don't have enough functional tests for SSL in asyncio. Especially ones that test cancellation & timeouts thoroughly enough.

Requirements for the new implementation

  • Easy to understand code. Ideally, the state machine should be implemented right in the SSLProtocol (so no SSLPipe abstraction). A good example of a protocol state machine is asyncpg's CoreProtocol <-> Protocol <-> Connection mechanism.

  • Performance. Ideally, SSL connections shouldn't be more than 10% slower.

  • Tests. Nathaniel once mentioned that Trio's SSL tests are very robust and that he found bugs in CPython's ssl module with them. We maybe want to backport them to asyncio.

  • SSL tunneled through SSL should be naturally supported.

Suggested steps towards the new implementation

  1. Try to benchmark the current asyncio/sslproto.py. Maybe we'll be able to pinpoint the exact bottleneck that makes it so slow; what if there's a bug in ssl module? Most likely there's no single bottleneck, and the reason for poor performance is just its inefficient code.

  2. Implement a new SSL state machine/protocol in pure Python (I suggest to use new asyncio.BufferedProtocol from the beginning, though.) Get it to the point that it passes asyncio's existing SSL tests. Add more tests.

  3. Once we have a working pure Python implementation, we can (a) contribute it to Python 3.8, (b) start optimizing it in Cython for uvloop.

cc @fantix

@fantix
Copy link
Member

fantix commented May 31, 2018

@1st1 Thanks! I'm trying to quickly profile asyncio/sslproto.py, shall follow up with results here.

@fantix
Copy link
Member

fantix commented May 31, 2018

Vanilla asyncio echo server on Python 3.7b3 without SSL:

image

With SSL:

image

I've repainted this in red (sslproto.py) and green(_ssl module):

image

This gives a rough idea about the time portfolio:

  • It's true that there is no single bottleneck in sslproto.py.
  • _ssl module is taking quite some time, 90% of which is on read() and write() of _ssl.SSLSocket, and 9% on _ssl.MemoryBIO.

Top tottime detail data:

ncalls tottime percall cumtime percall filename:lineno(function)
1200003 10.24 8.537e-06 10.24 8.537e-06 ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
600006 6.573 1.095e-05 6.573 1.095e-05 ~:0(<method 'send' of '_socket.socket' objects>)
600009 4.789 7.982e-06 17.19 2.865e-05 sslproto.py:157(feed_ssldata)
600009 4.74 7.9e-06 4.74 7.9e-06 ~:0(<method 'recv' of '_socket.socket' objects>)
313269 3.829 1.222e-05 63.06 0.0002013 base_events.py:1642(_run_once)
600009 3.821 6.369e-06 18.51 3.085e-05 sslproto.py:641(_process_write_backlog)
313278 3.2 1.021e-05 3.2 1.021e-05 ~:0(<method 'control' of 'select.kqueue' objects>)
600000 2.86 4.767e-06 6.207 1.034e-05 sslproto.py:231(feed_appdata)
600009 2.432 4.054e-06 48.54 8.089e-05 selector_events.py:790(_read_ready__data_received)
600006 2.215 3.691e-06 41.36 6.894e-05 sslproto.py:502(data_received)
600000 2.115 3.525e-06 2.115 3.525e-06 ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
313269 1.956 6.245e-06 5.571 1.778e-05 selectors.py:553(select)
600003 1.437 2.396e-06 20.16 3.361e-05 sslproto.py:569(_write_appdata)
600006 1.42 2.367e-06 8.121 1.353e-05 selector_events.py:830(write)
313268 1.112 3.55e-06 2.488 7.941e-06 selector_events.py:558(_process_events)
1200003 1.081 9.008e-07 11.33 9.438e-06 ssl.py:698(read)
5140518/5140477 0.9978 1.941e-07 0.9978 1.941e-07 ~:0()
600040 0.9382 1.564e-06 50.39 8.398e-05 events.py:86(_run)
600040 0.9134 1.522e-06 49.45 8.241e-05 ~:0(<method 'run' of 'Context' objects>)
600011 0.9081 1.514e-06 1.376 2.293e-06 base_events.py:1624(_add_callback)
600000 0.8536 1.423e-06 21.17 3.528e-05 sslproto.py:375(write)
600000 0.7963 1.327e-06 21.96 3.66e-05 echoserver.py:73(data_received)
600006 0.6575 1.096e-06 0.6575 1.096e-06 ~:0(<method 'write' of '_ssl.MemoryBIO' objects>)
2401300 0.637 2.653e-07 0.637 2.653e-07 ~:0()
600006 0.537 8.95e-07 0.537 8.95e-07 ~:0(<method 'read' of '_ssl.MemoryBIO' objects>)
2 0.4861 0.243 63.55 31.77 base_events.py:504(run_forever)
1801088 0.3929 2.181e-07 0.3929 2.181e-07 ~:0(<method 'append' of 'list' objects>)
600000 0.3147 5.245e-07 2.43 4.049e-06 ssl.py:710(write)

@fantix
Copy link
Member

fantix commented May 31, 2018

Just made a very simple PoC, the improvement is not significant. I'll keep trying on something else tomorrow, just posting the results here for now.

image

image

ncalls tottime percall cumtime percall filename:lineno(function)
1200000 10.04 8.365e-06 10.04 8.365e-06 ~:0(<method 'read' of '_ssl._SSLSocket' objects>)
600006 6.57 1.095e-05 6.57 1.095e-05 ~:0(<method 'send' of '_socket.socket' objects>)
600009 3.858 6.43e-06 3.858 6.43e-06 ~:0(<method 'recv_into' of '_socket.socket' objects>)
600000 3.519 5.865e-06 29.98 4.997e-05 sslproto.py:888(_do_read)
305014 3.079 1.009e-05 3.079 1.009e-05 ~:0(<method 'control' of 'select.kqueue' objects>)
305005 2.853 9.353e-06 53.28 0.0001747 base_events.py:1642(_run_once)
600000 2.651 4.419e-06 13.75 2.292e-05 sslproto.py:868(_do_write)
600006 2.325 3.874e-06 32.99 5.498e-05 sslproto.py:794(buffer_updated)
305005 2.124 6.964e-06 5.646 1.851e-05 selectors.py:553(select)
600009 1.954 3.256e-06 38.95 6.491e-05 selector_events.py:761(_read_ready__get_buffer)
305004 1.941 6.362e-06 3.643 1.195e-05 selector_events.py:558(_process_events)
600000 1.694 2.824e-06 1.694 2.824e-06 ~:0(<method 'write' of '_ssl._SSLSocket' objects>)
600006 1.618 2.696e-06 8.361 1.394e-05 selector_events.py:830(write)
600011 1.19 1.983e-06 1.703 2.838e-06 base_events.py:1624(_add_callback)
1200000 1.137 9.474e-07 11.17 9.312e-06 ssl.py:698(read)
600000 0.9503 1.584e-06 14.7 2.45e-05 sslproto.py:725(write)

Just curious about Trio, which is using the same stdlib _ssl module (MemoryBIO, read() and write()):

With SSL:

image

Without SSL:

image

@1st1
Copy link
Member Author

1st1 commented May 31, 2018

_ssl module is taking quite some time, 90% of which is on read() and write() of _ssl.SSLSocket, and 9% on _ssl.MemoryBIO.

Oh, that's unfortunate. I looked at _SSLObject.read and write implementations in Modules/_ssl.c and it looks like they are just tiny wrappers of OpenSSL functions. We can probably save some overhead by supplying our own buffer to read, but I wouldn't expect it to give us any significant improvements.

@1st1
Copy link
Member Author

1st1 commented May 31, 2018

Maybe asyncio's read buffer is too small and we have too much overhead with WANT_READ.

@elprans
Copy link
Member

elprans commented May 31, 2018

Throughput depends a lot on the chosen cipher. See e.g. https://jbp.io/2018/01/07/rustls-vs-openssl-performance-1.html, the difference can be quite large.

@fantix
Copy link
Member

fantix commented Jun 1, 2018

All above tests are using cipher ECDHE-RSA-AES256-GCM-SHA384, I'll quickly try ECDHE-RSA-AES128-GCM-SHA256.


Using ECDHE-RSA-AES128-GCM-SHA256 didn't produce much difference, 8.53 vs 8.47. Let me try to tweak the buffer size.

@1st1
Copy link
Member Author

1st1 commented Jun 1, 2018

Or you can try uvloop which has 256kb (or something like that) buffer size by default.

@fantix
Copy link
Member

fantix commented Jun 7, 2018

Sorry for the delay! Tweaking buffer size had no luck. Actually:

  1. asyncio has similar read buffer size as uvloop, 256KB vs 250KB. Tried both 2560KB with asyncio, and master uvloop.
  2. The test data is small enough to fit in the read buffer for each request - there are 0.6mil socket.send() calls, and 1.2mil _SSLSocket.read() calls. From sslproto.py this looks like half of the _SSLSocket.read() calls provided all decrypted data in one go, the other half simply raised WANT_READ to break the loop in feed_ssldata().

I'm thinking of these TODOs next, please kindly advice:

  1. Port the PoC to uvloop/sslproto.pyx to see how it works.
  2. Profile and improve cpython/Modules/_ssl.c (tried ltrace a bit, result is not yet convincing).
  3. Rewrite asyncio/sslproto.py with new _ssl.c perhaps.

(Similarly, quick-and-dirty PoCs first.)

@1st1
Copy link
Member Author

1st1 commented Jun 7, 2018

asyncio has similar read buffer size as uvloop, 256KB vs 250KB. Tried both 2560KB with asyncio, and master uvloop.

That's true only for 3.7, but I get your point.

  1. Port the PoC to uvloop/sslproto.pyx to see how it works.
  2. Profile and improve cpython/Modules/_ssl.c (tried ltrace a bit, result is not yet convincing).
  3. Rewrite asyncio/sslproto.py with new _ssl.c perhaps.

Let's re-sync after (1).

(2) and (3) will require us waiting until 3.8, considering your improvements to the ssl module are accepted.

@fantix
Copy link
Member

fantix commented Jun 7, 2018

Got it, thanks! Meanwhile there's also a possibility to make uvloop use pyOpenSSL, libuv-tls or similar libs directly (as a last way).

@1st1
Copy link
Member Author

1st1 commented Jun 7, 2018

Meanwhile there's also a possibility to make uvloop use pyOpenSSL, libuv-tls or similar libs directly (as a last way).

Yes, I was thinking about that too until I looked at _ssl.c and saw a very thin wrapper around OpenSSL. I really don't think we can make it any faster. But this is just my first impression; maybe there are some faster OpenSSL APIs that we can use or something. After we have a better SSLProtocol implementation I'm totally open to profile Python's ssl module to see if it can be improved.

@1st1
Copy link
Member Author

1st1 commented Jun 7, 2018

Christian Heims also suggests trying to enable http://manpages.org/SSL_CTX_set_read_ahead/3

@1st1
Copy link
Member Author

1st1 commented Jun 8, 2018

We'll also need to think about offloading SSL encryption to a thread, in case it consumes significant CPU.

@1st1
Copy link
Member Author

1st1 commented Jun 9, 2018

Also need to investigate SSL_MODE_ENABLE_PARTIAL_WRITE; a comment from _ssl.c:

XXX should partial writes be enabled, SSL_MODE_ENABLE_PARTIAL_WRITE?

I've seen some comments in forums suggesting enabling it for non-blocking IO.

@fantix
Copy link
Member

fantix commented Jun 11, 2018

Fixed two buffer size issue in the PoC, and tested again with:

  • Python 3.7.0b5
  • OpenSSL 1.1.0h
  • ECDHE-RSA-AES256-GCM-SHA384
  • Tested with both 1KB requests and 100KB requests
  • 5 tests for each combination

Results:

image

  • The SD is kind of large, probably because of a bad testing environment.
  • Still the PoC new sslproto.py is approximately 10% ~ 20% faster, however I doubt it could still keep the advantange if it is completed with all the error checks, shutdown, etc.
  • The effect applying "read ahead" and "partial write" flags is not significant, "partial write" may be helpful, but I haven't check or fix any side effects the flags may bring.

I'm trying this in uvloop, as well as call SSL-related methods in threads, will update soon. Results of uvloop comparing to the same non-uvloop:

image

fantix added a commit to fantix/cpython that referenced this issue Jun 11, 2018
@fantix
Copy link
Member

fantix commented Jun 11, 2018

In the process testing with 100kB requests, I found that SSL_read() is returning maximum 16kB data (the record size for SSLv3/TLSv1 according to the manual), which caused roughly 8x~16x more function calls to _ssl.SSLSocket.read() than the number of calls to sslproto.data_received().

Previous commit embeds the loop in _ssl.c, and improved the overall performance by approximately 30%~40%. Because the test subject is an echo server, making _ssl.SSLSocket.read() return more leads to much less calls to echoserver.data_received(), therefore less calls to transport.write().

image

We could actually do the same in sslproto.py to combine decrypted data into a bigger buffer, then call client data_received(). UPDATE: tested and verified the conclusion, both C-level fix and Python-level fix have the same improvement.

This has no effect if the payload is small (like 1kB).

fantix added a commit to fantix/cpython that referenced this issue Jun 11, 2018
fantix added a commit to fantix/cpython that referenced this issue Jun 11, 2018
fantix added a commit to fantix/cpython that referenced this issue Jun 11, 2018
@1st1
Copy link
Member Author

1st1 commented Jun 11, 2018

Previous commit embeds the loop in _ssl.c, and improved the overall performance by approximately 30%~40%.

Nice.

@fantix
Copy link
Member

fantix commented Jun 11, 2018

😃 currently working on a relatively more complete sslproto.py on latest cpython master, should be ready to show by tomorrow.

@1st1
Copy link
Member Author

1st1 commented Jun 11, 2018

Great. You can also try to use BufferedProtocol, but really up to you in what order want to approach this.

@fantix
Copy link
Member

fantix commented Jun 11, 2018

Fully buffered - also supporting client BufferedProtocol. 😄

@fantix
Copy link
Member

fantix commented Jun 12, 2018

It's taking longer than expected, previous commit shows the rough idea -
I'm trying to separate the flow control (FC), and extract a more clearer state transition model tomorrow, in order to survive e.g. renegotiation.

@1st1
Copy link
Member Author

1st1 commented Jun 12, 2018

I'm trying to separate the flow control (FC), and extract a more clearer state transition model tomorrow, in order to survive e.g. renegotiation.

Take your time. And... nice diagram! We'll probably want to have a digital version of it to put somewhere in the docs ;)

@fantix
Copy link
Member

fantix commented Jun 13, 2018

Sure thing!

Looks like renegotiation is a bit awful - Python didn't expose it, OpenSSL didn't do it fully right, HTTP/2 forbid it, and TLS1.3 replaced it with something else. I'll try to get it working and tested with minimal effort - perhaps copying some tests from Trio as Nathaniel seemed already walked through this.


image

Eventually figured out how renegotiation works in Python, there are quite some implications:

  • If the other peer initiates a renegotiation by sending a Hello message (probably together with other regular data messages), SSLObject.read() is required to consume the Hello message.
  • This call to SSLObject.read() will write a handshake message into the outgoing buffer.
  • SSLObject.write() won't do - it doesn't touch the incoming buffer at all. Therefore, calling SSLObject.write() before SSLObject.read() would only send app data as usual, with the Hello message staying in the incoming buffer.
  • However, it raises an ssl.SSLSyscallError to call SSLObject.write() after SSLObject.read() consumed the Hello message, until the renegotiation is done.

Therefore, I came up with this process diagram:

image

  • Whenever data arrives, always call SSLObject.read().
  • Because we cannot send anything during a renegotiation, downstream writes need to be buffered (in the write_backlog buffer).
  • Any SSLObject.read() may write data into the outgoing buffer or complete a renegotiation, therefore we should always try (the blue arrow) to write data from write_backlog into the SSLObject without waiting for the next downstream write (missing from current cpython), and flush the outgoing buffer if any (present in current cpython).
  • The flow control is done differently. Current cpython simply forward events between upstream and downstream. I'd like to keep each SSLprotocol independent, water marks (high/low for incoming, and high/low for outgoing + backlog) can be separately tweaked in scenarios like SSL over SSL over SSL.
  • Explicit handshake and shutdown is not yet considered.

Meanwhile, I'm still trying to fit this into a state transitioning diagram - states probably make more sense with explicit handshake and shutdown included.


image

  • Written data from APP is buffered during DO_HANDSHAKE and SHUTDOWN, then processed afterwards accordingly.

Other than the 4 original states, I added 3:

  • The two PASSTHRU are used between get_buffer() and buffer_updated() in UNWRAPPED mode to ignore any environment changes (very unlikely), when app reading is not paused and we could just pass it through without buffering locally.
  • The NO_CONNECTION is used so that we don't have to judge if self._transport is None.

@1st1
Copy link
Member Author

1st1 commented Jun 19, 2018

Great progress! (GH doesn't send notifications on edit, so this is the first time I see the updates)


Python didn't expose it, OpenSSL didn't do it fully right, HTTP/2 forbid it, and TLS1.3 replaced it with something else. I'll try to get it working and tested with minimal effort - perhaps copying some tests from Trio as Nathaniel seemed already walked through this.

Is it so broken that it's impractical to support it? I mean it's a very niche requirement, so if its support comes at the expense of code clarity/maintainability I'd say let's not do it at all. Your call though.

TLS1.3 replaced it with something else

Can you share some details on that? Is that new TLS 1.3 renegotiation protocol easier to implement? Maybe we can get away with supporting just that TLS 1.3 thing?

@fantix
Copy link
Member

fantix commented Jun 20, 2018

Thanks! No worries, I didn't want to make too frequent noises, just about to ask for your review and next steps.

The renegotiation support is actually done in previous commit. It turns out that as far as we follow the process diagram to call OpenSSL functions with those implications considered, current TLS<=1.2 renegotiation shall just work. It's manually tested, a unit test would require pyOpenSSL installed, though I'm not sure if that is practical in cpython source code.

For the two renegotiation issues, 1) duplex traffic during renegotiation: it seems that OpenSSL had already fixed the issue, by preventing sending app data once it had sent the renegotiation Hello. I didn't find the actual fix, but verified the behavior (test code is a bit ugly), so it should just work fine now. 2) TLS 1.3 rekeying is still in OpenSSL 1.1.1 beta, so it is a future thing I think.


Another finding during fixing cpython tests is about canceling handshake. SSL/TLS has this user_canceled alert to manually cancel an ongoing handshake. OpenSSL has this type too, but exposed no API (should use ssl3_send_alert() to send a user_canceled). Therefore, we shall simply close the underlying connection in this case. (The state diagram above is updated with a few transitions to NO_CONNECTION on errors)


A side topic - we may be able to support user-random SSL wrap/unwrap in the same TCP connection without reconnecting (transitioning between UNWRAPPED and WRAPPED multiple times). Original sslproto.py won't stay in UNWRAPPED state for too much time, but still allowed unencrypted data passing through for an extremely short while after a successful shutdown before connection lost. (Also, current asyncio supported one-time upgrading by inserting an SSLProtocol before user protocol by loop.start_tls())

The worry is about being unwrapped by the peer - application protocol needs to be aware or even in charge of the SSL downgrading, or sensitive data may be sent in plaintext by mistake. For now, an SSL close_notify alert coming from the peer is received as an empty read (sslobj.read() returns b''). We may use similar pattern to notify the app protocol that this connection is being unwrapped - if data_received(b'') happens, then all data will be in plaintext afterwards. We could also allow user to configure the behavior of response to close_notify alerts, either close the connection directly, or downgrade immediately, or even let the user to choose when to respond to the close_notify alert with calling unwrap().

I think this is getting too far, sorry for the long story! It might be better to just stay with closing the connection immediately as the first part of RFC 5246 7.2.1 stated, and leave the random wrap/unwrap to a new feature. If so, I'll simply clean up the pass through logic (also the one from original sslproto.py), and make sure all traffic is always encrypted.

@fantix
Copy link
Member

fantix commented Jun 21, 2018

About the performance thing, I just realized that I’ve been comparing encryption time with loopback network time, which is probably not a practical anchor point. So I’ve compared the absolute throughput of OpenSSL(with this benchmark tool found in Elvis's reference) and cpython/_ssl.c(with asyncio echo server and cProfile), the result is: 2059 MB/s vs 1390 MB/s (68%). I think this is a reasonable result, considering async overhead, BIO memory copy overhead and Pythonize overhead.

This probably means that our initial problem was incomplete - SSL connections are 3-4x slower than regular connections when the network time is about the same as encryption/decryption time. If the encryption/decryption time was 10% of the network time, then current SSL implementation is supposed to be adding another 15% ~ 30% overhead.

Making the time consumed in the _ssl module as the anchor point (unit of one), the first PoC sslproto.py was 34% faster (1.29 -> 0.85), and the decrypted data aggregation fix was 70% faster (0.52 -> 0.15) on 100KB echo server. In this special combo case, sslproto.py is improved to only take about 10% of the time spent in _ssl module, that is 6.5% extra overhead overall comparing to previous 15% ~ 30%, which is good enough I think.

@1st1
Copy link
Member Author

1st1 commented Jun 21, 2018

I just realized that I’ve been comparing encryption time with loopback network time, which is probably not a practical anchor point.

Making the time consumed in the _ssl module as the anchor point (unit of one), the first PoC sslproto.py was 34% faster (1.29 -> 0.85), and the decrypted data aggregation fix was 70% faster (0.52 -> 0.15) on 100KB echo server. In this special combo case, sslproto.py is improved to only take about 10% of the time spent in _ssl module, that is 6.5% extra overhead overall comparing to previous 15% ~ 30%, which is good enough I think.

Right. This is great. Perhaps we'll be able to squeeze a few more %% when we cythonize the sslproto.py, but seems that the performance isn't that bad after all.

fantix added a commit to fantix/cpython that referenced this issue Jul 2, 2018
@fantix fantix mentioned this issue Jul 2, 2018
16 tasks
@1st1
Copy link
Member Author

1st1 commented Aug 14, 2018

What's the latest news here?

@fantix
Copy link
Member

fantix commented Aug 14, 2018 via email

fantix added a commit to fantix/uvloop that referenced this issue Aug 22, 2018
fantix added a commit to fantix/uvloop that referenced this issue Sep 5, 2018
fantix added a commit to fantix/uvloop that referenced this issue Sep 7, 2018
@1st1 1st1 closed this as completed in 9cba749 Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants