-
Notifications
You must be signed in to change notification settings - Fork 16
Reduce # of clients connected to the application db [#810] #944
Conversation
- Limit task concurrency to two per worker. - Create one Engine per celery process which opens up a connection pool. Create one Session per celery process and use that session across privacy requests. - Close the session after the privacy request has finished executing. This just resets the session and returns connections back to the pool. It can be reused. - Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that. - Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session. - Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited.
For mypy's benefits, define that the session is a context manager.
…task.run_erasure, and for instantiating taskResources
One legitimate test failure here, representing a behavior change, need to clear with product. |
Thanks @pattisdr — this is a huge improvement from the current, and back to where we were before! |
For the failed test |
… a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state. Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed.
@seanpreston yes, in part. The change is caused because the same Session that we use to run the PrivacyRequest is also now used to create Execution Logs, instead of creating a new Session for the Execution Logs. As part of saving the Execution Logs, we run a https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-committing
I cleared the behavior with @adriaaaa yesterday. She said we can go with it, and if we get feedback that it's causing issues we can revisit. Note that this doesn't spread the Session everywhere through the Traversal yet, it just passes it through far enough for the ExecutionLog creation to be able to access it. |
class DatabaseTask(Task): # pylint: disable=W0223 | ||
_session = None | ||
|
||
@property | ||
def session(self) -> ContextManager[Session]: | ||
"""Creates Session once per process""" | ||
if self._session is None: | ||
SessionLocal = get_db_session(config) | ||
self._session = SessionLocal() | ||
|
||
return self._session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us to share the Session across a Celery process.
This also limits the number of times get_db_session
is being called per process which was creating new Engines which were opening up new Connection Pools whose connections weren't being reused.
@@ -190,7 +204,6 @@ def run_privacy_request( | |||
after_webhook_id=from_webhook_id, | |||
) | |||
if not proceed: | |||
session.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These session.close
calls were removed because they're not doing anything, the Session is a context manager above (see with self.session as session), so it is being closed automatically.
with TaskResources( | ||
privacy_request, policy, connection_configs, session | ||
) as resources: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're passing in the Session here to TaskResources so we can use it to create the ExecutionLogs
SessionLocal = get_db_session(config) | ||
db = SessionLocal() | ||
db = self.session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the same Session that is being used for the PrivacyRequest, not creating a new one, which reduces the number of connections we're opening. It also has a side effect of causing other resources bound to the Session for the PrivacyRequest to get the most recent state when the ExecutionLog is saved, because we commit the Session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main implication here is with request cancellation? ie. if .cancelled_at
/ paused_at
/ .status
were to change, we may be able to abort the traversal entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is a good example of where we can better make decisions mid-traversal based on the resources's current state in the future.
@@ -48,7 +48,7 @@ def _create_celery(config_path: str = config.execution.celery_config_path) -> Ce | |||
|
|||
def start_worker() -> None: | |||
logger.info("Running Celery worker...") | |||
celery_app.worker_main(argv=["worker", "--loglevel=info"]) | |||
celery_app.worker_main(argv=["worker", "--loglevel=info", "--concurrency=2"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be configurable in the future, but this matches our state before the move to Celery. The default is the number of cores on your machine, but limiting the number of PrivacyRequests that can be run simultaneously per worker has a lot of positive effects. This includes reducing the number of simultaneous connections on the application database, as well as the customers' owned databases, and reduces simultaneous requests against their connected API's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added a ticket to be addressed as a follow-up.
* Reduce number of open connections: - Limit task concurrency to two per worker. - Create one Engine per celery process which opens up a connection pool. Create one Session per celery process and use that session across privacy requests. - Close the session after the privacy request has finished executing. This just resets the session and returns connections back to the pool. It can be reused. - Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that. - Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session. - Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited. * Define "self" for run_privacy_task - it's the task itself. For mypy's benefits, define that the session is a context manager. * Make a session non-optional for graph_task.run_access_request, graph_task.run_erasure, and for instantiating taskResources * Use missing db fixture. * Add missing db resource. * Update test to reflect new behavior that disabling a datasource while a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state. Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed. * Update CHANGELOG.
Purpose
Since around the time of the move to celery, in local light scale testing, it's very easy to open up too many connections against the fidesops application database (current limit is 100). Investigate improvements and set some reasonable defaults.
Changes
Big picture, we now do three things:
More details below:
run_privacy_request
a bound celery task so we have access to the task itself within the function.DatabaseTask
which extendsTask
but allows caching a Session to be shared across a celery process. Each celery process will have an Engine created that maintains its own connection pool, separate from the other processes. We then create the Session that is shared. This Engine and Session should be created then, once per process.run_privacy_request
is using through tograph_task.run_access_request
andgraph_task.run_erasure
and then ontoTaskResources
to be re-used to create ExecutionLogs instead of creating a new Session each, because they were opening up their own additional connections.Findings
Local experimentation was completed by monitoring the number of backends connected to the application database while submitting increasing numbers/rates of privacy requests against fidesops. The test setup is using a simple graph of just postgres collections from
postgres_example_test_dataset
and I'm runningaccess requests
using the policy and other details laid out in the Postman collection. I had two celery workers running. A single POST toapi/v1/privacy-request
can create anywhere from 1-50 privacy requests at a time, so testing would start with just creating one a time, then 6 at a time, then usually 6 at a time 6 times in rapid fashion, then 10 at a time, then 50 at a time.Before
We were using the default SqlAlchemy Connection Pool (QueuePool) and not sharing connection pools between processes which is good. Each pool size was 5 by default, with the default max_overflow of 10. But we were calling
get_db_session
for each privacy request without specifying an engine (which manage the connection pools), so it seemed like multiple connection pools were being creating per process so connections weren't being re-used. Additionally, the process of writing execution logs would open up their own sessions which would open up additional connections, compounding the problem.The setup below seemed to quickly max out the 100 connection default and not reuse connections.
After
Still using a QueuePool of size 5 with max_overflow of 10, but each celery process gets one engine and one session. We use that session for the entirety of the request, even to create the execution logs, then close at the end, which just resets the session and returns the connections to the connection pool. I also limit the number of privacy requests that can be executed simultaneously per worker to 2, which also reduces the number of connections we need to open at once.
Side Effects
Datastores that are disabled while a PrivacyRequest is in progress will have their current state picked up and remaining collections will be skipped.
ExecutionLog creation runs a Session.commit() which expires the state of all instances in the Session so the next time we access them, they have the most recent attributes. Our previous behavior queries everything up-front, and then uses that initial state for the remainder of the privacy request. This change means that attributes can alter state mid-privacy request.
Future things to experiment with
pool_size
andmax_overflow
- making the pool slightly smaller or slightly larger only had small effects on total connections, but I didn't experiment with larger changes.Checklist
CHANGELOG.md
fileCHANGELOG.md
file is being appended toUnreleased
section in an appropriate category. Add a new category from the list at the top of the file if the needed one isn't already there.Run Unsafe PR Checks
label has been applied, and checks have passed, if this PR touches any external servicesTicket
Fixes #810