Reduce # of clients connected to the application db [#810] #944

pattisdr · 2022-07-25T20:50:37Z

Purpose

Since around the time of the move to celery, in local light scale testing, it's very easy to open up too many connections against the fidesops application database (current limit is 100). Investigate improvements and set some reasonable defaults.

worker_1            | [2022-07-06 13:03:43,882: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 584, in connect
worker_1            |     return self.dbapi.connect(*cargs, **cparams)
worker_1            | [2022-07-06 13:03:43,883: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
worker_1            |     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
worker_1            | [2022-07-06 13:03:43,885: WARNING/ForkPoolWorker-1] sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  sorry, too many clients already
worker_1            |

Changes

Big picture, we now do three things:

Share a Session across each Celery process
Use that same Session to write execution logs for a privacy request instead of creating a new one
Limit task concurrency to two per worker

More details below:

Make run_privacy_request a bound celery task so we have access to the task itself within the function.
Create a custom celery Task class: DatabaseTask which extends Task but allows caching a Session to be shared across a celery process. Each celery process will have an Engine created that maintains its own connection pool, separate from the other processes. We then create the Session that is shared. This Engine and Session should be created then, once per process.
- Close the Session after each privacy request has finished executing. This just resets the session and returns connections back to the pool. It can be reused (we were already doing this).
- Remove unnecessary places where the Session is closed manually because the Session is being used as a context manager and is already closed through that.
Pass the same Session that run_privacy_request is using through to graph_task.run_access_request and graph_task.run_erasure and then onto TaskResources to be re-used to create ExecutionLogs instead of creating a new Session each, because they were opening up their own additional connections.
- Don't close the Session after creating ExecutionLogs, wait until the entire privacy request is complete/exited. This is because we're now using the Session passed into the privacy request, instead of creating a new one, and we need to wait until the end to close the session.
Limit task concurrency to two per worker (this can later be a configuration option, this was just our previous limit. This reduces the number of privacy requests that are being executed simultaneously, which also can spread out the queries to their owned databases and connected APIs)

Findings

Local experimentation was completed by monitoring the number of backends connected to the application database while submitting increasing numbers/rates of privacy requests against fidesops. The test setup is using a simple graph of just postgres collections from postgres_example_test_dataset and I'm running access requests using the policy and other details laid out in the Postman collection. I had two celery workers running. A single POST to api/v1/privacy-request can create anywhere from 1-50 privacy requests at a time, so testing would start with just creating one a time, then 6 at a time, then usually 6 at a time 6 times in rapid fashion, then 10 at a time, then 50 at a time.

Before

We were using the default SqlAlchemy Connection Pool (QueuePool) and not sharing connection pools between processes which is good. Each pool size was 5 by default, with the default max_overflow of 10. But we were calling get_db_session for each privacy request without specifying an engine (which manage the connection pools), so it seemed like multiple connection pools were being creating per process so connections weren't being re-used. Additionally, the process of writing execution logs would open up their own sessions which would open up additional connections, compounding the problem.

The setup below seemed to quickly max out the 100 connection default and not reuse connections.

Num Privacy Requests	Total connections	Idle Connections
1x1	15	12
1x1	23	21
1x1	23	22
1x6	59	56
1x6	67	65
1x6	77	74
1x6	92	89
1x6	Maxed out	99

After

Still using a QueuePool of size 5 with max_overflow of 10, but each celery process gets one engine and one session. We use that session for the entirety of the request, even to create the execution logs, then close at the end, which just resets the session and returns the connections to the connection pool. I also limit the number of privacy requests that can be executed simultaneously per worker to 2, which also reduces the number of connections we need to open at once.

Num Privacy Requests	Total connections	Idle Connections
1x1	5	3
1x1	7	5
1x1	7	5
1x1	9	7
1x6	13	11
1x6	10	8
6x6	15	12
6x6	15	13
1x10	17	15
1x50	18	16

Side Effects

Datastores that are disabled while a PrivacyRequest is in progress will have their current state picked up and remaining collections will be skipped.

ExecutionLog creation runs a Session.commit() which expires the state of all instances in the Session so the next time we access them, they have the most recent attributes. Our previous behavior queries everything up-front, and then uses that initial state for the remainder of the privacy request. This change means that attributes can alter state mid-privacy request.

Future things to experiment with

Increasing or decreasing size of connection pools: with pool_size and max_overflow - making the pool slightly smaller or slightly larger only had small effects on total connections, but I didn't experiment with larger changes.
Increasing total number of allowed connections against fidesops database
Allow number of workers to be configured, sometimes multiple workers managing a queue can perform better than a single queue. Two workers seemed to perform better than 1, with concurrency of 2, and default pool size/overflow.
Allow the number of concurrent privacy requests to be configured: I am currently restricting to 2 per worker to match where we had settled before celery went in.
Look into using celery thread based pools (eventlet/gevent) instead of process based pools because a lot of our privacy request execution is IO bound.

Checklist

Ticket

Fixes #810

- Limit task concurrency to two per worker. - Create one Engine per celery process which opens up a connection pool. Create one Session per celery process and use that session across privacy requests. - Close the session after the privacy request has finished executing. This just resets the session and returns connections back to the pool. It can be reused. - Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that. - Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session. - Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited.

…onnections

For mypy's benefits, define that the session is a context manager.

…onnections

…task.run_erasure, and for instantiating taskResources

pattisdr · 2022-07-26T21:43:12Z

One legitimate test failure here, representing a behavior change, need to clear with product.

seanpreston · 2022-07-27T10:35:13Z

Thanks @pattisdr — this is a huge improvement from the current, and back to where we were before!

seanpreston · 2022-07-27T10:43:18Z

One legitimate test failure here, representing a behavior change, need to clear with product.

For the failed test test_run_disabled_collections_in_progress, is it that case that because we have a live DB connection in the traversal we're now able to update the execution plan to not run disabled collections for already in progress requests?

… a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state. Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed.

pattisdr · 2022-07-27T11:55:03Z

@seanpreston yes, in part. The change is caused because the same Session that we use to run the PrivacyRequest is also now used to create Execution Logs, instead of creating a new Session for the Execution Logs. As part of saving the Execution Logs, we run a Session.commit(), which will cause the ConnectionConfig to have the most recent state the next time it is accessed:

https://docs.sqlalchemy.org/en/14/orm/session_basics.html#session-committing

Finally, all objects within the Session are expired as the transaction is closed out. This is so that when the instances are next accessed, either through attribute access or by them being present in the result of a SELECT, they receive the most recent state.

I cleared the behavior with @adriaaaa yesterday. She said we can go with it, and if we get feedback that it's causing issues we can revisit.

Note that this doesn't spread the Session everywhere through the Traversal yet, it just passes it through far enough for the ExecutionLog creation to be able to access it.

pattisdr · 2022-07-27T12:02:42Z

src/fidesops/service/privacy_request/request_runner_service.py

+class DatabaseTask(Task):  # pylint: disable=W0223
+    _session = None
+
+    @property
+    def session(self) -> ContextManager[Session]:
+        """Creates Session once per process"""
+        if self._session is None:
+            SessionLocal = get_db_session(config)
+            self._session = SessionLocal()
+
+        return self._session


This allows us to share the Session across a Celery process.

This also limits the number of times get_db_session is being called per process which was creating new Engines which were opening up new Connection Pools whose connections weren't being reused.

pattisdr · 2022-07-27T12:03:22Z

src/fidesops/service/privacy_request/request_runner_service.py

@@ -190,7 +204,6 @@ def run_privacy_request(
                after_webhook_id=from_webhook_id,
            )
            if not proceed:
-                session.close()


These session.close calls were removed because they're not doing anything, the Session is a context manager above (see with self.session as session), so it is being closed automatically.

pattisdr · 2022-07-27T12:03:51Z

src/fidesops/task/graph_task.py

+    with TaskResources(
+        privacy_request, policy, connection_configs, session
+    ) as resources:


We're passing in the Session here to TaskResources so we can use it to create the ExecutionLogs

pattisdr · 2022-07-27T12:04:52Z

src/fidesops/task/task_resources.py

-        SessionLocal = get_db_session(config)
-        db = SessionLocal()
+        db = self.session


We're using the same Session that is being used for the PrivacyRequest, not creating a new one, which reduces the number of connections we're opening. It also has a side effect of causing other resources bound to the Session for the PrivacyRequest to get the most recent state when the ExecutionLog is saved, because we commit the Session.

I think the main implication here is with request cancellation? ie. if .cancelled_at / paused_at / .status were to change, we may be able to abort the traversal entirely?

Yes, I think this is a good example of where we can better make decisions mid-traversal based on the resources's current state in the future.

pattisdr · 2022-07-27T12:06:32Z

src/fidesops/tasks/__init__.py

@@ -48,7 +48,7 @@ def _create_celery(config_path: str = config.execution.celery_config_path) -> Ce

 def start_worker() -> None:
    logger.info("Running Celery worker...")
-    celery_app.worker_main(argv=["worker", "--loglevel=info"])
+    celery_app.worker_main(argv=["worker", "--loglevel=info", "--concurrency=2"])


This could be configurable in the future, but this matches our state before the move to Celery. The default is the number of cores on your machine, but limiting the number of PrivacyRequests that can be run simultaneously per worker has a lot of positive effects. This includes reducing the number of simultaneous connections on the application database, as well as the customers' owned databases, and reduces simultaneous requests against their connected API's.

Thanks, I've added a ticket to be addressed as a follow-up.

* Reduce number of open connections: - Limit task concurrency to two per worker. - Create one Engine per celery process which opens up a connection pool. Create one Session per celery process and use that session across privacy requests. - Close the session after the privacy request has finished executing. This just resets the session and returns connections back to the pool. It can be reused. - Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that. - Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session. - Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited. * Define "self" for run_privacy_task - it's the task itself. For mypy's benefits, define that the session is a context manager. * Make a session non-optional for graph_task.run_access_request, graph_task.run_erasure, and for instantiating taskResources * Use missing db fixture. * Add missing db resource. * Update test to reflect new behavior that disabling a datasource while a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state. Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed. * Update CHANGELOG.

pattisdr added the DON'T MERGE label Jul 25, 2022

pattisdr changed the title ~~[DRAFT] Reduce Number of Open Connections [#810]~~ [DRAFT] Reduce # of clients connected to the application db [#810] Jul 25, 2022

pattisdr added 6 commits July 26, 2022 08:42

Merge remote-tracking branch 'ethyca/main' into fidesops_810_reduce_c…

41a6800

…onnections

Define "self" for run_privacy_task - it's the task itself.

feca30f

For mypy's benefits, define that the session is a context manager.

Merge remote-tracking branch 'ethyca/main' into fidesops_810_reduce_c…

cf6d79b

…onnections

Make a session non-optional for graph_task.run_access_request, graph_…

c9e2ec4

…task.run_erasure, and for instantiating taskResources

Use missing db fixture.

88504cd

Add missing db resource.

d38ef2b

pattisdr added 2 commits July 27, 2022 07:50

Update CHANGELOG.

4f7f5f2

pattisdr changed the title ~~[DRAFT] Reduce # of clients connected to the application db [#810]~~ Reduce # of clients connected to the application db [#810] Jul 27, 2022

pattisdr commented Jul 27, 2022

View reviewed changes

pattisdr removed the DON'T MERGE label Jul 27, 2022

pattisdr marked this pull request as ready for review July 27, 2022 12:07

pattisdr added the run unsafe ci checks Triggers running of unsafe CI checks label Jul 27, 2022

seanpreston mentioned this pull request Jul 27, 2022

Make Celery --concurrency a config option #957

Open

seanpreston self-assigned this Jul 27, 2022

Merge branch 'main' into fidesops_810_reduce_connections

bf06332

seanpreston approved these changes Jul 27, 2022

View reviewed changes

seanpreston merged commit c8ba158 into main Jul 27, 2022

seanpreston deleted the fidesops_810_reduce_connections branch July 27, 2022 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce # of clients connected to the application db [#810] #944

Reduce # of clients connected to the application db [#810] #944

pattisdr commented Jul 25, 2022 •

edited

Loading

pattisdr commented Jul 26, 2022

seanpreston commented Jul 27, 2022

seanpreston commented Jul 27, 2022

pattisdr commented Jul 27, 2022 •

edited

Loading

pattisdr Jul 27, 2022

pattisdr Jul 27, 2022

pattisdr Jul 27, 2022

pattisdr Jul 27, 2022

seanpreston Jul 27, 2022

pattisdr Jul 27, 2022 •

edited

Loading

pattisdr Jul 27, 2022 •

edited

Loading

seanpreston Jul 27, 2022

Reduce # of clients connected to the application db [#810] #944

Reduce # of clients connected to the application db [#810] #944

Conversation

pattisdr commented Jul 25, 2022 • edited Loading

Purpose

Changes

Findings

Before

After

Side Effects

Future things to experiment with

Checklist

Ticket

pattisdr commented Jul 26, 2022

seanpreston commented Jul 27, 2022

seanpreston commented Jul 27, 2022

pattisdr commented Jul 27, 2022 • edited Loading

pattisdr Jul 27, 2022

Choose a reason for hiding this comment

pattisdr Jul 27, 2022

Choose a reason for hiding this comment

pattisdr Jul 27, 2022

Choose a reason for hiding this comment

pattisdr Jul 27, 2022

Choose a reason for hiding this comment

seanpreston Jul 27, 2022

Choose a reason for hiding this comment

pattisdr Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

pattisdr Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

seanpreston Jul 27, 2022

Choose a reason for hiding this comment

pattisdr commented Jul 25, 2022 •

edited

Loading

pattisdr commented Jul 27, 2022 •

edited

Loading

pattisdr Jul 27, 2022 •

edited

Loading

pattisdr Jul 27, 2022 •

edited

Loading