Skip to content
This repository was archived by the owner on Nov 30, 2022. It is now read-only.

[Spike] Privacy Request Limit Concurrency / Too Many Open Connections (Timebox: 3 days) #810

Closed
pattisdr opened this issue Jul 6, 2022 · 1 comment · Fixed by #944
Closed
Assignees
Labels
enhancement New feature or request

Comments

@pattisdr
Copy link
Contributor

pattisdr commented Jul 6, 2022

Is your feature request related to a specific problem?

  • Locally, when executing multiple consecutive privacy requests, you can run into:
worker_1            | [2022-07-06 13:03:43,882: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 584, in connect
worker_1            |     return self.dbapi.connect(*cargs, **cparams)
worker_1            | [2022-07-06 13:03:43,883: WARNING/ForkPoolWorker-1]   File "/usr/local/lib/python3.9/site-packages/psycopg2/__init__.py", line 122, in connect
worker_1            |     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
worker_1            | [2022-07-06 13:03:43,885: WARNING/ForkPoolWorker-1] sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  sorry, too many clients already
worker_1            | 

as we open up too many connections against the fidesops postgres database.

  • Also, some time after running all these privacy requests, I still have a lot of open connections. Investigate if there's been a regression in closing connections.

Describe the solution you'd like

  1. Add sane defaults for limiting concurrency of the number of privacy requests that are executed simultaneously. We used to have this when we limited the number of concurrent threads when we were using asycnio Reduce Number of Concurrent Threads  #145, but now that we've moved to Celery, we could use the --concurrency flag.

  2. Verify we're closing connections after we open them.

Describe alternatives you've considered, if any

A description of any alternative solutions or features you've considered.

Additional context

If we open too many connections to the datastore (mainly application database) concurrently, it can't do it's job properly. When we introduced celery, we introduced this issue. In the future we can tune this to use other methods to improve accuracy and performance.

@pattisdr pattisdr added the enhancement New feature or request label Jul 6, 2022
@pattisdr pattisdr changed the title Privacy Request Limit Concurrency Privacy Request Limit Concurrency / Too Many Open Connections Jul 6, 2022
@pattisdr pattisdr assigned pattisdr and unassigned pattisdr Jul 11, 2022
@pattisdr
Copy link
Contributor Author

pattisdr commented Jul 13, 2022

To test:

  1. In terminal, run
PGPASSWORD=<postgres password from toml file> watch 'psql -h localhost -p 5432 -U postgres -d app -c "SELECT sum(numbackends) FROM pg_stat_database"'

This will ping for the number of connections on the fidesops postgres database every 2 seconds.

  1. In postman, after running the prerequisite calls for "Minimum API calls to create an access request", send multiple requests to create a privacy request. To quickly exceed the default number of allowed connections (100), Post multiple requests at once. Here, I am just running access requests for the same identity, and my graph just has the postgres collections in it.

POST {{host}}/privacy-request

[
    {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    },
      {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    },
      {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    },
      {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    },
      {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    },
      {
        "requested_at": "2021-08-30T16:09:37.359Z",
        "identity": {"email": "[email protected]"},
        "policy_key": "{{policy_key}}"
    }

Early thoughts:

Connections are quickly swamped. Numerous ROLLBACK entries, idle connections waiting on another query. It seems like we’re adding more rather than reusing existing ones. You can very quickly get over 100.

16384 | app     | 472 |       10 | postgres |                  | 172.18.0.10 |                 |       60566 | 2022-07-12 20:20:39.683792+00 |                               | 2022-07-12 20:20:40.780926+00 | 2022-07-12 20:20:40.781194+00 | Client          | ClientRead          | idle   |             |              | ROLLBACK  

I'm guessing we are using SQLAlchemy's default QueuePool. I haven't looked into what happened with the addition of Celery yet but we need to make sure that the connection pool is not being shared across several celery workers. Setting a concurrency limit on the celery workers should help, but I think we also need to make sure we're reusing connections.

@pattisdr pattisdr self-assigned this Jul 13, 2022
@pattisdr pattisdr self-assigned this Jul 20, 2022
@adriaaaa adriaaaa changed the title Privacy Request Limit Concurrency / Too Many Open Connections [Spike] Privacy Request Limit Concurrency / Too Many Open Connections (Timebox: 3 days) Jul 20, 2022
seanpreston pushed a commit that referenced this issue Jul 27, 2022
* Reduce number of open connections:

- Limit task concurrency to two per worker.
- Create one Engine per celery process which opens up a connection pool.  Create one Session per celery process and use that session across privacy requests.
- Close the session after the privacy request has finished executing.  This just resets the session and returns connections back to the pool. It can be reused.
- Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that.
- Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session.
- Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited.

* Define "self" for run_privacy_task - it's the task itself.

For mypy's benefits, define that the session is a context manager.

* Make a session non-optional for graph_task.run_access_request, graph_task.run_erasure, and for instantiating taskResources

* Use missing db fixture.

* Add missing db resource.

* Update test to reflect new behavior that disabling a datasource while a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state.

Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed.

* Update CHANGELOG.
sanders41 pushed a commit that referenced this issue Sep 22, 2022
* Reduce number of open connections:

- Limit task concurrency to two per worker.
- Create one Engine per celery process which opens up a connection pool.  Create one Session per celery process and use that session across privacy requests.
- Close the session after the privacy request has finished executing.  This just resets the session and returns connections back to the pool. It can be reused.
- Remove unnecessary places where session is closed manually because the session is being used as a context manager and is already closed through that.
- Pass the same Session that the privacy request is using through to TaskResources to be re-used to create ExecutionLogs instead of opening up a new Session.
- Don't close the session when passing it into the Execution Log, wait until the entire privacy request is complete/exited.

* Define "self" for run_privacy_task - it's the task itself.

For mypy's benefits, define that the session is a context manager.

* Make a session non-optional for graph_task.run_access_request, graph_task.run_erasure, and for instantiating taskResources

* Use missing db fixture.

* Add missing db resource.

* Update test to reflect new behavior that disabling a datasource while a request is in progress can cause related collections to be skipped once the current session is expired and the connection config has the most recent state.

Because the same Session that is being used to run the PrivacyRequest is now being used for ExecutionLogs, the process of saving an ExecutionLog runs a session.commit() which expires the Session and causes the ConnectionConfig to have the most recent state the next time it is accessed.

* Update CHANGELOG.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant