This is intended to be a living document/checklist to aid you through your database upgrade process. Copy it to somewhere that allows you to easily check-off the checklist items, write values, edit as necessary, and share with your team to follow along in realtime. We used Notion.
For more information on this runbook, see our blog post.
Use this table as a centralized place to write down various values that will be needed throughout the upgrade process.
Item | Value |
---|---|
Jumpbox host | |
PgBouncer host | |
Added SG rules | |
green cluster writer host | e.g. medplumstack-backenddatabase-pg16-cluster.cluster-xxxxxxxxxxxx.us-east-1.rds.amazonaws.com |
green subnet group name | |
green security group ID | |
blue create replication slot output | e.g. (my_replication_slot,0/F0407060) |
green initial LSN | e.g. 0/F09A41E0 |
extensions upgraded | e.g. pg_trgm 1.4 --> 1.6 |
ReaderDatabaseSecret value | e.g. arn:aws:secretsmanager:us-east-1:123412341234:secret:MedplumProdDatabaseReaderInstance-xxxxxx |
DB secret w/ blue | |
DB secret w/ pgbouncer | |
DB secret w/ green | |
reader DB secret w/ blue | |
reader DB secret w/ green |
-
Update cluster parameter group to enable logical replication
- Enable RDS Proxy
"rds.logical_replication": "1", "max_replication_slots": "25", "max_wal_senders": "25", "max_logical_replication_workers": "25", "max_worker_processes": "50",
- Restart database to apply
- Disable RDS Proxy
-
Provision PgBouncer server based on
medplum-ee/packages/pg-upgrade/README.md
-
Provision jumpbox server
- Install psql, Makefile, .pgpass, etc.
- Sync pg-upgrade package
- Test cutover script with expected failures since green does not exist
- Ensure SSH connectivity to PgBouncer
- Ensure Postgres connectivity to PgBouncer
-
Update Security Groups to allow connectivity
- Copy PgBouncer SG URL above
- Copy Jumpbox SG URL above
- Copy rules added to
Medplum<Name>-BackEndDatabaseClusterSecurityGroup
above
-
On blue, ensure all tables have PKEY or REPLICA IDENTITY FULL
- Run all
ALTER TABLE
statements output by the query below - Verify no rows are returned by the query below
SELECT 'ALTER TABLE "' || tablename || '" REPLICA IDENTITY FULL;' FROM pg_tables t WHERE t.schemaname IN ('public') AND NOT EXISTS ( SELECT 1 FROM pg_index i JOIN pg_constraint c ON i.indexrelid = c.conindid WHERE i.indrelid = (quote_ident(t.schemaname) || '.' || quote_ident(t.tablename))::regclass AND c.contype = 'p' -- 'p' for primary key ) AND EXISTS ( SELECT 1 FROM pg_class c WHERE c.oid = (quote_ident(t.schemaname) || '.' || quote_ident(t.tablename))::regclass AND c.relreplident = 'd' -- 'd' stands for default replica identity ) ORDER BY schemaname, tablename;
- Run all
-
On blue, create publication and replication slot:
SELECT * FROM pg_publication; CREATE PUBLICATION my_publication FOR ALL TABLES; SELECT * FROM pg_replication_slots; SELECT pg_create_logical_replication_slot('my_replication_slot', 'pgoutput');
-
Write output from
pg_create_logical_replication_slot
query intoblue create replication slot output
in table above -
Create Subnets Group named like
medplum<name>-backenddatabaseclusterpg16subnets
- Supply a description
- Use correct VPC
- Match AZs and Subnets from previous subnets group
- Copy subnet group name in table above
-
Create Security Group named like
Medplum<Name>-BackEndDatabaseCluterPG16SecurityGroup
- Inbound from MedplumSecurityGroup
- Use correct VPC
- Copy security group ID in table above
-
Create green cluster via Actions -> Create clone
- DB Instance Identifier
medplum<stackname>-backenddatabase-pg16
- Choose subnet group created above
- Choose security group created above
- Specify Initial database name as
medplum
- Log exports > check PostgreSQL log
- Enable deletion protection
- Write green cluster host into
green cluster writer host
in table above
- DB Instance Identifier
-
On green, obtain initial LSN of green writer BEFORE starting version upgrade
- Sanity check it is larger than value in
blue create replication slot output
from table above
# if on postgres <=12.9 ./get-start-lsn.sh <writer-db-instance-id> # if on postgres >=12.10 SELECT aurora_volume_logical_start_lsn();
- Write LSN into
green initial LSN
in table above
- Sanity check it is larger than value in
-
On green, delete replication slot and publication
SELECT * FROM pg_replication_slots; SELECT pg_drop_replication_slot('my_replication_slot'); SELECT * FROM pg_publication; DROP PUBLICATION my_publication;
-
On green, upgrade to desired Major version, e.g. 16.4
-
On green, upgrade extensions to newest versions
SELECT extname, extversion FROM pg_extension ORDER BY extname; SELECT name, max(version) FROM pg_available_extension_versions WHERE name IN (SELECT extname FROM pg_extension) GROUP BY name ORDER BY name; ALTER EXTENSION extension_name UPDATE TO 'new_version';
-
On green, create subscription. transaction and SET LOCALs to prevent password being logged
BEGIN; SET LOCAL log_statement='none'; SET LOCAL log_min_duration_statement=-1; CREATE SUBSCRIPTION my_subscription CONNECTION 'postgres://admin_user_name:admin_user_password@blue-host/database' PUBLICATION my_publication WITH (copy_data = false, create_slot = false, enabled = false, connect = true, slot_name = 'my_replication_slot'); COMMIT;
-
On green, advance replication origin
SELECT * FROM pg_replication_origin; -- replace <roname> with roname from pg_replication_origin query -- replace <initial LSN> with initial LSN value obtained above SELECT pg_replication_origin_advance('<roname>', '<initial LSN>');
-
On green, enable logical replication
ALTER SUBSCRIPTION my_subscription ENABLE;
-
On blue, monitor replication lag
SELECT now() AS CURRENT_TIME, slot_name, active, active_pid, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS diff_size, pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) AS diff_bytes FROM pg_replication_slots WHERE slot_type = 'logical';
- Wait for
diff_size
anddiff_bytes
reach 0, replication has caught up.
- Wait for
-
On green, analyze the database after replication has caught up. (staging took 20 minutes)
-- ANALYZE can take a while depending on the size of your database and server, so set a high timeout SET statement_timeout = '999min'; SELECT now(); ANALYZE VERBOSE; SELECT now(); -- 2024-10-22 01:56:00.307045+00 -- 2024-10-22 04:11:06.133202+00 SELECT now(), count(1), min(relname), max(relname) FROM pg_stat_all_tables WHERE schemaname = 'public' AND greatest(coalesce(last_autoanalyze, '2024-01-01'), coalesce(last_analyze, '2024-01-01')) < '2024-10-22 01:56:00.307045+00'; SELECT schemaname, relname, last_autoanalyze, last_analyze FROM pg_stat_all_tables WHERE schemaname = 'public' ORDER BY relname; SELECT count(1), min(relname), max(relname), string_agg(relname, '", "') FROM pg_stat_all_tables WHERE schemaname = 'public' AND COALESCE(last_autoanalyze, last_analyze) IS NULL;
- Add readers to green cluster
- Remove RDS Proxy,
/medplum/${name}/databaseProxyEndpoint
, from server config if present - Remove usage of readers from server config
- Copy arn stored in value of
/medplum/prod/ReaderDatabaseSecrets
to table above - delete
/medplum/prod/ReaderDatabaseSecrets
- delete
/medplum/prod/readonlyDatabase.disableConnectionConfiguration
if present - Redeploy
- Copy arn stored in value of
- Add PgBouncer to server config secret, NOT using
/medplum/${name}/databaseProxyEndpoint
since we needssl.rejectUnauthorized = false
- Update
DatabaseSecrets
secret with value fromDB secret w/ pgbouncer
in table above - Redeploy
- Update
-
Query CloudWatch logs for 500s
-
Run
npx tsx src/cli.ts
-
Run
npx tsx src/cutover.ts --dry-run
- Review output
-
Start k6 constant.js script
-
Wait for no elevated load, batch jobs, etc. before continuing
-
Run
npx tsx src/cutover.ts --no-dry-run
- Review output
-
Ensure server working as expected
-
Review Cloudwatch logs for 500s
-
On green, disable and drop subscription (w/o the ALTER statement, the replication slot on blue is also dropped)
-- ALTER SUBSCRIPTION my_subscription SET (slot_name = NONE); DROP SUBSCRIPTION my_subscription;
-
On blue, drop publication and verify replication slot is dropped
DROP PUBLICATION my_publication; SELECT * FROM pg_publication; SELECT * FROM pg_replication_slots;
-
On green, drop all REPLICA IDENTITY FULL
- Run all
ALTER TABLE
statements output by the query below - Verify no rows are returned by the query below
SELECT 'ALTER TABLE "' || tablename || '" REPLICA IDENTITY DEFAULT;' FROM pg_tables t WHERE t.schemaname IN ('public') AND EXISTS ( SELECT 1 FROM pg_class c WHERE c.oid = (quote_ident(t.schemaname) || '.' || quote_ident(t.tablename))::regclass AND c.relreplident = 'f' -- 'f' means REPLICA IDENTITY FULL ) ORDER BY schemaname, tablename;
- Run all
-
Remove PgBouncer from server config
- Update Secret pointed to by
/medplum/prod/DatabaseSecrets
with value fromDB secret w/ green
in table above - Redeploy
- Update Secret pointed to by
-
Add usage of GREEN readers to server config
- Update ReaderDatabaseSecrets with value from
reader DB secret w/ green
in table above - Create
/medplum/prod/ReaderDatabaseSecrets
with value ofReaderDatabaseSecret value
in table above - Create
/medplum/prod/readonlyDatabase.disableConnectionConfiguration
with valuetrue
- Redeploy
- Update ReaderDatabaseSecrets with value from
- Rectify CDK config with new RDS cluster (w/o any shutdowns/replacements)
- Shutdown/terminate blue database
- Shutdown/terminate PgBouncer server
- Shutdown/terminate jumpbox
- Clean up security groups
- Delete PgBouncer security group
- Delete Jumpbox security group
- Remove rules
- Update dashboards
- (optional) Disable logical replication on new cluster (requires reboot)
- Is there actually a performance implication of leaving it enabled?
WITH inactive_connections AS (
SELECT
pid,
datname,
usename,
application_name,
(current_timestamp - state_change) as idle_time,
rank() over (order by current_timestamp - state_change ASC) as rank
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
-- AND application_name ~ '(medplum-server)|(^$)'
AND datname = 'medplum'
AND usename = 'clusteradmin'
AND state in ('idle', 'idle in transaction', 'idle in transaction (aborted)', 'disabled')
AND current_timestamp - state_change >= interval '1 seconds'
ORDER BY idle_time
) SELECT pg_terminate_backend(pid) FROM inactive_connections WHERE rank >= 1;
SELECT pid,datname,usename,query FROM pg_stat_activity where state <> 'idle' AND pid<>pg_backend_pid();
- Prevent logging password
- Import resources into CDK stack
- PgBouncer config
- PgBouncer SSL setup percona crunchydata