Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize costs #1149

Open
snarfed opened this issue Jun 24, 2024 · 68 comments
Open

Optimize costs #1149

snarfed opened this issue Jun 24, 2024 · 68 comments
Labels

Comments

@snarfed
Copy link
Owner

snarfed commented Jun 24, 2024

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

month

year

@snarfed snarfed added the now label Jun 25, 2024
@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

related: #1152 (but I expect outbox is a small fraction of overall cost)

@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

@snarfed
Copy link
Owner Author

snarfed commented Jul 25, 2024

Back to looking at this.

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query
SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc
method_name count
Lookup 4374961
RunQuery 3596350
Commit 548773
BeginTx 353552
Rollback 23279
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc
kind num_composites count
Object 767016
ATProto 764103
MagicKey 755585
ActivityPub 754627
UIProtocol 468614
Follower 2 49023
AtpBlock 17479
AtpRemoteBlob 6890
AtpRepo 3566
Follower 3147
AtpRepo 2
ATProto 2
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc
kind filter count
ActivityPub copies.uri EQUAL 753344
MagicKey copies.uri EQUAL 753290
Object copies.uri EQUAL 753245
ATProto copies.uri EQUAL 753243
UIProtocol copies.uri EQUAL 468614
AtpBlock seq GREATER_THAN_OR_EQUAL 17479
Object users EQUAL 12535
AtpRemoteBlob cid EQUAL 6890
ATProto handle EQUAL 6516
MagicKey manual_opt_out EQ 2295
Follower from EQUAL 1575
Follower to EQUAL 1572
ATProto enabled_protocols NOT_EQUAL 1232
ActivityPub enabled_protocols NOT_EQUAL 1231
...

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query
SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc
num_keys count
1 4371422
2 999
3 519
4 365
5 229
6 171
7 101
8 83
9 74
100 68
12 67
...
Query
SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc
kind count
Object 4080965
ATProto 112141
AtpBlock 97120
MagicKey 38829
ActivityPub 29996
AtpRepo 7222
AtpSequence 3551
AtpRemoteBlob 1574
Cursor 24

Object lookups by id scheme:

Query
SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc
scheme count
did 3434021
at 413000
https 235446
http 85
...

@snarfed
Copy link
Owner Author

snarfed commented Jul 31, 2024

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query
SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100
name count
did:plc:ayutykgvyf4x7ev5ornltyzz 176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i 64569
did:plc:p2ygpwluon3vrk5yecjq7wc5 56364
did:plc:hm5cxb2g2q4ma4ucsks73tex 53051
did:plc:b5glolrcdfnaxwc4dbe4zbty 43557
did:plc:3rxcq5wacdop5thjoh3sny3p 35103
did:plc:6zfgmvghpidjvcn3cqtitnx5 28309
did:plc:5syae7p7dr6zsfdoe4k6buky 26882
did:plc:dqgdfku26vpxdktkzne5x2xj 26240
did:plc:sese4fb6luojywziva3s7zjo 24876
...

snarfed added a commit that referenced this issue Aug 1, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 1, 2024

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

image image

snarfed added a commit to snarfed/webutil that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

Logs in BigQuery: https://console.cloud.google.com/bigquery?invt=AbjFiQ&project=bridgy-federated&inv=1&ws=%211m0

Logs dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 3, 2024

Next step for datastore reads: snarfed/arroba#30

@snarfed
Copy link
Owner Author

snarfed commented Aug 9, 2024

Results from the datastore reads optimization here have been disappointing. I cut them by ~3x, but spend on datastore read API calls only went down maybe 1/4-1/3, from $10-12/d to $7-8/d. Hrmph.

image image

(Datastore reads are the top blue part of the cost bars above.)

@snarfed
Copy link
Owner Author

snarfed commented Aug 13, 2024

Another opportunity here could be reducing router and atproto-hub allocated CPU. Router CPU is down to mostly under two cores, atproto-hub to maybe half a core. Could either drop router from 4 to 2, or leave it at 4 and merge atproto-hub into it.

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying.

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them.

@snarfed
Copy link
Owner Author

snarfed commented Aug 15, 2024

...but looking at our incoming activity types, that might not make a big difference. The bulk of incoming activities are likes, follows, and reposts, in that order.

Measure first, then optimize.

image

(I should double check to make sure that count isn't after various filtering though!)

@snarfed
Copy link
Owner Author

snarfed commented Dec 21, 2024

Here's everything I can think of that I've done to get ndb caching in memcache working, or to even check that it's working:

  • Configured ndb to use memcache for its global cache:
    MEMCACHE_HOST: '10.126.144.3'

    bridgy-fed/common.py

    Lines 108 to 118 in 382b30e

    if memcache_host := os.environ.get('MEMCACHE_HOST'):
    logger.info(f'Using real memcache at {memcache_host}')
    memcache = pymemcache.client.base.PooledClient(
    memcache_host, timeout=10, connect_timeout=10, # seconds
    allow_unicode_keys=True)
    pickle_memcache = pymemcache.client.base.PooledClient(
    memcache_host, timeout=10, connect_timeout=10, # seconds
    serde=PickleSerde(), allow_unicode_keys=True)
    # ideally we'd use MemcacheCache.from_environment, but it doesn't let us
    # pass kwargs like serde to the pymemcache client constructor
    global_cache = MemcacheCache(memcache, strict_read=True)

    bridgy-fed/common.py

    Lines 484 to 491 in 382b30e

    NDB_CONTEXT_KWARGS = {
    # limited context-local cache. avoid full one due to this bug:
    # https://github.com/googleapis/python-ndb/issues/888
    'cache_policy': cache_policy,
    'global_cache': global_cache,
    'global_cache_policy': global_cache_policy,
    'global_cache_timeout_policy': global_cache_timeout_policy,
    }
  • Enabled ndb debug logging with NDB_DEBUG=true and logging.getLogger('google.cloud.ndb.global_cache').setLevel(logging.DEBUG).
  • Checked the build to confirm that it's using my ndb fork and branch with global_cache logging.
  • Ran unit tests locally with eg env MEMCACHE_HOST=localhost NDB_DEBUG=true python -m unittest -v ..., confirmed that I see google.cloud.ndb.global_cache output.

I still don't see google.cloud.ndb.global_cache log lines in prod, and I still see repeated datastore lookups for the same key.

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Finally got ndb logs working! The GCP Python SDKs disable propagation at the google.cloud level for some reason:

    for logger_name in all_excluded_loggers:
        # prevent excluded loggers from propagating logs to handler
        logger = logging.getLogger(logger_name)
        logger.propagate = False

Fixed that with logging.getLogger('google.cloud').propagate = True.

...and that led to the big find here, we're overusing transactions, which skip the cache and always go to the datastore. I removed the transactions where we're fine with last writer wins, which was most of them, and datastore lookups are down by half!

image

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Another idea here, cache follower counts in memcache instead of instance memory. Not sure how much that would matter, I'll see when I dig into datastore usage.

Did this in 6388aab.

@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2025

Next idea: use #552 to optimize Protocol.for_id, #1341. Maybe with Web.status == 'webfinger' entities in datastore?

snarfed added a commit that referenced this issue Jan 14, 2025
...because transactions don't read or write memcache. :/ Fortunately we don't really depend on atomicity for much, last writer wins is pretty much always fine.

for #1149
snarfed added a commit that referenced this issue Jan 15, 2025
snarfed added a commit that referenced this issue Jan 15, 2025
…no mf2

this hopefully stops us from re-fetching them repeatedly in Protocol.for_id or Web.load/fetch

for #1149
@snarfed
Copy link
Owner Author

snarfed commented Jan 15, 2025

I ended up caching Protocol.for_id in memcache instead.

snarfed added a commit that referenced this issue Jan 17, 2025
@snarfed
Copy link
Owner Author

snarfed commented Jan 20, 2025

New milestone: finally shipped #1360 to stop storing CRUD activities! Initial deploy has been a bit bumpy, switched more writes to transactions that I wanted to, but that's fixable. Regardless, looks like it worked, it halved our datastore write rate overall. (!)

Image

@snarfed
Copy link
Owner Author

snarfed commented Jan 21, 2025

Looking better now!

Image

@snarfed
Copy link
Owner Author

snarfed commented Jan 22, 2025

Datastore storage is currently almost 1T total, costing us >$5/d, I'm looking at cleaning that up in #1703. Good results so far, should be able to cut it by half at least.

@snarfed
Copy link
Owner Author

snarfed commented Jan 24, 2025

Making gradual progress here overall, costs are down 50% from our peak back in mid Nov.

Image

@snarfed
Copy link
Owner Author

snarfed commented Jan 24, 2025

Similar view of datastore API calls:

Image

@snarfed
Copy link
Owner Author

snarfed commented Jan 26, 2025

Interesting, a new "Metric volume" SKU popped up on billing a couple days ago, just $1-2/d, but still. The new outbound-http-requests log-based metric I added a bit ago is already the biggest one by volume, 50M so far, that's probably the culprit, I'm guessing it put me over the free tier.

Amusingly the console's metric management tool (docs) doesn't show log-based metrics, whee.

snarfed added a commit that referenced this issue Jan 29, 2025
...since GAE's edge caching based on Cache-Control doesn't seem to be very effective :/

for #1149
@snarfed
Copy link
Owner Author

snarfed commented Jan 31, 2025

We currently pay ~$60/mo for egress bandwidth.

(Aside: cloud egress bandwidth is such a racket! ~$.10/G, that's orders of magnitude above cost! And all the major cloud providers collude and keep their prices artificially inflated, all within $.08-$.12/G, so there's no real competition. Grr!)

I could reduce this some by downgrading from GCP's premium networking tier, which uses their own backbone for transit, to the standard tier, which uses the public internet. I can't find a way to do that with App Engine though, may not be possible.

@snarfed snarfed removed the now label Feb 25, 2025
@snarfed
Copy link
Owner Author

snarfed commented Mar 12, 2025

Re #1149 (comment) :

Oh wow, I didn't realize that the datastore had separate pricing for multi-region vs single-region, https://cloud.google.com/datastore/pricing#location_pricing . Let's see if I can figure out how to switch us to single-region.

If we ever do this kind of migration, we should probably migrate all the way to native Firestore, ie not in Datastore mode, since its pricing is the same is single-region datastore: https://cloud.google.com/firestore/pricing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants