Proposal: Sampling DHT RPC #53

the8472 · 2016-12-14T19:49:24Z

The idea is fairly simple, add a DHT call that returns a random sample of the infohashes that a node has stored locally.

This should make DHT indexing more efficient and provide a cleaner alternative to current approaches which are highly incentivized to misbehave to increase the amount of data they can gather.
It would also democratize the indexing process in the sense that can be executed with moderate resources instead of advantaging those who have large IP blocks at their disposal.

Addressing privacy concerns: Swarms using the DHT already are open, public and de facto indexed. Additionally using popular open trackers usually leads to the torrents to be included in some of their data dumps, too. So in practice this would change very little.

But to offset the concerns anyway I would suggest using encrypted torrents (#20, once I finish the spec and a reference implementation) to allow people to use the DHT while keeping the content secret.

Non-Goal: Enabling average endusers to search the entire DHT network for content

Opinions?

gubatron · 2016-12-14T23:43:07Z

would you have a global rate limit to respond to any of these requests, to avoid for a sophisticated attacker from bombarding in order to eventually get every infohash you may have?

or would you simply calculate that random set of infohashes not so often and always give the same response for a long amount of time in order to deter crawlers from coming back to you

or something else.

the8472 · 2016-12-15T00:00:58Z

Global rate limits would mean crawler A could exhaust the available limits on some keyspace region, not letting crawler B get any data. Incentives to do that don't seem to be strong, but it's still suboptimal.

I like the precalculated and occasionally rotated subset approach more. Good idea.

ssiloti · 2016-12-15T19:03:17Z

Popular DHT clients already apply a pretty low (<10KB/s) rate limit to all DHT traffic so I don't think we need any new rate limiting for this.

In principal I'm ok with this. I think the benefit that DHT indexers provide to regular users outweighs the potential harm to people who are relying on security through obscurity for their "private" torrents.

jech · 2016-12-26T23:26:40Z

When the answer was truncated (the node has more hashes than could fit in a packet), this should be indicated in some way. I suggest adding a field that contains the total number of hashes that the sending node knows about.

the8472 · 2016-12-26T23:45:10Z

What would be the motivation behind that?

the8472 · 2016-12-28T20:54:09Z

A thought I had while working on storage encryption:

Local rate limits don't do much to discourage indexers from being greedy (highly parallel) at a global scale. One possibility would be to add a memory-hard proof of work (e.g. bit prefix searching + argon2) to the requests with the computation based on the target IP. The responding node would then be free to verify the work or not.

I hope it wouldn't be necessary in practice, but it's an option to consider.

jech · 2017-01-01T15:43:46Z

It's information that's easy to compute, and we don't want to restrict indexers in the algorithms they may use. 8472, are you in touch with any indexer authors?

the8472 · 2017-01-01T16:18:12Z

@jech

8472, are you in touch with any indexer authors?

I have my own operational indexer implementation, but I don't have contact to 3rd parties who operate indexers.

It's information that's easy to compute, and we don't want to restrict indexers in the algorithms they may use.

Well, the only use-case that is obvious to me is that they query the same node repeatedly, which is something I intend to discourage (see the precomputed subset part in the proposal).

jech · 2017-01-01T18:51:44Z

Well, the only use-case that is obvious to me is that they repeatedly query the same node repeatedly, which is something I intend to discourage (see the precomputed subset part in the proposal).

Why do you want to discourage repeated querying? It doesn't do much harm to the network, and I expect that people will do it whether you discourage them or not. Having the total number of data in the reply makes it easier to stop querying when you've got a large enough subset of the data, so it will decrease the amount of traffic.

the8472 · 2017-01-01T19:12:15Z

Why do you want to discourage repeated querying?

An indexer creates a fairly unique load across the DHT which is quite asymmetrical compared to what other nodes do. It basically has two goals a) be as exhaustive as possible b) be as current as possible.

Without limits the optimum for an indexer with unmetered traffic would be querying millions of nodes every few minutes. And that's just one indexer. Now add multiple indexers into the equation and you're generating a lot of traffic across the entire DHT. If this becomes too much it would discourage clients from implementing this extension and the worst case is that it might also exhaust their internal rate limits.

So, there are several multipliers in here

number of indexers
maximum send rate of each indexer
queries per target and unit of time
number of nodes in the DHT

The 3rd one is easy to restrict by making additional queries useless (as I propose). The 2nd could be limited by making queries computationally expensive as suggested in a previous comment

My assumption is that indexers will make repeated sweeps over the DHT to keep things fresh. Let's say half-daily. So even if they miss a few samples during one sweep, they'll just get the data in the next one.

Additionally, the redundancy of stores should mean that even with sampling they're likely to get the hash from neighboring nodes.

and I expect that people will do it whether you discourage them or not.

I think the proposal includes a fairly effective way to do just that. (credit goes to @gubatron )

the8472 · 2017-01-13T13:06:33Z

solved with #54

the8472 mentioned this issue Dec 26, 2016

DHT Sampling #54

Merged

the8472 closed this as completed Jan 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Sampling DHT RPC #53

Proposal: Sampling DHT RPC #53

the8472 commented Dec 14, 2016

gubatron commented Dec 14, 2016

the8472 commented Dec 15, 2016

ssiloti commented Dec 15, 2016

jech commented Dec 26, 2016

the8472 commented Dec 26, 2016

the8472 commented Dec 28, 2016

jech commented Jan 1, 2017 •

edited

Loading

the8472 commented Jan 1, 2017 •

edited

Loading

jech commented Jan 1, 2017 via email

the8472 commented Jan 1, 2017

the8472 commented Jan 13, 2017

Proposal: Sampling DHT RPC #53

Proposal: Sampling DHT RPC #53

Comments

the8472 commented Dec 14, 2016

gubatron commented Dec 14, 2016

the8472 commented Dec 15, 2016

ssiloti commented Dec 15, 2016

jech commented Dec 26, 2016

the8472 commented Dec 26, 2016

the8472 commented Dec 28, 2016

jech commented Jan 1, 2017 • edited Loading

the8472 commented Jan 1, 2017 • edited Loading

jech commented Jan 1, 2017 via email

the8472 commented Jan 1, 2017

the8472 commented Jan 13, 2017

jech commented Jan 1, 2017 •

edited

Loading

the8472 commented Jan 1, 2017 •

edited

Loading