Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Sampling DHT RPC #53

Closed
the8472 opened this issue Dec 14, 2016 · 11 comments
Closed

Proposal: Sampling DHT RPC #53

the8472 opened this issue Dec 14, 2016 · 11 comments

Comments

@the8472
Copy link
Contributor

the8472 commented Dec 14, 2016

The idea is fairly simple, add a DHT call that returns a random sample of the infohashes that a node has stored locally.

This should make DHT indexing more efficient and provide a cleaner alternative to current approaches which are highly incentivized to misbehave to increase the amount of data they can gather.
It would also democratize the indexing process in the sense that can be executed with moderate resources instead of advantaging those who have large IP blocks at their disposal.

Addressing privacy concerns: Swarms using the DHT already are open, public and de facto indexed. Additionally using popular open trackers usually leads to the torrents to be included in some of their data dumps, too. So in practice this would change very little.

But to offset the concerns anyway I would suggest using encrypted torrents (#20, once I finish the spec and a reference implementation) to allow people to use the DHT while keeping the content secret.

Non-Goal: Enabling average endusers to search the entire DHT network for content

Opinions?

@gubatron
Copy link
Contributor

would you have a global rate limit to respond to any of these requests, to avoid for a sophisticated attacker from bombarding in order to eventually get every infohash you may have?

or would you simply calculate that random set of infohashes not so often and always give the same response for a long amount of time in order to deter crawlers from coming back to you

or something else.

@the8472
Copy link
Contributor Author

the8472 commented Dec 15, 2016

Global rate limits would mean crawler A could exhaust the available limits on some keyspace region, not letting crawler B get any data. Incentives to do that don't seem to be strong, but it's still suboptimal.

I like the precalculated and occasionally rotated subset approach more. Good idea.

@ssiloti
Copy link
Contributor

ssiloti commented Dec 15, 2016

Popular DHT clients already apply a pretty low (<10KB/s) rate limit to all DHT traffic so I don't think we need any new rate limiting for this.

In principal I'm ok with this. I think the benefit that DHT indexers provide to regular users outweighs the potential harm to people who are relying on security through obscurity for their "private" torrents.

@the8472 the8472 mentioned this issue Dec 26, 2016
@jech
Copy link
Contributor

jech commented Dec 26, 2016

When the answer was truncated (the node has more hashes than could fit in a packet), this should be indicated in some way. I suggest adding a field that contains the total number of hashes that the sending node knows about.

@the8472
Copy link
Contributor Author

the8472 commented Dec 26, 2016

What would be the motivation behind that?

@the8472
Copy link
Contributor Author

the8472 commented Dec 28, 2016

A thought I had while working on storage encryption:

Local rate limits don't do much to discourage indexers from being greedy (highly parallel) at a global scale. One possibility would be to add a memory-hard proof of work (e.g. bit prefix searching + argon2) to the requests with the computation based on the target IP. The responding node would then be free to verify the work or not.

I hope it wouldn't be necessary in practice, but it's an option to consider.

@jech
Copy link
Contributor

jech commented Jan 1, 2017

It's information that's easy to compute, and we don't want to restrict indexers in the algorithms they may use. 8472, are you in touch with any indexer authors?

@the8472
Copy link
Contributor Author

the8472 commented Jan 1, 2017

@jech

8472, are you in touch with any indexer authors?

I have my own operational indexer implementation, but I don't have contact to 3rd parties who operate indexers.

It's information that's easy to compute, and we don't want to restrict indexers in the algorithms they may use.

Well, the only use-case that is obvious to me is that they query the same node repeatedly, which is something I intend to discourage (see the precomputed subset part in the proposal).

@jech
Copy link
Contributor

jech commented Jan 1, 2017 via email

@the8472
Copy link
Contributor Author

the8472 commented Jan 1, 2017

Why do you want to discourage repeated querying?

An indexer creates a fairly unique load across the DHT which is quite asymmetrical compared to what other nodes do. It basically has two goals a) be as exhaustive as possible b) be as current as possible.

Without limits the optimum for an indexer with unmetered traffic would be querying millions of nodes every few minutes. And that's just one indexer. Now add multiple indexers into the equation and you're generating a lot of traffic across the entire DHT. If this becomes too much it would discourage clients from implementing this extension and the worst case is that it might also exhaust their internal rate limits.

So, there are several multipliers in here

  1. number of indexers
  2. maximum send rate of each indexer
  3. queries per target and unit of time
  4. number of nodes in the DHT

The 3rd one is easy to restrict by making additional queries useless (as I propose). The 2nd could be limited by making queries computationally expensive as suggested in a previous comment

My assumption is that indexers will make repeated sweeps over the DHT to keep things fresh. Let's say half-daily. So even if they miss a few samples during one sweep, they'll just get the data in the next one.

Additionally, the redundancy of stores should mean that even with sampling they're likely to get the hash from neighboring nodes.

and I expect that people will do it whether you discourage them or not.

I think the proposal includes a fairly effective way to do just that. (credit goes to @gubatron )

@the8472
Copy link
Contributor Author

the8472 commented Jan 13, 2017

solved with #54

@the8472 the8472 closed this as completed Jan 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants