-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Sampling DHT RPC #53
Comments
would you have a global rate limit to respond to any of these requests, to avoid for a sophisticated attacker from bombarding in order to eventually get every infohash you may have? or would you simply calculate that random set of infohashes not so often and always give the same response for a long amount of time in order to deter crawlers from coming back to you or something else. |
Global rate limits would mean crawler A could exhaust the available limits on some keyspace region, not letting crawler B get any data. Incentives to do that don't seem to be strong, but it's still suboptimal. I like the precalculated and occasionally rotated subset approach more. Good idea. |
Popular DHT clients already apply a pretty low (<10KB/s) rate limit to all DHT traffic so I don't think we need any new rate limiting for this. In principal I'm ok with this. I think the benefit that DHT indexers provide to regular users outweighs the potential harm to people who are relying on security through obscurity for their "private" torrents. |
When the answer was truncated (the node has more hashes than could fit in a packet), this should be indicated in some way. I suggest adding a field that contains the total number of hashes that the sending node knows about. |
What would be the motivation behind that? |
A thought I had while working on storage encryption: Local rate limits don't do much to discourage indexers from being greedy (highly parallel) at a global scale. One possibility would be to add a memory-hard proof of work (e.g. bit prefix searching + argon2) to the requests with the computation based on the target IP. The responding node would then be free to verify the work or not. I hope it wouldn't be necessary in practice, but it's an option to consider. |
It's information that's easy to compute, and we don't want to restrict indexers in the algorithms they may use. 8472, are you in touch with any indexer authors? |
I have my own operational indexer implementation, but I don't have contact to 3rd parties who operate indexers.
Well, the only use-case that is obvious to me is that they query the same node repeatedly, which is something I intend to discourage (see the precomputed subset part in the proposal). |
Well, the only use-case that is obvious to me is that they repeatedly query
the same node repeatedly, which is something I intend to discourage (see the
precomputed subset part in the proposal).
Why do you want to discourage repeated querying? It doesn't do much harm
to the network, and I expect that people will do it whether you discourage
them or not.
Having the total number of data in the reply makes it easier to stop
querying when you've got a large enough subset of the data, so it will
decrease the amount of traffic.
|
An indexer creates a fairly unique load across the DHT which is quite asymmetrical compared to what other nodes do. It basically has two goals a) be as exhaustive as possible b) be as current as possible. Without limits the optimum for an indexer with unmetered traffic would be querying millions of nodes every few minutes. And that's just one indexer. Now add multiple indexers into the equation and you're generating a lot of traffic across the entire DHT. If this becomes too much it would discourage clients from implementing this extension and the worst case is that it might also exhaust their internal rate limits. So, there are several multipliers in here
The 3rd one is easy to restrict by making additional queries useless (as I propose). The 2nd could be limited by making queries computationally expensive as suggested in a previous comment My assumption is that indexers will make repeated sweeps over the DHT to keep things fresh. Let's say half-daily. So even if they miss a few samples during one sweep, they'll just get the data in the next one. Additionally, the redundancy of stores should mean that even with sampling they're likely to get the hash from neighboring nodes.
I think the proposal includes a fairly effective way to do just that. (credit goes to @gubatron ) |
solved with #54 |
The idea is fairly simple, add a DHT call that returns a random sample of the infohashes that a node has stored locally.
This should make DHT indexing more efficient and provide a cleaner alternative to current approaches which are highly incentivized to misbehave to increase the amount of data they can gather.
It would also democratize the indexing process in the sense that can be executed with moderate resources instead of advantaging those who have large IP blocks at their disposal.
Addressing privacy concerns: Swarms using the DHT already are open, public and de facto indexed. Additionally using popular open trackers usually leads to the torrents to be included in some of their data dumps, too. So in practice this would change very little.
But to offset the concerns anyway I would suggest using encrypted torrents (#20, once I finish the spec and a reference implementation) to allow people to use the DHT while keeping the content secret.
Non-Goal: Enabling average endusers to search the entire DHT network for content
Opinions?
The text was updated successfully, but these errors were encountered: