-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IPFS content provider (InterPlanetary File System) #1098
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
While the py_cid-package would be more accurate, it has quite a bunch of dependencies and for most cases, this regex based approach should just work fine.
Tests are now passing, including a test which actually creates a docker image from IPFS. There's still the open question regarding customizability of the IPFS gateways to try. In ipfsspec the environment variable
@yuvipanda what do you think? |
A couple of technical issues:
A more general issue which ties in with @betatim's comment on jupyterhub/mybinder.org-deploy#2082 (comment) |
Thanks @manics for the comments! For the first point: yes, that's also my concern. I believe that IPFS (or any other content addressible storage system) can be a very very useful tool in science, but it is increadibly hard to get people on board initially as there are svereal concepts which are new, creating a large initial burden. My take on the gateway issue would be to offer a somewhat large list of public gateways by default which should just work for anyone and offer a customization point for people wanting to improve performance in this point. (The use of public gateways in case of this PR is easier than for data, as code tends to be smaller and the entire thing can be downloaded by a single request, strongly reducing the pressure created on the gateways). The second point also came to my mind after writing this. There's the The third point is of course the hardest one, and maybe also a bit of a chicken and egg problem (see the gateway issue for example). Upfront: I believe that up to now, there are not yet too many researchers working with data on IPFS, but I hope this will change soon and repo2docker / binder would be great accelerators. The reason I started investing time into IPFS is because we've got a lot of data from a recent field campaign (https://eurec4a.eu/), which unfortunately is still distributed across the world partly in very unaccessible places and which should be made accessible for collaborative investigation. To do so, we figured that a basic requirement would be the availability of a simple-looking function which should work for all the data:
The main goal of this function would be that one can write an analysis script and the script "just works" on any other computer of any coworkers. Thus, this function should work on any computer at any time, even when the primary server would be offline and without changing the identifier (over a long timespan). It should be fast at least if the data is close by (possible on a local disk). And the result of the function may not change over time, because otherwise my analysis wouldn't be reproducible at all. Another point which we have (and want to) deal with is the possibility for having copies of the data at our (and our collaborator's) datacenters. This is partly in order to have better performance and redundancy and partly for political reasons (some data must be held in some countries or at some institutions). For the larger datasets, we also need the possibility for efficient subsetting without downloading (in particular in the context of demonstation scripts), which is often not available at scientific data repositories. For at least those reasons, I figure that a global content addressable storage system would be very helpful (this provides verifiable data integrity, trivial caching and consequently a simple to implement system of globally distributed copies). IPFS is of course only one possibility, but the implementation of a single global namespace without the immediate need to name the location of the data provider in the dataset identifier (due to a lookup in a distributed hash table) is particularly helpful in the setting outlined above. Based on this reasoning, I am primarily interested in having datasets on top of IPFS and that's still true. Having code on IPFS would be a neat addition (a single CID would suffice to reference the whole analysis as well as all the data which went into it), but it's probably also fine to have that as a second step. I've made the PR mainly because @yuvipanda suggested it and it seemed to be relatively simple to implement. Also, as the use of IPFS is rather controlled (basically only a single HTTP-request to a gateway), I assumed that the implementation is relatively unproblematic. Although I'm up to now quite conviced that IPFS will provide a lot of benefits for scientific data storage, I'm also interested in other good and practical solutions. Thus, if there are public data (and code) repositories which cover all of the requirements above, I'd be glad to learn more about them independent of this PR. |
Definitely a chicken and egg problem 😃 . Personally I don't think repo2docker should take the lead in promoting a particular new/upcoming technology, instead I see it's role as supporting reproducible research using existing well known tools that are already in use by the community. For example, a very quick Google brought up dat, should we also add support for that? Perhaps a long term solution to this problem is to make all content providers into plugins so it's easy to extend r2d and experiment with new optional providers. A similar idea has already been suggested for buildpacks. |
I think consensus here is to not add this. Personally for me, I've been disappointed with the progress IPFS has made in the last few years. So while I did initially champion this PR, I think we should close this one for now. I'm really sorry, @d70-t! |
This PR is adding an IPFS content provider (see #1096).
The following builds the requirements.txt example via IPFS:
Still open:
Likely one wants to have an option to configure the list of possible IPFS gateways. E.g. an environment variable?