Add IPFS content provider (InterPlanetary File System) #1098

d70-t · 2021-11-26T17:45:36Z

This PR is adding an IPFS content provider (see #1096).

The following builds the requirements.txt example via IPFS:

jupyter-repo2docker QmPjPUTcXeiEdNUMEPusP4rnJNz2YPw1XrYQkp43C96DyS

Still open:
Likely one wants to have an option to configure the list of possible IPFS gateways. E.g. an environment variable?

welcome · 2021-11-26T17:45:38Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

While the py_cid-package would be more accurate, it has quite a bunch of dependencies and for most cases, this regex based approach should just work fine.

d70-t · 2021-11-26T20:45:43Z

Tests are now passing, including a test which actually creates a docker image from IPFS.

There's still the open question regarding customizability of the IPFS gateways to try. In ipfsspec the environment variable IPFSSPEC_GATEWAYS can be used to change the list of gateways. I'd imagine that one likely wants to specify a local one if the datacenter running repo2docker has some gateway(s) running there... What would be a preferable way to handle this in repo2docker?

Would we hijack / reuse this variable
Would we like to have a new one (e.g. REPO2DOCKER_IPFS_GATEWAY)?
Something else?

@yuvipanda what do you think?

manics · 2021-11-29T22:08:07Z

A couple of technical issues:

repo2docker needs to be usable by novice users, so expecting users to configure or even understand gateways is probably too complicated
Just using an alpha-numeric string CID is too generic, e.g. it could coincide with the name of a local directory, or some other future provider that uses a hash may be invented and cause confusion. Does IPFS have a standard URL for referencing objects?

A more general issue which ties in with @betatim's comment on jupyterhub/mybinder.org-deploy#2082 (comment)
How widespread is IPFS usage for publishing or referencing code and data repositories? It's clear that IPFS can support reproducibility, but so can many other tools, and repo2docker can't support all of them. Do you have any evidence for how many researchers use it, and what it's unique advantages are over other content providers?

d70-t · 2021-11-29T23:44:06Z

Thanks @manics for the comments!

For the first point: yes, that's also my concern. I believe that IPFS (or any other content addressible storage system) can be a very very useful tool in science, but it is increadibly hard to get people on board initially as there are svereal concepts which are new, creating a large initial burden. My take on the gateway issue would be to offer a somewhat large list of public gateways by default which should just work for anyone and offer a customization point for people wanting to improve performance in this point. (The use of public gateways in case of this PR is easier than for data, as code tends to be smaller and the entire thing can be downloaded by a single request, strongly reducing the pressure created on the gateways).

The second point also came to my mind after writing this. There's the ipfs:// protocol (also listed at IANA) which would be suited for this (and which is also used by ipfsspec). I can modify the PR to require the addition of this protocol prefix.

The third point is of course the hardest one, and maybe also a bit of a chicken and egg problem (see the gateway issue for example). Upfront: I believe that up to now, there are not yet too many researchers working with data on IPFS, but I hope this will change soon and repo2docker / binder would be great accelerators. The reason I started investing time into IPFS is because we've got a lot of data from a recent field campaign (https://eurec4a.eu/), which unfortunately is still distributed across the world partly in very unaccessible places and which should be made accessible for collaborative investigation. To do so, we figured that a basic requirement would be the availability of a simple-looking function which should work for all the data:

ds = get_dataset("some id")

The main goal of this function would be that one can write an analysis script and the script "just works" on any other computer of any coworkers. Thus, this function should work on any computer at any time, even when the primary server would be offline and without changing the identifier (over a long timespan). It should be fast at least if the data is close by (possible on a local disk). And the result of the function may not change over time, because otherwise my analysis wouldn't be reproducible at all. Another point which we have (and want to) deal with is the possibility for having copies of the data at our (and our collaborator's) datacenters. This is partly in order to have better performance and redundancy and partly for political reasons (some data must be held in some countries or at some institutions). For the larger datasets, we also need the possibility for efficient subsetting without downloading (in particular in the context of demonstation scripts), which is often not available at scientific data repositories.

For at least those reasons, I figure that a global content addressable storage system would be very helpful (this provides verifiable data integrity, trivial caching and consequently a simple to implement system of globally distributed copies). IPFS is of course only one possibility, but the implementation of a single global namespace without the immediate need to name the location of the data provider in the dataset identifier (due to a lookup in a distributed hash table) is particularly helpful in the setting outlined above.

Based on this reasoning, I am primarily interested in having datasets on top of IPFS and that's still true. Having code on IPFS would be a neat addition (a single CID would suffice to reference the whole analysis as well as all the data which went into it), but it's probably also fine to have that as a second step. I've made the PR mainly because @yuvipanda suggested it and it seemed to be relatively simple to implement. Also, as the use of IPFS is rather controlled (basically only a single HTTP-request to a gateway), I assumed that the implementation is relatively unproblematic.

Although I'm up to now quite conviced that IPFS will provide a lot of benefits for scientific data storage, I'm also interested in other good and practical solutions. Thus, if there are public data (and code) repositories which cover all of the requirements above, I'd be glad to learn more about them independent of this PR.

manics · 2021-11-30T23:29:09Z

Definitely a chicken and egg problem 😃 . Personally I don't think repo2docker should take the lead in promoting a particular new/upcoming technology, instead I see it's role as supporting reproducible research using existing well known tools that are already in use by the community.

For example, a very quick Google brought up dat, should we also add support for that?

Perhaps a long term solution to this problem is to make all content providers into plugins so it's easy to extend r2d and experiment with new optional providers. A similar idea has already been suggested for buildpacks.

yuvipanda · 2024-06-20T00:52:43Z

I think consensus here is to not add this. Personally for me, I've been disappointed with the progress IPFS has made in the last few years. So while I did initially champion this PR, I think we should close this one for now.

I'm really sorry, @d70-t!

first try for an IPFS content provider

16efd2d

d70-t added 6 commits November 26, 2021 19:23

IPFS: use regex-based is_cid

1708ae5

While the py_cid-package would be more accurate, it has quite a bunch of dependencies and for most cases, this regex based approach should just work fine.

run black

9c31e91

IPFS content provider: added test for CID detection

02c61c4

IPFS content provider: fix fallback on connection error

af1273f

IPFS content provider: get rid of CID-named root folder

cdcbd4a

added IPFS CIDs to documentation

9c45888

d70-t marked this pull request as ready for review November 26, 2021 20:37

manics marked this pull request as draft January 26, 2022 19:16

consideRatio changed the title ~~[WIP] IPFS content provider~~ IPFS content provider Oct 30, 2022

consideRatio added the new label Oct 31, 2022

consideRatio changed the title ~~IPFS content provider~~ Add IPFS (InterPlanetary File System) content provider Oct 31, 2022

consideRatio changed the title ~~Add IPFS (InterPlanetary File System) content provider~~ Add IPFS content provider (InterPlanetary File System) Oct 31, 2022

consideRatio added the content provider label Oct 31, 2022

yuvipanda closed this Jun 20, 2024

manics mentioned this pull request Sep 16, 2024

Add an IPFS content provider #1096

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IPFS content provider (InterPlanetary File System) #1098

Add IPFS content provider (InterPlanetary File System) #1098

d70-t commented Nov 26, 2021 •

edited

Loading

welcome bot commented Nov 26, 2021

d70-t commented Nov 26, 2021

manics commented Nov 29, 2021 •

edited

Loading

d70-t commented Nov 29, 2021

manics commented Nov 30, 2021

yuvipanda commented Jun 20, 2024

Add IPFS content provider (InterPlanetary File System) #1098

Add IPFS content provider (InterPlanetary File System) #1098

Conversation

d70-t commented Nov 26, 2021 • edited Loading

welcome bot commented Nov 26, 2021

d70-t commented Nov 26, 2021

manics commented Nov 29, 2021 • edited Loading

d70-t commented Nov 29, 2021

manics commented Nov 30, 2021

yuvipanda commented Jun 20, 2024

d70-t commented Nov 26, 2021 •

edited

Loading

manics commented Nov 29, 2021 •

edited

Loading