Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are files formatted inside nodes? #5059

Closed
schomatis opened this issue Jun 1, 2018 · 9 comments
Closed

How are files formatted inside nodes? #5059

schomatis opened this issue Jun 1, 2018 · 9 comments
Assignees
Labels
topic/docs-ipfs Topic docs-ipfs topic/files Topic files

Comments

@schomatis
Copy link
Contributor

Roughly speaking a block is formatted as a node (with links/children) using the merkledag.pb package. On top of the DAG layer the UnixFS format is used to represent files and directories (but those two layers are not as decoupled as the previous two). The unixfs package has (basically) 3 types of objects, Raw, File and Directory. The importers split files and arranges the chunks in DAG nodes that contain UnixFS objects of type Raw and File (using the DAG links to connect them).

I'll be studying the importer package (and related) and registering what I find here to later convert that information in the form of comments and small code refactoring (any feedback or code pointers is more than welcomed).

What I'm having the most trouble understanding is when is the the File type used in the UnixFS objects and when Raw (it would seem that Raw is used in the leaves of the DAG and File for the rest, but if that is the case I'm not understanding why). Also the interaction between the ProtoNode layer and the FSNode layer (which I would like to rename to FSObject and leave Node for the DAG layer) which are encapsulated in a UnixfsNode structure which seems to have many representations. Also how does the MFS root fit into all of this.

@schomatis schomatis self-assigned this Jun 1, 2018
@schomatis schomatis added topic/docs-ipfs Topic docs-ipfs topic/files Topic files labels Jun 1, 2018
@schomatis schomatis added this to the Files API Documentation milestone Jun 1, 2018
@schomatis
Copy link
Contributor Author

I'm now realizing that there are two FSNode identifiers, one in the unixfs package which represents the UnixFS format (defined in unixfs.proto),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/unixfs/unixfs.go#L142-L147

and the other one is an interface in the mfs package implemented by the File and Directory structures (and somewhat indirectly by the Root structure which contains a FSNode member),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/mfs/system.go#L44-L49

Is this intentional? How much related are they?

@schomatis
Copy link
Contributor Author

/cc @Stebalien

@schomatis
Copy link
Contributor Author

I'm following the trickle importer to understand how are the DAG and UnixFS layers connected, most of the logic resides in the helper package and is abstracted through the UnixfsNode structure,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L40-L48

which combines the different layers and exposes them through generic-named functions that make it hard to understand when is the work being done on the DAG node and when on the UnixFS object. As an example I'll try to detail the basic trickle importer process and what problems might a new user face when trying to dig through the code.

The entry-point Layout function creates a "root" UnixfsNode (not to be confused with mfs.Root), "fills" it and "adds" it, this apparently simple process hides what (IMO) is the most important part of the IPFS information layers logic.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/trickle/trickledag.go#L36-L48

NewUnixfsNode creates two entities, the dag.ProtoNode and the Unix FSNode object which for now remain decoupled (although the former will eventually contain the latter). One thing that caught my attention is that at this (rather low-level) part of the code the ipld.Node is almost never used in favor of its implementation, dag.ProtoNode (the difference between these two should be discusses in another issue).

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L115-L120

In the simplest case of a one-level layered DAG the fillTrickleRec will involve only a single call to FillNodeLayer,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L135-L147

The ambiguity of the UnixfsNode starts becoming apparent in call a like NumChildren() which is accessing the FSNode.blocksizes to compare it with db.maxlinks, something that I would have associated rather with the DAG layer (instead of UnixFS) and the linking functionality it provides (this is subjective of course),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L35-L36

Going back to FillNodeLayer, the call to GetNextDataNode() would get the next chunk of data and store it in a newUnixfsBlock(),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L188-L190

which creates a FSNode of type TRaw.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L125-L130

First, the coupling of the two layers makes it hard to understand that the difference resides in the UnixFS layer where now the type is a raw object. Second, it is by no means apparent why has the type change from TFile to TRaw in the UnixFS object of this DAG node (is this node a leaf?), especially since the call to a (raw) newUnixfs*Block*() comes from a function called GetNext*Data*Node().

After the (file) data is obtained and stored in the UnixFS object its encapsulating node is added to its parent in AddChild,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L99-L112

Again the UnixfsNode confuses layers when requested for its FileSize() that depending on a boolean variable raw will either access the UnixFS layer (ufmt.FileSize()) or a raw node (a third data entity stored in the UnixfsNode structure) which as it implements the ipld.Node interface I'm associating it with the DAG layer.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L133-L140

Returning to AddChild, the most important function in the entire process (IMO), getBaseDagNode (called by GetDagNode), appears,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L173-L184

Only now it is made more clear the relationship between the entities of UnixfsNode, the ipld.Node is used again (instead of its implementation), and it is seen that the FSNode is formatted (through the use of protocol buffers) inside the DAG layer (in the ProtoNode.data field).

This function getBaseDagNode that has no comments and is buried down deep in the helpers package is by no means easy to find nor understand its importance.

@Stebalien
Copy link
Member

it would seem that Raw is used in the leaves of the DAG and File for the rest, but if that is the case I'm not understanding why

I believe that is the case and I have no idea why either. IIRC, it's basically just an historical quirk.

Note: there are also DagRaw nodes. These are raw (binary) IPLD nodes that we use in the leaves of a file if the --raw-leaves option is specified on add. We'll make this the default when we release 1.0, we just don't want to constantly change hashes along the way.

@schomatis
Copy link
Contributor Author

Thanks @Stebalien, do you have any idea on the FSNode identifier being used twice in the code? If there is a connection there?

@schomatis
Copy link
Contributor Author

/cc @Stebalien ^^

@schomatis schomatis changed the title How are files and directories formatted inside nodes? How are files formatted inside nodes? Jun 12, 2018
@Stebalien
Copy link
Member

This one fell through the cracks and I assume you may already know the answer to this but... they're just two different FSNodes. They both represent a single object in the filesystem (file, symlink, directory, etc.) but one is an interface used by MFS and the other is a concrete type used to store the actual data.

@schomatis
Copy link
Contributor Author

Yes, thanks. I would love to change the name of either of two, or add 35 lines of comment all of them saying: "watch it! this is not what you think it is.."

@schomatis
Copy link
Contributor Author

I think this has been clarified enough with the balanced builder refactoring (which should be used as the reference to understand the question of this issue, the trickle builder will follow a similar process in ipfs/go-unixfs#10).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/docs-ipfs Topic docs-ipfs topic/files Topic files
Projects
None yet
Development

No branches or pull requests

2 participants