How are files formatted inside nodes? #5059

schomatis · 2018-06-01T02:38:30Z

Roughly speaking a block is formatted as a node (with links/children) using the merkledag.pb package. On top of the DAG layer the UnixFS format is used to represent files and directories (but those two layers are not as decoupled as the previous two). The unixfs package has (basically) 3 types of objects, Raw, File and Directory. The importers split files and arranges the chunks in DAG nodes that contain UnixFS objects of type Raw and File (using the DAG links to connect them).

I'll be studying the importer package (and related) and registering what I find here to later convert that information in the form of comments and small code refactoring (any feedback or code pointers is more than welcomed).

What I'm having the most trouble understanding is when is the the File type used in the UnixFS objects and when Raw (it would seem that Raw is used in the leaves of the DAG and File for the rest, but if that is the case I'm not understanding why). Also the interaction between the ProtoNode layer and the FSNode layer (which I would like to rename to FSObject and leave Node for the DAG layer) which are encapsulated in a UnixfsNode structure which seems to have many representations. Also how does the MFS root fit into all of this.

The text was updated successfully, but these errors were encountered:

schomatis · 2018-06-01T02:56:58Z

I'm now realizing that there are two FSNode identifiers, one in the unixfs package which represents the UnixFS format (defined in unixfs.proto),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/unixfs/unixfs.go#L142-L147

and the other one is an interface in the mfs package implemented by the File and Directory structures (and somewhat indirectly by the Root structure which contains a FSNode member),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/mfs/system.go#L44-L49

Is this intentional? How much related are they?

schomatis · 2018-06-01T16:57:15Z

/cc @Stebalien

schomatis · 2018-06-01T18:29:33Z

I'm following the trickle importer to understand how are the DAG and UnixFS layers connected, most of the logic resides in the helper package and is abstracted through the UnixfsNode structure,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L40-L48

which combines the different layers and exposes them through generic-named functions that make it hard to understand when is the work being done on the DAG node and when on the UnixFS object. As an example I'll try to detail the basic trickle importer process and what problems might a new user face when trying to dig through the code.

The entry-point Layout function creates a "root" UnixfsNode (not to be confused with mfs.Root), "fills" it and "adds" it, this apparently simple process hides what (IMO) is the most important part of the IPFS information layers logic.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/trickle/trickledag.go#L36-L48

NewUnixfsNode creates two entities, the dag.ProtoNode and the Unix FSNode object which for now remain decoupled (although the former will eventually contain the latter). One thing that caught my attention is that at this (rather low-level) part of the code the ipld.Node is almost never used in favor of its implementation, dag.ProtoNode (the difference between these two should be discusses in another issue).

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L115-L120

In the simplest case of a one-level layered DAG the fillTrickleRec will involve only a single call to FillNodeLayer,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L135-L147

The ambiguity of the UnixfsNode starts becoming apparent in call a like NumChildren() which is accessing the FSNode.blocksizes to compare it with db.maxlinks, something that I would have associated rather with the DAG layer (instead of UnixFS) and the linking functionality it provides (this is subjective of course),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L35-L36

Going back to FillNodeLayer, the call to GetNextDataNode() would get the next chunk of data and store it in a newUnixfsBlock(),

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L188-L190

which creates a FSNode of type TRaw.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/dagbuilder.go#L125-L130

First, the coupling of the two layers makes it hard to understand that the difference resides in the UnixFS layer where now the type is a raw object. Second, it is by no means apparent why has the type change from TFile to TRaw in the UnixFS object of this DAG node (is this node a leaf?), especially since the call to a (raw) newUnixfs*Block*() comes from a function called GetNext*Data*Node().

After the (file) data is obtained and stored in the UnixFS object its encapsulating node is added to its parent in AddChild,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L99-L112

Again the UnixfsNode confuses layers when requested for its FileSize() that depending on a boolean variable raw will either access the UnixFS layer (ufmt.FileSize()) or a raw node (a third data entity stored in the UnixfsNode structure) which as it implements the ipld.Node interface I'm associating it with the DAG layer.

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L133-L140

Returning to AddChild, the most important function in the entire process (IMO), getBaseDagNode (called by GetDagNode), appears,

https://github.com/ipfs/go-ipfs/blob/7853e53860805e08a212d78c4baa5d59bff99ba8/importer/helpers/helpers.go#L173-L184

Only now it is made more clear the relationship between the entities of UnixfsNode, the ipld.Node is used again (instead of its implementation), and it is seen that the FSNode is formatted (through the use of protocol buffers) inside the DAG layer (in the ProtoNode.data field).

This function getBaseDagNode that has no comments and is buried down deep in the helpers package is by no means easy to find nor understand its importance.

Stebalien · 2018-06-01T23:04:27Z

it would seem that Raw is used in the leaves of the DAG and File for the rest, but if that is the case I'm not understanding why

I believe that is the case and I have no idea why either. IIRC, it's basically just an historical quirk.

Note: there are also DagRaw nodes. These are raw (binary) IPLD nodes that we use in the leaves of a file if the --raw-leaves option is specified on add. We'll make this the default when we release 1.0, we just don't want to constantly change hashes along the way.

schomatis · 2018-06-02T00:05:42Z

Thanks @Stebalien, do you have any idea on the FSNode identifier being used twice in the code? If there is a connection there?

schomatis · 2018-06-06T12:29:17Z

/cc @Stebalien ^^

Stebalien · 2018-07-06T01:58:37Z

This one fell through the cracks and I assume you may already know the answer to this but... they're just two different FSNodes. They both represent a single object in the filesystem (file, symlink, directory, etc.) but one is an interface used by MFS and the other is a concrete type used to store the actual data.

schomatis · 2018-07-06T11:50:20Z

Yes, thanks. I would love to change the name of either of two, or add 35 lines of comment all of them saying: "watch it! this is not what you think it is.."

schomatis · 2018-09-26T00:41:50Z

I think this has been clarified enough with the balanced builder refactoring (which should be used as the reference to understand the question of this issue, the trickle builder will follow a similar process in ipfs/go-unixfs#10).

schomatis self-assigned this Jun 1, 2018

schomatis added topic/docs-ipfs Topic docs-ipfs topic/files Topic files labels Jun 1, 2018

schomatis added this to the Files API Documentation milestone Jun 1, 2018

schomatis mentioned this issue Jun 1, 2018

coreunix: simplify Adder structure #5062

Open

This was referenced Jun 2, 2018

Terminology: MFS vs UnixFS vs Files API #5051

Open

MFS: how are directory structures formatted inside nodes? #5081

Open

schomatis changed the title ~~How are files and directories formatted inside nodes?~~ How are files formatted inside nodes? Jun 12, 2018

schomatis mentioned this issue Jun 12, 2018

Revisiting the balanced builder #5106

Closed

schomatis mentioned this issue Aug 17, 2018

[community contribution] Files API milestone #5388

Open

schomatis closed this as completed Sep 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are files formatted inside nodes? #5059

How are files formatted inside nodes? #5059

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

Stebalien commented Jun 1, 2018

schomatis commented Jun 2, 2018

schomatis commented Jun 6, 2018

Stebalien commented Jul 6, 2018

schomatis commented Jul 6, 2018

schomatis commented Sep 26, 2018

How are files formatted inside nodes? #5059

How are files formatted inside nodes? #5059

Comments

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

schomatis commented Jun 1, 2018

Stebalien commented Jun 1, 2018

schomatis commented Jun 2, 2018

schomatis commented Jun 6, 2018

Stebalien commented Jul 6, 2018

schomatis commented Jul 6, 2018

schomatis commented Sep 26, 2018