Generalizing to other objects than files

Hi, I recently learnt about pijul, mostly from the theoretical pov which interests me the most, even though i’m eager to actually use it to manage some code. Diving into the code (and the autogenerated doc), I noticed that directories seem to exist only implicitly (as some kind of “path” property of “leaves” – the files). Is that true or is there something i missed?

I’m asking this because i’m currently starting a project of distributed filesystem which would bear much similarity with a dvcs and i hope to build it upon libpijul as much as possible. More specifically, the structure i want to represent may contain things that look like files and things that look like directories, but i would like to think of directories as “structured files” that are just mappings from some string keys (filenames) to some id of another structured file. As such, i like the git concept of object which abstracts everything that is version-controllable, eg my structured files.

Translating this into pijul concepts, i think my structured files would map to the actual categorical objects of the file-and-patch category. Is that true? In particular i’m not sure about the categorical account for the concept of repository (are patches applied on a repository/set-of-files or on a file?). Supposing this is true, how much would it take to generalize (and is pijul interested in that) the concept of file to a more abstract “object of any cocomplete category”? In terms of libpijul that would mean to have a trait describing what it means to be patchable for some type (with an associated Patch type and some helpers to compute pushout, to reduce a patch, etc).

Note that i’m mostly interested about quite concrete instances of this generalization. Pijul treats files conceptually as dynamic arrays of lines (which is probably easy enough to generalize to dynamic arrays of some abstract type T), i would like to add mappings from strings to an abstract type, stacks (eg append-only log), fixed-length arrays, user-defined structs, sets, etc. I think that the concrete representation of the cocompletion of these respective structures (as categories where arrows are the supported operations) isn’t too complicated.

I’d be happy about any thoughts on this subject, if other people are interested in this kind of development, etc.

I will take a stand on your thoughts and start with defining/clarifying some important concepts before bringing forward my own argument in this context.

A filesystem is a tree-shaped data structure. For this argument, they don’t have hardlinks. Softlinks have to be actively maintained and resolved anyhow and don’t destroy the tree structure. The leaves of the tree are files. Being a tree, a filesystem can always be flattened into a list of files.

Git tracks content and it does so by storing it as blobs. That’s basically it. The rest is user convenience (tree objects) or implementation detail (commit objects etc.). This specifically means that Git has trouble tracking the name of content, by design. (See an early Torvalds Git talk at Google on that.) This seemed to be a good idea at the time. Turns out it is. Turns out again that we might be able to do better.

Pijul tracks the lifetime of change. Pijul objects are named changes. So, each change records a unique location in the filesystem and change in content including coming into existance and going out of existance.

So, while in Git you can recalculate the location of content in a single tree and, by comparing two trees, can calculate the change of content; you can derive the named change, which is a diff, in Git, while in Pijul this is the immediate object you store.

In my perspective, this means that in Git, objects have a meaning outside a location. In Pijul they don’t. An object is always bound to a location.

In any practical case, you need to be able to address an object you want to see. So, you need to know the location of it anyhow (because it’s your only means of addressing it). There is no point in being able to store blobs that are not bound to a location (referenced by a tree object in Git context), which is why Git’s garbage collector throws those blobs away.

In both systems you only really care for the tree leaves, never for the directory hierarchy. This means, you don’t need to care for directories per sé but pick them up as part of the location of an object.

With this being said, I understand your angle on structured files is, you want two files in the hierarchy to change dependently. So, a change in one file is only valid with an according change in the second. For this, then, you could implement a support layer in Pijul, I suppose. An extra set of rules that is applied on top of calculating the content from changes. Is this what you want to do?

I would find that interesting but that might impact user experience a lot. You would need a completely different interface for showing and interacting with these structured changes. And you would need the proper tools for producing such kind of dependent change in the first place. (Or you’d have a broken working tree in the mean time)

Hi, thanks for taking the time. Actually now i realize i conflated two separated questions, the one about the naming and the actual main one about structured files.

About named changes: ok, that’s the concept i was naming “implicit directories”, ie the fact that directories don’t exist per se, it’s just patchs concerning some given file are all tagged with that file’s location. Yet i’m not sure i agree when you say that objects are always bound to a location: sure every patch has a field giving the file’s location, but that’s only some metadata field. Since patchs have hash-based IDs, the core system would work just as well without that field. Indeed if i’m not mistaken, what’s going on at the high level is that we maintain a partition of all patches, with each part of that partition being the set of patchs making up one file. Anyway i think that part is clearer in my head now. What i’d like to ask about that part now is a clarification on the code: in libpijul, are directories nodes? I guess they are since in src/backend/edge.rs there is the EdgeFlags definition containing const FOLDER_EDGE = 2;. So what do these directory nodes contain? Does a folder-edge go from a directory node to the topmost (possibly several) line nodes of a given file or every line nodes of that file?

Note that i could answer myself the previous questions by spinning up some script linked to libpijul inspecting the internal data of the repository, but for now that’s too much for me since a sanakagira db is not just a sqlite db i can inspect with known tools. Actually i’d be interested in having some “plumbing” interface to the pijul database, possibly contributing them myself if other are willing to drive that.

With “structured files” i was referencing the last section of the paper that started this patch theory (emphasis mine):

We believe that the interest of our methodology lies in the fact that it adapts easily to other more complicated base categories L than the two investigated here: in future works, we should explain how to extend the model in order to cope with multiple files (which can be moved, deleted, etc.), different file types (containing text, or more structured data such as xml trees). Also, the structure of repositories (partially ordered sets of patches) is naturally modeled by event structures labeled by morphisms in P, which will be detailed in future works, as well as how to model usual operations on repositories: cherry-picking (importing only one patch from another repository), using branches, removing a patch, etc. It would also be interesting to explore axiomatically the addition of inverses for patches, following other works hinted at in the introduction.

What is being described here is that instead of tracking a single file, it’s doable to extend to tracking several files (eg what is done in pijul), or tracking other stuff that aren’t (text) files (not yet done, what i would like).

A lot of the things mentioned here as future works have (afaik) been done already by pijul: multiple files, branches, repo operations, inverses. But pijul still only track changes on “named text files”. I’d be interested to hear how do directories fit in that view and how much the pijul project would be interested in opening up the possibility of some kind of plugins adding new types of stuff being tracked.

I’m gonna dare cc-ing @pmeunier, maybe you’re the one having some answers, in particular with the libpijul overhaul that i heard was planned.

Of course also keep in mind i’m not really interested in changing the current pijul user interface. I’d like to bind to libpijul to make something that could go in place of the http protocol (or heck ldap or ftp, any kind of distributed filesystem protocol), the web as a monorepo, where people browse by pulling changes, eg syncing some subtree with some peers, and do changes by doing them locally and again pushing the subtree to some peers.

I think you would be interested in running pijul info --debug on a test repository. This would answer all your questions. Folder edges link “basename vertices” of the graph to “inode vertices”, or inode vertices to basename vertices. Each basename vertex has a non-empty content, and exactly one parent and one child, whereas inode vertices have no content, and they represent a file or a directory. Non-folder edges are any edge internal to a file.

The rewrite will not change the structure fundamentally, it’s just a smarter way of representing a repository on the disk.

Could this be extended? Maybe. But just like handling line deletions, it’s not as simple as what this paper claims.