Pijul

Generalizing to other objects than files

Hi, I recently learnt about pijul, mostly from the theoretical pov which interests me the most, even though i’m eager to actually use it to manage some code. Diving into the code (and the autogenerated doc), I noticed that directories seem to exist only implicitly (as some kind of “path” property of “leaves” – the files). Is that true or is there something i missed?

I’m asking this because i’m currently starting a project of distributed filesystem which would bear much similarity with a dvcs and i hope to build it upon libpijul as much as possible. More specifically, the structure i want to represent may contain things that look like files and things that look like directories, but i would like to think of directories as “structured files” that are just mappings from some string keys (filenames) to some id of another structured file. As such, i like the git concept of object which abstracts everything that is version-controllable, eg my structured files.

Translating this into pijul concepts, i think my structured files would map to the actual categorical objects of the file-and-patch category. Is that true? In particular i’m not sure about the categorical account for the concept of repository (are patches applied on a repository/set-of-files or on a file?). Supposing this is true, how much would it take to generalize (and is pijul interested in that) the concept of file to a more abstract “object of any cocomplete category”? In terms of libpijul that would mean to have a trait describing what it means to be patchable for some type (with an associated Patch type and some helpers to compute pushout, to reduce a patch, etc).

Note that i’m mostly interested about quite concrete instances of this generalization. Pijul treats files conceptually as dynamic arrays of lines (which is probably easy enough to generalize to dynamic arrays of some abstract type T), i would like to add mappings from strings to an abstract type, stacks (eg append-only log), fixed-length arrays, user-defined structs, sets, etc. I think that the concrete representation of the cocompletion of these respective structures (as categories where arrows are the supported operations) isn’t too complicated.

I’d be happy about any thoughts on this subject, if other people are interested in this kind of development, etc.

I will take a stand on your thoughts and start with defining/clarifying some important concepts before bringing forward my own argument in this context.

A filesystem is a tree-shaped data structure. For this argument, they don’t have hardlinks. Softlinks have to be actively maintained and resolved anyhow and don’t destroy the tree structure. The leaves of the tree are files. Being a tree, a filesystem can always be flattened into a list of files.

Git tracks content and it does so by storing it as blobs. That’s basically it. The rest is user convenience (tree objects) or implementation detail (commit objects etc.). This specifically means that Git has trouble tracking the name of content, by design. (See an early Torvalds Git talk at Google on that.) This seemed to be a good idea at the time. Turns out it is. Turns out again that we might be able to do better.

Pijul tracks the lifetime of change. Pijul objects are named changes. So, each change records a unique location in the filesystem and change in content including coming into existance and going out of existance.

So, while in Git you can recalculate the location of content in a single tree and, by comparing two trees, can calculate the change of content; you can derive the named change, which is a diff, in Git, while in Pijul this is the immediate object you store.

In my perspective, this means that in Git, objects have a meaning outside a location. In Pijul they don’t. An object is always bound to a location.

In any practical case, you need to be able to address an object you want to see. So, you need to know the location of it anyhow (because it’s your only means of addressing it). There is no point in being able to store blobs that are not bound to a location (referenced by a tree object in Git context), which is why Git’s garbage collector throws those blobs away.

In both systems you only really care for the tree leaves, never for the directory hierarchy. This means, you don’t need to care for directories per sé but pick them up as part of the location of an object.

With this being said, I understand your angle on structured files is, you want two files in the hierarchy to change dependently. So, a change in one file is only valid with an according change in the second. For this, then, you could implement a support layer in Pijul, I suppose. An extra set of rules that is applied on top of calculating the content from changes. Is this what you want to do?

I would find that interesting but that might impact user experience a lot. You would need a completely different interface for showing and interacting with these structured changes. And you would need the proper tools for producing such kind of dependent change in the first place. (Or you’d have a broken working tree in the mean time)