Disk space reduction

While trying to clean the Nest with my new Nest cleaner (not yet ready, still pretty inefficient), I had the opportunity to measure disk space usage “in production”, i.e. for a real repository.

So, the pijul repository has 25 branches (there is one branch per user branch, and an extra one when the branch has pending patches), and uses 39Mb, which is a lot.

The current implementation of branches is (roughly) a table of type Db<String, (Db<PatchId, ()>, Graph)>. Moreover, forks happen in O(log n) time (where n is the number of Sanakirja pages referenced at least twice), but edits to one of the branches start duplicating it.

My idea to solve this problem is to have just one graph containing all the edges, and branches would then be just sets of patches.

Only the patchsets would get forked (in the Sanakirja sense), and they are much smaller (and further compressible by

This would additionally allow to apply or unapply a patch from another branch in time independent from the patch size, and would make time travel as fast as a standard “revert”.

The tradeoff is that obsolete edges (which includes old versions of deleted edges, but not only) would have to stay in the database.

Also, a conversion tool would be really easy to write.

What do you think?

As far as I know, git has similar issues and has a dedicated command (prune I think) to deal with it. This is a functionning we could mimic.

That does sound like a lot of space, although 25 branches is also quite a lot. Do you know where the space is going? (e.g. fragmentation, sanakirja overhead, duplicated data)? Also, how big are the various databases?

Isn’t there also a tradeoff in the time it takes to (e.g.) checkout a branch, or run ‘diff’ on a branch? Those are pretty common, so I wouldn’t want their time complexity to depend on the number of patches in all the other branches.

What if you had a command to mark a branch as “cold”, meaning that you wouldn’t have to store its graph but instead you could recreate it (possibly slowly) from a set of patches?

Well, 25 branches is a small number on the Nest: every time someone clicks “fork”, even if they don’t change anything, a new branch is created. This costs nothing immediately (neither in space nor time), but has the effect of “pinpointing” the current version, effectively creating a “snapshot” which gets “separated” from the master branch progressively over time.

With this proposal, “cold” branches would not be needed: the type of the repository would be (Graph, Db<BranchName, Db<PatchId, ()>>) instead of the current Db<BranchName, (Graph, Db<PatchId, ()>)>.

Neither would I, but currently, going back in time has a pretty high complexity (you need to unrecord all the patches successively).
With my new proposal, this would have the same complexity as the current “graph fetch”, which I believe is what you meant by “diff and checkout”.

My suggestion seems likely to increase the arity of some vertices, since we would have to keep many more pseudo-edges. As long as at least one of the following two holds, we should be good:

  • the arity is independent from the number of patches
  • the extra pseudo-edges can have a flag larger than other edges (the order of edge flags is already used to achieve a similar thing in the current pijul), so that we can ignore them when traversing the outgoing edges of a vertex.

Well tried!

But I would be pretty sad if Pijul used this kind of user-unfriendly commands. A Pijul user should not have to know how Pijul is implemented in order to use it.

Let’s free the world from porcelain and plumbery!

Am I correct in understanding that Graph here would be the graph that you get from applying all the patches in the repository simultaneously? If so, I still don’t understand how to efficiently fetch the graph corresponding to a branch. For example, suppose that a branch b consists of half of the patches in a very large repository. Some of the lines in b might be deleted by a patch that is contained in the repository, but not in b. So in order to fetch the graph for b, don’t you need to look at all the edges in the giant repository graph, including deleted ones?