Disk space reduction

pmeunier · June 3, 2017, 12:52pm

While trying to clean the Nest with my new Nest cleaner (not yet ready, still pretty inefficient), I had the opportunity to measure disk space usage “in production”, i.e. for a real repository.

So, the pijul repository has 25 branches (there is one branch per user branch, and an extra one when the branch has pending patches), and uses 39Mb, which is a lot.

The current implementation of branches is (roughly) a table of type Db<String, (Db<PatchId, ()>, Graph)>. Moreover, forks happen in O(log n) time (where n is the number of Sanakirja pages referenced at least twice), but edits to one of the branches start duplicating it.

My idea to solve this problem is to have just one graph containing all the edges, and branches would then be just sets of patches.

Only the patchsets would get forked (in the Sanakirja sense), and they are much smaller (and further compressible by

This would additionally allow to apply or unapply a patch from another branch in time independent from the patch size, and would make time travel as fast as a standard “revert”.

The tradeoff is that obsolete edges (which includes old versions of deleted edges, but not only) would have to stay in the database.

Also, a conversion tool would be really easy to write.

What do you think?

lthms · June 3, 2017, 4:22pm

As far as I know, git has similar issues and has a dedicated command (prune I think) to deal with it. This is a functionning we could mimic.

joeneeman · June 3, 2017, 4:31pm

That does sound like a lot of space, although 25 branches is also quite a lot. Do you know where the space is going? (e.g. fragmentation, sanakirja overhead, duplicated data)? Also, how big are the various databases?

Isn’t there also a tradeoff in the time it takes to (e.g.) checkout a branch, or run ‘diff’ on a branch? Those are pretty common, so I wouldn’t want their time complexity to depend on the number of patches in all the other branches.

What if you had a command to mark a branch as “cold”, meaning that you wouldn’t have to store its graph but instead you could recreate it (possibly slowly) from a set of patches?

pmeunier · June 3, 2017, 5:06pm

Well, 25 branches is a small number on the Nest: every time someone clicks “fork”, even if they don’t change anything, a new branch is created. This costs nothing immediately (neither in space nor time), but has the effect of “pinpointing” the current version, effectively creating a “snapshot” which gets “separated” from the master branch progressively over time.

With this proposal, “cold” branches would not be needed: the type of the repository would be (Graph, Db<BranchName, Db<PatchId, ()>>) instead of the current Db<BranchName, (Graph, Db<PatchId, ()>)>.

Neither would I, but currently, going back in time has a pretty high complexity (you need to unrecord all the patches successively).
With my new proposal, this would have the same complexity as the current “graph fetch”, which I believe is what you meant by “diff and checkout”.

My suggestion seems likely to increase the arity of some vertices, since we would have to keep many more pseudo-edges. As long as at least one of the following two holds, we should be good:

the arity is independent from the number of patches
the extra pseudo-edges can have a flag larger than other edges (the order of edge flags is already used to achieve a similar thing in the current pijul), so that we can ignore them when traversing the outgoing edges of a vertex.

pmeunier · June 3, 2017, 5:09pm

Well tried!

But I would be pretty sad if Pijul used this kind of user-unfriendly commands. A Pijul user should not have to know how Pijul is implemented in order to use it.

Let’s free the world from porcelain and plumbery!

joeneeman · June 3, 2017, 9:33pm

Am I correct in understanding that Graph here would be the graph that you get from applying all the patches in the repository simultaneously? If so, I still don’t understand how to efficiently fetch the graph corresponding to a branch. For example, suppose that a branch b consists of half of the patches in a very large repository. Some of the lines in b might be deleted by a patch that is contained in the repository, but not in b. So in order to fetch the graph for b, don’t you need to look at all the edges in the giant repository graph, including deleted ones?

Topic		Replies	Views
Adventures in converting: git-pijul Community	10	1411	August 23, 2020
New release: 1.0.0-alpha.1 Announcements	10	1762	July 17, 2021
Pijul is super slow when compared to git Development	2	1335	August 19, 2019
Is this project still active? Yes it is Announcements	67	13205	November 19, 2020
Bad performance Development	16	2790	December 11, 2018

Disk space reduction

Related topics