In https://pijul.org/sanakirja/, I can see that the fork operation has complexity O(log n). But in https://pijul.org/sanakirja/doc/sanakirja/ I can see the database can be cloned in O(1). I suppose clone and fork is the same operation really, or am i wrong?
Excellent question. It is the same operation, and O(1) is a mistake for the absolute worst case complexity, and the real answer for average complexity is somewhere in between.
More specifically, the complexity of cloning depends on a tricky parameter: the number of pages referenced strictly more than once in the whole database.
In order to understand that parameter, you first have to understand that Sanakirja stores everything in B trees (+ exactly one linked list, explained below), and does all the edits in a memory-mapped file. The file is divided in blocks of 4k = 4096 bytes (an atomic block for reading and writing on many file systems and memory architectures).
You can store a number of B trees in the same file, actually as many as 64 bits integers in the first 4096 bytes page in the file (i.e. you can store up to 512 B trees, minus a small constant < 10, let’s say at most 500 trees).
The reason is, the address to the root of each of the trees in the file is stored on the first page of the file. This was chosen because synchronising a single page to disk is the only thing that’s guaranteed to be atomic on good file systems.
Now, imagine the case of a single B tree stored in a file. Sanakirja represents exactly two things for that tree: a list of unused pages, freed by previous transactions, and an actual B tree.
Now, every time you add or remove a key, unused pages will be allocated, others will be “temporarily freed” (i.e. free for future transactions to use, but not for the current one, since we want to be able to cancel the transactions, and also since concurrent transactions might be reading it, so we don’t want to overwrite it).
Now, since everything uses copy-on-write, cloning a tree just duplicates the pointer to the root, i.e. increases a reference counter to the root. When one of the clones is changed, the root will be duplicated (i.e. there will be two different pages holding a copy of the root), and all the children that have not changed will be pointed to by the two copies. Hence, all those children will need their reference count to be incremented, and since the root has been cloned, both copies are now referenced only once, their reference count has to be decremented.
The last thing to understand to understand the complexity story, is where these reference counts are stored. We cannot store them on the pages themselves, because that would not be transactional (how do you keep track of all the reference counts that have changed, and revert them all?). Instead, they are themselves stored in another B tree.
But that poses another problem, which is that operations on that B tree need to allocate and free pages, but allocating and freeing might need to change the reference counts stored in the same B tree, which could result in infinite loops.
The solution to that is to store only reference counts when they are at least 2, and assume all pages that have a pointer to them but have no reference counter to have reference count 1. Then, since the B tree storing the reference counts is never cloned, its doesn’t need to store the reference counts for its own pages.
Now, on to the complexity of cloning: it is in O of the size of that last B tree, which depends on how many pages the different existing clones have in common. That is not super easy to study, since immediately after the fork, they have everything in common, the reference counting B tree is of size 1. After some number of operations after the fork, they have no page in common anymore, and the reference counting B tree is of size 0. I believe it peaks when half of the leaves are referenced twice, where the reference counting B tree has O(n) entries. But then its operations are all in O(log n). So, that’s a worst-case complexity not seen often.
Thanks you very much for the detailed explanation.