Random Talks and Thoughts

If you want to share something with the pijul contributors, yet you don’ feel it deserves its own thread, here is the perfect place. The main idea is to exchange random thoughts or ideas in an informal way.

For instance: I have just pushed a patch to add a NotImplementedYet error so that we can be lazy sometimes if the feature we are working on is too big d:.

I can’t create a proper PR in the Nest so this is a good occasion to use this thread. I’ve tried to tweak a little the output of pijul -V to give more information to the user. It says if pijul has been compiled as debug or release.

# release
λ pijul-dev -V
pijul 0.6.0
2017-06-05 12:28:33.040423785 +02:00
# debug
λ target/debug/pijul -V
pijul 0.6.0 -- development version
2017-06-05 12:27:07.925990949 +02:00

The patch is on my repo and I’d like your feedback!

bincode is a bad choice for serialization for two reasons — ⓐ it doesn’t produce self describing output, ⓑ it is too closely tied to Rust; ⓐ is not a huge deal, but ⓑ could seriously limit future adoption. I am not questioning your choice of implementation language, in fact based on what I know about Rust it sounds like a great fit. However, once Pijul is more mature it would help tremendously with adoption if it was practical to read and write Pijul repositories with tools written in other languages (without having to depend on parsing executable output or interfacing to a foreign library). The alternate implementations of git for example have made a lot of interesting software possible.

I want to acknowledge that I am late to the game here — I just now read the April 2nd blog post. Furthermore, I want to acknowledge that my conclusions may be misinformed¹ or simply wrong. I just wanted to make sure these concerns were taken into account, if only to be dismissed.

At the very least, it seems like a (machine readable) specification of the binary format² would be a huge win for testing and maintaining compatibility across versions? Does bincode include support for versioning your data?

¹ I didn’t dig deep into bincode, so my understanding of it is somewhat superficial. Neither am I competent with Rust, so maybe the data structures are so well defined that this is not an issue.

² Even if post hoc

p.s. None of this is going to prevent me from trying it out; In fact, I’m in the process of getting Rust up and going on my system just for that purpose.

Thanks for the comments. In the beginning, we were using cbor, but the lack of unique representation forced us to change: different implementations of cbor in Rust were using different features.

Bincode is a great fit for this: it is extremely simple, super easy to define, and has unique encodings. It is not self-described, but you probably don’t really care, since Pijul patches store binary information, which you would probably not make sense of by yourself.

Bincode is also extremely fast. Being tied to Rust is also not a big issue, since you’ll probably want to interact with Pijul via libpijul, itself written in Rust.

1 Like

I very much like the idea of patch based version control; I am not a dumb person but find it hard to imagine what git is doing, when I enter git commands such as found from the web (non-intuitive commands). My question here, is even more basic as the difference between patch based and snapshot based version control : Suppose one has a source code file that contains a loop and a counter in that loop has been forgotten to be incremented, and it does not really matter where in the loop body that counter gets incremented when that gets corrected, as long as, say it is the second half of that loop body (for sake of this discussion). Now suppose person A develops a patch AA in which counter gets incremented right in middle of loop body and person B develops a patch BB in which counter gets incremented at end of loop body. Now as there is no conflict of lines it appears as if both patches can be merged without conflict. However, then, the counter gets incremented twice. This is a fundamental problem of automatic merging.

Sure, but even if you managed to get a formal definition of the problem you would like to solve, that problem is likely to be undecidable. Instead, what Pijul does is to guarantee at least a few axioms to rely on: associativity, commutativity, inverses.

We’re not claiming much more than that, but it’s already much better than others (including Git).

Another “random idea” concerns the definition of a “patch”. Such a definition, in order to pin down the location of an edit, can use either line numbers, or context (lines before/after the edit, sufficient number of them such that they are unique in the file). However, myself I think it would help enormously if each text file line would have a (hidden) unique ID number go with it (that a text editor would hide). In that case, insertion or deletion of text at other locations (through other patches) that would affect line numbers, would leave such an ID untouched. During copy paste actions, the editor would have to insert new ID numbers for the pasted lines, to keep the IDs unique at all times. The downside of this is that it needs a slight extension of the text editor people would use to a) hide these IDs b) upon paste actions replace existing IDs by new IDs. However, it would seem to me such unique ID’s per line would enable a more powerful definition of the concept of a “patch”. https://stackoverflow.com/questions/45258674/would-it-help-source-code-version-control-systems-to-start-each-line-with-a-64-b

1 Like

We have something like that in Pijul, but we also have a fairly robust theory proving what patches can and cannot do if we want to keep some guarantees. Moving text around while keeping line ids is sometimes possible, sometimes not.
In particular, it needs to work when several persons do it in parallel.

1 Like

With Victor Grishchenko’s ctre every character has identity spanning versions. This makes following changes to a text over time much easier because the identity of a string is not bound to whatever position it happens to be at in the current state of the repo.

https://github.com/gritzko/ctre/blob/master/doc/ws10.pdf

Eh all, I have wrote and submit a little patch for pijul in order to give a little more information with pijul -V. Would love to get some feedback!

https://nest.pijul.com/pijul_org/pijul/discussions/140

This is why we use tests, I guess. CI would maybe detect this error.

On the other side of things, if you were using a dependently typed programming language, you could state the property “this counter gets incremented only once per step”, and something like an extension to Pijul that could query your compiler could say: “Hey, I know this merge looks fine, but it actually breaks this property”. But then again, if you were using such a language, you could introduce other kinds of bugs of a kind inexpressible in the language itself. So I don’t really know.

Tests for the win! Fight it with tests :slight_smile:

1 Like

I’ve run into this very interesting reddit comment: https://www.reddit.com/r/rust/comments/8dx08d/new_release_of_pijul_010_more_stable_than_ever/dxsuabt

It looks like interesting benchmarks.

this is exactly my issue, I reported it on this forum a while back (title “bad performance”). If you read through it, you’ll find two python tests that deal with the issue. They show:

  1. the .pijul folder becomes quickly ENORMOUS, while .git’s remains very compact
  2. pijul quickly becomes VERY slow, until to a point where it freezes the computer by sucking away all ram it can find.

Pmeunier told he expects the issue to strongly mitigate when we move to myers diff

2 Likes

I wasn’t aware of this discussion, thanks for pointing me out!

I wonder if changing for a better diff algorithm will tackle both 1. and 2., or only 2.

I don’t know either, it’s a bit too low level for my skill.

I can obviously answer only from a theoretical point of view, since I’ve not implemented the fix to diff. I would definitely expect Pijul to take up more disk space than Git, although I don’t know in which proportions. One issue is that there is some redundancy in the storage format, which we could get rid of in the future. Some work on the backend could also help. I chose to use copy-on-write B trees with skiplists in the nodes, which I believe to be optimal in terms of speed. But maybe there’s a better trade-off to be found!

Another problem is that line deletions tend to rewrite the database a lot, because of a naive implementation of something in the apply algorithm.

Since the only benchmark I have about Sanakirja is in its own tests, I cannot tell for sure that Sanakirja doesn’t leak, especially when it is used in Pijul (Pijul has some unsafe code). I should definitely try to write some more benchmarks in Pijul, by exposing the leak detection functions in the Sanakirja API.

@flobec, could you give me moderation privilege for the #pijul channel on IRC?

The poor thing is actually flooded by regular spammer, and we need to do something about it if we want to regain some proper discussion medium. Also, it gives a poor image of the project, for those who want to talk with us and only find… strange messages, to say the least.

I think you have the privilege now, but you should check.

1 Like

It worked, thanks! I set the channel mode to +r, as it looks like a recommended mode to deal with spam.

@pmeunier @lthms: You might wish to consider switching the serialization format to bencode, and more specifically, bencode using the following EXCEEDINGLY excellent Rust library, which runs circles around pretty much all other implementations, for reasons you can read at the link below: