Pijul

Blake3 for the new hash function

Blake3 was mentioned in another thread, but I wanted to make a new thread.

I think that Blake3 would be an ideal choice for Pijul’s hash function:

  • Blake3 has a fist-class rust implementation, maintained by the authors of the hash function.
  • The hashes are 256 bits, so it’s significantly shorter than 512 bit SHA2. There was some mention of a possible security tradeoff going to 256 bits, but I think that 256 bit hashes are currently considered to be secure, for now and the indefinite future.
  • Blake3 is very, very fast. It both because it’s a merkle tree hash, which means that it’s highly amenable to parallelism, and because it uses Blake2s, which is a fast primitive.
  • Blake3 includes a number of useful modes, in addition to the main hash function, it supports a keyed hash, and a key derivation operation. These might be useful for collision resistant hashing, or key derivation for signature operations.

Blake3 is binary merkle tree hash with 16KiB leaf nodes, which enables a lot of neat possibilities for Pijul:

  • Corruption during transmission or storage could be identified and repaired at the block level, instead of having to throw away the entire object.
  • Large objects could be synced from multiple locations, by requesting disjoint sets of blocks.

Blake3 is derived from a hash called Bao, which was written by Jack O’Connor, and then later turned into Blake3. Jack is active on Twitter and a few other places. I’m sure he would be interested in seeing Blake3 adopted, and would likely be able to explain any technical points relating to Pijul adopting Blake3.

Thanks! Blake3 is already the hash in the next version of Pijul. There is also a new idea to make “commutative hashes”, in the sense that h(A, B) = h(B, A), but it’s hard to get either A or B from h(A, B), and hard to get B (resp A) from h(A, B) and A (resp B).

Sweet, that’s awesome!

What would commutative hashes enable? I’m guessing that h(xor(h(A), h(B))) doesn’t work for other reasons?

Can you explain what you mean by “and hard to get B (resp A) from h(A, B) and A (resp B).”? I’m not familiar with the phrase “(resp X)”.

Ah, I think I get it, hard to get B from h(A, B) and A, and hard to get A from h(A, B) and B.

That’s right.

Right. I forgot to add that I also want to compute h(A, B) from (h(A), B) or from (h(B), A). This property means that:

  • a state in Pijul can be described by a hash h(…), and the next state after applying patch B can be computed from h(…) and B.
  • Starting from state s, if Alice applies A then B (let’s write the state h(h(s, A), B)), she gets the same hash as Bob, who applied B then A (and hence computed h(h(s, B), A)).
1 Like

Hmm, I could see how that would be nice. It mirror’s the patch-based repository model, but for hashes.

I’m assuming that this is desired for efficient updates of the repository’s root hash, which multiple parties can agree on? I can imagine somewhat inefficient schemes for calculating a root hash (which don’t satisfy the properties you gave), like sorting the hashes of the patches in the repo, and taking a merkle tree hash over those hashes as leaves. Is efficiency the only consideration?

I’m not sure I understand why the properties you mentioned would be important for this. If you can get A or B (patch hashes) out of h(A, B) (the repo root hash) that doesn’t seem like a big deal.

I’m not very familiar with them, but this makes me think of cryptographic accumulators. There are a few types, the ones I know about hash-based, like utreexo and one based on prime numbers, which I can’t find a good link for.

I was thinking about this, and it’s not obvious to me why you couldn’t sum the patch hashes modulo 2**256 and use that as the state hash. (You could do xor, but that has odd properties like two patches xoring to zero.) Is there a practical reason why that wouldn’t work?

Two patches XORing to 0 is not really an issue: for a hash /h/, there’s only one hash /h’/ such that /h+h’ = 0/, and you have the exact same property in your other proposed scheme.

The problem with XOR is that you can use linear algebra to generate collisions, by generating a list of patches and solving the corresponding linear equation.

For the integer sum modulo 2^256, finding collisions amounts to solving the subset sum problem, which is known to be a weak cryptographic problem (https://en.wikipedia.org/wiki/Merkle–Hellman_knapsack_cryptosystem).

So, for both systems, a server can trick a client into believing they’re synchronised, even though they’re not. Even worse, the server can inject a patch in the log, and add a constant number of patches to recover the same state identifier (the same state “hash”).

I’m not sure if I see what the attack would be.

Let’s say a server wants a client to accept a bad patch, P. The client asks for some state S. The server creates a log that includes P, and also some number of additional hashes H1…Hn to make the log’s state be S. However, if the client requests the patches for H1…Hn, the server can’t actually return them to the client, because to do so the server would have to calculate pre-images that hash to H1…Hn, which is infeasible. So as long as the client actually verifies that all patches in the log exist, they can’t be tricked with an erroneous state.

That’s not how it works. If the state is a linear function of the patch hashes, it doesn’t matter what exact hashes you have, as long as you can generate enough “independent” vectors (hashes) to form a linear base. You need as many independent vectors as there are bits, here 256 vectors. Then, for example with XOR, you can generate any hash you want by taking the subset of these hashes given by the resolution of the linear equation (h_{i_0} + … + h_{i_k} = h).

Good point, I didn’t think of that!