Generalized birthday bound attack on version identifiers

I pointed out the generalized birthday bound attack on version identifiers back in 2022. It seems the pijul manual says you compute version identifiers by doing exponentiations. I’ve no idea if this changed since my comment in 2022 but..

Actually exponentiations are vulnerable to generalized birthday bound attacks, since the adversary would know their own hashes, and the discrete log problem is easy in the exponent where your hashes occur. In concrete terms, an adversary would compute the hashes for many buchets of possible patches, then identify some version identifier product which they could obtain through many different selections of one patch from each bucket.

We discussed two defense mechanisms before:

  • Cryptographic accumulators remain extremely expensive, but would prove that a bare identifier consists of accumulated hashes, without any patches being subtracted.

  • Addition of hashes-to-elliptic-curves avoids generalized birthday bound attacks nicely, while being extremely cheap, far cheaper than exponentiations or scalar-point multiplication.

Addition of hashes-to-elliptic-curves is not an accumulator, but afaik you’ll only care when faced with shallow clones taken from adversarial repositories. Addition of hashes-to-elliptic-curves should siffice either (a) if shallow clones already involve some trust element anyways, or (b) if shallow clones make no sense anyways, or maybe (c) depending upon what patch state shallow clones do contain.

Anyways..

Assuming addition of hashes-to-elliptic-curves suffice, I’d expect RistrettoPoint::from_hash and RistrettoPoint: ops:Add in curve25519_dalek yield about the fastest implementation, since few elliptic curves have such a nice base field.

Also, there is no real change in how you hash patches per se, only in how you compute the version identifiers from the set of patches.

Is the attack a spoof of a repo with malware? Or something else?
Do other VCSs have similar problems?

It’s collisions in the version identifier: An adversary could create multiple disjoint patch sets that yield the same version identifier.

The adversary creates k buckets B_i each containing thousands of unrelated patches. Their attack algorithm identifies several distinct mappings f_j such that f_j(i) lies in B_i, but all the f_j yield the same combined product, aka H(f_j(0)) * H(f_j(1)) * … * H(f_j(k)) is the same for each j. This means each pair of f_j defines a colliding patch set.

All the patches in B_i are good for i>0, but half the patches in B_0 are malicious, so many of these f_j use all good patches, while many of the f_j contain exactly one malicious patch.

They submit one honest patch set that gets merged, after which they convince others that the dishonest patch was merged, maybe using signatures or web posting of the version identifier.

If you use David Wagner’s original attack, then you need roughly k=128 buckets, aka 128 patches in both the good and bad patch sets.

There is a faster attack that needs only k=16 buckets, but comes with more constraints, which I’ve now forgotten.

Do other VCSs have similar problems?

Very unlikely. Afaik, only a patch based DVCS would employ some commutative algebraic version identifier, but others like darcs might never have thought of doing so.

As I said, if you replace the version identifier by the addition of hashes-to-elliptic-curves, then your performance should improve dramatically over whatever exponentiation you’re doing now.

It’s just like the birthday paradox, where if you have 20 people in a room, then often 2 have the same birthday, but instead of a square root, the attacker benefits from the k-th root.

We stop the attack by working in some group that has a hard discrete log problem, like a secure elliptic curve. We’ve no secrets here so how we create the groups elements matters: If we create curve points using scalar multiplication, or create large finite field elements using exponentiations like your manual describes, then the attacker can run the attack in the scalar field or in the exponent. If otoh we hash-to-curve then nobody knows the scalars that relate those group elements, so the attacker must now work in the elliptic curve itself, but if they’d an algorith that worked here then they’d break the discrete log problem itself, so that’s impossible.

This is great for you because hash-to-curve is by far the fastest way to create a fresh elliptic curve point anyways. A “slow” hash-to-curve works by rejection sampling the x coordinate and solving the curve equation to see if some y coordinate exists. This succeeds half the time for fast curve, so already a couple hundred times faster than a scalar multiplicaiton, which itself should be 10 x faster than your exponentiation. RistrettoPoint::from_hash should be somewhat faster still.

Also an unrelated issue..

Your shallow clones might already be vulnerable to some unrelated subtracted patch attack on shallow clones. This fix would not change the status of this flavor of possible attack on shallow clones.

An accumulator would fix this possible attack, but they are extremely expensive. It’s likely cheaper to fix in other ways, like by having shallow clones do some validation that patches provided leave no “holes”, or going back to some signed check point of known good versions. Accumulators would provide no benefits except for shallow clones.

1 Like