All programming languages and human-readable text-based formats that I’m aware use characters and substrings as tokens, not lines. However, source control systems are typically based around the arbitrary newline separator. They do this for performance reasons: performance dependent on the number of lines is always faster than performance dependent on the number of characters (unless your file only contains newline characters)
From skimming over the pijul and sanakirja source code, it looks like patches deal with LineId’s. However, patches must initially be generated from flat text files. This would need to involve scanning for newlines, characterwise. Only afterwards can pijul/sanakirja get the performance improvements of just using LineId’s.
I propose running the algorithms over a compressed repository format directly without any decompression. This is known as compressed pattern matching, but it’s good for any stringology algorithms.
Compressed pattern matching saves space, but more interestingly it speeds up algorithms to times sub-linear to the length of the uncompressed text.
Lately I’ve been reading about data compression using anti-dictionaries. Anti-dictionary compression works by building a dictionary of what is not in the source text.
Anti-dictionary compression is particularly well suited to CPM.
Antidictionaries may or may not be the best way, but I have a feeling that they would be a good choice for a version control system given that they are an example of contextual compression.
If pijul were to use compressed pattern matching rather than LineId’s, I think it could in most cases be faster while supporting more user-friendly and semantically relevant substring patches (and maybe even binary patches!)
I figured I would bring this up publicly before me (or anyone else) dives in and starts working on this because it would require yet another a format change.