Fast characterwise (and binary?) patches

All programming languages and human-readable text-based formats that I’m aware use characters and substrings as tokens, not lines. However, source control systems are typically based around the arbitrary newline separator. They do this for performance reasons: performance dependent on the number of lines is always faster than performance dependent on the number of characters (unless your file only contains newline characters)

From skimming over the pijul and sanakirja source code, it looks like patches deal with LineId’s. However, patches must initially be generated from flat text files. This would need to involve scanning for newlines, characterwise. Only afterwards can pijul/sanakirja get the performance improvements of just using LineId’s.

I propose running the algorithms over a compressed repository format directly without any decompression. This is known as compressed pattern matching, but it’s good for any stringology algorithms.

Compressed pattern matching saves space, but more interestingly it speeds up algorithms to times sub-linear to the length of the uncompressed text.

Lately I’ve been reading about data compression using anti-dictionaries. Anti-dictionary compression works by building a dictionary of what is not in the source text.

Text Compression Using Antidictionaries
DCA (Data Compression using Antidictionaries)
Data Compression Using Antidictionaries

Anti-dictionary compression is particularly well suited to CPM.

Pattern Matching in Text Compressed by Using Antidictionaries

Antidictionaries may or may not be the best way, but I have a feeling that they would be a good choice for a version control system given that they are an example of contextual compression.

If pijul were to use compressed pattern matching rather than LineId’s, I think it could in most cases be faster while supporting more user-friendly and semantically relevant substring patches (and maybe even binary patches!)

I figured I would bring this up publicly before me (or anyone else) dives in and starts working on this because it would require yet another a format change.


So, actually LineId is a misnomer, they are not really lines, they’re arbitrary chunks. Pijul patches can already do characterwise patches.

The only property you need to use Pijul is that files are totally ordered sets of chunks, where a chunk is a contiguous array of bytes.
I don’t know how to write patches for XML or programming languages.

If you want to work on word patches or sentence patches, go ahead! I know a number a people (including myself) who would probably be interested in merging paragraphs in LaTeX documents.

The only thing you need to change in Pijul is the diff algorithm, and I’m pretty sure there’s not much to change (the '\n' character is used quite explicitly).

I think @pointfree point is quite different than just using a different diff algorithm. As far as I understand its proposal, he wants to use the diff algorithm not on raw data, but rather on a compressed representation of the data.

Am I right?

Indeed, @lthms. A CPM repo format would not use any hardcoded delimiter and my point was as much about performance as user-friendliness. Patches with characterwise (or binary) granularity would perform well with CPM. Because CPM references codewords, a CPM repo format would also reference much larger chunks than individual lines in most cases, so a CPM repo format would perform much better than a line based one as well.