Fast characterwise (and binary?) patches

pointfree · June 2, 2017, 8:56pm

All programming languages and human-readable text-based formats that I’m aware use characters and substrings as tokens, not lines. However, source control systems are typically based around the arbitrary newline separator. They do this for performance reasons: performance dependent on the number of lines is always faster than performance dependent on the number of characters (unless your file only contains newline characters)

From skimming over the pijul and sanakirja source code, it looks like patches deal with LineId’s. However, patches must initially be generated from flat text files. This would need to involve scanning for newlines, characterwise. Only afterwards can pijul/sanakirja get the performance improvements of just using LineId’s.

I propose running the algorithms over a compressed repository format directly without any decompression. This is known as compressed pattern matching, but it’s good for any stringology algorithms.

Compressed pattern matching saves space, but more interestingly it speeds up algorithms to times sub-linear to the length of the uncompressed text.

Lately I’ve been reading about data compression using anti-dictionaries. Anti-dictionary compression works by building a dictionary of what is not in the source text.

Text Compression Using Antidictionaries
DCA (Data Compression using Antidictionaries)
Data Compression Using Antidictionaries

Anti-dictionary compression is particularly well suited to CPM.

Pattern Matching in Text Compressed by Using Antidictionaries

Antidictionaries may or may not be the best way, but I have a feeling that they would be a good choice for a version control system given that they are an example of contextual compression.

If pijul were to use compressed pattern matching rather than LineId’s, I think it could in most cases be faster while supporting more user-friendly and semantically relevant substring patches (and maybe even binary patches!)

I figured I would bring this up publicly before me (or anyone else) dives in and starts working on this because it would require yet another a format change.

Topic		Replies	Views
Bad performance Development	16	2796	December 11, 2018
Is having `replace` a sensible idea? Development	4	876	May 5, 2021
How to combine patches? Question	7	941	February 12, 2020
Random Talks and Thoughts Development	19	2491	December 24, 2019
AST-level diffs and merges Development	19	7007	May 22, 2024

Fast characterwise (and binary?) patches

Related topics