Fast characterwise (and binary?) patches

All programming languages and human-readable text-based formats that I’m aware use characters and substrings as tokens, not lines. However, source control systems are typically based around the arbitrary newline separator. They do this for performance reasons: performance dependent on the number of lines is always faster than performance dependent on the number of characters (unless your file only contains newline characters)

From skimming over the pijul and sanakirja source code, it looks like patches deal with LineId’s. However, patches must initially be generated from flat text files. This would need to involve scanning for newlines, characterwise. Only afterwards can pijul/sanakirja get the performance improvements of just using LineId’s.

I propose running the algorithms over a compressed repository format directly without any decompression. This is known as compressed pattern matching, but it’s good for any stringology algorithms.

Compressed pattern matching saves space, but more interestingly it speeds up algorithms to times sub-linear to the length of the uncompressed text.

Lately I’ve been reading about data compression using anti-dictionaries. Anti-dictionary compression works by building a dictionary of what is not in the source text.

Text Compression Using Antidictionaries
DCA (Data Compression using Antidictionaries)
Data Compression Using Antidictionaries

Anti-dictionary compression is particularly well suited to CPM.

Pattern Matching in Text Compressed by Using Antidictionaries

Antidictionaries may or may not be the best way, but I have a feeling that they would be a good choice for a version control system given that they are an example of contextual compression.

If pijul were to use compressed pattern matching rather than LineId’s, I think it could in most cases be faster while supporting more user-friendly and semantically relevant substring patches (and maybe even binary patches!)

I figured I would bring this up publicly before me (or anyone else) dives in and starts working on this because it would require yet another a format change.

2 Likes

So, actually LineId is a misnomer, they are not really lines, they’re arbitrary chunks. Pijul patches can already do characterwise patches.

The only property you need to use Pijul is that files are totally ordered sets of chunks, where a chunk is a contiguous array of bytes.
I don’t know how to write patches for XML or programming languages.

If you want to work on word patches or sentence patches, go ahead! I know a number a people (including myself) who would probably be interested in merging paragraphs in LaTeX documents.

The only thing you need to change in Pijul is the diff algorithm, and I’m pretty sure there’s not much to change (the '\n' character is used quite explicitly).

I think @pointfree point is quite different than just using a different diff algorithm. As far as I understand its proposal, he wants to use the diff algorithm not on raw data, but rather on a compressed representation of the data.

Am I right?

Indeed, @lthms. A CPM repo format would not use any hardcoded delimiter and my point was as much about performance as user-friendliness. Patches with characterwise (or binary) granularity would perform well with CPM. Because CPM references codewords, a CPM repo format would also reference much larger chunks than individual lines in most cases, so a CPM repo format would perform much better than a line based one as well.

I just found this old discussion still open deep in my tabs…but it looks like a lot of progress has happened since I last looked! Crossposting for anyone who ends up here via google or old links

  • We now have an efficient implementation of binary diffs, based on a rolling checksum (like rsync) and my diffs crate on the result. This isn’t yet integrated into Pijul, but will soon be.

The trick introduced in the last few version of Libpijul, and tested and debugged more carefully in version 1.0.0-alpha.45, is to use almost the same idea to compare the old version of a file with a new one. We don’t even need the more careful hashes, since we can compare the blocks directly, with the same complexity as computing a strong hash.

Very sorry, editing isn’t available for my last post but to carry forward from the original Darcs3 wishlist, I opened a discussion on word-by-word diffs!

https://nest.pijul.com/pijul/pijul/discussions/778

also on Zulip

Where on Zulip did you open that discussion? Please link to a specific topic or message. :slight_smile:

Oh noo, sorry I don’t know what happened and once again the edit button is gone. Should have been

https://pijul.zulipchat.com/#narrow/stream/270693-general/topic/Word-by-word.20.28instead.20of.20line-based.29.20diffs.20for.20Wiki-backing/near/388011992

Well, I can edit this message. Apparently Discourse for some reason rewrites the link if I don’t force it using markdown…

1 Like