AST-level diffs and merges

AST diff and merge algorithms allow to resolve conflicts which cannot be resolved by usual git. For example changes in the same row or moving blocks of code into other places or variables renames. It may be useful to store not text diffs, but AST ones. Of course with fallback to raw text or binary when the lang is unknown.

https://github.com/koczkatamas/onelang - is an abstract ast for all langs with specifying languages in yaml and transpiling the code on them to each other
and
https://github.com/GumTreeDiff/gumtree - ast diff tool

3 Likes

Thanks for the constructive comment! Could you provide an explanation of why you think the current Pijul could not handle this, and how that could work with our current algorithm?

1 Like

In fact I have not tried pijul yet, all the things are from my git experience, and since, I guess (am I right?) that pijul is also line-oriented (as git is), I thought that it may have the same troubles git has. But since pijul is not mature and stable like enough, it can introduce breaking changes and innovations, like AST-based diffs.

Pijul is not line-oriented (at least in the sense that Git is) - that’s the key difference. Although it is not at level of understanding different programming for conflict resolution, the internal file representation has an explicit modeling of conflicts (that you can then resolve).

I’d recommend reading:

it’s a good explanation of how conflicts are handled in Pijul.

1 Like

I have already read this, thisI guess won’t help much for automatic conflict resolution.
version a

def a(b, d):
  return b+d

Version b

def a(b, c):
  return b+c

version c

def a(a, d):
  return a+d

a->c
a->b
c->d
b->d

How should d look like? I guess

def a(a, c):
  return a+c

I guess neither pijul nor git can deal with it.

Another example is code style change without changing meaning of the code and then merging 2 branches one before style change another one is after.

Another use case is subtree moving. We move a subtree into another tree. An then someone edits that subtree unmoved and tries to merge, obviously this will lead to conflict. But if we trac subtrees we can apply the changes he made to the moved subtree.

It’s an interesting idea, but I think it’s out of scope for Pijul (in its current form at least).

I could see it fitting in as a hook feature of sorts. If Pijul would provide a “merge” hook, then the conflict resolution scheme could employ knowledge of the AST to come with smarter suggestions.

1 Like

Storing the code as an AST (and history as a spatiotemporal graphc) internally has some benefits such as lesser size and immediate answer on queries “which patches have affected this symbol” or “which patches affect contol flow touching that symbol”.

That’s really great, everybody knows that. We could also write diff algorithms for other structures than trees: why not arbitrary graphs? Sets? Continuous functions?

The main problem here is, unless you have a concrete proposal for how to adapt our theory to trees, we cannot implement this now.

3 Likes

@krixano, I really want to answer your post, because you’re asking a very good question.

I’m speaking only for myself, but I’m sure Florent agrees, as well as all contributors I’ve met: we absolutely want Pijul to be as open and welcoming as possible. Our main goal in writing Pijul (for which I also wrote Sanakirja and Thrussh, spending hundreds of hours on each), is to write a version control system as easy as possible, as welcoming and friendly as possible, both to use and to contribute, especially for beginners in computer science, and non-mathematicians.

However, saying “your internal representation should work like this” without knowing anything about our current internal representation, or all the design choices we made, is a little bit bizarre.

A more proper way to ask the same thing is “does Pijul store patches on ASTs? Why not.”

4 Likes

The way I interpreted the person’s argument was that they just wanted a way to make their example work correctly in a VCS - changes in units smaller than a line and bigger than characters.

I can now see how it might have come off as bizarre (and possibly rude-ish) asking for a change in an internal representation without knowing what the current internal representation is in the first place. I just more focused on what the person wanted a VCS to be able to do rather than how they are thinking of it being able to do that.

Maybe what you’d like is an instance of the theory behind Pijul, that works with ASTs.

Then, when you start a new repo, every file is handled, say, via Pijul, unless the user gives you a “translator” (which should be a paraer + AST builder) which allows you to understand and diff at the AST level while keeping all the cleanliness of the patch theory. Then, you could have (for file types that have an existing translator) a VCS like Pijul, but that also understands ASTs and can do smarter work using their information.

Beware, for this idea (a) I think should work, but I’m not certain, (b) it would definitely need the implementor to understand parsers and AST transformations, and © for performance reasons, it would probably need languages that are friendly enough in grammatical complexity, in order for them to be parsed quickly.

I like your idea. I hope you can make it work if you try to implement it ^^

1 Like

I don’t feel so sure about whether this would work very well with a normal AST. HOWEVER… I suspect this could work very well against something like, for example, Scala 3’s Typed Abstract Syntax Tree format, about which you can read here:

https://www.scala-lang.org/blog/2018/04/30/in-a-nutshell.html

Ruber duck debugging my thoughts on this, I would appreciate feedback on whether the following explainer is correct:

The problem with tracking higher level data structures directly (such as an AST) is that those data structures are ill-defined, will change over time, and that isn’t how programmers make changes to their code anyway.

Take file renames: Pijul has to manually store an identifier for each file, lest duplicate file names or moving a file results in a conflict. Pijul could do something similar by tracking the symbols present in text files via some sort of database. However, renaming a file involves a user submitting edits to a database (i.e. the filesystem) whereas Pijul would be left reconstructing the changes from freeform text edits. The only way to track micro-level semantic changes of source code would be to force users into a projectional editor … but we can only boil one ocean at a time ; ).

Pijul instead deals with the base case: byte level groupings of changes. This is an improvement to Git’s object based approach, as changes can be broken up across byte boundaries that correspond to changes at whatever abstraction level the diffing algorithm wishes to operate at.

Notably, the Language Server Protocol also works by communicating ranges and offsets, outsourcing the job of building an AST and maintaining a symbol database. Why? Because no one can agree upon a universal AST.

Beyond tracking byte-level changes, commit metadata can describe what change was performed. Files are universal enough to be baked into Pijul proper, but I wouldn’t even support LSP refactors directly in Pijul because how a refactor is carried out will differ depending on the language server and version. At most, I would have informal namespacing (and maybe version information) so that refactors can be replayed at another time. But the process of reconstructing the semantics behind the change is largely orthogonal to how they are written down.

And that’s why an AST level source tracking doesn’t make much sense: you can layer AST transforms atop the protocol, but you don’t want to hardcode an AST into the protocol.

Does that sum it up correctly?

1 Like

actually this post helped me to realize what project I want to build and I started to develop it, it is called a Lisperanto - a structure editor for AST-based Lisp-like programming language Lisperanto

also in a future I will add migration support for some other languages like C# and Java
and I will create version control system based on AST

1 Like

I don’t think, the VCS is the problem here.
You could probably do this in git as well using hooks.

Before doing a commit, every file of a specific programming language could be converted to a simple AST representation, preferably S-expressions, then you write one token per line, which are either opening brackets, closing brackets or names.
So it would also work with line based algorithms.

For example the example from before:

def a(b, d):
    return b+d

As S-expressions:

(def a (b d)
  (return (+ b d)))

As line based format:

(
def
a
(
b
d
)
(
return
(
+
b
d
)
)
)

The main benefit I see in this: Increasing the indentation of some code block will only add a few lines before and after the block.

But something like this will be more reasonable, when all programming languages can compile to some S-expression IR, which can be almost lossless converted back to the language.

Preferably, there won’t be full programming languages, but instead every programming language will be split into syntax and semantics, where syntax just means, how the language maps to a s-expression representation, and semantics means, how these s-expressions need to be interpreted.
So you will be able to write rust in python syntax, python in lisp syntax (pure s-expressions), and lisp using rust syntax, for example. But that’s kind of off topic.

Please also see my post from the zulip chat on this topic (which I repeat below) and why I still think AST change tracking doesn’t work for a VCS, because it is based on wrong assumptions. I am sorry if I violate any cross-posting rules and restrictions. Link to zulip chat: Pijul

Message content:

This is impossible to solve, generally. [This refers to tracking identifier renaming.] In programming languages with real macros you can programmatically create new identifiers which would necessitate pijul to actually run the macros to identify the real identifiers and track refactors. Additionally, tracking refactors is an arbitrary interpretation layer in pijul. Before that, pijul tracked changes of bytes (with one major interpretation layer already implemented, the notion of “lines of text”). Now, pijul is supposed to track changes of “meaning”, which is not well-defined. Small examples:

  1. You refactor frob to frobnicate and introduce a new location with identifier frob in the same change. Did you messup your refactoring and actually meant to have frobnicate in the new location? Or is frob (after refactor) a different identifier from frob (before refactor)?
  2. You now get two new changes A and B. Both do not (technically) depend on the refactor change. Both make use of the frob identifier. Suppose, that change A actually missed the refactor and frob (A) now is intended to mean frobnicate while, change B intends to use frob (after). How does pijul distinguish between those? It cannot, the same way a non-introduced human also could not distinguish between those two cases, because there is genuine information missing.

One could argue that pijul tries to catch the 90% of use cases but I find that this introduces an unobvious “cleverness” which will lead to painful hard to track “bugs”, where pijul should generate a conflict where it opted to do not.

Also, as I said earlier, this will lead to pijul simulating the code to find the identifiers, which is an arbitrarily unsafe (and arbitrarily costly) operation.

1 Like

Because no one can agree upon a universal AST.

We have treesitter: Tree-sitter|Playground.

I think AST-level source control is probably the future. Pijul or Git will work with treesitter ASTs. The issue is all the tooling is not AST aware. Diffing tools would need to “render” the AST etc.

1 Like