Yestin L. Harrison

Theory

As a first approximation, one can think of a repository as a single file represented by a directed graph G=(V,E)G = (V, E) of lines of text, where each vertex vVv\in V represents a line of text, and an edge from uVu \in V to vVv\in V, labelled with a change (also called patch) number cc, could be read as "according to change cc, line uu comes before vv".

This means that changes may introduce both vertices and edges, as in the following example, where a line of text DD is introduced between AA and BB:

Here, the thick arrow represents a change c0c_0 from a file containing the lines AA, BB, CC, to a file which includes the line DD. As mentioned above, the edges are labelled with the change that introduced them, in this case c0c_0. An important feature to note is that vertices are uniquely identified, by the hash of the change that introduced them, along with a position in that change. This means that two lines of text with the same content, introduced by different changes, will be different. It also means that a line keeps its identity, even if the change is applied in a totally different context.

Moreover, this system is append-only, in the sense that deletions are handled by a more sophisticated labelling of the edges. In the example above, if we want to delete line DD, we just need to make a change mapping the edge introduced by c0c_0 to a deleted edge, which is also labelled with the change in which it was introduced, this time c1c_1:

From now on, we call the full edges alive, and the dashed ones dead.

We have just described the two basic kinds of actions in Pijul. There are no other. One kind adds vertices to the graph, along with "alive" edges around them, and the other kind maps an existing edge label onto a different one. The edge labels are fully described by two parameters: (1) their status (alive, deleted, and a few others related to multiple files and technical details explained below), and (2) the change that introduced them.

Dependencies

This scheme allows one to define dependencies between changes:

Of course, this is just the minimal set of dependencies needed to make sense of the text edits. Hooks and scripts may add extra language-dependent dependencies based on semantics.

Are edge labels minimal?

Our goal is to find the smallest possible system, both for reasons of mathematical aesthetics (why store useless stuff?) and the other one for performance. Therefore, one immediate question comes to mind: why even keep the change number on the edges?

In order to answer that question, suppose we don't keep the labels, meaning that the maps happen between statuses only. Then, consider the following two situations:

Conflicts and CRDTs

According to what we described so far, there are three types of conflicts in Pijul:

Moreover, it is easy to show that Pijul implements a conflict-free replicated datatype (CRDT): indeed, we're just adding vertices and edges to a graph, or mapping edge labels which we know exist because of dependencies.

However, Pijul's datastructure models, in a conflict-free way, the conflicts that can happen over a text file. In fact, Pijul would remain a CRDT with or without the design choices explained above about edge labels: for example, we could decide that the "deleted" status has higher precedence. But as shown above, that wouldn't model conflicts accurately.

Pseudo-edges

Moving back to the sequential (i.e. single-user) situation, suppose we start a file with many lines. Our user deletes all of them, adds one new line at the very beginning, and one at the very end, as shown in the following diagram:

If we represented this situation naively like in that diagram, the complexity of applying changes and outputting the repository would depend linearly on the size of the graph, as we would need to traverse the entire thing to know about line CC, and know it comes after BB.

The trick is to use what we call "pseudo-edges", which are not part of any change, but are just here to keep the "alive subgraph" (the subgraph of the alive vertices) connected. Every time we delete an edge, we add pseudo-edges between the source of the edge and all the descendants of the target, like the dotted edge in the graph below:

Multiple files

The extension of this scheme to multiple files is done in the following way:

A less naive representation

We now revisit the first two diagrams in this post: adding and deleting lines. Vertices are now referenced by a change number (for example c0c_0) and a byte interval in that change (for example [0,n[[0, n[, which means "bytes from offset 00 included to offset nn excluded"). Note that vertices can now represent an arbitrary number of lines. Moreover, the size they occupy in memory is independent from nn (assuming n<264n < 2^{64}).

Starting from a single vertex c0:[0,n[c_0:[0, n[ with nn bytes (containing an arbitrary number of lines), introduced by change c0c_0, adding a line is done by first splitting c0:[0,n[c_0:[0, n[ at some offset i<ni < n, and inserting the new vertex just like before.

This means in particular that the splitting of content into lines is done by the diff algorithm and is encoded in the changes, instead of being set in stone for all repositories. With a different diff algorithm, we could imagine splitting contents according to programming language structure.

Once we know how to split vertices, deletions are handled almost as before: as shown in the following diagram, where we first apply the same change c1c_1 as in the previous example, and go on to applying change c2c_2, which deletes the end of vertex c0:[0,i[c_0:[0, i[ from some j<ij<i to ii, and the beginning of vertex c1:[0,m[c_1:[0, m[, from 00 to some k<mk<m.

One important difference with before is that our previous edges had two different roles which were not clearly distinguishable from one another until now. One of these meanings was to order the lines, and the other one was the status. However, now that vertices can be split, the "status" role of edges becomes ambiguous: for example, a deleted edge pointing to vertex some vertex c:[i,j[c:[i, j[ means that bytes [i,j[[i, j[ of change cc are deleted, but what if we split that vertex into c:[i,k[c:[i, k[ and c:[k,j[c: [k, j[? Should we add an extra deleted edge, and if so, where?

There is a simple solution: by introducing a new kind of edge label (named BLOCK in the source code) we can distinguish between "internal" or "implicit" edges that are only here to order the blocks, and "status" edges informing about the status of their target vertex. I'll probably explain more about this in a future blog post, or in the manual, or in a paper.

Version identifiers

Version identifiers need more sophistication in Pijul than in other version control systems, because any two patches c0c_0 and c1c_1 that could be produced independently commute, meaning that applying c1c_1 after c0c_0 yields the same repository as applying c0c_0 after c1c_1, and we want version identifiers to reflect that.

This means that meaningful version identifiers must be independent from the order. While this could be achieved easily with any linear function of the hashes, for example taking the bitwise XOR of hashes, some naive schemes have the problem that servers could trick clients into accepting that versions are synchronised, even though they are not. The bitwise XOR described above has this problem, and so do other linear functions: since the hashes are random, there is a high probability that any nn hashes form an independent linear basis of the vector space of hashes. The problem of forging a version identity becomes easy, as it is just a matter of solving a linear equation.

Our new scheme for version identifiers comes from the discrete log problem. The version identifier of an empty repository is always the identity element 11, and applying a change with hash hh on top of version V=evV = e^{v} is Vh=evhV^h = e^{v\cdot h}.

Now, because getting from VV to the exponent vv is hard (at least for a classical computer), forging a version identity is hard, since we would need to generate hashes whose multiplication is vv, without even knowing what vv is.

As pointed by Dan Robertson, this scheme is similar to homomorphic hashing.