Pijul

Bad performance


#1

Hello!
I’m a new user. While trying pijul to replace git for my personal projects, I was quite surprised that I hit some pretty strong performance issues.
My use cases were:

  1. a repo with a single 4MB file
  2. a repo with a few hundreds small files

In the first case commiting felt laggy, but not too awful. But when I try to diff a change (of just ~270 lines) it used up all my 8GB of RAM and crashed my system. With git it’s all instantaneous.

In the second case I noticed that while the repo itself was less than 20MB, the .pijul folder ran up to 200MB (whereas with git it’s a bit more than half the repo). Moreover committing and diffing was awfully slow and CPU intensive.

I prepared a simple python test to demonstrate the second issue:

#!/usr/bin/python3
    """This test shows the strong performance hit of large repos with pijul, compared to git."""
    import glob
    import json
    import os
    import random
    import shutil
    import shlex
    import string
    import subprocess
    import time


def get_size(start_path):
    """Calculate the size of a directory."""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return str(total_size / 1000000) + ' MB'


def random_string(length=25):
    """Generate a random string."""
    pool = string.ascii_letters + string.digits
    return ''.join(random.choice(pool) for i in range(length))


startdir = os.getcwd()

pjl = 'pijul-test-performance-pjl/'
git = 'pijul-test-performance-git/'

os.makedirs(pjl)

def build_value():
    v = []
    for i in range(20):
        v.append(random_string())
    return v
  
      
def build_obj():
    return {random_string(): build_value(), random_string(): build_value(), random_string(): build_value()}


def build_data():
    """Build a list.

    We use random strings to simulate real-life data and avoid compression optimizations on identical data."""
    data = []
    for i in range(20):
        data.append(build_obj())
    return data


print('Wait a few seconds, generating data...')
for i in range(400):
    with open(pjl + str(i) + '.json', 'w') as f:
        json.dump(build_data(), f, indent=2)

shutil.copytree(pjl, git)

print('Total size of files: ' + get_size(pjl))

os.chdir(pjl)
a = time.time()
subprocess.run(shlex.split('pijul init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul record -a -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First PIJUL commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.pijul'))

# now we demonstrate that on subsequent operations, even on small commits, there's a huge performance hit
# also note that trying to diff a 4MB file with ~270 lines changed hanged my system and forced me to hard kill my user session, whereas in git there's no problem at all
with open('1.json') as f:
    data = json.load(f)
data.append(random_string())
with open('1.json', 'w') as f:
    json.dump(data, f, indent=2)
a = time.time()
subprocess.run(shlex.split('pijul record -a -m second'), stdout=subprocess.DEVNULL)
b = time.time()

print('Second PIJUL commit took', str((b - a)), 'seconds')

os.chdir(startdir)

os.chdir(git)
a = time.time()
subprocess.run(shlex.split('git init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git commit -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First GIT commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.git'))

with open('1.json') as f:
    data = json.load(f)
data.append(random_string())
with open('1.json', 'w') as f:
    json.dump(data, f, indent=2)
a = time.time()
subprocess.run(shlex.split('git commit -am second'), stdout=subprocess.DEVNULL)
b = time.time()

print('Second GIT commit took', str((b - a)), 'seconds')

os.chdir(startdir)

I wonder wether this is just an implementation bug that will be likely solved when pijul hits 1.0, or if it comes from a design choice difficult to solve, as I hear darcs has awful performance too.


#2

Created test case for case 1. Basically, just diffing a change of 135 non-contiguous lines in a 4MB file will eat all your RAM and crash the system. Also note how the repo size is 54 MB.
Please remember to kill the script before it freezes your system.

#!/usr/bin/python3
"""This test shows the strong performance hit of large repos with pijul, compared to git."""
import glob
import json
import os
import random
import shutil
import shlex
import string
import subprocess
import time


def get_size(start_path):
    """Calculate the size of a directory."""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return str(total_size / 1000000) + ' MB'


def random_string(length=25):
    """Generate a random string."""
    pool = string.ascii_letters + string.digits
    return ''.join(random.choice(pool) for i in range(length))


startdir = os.getcwd()

pjl = 'pijul-test-performance2-pjl/'
git = 'pijul-test-performance2-git/'

os.makedirs(pjl)

def build_value():
    v = []
    for i in range(20):
        v.append(random_string())
    return v
  
      
def build_obj():
    return {random_string(): build_value(), random_string(): build_value(), random_string(): build_value()}


def build_data():
    """Build a list.

    We use random strings to simulate real-life data and avoid compression optimizations on identical data."""
    data = []
    for i in range(2000):
        data.append(build_obj())
    return data


print('Wait a few seconds, generating data...')

with open(pjl + 'big_data.json', 'w') as f:
    json.dump(build_data(), f, indent=2)

shutil.copytree(pjl, git)

print('Total size of files: ' + get_size(pjl))


def create_diff():
    with open('big_data.json') as f:
        lines = f.readlines()
    new = []
    changed = 0
    c = 0
    for l in lines:
        if c == 1000:
            l = random_string()
            c = 0
            changed += 1
        else:
            c += 1
        new.append(l)
    with open('big_data.json', 'w') as f:
        f.writelines(new)
    print(str(changed), 'lines changed.')

# we do git first, because pijul will crash
os.chdir(git)
a = time.time()
subprocess.run(shlex.split('git init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git commit -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First GIT commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.git'))

create_diff()
a = time.time()
subprocess.run(shlex.split('git diff'), stdout=subprocess.DEVNULL)
b = time.time()

print('Diffing GIT took', str((b - a)), 'seconds')

os.chdir(startdir)

# DANGER! Kill the script before it eats up all your RAM
os.chdir(pjl)
a = time.time()
subprocess.run(shlex.split('pijul init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul record -a -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First PIJUL commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.pijul'))

create_diff()
a = time.time()
subprocess.run(shlex.split('pijul diff'), stdout=subprocess.DEVNULL)
b = time.time()

print('Diffing PIJUL took', str((b - a)), 'seconds')

os.chdir(startdir)

#3

Hi! Thanks for taking the time to run these benchmarks. I would tend to believe this is an implementation bug, as Pijul was designed specifically as a solution to Darcs’ performance problems.

There’s a tradeoff on pristine size, though, I would expect our pristines to be larger than Git’s, but not too large.

And the speed to be most of the time similar, or just slightly slower, than Git (and sometimes faster in made-up instances).


#4

That’s good to know, I really like the spirit of pijul and sincerely hope it succeds!


#5

By the way, there are three different areas in which Pijul performance has to be investigated:

  1. Diff (I believe that’s where your problems are coming from). IIRC, Git doesn’t need to call diff to create commits, but it still needs to read the files, so it’s in time linear in the number of lines.

    On the other hand, Pijul runs the most expensive diff algorithm (minimal patches), which is quadratic in the number of lines, equivalent to diff -d.

    Our choice was the best one in the beginning, as we really needed smallest patches to make debugging easier, and in particular more reliable. There’s no strong reason to keep using that in the future, especially since patience diff is both faster and gives better diffs.

  2. Patch applications. This is supposed to be really fast, and even faster than Git (in the future). At the moment, the implementation is supposed to be correct, but maybe not optimal, and the algorithm might improve.

  3. System stuff, like mmap performance and fine-grained details of virtual memory. Even though I wrote Sanakirja, which forced me to learn loads about these low-level things, it’s not like I have been an engineer for 20 years in that particular area.


#6

Thanks for the follow up. Sadly this is a field way beyond my skill level, as I’m just an hobbyist python programmer, but I’d like to contribute.
For now, I implemented a patch to generate automatically shell completions. I’ll send it as soon as the nest is up again.


#7

Or, another way to improve performance could be to write algorithms correctly. For instance, I just experienced the same problem 1000 times when trying to convert the repositories from the Nest, and investigated a little further.

It turns out libpijul 0.9.0 has a DFS written in exponential time (because I didn’t want to write it recursively, in order not to overflow the stack). I’ll release a fix in 0.9.1 today, and restore all repositories on the Nest.


#8

Tested libpijul 0.9.1, but both my testcases return the same performance failure as 0.9.0


#9

I’ve had a significant performance boost now that https://nest.pijul.com/pijul_org/pijul/discussions/294 is fixed


#10

This is great news! Thanks for helping us, and congrats @flobec for the fix.


#11

I tried latest pijul from master, which uses myer’s diff algorithm implemented by @pmeunier. Excellent news!
Both of my tests don’t hang anymore, and complete fairly quickly, though much slower than git.

First test:

Total size of files: 17.8488 MB
First PIJUL commit took 6.657814979553223 seconds
Repo size: 234.906034 MB
Second PIJUL commit took 3.0688796043395996 seconds
First GIT commit took 0.8010790348052979 seconds
Repo size: 10.739978 MB
Second GIT commit took 0.020694971084594727 seconds

Second test:

Total size of files: 4.462002 MB
First GIT commit took 0.24826502799987793 seconds
Repo size: 2.704378 MB
135 lines changed.
Diffing GIT took 0.11530280113220215 seconds
First PIJUL commit took 1.6783769130706787 seconds
Repo size: 58.633307 MB
135 lines changed.
Diffing PIJUL took 0.8525295257568359 seconds

As you can see the biggest challenge remains repo size, but performance did a huge leap forward. Kudos to @pmeunier.


#12

I have just tried the latest version in the repository, and the pijul status command is now a lot better with the pijul repository (it becomes possible again to print the branch name in my prompt without having to wait seconds between each shell command).

Awesome work, @pmeunier!


#13

About repo size: the bulk of it goes into the pristine, whereas the patches themselves are tiny. If I understand correctly, the pristine is a sort of cache for speeding up operations; maybe now that pijul is significantly faster than before it would be possible to significantly downplay it?


#14

@yory: thanks for running benchmarks again! I’m glad you like the change. Unfortunately, the pristine is essential to guarantee the algorithmic complexity of patch applications. Without it we could either have consistency bugs when we apply the same set of patches in two different orders, or else really bad performance (linear or even exponential in the size of history), much worse than the diff problem we had.


#15

Please, please, do not focus on performance parity with Git.

Git is fast on Linux because it is optimized for Linux, by Linux developers. To achieve the same performance you will sacrifice portability. Yes, mmap is fast on Linux, but it’s pretty much guaranteed to break with any network attached file-systems, which eliminates portability to distributed operating systems. It also eliminates portability to non-monolithic systems that do not maintain a global cache hierarchy shared across all processes.

Quality software makes for fast software, but fast software does not make for quality software.


#16

Well, I believe that the speed we have now is already quite good, though cutting it down a bit more would benefit huge projects a lot, I think — I’m not sure if for sizeable projects like a kernel or web browser Pijul would perform well, though for those the first blocker is the size of the pristine, which becomes gargantuan very soon.


#17

@ehmry yay, performance (as opposed to algorithmic complexity) was never a real focus for us. As hurling players say, focus on your points and your goals will come.

@yory: size problems is more important, we need to keep this under control. It’s not too bad yet but we have to watch it.