Bad performance


#1

Hello!
I’m a new user. While trying pijul to replace git for my personal projects, I was quite surprised that I hit some pretty strong performance issues.
My use cases were:

  1. a repo with a single 4MB file
  2. a repo with a few hundreds small files

In the first case commiting felt laggy, but not too awful. But when I try to diff a change (of just ~270 lines) it used up all my 8GB of RAM and crashed my system. With git it’s all instantaneous.

In the second case I noticed that while the repo itself was less than 20MB, the .pijul folder ran up to 200MB (whereas with git it’s a bit more than half the repo). Moreover committing and diffing was awfully slow and CPU intensive.

I prepared a simple python test to demonstrate the second issue:

#!/usr/bin/python3
    """This test shows the strong performance hit of large repos with pijul, compared to git."""
    import glob
    import json
    import os
    import random
    import shutil
    import shlex
    import string
    import subprocess
    import time


def get_size(start_path):
    """Calculate the size of a directory."""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return str(total_size / 1000000) + ' MB'


def random_string(length=25):
    """Generate a random string."""
    pool = string.ascii_letters + string.digits
    return ''.join(random.choice(pool) for i in range(length))


startdir = os.getcwd()

pjl = 'pijul-test-performance-pjl/'
git = 'pijul-test-performance-git/'

os.makedirs(pjl)

def build_value():
    v = []
    for i in range(20):
        v.append(random_string())
    return v
  
      
def build_obj():
    return {random_string(): build_value(), random_string(): build_value(), random_string(): build_value()}


def build_data():
    """Build a list.

    We use random strings to simulate real-life data and avoid compression optimizations on identical data."""
    data = []
    for i in range(20):
        data.append(build_obj())
    return data


print('Wait a few seconds, generating data...')
for i in range(400):
    with open(pjl + str(i) + '.json', 'w') as f:
        json.dump(build_data(), f, indent=2)

shutil.copytree(pjl, git)

print('Total size of files: ' + get_size(pjl))

os.chdir(pjl)
a = time.time()
subprocess.run(shlex.split('pijul init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul record -a -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First PIJUL commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.pijul'))

# now we demonstrate that on subsequent operations, even on small commits, there's a huge performance hit
# also note that trying to diff a 4MB file with ~270 lines changed hanged my system and forced me to hard kill my user session, whereas in git there's no problem at all
with open('1.json') as f:
    data = json.load(f)
data.append(random_string())
with open('1.json', 'w') as f:
    json.dump(data, f, indent=2)
a = time.time()
subprocess.run(shlex.split('pijul record -a -m second'), stdout=subprocess.DEVNULL)
b = time.time()

print('Second PIJUL commit took', str((b - a)), 'seconds')

os.chdir(startdir)

os.chdir(git)
a = time.time()
subprocess.run(shlex.split('git init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git commit -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First GIT commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.git'))

with open('1.json') as f:
    data = json.load(f)
data.append(random_string())
with open('1.json', 'w') as f:
    json.dump(data, f, indent=2)
a = time.time()
subprocess.run(shlex.split('git commit -am second'), stdout=subprocess.DEVNULL)
b = time.time()

print('Second GIT commit took', str((b - a)), 'seconds')

os.chdir(startdir)

I wonder wether this is just an implementation bug that will be likely solved when pijul hits 1.0, or if it comes from a design choice difficult to solve, as I hear darcs has awful performance too.


#2

Created test case for case 1. Basically, just diffing a change of 135 non-contiguous lines in a 4MB file will eat all your RAM and crash the system. Also note how the repo size is 54 MB.
Please remember to kill the script before it freezes your system.

#!/usr/bin/python3
"""This test shows the strong performance hit of large repos with pijul, compared to git."""
import glob
import json
import os
import random
import shutil
import shlex
import string
import subprocess
import time


def get_size(start_path):
    """Calculate the size of a directory."""
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return str(total_size / 1000000) + ' MB'


def random_string(length=25):
    """Generate a random string."""
    pool = string.ascii_letters + string.digits
    return ''.join(random.choice(pool) for i in range(length))


startdir = os.getcwd()

pjl = 'pijul-test-performance2-pjl/'
git = 'pijul-test-performance2-git/'

os.makedirs(pjl)

def build_value():
    v = []
    for i in range(20):
        v.append(random_string())
    return v
  
      
def build_obj():
    return {random_string(): build_value(), random_string(): build_value(), random_string(): build_value()}


def build_data():
    """Build a list.

    We use random strings to simulate real-life data and avoid compression optimizations on identical data."""
    data = []
    for i in range(2000):
        data.append(build_obj())
    return data


print('Wait a few seconds, generating data...')

with open(pjl + 'big_data.json', 'w') as f:
    json.dump(build_data(), f, indent=2)

shutil.copytree(pjl, git)

print('Total size of files: ' + get_size(pjl))


def create_diff():
    with open('big_data.json') as f:
        lines = f.readlines()
    new = []
    changed = 0
    c = 0
    for l in lines:
        if c == 1000:
            l = random_string()
            c = 0
            changed += 1
        else:
            c += 1
        new.append(l)
    with open('big_data.json', 'w') as f:
        f.writelines(new)
    print(str(changed), 'lines changed.')

# we do git first, because pijul will crash
os.chdir(git)
a = time.time()
subprocess.run(shlex.split('git init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('git commit -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First GIT commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.git'))

create_diff()
a = time.time()
subprocess.run(shlex.split('git diff'), stdout=subprocess.DEVNULL)
b = time.time()

print('Diffing GIT took', str((b - a)), 'seconds')

os.chdir(startdir)

# DANGER! Kill the script before it eats up all your RAM
os.chdir(pjl)
a = time.time()
subprocess.run(shlex.split('pijul init'), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul add ' + ' '.join(glob.glob('*'))), stdout=subprocess.DEVNULL)
subprocess.run(shlex.split('pijul record -a -m first'), stdout=subprocess.DEVNULL)
b = time.time()

print('First PIJUL commit took', str((b - a)), 'seconds')
print('Repo size:', get_size('.pijul'))

create_diff()
a = time.time()
subprocess.run(shlex.split('pijul diff'), stdout=subprocess.DEVNULL)
b = time.time()

print('Diffing PIJUL took', str((b - a)), 'seconds')

os.chdir(startdir)

#3

Hi! Thanks for taking the time to run these benchmarks. I would tend to believe this is an implementation bug, as Pijul was designed specifically as a solution to Darcs’ performance problems.

There’s a tradeoff on pristine size, though, I would expect our pristines to be larger than Git’s, but not too large.

And the speed to be most of the time similar, or just slightly slower, than Git (and sometimes faster in made-up instances).


#4

That’s good to know, I really like the spirit of pijul and sincerely hope it succeds!


#5

By the way, there are three different areas in which Pijul performance has to be investigated:

  1. Diff (I believe that’s where your problems are coming from). IIRC, Git doesn’t need to call diff to create commits, but it still needs to read the files, so it’s in time linear in the number of lines.

    On the other hand, Pijul runs the most expensive diff algorithm (minimal patches), which is quadratic in the number of lines, equivalent to diff -d.

    Our choice was the best one in the beginning, as we really needed smallest patches to make debugging easier, and in particular more reliable. There’s no strong reason to keep using that in the future, especially since patience diff is both faster and gives better diffs.

  2. Patch applications. This is supposed to be really fast, and even faster than Git (in the future). At the moment, the implementation is supposed to be correct, but maybe not optimal, and the algorithm might improve.

  3. System stuff, like mmap performance and fine-grained details of virtual memory. Even though I wrote Sanakirja, which forced me to learn loads about these low-level things, it’s not like I have been an engineer for 20 years in that particular area.


#6

Thanks for the follow up. Sadly this is a field way beyond my skill level, as I’m just an hobbyist python programmer, but I’d like to contribute.
For now, I implemented a patch to generate automatically shell completions. I’ll send it as soon as the nest is up again.


#7

Or, another way to improve performance could be to write algorithms correctly. For instance, I just experienced the same problem 1000 times when trying to convert the repositories from the Nest, and investigated a little further.

It turns out libpijul 0.9.0 has a DFS written in exponential time (because I didn’t want to write it recursively, in order not to overflow the stack). I’ll release a fix in 0.9.1 today, and restore all repositories on the Nest.


#8

Tested libpijul 0.9.1, but both my testcases return the same performance failure as 0.9.0


#9

I’ve had a significant performance boost now that https://nest.pijul.com/pijul_org/pijul/discussions/294 is fixed


#10

This is great news! Thanks for helping us, and congrats @flobec for the fix.