git-annex diagnostics

Sat Mar 3 15:39:44 CET 2012

Thomas Koch wrote:
> first I just wanted to report that I have a git-annex repo that is really big 
> and slow and that this makes me kind of unhappy. Then I realized, that it may 
> be a good idea to add a "diagnostics" command to git-annex that will gather 
> all informations useful for you to improve git-annex, e.g. for my repo:

`git annex status` is essentially that, combined with the --debug flag
when there's a specific problem. There is also the ability to build
with `make PROFILE=1`, at which point the techniques described here can
be used to profile for time or space:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

> find . -type l -a \( -path ".git" -prune -o -print \) | wc -l 
> 37738

This is the most relevant number, probably.

> find .git/objects -type f | wc -l
> 207864

This is surprisingly many. git auto gc typically keeps the loose objects
fewer, packing when there are more than 6700. (I have 194.)
Packing does tend to improve git repository performance, since the
kernel can better buffer pack files, rather than seeking like mad amoung
many loose objects. I'd be curious how your fsck performs after packing.

> time git annex fsck --fast | grep -A 10 -v "ok$"
> 1200.66s real  45.35s user  5.86s system  156 maxmem/kb  301856 nrInOps  4%

By comparison, I have a repo with 40 thousand files, and running fsck on
a SSD (on an otherwise 3 years out of date netbook) takes 10 minutes:

225.33user 59.37system 9:58.26elapsed 47%CPU (0avgtext+0avgdata 54448maxresident)k

Note the 47% CPU usage. The other half of the CPU was used by git cat-object,
which is looking up the location log for each file being fscked.

Indeed, as the number of files, rather than the size of files increases,
the largest source of scalability problems is git itself. Some helpful
tips include:

* Use `git status .` to only check status of current subdirectory,
  rather than scanning entire repository, and `git commit .` or
  staged commits, rather than commit -a.
* Run git annex fsck in or on active directories; put inactive files in a
  different directory of the same repository. Even in a large repository
  git-annex will be fast if run in a relatively small (tens of thousands
  of files) subdirectory.
* If doing "git annex add" (or move, or drop) on a large number of files,
  consider setting `git config annex.alwayscommit false` with the newest
  version, to avoid running the slow git commit as much as possible.
* Use branches. Now that git-annex fully supports them, if there's a
  sensible branch strategy for your repository that can segment the
  files in a useful way, you can avoid performance issues, since
  the files in the non-checked-out branch are essentially "free".

-- 
see shy jo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 828 bytes
Desc: Digital signature
URL: <http://lists.madduck.net/pipermail/vcs-home/attachments/20120303/eb860680/attachment.pgp>