[announce] Sharebox, a FUSE filesystem relying on git-annex

Sun Apr 3 11:35:29 CEST 2011

On Sat, 2 Apr 2011 23:19:52 -0400
Joey Hess <joey at kitenet.net> wrote:

> Dieter Plaetinck wrote:
> > @Joey: you mentioned you think inotify might be a better
> > backend/paradigm for this than fuse, so do you think implementing
> > git-annex in something like dvcs-autosync is feasible? and/or
> > preferable?
> 
> Feasable? Certianly. Preferable? I'm in the "let a thousand flowers
> bloom phase". It's spring. :)
> 
> As Christophe-Marie has pointed out, git-annex makes annexed files
> semi-immutable, and FUSE can hide that quirk, while inotify watching cannot.
> That could be confusing for certian users or use cases, if they are not
> aware of what is going on. Or it could be something quickly learned
> about how these special replicated directories work, that files have to
> be copied to be changed.
> 
> This is also an area I hope to improve in git-annex, by using git smudge
> filters. So it might get a mode where files can be modified and git
> commit just annexes the new content. Last time I looked at this, git was
> not *quite* there to let it be done efficiently.

I think having support for this in git-annex would be very useful, even if it's not that efficient: if this can be dealt with in git-annex, individual "higherlevel" projects like sharebox and dvcs-autosync have less headaches.  Not to mention sharebox/dvcs-autosync would need to do really inefficient things to deal with it anyway. (because they can't involve themselves into the actual git/dvcs tricks, they work on a higher level of abstraction), and it might also benefit some users who work with git-annex manually.
How do you see this? How hard/cumbersome is it to implement this in git-annex?
Why is it inefficient?  It's not really clear to me after reading the smudge information on http://www.kernel.org/pub/software/scm/git/docs/gitattributes.html

> > I quite like dvcs-autosync (partially because inotify is more simple
> > than fuse, partially because it currently works already quite well) and I'm
> > interested in making it support space efficient storage of big files;
> > from what I've read it should be possible to do this with git-annex
> > (which should not even change how we currently deal with small files,
> > they would still be in git) but I'm still doing my first baby steps
> > with git-annex so I wouldn't know. Advice very welcome..
> 
> All it probably needs at is simplest is something like this
> (excuse the haskell):
> 
> 	toobig <- checkFileSize file
> 	if toobig
> 		then git_annex_add file
> 		else git_add file
> 	git_commit file

unfortunately I don't think so:
- with dvcs-autosync we often commit "early", as in, the file could still be in the process of being written to, or it could be modified again after we added it.
From what I understand, we would need to forbid our users from changing the file after it is added to git-annex, and worse: if git-annex does its "move file, replace file with symlink" trick, while the user is writing to it, this might break things.
- when a remote A pulls in the changes from remote B, for dropbox-like behavior it should also automatically:
 * run `git annex get`
 * git commit .git-annex/*/*.log
Does this seem about right?
- deletes will also need to propagate automatically (see next paragraph), still need to figure out how to do that best.

> 
> > Another note : files being tracked with git-annex through sharebox or
> > dvcs-autosync or whatever should always have at least 1 "backup copy",
> > so that if the file gets deleted everywhere, it still can be retrieved
> > from somewhere (which raises the interesting question: where will you
> > store this backup copy? introducing a node/repository which will hold
> > backup copies can be considered going to a centralized model; which is
> > something you (Christophe-Marie) try to explicitly avoid, but I think
> > this is not necessarily a problem)
> 
> This is something git annex goes to large lengths to deal with.
> It will enforce N backup copies; it tracks which other repositories
> have which files; it can transfer wanted file contents from other
> repositories in either a decentralized or a centralized manner; the
> other repositories can be on other drives of the same computer, or
> accessible by ssh, or even, now, Amazon S3.
> 

Note that dropbox-like behavior is different from the behavior you usually expect from git-annex users.
* usual git-annex behavior: every remote stands on it's own, there is no forced "being in sync", so that deletes must happen as initiated by the user, and this way you can prevent them from removing files if you expect it could be the last instance of the file.
* dropbox-like : remote A remove a file -> *all other remotes* should remove the file, so that their "working copy" looks the same. BUT the file should still be available *somewhere* so that a restore can be initiated (preferably from any of these nodes)

I see two solutions here:
- centralized: have 1 (or more) remotes that always keep a copy of the files which are being removed on all other remotes, these would be backup-nodes, they don't follow the strict "always in sync" rule that applies to the regular nodes. (they follow the original git-annex idea more strictly)
- decentralized: allow users to "remove files" by removing the symlink, but still keep the blob in .git-annex on at least one of the nodes, so that it can be restored from that.

Dieter