automating git-annex relationships

Fri Sep 12 00:55:39 CEST 2014

On Tue, Sep 09, 2014 at 05:48:52PM +0200, martin f krafft wrote:
> also sprach Joey Hess <joey at kitenet.net> [2014-09-09 17:20 +0200]:
> Unfortunately, the requirement for the remote to already exist kinda makes it
> hard to use, for two reasons:
>
>   1. The location depends on both sides of the equation, there's not
>      always a canonical one.

Correct.  That's why semi-intelligent custom code is required, and mr
is a pretty good framework to build that code upon.

> > I use mr to set up remotes.
> >
> > [lib/downloads]
> > checkout =
> >         git clone ssh://joey@git.kitenet.net/srv/git/downloads
> >         cd downloads
> >         git remote add website ssh://joey@git.kitenet.net/srv/web/downloads.kitenet.net/.git
>
> Yeah, like Adam.

Not quite - I rarely hardcode the remote URLs.

> With the use-case of two machines that are more or less identical,
> as well as plenty of SSH hosts out there, which only get a subset of
> the repos, I would have to keep a list of remotes centrally
> maintained, and possibly a different set for each host as remote URI
> depends on the relationship, not a single host. Maybe Git
> rewrite rules can help here, as Adam suggests, but it just gets
> messy.

It can get complex, but it doesn't have to get messy.  I found that my
repositories (around 200 of them and still growing weekly) largely
fall into the following categories:

  - personal configuration files / shortcuts / scripts
  - software projects I wrote
  - 3rd party software someone else wrote, but which I use or even develop
  - work-related repositories
  - self-organisation ("PIM") repositories
  - media: photos, music, videos etc.

(BTW, not all of them use git-annex, but that may change in the future
once I can get git-annex fulfilling more of my workflows.)

Of course there is some overlap between these categories.  But the
repositories in each category generally follow the same patterns
regarding which remotes they need, and it is possible to build shell
functions which implement those generalized rules, whilst still easily
allowing for exceptions.  In practice, my mr config is at the stage
where most repository definitions have the following line:

    remotes = auto_remotes

and the auto_remotes function just auto-detects the right set of
remotes:

    https://github.com/aspiers/mr-config/blob/master/sh.d/my-git-remotes#L75a

> The reason I am worrying about this is because rather than having
> a single git-annex repo for everything in $HOME, I'd rather have
> a different repo for each project I am involved with

Absolutely!  A single repo sounds like an awful idea.

> and that's several dozens, so there's a lot of repetitive work
> ahead, and much redundancy to be created.

Dozens?  Pah ;-)  Like I said, I'm around the 200 mark now.  I have at
least 10 just for managing my mail (6 of which are predominantly for
configuring mutt).  I've definitely proven that mr+git can handle this
level of complexity (although I did need to extend mr in the process,
enhance GNU Stow, and write quite a bit of integration / automation
code).

> The reason for having separate annexes is quite simply that some of
> them are shared with others, while most are not.

That's one very good reason, but it's not the only one - in fact it's
#5 in my list below.  A few years ago I actually started documenting
my strategy for managing my personal files and infrastructure, and
here's a relevant extract:

  Archive boundaries affect or effect the following:

    1. replication
       - can mirror small archives quickly
       - automatic replication may act on entire archive
       - different data has different redundancy requirements
    2. archive browsing
    3. archive size (affects performance of DVCS)
       - especially git-annex, which can de-duplicate data
    4. quotas / accounting (e.g. df)
       - convenient if the partition is dedicated to the archive
       - awkward for an archive to span a partition
    5. access control (UNIX file permissions, Apache config, intended audience etc.)
    6. isolation (damage limitation)

I probably missed some.

> So yes, I could have a single annex for $HOME and a few annexes for
> sharing, and use views (tags) to select which files appear where.
>
> But views don't (yet) update automatically¹ and there are strict
> limitations on filenames², which make views a nice query tool, but
> not really a tool for persistent use.
>
> ¹) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=743820
> ²) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=743794
>
> I would love to have real tagging because I have so many files that
> belong to more than one project… :/
>
> </mind mode="wander">

Yeah, it sounds like you're dreaming the same semantic file system
dream which people have been dreaming of for years already:

  http://en.wikipedia.org/wiki/Semantic_file_system

There have been various attempts at solving it, but from what I've
seen, the technology simply isn't there yet.  Most implementations
seem to be user-space layers on top of traditional hierarchical
filesystems, rather than native kernel-based solutions, and that
inevitably limits the effectiveness.