using git to version large files w/o checking them in

Thu Sep 30 15:50:30 CEST 2010

It seems to me that ssdep (http://ssdeep.sourceforge.net/) would be a good
basis for such a tool. This tool makes hashes for blocks of data in a file,
so you can compare portions of that file to other files, or in this case,
other versions of the file[1]. So this gives you an ordered list of hashes
for a file, presumably smaller than the file itself, and in text form.  If
you add some metadata along with this list, like full path+filename, and
permissions and so on, you have everything you need to deal with the file.

Now we just need 2 components.  The first is a way to store blocks of data.
Since we already have hashes for those blocks, this is essentially a
key-value store. Fortunately such things are pretty popular these days, and
various experimenting can determine which would be good for this job.

The second would be a set of scripts as scm pre- and post-(commit/checkout)
hooks which converts between the large binaries and the above mentioned
metadata format.  This minimizes checkin time because only blocks not
already in the db need to be added, and checkout could be simple too because
of locally cached blocks.

So just a few gotchas I know about this approach - any file format which
uses internal compression will have issues in the redundancy department,
it's just the nature of compression.  Similarly the overhead generated by
even the fastest versions of this approach could make it too slow.  A final
gotcha comes from potential hash collisions, but this seems a remote
possibility to me, and there are probably ways around it.

None of those gotchas are enough to make me not want to try it... anyone
else interested?

[1]This is actually a basis for some data de-duplication techniques, where
most of my idea here comes from. Similarly instead of hashes such a thing
could be build on other common identifiers such as those found by
compression techniques.  No matter what the algorithm for finding blocks,
the rest of my idea remains the exact same :)

Regards,
Erich

On Wed, Sep 29, 2010 at 11:27 PM, Joey Hess <joey at kitenet.net> wrote:

> My pain point for checking files into git is around 64 mb, and it's
> partly a disk-space based pain, so stuff like bup or git large file
> patches don't help much. So despite having multi-gb git repos,
> I still have no way to version ie, videos.
>
> It occurs to me that I'd be reasonably happy with something that let me
> manage the filenames (and possibly in some cases content checksums) of
> large files without actually storing their content in .git/.
> So I could delete files, move them around, rename them, and add new
> ones, and commit the changes (plus take some other action to transfer
> file contents) to propigate those actions to other checkouts.
>
> Does anyone know of any tools in that space?
>
> --
> see shy jo
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
>
> iQIVAwUBTKQRxMkQ2SIlEuPHAQhLOg//QDb+/8pzYlPpVHt+B+rnqIwLNDQs1cht
> uL8AuK/ukeGEnKYkncaGDOwr8BqB4JmZ8RnFOWB2dmGxd8/ecDJXNREKNABVPlOQ
> Zea6J+kS584CPXbZOOV4eva0oVs2M780tIe7R0RGngXjw6Igh2PM4uj+wX8IxKkw
> 7jLcR7TRrsAVwyNzAEDoEGtIYglOPpirXUYXxjbmfttFF9PqmyK2Eqe4NBkNoXMV
> hDOIS3BAvO6tkJJd1q3WAPQEgSDwPRTX44aUvOMn/cvDPJ7ilCwSE19muTbBoBHb
> qKcTS7OOpUp+OPW5ejH4tRWZ6zFPHvIor3A06b5FJd54KV0rC+ymVhHFBk/gT3Tm
> ImTqL3K0CuDNRKxTpFBjpcYuKDCL3O8p/v8iKGta4/fyuREeJUgjJwE8SXpZrXav
> xpskdP9/bYDTvAQzZQ/QIl2Unn631/TsCxbD/GZGTTHq/EwCxxlwy3tUX2rhhb3O
> l+0F84Utrxd9oYU74fJYiJaTyOQ6WmbQsCFTGr4cwfJX7oTtmpxmypS3pGjtf9Wm
> YlbPlX3Xq8fS7t+4i7viWuGllFyELffxbclPRUNU46H0AygBdTK+Or+Fq1Uydti+
> +aYaQSF8PzCqlrxZmyihwIiRQCPgmOhiyvFLlRA3X5ACWsKmLgKZBMrbx1c1Uq9L
> kINws8EoU6o=
> =mrHb
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> vcs-home mailing list
> vcs-home at lists.madduck.net
> http://lists.madduck.net/listinfo/vcs-home
>