This is a place to discuss using git-annex. If you need help, advice, or anything, post about it here. (But post bug reports over here.)

[Installation] base-3.0.3.2 requires syb ==0.1.0.2 however syb-0.1.0.2 was excluded because json-0.5 requires syb >=0.3.3
Posted Sat Sep 22 03:36:59 2012

exporting annexed files
Posted Wed Sep 19 16:46:44 2012

Wishlist: automatic reinject
Posted Wed Sep 19 16:46:44 2012

Wishlist: getting the disk used by a subtree of files
Posted Wed Sep 19 16:46:44 2012

Wishlist: logging to file when running as a daemon (for the assistant)
Posted Wed Sep 19 16:46:44 2012

autobuilders for git-annex to aid development
Posted Wed Sep 19 16:46:44 2012

migration to git-annex and rsync
Posted Tue Jul 17 17:54:57 2012

post-copy/sync hook
Posted Tue Jul 17 17:54:57 2012

wishlist: do round robin downloading of data
Posted Tue Jul 17 17:54:57 2012

migrate existing git repository to git-annex
Posted Tue Jul 17 17:54:57 2012

How to expire old versions of files that have been edited?
Posted Tue Jul 17 17:54:57 2012

Error while adding a file "createSymbolicLink: already exists"
Posted Tue Jul 17 17:54:57 2012

Wishlist: Is it possible to "unlock" files without copying the file data?
Posted Tue Jul 17 17:54:57 2012

wishlist: define remotes that must have all files
Posted Tue Jul 17 17:54:57 2012

seems to build fine on haskell platform 2011
Posted Tue Jul 17 17:54:57 2012

Can I store normal files in the git-annex git repository?
Posted Tue Jul 17 17:54:57 2012

version 3 upgrade
Posted Tue Jul 17 17:54:57 2012

Auto archiving
Posted Tue Jul 17 17:54:57 2012

fsck gives false positives
Posted Tue Jul 17 17:54:57 2012

working without git-annex commits
Posted Tue Jul 17 17:54:57 2012

wishlist: traffic accounting for git-annex
Posted Tue Jul 17 17:54:57 2012

tips: special_remotes/hook with tahoe-lafs
Posted Tue Jul 17 17:54:57 2012

wishlist: git-annex replicate
Posted Tue Jul 17 17:54:57 2012

wishlist: push to cia.vc from the website's repo, not your personal one
Posted Tue Jul 17 17:54:57 2012

"permission denied" in fsck on shared repo
Posted Tue Jul 17 17:54:57 2012

Git Annex Transfer Protocols
Posted Tue Jul 17 17:54:57 2012

OSX's haskell-platform statically links things
Posted Tue Jul 17 17:54:57 2012

vlc and git-annex
Posted Tue Jul 17 17:54:57 2012

example of massively disconnected operation
Posted Tue Jul 17 17:54:57 2012

Debugging Git Annex
Posted Tue Jul 17 17:54:57 2012

Please fix compatibility with ghc 7.0
Posted Tue Jul 17 17:54:57 2012

wishlist: git annex status
Posted Tue Jul 17 17:54:57 2012

pure git-annex only workflow
Posted Tue Jul 17 17:54:57 2012

Moving older version's file content without doing checkout
Posted Tue Jul 17 17:54:57 2012

Is an automagic upgrade of the object directory safe?
Posted Tue Jul 17 17:54:57 2012

What can be done in case of conflict
Posted Tue Jul 17 17:54:57 2012

new microfeatures
Posted Tue Jul 17 17:54:57 2012

retrieving previous versions
Posted Tue Jul 17 17:54:57 2012

Wishlist: Ways of selecting files based on meta-information
Posted Tue Jul 17 17:54:57 2012

Need new build instructions for Debian stable
Posted Tue Jul 17 17:54:57 2012

unannex alternatives
Posted Tue Jul 17 17:54:57 2012

hashing objects directories
Posted Tue Jul 17 17:54:57 2012

windows port?
Posted Tue Jul 17 17:54:57 2012

Handling web special remote when content changes?
Posted Tue Jul 17 17:54:57 2012

git-annex communication channels
Posted Tue Jul 17 17:54:57 2012

cloud services to support
Posted Tue Jul 17 17:54:57 2012

Behaviour of fsck
Posted Tue Jul 17 17:54:57 2012

advantages of SHA* over WORM
Posted Tue Jul 17 17:54:57 2012

error in installation of base-4.5.0.0
Posted Tue Jul 17 17:54:57 2012

performance improvement: git on ssd, annex on spindle disk
Posted Tue Jul 17 17:54:57 2012

wishlist: git backend for git-annex
Posted Tue Jul 17 17:54:57 2012

Will git annex work on a FAT32 formatted key?
Posted Tue Jul 17 17:54:57 2012

Recommended number of repositories
Posted Tue Jul 17 17:54:57 2012

Windows support
Posted Tue Jul 17 17:54:57 2012

wishlist: command options changes
Posted Tue Jul 17 17:54:57 2012

Sharing annex with local clones
Posted Tue Jul 17 17:54:57 2012

getting git annex to do a force copy to a remote
Posted Tue Jul 17 17:54:57 2012

How to handle the git-annex branch?
Posted Tue Jul 17 17:54:57 2012

A really stupid question
Posted Tue Jul 17 17:54:57 2012

incompatible versions?
Posted Tue Jul 17 17:54:57 2012

using git annex to merge and synchronize 2 directories (like unison)
Posted Tue Jul 17 17:54:57 2012

bainstorming: git annex push & pull
Posted Tue Jul 17 17:54:57 2012

--print0 option as in "find"
Posted Tue Jul 17 17:54:57 2012

can git-annex replace ddm?
Posted Tue Jul 17 17:54:57 2012

confusion with remotes, map
Posted Tue Jul 17 17:54:57 2012

What happened to the walkthrough?
Posted Tue Jul 17 17:54:57 2012

location tracking cleanup
Posted Tue Jul 17 17:54:57 2012

unlock/lock always gets me
Posted Tue Jul 17 17:54:57 2012

wishlist: git annex put -- same as get, but for defaults
Posted Tue Jul 17 17:54:57 2012

wishlist: simpler gpg usage
Posted Tue Jul 17 17:54:57 2012

Preserving file access rights in directory tree below objects/
Posted Tue Jul 17 17:54:57 2012

git-annex on OSX
Posted Tue Jul 17 17:54:57 2012

syncing non-git trees with git-annex
Posted Tue Jul 17 17:54:57 2012

batch check on remote when using copy
Posted Tue Jul 17 17:54:57 2012

sparse git checkouts with annex
Posted Tue Jul 17 17:54:57 2012

nfs mounted repo results in errors on drop/move
Posted Tue Jul 17 17:54:57 2012

relying on git for numcopies
Posted Tue Jul 17 17:54:57 2012

git-subtree support?
Posted Tue Jul 17 17:54:57 2012

git annex ls / metadata in git annex whereis
Posted Tue Jul 17 17:54:57 2012

wishlist:alias system
Posted Tue Jul 17 17:54:57 2012

OSX's default sshd behaviour has limited paths set
Posted Tue Jul 17 17:54:57 2012

git pull remote git-annex
Posted Tue Jul 17 17:54:57 2012

tell us how you're using git-annex
Posted Tue Jul 17 17:54:57 2012 by Joey

Automatic commit messages for git annex sync
Posted Tue Jul 17 17:54:57 2012

rsync over ssh?
Posted Tue Jul 17 17:54:57 2012

Automatic `git annex get` after invalidation of local files due to external modification?
Posted Tue Jul 17 17:54:57 2012

fail to git annex add some files: getFileStatus: does not exist(v 3.20111231)
Posted Tue Jul 17 17:54:57 2012

Problem with bup: cannot lock refs
Posted Tue Jul 17 17:54:57 2012

wishlist: special remote for sftp or rsync
Posted Tue Jul 17 17:54:57 2012

"git annex lock" very slow for big repo
Posted Tue Jul 17 17:54:57 2012

git tag missing for 3.20111011
Posted Tue Jul 17 17:54:57 2012

Getting started with Amazon S3
Posted Tue Jul 17 17:54:57 2012

Making git-annex less necessary
Posted Tue Jul 17 17:54:57 2012

Problems with large numbers of files
Posted Tue Jul 17 17:54:57 2012

Podcast syncing use-case
Posted Tue Jul 17 17:54:57 2012

git annex add crash and subsequent recovery
Posted Tue Jul 17 17:54:57 2012

Running git checkout by hand is fine, of course.

Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the git reset this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved". ;)

Still, it is possible to speed this up at git-annex's level. Rather than doing a git reset followed by a git checkout, it can just git checkout HEAD -- file, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.

I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!

(Hey, this even speeds up the one file case greatly, since git reset -- file is slooooow -- it seems to scan the entire repository tree. Yipes.)

Comment by http://joey.kitenet.net/ Tue May 31 18:51:13 2011

@joey

OK, I'll try increasing the stack size and see if that helps.

For reference, I was running:

git annex add .

on a directory containing about 100k files spread over many nested subdirectories. I actually have more than a dozen projects like this that I plan to keep in git annex, possibly in separate repositories if necessary. I could probably tar the data and then archive that, but I like the idea of being able to see the structure of my data even though the contents of the files are on a different machine.

After the crash, running:

git annex unannex

does nothing and returns instantly. What exactly is 'git annex add' doing? I know that it's moving files into the key-value store and adding symlinks, but I don't know what else it does.

--Justin

If

Are the files identical or different? I today did something like that with similar, but not identical directories containing media files, and git happily merged them. but there, same files had same content.

Also, make sure you use the same backend. In my case, one of the machines runs Debian stable, so I use the WORM backend, not the SHA backend.

Comment by http://www.joachim-breitner.de/ Sun Dec 18 13:57:33 2011
:)
Comment by http://joey.kitenet.net/ Thu Oct 13 15:36:59 2011

Thanks for the tips so far. I guess a bare-only repo helps, but as well is something that I don’t need (for my use case), any only have to do because git works like this.

Also, if I have a mobile device that I want to push to, then I’d have to have two repositories on the device, as I might not be able to reach my main bare repository when traveling, but I cannot push to the „real“ repo on the mobile device from my computer. I guess I am spoiled by darcs, which will happily push to a checked out remote repository, updating the checkout if possible without conflict.

If I introduce a central bare repository to push to and from; I’d still have to have the other non-bare repos as remotes, so that git-annex will know about them and their files, right?

I’d appreciate a "git annex sync" that does what you described (commit all, pull, merge, push). Especially if it comes in a "git annex sync --all" variant that syncs all reachable repositories.

Comment by http://www.joachim-breitner.de/ Sat Dec 10 16:28:29 2011

Right, I have thought about untrusting all but a few remotes to achieve something similar before and I'm sure it would kind of work. It would be more of an ugly workaround, however, because I would have to untrust remotes that are, in reality, at least semi-trusted. That's why an extra option/attribute for that kind of purpose/remote would be nice.

Obviously I didn't see the scalability problem though. Good Point. Maybe I can achieve the same thing by writing a log parsing script for myself?

Comment by gernot Sun Apr 24 11:20:05 2011

Can't you just use an underscore instead of a colon?

Would it be feasible to split directories dynamically? I.e. start with SHA1_123456789abcdef0123456789abcdef012345678/SHA1_123456789abcdef0123456789abcdef012345678 and, at a certain cut-off point, switch to shorter directory names? This could even be done per subdirectory and based purely on a locally-configured number. Different annexes on different file systems or with different file subsets might even have different thresholds. This would ensure scale while not forcing you to segment from the start. Also, while segmenting with longer directory names means a flatter tree, segments longer than four characters might not make too much sense. Segmenting too often could lead to some directories becoming too populated, bringing us back to the dynamic segmentation.

All of the above would make merging annexes by hand a lot harder, but I don't know if this is a valid use case. And if all else fails, one could merge everything with the unsegemented directory names and start again from there.

-- RichiH

windows support has everything I know about making a windows port. This badly needs someone who understand Windows to dive into it. The question of how to create a symbolic link (or the relevant Windows equivilant) from haskell on Windows is a good starting point..

Comment by http://joey.kitenet.net/ Mon Mar 12 06:43:02 2012
I tend to agree that the default output of fsck is not quite right. I often use git annex fsck -q. A progress spinner display is a good idea.
Comment by http://joey.kitenet.net/ Thu Mar 24 17:45:08 2011

+1 for a generic user configurable backend that a user can put shell commands in, which has a disclaimer such that if a user hangs themselves with misconfiguration then its their own fault :P

I would love to be able to quickly plugin an irods/sector set of put/get/delete/stat(get info) commands into git-annex to access my private clouds which aren't s3 compatible.

I think that's because the SSH was successful (I entered the password and let it connect), so it got the UUID and put that in the .dot instead. The same UUID (for psychosis) then ended up in two different "subgraph" stanzas, and Graphviz just plotted them together as one node.

Maybe this will clarify:

On psychosis, run "git annex map" and press ^C at the ssh password prompt: map-nossh.dot Map

On psychosis, run "git annex map" and type the correct password: map-goodssh.dot Map

As I see it:

  • psychosis ("localhost") connects to each of its remotes
  • some of them point back to ssh://psychosis
  • psychosis doesn't know that ssh://psychosis is itself, so it tries to connect
  • if successful:
    • psychosis gets put twice in the .dot as if it was two different hosts, one "local" and one "ssh://psychosis"
    • graphviz recognizes it as the same node because the UUID is the same, but graphviz still draws the extra connecting lines
  • if unsuccessful:
    • ssh://psychosis is shown as an additional host that can't be reached

windows support has everything I know about making a windows port. This badly needs someone who understand Windows to dive into it. The question of how to create a symbolic link (or the relevant Windows equivilant) from haskell on Windows is a good starting point..

Comment by http://joey.kitenet.net/ Mon Mar 12 06:43:02 2012

This was already asked here, but I have a use case where I need to unlock with the files being hardlinked instead of copied (my fs does not support CoW), even though 'git annex lock' is now much faster ;-) . The idea is that 1) I want the external world see my repo "as if" it wasn't annexed (because of its own limitation to deal with soft links), and 2) I know what I do, and am sure that files won't be written to but only read.

My case is: the repo contains a snapshot A1 of a certain remote directory. Later I want to rsync this dir into a new snapshot A2. Of course, I want to transfer only new or changed files, with the --copy-dest=A1 (or --compare-dest) rsync's options. Unfortunately, rsync won't recognize soft-links from git-annex, and will re-transfer everything.

Maybe I'm overusing git-annex ;-) but still, I find it is a legitimate use case, and even though there are workarounds (I don't even remember what I had to do), it would be much more straightforward to have 'git annex unlock --readonly' (or '--readonly-unsafe'?), ... or have rsync take soft-links into account, but I did not see the author ask for microfeatures ideas :) (it was discussed, and only some convoluted workarounds were proposed). Thanks.

The git tweak-fetch hook that I have been developing, and hope will be accepted into git soon, provides some abilities that could be used to make "git pull remote" always merge remote/master. Normall, git can only be configured to do that merge automatically for one remote (ie, origin). But the tweak-fetch hook can flag arbitrary branches as needing merge.

So, it could always flag tracking branches of the currently checked out branch for merge. This would be enabled by some setting, probably, since it's not necessarily the case that everyone wants to auto-merge when they pull like this. (Which is why git doesn't do it by default after all.)

(The tweak-fetch hook will also entirely eliminate the need to run git annex merge manually, since it can always take care of merging the git-annex branch.)

Comment by http://joey.kitenet.net/ Mon Dec 26 18:50:35 2011
http://xfs.org/index.php/XFS_FAQ#Q:Performance:mkfs.xfs_-n_size.3D64k_option
.1 cents: Having IRC would be really nice for seeking quick help. E.g. like I was trying to do now, google lead me to this page.
You're right -- as long as nothing changes a file without letting the modification time update, editing WORM files is safe.
Comment by http://joey.kitenet.net/ Mon Aug 29 16:10:38 2011

@justin, I discovered that "git annex describe" did what I wanted

@joey, yep that is the behaviour of "tahoe ls", thanks for the tip on removing the file from the remote.

It seems to be working okay for now, the only concern is that on the remote everything is dumped into the same directory, but I can live with that, since I want to track biggish blobs and not lots of small little files.

It might make sense to put this functionality in git annex find. Perhaps a format string with a %s for example.
Comment by http://joey.kitenet.net/ Mon Nov 14 22:48:03 2011
P.S. I see you already fixed the docs - thanks! :)
Comment by http://adamspiers.myopenid.com/ Fri Dec 23 17:24:58 2011
the non-bare repository issue would go away if this was combined with the "alternate" approach to branching. (with the "fleshed out proposal" of branching, this would not work at all for lack of shared commits.)
Comment by http://christian.amsuess.com/chrysn Wed Feb 23 21:48:14 2011

Hey Jimmy: how's this working for you now? I would expect it to go slower and slower since Tahoe-LAFS has an O(N) algorithm for reading or updating directories.

Of course, if it is still fast enough for your uses then that's okay. :-)

(We're working on optimizations of this for future releases of Tahoe-LAFS.)

I'd like to understand the desired behavior of store-hook and retrieve-hook better, in order to see if there is a more efficient way to use Tahoe-LAFS for this.

Off to look for docs.

Regards,

Zooko

Comment by zooko Sat May 14 05:07:17 2011

If tahoe ls outputs only the key, on its own line, and exits nonzero if it's not present, then I think you did the right thing.

To remove a file, use git annex move file --from tahoe and then you can drop it locally.

Comment by http://joey.kitenet.net/ Fri Apr 29 15:24:56 2011
That's my fault, I made a change last night that caused the noop problem. Fixed now.
Comment by http://joey.kitenet.net/ Sun Apr 22 15:23:26 2012

Suppose you do that to repos A and B. Now, in A, you git annex drop a file that is only present in those repositories. A checks B to make sure it still has a copy of the file. It sees the (same) file there, so assumes it's safe to drop. The file is removed from A, also removing it from B, and losing data.

It is possible to configure A and B to mutually distrust one-another and avoid this problem, but there will be other problems too.

Instead, git-annex supports using cp --reflink=auto, which on filesystems supporting Copy On Write (eg, btrfs), avoids duplicating contents when A and B are on the same filesystem.

Comment by http://joey.kitenet.net/ Mon Mar 19 18:23:13 2012

My last comment is a bit confused. The "git fetch" command allows to get all the information from a remote, and it is then possible to merge while being offline (without access to the remote). I would like a "git annex fetch remote" command to be able to get all annexed files from remote, so that if I later merge with remote, all annexed files are already here. And "git annex fetch" could (optionally) call "git fetch" before getting the files.

It seems also that in my last post, I should have written "git annex get --from=remote" instead of "git annex copy --from=remote", because "annex copy --from" copies all files, even if the local repo already have them (is this the case? if yes, when is it useful?)

I've been longing for an automated way of removing references to a remote assuming I know the exact uuid that I want to remove. i.e. I have lost a portable HDD due to a destructive process, I now want to delete all references to copies of data that was on that disk. Unless this feature exists, I would love to see it implemented.
This is now almost completely implemented. See powerful file matching.
Comment by http://joey.kitenet.net/ Mon Sep 19 18:46:35 2011
Good idea! I've made git annex add recover when ran a second time.
Comment by http://joey.kitenet.net/ Wed Dec 7 20:54:51 2011
I'm just answering myself: manually fixing symlinks doesn't always works. Sometimes the pre-commit hook will just rewrite the link to some wrong path.
Comment by http://mildred.pip.verisignlabs.com/ Thu Apr 12 15:46:54 2012
Broken last night during upgrade, fixed now, thanks for noticing.
Comment by http://joeyh.name/ Thu May 24 20:15:19 2012

This begs the question: What is the default remote? It's probably not the same repository that git's master branch is tracking (ie, origin/master). It seems there would have to be an annex.defaultremote setting.

BTW, mr can easily be configured on a per-repo basis so that "mr push" copies to somewhere: push = git push; git annex push wherever

Comment by http://joey.kitenet.net/ Mon Apr 4 18:13:46 2011

First, you need a bare git repository that you can push to, and pull from. This simplifies most git workflow.

Secondly, I use mr, with this in .mrconfig:

[DEFAULT]
lib =
        annexupdate() {
                git commit -a -m update || true
                git pull "$@"
                git annex merge
                git push || true
        }

[lib/sound]
update = annexupdate
[lib/big]
update = annexupdate

Which makes "mr update" in repositories where I rarely care about git details take care of syncing my changes.

I also make "mr update" do a "git annex get" of some files in some repositories that I want to always populate. git-annex and mr go well together. :)

Perhaps my annexupdate above should be available as "git annex sync"?

Comment by http://joey.kitenet.net/ Fri Dec 9 22:56:11 2011

Going one step further, a --min-copy could put all files so that numcopies is satisfied. --all could push to all available ones.

To take everything another step further, if it was possible to group remotes, one could act on the groups. "all" would be an obvious choice for a group that always exists, everything else would be set up by the user.

Git-annex's commit hook does not prevent unannex being used. The file you unannex will not be checked into git anymore and will be a regular file again, not a git-annex symlink.

For example, here's a transcript:

joey@gnu:~/tmp>mkdir demo
joey@gnu:~/tmp>cd demo
joey@gnu:~/tmp/demo>git init
Initialized empty Git repository in /home/joey/tmp/demo/.git/
joey@gnu:~/tmp/demo>git annex init demo
init demo ok
joey@gnu:~/tmp/demo>echo hi > file
joey@gnu:~/tmp/demo>git annex add file 
add file ok
(Recording state in git...)
joey@gnu:~/tmp/demo>git commit -m add
[master 64cf267] add
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 .git-annex/WORM:1296607093:3:file.log
 create mode 120000 file
joey@gnu:~/tmp/demo>git annex unannex file
unannex file ok
(Recording state in git...)
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git commit
[master 78a09cc] unannex
 2 files changed, 1 insertions(+), 2 deletions(-)
 delete mode 120000 file
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git status
# On branch master
# Untracked files:
#   (use "git add ..." to include in what will be committed)
#
#   file
nothing added to commit but untracked files present (use "git add" to track)
Comment by http://joey.kitenet.net/ Wed Feb 2 00:39:10 2011

My guess is that psychosis has not pulled the git-annex branch since bacon was set up (or that bacon's git-annex branch has not been pushed to origin). git-annex status only shows remotes present in git-annex:uuid.log This may be a bug.

The duplicate links in the map I don't quite understand. I only see duplicate links in my maps when I have the same repository configured as two different git remotes (for example, because the same repository can be accessed two different ways). You don't seem to have that in your config.

Comment by http://joey.kitenet.net/ Mon Oct 17 19:01:21 2011

Yes, there is value in layering something over git-annex to use a policy to choose what goes where.

I use mr to update and manage all my repositories, and since mr can be made to run arbitrary commands when doing eg, an update, I use its config file as such a policy layer. For example, my podcasts are pulled into my sound repository in a subdirectory; boxes that consume podcasts run "git pull; git annex get podcasts --exclude="/out/"; git annex drop podcasts/*/out". I move podcasts to "out" directories once done with them (I have yet to teach mpd to do that for me..), and the next time I run "mr update" to update everything, it pulls down new ones and removes old ones.

I don't see any obstacle to doing what you want. May be that you'd need better querying facilities in git-annex (so the policy layer can know what is available where), or finer control (--exclude is a good enough hammer for me, but maybe not for you).

Comment by http://joey.kitenet.net/ Mon Feb 14 22:08:54 2011
It seems the pages that are supposed to be inlined are not being found even though they are in doc/walkthrough/.

That's awesome, I had not heard of git sparse checkouts before.

It does not make sense to tie the log files to the directory of the corresponding files, as then the logs would have to move when the files are moved, which would be a PITA and likely make merging log file changes very complex. Also, of course, multiple files in different locations can point at the same content, which has the same log file. And, to cap it off, git-annex can need to access the log file for a given key without having the slightest idea what file in the repository might point to it, and it would be very expensive to scan the whole repository to find out what that file is in order to lookup the filename of the log file.

The most likely change in git-annex that will make this better is in this todo item -- but it's unknown how to do it yet.

Comment by http://joey.kitenet.net/ Thu Apr 7 16:32:04 2011
My estimates were pretty close -- the new bup special remote type took 133 lines of code, and 2 hours to write. A testament to the flexibility of the special remote infrastructure. :)
Comment by http://joey.kitenet.net/ Fri Apr 8 20:59:37 2011
I would also like an git-annex channel. Would be #git-annex@OFTC ok?

This is an entirely reasonable way to go about it.

However, doing it this way causes files in B to always "win" -- If the same filename is in both repositories, with differing content, the version added in B will superscede the version from A. If A has a file that is not in B, a git commit -a in B will commit a deletion of that file.

I might do it your way and look at the changes in B before (or even after) committing them to see if files from A were deleted or changed.

Or, I might just instead keep B in a separate subdirectory in the repository, set up like so:

mv b old_b
git clone a b
cd b
mv ../old_b .
git annex add old_b --not --exclude '*.avi'

Or, a third way would be to commit A to a branch like branchA and B to a separate branchB, and not merge the branches at all.

Comment by http://joey.kitenet.net/ Wed Dec 14 17:31:31 2011
Yes, contents are still considered used while tags or refs refer to them. Including remote tracking branches like remotes/origin/master
Comment by http://joey.kitenet.net/ Thu Feb 9 19:42:28 2012

I've committed the queue flush improvements, so it will buffer up to 10240 git actions, and then flush the queue.

There may be other memory leaks at scale (besides the two I mentioned earlier), but this seems promising. I'm well into running git annex add on a half million files and it's using 18 mb ram and has flushed the queue several times. This run will fail due to running out of inodes for the log files, not due to memory. :)

Comment by http://joey.kitenet.net/ Thu Apr 7 18:09:13 2011

Right, --in goes by git-annex's location tracking information; actually checking if a remote still has the files would make --in too expensive in many cases.

So you need to give gpodder-on-usbdisk current information. You can do that by going to usb-ariaz and doing a git annex fsck. That will find the deleted files and update the location information. Then, back on gpodder-on-usbdisk, git pull usb-ariaz, and then you can proceed with the commands you showed.

Comment by http://joey.kitenet.net/ Sun Nov 27 17:56:31 2011
Thanks, it worked now!
Actually, there is a hint that, while you ran the git annex map on psychosis, it decided to ssh to itself two times. That seems to be where the duplicate links came from, I guess you must have some git remotes you did not show.
Comment by http://joey.kitenet.net/ Mon Oct 17 19:02:50 2011
BTW, git-annex unused will have a problem that not all the symlinks are present. It will suggest dropping content belonging to the excluded symlinks.
Comment by http://joey.kitenet.net/ Thu Apr 7 16:33:30 2011

While having remotes redistribute introduces some obvious security concerns, I might use it.

As remotes support a cost factor already, you can basically implement bandwidth through that.

Extending git annex sync would be nice, although auto-commit does not suit every use case, so it would be better not to couple one to the other.
Comment by http://adamspiers.myopenid.com/ Fri Dec 23 17:14:03 2011

It's ok that git pull does not merge the git-annex branch. You can merge it with git annex merge, or it will be done automatically when you use other git-annex commands.

If you use git pull and git push without any options, the defaults will make git pull and push the git-annex branch automatically.

But if you're in the habit of doing git push origin master, that won't cause the git-annex branch to be pushed (use git push origin git-annex to manually push it then). Similarly, git pull origin master won't pull it. And also, the remote.origin.fetch setting in .git/config can be modified in ways that make git pull not automatically pull the git-annex branch. So those are the things to avoid after upgrade to v3, basically.

Comment by http://joey.kitenet.net/ Wed Aug 17 01:33:08 2011

hmmmm - I'm still not sure I get this.

If I'm using a whole bunch of distributed annexs with no central repo, then I can not do a git pull remote without either specifying the branch to use or changing default tracked remote via git branch --set-upstream. The former like you note doesn't pull the git-annex branch down the latter only works one-at-a-time.

The docs read to me as though I ought to be able to do a git pull remote ; git annex get . using anyone of my distributed annexs.

Am I doing something wrong? Or is the above correct?

I'm not sure it is worth adding a command for such a small feature, but I would certainly use it: having something like "git annex fetch remote" do "git fetch remote && git annex copy --from=remote", and "git annex push remote" do "git push remote && git annex copy --to=remote". And maybe the same for a pull operation?
@jimmy what to do when you lose a repository.. I have not seen a convincing argument that removing the location tracking data entirely serves any purpose
Comment by http://joey.kitenet.net/ Wed Jun 1 20:24:33 2011
using the location tracking information, it should be possible to show the status of other remotes as well. what about supporting --from=... or --all? (thus, among other things, one could determine if a remote has a complete checkout.)
Comment by http://christian.amsuess.com/chrysn Wed Jun 15 08:39:24 2011
See fat support. A bare git repo will have to be used to avoid symlink problems, at least for now. The other problem is that git-annex key files have colons in their filenames.
Comment by http://joey.kitenet.net/ Mon Mar 7 19:13:14 2011
Some other protocols such as S3 for special remotes.
Comment by http://joeyh.name/ Thu May 10 18:18:01 2012

With a lazy branch, I get "git-annex: no branch is checked out". Weird.. my best guess is that it's because this is running at the seek stage, which is unusual, and the value is not used until a later stage and so perhaps the git command gets reaped by some cleanup code before its output is read.

(pipeRead is lazy because often it's used to read large quantities of data from git that are processed progressively.)

I did make it merge both branches, separately. It would be possible to do one single merge, but it's probably harder for the user to recover if there are conflicts in an octopus merge. The order of the merges does not seem to me to matter much, barring conflicts it will work either way. Dealing with conflicts during sync is probably a weakness of all this; after the first conflict the rest of the sync will continue failing.

Comment by http://joey.kitenet.net/ Mon Jan 2 16:01:49 2012

I think:

  • The first extra edge is because bucket had "ssh://psychosis.foo.com/vid/", while bacon had "ssh://psychosis.foo.com/vid" with no trailing slash. That got lost in the hostname/path editing I did, sorry. Maybe those should be considered matching?
  • The second extra edge is because, when running "git annex map" from psychosis, it doesn't recognize the remote's remote URL as pointing back to itself.

For the second case, after the "spurious" SSH, it could still recognize that the repositories are the same by the duplicated annex uuid, which currently shows up in map.dot twice. I wonder what it would take to avoid the spurious SSH -- maybe some config that lists "alternate" URLs that should be considered the same as the current repository? Or actually list URLs in uuid.log? Fortunately, I think this only affects the map, so it's not a big problem.

I agree on the naming suggestions, and that it does not suit everybody. Maybe I’ll think some more about it. The point is: I’m trying to make live easy for those who do not want to manually create some complicated setup, so if it needs configuration, it is already off that track. But turning the current behavior into something people have to configure is also not well received by the users.

Given that "git annex sync" is a new command, maybe it is fine to have this as a default behavior, and offer an easy way out. The easy way out could be one of two flags that can be set for a repo (or a remote):

  • "central", which makes git annex sync only push and pull to and that repo (unless a different remote is given on the command line)
  • "unsynced", which makes git annex sync skip the repo.

Maybe central is enough.

Comment by http://www.joachim-breitner.de/ Sun Dec 18 12:08:51 2011
Either option should work fine, but git gc --aggressive will probably avoid most of git's seeking.
Comment by http://joey.kitenet.net/ Sat Apr 2 17:48:29 2011
git annex get/copy/drop all now support a --auto flag, which makes them only act on files that have not enough or too many copies. This allows for some crude replication; it doesn't take into account which repositories should be filled up more (beyond honoring annex.diskreserve), nor does it try to optimally use bandwidth (beyond honoring configured annex-cost). You have to run it in every repository that you want to participate in the replication, too. But it's probably a Good Enough solution. See automatically managing content.
Comment by http://joey.kitenet.net/ Mon Sep 19 18:54:46 2011
Personally, I would not mind a requirement to keep a local bup repo. I wouldn't want my data to to unncessarily complex setups, anyway. -- RichiH

It is unfortunatly not possible to do system-dependant hashing, so long as git-annex stores symlinks to the content in git.

It might be possible to start without hashing, and add hashing for new files after a cutoff point. It would add complexity.

I'm currently looking at a 2 character hash directory segment, based on an md5sum of the key, which splits it into 1024 buckets. git uses just 256 buckets for its object directory, but then its objects tend to get packed away. I sorta hope that one level is enough, but guess I could go to 2 levels (objects/ab/cd/key), which would provide 1048576 buckets, probably plenty, as if you are storing more than a million files, you are probably using a modern enough system to have a filesystem that doesn't need hashing.

Comment by http://joey.kitenet.net/ Wed Mar 16 03:13:39 2011

No matter what you end up doing, I would appreciate a git-annex-announce@ list.

I really like the persistence of ikiwiki, but it's not ideal for quick communication. I would be fine with IRC and/or ML. The advantage of a ML over ikiwiki is that it doesn't seem to be as "wasteful" to mix normal chat with actual problem-solving. But maybe that's merely my own perception.

Speaking of RSS: I thought I had added a wishlist item to ikiwiki about providing per-subsite RSS feeds. For example there is no (obvious) way to subscribe to changes in http://git-annex.branchable.com/forum/git-annex_communication_channels/ .

FWIW, I resorted to tagging my local clone of git-annex to keep track of what I've read, already.

-- RichiH

Seems to have a scalability problem, what happens when such a repository becomes full?

Another way to accomplish I think the same thing is to pick the repositories that you would include in such a set, and make all other repositories untrusted. And set numcopies as desired. Then git-annex will never remove files from the set of non-untrusted repositories, and fsck will warn if a file is present on only an untrusted repository.

Comment by http://joey.kitenet.net/ Sat Apr 23 16:27:13 2011
Ok, after pushing the "git-annex" branch to origin, then "git annex status" knows all repositories on all hosts, so that part makes sense now. Thanks for the tip. But the "git annex map" output hasn't changed.

Well, it should only move files to .git/annex/bad/ if their filesize is wrong, or their checksum is wrong.

You can try moving a file out of .git/annex/bad/ and re-run fsck and see if it fails it again. (And if it does, paste in a log!)

To do that -- Suppose you have a file .git/annex/bad/SHA256-s33--5dc45521382f1c7974d9dbfcff1246370404b952 and you know that file foobar was supposed to have that content (you can check that foobar is a symlink to that SHA value). Then reinject it:

git annex reinject .git/annex/bad/SHA256-s33--5dc45521382f1c7974d9dbfcff1246370404b952 foobar

Comment by http://joey.kitenet.net/ Tue Feb 14 16:58:33 2012
I dunno about parrallel downloads -- eek! -- but there is at least room for improvement of what "git annex get" does when there are multiple remotes that have a file, and the one it decides to use is not available, or very slow, or whatever.
Comment by http://joey.kitenet.net/ Sun Apr 3 16:39:35 2011

Another nice thing would be a summary of what is wrong. I.e.

% git fsck
[...]
git-annex: 100 total failed
  50 checksum failed
  50 not enough copies exit

And the same/similar for all other failure modes.

-- RichiH

Thanks for the quick reply :)

I wanted to look up the UUID of the current repo so that I can find out which repo is alive from the collection of repos with the same name. I could have looked for it in .git/config though, since it's pretty obvious. I just looked into the git-annex branch and didn't find it there. Thanks for the tip about using ".". By the way, could there be some kind of warning about using non-unique names for repos? That would make this scenario less likely. Or maybe that is a bad idea given the decentralized nature of git.

By the way, do the trust settings propagate to other repos? If I mark some UUID as untrusted on one computer does it become globally untrusted?

Thanks for the update, Joey. I think you forgot to change libghc-missingh-dev to libghc6-missingh-dev for the copy & paste instructions though.

Also, after having checked that I have everything installed I'm still getting this error:

...
[15 of 77] Compiling Annex            ( Annex.hs, Annex.o )

Annex.hs:19:35:
    Module `Control.Monad.State' does not export `state'
make[1]: *** [git-annex] Error 1
make[1]: Leaving directory `/home/gernot/dev/git-annex'
dh_auto_build: make -j1 returned exit code 2
make: *** [binary] Error 2
Comment by gernot Tue Apr 26 18:56:44 2011
Oh, you'll need profiling builds of various haskell libraries to build with profiling support. If that's not easily accomplished, if you could show me the form of the command you're running, and also how git annex unannex fails, that would be helpful for investigating.
Comment by http://joey.kitenet.net/ Tue Apr 5 18:02:05 2011
I've corrected the missing ANNEX_HASH_* oversight. (It also affected removal, btw.)
Comment by http://joey.kitenet.net/ Fri Apr 29 18:01:04 2011

For future reference, git can recover from a corrupted index file with rm .git/index; git reset --mixed.

Of course, you lose any staged changes that were in the old index file, and may need to re-stage some files.

Comment by http://joey.kitenet.net/ Sun Apr 3 01:48:57 2011

The .git-annex/ directory is what really needs hashing.

Consider that when git looks for changes in there, it has to scan every file in the directory. With hashing, it should be able to more quickly identify just the subdirectories that contained changed files, by the directory mtimes.

And the real kicker is that when committing there, git has to create a tree object containing every single file, even if only 1 file changed. That will be a lot of extra work; with hashed subdirs it will instead create just 2 or 3 small tree objects leading down to the changed file. (Probably these trees both pack down to similar size pack files, not sure.)

Comment by http://joey.kitenet.net/ Wed Mar 16 04:06:19 2011
Nevermind, found it. (git-annex 0.08)
Thanks a lot. I tried various howtos around the net, but none of them worked; yours did. (I tried it in one of the copies of the broken repo which I keep around for obvious reasons).
BTW re your Tweet.. I was so happy to be able to use 'c i a' in Crypto.hs. :)
Comment by http://joey.kitenet.net/ Sun Apr 29 02:41:38 2012

Ah - very good to know that recovery is easier than the method I used.

I wonder if it could be made a feature to automatically and safely recover/resume from an interrupted git add?

It would be clearer to call "git-annex-master" "synced/master" (or really "synced/$current_branch"). That does highlight that this method of syncing is not particularly specific to git-annex.

I think this would be annoying to those who do use a central bare repository, because of the unnecessary pushing and pulling to other repos, which could be expensive to do, especially if you have a lot of interconnected repos. So having a way to enable/disable it seems best.

Maybe you should work up a patch to Command/Sync.hs, since I know you know haskell :)

Comment by http://joey.kitenet.net/ Tue Dec 13 20:53:23 2011
On second thought maybe the current behaviour is better than what I am suggesting that the force command should do. I guess it's better to be safe than sorry.
+1 for this feature, I've been longing for something like this other than rolling my own perl/shell scripts to parse the outputs of "git annex whereis ." to see how many files are on my machine or not.

OMG, my first sizable haskell patch!

So trying this out..

In each repo I want to sync, I first git branch synced/master

Then in each repo, I found I had to pull from each of its remotes, to get the tracking branches that defaultSyncRemotes looks for to know those remotes are syncable. This was the surprising thing for me, I had expected sync to somehow work out which remotes were syncable without my explicit pull. And it was not very obvious that sync was not doing its thing before I did that, since it still does a lot of "stuff".

Once set up properly, git annex sync fetches from each remote, merges, and then pushes to each remote that has a synced branch. Changes propigate around even when some links are one-directional. Cool!

So it works fine, but I think more needs to be done to make setting up syncing easier. Ideally, all a user would need to do is run "git annex sync" and it syncs from all remotes, without needing to manually set up the synced/master branch.

While this would lose the ability to control which remotes are synced, I think that being able to git annex sync origin and only sync from/to origin is sufficient, for the centralized use case.


Code review:

Why did you make branch strict?

There is a bit of a bug in your use of Command.Merge.start. The git-annex branch merge code only runs once per git-annex run, and often this comes before sync fetches from the remotes, leading to a push conflict. I've fixed this in my "sync" branch, along with a few other minor things.

mergeRemote merges from refs/remotes/foo/synced/master. But that will only be up-to-date if git annex sync has recently been run there. Is there any reason it couldn't merge from refs/remotes/foo/master?

Comment by http://joey.kitenet.net/ Fri Dec 30 21:49:06 2011

git annex status now includes a list of all known repositories.

Yes, trust setting propigate on git push/pull like any other git-annex information.

Comment by http://joey.kitenet.net/ Fri Sep 30 16:47:27 2011
encryption=shared is now supported
Comment by http://joey.kitenet.net/ Sun Apr 29 18:04:13 2012
Cool!, I just tried adding tahoe-lafs as a remote, and it wasn't too hard.
Sorry if I am not clear. Actually i meant to ask, if i have 2 git repositories which are not special remotes and I am transferring annexed file content between these repositories using git annex command (move or copy) then, which protocol it uses to transfer content? Is it uses git-send-pack git-recieve-pack or some other protocols.

These are good examples; I think you've convinced me at least for upgrades going forward after v2. I'm not sure we have enough users and outdated git-annex installations to worry about it for v1.

(Hoping such upgrades are rare anyway.. Part of the point of changes made in v2 was to allow lots of changes to be made later w/o needing a v3.)

Update: Upgrades from v1 to v2 will no longer be handled automatically now.

Comment by http://joey.kitenet.net/ Fri Mar 18 00:38:51 2011

What a good idea!

150 lines of haskell later, I have this:

# git annex status
supported backends: WORM SHA1 SHA256 SHA512 SHA224 SHA384 SHA1E SHA256E SHA512E SHA224E SHA384E URL
supported remote types: git S3 bup directory rsync hook
local annex keys: 32
local annex size: 58 megabytes
total annex keys: 38158
total annex size: 6 terabytes (but 1632 keys have unknown size)
backend usage: 
    SHA1: 1789
    WORM: 36369
Comment by http://joey.kitenet.net/ Tue May 17 01:15:10 2011

Git can actually push into a non-bare repository, so long as the branch you change there is not a checked out one. Pushing into remotes/$foo/master and remotes/$foo/git-annex would work, however determining the value that the repository expects for $foo is something git cannot do on its own. And of course you'd still have to git merge remotes/$foo/master to get the changes.

Yes, you still keep the non-bare repos as remotes when adding a bare repository, so git-annex knows how to get to them.

I've made git annex sync run the simple script above. Perhaps it can later be improved to sync all repositories.

Comment by http://joey.kitenet.net/ Sat Dec 10 19:43:04 2011

Cool, that seems to make things work as expected, here's an updated recipe

git config annex.tahoe-store-hook 'tahoe mkdir tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2 && tahoe put $ANNEX_FILE tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY'
git config annex.tahoe-retrieve-hook 'tahoe get tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY $ANNEX_FILE'
git config annex.tahoe-remove-hook 'tahoe rm tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY'
git config annex.tahoe-checkpresent-hook 'tahoe ls tahoe:$ANNEX_HASH_1/$ANNEX_HASH_2/$ANNEX_KEY 2>&1 || echo FAIL'
git annex initremote library type=hook hooktype=tahoe encryption=none
git annex describe 1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a library

I just needs some of the output redirected to /dev/null.

(I updated this comment to fix a bug. --Joey)

Thanks, that works perfectly!
Comment by http://cgray.myopenid.com/ Sun Nov 27 22:10:44 2011

It makes sense to have separate repositories when you have well-defined uses for them.

I have a separate repository just for music and podcasts, which I can put various places where I have no need of the overhead of a tree of other files.

If you're using it for whatever arbitrary large files you accumulate, I find it's useful to have them in one repository. This way I can rearrange things as makes sense. It might make sense to have "photos" and "isos" as categories today, but next year you might prefer to move those under 2011/{photos,isos}. It would certainly make sense to have different repositories for home, work, etc.

How to split repositories up for a home directory is a general problem that the vcs-home project has surely considered at one time or another.

Comment by http://joey.kitenet.net/ Fri Nov 4 19:59:24 2011

git annex sync only syncs git metadata, not file contents, and metadata is not stored on S3, so it does notthing (much).

git annex move . --to s3 or git annex copy . --to s3 is the right way to send the files to S3. I'm not sure why you say it's not working. I'd try it but Amazon is not letting me sign up for S3 again right now. Can you show what goes wrong with copy?

Comment by http://joeyh.name/ Tue May 29 19:09:50 2012

I've made git-annex-shell run the git hooks/annex-content after content is received or dropped.

Note that the clients need to be running at least git-annex version 3.20120227 , which runs git-annex-shell commit, which runs the hook.

Comment by http://joey.kitenet.net/ Wed Mar 14 16:23:25 2012

thanks Joey,

is it possible to run some git annex command that tells me, for a specific directory, which files are available in an other remote? (and which remote, and which filenames?) I guess I could run that, do my own policy thingie, and run git annex get for the files I want.

For your podcast use case (and some of my use cases) don't you think git [annex] might actually be overkill? For example your podcasts use case, what value does git annex give over a simple rsync/rm script? such a script wouldn't even need a data store to store its state, unlike git. it seems simpler and cleaner to me.

for the mpd thing, check http://alip.github.com/mpdcron/ (bad project name, it's a plugin based "event handler") you should be able to write a simple plugin for mpdcron that does what you want (or even interface with mpd yourself from perl/python/.. to use its idle mode to get events)

Dieter

Comment by http://dieter-be.myopenid.com/ Wed Feb 16 21:32:04 2011

The symlinks are in the git repository. So if the rsync damanged one, git would see the change. And nothing that happens to the symlinks can affect fsck.

git-annex does not use hard links at all.

fsck corrects mangled file permissions. It is possible to screw up the permissions so badly that it cannot see the files at all (ie, chmod 000 on a file under .git/annex/objects), but then fsck will complain and give up, not move the files to bad. So I don't see how a botched rsync could result in fsck moving a file with correct content to bad.

Comment by http://joey.kitenet.net/ Wed Feb 15 15:22:56 2012
Thanks, Is git annex is using same protocols as normal git to transfer content between normal git repositories?
If the subdirectory has a .git, then it's a separate git repo, and inside the directory, all git (and git-annex) commands in it will operate on that nested repo and ignore the outside one.
Comment by http://joeyh.name/ Wed May 30 00:54:38 2012

I'm not currently planning to support sharedRepository perms on special remotes. I suppose I could be convinced otherwise, it's perhaps doable for the ones you mention (rsync might be tricky). (bup special remote already supports it of course.)

thanks for the use case!

Comment by http://joey.kitenet.net/ Mon Apr 23 14:35:39 2012

All that git annex fsck does is checksum the file and move it away if the checksum fails.

If bad data was somehow read from the disk that one time, what you describe could occur. I cannot think of any other way it could happen.

Comment by http://joey.kitenet.net/ Tue Feb 14 22:57:29 2012

Thanks for the fast response!

Unfortunately, I had another problem:

================================== Building git-annex-3.20120419... Utility/libdiskfree.c: In function ‘diskfree’:

Utility/libdiskfree.c:61:0: warning: ‘statfs64’ is deprecated (declared at /usr/include/sys/mount.h:379) [ 6 of 157] Compiling Build.SysConfig ( Build/SysConfig.hs, dist/build/git-annex/git-annex-tmp/Build/SysConfig.o ) [ 15 of 157] Compiling Utility.Touch ( dist/build/git-annex/git-annex-tmp/Utility/Touch.hs, dist/build/git-annex/git-annex-tmp/Utility/Touch.o )

Utility/Touch.hsc:118:21: Not in scope: `noop' cabal: Error: some packages failed to install: git-annex-3.20120419 failed during the building phase. The exception was:

ExitFailure 1

I also tried to look for information on the internet, and I did not find anything useful. Any idea of what happened?

Thanks again!

Before dropping unsused items, sometimes I want to check the content of the files manually. But currently, from e.g. a sha1 key, I don't know how to find the corresponding file, except with 'find .git/annex/objects -type f -name 'SHA1-s1678--70....', wich is too slow (I'm in the case where "git log --stat -S'KEY'" won't work, either because it is too slow or it was never commited). By the way, is it documented somewhere how to determine the 2 (nested) sub-directories in which a given (by name) object is located?

So I would like 'git-annex unused' be able to give me the list of paths to the unused items. Also, I would really appreciate a command like 'git annex unused --log NUMBER [NUMBER2...]' which would do for me the suggested command "git log --stat -S'KEY'", where NUMBER is from the 'git annex unused' output. Thanks.

You say you started the repo with "git init --shared" .. but what that's really meant for is bare repositories, which can have several users pushing into it, not a non-bare repository.

The strange mode on the directories "dr-x--S---" and files "-r--r-----" must be due to your umask setting though. My umask is 022 and the directories and files under .git/annex/objects are "drwxr-xr-x" and "-r--r--r--", which allows anyone to read them unless an upper directory blocks it -- and with this umask, none do unless I explicitly remove permissions from one to lock down a repository.

About mpd, the obvious fix is to run mpd not as a system user but as yourself. I put "@reboot mpd" in my crontab to do this.

Comment by http://joey.kitenet.net/ Mon Jan 23 19:00:40 2012

You get a regular git merge conflict, which can be resolved in any of the regular ways, except that conflicting files are just symlinks.

Example:

$ git pull
...
Auto-merging myfile
CONFLICT (add/add): Merge conflict in myfile
Automatic merge failed; fix conflicts and then commit the result.
$ git status
# On branch master
# Your branch and 'origin/master' have diverged,
# and have 1 and 1 different commit(s) each, respectively.
#
# Unmerged paths:
#   (use "git add/rm ..." as appropriate to mark resolution)
#
#   both added:         myfile
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git add myfile
$ git commit -m "took local version of the conflicting file"
Comment by http://joey.kitenet.net/ Tue Dec 20 23:07:25 2011
FWIW, I wanted to suggest exactly the same thing.

This message comes from ghc's runtime memory manager. Apparently your ghc defaults to limiting the stack to 80 mb. Mine seems to limit it slightly higher -- I have seen haskell programs successfully grow as large as 350 mb, although generally not intentionally. :)

Here's how to adjust the limit at runtime, obviously you'd want a larger number:

# git-annex +RTS -K100 -RTS find
Stack space overflow: current size 100 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I've tried to avoid git-annex using quantities of memory that scale with the number of files in the repo, and I think in general successfully -- I run it on 32 mb and 128 mb machines, FWIW. There are some tricky cases, and haskell makes it easy to accidentally write code that uses much more memory than would be expected.

One well known case is git annex unused, which has to build a structure of every annexed file. I have been considering using a bloom filter or something to avoid that.

Another possible case is when running a command like git annex add, and passing it a lot of files/directories. Some code tries to preserve the order of your input after passing it through git ls-files (which destroys ordering), and to do so it needs to buffer both the input and the result in ram.

It's possible to build git-annex with memory profiling and generate some quite helpful profiling data. Edit the Makefile and add this to GHCFLAGS: -prof -auto-all -caf-all -fforce-recomp then when running git-annex, add the parameters: +RTS -p -RTS , and look for the git-annex.prof file.

Comment by http://joey.kitenet.net/ Tue Apr 5 17:46:03 2011

Personally, I deal with this problem by having a directory, or directories where I put files that I want to have on my partial checkout laptop, and run git annex get in that directory.

It's not a perfect solution, but I don't know that a perfect solution exists.

Comment by http://joeyh.name/ Mon Jun 4 19:56:05 2012

Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!

git-annex doesn't transfer git content between git repositories. You use git for that. Well, git-annex sync can run a few git commands for you to do it.
Comment by http://joeyh.name/ Thu May 10 18:51:56 2012
Everything is done over ssh unless both repos are on the same system (or unless you NFS mount a repo)
Comment by http://joey.kitenet.net/ Sun Mar 6 15:59:37 2011
They are not. See upgrades
Comment by http://joey.kitenet.net/ Wed Jun 8 00:40:54 2011

The web special remote will happily download files when you git annex get even if they don't have the same content that they did before.

git annex fsck will detect such mismatched content to the best ability of the backend (so checking the SHA1, or verifying the file size at least matches if you use WORM), and complain and move such mismatched content aside. git annex addurl --fast deserves a special mention; it uses a backend that only records the URL, and so if it's used, fsck cannot later detect such changes. Which might be what you want..

For most users, this is one of the reasons git annex untrust web is a recommended configuration. Once you untrust the web, any content you download from the web will be kept around in one of your own git-annex repositories, rather than the untrustworthy web being the old copy.

Comment by http://joey.kitenet.net/ Tue Jan 3 00:57:55 2012

git's code base makes lots of assumptions hardcoding the size of the hash, etc. (grep its source for magic numbers 40 and 42...) I'd like to see git get parameratised hashes. SHA1 insecurity may evenutally push it in that direction. However, when I asked the git developers about this at the Gittogether last year, there were several ideas floated that would avoid parameterisation, and a lot of good thoughts about problems parameterised hashes would cause.

Moving data into git proper would still leave the problems unique to large data of not being able to store it all on every clone. Which means a git-annex like thing is needed to track where the data resides and move it around.

(BTW, in markdown, you separate paragraphs with blank lines. Like in email.)

Comment by http://joeyh.name/ Tue May 8 18:22:12 2012

I have no experience using git-subtree, but as long as the home repository has the work one as a git remote, it will automatically merge work's git-annex branch with its own git-annex branch, and so will know what files are present at work, and will be able to get them.

Probably you won't want to make work have home as a remote, so work's git-annex will not know which files home has, nor will it be able to copy files to home (but home will be able to copy files to work).

Comment by http://joey.kitenet.net/ Tue Jan 3 17:00:53 2012
Sorry for commmenting on my own question ... I think I just figured out that git annex unused does in fact do what I want. When I tried it, it just didn't show the obsolete versions of the files I edited because I hadn't yet synchronized all repositories, so that was why the obsolete versions were still considered used.
Comment by http://peter-simons.myopenid.com/ Thu Feb 9 18:53:00 2012
I have updated the instructions.
Comment by http://joey.kitenet.net/ Tue Apr 26 15:27:49 2011

No-so-subtle sarcasm taken and acknowledged :)

Arguably, git-annex should know about any local limits and not have them implemented via mr from the outside. I guess my concern boils down to having git-annex do the right thing all by itself with minimal user interaction. And while I really do appreciate the flexibility of chaining commands, I am a firm believer in exposing the common use cases as easily as possible.

And yes, I am fully aware that not all annexes are created equal. Point in case, I would never use git annex pull on my laptop, but I would git annex push extensively.

Another option that would please the naive user without hindering the more advanced user: "git annex init", by default, creates a synced/master branch. "git annex sync" will pull from every /sync/master branch it finds, and also push to any /sync/master branch it finds, but will not create any. So by default (at least for new users), this provides simple one-step syncing.

Advanced users can disable this per-repo by just deleting the synced/master branch. Presumably the logic will be: Every repo that should not be pushed to, because it has access to some central repo, should not have a synced/master branch. Every other repo, including the (or one of the few) central repos, will have the branch.

This is not the most expressive solution, as it does not allow configuring syncing between arbitrary pairs of repos, but it feels like a good compromise between that and simplicity and transparency.

I think it's about time that I provide less talk and more code. I’ll see when I find the time :-)

Comment by http://www.joachim-breitner.de/ Mon Dec 19 22:56:26 2011

I've just tried to use the ANNEX_HASH_ variables, example of my configuration

    git config annex.tahoe-store-hook 'tahoe mkdir $ANNEX_HASH_1 && tahoe put $ANNEX_FILE tahoe:$ANNEX_HASH_1/$ANNEX_KEY'
    git config annex.tahoe-retrieve-hook 'tahoe get tahoe:$ANNEX_HASH_1/$ANNEX_KEY $ANNEX_FILE'
    git config annex.tahoe-remove-hook 'tahoe rm tahoe:$ANNEX_HASH_1/$ANNEX_KEY'
    git config annex.tahoe-checkpresent-hook 'tahoe ls tahoe:$ANNEX_HASH_1/$ANNEX_KEY 2>&1 || echo FAIL'
    git annex initremote library type=hook hooktype=tahoe encryption=none
    git annex describe 1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a library

It's seems to work quite well for me now, I did run across this when I tried to drop a file locally, leaving the file on my remote

jtang@x00:/tmp/annex3 $ git annex drop .
drop frink.sh (checking library...) (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies
  Try making some of these repositories available:
    1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a  -- library
  (Use --force to override this check, or adjust annex.numcopies.)
failed
drop t/frink.jar (checking library...) (unsafe) 
  Could only verify the existence of 0 out of 1 necessary copies
  Try making some of these repositories available:
    1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a  -- library
  (Use --force to override this check, or adjust annex.numcopies.)
failed
git-annex: 2 failed
1|jtang@x00:/tmp/annex3 $ 

I do know that the files exist in my library as I have just inserted them, it seemed to work when I didnt have the hashing, it appears that the checkpresent doesn't seem to pass the ANNEX_HASH_* variables (from the limited debugging I did)

Whups, the comment above got stuck in moderation queue for 27 days. I will try to check that more frequently.

In the meantime, I've implemented "git annex whereis" -- enjoy!

I find keeping my podcasts in the annex useful because it allows me to download individual episodes or poscasts easily when low bandwidth is available (ie, dialup), or over sneakernet. And generally keeps everything organised.

Comment by http://joey.kitenet.net/ Wed Mar 16 03:01:17 2011

From what you say, it seems that vlc is following the symlink to the movie content, and then looking for subtitles next to the file the symlink points to. It would have to explicitly realpath the symlink to have this behavior, and this sounds like a misfeature.. perhaps you could point out to the vlc people the mistake in doing so?

There's a simple use-case where this behavior is obviously wrong, without involving git-annex. Suppose I have a movie, and one version of subtitles for it, in directory foo. I want to modify the subtitles, so I make a new directory bar, symlink the large movie file from foo to save space, and copy over and edit the subtitles from foo. Now I run vlc in bar to test my new subtitles. If it ignores the locally present subtitles and goes off looking for the ones in bar, I say this is broken behavior.

Comment by http://joey.kitenet.net/ Fri Dec 23 16:16:19 2011

After some thought, perhaps the default fsck output should be at least machine readable and copy and pasteable i.e.

$ git annex fsck
Files with errors

    file1
    file2

so I can then copy the list of borked files and then just paste it into a for loop in my shell to recover the files. it's just an idea.

Specifying the UUID was supposed to work, I think I broke it a while ago. Fixed now in git.

I'm not sure why you need to look up the UUID of the current repository. You can always refer to the current repository as ".". Anyway, the UUID of the current repository is in .git/config, or use git config annex.uuid.

Comment by http://joey.kitenet.net/ Fri Sep 30 06:55:34 2011

Thanks! git annex addurl --fast does exactly what I want it to do.

Wow. Yet another special backend for me to play with. :-)

Comment by http://a-or-b.myopenid.com/ Tue Jan 3 02:49:18 2012

Probably more like 150 lines of haskell. Maybe just 50 lines if the bup repository is required to be on the same computer as the git-annex repository.

Since I do have some repositories where I'd appreciate this level of assurance that data not be lost, it's mostly a matter of me finding a free day.

Comment by http://joey.kitenet.net/ Mon Mar 28 20:05:13 2011
Push access to the non-code bits of git-annex' ikiwiki would be very welcome indeed. Given the choice, I would rather edit everything in Vim than in a browser. -- RichiH
Ok. This helped me a lot. Thank you

Hmm, so it seems there is almost a way to do this already.

I think the one thing that isn't currently possible is to have 'plain' ssh remotes.. basically something just like the directory remote, but able to take a ssh user@host/path url. something like sshfs could be used to fake this, but for things like fsck you would want to do the sha1 calculations on the remote host.

i'll comment on each of the points separately, well aware that even a single little leftover issue can show that my plan is faulty:

  • force removal: well, yes -- but the file that is currently force-removed on the laptop could just as well be the last of its kind itself. i see the problem, but am not sure if it's fatal (after all, if we rely on out-of-band knowledge when forcing something, we could just as well ask a little more)
  • non-bare repos: pushing is tricky with non-bare repos now just as well; a post-commit hook could auto-accept counter changes. (but pushing causes problems with counters anyway, doesn't it?)
  • merging: i'd have them auto-merge. git-annex will have to check the validity of the current state anyway, and a situation in which a counter-decrementing commit is not a fast-forward one would be reverted in the next step (or upon discovery, in case the next step never took place).
  • reverting: my wording was bad as "revert" is already taken in git-lingo. the correct term for what i was thinking of is "reset". (as the commit could not be pushed, it would be rolled back completely).
    • we might have to resort to reverting, though, if the commit has already been pused to a first server of many.
  • hidden files: yes, this solves pre-removal dropping :-)
  • round trips: it's not the number of servers, it's the number of files (up to 30k in my case). it seems to me that an individual request was made for every single file i wanted to drop (that would be N*M roundtrips for N affected servers and M files, and N roundtrips with git managed numcopies)

all together, it seems to be a bit more complicated than i imagined, although not completely impossible. a combination of hidden files and maybe a simpler reduction of the number of requests might though achieve the important goals as well.

Comment by http://christian.amsuess.com/chrysn Wed Feb 23 16:43:59 2011

Besides the cost values, annex.diskreserve was recently added. (But is not available for special remotes.)

I have held off on adding high-level management stuff like this to git-annex, as it's hard to make it generic enough to cover use cases.

A low-level way to accomplish this would be to have a way for git annex get and/or copy to skip files when numcopies is already satisfied. Then cron jobs could be used.

Comment by http://joey.kitenet.net/ Sat Apr 23 16:22:07 2011

As joey points out the problem is B overwrites A, so that any files in A that aren't in B will be removed. But the suggestion to keep B in a separate subdirectory in the repository means I'll end up with duplicates of files in both A and B. What I want is to have the merged superset of all files from both A and B with only one copy of identical files.

The problem is that unique symlinks in A/master are deleted when B/master is merged in. To add back the deleted files after the merge you can do this:

git checkout master~1 deleted_file_name                                                              #checkout a single deleted file called deleted_file_name
git diff master~1 master --numstat --name-only --diff-filter=D                                       #get the names of all files deleted between master and master~1
git diff master~1 master --numstat --name-only --diff-filter=D | xargs git checkout master~1         #checkout all deleted files between master and master~1

Once the first merge has been done after set up, you can continue to make changes to A and B and future merges won't require accounting for deleted files in this way.

I'll give it a try as soon as I get rid of this:

% git annex fsck

fatal: index file smaller than expected fatal: index file smaller than expected % git status fatal: index file smaller than expected %

And no, I am not sure where that is coming from all of a sudden... (it might have to do with a hard lockup of the whole system due to a faulty hdd I tested, but I didn't do anything to it for ages before that lock-up. So meh. Also, this is prolly off topic in here)

Richard

Took me a minute to see this is not about em>descriptive commit messages in git-annex branch, but about the "git-annex automatic sync" message that is used when committing any changes currently on the master branch before doing the rest of the sync.

So.. It would be pretty easy to ls-files the relevant files before the commit and make a message. Although this would roughly double the commit time in a large tree, since that would walk the whole tree again (git commit -a already does it once). Smarter approaches could be faster.. perhaps it could find unstaged files, stage them, generate the message, and then git commit the staged changes.

But, would this really be useful? It's already easy to get git log to show a summary of the changes made in such a commit. So it's often seen as bad form to unnecessarily mention which files a commit changes in the commit message.

Perhaps more useful would be to expand the current message with details like where the sync is being committed, or what remotes it's going to sync from, or something like that.

Comment by http://joey.kitenet.net/ Fri Feb 24 20:52:51 2012
Okay, I see, but is git annex get --auto . going to import all those files from the work remote into my home if the master branch of that remote isn't merged?
Comment by http://peter-simons.myopenid.com/ Tue Jan 3 19:17:36 2012

My goal for git-annex merge is that users should not need to know about it, so it should not be doing expensive pulls.

I hope that git annex sync will grow some useful features to support fully distributed git usage, as being discussed in pure git-annex only workflow. I still use centralized git to avoid these problems myself.

Comment by http://joey.kitenet.net/ Fri Dec 23 16:50:26 2011

No extra remotes (that I'm aware of); that output was only edited to change hostnames.

On all three hosts, "git push origin" and "git pull origin" say everything is up to date.

I'm using git-annex 3.20111011 on all hosts (although some were running 3.20110928 when I created the repositories).

Regarding the multiple links, I've put a copy of the dot file here. It shows psychosis in three separate subgraphs, that are just getting rendered together as one, if that helps clarify anything.

Wait, I just realized you said "the git-annex branch". My origin only has "master". Do you mean the one specifically named "git-annex"? I thought that was something that gets managed automatically, or is it something I need to manually check out and deal with?

Any other info I could provide?

ps: concerning the command 'find .git/annex/objects -type f -name 'SHA1-s1678--70....' from my previous comment, it is "significantly" faster to search for the containing directory which have the same name: 'find .git/annex/objects -maxdepth 2 -mindepth 2 -type d -name 'SHA1-s1678--70....'. I am just curious: what is the need to have each file object in its own directory, itself nested under two more sub-directories?

Indeed, see add a git backend, where you and I have already discussed this idea. :)

With the new support for special remotes, which will be used by S3, it would be possible to make such a git repo, using bup, be a special remote. I think it would be pretty easy to implement now. Not a priority for me though.

Comment by http://joey.kitenet.net/ Mon Mar 28 16:01:30 2011
I got my answer on #vcs-home: Yes, git-annex and git get along fine.
Comment by http://peter-simons.myopenid.com/ Wed Jul 13 16:21:25 2011

I got a good laugh out of it :-)

Storing the key unencrypted would make things easier.. I think at least for my use-cases I don't require another layer of protection on top of the ssh keys that provide access to the encrypted remotes themselves.

Since subtitle files are typically pretty small, a workaround is to simply check them into git directly, and only use git-annex for the movies. (Or git annex unannex the ones you've already annexed.)
Comment by http://joey.kitenet.net/ Fri Dec 23 18:43:05 2011

Yes, it can read id3-tags and guess titles from movie filenames but it sometimes gets confused by the filename metadata provided by the WORM-backend.

I think I have a good enough solution to this problem. It's not efficient when it comes to renames but handles adding and deletion just fine

rsync -vaL --delete source dest

The -L flag looks at symbolic links and copies the actual data they are pointing to. Of course "source" must have all data locally for this to work.

Hmm, I don't see the spurious ssh edge in the dot file -- that is, I don't see any ssh:// uris in it?
Comment by http://joey.kitenet.net/ Sat Oct 22 01:18:27 2011

I think what is happening with "git annex unannex" is that "git annex add" crashes before it can "git add" the symlinks. unannex only looks at files that "git ls-files" shows, and so files that are not added to git are not seen. So, this can be recovered from by looking at git status and manually adding the symlinks to git, and then unannex.

That also suggests that "git annex add ." has done something before crashing. That's consistent with you passing it < 2 parameters; it's not just running out of memory trying to expand and preserve order of its parameters (like it might if you ran "git annex add experiment-1/ experiment-2/")

I'm pretty sure I know where the space leak is now. git-annex builds up a queue of git commands, so that it can run git a minimum number of times. Currently, this queue is only flushed at the end. I had been meaning to work on having it flush the queue periodically to avoid it growing without bounds, and I will prioritize doing that.

(The only other thing that "git annex add" does is record location log information.)

Comment by http://joey.kitenet.net/ Thu Apr 7 16:41:00 2011

In my case, the remotes are the same, but adding a new option could make sense.

And while I can tell mr what to do explicitly, I would prefer if it did the right thing all by itself. Having to change configs in two separate places is less than ideal.

I am not sure what you mean by git annex push as that does not exist. Did you mean copy?

I have made a new autosync branch, where all that the user needs to do is run git annex sync and it automatically sets up the synced/master branch. I find this very easy to use, what do you think?

Note that autosync is also pretty smart about not running commands like "git merge" and "git push" when they would not do anything. So you may find git annex sync not showing all the steps you'd expect. The only step a sync always performs now is pulling from the remotes.

Comment by http://joey.kitenet.net/ Fri Dec 30 23:45:57 2011

How remote is REMOTE? If it's a directory on the same computer, then git-annex copy --to is actually quickly checking that each file is present on the remote, and when it is, skipping copying it again.

If the remote is ssh, git-annex copy talks to the remote to see if it has the file. This makes copy --to slow, as Rich complained before. :)

So, copy --to does not trust location tracking information (unless --fast is specified), which means that it should be doing exactly what you want it to do in your situation -- transferring every file that is really not present in the destination repository already.

Neither does copy --from, by the way. It always checks if each file is present in the current repository's annex before trying to download it.

Comment by http://joey.kitenet.net/ Sun Apr 3 16:49:01 2011

Doh! Total brain melt on my part. Thanks for the additional info. Not taking my time and reading things properly - kept assuming that the full remote pull failed due to the warning:

You asked to pull from the remote 'rss', but did not specify
a branch. Because this is not the default configured remote
for your current branch, you must specify a branch on the command line.

Rookie mistake indeed.

You handle conflicts in annexed files the same as you would handle them in other binary files checked into git.

For example, you might choose to git rm or git add the file to resolve the conflict.

Previous discussion

Comment by http://joey.kitenet.net/ Mon Apr 23 14:29:03 2012

The rsync or directory special remotes would work if the media player uses metadata in the files, rather than directory locations.

Beyond that there is the smudge idea, which is hoped to be supported sometime.

Comment by http://joey.kitenet.net/ Thu Jul 7 15:27:28 2011

Well, the modes you show are wrong. Nothing in the annex should be writable. fsck needs to fix those. (It's true that it also always chmods even correct mode files/directories.. I've made a change avoiding that.)

I have not thought or tried shared git annex repos with multiple unix users writing to them. (Using gitolite with git-annex would be an alternative.) Seems to me that removing content from the annex would also be a problem, since the directory will need to be chmodded to allow deleting the content from it, and that will fail if it's owned by someone else. Perhaps git-annex needs to honor core.sharedRepository and avoid these nice safeguards on file modes then.

Comment by http://joey.kitenet.net/ Sat Apr 21 16:09:19 2012

Heh, cool, I was thinking throwing about 28million files at git-annex. Let me know how it goes, I suspect you have just run into a default limits OSX problem.

You probably just need to up some system limits (you will need to read the error messages that first appear) then do something like

# this is really for the run time, you can set these settings in /etc/sysctl.conf
sudo sysctl -w kern.maxproc=2048
sudo sysctl -w kern.maxprocperuid=1024

# tell launchd about having higher limits
sudo echo "limit maxfiles 1024 unlimited" >> /etc/launchd.conf
sudo echo "limit maxproc 1024 2048" >> /etc/launchd.conf

There are other system limits which you can check by doing a "ulimit -a", once you make the above changes, you will need to reboot to make the changes take affect. I am unsure if the above will help as it is an example of what I did on 10.6.6 a few months ago to fix some forking issues. From the error you got you will probably need to increase the stacksize to something bigger or even make it unlimited if you feel lucky, the default stacksize on OSX is 8192, try making it say 10times that size first and see what happens.

Thanks, joey, but I still do not know, why the file that has been (and is) OK according to separate sha1 and sha256 checks, has been marked 'bad' by fsck and moved to .git/annex/bad. What could be a reason for that? Could have rsync caused it? I know too little about internal workings of git-annex to answer this question.

But one thing I know for certain - the false positives should not happen, unless something is wrong with the file. Otherwise, if it is unreliable, if I have to check twice, it is useless. I might as well just keep checksums of all the files and do all checks by hand...

Comment by antymat Tue Feb 14 22:48:37 2012

Well, lock could check for modifications and require --force to lose them. But the check could be expensive for large files.

But git annex lock is just a convenient way to run git checkout. And running git checkout or git reset --hard will lose your uncommitted file the same way obviously.

Perhaps the best fix would be to get rid of lock entirely, and let the user use the underlying git commands same as they would to drop modifications to other files. It would then also make sense to remove unlock, leaving only edit.

Comment by http://joey.kitenet.net/ Sat Jan 7 17:15:31 2012

My experience is that modern filesystems are not going to have many issues with tens to hundreds of thousands of items in the directory. However, if a transition does happen for FAT support I will consider adding hashing. Although getting a good balanced hash in general without, say, checksumming the filename and taking part of the checksum, is difficult.

I prefer to keep all the metadata in the filename, as this eases recovery if the files end up in lost+found. So while "SHA/" is a nice workaround for the FAT colon problem, I'll be doing something else. (What I'm not sure yet.)

There is no point in creating unused hash directories on initialization. If anything, with a bad filesystem that just guarantees worst performance from the beginning..

Comment by http://joey.kitenet.net/ Mon Mar 14 16:12:49 2011

After some experimentation, this seems to work better:

    git commit -a -m 'git annex sync'
git merge git-annex-master
for remote in $(git remote)
do
    git fetch $remote
    git merge $remote git-annex-master
done
git branch -f git-annex-master
git annex merge
for remote in $(git remote)
do
    git push $remote git-annex git-annex-master
done

Maybe this approach can be enhance to skip stuff gracefully if there is no git-annex-master branch and then be added to what "git annex sync" does, this way those who want to use the feature can do so by running "git branch git-annex-master" once. Or, if you like this and want to make it default, just make git-annex-init create the git-annex-master branch :-)

Comment by http://www.joachim-breitner.de/ Tue Dec 13 18:47:18 2011

Git-annex has really helped me with my media files. I have a big NAS drive where I keep all my music, tv, and movies files, each in their own git annex. I tend to keep the media that I want to watch or listen to on my laptop and then drop it when it is done. This way I don't have too much on my laptop at any one time, but I have a nice selection for when I'm traveling and don't have access to my NAS.

Additionally, I have a mp3 player that will format itself randomly every few months or so. I keep my podcasts on it in a git annex and in a git annex on my laptop. When I am done with a podcast, I can delete it from the mp3 player and then sync that information with my laptop. With this method, I have a backup of what should be on my mp3 player, so I don't need to worry about losing it all when the mp3 player decides it's had enough.

Comment by http://cgray.myopenid.com/ Sat Apr 14 01:18:53 2012

Thank you,

I imagined it was something like that. I 'm just sorry I posted that on the forum and not on the bugs section (I hadn't discovered it at that time). but now, if people search for this error, they should find this.

Note for Fedora users: unfortunately GHC 7.4 will not be shipped with Fedora 17 (which is still not released). The feature page mention it for Fedora 18. I feel like I am using debian ... outdated packages the day of the release.

And many thanks for this wonderful piece of software.

Mildred

Comment by http://mildred.pip.verisignlabs.com/ Fri Apr 13 07:28:10 2012

I don't know how to approach this yet, but I support the idea -- it would be great if there was a tool that could punch files out of git history and put them in the annex. (Of course with typical git history rewriting caveats.)

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

Comment by http://joey.kitenet.net/ Fri Feb 25 05:16:48 2011
It's strange. I've done some testing on another machine, and this one, and the issue seems to be with adding only certain sub-directories of the git-annex directory. Would it cause an issue with git-annex if a sub-directory was a git repo?
You can make it a remote without merging its contents. Git will not merge its contents by default unless it's named "origin". git-annex will be prefectly happy with that.
Comment by http://joey.kitenet.net/ Tue Jan 3 18:42:08 2012

--to and --from seem to have different semantics than --source and --destination. Subtle, but still different.

That being said, I am not sure --from and --to are needed at all. Calling the local repo . and all remotes by their name, they are arguably redundant and removing them would make the syntax a lot prettier; mv and cp don't need them, either.

I am not sure changing syntax at this point is considered good style though personally, I wouldn't mind adapting and would actually prefer it over using --to and --from.

-v and -q would be nice.

Richard

To get to a specific version of a file, you need to have a tag or a branch that includes that version of the file. Check out the branch and git annex get $file.

(Of course, even without a tag or branch, old file versions are retained, unless dropped with unused/dropunused. So you could even git checkout $COMMITID.)

Comment by http://joey.kitenet.net/ Tue Apr 24 21:14:15 2012
All right, I've made all the changes so it supports core.sharedRepository.
Comment by http://joey.kitenet.net/ Sat Apr 21 23:46:42 2012
Great! This was the only thing about git-annex which could have kept me from using it. --Michael
Comment by http://m-f-k.myopenid.com/ Sun Mar 6 16:33:19 2011

You're taking a very long and strange way to a place that you can reach as follows:

git pull remote
git annex get .

Which is just as shown in getting file content.

In particular, "git pull remote" first fetches all branches from the remote, including the git-annex branch. When you say "git pull remote master", you're preventing it from fetching the git-annex branch. If for some reason you want the slightly longer way around, it is:

git pull remote master
git fetch remote git-annex
git annex get .

Or, eqivilantly but with less network connections:

git fetch remote
git merge remote/master
git annex get .

BTW, notice that this is all bog-standard git branch pulling stuff, not specific to git-annex in the least. Consult your extensive and friendly git documentation for details. :)

Comment by http://joey.kitenet.net/ Tue Dec 6 16:43:29 2011
rsync over ssh is used to transfer file contents between repositories. (You can use the -d option to see the commands git-annex runs.)
Comment by http://joeyh.name/ Thu May 10 19:17:22 2012
So perhaps checking if git-status (or similar) complains about missing files is a possible solution for this?

And something else i've done is, that i symlinked the video/ directory from the media annex to the normal raid annex

ln -s ~/media/annex/video ~/annex

And it's working out great.

~annex $ git annex whereis video/series/episode1.avi
whereis video/series/episode1.avi(1 copy)
        f210b45a-60d3-11e0-b593-3318d96f2520  -- Trantor - Media
ok

I really like this, perhaps it is a good idea to store all log files in every repo, but maybe there is a possibilitiy to to pack multiple log files into one single file, where not only the time, the present bit and the annex-repository is stored, but also the file key. I don't know if this format would also be merged correctly by the union merge driver.

Here's another handy command-line which annexes all files in repo B which have already been annexed in repo A:

git status --porcelain | sed -n '/^ T /{s///;p}' | xargs git annex add

The 'T' outputted by git status for these files indicates a type change: it's a symlink to the annex in repo A, but a normal file in repo B.

Comment by http://adamspiers.myopenid.com/ Thu Mar 29 21:41:54 2012
Ah, OK. Is there a configuration step to set this up, or is this included magic in a new enough git-annex client?

Yes, I think that add -all option is the right approach for this. Seems unlikely you'd have some files' hashes handy without having them checked out, but operating on all content makes sense.

That page discusses some problems implementing it for some commands, but should not pose a problem for move. It would also be possible to support get and copy, except --auto couldn't be used with --all. Even fsck could support it.

Comment by http://joeyh.name/ Tue May 15 17:00:10 2012
Ask and ye shalle receive with an Abbot on top: hook
Comment by http://joey.kitenet.net/ Thu Apr 28 21:22:03 2011
There is a ghc7.0 branch in git that is being maintained to work with that version.
Comment by http://joey.kitenet.net/ Sun Mar 11 15:50:11 2012
Very cool! Thank you for the explanation.
Comment by http://peter-simons.myopenid.com/ Tue Jan 3 19:47:11 2012

The encryption uses a symmetric cipher that is stored in the git repository already. It's just stored encrypted to the various gpg keys that have been configured to use it. It would certianly be possible to store the symmetric cipher unencrypted in the git repo.

I don't see your idea of gpg-options saving any work. It would still require you to do key distribution and run commands in each repo to set it up.

Comment by http://joey.kitenet.net/ Sun Apr 29 02:39:20 2012

The bug with newlines is now fixed.

Thought I'd mention how to clean up from interrupting git annex add. When you do that, it doesn't get a chance to git add the files it's added (this is normally done at the end, or sometimes at points in the middle when you're adding a lot of files). Which is also why fsck, whereis, and unannex wouldn't operate on them, since they only deal with files in git.

So the first step is to manually use git add on any symlinks.

Then, git commit as usual.

At that point, git annex unannex would get you back to your starting state.

Comment by http://joey.kitenet.net/ Tue Dec 6 17:08:37 2011

Maybe, otoh, part of the point of git-annex is that the data may be too large to pull down all of it.

I find mr useful as a policy layer over top of git-annex, so "mr update" can pull down appropriate quantities of data from appropriate locations.

Comment by http://joey.kitenet.net/ Tue Apr 5 18:05:00 2011
Good point. scp fixes this by using a colon, but as colons aren't needed in git-annex remotes' names... -- RichiH

git-annex is just amazing. I just started using it and for once, I have hope to be able to organize my files a little better than now.

Currently, I have a huge homedir. From time to time, I move file away in external hard drives, then forget about them. When I want to look at them back, I just can't because I have forgotten where they are. I have also a ton of files on those drives that I can't access because they are not indexed. With git-annex I have hope to put all of these files on a git repository. I will be able to see them everywhere, and find them when I need to.

I might stop loosing files for once.

I might avoid having multiple copies of the same things over and over again, without knowing so. and regain some more disk space.

For the moment, I'm archiving my photographs. But there is one thing that might not go very well: directory hierarchies where everything is important (file owner, specific permissions, symlinks). I won't just be able to blindly annex all of these files. But for the moment I'll stick at archiving ocuments and it should be amazing.

Mildred

Comment by http://mildred.fr/ Thu Apr 12 17:12:41 2012
The point of git-subtree is that I can import another repository into sub-directory, i.e. I can have a directory called "work" that contains all files from the annex I have at work. If I make the other annex a remote and merge its contents, then all contents is going to be merged at the top-level, which is somewhat undesirable in my particular case.
Comment by http://peter-simons.myopenid.com/ Tue Jan 3 18:11:37 2012

Sorry for not replying earlier, but my non-mailinglist-communications-workflows are suboptimal :-)

Then in each repo, I found I had to pull from each of its remotes, to get the tracking branches that defaultSyncRemotes looks for to know those remotes are syncable. This was the surprising thing for me, I had expected sync to somehow work out which remotes were syncable without my explicit pull. And it was not very obvious that sync was not doing its thing before I did that, since it still does a lot of "stuff".

Right. But "git fetch" ought to be enough.

Personally, I’d just pull and push everywhere, but you pointed out that it ought to be manageable. The existence of the synced/master branch is the flag that indicates this, so you need to propagate this once. Note that if the branch were already created by "git annex init", then this would not be a problem.

It is not required to use "git fetch" once, you can also call "git annex sync " once with the remote explicitly mentioned; this would involve a fetch.

While this would lose the ability to control which remotes are synced, I think that being able to git annex sync origin and only sync from/to origin is sufficient, for the centralized use case.

I’d leave this decision to you. But I see that you took the decision already, as your code now creates the synced/master branch when it does not exist (e290f4a8).

Why did you make branch strict?

Because it did not work otherwise :-). It uses pipeRead, which is lazy, and for some reason git and/or your utility functions did not like that the output of the command was not consumed before the next git command was called. I did not investigate further. For better code, I’d suggest to add a function like pipeRead that completely reads the git output before returning, thus avoiding any issues with lazyIO.

mergeRemote merges from refs/remotes/foo/synced/master. But that will only be up-to-date if git annex sync has recently been run there. Is there any reason it couldn't merge from refs/remotes/foo/master?

Hmm, good question. It is probably save to merge from both, and push only to synced/master. But which one first? synced/master can be ahead if the repo was synced to from somewhere else, master can be ahead if there are local changes. Maybe git merge should be called on all remote heads simultaniously, thus generating only one commit for the merge. I don’t know how well that works in practice.

Thanks for including my code, Joachim

Comment by http://www.joachim-breitner.de/ Mon Jan 2 14:02:04 2012

Thank you for your comment! Indeed, setting the umask to, for example, 022 has the desired effect that annex/objects etc. are executable (and in this special case also writable), my previous umask setting was 077; the "strange" permissions on the git directories was probably due to --shared=all, and the mode of "440" on the files within the git-annex tree is correct (the original file was 640 and stripped of its write permission).

Using this umask setting and newgrp to switch the default group, I was successfully able to set up the repositories.

However, I would like to suggest adding the execute bit to the directories below .git/annex/objects/ per default, even if the umask of the current shell differs. As the correct rights are already preserved in the actual files (minus their write permission) together with correct owner and group, the files are still protected the same way as previously, and because +x does not allow directory listings, no additional information can leak out either. Not having to set the umask to something "sensible" before operating git-annex would be a huge plus, too :)

The reason why I am not running MPD as my user is that I am a bit wary of running an application even exposed to the local network as my main user, and I see nothing wrong with running it as its own user.

Thank you again for your help and the time you put into this project!

On the plus side, the past me wanted exactly what I had in mind.

On the meh side, I really forgot about this conversation :/

When you say this todo is not a priority, does that mean there's no ETA at all and that it will most likely sleep for a long time? Or the almost usual "what the heck, I will just wizard it up in two lines of haskell"?

-- RichiH

I see the following problems with this scheme:

  • Disallows removal of files when disconnected. It's currently safe to force that, as long as git-annex tells you enough other repos are belived to have the file. Just as long as you only force on one machine (say your laptop). With your scheme, if you drop a file while disconnected, any other host could see that the counter is still at N, because your laptop had the file last time it was online, and can decide to drop the file, and lose the last version.

  • pushing a changed counter commit to other repos is tricky, because they're not bare, and the network topology to get the commit pulled into the other repo could vary.

  • Merging counter files issues. If the counter file doesn't automerge, two repos dropping the same file will conflict. But, if it does automerge, it breaks the counter conflict detection.

  • Needing to revert commits is going to be annoying. An actual git revert could probably not reliably be done. It's need to construct a revert and commit it as a new commit. And then try to push that to remotes, and what if that push conflicts?

  • I do like the pre-removal dropping somewhat as an alternative to trust checking. I think that can be done with current git-annex though, just remove the files from the location log, but keep them in-annex. Dropping a file only looks at repos that the location log says have a file; so other repos can have retained a copy of a file secretly like this, and can safely remove it at any time. I'd need to look into this a bit more to be 100% sure it's safe, but have started hidden files.

  • I don't see any reduced round trips. It still has to contact N other repos on drop. Now, rather than checking that they have a file, it needs to push a change to them.

Comment by http://joey.kitenet.net/ Tue Feb 22 18:44:28 2011
I think the forums/website currently is sufficient, I do at times wish there was a mailing list or anonymous git push to the wiki as I find editing posts through the web browser is some times tedious (the lack of !fmt or alt-q bugs me at times ;) ). The main advantage of keeping stuff on the site/forum is that everything gets saved and passed on to anyone who checks out the git repo of the code base.

I thought about this some more, and I think I have a pretty decent solution that avoids a central bare repository. Instead of pushing to master (which git does not like) or trying to guess the remote branch name on the other side, there is a well-known branch name, say git-annex-master. Then a sync command would do something like this (untested):

git commit -a -m 'git annex sync' # ideally with a description derived from the diff
git merge git-annex-master
git pull someremote git-annex-master # for all reachable remotes. Or better to use fetch and then merge everything in one command?
git branch -f git-annex-master # (or checkout git-annex-master, merge master, checkout master, but since we merged before this should have the same effect
git annex merge
git push someremote git-annex-master # for all reachable remotes

The nice things are: One can push to any remote repository, and thus avoid the issue of pushing to a portable device; the merging happens on the master branch, so if it fails to merge automatically, regular git foo can resolve it, and all changes eventually reach every repository.

What do you think?

Comment by http://www.joachim-breitner.de/ Tue Dec 13 18:16:08 2011
JFTR, pushing now happens automatically from branchable.
Comment by http://joey.kitenet.net/ Mon Sep 19 18:57:52 2011
When I want that, I ls -l the file and look at the symlink to the key. Ie, in SHA1-s10481423--efc7eec0d711212842cd6bb8f957e1628146d6ed the size is 10481423 bytes.
Comment by http://joey.kitenet.net/ Mon Nov 14 22:46:35 2011

Sorry for all the followups, but I see now that if you unannex, then add the file to git normally, and commit, the hook does misbehave.

This seems to be a bug. git-annex's hook thinks that you have used git annex unlock (or "git annex edit") on the file and are now committing a changed version, and the right thing to do there is to add the new content to the annex and update the symlink accordingly. I'll track this bug over at unannex vs unlock hook confusion.

So, committing after unannex, and before checking the file into git in the usual way, is a workaround. But only if you do a "git commit" to commit staged changes.

Anyway, this confusing point is fixed in git now!

Comment by http://joey.kitenet.net/ Wed Feb 2 00:46:00 2011

You should be able to fix the missing label by editing .git-annex/uuid.log and adding

1d1bc312-7243-11e0-a9ce-5f10c0ce9b0a tahoe

thanks, that's great. will there be a way to have sharedRepository work for shared remotes (rsync, directory) too, or is that better taken care of by acls?

@not thought of shared repos: we're having our family photo archive spread over our laptops, and backed up on our home storage server and on an rsync+encryption off-site server, with everyone naturally having their own accounts on all systems -- just if you need a use case.

Comment by http://christian.amsuess.com/chrysn Mon Apr 23 14:14:28 2012
Yes; git-annex uses the git-annex branch independently of the branch you have checked out. You may find internals interesting reading, but the short answer is it will work.
Comment by http://joey.kitenet.net/ Tue Jan 3 19:31:45 2012

I'd recommend using the SHA backend for this, the WORM backend would produce conflicts if the files' modification times changed.

syncing non-git trees with git-annex describes one way to do it.

Comment by http://joey.kitenet.net/ Mon Dec 19 18:24:59 2011

My current workflow looks like this (I'm still experimenting):

Create backup clone for migration

git clone original migrate
cd migrate
for branch in $(git branch -a | grep remotes/origin | grep -v HEAD); do git checkout --track $branch; done

Inject git annex initialization at repository base

git symbolic-ref HEAD refs/heads/newroot
git rm --cached *.rpm
git clean -f -d
git annex init master
git cherry-pick $(git rev-list --reverse master | head -1)
git rebase --onto newroot newroot master
git rebase master mybranch # how to automate this for all branches?
git branch -d newroot

Start migration with tree filter

echo \*.rpm annex.backend=SHA1 > .git/info/attributes
MYWORKDIR=$(pwd) git filter-branch --tree-filter ' \
    if [ ! -d .git-annex ]; then \
        mkdir .git-annex; \
        cp ${MYWORKDIR}/.git-annex/uuid.log .git-annex/; \
        cp ${MYWORKDIR}/.gitattributes .; \
    fi
    for rpm in $(git ls-files | grep "\.rpm$"); do \
        echo; \
        git annex add $rpm; \
        annexdest=$(readlink $rpm); \
        if [ -e .git-annex/$(basename $annexdest).log ]; then \
            echo "FOUND $(basename $annexdest).log"; \
        else \
            echo "COPY $(basename $annexdest).log"; \
            cp ${MYWORKDIR}/.git-annex/$(basename $annexdest).log .git-annex/; \
        fi; \
        ln -sf ${annexdest#../../} $rpm; \
    done; \
    git reset HEAD .git-rewrite; \
    : \
    ' -- $(git branch | cut -c 3-)
rm -rf .temp
git reset --hard

There are still some drawbacks:

  • git history shows that git annex log files are modified with each checkin
  • branches have to be rebased manually before starting migration
Comment by tyger Tue Mar 1 14:07:50 2011

Let's see..

  • -v is already an alias for --verbose

  • I don't find --source and --destination as easy to type or as clear as --from or --to.

  • -F is fast, so it cannot be used for --force. And I have no desire to make it easy to mistype a short option and enable --force; it can lose data.

@richard while it would be possible to support some syntax like "git annex copy . remote"; what is it supposed to do if there are local files named foo and bar, and a remotes named foo and bar? Does "git annex copy foo bar" copy file foo to remote bar, or file bar from remote foo? I chose to use --from/--to to specify remotes independant of files to avoid such ambiguity, which plain old cp doesn't have since it's operating entirely on filesystem objects, not both filesystem objects and abstract remotes.

Seems like nothing to do here. done --Joey

Comment by http://joey.kitenet.net/ Tue Apr 19 20:13:10 2011

OK, thanks. I was just wondering - since there are links in git(-annex), and a hard links too, that maybe the issue has been caused by rsync.

I will keep my eye on that and run checks with my own checksum and fsck from time to time, and see what happens. I will post my results here, but the whole run (fsck or checksum) takes almost 2 days, so I will not do it too often... ;)

Comment by antymat Wed Feb 15 07:13:12 2012

Thinking about this more, I think minimally git-annex could support a

remote.<name>.gpg-options

or

remote.<name>.gpg-keyring

for options to be passed to gpg. I'm not sure how automatically setting it to $ANNEX_ROOT/.gnupg/.. would work.

I need to read the encryption code to fully understand it, but I also wonder if there is not also a way to just bypass gpg entirely and store the remote-encryption keys locally in plain text.

Remote as in "another physical machine". I assumed that

git annex copy --force --to REMOTE .

would have not trusted the contents in the current directory (or the remote that is being copied to) and then just go off and re-download/upload all the files and overwrite what is already there. I expected the combination of --force and copy --to that it would not bother to check if the files are there or not and just copy it regardless of the outcome.

@joey thanks for the update in the previous comment, I had forgotten about updating it.

@zooko it's working okay for me right now, since I'm only putting fairly big blogs on stuff on to it and only things that I really care about. On the performance side, if it ran faster then it would be nicer :)

git-annex needs ghc 7.4, that's why it depends on that base version that comes with it. So you either need to upgrade your ghc, or you can build from the ghc7.0 branch in git, like this:

git clone git://git-annex.branchable.com/ git-annex
cd git-annex
git checkout ghc7.0
cabal update
cabal install --only-dependencies
cabal configure
cabal build
cabal install --bindir=$HOME/bin
Comment by http://joey.kitenet.net/ Sun Apr 22 05:39:28 2012
I had a similiar question in forum/new_microfeatures/. I would like to fetch/copy all the annexed content from a repo, be it on the current branch, another branch, or corresponds to an old version of a file. A command like "git annex copy --all --from=source [path]" would then ensure I have access to all the content I need even if I have later no longer access to source. Sure I could use rsync.

we could include the information about the current directory as well, if the command is not issued in the local git root directory. to avoid large numbers of similar lines, that could look like this:

Estimated annex size: B MiB (of C MiB; [B/C]%)
Estimated annex size in $PWD: B' MiB (of C' MiB; [B'/C']%)

with the percentages being replaced with "complete" if really all files are present (and not just many enough for the value to be rounded to 100%).

Comment by http://christian.amsuess.com/chrysn Tue Apr 26 12:31:02 2011

I got bitten by this too. It seems that the user is expected to fetch remote git-annex branches themselves, but this is not documented anywhere.

The man page says of "git annex merge":

Automatically merges any changes from remotes into the git-annex
branch.

I am not a git newbie, but even so I had incorrectly assumed that git annex merge would take care of pulling the git-annex branch from the remote prior to merging, thereby ensuring all versions of the git-annex branch would be merged, and that the location tracking data would be synced across all peer repositories.

My master branches do not track any specific upstream branch, because I am operating in a decentralized fashion. Therefore the error message caused by git pull $remote succeeded in encouraging me to instead use git pull $remote master, and this excludes the git-annex branch from the fetch. Even worse, a git newbie might realise this and be tempted to do git pull $remote git-annex.

Therefore I think it needs to be explicitly documented that

git fetch $remote
git merge $remote/master

is required when the local branch doesn't track an upstream branch. Or maybe a --fetch option could be added to git annex merge to perform the fetch from all remotes before running the merge(s).

Comment by http://adamspiers.myopenid.com/ Fri Dec 23 14:04:44 2011
@Rafaël , you're correct on all counts.
Comment by http://joey.kitenet.net/ Tue May 31 21:54:23 2011
The repository at http://git.nomeata.de/?p=git-annex.git;a=summary contains changes to Commands/Sync.hs (and to the manpage) that implements this behavior. The functionality should be fine; the progress output is not very nice yet, but I’m not sure if I really understood the various Command types. It also should be more easily discoverable how to activate the behavior (by running "git branch synced/master") by providing a helpful message, at least unless git annex init creates the branch by default.
Comment by http://www.joachim-breitner.de/ Thu Dec 29 19:58:31 2011
Do you have a bug in git-annex that you need fixed, or are you just curious?
Comment by http://joeyh.name/ Mon Jun 4 19:49:46 2012

additional filter criteria could come from the git history:

  • git annex get --touched-in HEAD~5.. to fetch what has recently been worked on
  • git annex get --touched-by chrysn --touched-in version-1.0..HEAD to fetch what i've been workin on recently (based on regexp or substring match in author; git experts could probably craft much more meaningful expressions)

these options could also apply to git annex find -- actually, looking at the normal file system tools for such tasks, that might even be sufficient (think git annex find --numcopies-gt 3 --present-on lanserver1 --drop like find -iname '*foo*' -delete

(i was about to open a new forum discussion for commit-based getting, but this is close enough to be usefully joint in a discussion)

Comment by http://christian.amsuess.com/chrysn Thu Jun 23 13:56:35 2011

If you can't segment the names retroactively, it's better to start with segmenting, imo.

As subdirectories are cheap, going with ab/cd/rest or even ab/cd/ef/rest by default wouldn't hurt.

Your point about git not needing to create as many tree objects is a kicker indeed. If I were you, I would default to segmentation.

I don't mind changing the behavior of git-annex sync, certianly..

Looking thru git's documentation, I found some existing configuration that could be reused following your idea. There is a remote.name.skipDefaultUpdate and a remote.name.skipFetchAll. Though both have to do with fetches, not pushes. Another approach might be to use git's remote group stuff.

Comment by http://joey.kitenet.net/ Mon Dec 19 18:29:01 2011

Ah HA! Looks like I found the cause of this.

[matt@rss01:~/files/matt_ford]0> git annex add mhs
add mhs/Accessing_Web_Manager_V10.pdf ok
....
add mhs/MAHSC Costing Request Form Dual
Organisations - FINAL v20 Oct 2010.xls git-annex: unknown response from git cat-file refs/heads/git-annex:8d5/ed4/WORM-s568832-m1323164214--MAHSC Costing Request Form Dual missing

Spot the file name with a newline character in it! This causes the error message above. It seems that the files proceeding this badly named file are sym-linked but not registered.

Perhaps a bug?

I have merged my autosync branch, the improved sync command will be in this year's last git-annex release!
Comment by http://joey.kitenet.net/ Sat Dec 31 18:34:31 2011

@Jimmy mentioned anonymous git push -- that is now enabled for this wiki. Enjoy!

I may try to spend more time on #vcs-home -- or I can be summoned there from my other lurking places on irc, I guess.

Comment by http://joey.kitenet.net/ Thu May 19 19:21:51 2011

And following on to my transcript, you can then add the file to git in the regular git way, and it works fine:

joey@gnu:~/tmp/demo>git add file
joey@gnu:~/tmp/demo>git commit
[master 225ffc0] added as regular git file, not in annex
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 file
joey@gnu:~/tmp/demo>ls -l file
-rw-r--r-- 1 joey joey 3 Feb  1 20:38 file
joey@gnu:~/tmp/demo>git log file
commit 225ffc048f5af7c0466b3b1fe549a6d5e9a9e9fe
Author: Joey Hess 
Date:   Tue Feb 1 20:43:13 2011 -0400

    added as regular git file, not in annex

commit 78a09cc791b875c3b859ca9401e5b6472bf19d08
Author: Joey Hess 
Date:   Tue Feb 1 20:38:30 2011 -0400

    unannex

commit 64cf267734adae05c020d9fd4d5a7ff7c64390db
Author: Joey Hess 
Date:   Tue Feb 1 20:38:18 2011 -0400

    add
Comment by http://joey.kitenet.net/ Wed Feb 2 00:41:24 2011
Now it's fully supported, so long as you put a bare git repo on your key.
Comment by http://joey.kitenet.net/ Sat Mar 19 15:37:22 2011
Joey, that sounds reasonable; I'll try it. Thanks!

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

It should sufficient to honor GIT_DIR/GIT_WORK_TREE/GIT_INDEX_FILE environment variables. git filter-branch sets GIT_WORK_TREE to ., but this can be mitigated by starting the filter script with 'GIT_WORK_TREE=$(pwd $GIT_WORK_TREE)'. E.g. GIT_DIR=/home/tyger/repo/.git, GIT_WORK_TREE=/home/tyger/repo/.git-rewrite/t, then git annex should be able to compute the correct relative path or maybe use absolute pathes in symlinks.

Another problem I observed is that git annex add automatically commits the symlink; this behaviour doesn't work well with filter-tree. git annex commits the wrong path (.git-rewrite/t/LINK instead of LINK). Also filter-tree doesn't expect that the filter script commmits anything; new files in the temporary work tree will be committed by filter-tree on each iteration of the filter script (missing files will be removed).

Comment by tyger Wed Mar 2 08:15:37 2011
Yes, it is really a minor point. And indeed, "git log --summary" is pretty good already. But I’d still think that at least the title could deserves some love. Including the Hostname or the name of the repository there is a good idea as well.
Comment by http://www.joachim-breitner.de/ Fri Feb 24 23:09:03 2012
We seem to be using #vcs-home @ OFTC for now. madduck is fine with it and joeyh pokes his head in there, as well. I just added a CIA bot to #vcs-home and this comment is a test if pushing works. -- RichiH
The git-annex branch is how the annex information for the different repositories is communicated around, so yes, you need to push/pull it.
Comment by http://joey.kitenet.net/ Tue Apr 10 16:05:40 2012

This bug was fixed in git-annex 3.20120230. You have a few options to get the fix:

  • Upgrade to ghc 7.4, the need for which is the cause of the cabal error message you pasted.
  • Manually download from git, and git checkout ghc7.0 -- that branch will build with your old ghc and has the fix.
  • cherry-pick commit 51338486dcf9ab86de426e41b1eb31af1d3a6c87
Comment by http://joey.kitenet.net/ Thu Apr 12 16:29:58 2012
Both problems fixed.
Comment by http://joey.kitenet.net/ Tue Apr 26 23:40:33 2011
The logging format could be improved, but the daemon already logs to .git/annex/daemon.log. It also automatically rotates the log file.
Comment by http://joeyh.name/ Sat Jun 23 14:30:22 2012
That nautilous behavior is a bad thing when trying to export files out, but it's a good thing when just moving files around inside your repository...
Comment by http://joeyh.name/ Sat Jun 16 03:26:37 2012

Use du -L for the disk space used locally. The other number is not currently available, but it would be nice to have. I also sometimes would like to have data on which backends are used how much, so making this git annex status --subdir is tempting. Unfortunatly, it's current implementation scans .git/annex/objects and not the disk tree (better for accurate numbers due to copies), so it would not be a very easy thing to add. Not massively hard, but not something I can pound out before I start work today..

Comment by http://joeyh.name/ Wed Jun 27 12:36:08 2012
Ah! I was fooled by nautilus which is not able to properly handle symlinks when copying. It copies links instead of target [[!gnomebug 623580]].
Comment by http://denis.laxalde.org/ Fri Jun 15 19:57:31 2012

Sure, you can simply:

cp annexedfile ~

Or just attach the file right from the git repository to an email, like any other file. Should work fine.

If you wanted to copy a whole directory to export, you'd need to use the -L flag to make cp follow the symlinks and copy the real contents:

cp -r -L annexeddirectory /media/usbdrive/
Comment by http://joeyh.name/ Fri Jun 15 19:25:59 2012

This is now about different build failure than the bug you reported, which was already fixed. Conflating the two is just confusing.

The error message about syb is because by using cabal-install on an Ubuntu system from 2010, you're mixing the very old versions of some haskell libraries in Ubuntu with the new versions cabal wants to install. The solution is to stop mixing two package management systems --

  • Either install git-annex without using cabal, and use apt-get to install all its dependencies from Ubuntu, assuming your distribution has all the necessary haskell libraries packaged.
  • Or apt-get remove ghc, and manually install a current version of The Haskell Platform and use cabal.
Comment by http://joey.kitenet.net/ Sun Jan 15 19:53:35 2012

Joey, thanks for you quick help! I'll try the manual haskell-platform install once I have quicker internet again, i.e. tomorrow.

And sorry for the mess-up; I splitted the post into two. Hope it's clearer now.

Comments on this page are closed.