fixing corrupted git-svn commit data


So recently we had a permissions problem on the SVN server at work (a whole other story) that caused me to import bad commit data to my git repository using git-svn. The commit messages were fine, but the diffs were empty for this particular development branch. Due to unfortunate limitations in SVN-over-http, the workaround solution that we came up with was to create a second repository root that pointed at the exact same repository under the hood. This let us specify a different and more correct set of permissions in the second repository (mpi-private instead of mpi). Unfortunately because of this my standard git-svn fix-all trick, blowing away the the metadata in .git/svn, won’t work (alone) in this case. All of the git-svn imported commits contain a breadcrumb to figure out from where the commit originally came, called the git-svn-id, that looks something like this:

git-svn-id: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk@4262 a5d90c62-d51d-0410-9f91-bf5351168976

This is problematic if you want to move to a new repository url, such as the above with “mpi” replaced by “mpi-private”. The fix in this case is a little more involved than I would like it to be, but at least it was doable. The basic game plan is to:

  1. Delete the dev branch reference, which I’ll call “foo” for now.
  2. Expire any references to any commits on that branch in order to make all commits unreachable.
  3. Repack the repository.
  4. Prune commits from the repository that the repack didn’t catch.
  5. Delete all the git-svn metadata in .git/svn.
  6. Rewrite all of the git-svn-ids to the new URL.
  7. Change the url in the svn-remote section of the .git/config file.
  8. Rebuild all of the git-svn metadata and fetch the commit information for the “foo” branch.

The corresponding set of commands look something like:

% git branch -d -r foo
% git reflog expire --expire-unreachable=0 --all
% git repack -ad
% git prune
% rm -rf .git/svn
% git filter-branch --msg-filter 'sed "s,/repos/mpi,/repos/mpi-private,"' -- --all
% git config svn-remote.svn.url https://svn.mcs.anl.gov/repos/mpi-private
% git svn fetch

That last command will take quite some time. If you know the first commit of the “foo” branch then you can tell the fetch command to start from that point:

% svn log --stop-on-copy $SVN_URL/path/to/foo \
   egrep '^r[0-9]+.*lines?$' | \
   tail -n 1 | \
   cut -d' ' -f1
rXXXX
% git svn fetch -r XXXX:HEAD
... metadata rebuild messages ...

You’ll find that you now have a whole bunch of extra references thanks to git-filter-branch. It stores all of the original branch ref heads in refs/original/refs/heads/BAR, where BAR is the original branch name. You can remove them with this command:

git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d

My tags didn’t actually get updated, so I cleaned them up by hand since there were only a few. But based on the output from filter-branch it looks like I could have done this automagically by including “–tag-name-filter cat” in the arguments to filter-branch. This thread was very helpful in figuring this all out, as was this thread. This whole ordeal makes it even more compelling to completely bail from SVN to git (or something like it, maybe mercurial would work well enough). It would keep me from having to deal with the piece of junk that is SVN itself, as well as the somewhat finicky hack of git-svn. I’ll take git-svn over a regular SVN client any day at this point, but a truly native git experience would be much more satisfying and robust.