As part of the process of migrating NPanday to the Apache Incubator, I had to find a way to extract the Subversion repository from Codeplex. The challenge here is that it isn’t actually a Subversion repository, but a TFS server running SVNBridge to appear like one. Its occasional quirks and timeouts had been part of the reason we decided to move. Here is what I’ve learned.
The aim was to get it as a dump of a Subversion repository so that it could be loaded into the ASF Subversion repository, retaining the full history. I tried everything to get it out of there –
svnsync, cloning it as a Git repository, cloning it as a Mercurial repository, and other similar tools. All would timeout or freak out at some point due to the nature of SVNBridge. I made some progress with tfs2svn (which seems to be what Codeplex is using to migrate repositories to Mercurial when needed), but I started to find that not only was it needing regular manual intervention, it wasn’t quite the same (eg. Subversion properties came in as
..svnbridge hidden folders, and every comment appended the original TFS revision number).
I had tried rsvndump earlier without a lot of success, but eventually gave it another try. While it needed some help, it was making the most progress and was the one that ended up being successful.
It wasn’t anywhere near smooth sailing, and rsvndump needed some modifications to handle the task. So I brushed off my dusty C skills and made the required changes, which can be found in my github fork of the project (pull request pending).
The main problem was that rsvndump expected to be able to do the repository all in one hit. Given that Codeplex would timeout on several requests, this made it impossible. It did allow selecting a subset of revisions, but even then it would both do a full svn log (which wouldn’t succeed), and beyond that would traverse revisions to construct a path hash (I believe for detecting moves, etc.).
To assist with these, I added a
--log-window-size option, similar to
git-svn. This retrieves the logs in multiple requests, avoiding problems with timeouts. Next, I added a
--first-rev argument, which would start retrieving logs and content from a later revision than 0. While it introduced some risk of crashing due to missing revisions, in many circumstances it allowed restarting a dump from a later revision in a much faster manner.
The next problem was that Codeplex shares a single TFS repository between several projects, so your own revision numbers are not sequential. NPanday started at revision 21102, and ended at 60509 with a lot of gaps, having only 1427 revisions of its own. This wasn’t too much of a problem – because rsvndump was designed to deal with subdirectories of a Subversion repository it expected the gaps. The
--first-rev argument helped deal with the big gap to the start. But another SVNBridge quirk was that
svn copy operations copied from the
(current revision - 1) – even when it didn’t exist! To correct this, I had to adjust the code to search backwards through the revision numbers until it found one that existed to make the copy operations correct.
Finally, rsvndump added padding revisions into the dump file when a revision number was missing. This is helpful if you want to maintain the same numbers, but due to my use of
--first-rev they were already out and I was importing to an existing repository, so I decided to strip these out. For that, I added another flag
I ended up running a command like the following:
rsvndump --omit-padding-revs --adjust-missing-revnums \
--first $FIRST --log-window-size 1000 -v --incremental \
-r $REV1:$REV2 https://npanday.svn.codeplex.com/svn \
3>&1 >&2 2>&3 3>&- >$REV1-$REV2.dump | tee $REV1-$REV2.log.txt
The first few arguments are the customisations described above (and
--adjust-missing-revnums to make the dumped revisions sequential). The next are the traditional
svnadmin dump arguments that rsvndump honours. Finally, I redirected the output so that I could channel stdout to the dumpfile and stderr to a log file that I could also tee.
Other Codeplex SVNBridge Issues
With these changes in place, I was getting moderately successful dumps – but a few frustrating issues remained.
svn copy operations (such as creating a tag) were tracked file by file by Codeplex instead of at the top level directory. This resulted in further timeouts that I couldn’t work around. We had seen this manifest on the Codeplex repository, being unable to even list the
/tags/ directory. I didn’t attempt to correct this, instead manually applying the copy operation again after the preceding dump, then continuing.
svn cp \
-m '[maven-release-plugin] copy for tag npanday-project-1.2.1' \
svn ps svn:author "SND\jocaba_cp" --revprop -rHEAD $REPO
svn ps svn:date \
--revprop -rHEAD $REPO
If that revision appeared in the dump file (either incomplete or not able to be applied), I’d delete it by searching for
Revision-number: xyzxyz and deleting the lines up until the next revision.
Between tags and a few other stubborn revisions that wouldn’t come across (including one where even
svn log wouldn’t succeed), I manually reconstructed 100 revisions like that. The upside was that it provided an opportunity to clean out some botched releases (due to the SVNBridge
/tags/ issues) and branches that had never been used.
So the process was to dump as many revisions as possible, then apply to a test repository, check it out, make required modifications, and repeat. I captured all of this in a shell script so that at any time I could recreate the work repository and reapply all of the dumps and modifications to date. This because useful a few times as I gradually identified inconsistencies with a checkout from the same revision in Codeplex from having missed something.
This still uses a lot of bandwidth – starting at a given revision will both reconstruct the path hash for the whole repository at that revision, and fetch the “base revision”, which is a checkout of an entire revision, tags and all. So the process took a few days running intermittently. I also had to start the
--first-rev at least one revision earlier and sometimes more, to avoid getting a cryptic Subversion error message about the “editor drive”.
Properties were also quite quirky on SVNBridge, due to the way they are apparently stored in TFS as described earlier. Some could not be removed (eg, bogus
svn:mime-type), and some were set oddly (
svn:ignore on a file,
svn:eol-style on a directory). I chose to leave these alone and correct them after the import.
Some properties went missing, which was part of a larger problem on SVNBridge with copying from an existing revision. If you attempt to copy in the working copy and then make a modification before committing, this doesn’t show up as
A+ in the
svn log result later, but simply
M. The dumps know it was copied, but not that it was added, so attempt to modify a non-existant file when being applied. What’s more, this step wipes out some properties that are set on directories.
In some cases, I manually applied the revision, in others I made an edit in the dump file from:
Deleting directories hit snags as well. I’m unsure if this was a problem in SVNBridge or rsvndump, but it would dump deletions for every path and file like so:
When applying the dump, it would successfully delete the first then fail on the others that were already deleted in the first step. I ended up removing the nodes for all the later entries manually in these instances. You would take these out 4 lines at a time (including trailing whitespace of 2 lines):
Aside from Codeplex, for NPanday we needed to make some more manipulations. First, changing the usernames to line up to their final accounts on the ASF using repeated changes to the
svn:author revision properties.
The dump was also loaded onto another partial subversion repository that contained some intermediate history between leaving the incubator originally and arriving at Codeplex.
Loading to an existing Subversion repository and path
After all this was eventually done, and there was a repository that was matched with the history of the Codeplex one, it needed to be dumped to load into the ASF repository.
Normally, this would be a simple:
svnadmin dump --incremental --deltas work-repo >npanday.dump
However, the objective was to load this onto a path that already existed. This was because we sought to have continuity with the history from the point at which the project was forked from the incubator originally.
To achieve this, I identified the revision in the dump that matched the content in the ASF repository, which due to the initial creation of branches and tags, was revision 4. I then dumped it using:
svnadmin dump --incremental -r5:HEAD work-repo >npanday.dump
--deltas had been included to reduce the size, but we found that this caused checksum problems, possibly due to different line endings between r4 and the original in the ASF repository.
At last, this yielded a dump that could be loaded into the ASF repository, and the results can be seen here: http://svn.apache.org/viewvc/incubator/npanday. You can now see the historical continuity in files such as http://svn.apache.org/viewvc/incubator/npanday/trunk/pom.xml.
This took considerably more work than anticipated when we originally thought it would be a good idea to retain the history.
I found that there wasn’t a lot of information about these topics on the web, so I hope this post will help to expand that for those that might face this challenge in the future. I’ve also found that editing Subversion dump files (when not in delta-mode) is reasonably straightforward.
Interestingly I’ve learned that Subversion 1.7 will include the ability to do a remote
svnadmin dump, however I don’t believe this will work when
svnsync is not supported (as was the case here), or support sub-paths as rsvndump does.