cached: git-SVN: Whys And Hows (by amitu)

15 Feb

NOTE: this is just a copy of a blog-posting originally written by Amit Upadhyay. Since I found it really useful, I was rather disappointed to notice it has gone one day. Thankfully it still was in google’s cache, so I decided to make a copy. The rest of the article is a full-quote:

Git-SVN: Whys And Hows

One of the first thing one sets up when starting a software startup is a version control system, and when we started Vakow! we decided to use SVN because we gained lots of experience with SVN in previous work environment, is free, and more importantly, the cheapest dreamhost account allowed us to run SVN central repository without any problems. Other reasons to use it could be TortoiseSVN for windows user. SVN just works, and central repository and work flow is quite easy to fit in ones mind, there is a repository, you checkout, you work, you see what you have changed so far, you commit, and you update changes done by others. There are tags and branches, but they are nothing for folders for SVN. The weak portion of SVN is merge, basic merge offered by SVN is doable if you keep track of merge revisions numbers in revision logs, this, tho tedious is usually manageable. I was pretty happy with the setup, bugzilla integration worked just fine, tho I am yet to publish my SVN-Bugzilla integration script that works on Dreamhost thanks to endless procrastination. Soon I will. Promise. :-)

Another reason to use SVN for Vakow! when we started was that well, Git wasn’t there then.

So what is the problem with SVN. Like I said above, SVN does not help you almost at all in merging. There is no native concept of merge in SVN. SVN is linear history of one big folder, which is organised in trunk and branches, and there is no native support for trunk and branches/tags either in SVN. These were all cool decisions taken by SVN developers that made it so easy to grasp by developers and its simplicity means robust implementation [without letting the Gods(=Linus et al) come into picture], which all lead to wide spread adoption. But I digress.

Because of lack of merge capability, working with branches is difficult. The work flow for branches is, you have some feature that will take some time to develop, and you do not want to let your customers know prematurely know about the new feature or may be the new feature will destabilize the main code for some time before its stable, you branch off. In SVN you create a copy of your trunk in a new folder, by convention it resides under a folder called “branches”. You work on trunk, mostly bug fixes and minor features, on your main stable code, and you work in parallel on the new feature branch. SVN is excellent in letting you do this. But after the work is done, you ultimately have combine the changes you have done in trunk and in the feature branch and move it to trunk. This is what SVN is not good at. SVN does not know about branch, its a folder, so it cant merge, but what it can do is, take the diff of two versions for any folder, and give you a patch file, and then you can apply this patch file to some code and get a merge.

This is how it works: lets say you branched out the feature branch on revision 100, and have been developing trunk and branch till revision 200, when you realize you want to create a build to give it to testers, and you have to make sure changes from rev 100 to 200 on trunk also gets into the branch. So you create a diff from revision 100 to 200 on trunk, and apply it to branch. But merging is not trivial, you may have made changes to same files and same lines that other developers did in trunk while working on feature branch. You have to resolve it manually and its a laborious process. But what happens if testers say no go, and find 10 more bugs for you to fix. You could either revert the changes from trunk, to keep things clean, so that when you are on rev 300 lets say, you can again get the changes on trunk from rev 100 to 300 and apply the patch on branch. Or you can let the changes after merge at rev 200 stay, and keep working separately on trunk and branch. So in future when you have to merge the changes from trunk again, you have to remember your decision. So you must keep it logged in SVN commit logs or somewhere. Biggest issue is that in SVN when you merge, you lose history, you lose exactly how the file changed over time, and the person who merged would be logged as the person who made all the changes. Terrible thing in my opinion.

What happens if more than one branch is involved and merges are brought back and forth between them? The method of merging I described above becomes too difficult to keep track of, remember in real like the revision numbers are not as rounded as 100 and 200 as I used above. This leads to lots of uncertainty, and programmers hate uncertainty. This all leads to programmers general reluctance to use branches, to consider branches as necessary evil, and a constant effort to keep the number of branches at minimum, with a clear head of the branch defined who is responsible for merging and making sure nobody else is applying the merges and messing with the revision numbers. A small mistake, lets say you merged from revision 1946:2045 instead of 1945:2045 may lead to important bug fix getting lost in the process of merging. Headaches.

I managed with this at Vakow! almost never worked on any branch for any significant time, and given that we were just two people, of which only one can be considered a real programmer, it was not really a big issue. And after all till before Git/Mercurial started to become fashionable about 6-8 months ago [or this is when I started to learn about them], this was the state of art of version control for me.

So how does Git help? Well the first difference between SVN and Git is that Git is distributed where as SVN is centralized. What does it mean, and how does it make merge easier? I am not sure I am absolutely correct about it, but this is what I understand so far. This will make most sense for SVN veterans only, in Git there is no central repository, every “checkout” is “complete”, it not only contains the latest code, as checked out code in SVN does, but it also gets complete revision history and all tags and branches. This might sound astounding, what if you had 1000s of checkins and tens of branches, how much space will it all take, but the Gods did step into it when Git came into existence so they solved this issue, and a typical Git clone with all its glory, compares well with SVN checkout when it comes to disk space, and even network transfer rate. These are the things I don’t usually bother much as long as they are manageable, so don’t tell me if one of them is some percent faster or smaller for some operation or another than other. Since the repository is with you in Git lots of things become fast, checking log is blazing fast for oldest commits, and so is creating branches and doing commits. But this is not why Git or other distributed version control systems shine. I digress again.

Because you have the whole revision history for each branch and trunk, you can do something cool when merging. In git, branch is not branch of a folder as is the case in SVN, its a branch of a commit, Git remember this, where the commit came from, which branch, and what revision. Lets take our original example: branch on 100, merge on 200. Of course Git does not use the numbers like this as its distributed and if it auto incremented both you and me can check in and get version no 101, and then when merging this number will serve no purpose, so Git relies on cryptographic hashing based on commit changes and author info to get revision ids. Anyways, lets say those ids were 100 and 200 and when we are merging the branch=feature[trunk*100] (git keeps track of origin of a branch). This is what git does to merge: it goes back to revision 100, when both trunk and branch and the same content. Then it starts applying changes in the order the happened, lets say first change happened on trunk, so it applies, then the next change on branch, it merges, and so forth. This is possible because the entire change history is available to git. In case there were no conflicts, by the end of it you all changes on trunk applied on branch and git commits by default. This will make the branch now become feature[trunk*200] because now its effectively a branch of revision 200 of trunk. You did not have to remember the revision numbers. Branch based coding heaven!. What happens if 30th commit lead to a conflict? I am not sure about it, if I was designing Git probably I will just ignore that commit and go on, and so on for each conflict causing commit, and at the end of it, I will apply all conflicting commits on top, I am just speculating, conflicts will still cause problem, but because changes are being applied in sequence in which they happened, it reduces the conflicts that happen when the SVN style on big patch is applied to a branch that is really far into the future. Incremental merging will be less error prone then such bulk merging. I just realized I was wrong, Git does something even better(I am glad I did not design it :-), it stops at the first conflict and lets you manually resolve it before proceeding.Now by the end of it, you will have all changes merged cleanly, at any time you will be only trying to resolve one conflict, where as in SVN style bulk merge you would have to resolve conflicts due to more than one conflicting changes at once.

Enough of theory. But still does not solve the problem for Vakow!, we still have others who do not understand Git, who like the simplicity of SVN or are just used to it and considered learning one revision control system enough for their lifetime, and because I have not yet time to rewrite and deploy my SVN bugzilla integration scripts, or get someone else’s. And because I am not sure if it will just work with dreamhost, and because of lack of TortoiseSVN, etc, I am still not ready to switch to SVN on server. Next month may be, not yet. And this is from a sysadmin and CTO who is completely convinced that the switch will be beneficial in long run! There are other poor souls who are stuck with SVN, because either their startup/company is still using SVN and going to for sometime, or if they favorite open source system is stuck with SVN because of either code.google.com/sf.net only supporting SVN or because the of the excellent SVN-Trac integration that so many open source softwares are so fond of. Or for other reasons like they want to switch but could not decide between Git, Mercurial and Bazaar and few other, I would advise just move to Git, but then. For one reason or another, people are going to be stuck with SVN for sometime, and for them there is Git-SVN.

Getting started with Git and SVN

Git SVN is a cool two way bridge between Git and SVN. To be used when you love Git but your company/upstream team is stuck with SVN. I learnt about it from this blog post, I am writing my comments with using it for about a month of full time Git SVN usage.

First thing is getting SVN history into local Git:

git svn clone https://svn.foo.com/svn/proj --trunk=trunk --branches=branches --tags=tags

One of the peculiarities about my SVN repository was that I did not have trunk when I begun coding. I just got the startup idea and was in 80th revision by the time I realized I have not followed the usual design, and then I restructured my SVN into trunk, branched, tags usual hierarchy. This led to some problems. Initially when I tried that command, I skipped the parameters as man page told me that those were the default values anyways. Obviously enough I got some error and then remembered my SVN history. Then panicked a little bit. I tried checking out just the trunk portion but that failed too, as trunk was not there in the beginning, so on a last resort without hope I tried the full command, supplying the default values for —trunk etc. And git went on work. It skipped the first 80 or so commits, but I was happy as it got the rest 2000 of them. It kept on stopping because of network issues, my network was flaky, but was robust enough that simply restarting the process continued from where it stopped. I was already becoming a fan for its robustness. :-)

The first thing I did after this was to move into the directory and run gitk. This is a GUI log browser and was quite delighted to see all the revisions since more than a year back, with search and color code diff, way better than my old solution of using ViewSVN based website for browsing history, which was terribly slow, or TortoiseSVN’s log feature which again was terribly slow, and no provision to search of highlight author etc. This alone was my justification for keeping git clone of my SVN fresh for quite some time, just to see the logs.

One of the reasons I picked Git over Mercurial was the concept of index in Git. On more than one occasions I committed more than I intended when using SVN, and Mercurial was going to be the same in this regard, but not Git. In SVN and all other decent version control systems, a file has to be manually added before SVN starts keeping track of it. The problem is many times during debugging I would change more than what is minimally needed to fix the issue and will have to be really careful on only picking the files I intend to commit. This is where TortoiseSVN shines, it made this process very robust, at least if you follow the best practices. On command line, this lead to errors. So was quite interested in Git in which after every change you have to add the file again, as Git does not track files, it tracks content, and commits only the content that was there when you added the file using “git add“.

Anyways, if you prefer, you can get a behavior of commit very similar to SVN, but I like the Git default.

First things first. By the end of “git svn clone“ this is what would happen: you will get a folder named on your project derived from svn path. This folder will contain the latest trunk.

Note: git repositories are not cluttered with .svn like folders all over, there is only one .git folder in top level folder which contains all git related data.

Now the work begins.

Lets say you made some changes in trunk. You can view the changes by “git diff“. If you jump ahead and add a file that you have decided to commit by calling “git add filename“, “git diff“ will stop showing the changes in that file, or more strictly changes in that file till the moment you added it. The changes have gone into “index”. To see the changes in the index you have to run “git diff —cached“.

You can always see the status of files you have modified or added to index for checkin by running “git status“.

Next thing we are going to do is committing. As you have seen already, just changing is not enough, you have to add the files again before you can commit anything. You commit by running “git commit“ obviously enough, but if you are a command line warrior, you will miss/hate the fact that git does not think “git ci“ is the same as “git commit“ as does SVN. But if you are on a decent shell and operating system, the excellent tab completion won’t let you miss it all that much. Anyways. And yes, if you are coming from SVN, don’t be surprised by the speed of git commit, its nearly instantaneous because its committing to your local branch. You fellow developers using SVN will not notice it yet. But you can go on committing while net is not available.

If you do not like the process of adding a file before committing, and prefer the SVN way, you can do “git commit -a“ which will detect changes in all files that are being kept track of.

No point committing if nobody can see. To push your changes upstream, in real SVN repository, you have to run “git svn dcommit“. This will commit all your changes on the current branch that has not been committed to SVN yet.

A note about SVN precommit hooks: Some places have pre commit SVN hooks that do not let a commit go unless the log message mention the bug number or include copyright notice on the top or confirm with code formatting practice etc, in those cases the previous step may cause problem if you did not confirm to those rules while committing. The obvious answer is to be careful, but that is not always enough. If possible you should learn about git commit hooks and create them conforming to your SVN repositories commit hooks to ensure that errors do not take place. Though this will mean checking if bug exist before each git commit happens, and slowing down the whole blazing git commit experience but then this is how it is, if you want everything, you have to be really smart to avoid those pesky hooks altogether, but then if you don’t use tools and you look like us, most probably you are a chimp. For the matter of this howto just understand that its trivial to undo your commits and redo them if you want with Git to fix some old commit you might have done, but spare yourself the trouble, write git hooks, and get the tools working for you [if you have upstream SVN pre commit hooks. Which BTW you should.].

The above step, “git svn dcommit“ will also update your code with SVN changes done by others. But it will only happen if you have some changes to commit, and probably only changes that are required to merge that change will be brought in. So to robustly sync your trunk or branch with that in SVN repository, you should execute “git svn rebase“ from the branch time to time.

Q: What is the equivalent of “svn revert file“? A:git checkout file“.
Q: What is the equivalent of “svn copy“? A: None. Git will detect copy, just copy it and git add it before committing.

The wonder of Git Stash

One of the coolest thing I find in git is the “git stash“ command. This takes all your uncommitted changes, and puts them in a hidden location, and reverts to the previous checked in pristine state. Many operations, like “git svn dcommit“, “git svn rebase“ etc require that you have all the changes checked in and no un-committed changes lying around. You may have precious changes, like local settings files, etc that you don’t want to checkin but you don’t want to lose them either. So you stash them before those operations. Think of stash as a named patch managed by git for you. You can apply the latest changes that you stashed by running “git stash apply“. Your typical work flow could be:

  • hack hack
  • git add
  • git commit
  • git stash
  • git svn dcommit
  • git stash apply
  • go to hack hack

Remember every time you run “git stash“ a new patch will be created and stored for you, so you may want to run “git stash clear“ from time to time to get rid of old stash copies. To list the stashes stored, run “git stash list“. The name of each stash is pretty arcane, something like stash@{0}, and you have to type it full to refer to a stored stash by name. If you are working with branches, you may have many stashes that you want to keep around containing changes meaningful to you, so you can give them meaningful description by using the command “git stash save ‘my description’“ instead of “git stash“, and to apply one of the stashes not on top of the list, run: “git stash apply stash@{2}“ or so after getting the proper name from “git stash list“. Remember the stash/patch is applied to current branch.

Working with branches

Now the true wonder of Git. It confused me initially quite some, so hopefully this writeup will help a git newbie.

Some basics: branches in Git are of two types, local and remote. You can not work on remote branches directly, only by branching them locally can you commit any changes. So the SVN trunk and other branches and tags for that matter are visible to Git as remote branches, and “git svn clone“, the first step in this howto, has created a local branch from trunk called master and checked it out for you.

To be on top of branches, get into the habit of running “git branch“. This shows all local branches and indicates the current one. If you have followed this writeup, you should have a local git branch called master, and “git branch“ will output just “* master”. * meaning master is the currently checked out branch, and you can see its content in the current directory. “git branch -a“ will show you all the branches, local and remote.

If you want to explore any SVN branch or tag, which is remote branch in Git’s world, you can check them out:

git checkout b_web20

This command will bring the content of the current directory in the state that is there on the HEAD of b_web20 SVN branch. You can look but you can not commit. If you do a “git branch“ now, it will show “* (no branch)” as you are viewing a remote branch.

To start work on any of the branches or trunk, you have to create a local branch first, and that is done using “git checkout -b local_branch_name remote_branch_name“, so you can say “git checkout -b web20 b_web20“ and it will create a branch for you and select is so that the content of current folder will reflect that branch. Now if you do “git branch“, it will show “* web20”, and also “master” since it was created by git svn clone and is still around, a copy of trunk.

Note: There is one more idiosyncrasy that you will have to learn, sometimes someone will create a new branch in SVN, and you will want to work on it, but you won’t find it when you do “git branch -a“, and neither “git svn rebase“ not “git svn dcommit“ will help. You will have to execture “git svn fetch“ to get the new branch. Why? Beats me. [I guess rebase only rebases the current branch, and dcommit only syncs new commits on the current branch, because both are working with current branch, they don’t this care about other new branches. Programmers may be smart but they are seldom nice. ]

So you have created lots of local branches reflecting the remote SVN branches. You can make changes and commit, and “git svn dcommit“ will push the commits in appropriate remote branch for you, commits in master <= trunk will go to trunk and in web20 <= b_web20 will go to b_web20.

Now comes the question of merging. First use case is: you are working on branch web20, which is local for remote b_web20, but changes have happened in trunk that you want to merge to web20. You have to run “git merge master“ which you have branch web20 checked out. More strictly I am assuming b_web20 was created from trunk. It will merge the changes and commit them for you to your local branch web20. You can run “git merge —no-commit master“ to avoid commit.

Note: “git commit —amend“ can anytime be used to amend the change log for the previous commit. This often is useful for me to tailor the commit log when I accidentally “git merge“ without “—no-commit” flag.

The second scenario is: you are satisfied with the branch and you want to merge it with trunk. You can do so by “git pull . web20“ while you have checked out branch master, which was created from trunk. Be careful if you do a “git merge web20“ instead, the master local branch will get associated with remote b_web20, and nothing will be merged. If it happens you can get another copy of trunk by doing “git checkout -b master2 trunk“ and run the proper “git pull“ in it. This too will commit the change, and you may want to amend the commit log. Also remember either of these merges will merge and commit in your local git repository only, you will have to run “git svn dcommit“ to push these changes to SVN repository.

An unused branch can be deleted by running “git branch -d branchname“. Note this will not delete the branch unless all local commits to it has been pulled or merged into some other branch.

PS: Vakow! is hiring, so if you want to work with a really cool startup in Mumbai, get in touch!

PS: Read more about git on my git page.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: