Jay Taylor's notes

back to listing index

The largest Git repo | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: git version-control microsoft news.ycombinator.com
Clipped on: 2017-05-27

Image (Asset 1/2) alt=

Image (Asset 2/2) alt=
Windows, because of the size of the team and the nature of the work, often has VERY large merges across branches (10,000’s of changes with 1,000’s of conflicts).

At a former startup, our product was built on Chromium. As the build/release engineer, one of my daily responsibilities was merging Chromium's changes with ours.

Just performing the merge and conflict resolution was anywhere from 5 minutes to an hour of my time. Ensuring the code compiled was another 5 minutes to an hour. If someone on the Chromium team had significantly refactored a component, which typically occurred every couple weeks, I knew half my day was going to be spent dealing with the refactor.

The Chromium team at the time was many dozens of engineers, landing on the order of a hundred commits per day. Our team was a dozen engineers landing maybe a couple dozen commits daily. A large merge might have on the order of 100 conflicts, but typically it was just a dozen or so conflicts.

Which is to say: I don't understand how it's possible to deal with a merge that has 1k conflicts across 10k changes. How often does this occur? How many people are responsible for handling the merge? Do you have a way to distribute the conflict resolution across multiple engineers, and if so, how? And why don't you aim for more frequent merges so that the conflicts aren't so large?

(And also, your merge tool must be incredible. I assume it displays a three-way diff and provides an easy way to look at the history of both the left and right sides from the merge base up to the merge, along with showing which engineer(s) performed the change(s) on both sides. I found this essential many times for dealing with conflicts, and used a mix of the git CLI and Xcode's opendiff, which was one of the few at the time that would display a proper three-way diff.)

When you have that many conflicts, it's often due to massive renames, or just code moves.

If you use git-mediate[1], you can re-apply those massive changes on the conflicted state, run git-mediate - and the conflicts get resolved.

For example: if you have 300 conflicts due to some massive rename, you can type in:

  git-search-replace.py[2] -f oldGlobalName///newGlobalName
  git-mediate -d
  Succcessfully resolved 377 conflicts and failed resolving 1 conflict.
  <1 remaining conflict shown as 2 diffs here representing the 2 changes>
[1] https://medium.com/@yairchu/how-git-mediate-made-me-stop-fea...

[2] https://github.com/da-x/git-search-replace

Also git rerere

When maintaining multiple release lines and moving fixes between them:

Don't use a bad branching model. Things like "merging upwards" (=committing fixes to the oldest branch requiring the fix, then merging the oldest branch into the next older branch etc.), which seems to be somewhat popular, just don't scale, don't work very well, and produce near-unreadable histories. They also incentivise developing on maintenance branches (ick).

Instead, don't do merges between branches. Everything goes into master/dev, except stuff that really doesn't need to go there (e.g. a fix that only affects a specific branch(es)). Then cherry pick them into the maintenance branches.

Cherry picking hotfixes into maint branches is cool until you have stuff like diverging APIs or refactored modules between branches. I don't know of a better solution; it kind of requires understanding in detail what the fix does and how it does it, then knowing if that's directly applicable to every release which needs to be patched.

Use namespaces to separate API versions?

POST /v4/whatever

POST /v3/whatever

We version each of our individual resources. so a /v1/user might have many /v3/post . Seems to work for us as a smaller engineering team.

Yes, although this applies to all forms of porting changesets or patches between branches or releases.

I don't understand the hate for merging up. I've worked with the 'cherry-pick into release branches' model, and also with an automated merge-upwards model, and I found the automerge to be WAY easier to deal with. If you make sure your automerge is integrated into your build system, so a failing automerge is a red build that sends emails to the responsible engineers, I found that doing it this way removed a ton of the work that was necessary for cherry-picking. I can understand not liking the slightly-messier history you get, but IMO it was vastly better. Do you have other problems with it, or just 'unreadable' histories and work happening on release branches? Seems like a good trade to me.

As the branches diverge, merges take more and more time to do (up to a couple hours, at which point we abandoned the model)... they won't be done automatically. Since merges are basically context-free it's hard to determine the "logic" of changed lines. Since merges always contain a bunch of changes, all have to be resolved before they can be tested, and tracing failures back to specific conflict resolutions takes extra time. Reviewing a merge is seriously difficult. Mismerges are also far more likely to go unnoticed in a large merge compared to a smaller cherry pick. With cherry picking you are only considering one change, and you know which one. You only have to resolve that one change, and can then test, if you feel that's necessary, or move on to the next change.

Also; https://news.ycombinator.com/item?id=14413681

I also observed that getting rid of merging upwards moved the focus of development back to the actual development version (=master), where it belongs.

I just looked at git-mediate and I'm very confused. It appears that all it does is remove the conflict markers from the file after you've already manually fixed the conflict. Except you need to do more work than normal, because you need to apply one of the changes not only to the other branch's version but also to the base. What am I missing here, why would I actually want to use git-mediate when I'm already doing all the work of resolving the conflicts anyway?

It looks like git-mediate does one more important thing; it checks that the conflict is actually solved. In my experience it's very easy to miss something when manually resolving a conflict and often the choices the merge tools give you are not the ones you want.

IntelliJ IDEA default merger knows when all conflicts in a file are handled and shows you a handy notification on top "Save changes and finish merging".

TFS's conflict resolver also does this

Well, all it checks is that you modified the base case to look like one of the two other cases. That doesn't actually tell you if you resolved the conflict though, just that you copied one of the cases over the base case.

True, but if you follow a simple mechanical guideline: Apply the change 2 other versions (preferably the base one last) - then your conflict resolutions are going to be correct.

From experience with many developers using this method, conflict resolution errors went down to virtually zero, and conflict resolution time has improved by 5x-10x.

There's a much simpler mechanical guideline that works without git-mediate: Apply the change to the other version. git-mediate requires you to apply the change twice, but normal conflict resolution only requires you to apply it once.

Except you don't really know if you actually applied the full change to the other version. That's what applying it to the base is all about.

You often take the apparent diff, apply it to the other version, and then git-mediate tells you "oops, you forgot to also apply this change". And this is one of the big sources of bugs that stem from conflict resolutions.

Another nice thing about git-mediate is that it lets you safely do the conflict resolution automatically, e.g: via applying a big rename as I showed in the example, and seeing how many conflicts disappear. This is much safer than manually resolving.

Applying the change to the base doesn't prove that you applied the change to the other version. It only proves that you did the trivial thing of copying one version over the base. That's kinda the whole point of my complaint here, git-mediate is literally just having you do busy-work as a way of saying "I think I've applied this change", and that busy-work has literally no redeeming value because it's simply thrown away by git-mediate. Since git-mediate can't actually tell if you applied the change to the other version correctly, you're getting no real benefit compared to just deleting the conflict markers yourself.

The only scenario in which i can see git-mediate working is if you don't actually resolve conflicts at all but instead just do a project-wide search&replace, but that's only going to handle really trivial conflicts, and even then if you're not actually looking at the conflict you run the risk of having the search & replace not actually do what it's supposed to do (e.g. catching something it shouldn't).

That's why I showed an example of a rename. You write "manually fixed the conflict", where do you see that in the rename example?

You just re-apply either side of the changes in the conflict (Base->A, or Base->B) and the conflict is then detected as resolved. Reapplying (e.g: via automated rename) is much easier than what people typically mean by "manually resolving the conflict".

Also, as a pretty big productivity boost, it prints the conflicts in a way that lets many editors (sublime, emacs, etc) directly jump to conflicts, do the "git add" for you, etc. This converts your everyday editor into a powerful conflict resolution tool. Using the editing capabilities in most merge tools is tedious.

or worse, formatting changes

Not sure why companies don't develop great merge tools, looks like there is a big market for them.

Windows developed an extension that lets them do conflict resolution in the web. We have a server-side API that it calls into, but the extension isn't fundamentally different from using BeyondCompare or $YOUR_FAVORTE_MERGETOOL.

Is this CodeFlow you're talking about?

An extension to what? Could you open source it?

Extension to VSTS [1], sorry. We're working with them on making the extension available to the public. It's possible we could open source it as well; I'll poke around.

[1] https://www.visualstudio.com/en-us/docs/integrate/extensions...

For 3 way merging, I've had good luck with beyondcompare

Generally, it's the responsibility of whoever made the changes to resolve conflicts (basically, git blame conflicting lines and the people who changed them get notified to resolve conflicts in that file). Distributing the work like this makes the merges more reasonable.

Yep, this is the way I work. Always rebase against master, and fix conflicts there, so branches merging into master should always be up-to-date and have zero conflicts.

How does that work?

You first rebase against the master locally and push the merged feature branch after resolving all the conflicts yourself. Afterwards, you go to the master and merge it against the updated feature branch. The 2nd merge should not result in any conflicts.

This model can be annoying on running feature branches. Once you rebase, you have to force-push to the remote feature branch. It's not so bad if you use --force-with-lease to prevent blowing away work on the remote, but it still means a lot of rewriting history on anything other than one-off branches.

No no, you never force push to "published" branches, e.g., upstream master. What you're doing when you rebase onto the latest upstream is this: you're making your local history the _same_ as the upstream, plus your commits as the latest commits, which means if you push that, then you're NOT rewriting the upstream's history.

(In the Sun model one does rewrite project branch history, but one also leaves behind tags, and downstream developers use the equivalent of git rebase --onto. But the true upstream never rewrites its history.)

That's what I thought as well. But what happens when you need to rebase the feature branch against master? Won't you have to force push that rebase?

There's nothing stopping you from doing merges instead of rebases.

       * latest feature commit #3 (feature)
       * merge
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     * | master commits you wanted to include
     | * feature commit #1
     *   original master tip
     *   master history...
Then, when you're done with feature, if you really care about clean history, just rebase the entire history of the feature branch into one or more commits based on the latest from master. I think checkout -b newbranch; rebase --squash master does the trick here:

     *   feature commits #1, #2 and #3 (newbranch)
     | * latest feature commit (feature)
     | * merge
     * | more master commits you wanted to include (master)
     | * feature commit #2
     | * merge
     * | master commits you wanted to include
     | * feature commit #1
     *   original master tip
     *   master history...
Then checkout master, rebase newbranch, test it out and if you're all good, delete or ignore the original.

     * feature commits #1, #2 and #3 (master, newbranch)
     * more master commits you wanted to include
     * master commits you wanted to include
     * original master tip
     * master history...

I've described this. Downstreams of the feature branch rebase from their previous feature branch merge base (a tag for which is left behind to make it easy to find it) --onto the new feature branch head.

E.g., here's what the feature branch goes through:

feature$ git tag feature_05

<time passes; some downstreams push to feature branch/remote>

feature$ git fetch origin feature$ git rebase origin/master feature$ git tag feature_06

And here's what a downstream of the feature branch goes through:

downstream$ git fetch feature_remote

<time passes; this downstream does not push in time for the feature branch's rebase>

downstream$ git rebase --onto feature_remote/feature_06 feature_remote/feature_05

Easy peasy. The key is to make it easy to find the previous merge base and then use git rebase --onto to rebase from the old merge base to the new merge base.

Everybody rebases all the time. Everybody except the true master -- that one [almost] never rebases (at Sun it would happen once in a blue moon).

For 3-way merging the best tool I've found is Steve Losh's splice.vim (https://github.com/sjl/splice.vim/).

Yes back when using Linux, I used Meld a lot. I can recommend it - the directory comparison is good.

Also KDiff3. Struggling to remember the other ones I used to try unfortunately.

I spent the better part of a year in a team that was merging and bug fixing our companies engine releases into the customer (and owning company) code base. We also had to deal with different code repositories and version control systems.

We ended up with a script that created a new git repository, checked out the base version of the code there. Then created a branch for our updated release and another for their codebase. Then attempted to do a merge between the two. For any files which couldn't be automatically merged it created a set of 4 files, the original then a triplet for doing a 3-way merge.

This is also when I bought myself a copy of Beyond Compare 4, which fits perfectly for the price and feature set for what we needed.

It's not nearly as bad when you have all the people who work on it with you.

You look at who caused conflicts and send out emails. Don't land the merge until all commits are resolved. People who don't resolve their changes get in trouble.

Not perfect, but there it is.

I worked on a chromium based product as well and had the exact same problem. Eventually we came up with a reasonable system for landing commits and just tried out best to build using chromium, but not having actual patches. Worked okay, not great. Better thant he old system of having people do it manually/just porting our changes to each release.

It's usually not so bad. Generally a big project is actually split up into lots of different smaller projects, each of which have an owner. And typically a dev won't touch code that isn't theirs except in unusual cases. Teams that have more closely related or dependent code would typically try to work closer to each other and share code more often than teams that are more separated.

Most changes don't touch hundreds/thousands of files. If you were to split the repo into many (as Windows used to be) then you'd still have the problem of huge projects having MANY conflicts, but worse: now you need to do your merge/rebase for each such repo.

In any case, rebasing is better than merging. (Rebasing is a series of merges, naturally, but still.)

Are you aware of KDiff3[1]? If you are why do you prefer Xcode's opendiff?

[1] http://kdiff3.sourceforge.net/

For me, I found I did not need to do 3 way merges that often and opendiff's native UI fits in better than kdiff3 (for me). I think kdiff3 was Qt? Despite Trolltech's best efforts, Qt does not feel native on a Mac.

This isn't to say that KDiff3 isn't great - it is.

I don't think KDiff 3 looks native anywhere, and I don't think it's because of Qt. It uses weird fonts and icons, and for some reason their toolbar buttons just look wrong.

Still an extremely useful program.

God, that sounds hellish.

Archive Team is making a distributed backup of the Internet Archive. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK Currently the method getting the most attention is to put the data into git-annex repos, and then have clients just download as many files as they have storage space for. But because of limitations with git, each repo can only handle about 100,000 files even if they are not "hydrated". http://git-annex.branchable.com/design/iabackup/ If git performance were improved for files that have not been modified, this restriction could be lifted and the manual work of dividing collections up into repos could be a lot lower.

Edit: If you're interested in helping out, e.g. porting the client to Windows, stop by the IRC channel #internetarchive.bak on efnet.

internet archive sounds like the best ever use case for IPFS

It was considered but it just didn't get enough attention from anyone to get it done. http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

Could bup be useful in addition to git-annex? https://github.com/bup/bup

Wouldn't IPFS be much, much more suitable for this purpose?

IIRC, there are permanence and equitable sharing guarantee concerns with IPFS. The former at least can be helped by pinning I think.

Ah, I see, yes. It would be probabilistic, unless there was some way to coordinate sharing (eg downloading the least shared file first).

At Sun Microsystems, Inc., (RIP) we have many "gates" (repos) that made up Solaris. Cross-gate development was somewhat more involved, but still not bad. Basically: you installed the latest build of all of Solaris, then updated the bits from your clones of the gates in question. Still, a single repo is great if it can scale, and GVFS sounds great!

But that's not what I came in to say.

I came in to describe the rebase (not merge!) workflow we used at Sun, which I recommend to anyone running a project the size of Solaris (or larger, in the case of Windows), or, really, even to much smaller projects.

For single-developer projects, you just rebased onto the latest upstream periodically (and finally just before pushing).

For larger projects, the project would run their own upstream that developers would use. The project would periodically rebase onto the latest upstream. Developers would periodically rebase onto their upstream: the project's repo.

The result was clean, linear history in the master repository. By and large one never cared about intra-project history, though project repos were archived anyways so that where one needed to dig through project-internal history ("did they try a different alternative and found it didn't work well?"), one could.

I strongly recommend rebase workflows over merge workflows. In particular, I recommend it to Microsoft.

A problem with rebase workflows that I don't see addressed (here or in the replies) is: if I have, say, 20 local commits and am rebasing them on top of some upstream, I have to fix conflicts up to 20 times; in general I will have to stop to fix conflicts at least as many times as I would have to while merging (namely 0 or 1 times).

Moreover, resolution work during a rebase creates​ a fake history that does not reflect how the work was actually done, which is antithetical to the spirit of version control, in a sense.

A result of this is the loss of any ability to distinguish between bugs introduced in the original code (pre-rebase) vs. bugs introduced while resolving conflicts (which are arguably more likely in the rebase case since the total amount of conflict-resolving can be greater).

It comes down to Resolution Work is Real Work: your code is different before and after resolution (possibly in ways you didn't intend!), and rebasing to keep the illusion of a total ordering of commits is a bit of an outdated/misuse of abstractions we now have available that can understand projects' evolution in a more sophisticated way.

I was a dedicated rebaser for many years but have since decided that merging is superior, though we're still at the early stages of having sufficient tooling and awareness to properly leverage the more powerful "merge" abstraction, imho.

Well, git rerere helps here, though, honestly, this never happens to me even when I have 20 commits. Also, this is what you want, as it makes your commits easier to understand by others. Otherwise, with thousands of developers your merge graph is going to be a pile of incomprehensible spaghetti, and good luck cherry-picking commits into old release patch branches!

Ah, right, that's another reason to rebase: because your history is clean, linear, and merge-free, it makes it easier to pick commits from the mainline into release maintenance branches.

The "fake history" argument is no good. Who wants to see your "fix typo" commits if you never pushed code that needed them in the first place? I truly don't care how you worked your commits. I only care about the end result. Besides, if you have thousands of developers, each on a branch, each merging, then the upstream history will have an incomprehensible (i.e., _useless_) merge graph. History needs to be useful to those who will need it. Keep it clean to make it easier on them.

Rebase _is_ the "more powerful merge abstraction", IMO.

rebase : centralized repo :: merge : decentralized repo

rebase : linked-list :: merge : DAG

If the work/repo is truly distributed and there isn't a single permanently-authoritative repo, a "clean, linear" history is nonsensical to even try to reason about.

In all cases it is a crutch: useful (and nice, and sufficient!) in simple settings, but restricting/misleading in more complex ones (to the point of causing many developers to not see the negative space).

You can get very far thinking of a project as a linked list, but there is a lot to be gained from being able to work effectively with DAGs when a more complex model would better fit the reality being modeled.

It's harder to grok the DAG world because the tooling is less mature, the abstractions are more complex (and powerful!), and almost all the time and money up to now has explored the hub-and-spoke model.

In many areas of technology, however, better tooling and socialization around moving from linked-lists (and even trees) to DAGs is going to unlock more advanced capabilities.

Final point: rebasing is just glorified cherry-picking. Cherry-picking definitely also has a role in a merge-focused/less-centralized world, but merges add something totally new on top of cherry-picking, which rebase does not.

As @zeckalpha says, rebase != centralized repo.

You can have a hierarchical repo system (as we did at Sun).

Or you can have multiple hierarchies, contributing different series of rebased patches up the chain in each hierarchy.

Another possibility is that you are not contributing patches upstream but still have multiple upstreams. Even in this case your best bet is as follows: drop your local patches (save them in a branch), merge one of the upstreams, merge the other, re-apply (cherry-pick, rebase) your commits on top of the new merged head. This is nice because it lets you merge just the upstreams first, then your commits, and you're always left in a situation where your commits are easy to ID: they're the ones on top.

Also, it's harder to grok merge history because we humans have a hard time with complexity, and merge history in a system with thousands of developers and multiple upstreams can get insanely complex. The only way to cut through that complexity is to make sure that each upstream ends up with linear history -- that is: to rebase downstreams.

Decentralization at scale can result in a linear chain, too.

IMO, VC comes down not to tracking what was actually done, but to creating snapshots of logical steps that are reasonable to roll back to and git bisect with.

And cherry-pick onto release maintenance branches.

A pain I have with rebase workflow is that it creates untested commits (because diffs were blindly applied to a new version of the code). If I rebase 100 commits, some of the commits will be subtly broken.

How do you deal with that?

With git rebase you can in fact build and test each commit. That's what the 'exec' directive is for (among other things) in rebase scripts!

Basically, if you pick a commit, and in the next line exec make && make check (or whatever) then that build & test command will run with the workspace HEAD at that commit. Add such an exec after every pick/squash/fixup and you'll build and test every commit.

Or you could use the "-x" parameter to execute something between every rebase.

This is why, in git workflows with rebases, it's a good idea to create merge commits anyway, even if the master branch can fast-forwarded.

That way, looking at the history, you know what commits are stable/tested by looking at merge commits. Others that were brought in since the last merge commit can be considered intermediary commits that don't need to be individually tested.

(Of course, there's also the rebase-and-squash workflow which I've personally never used, but it accomplishes the same thing by erasing any intermediary history altogether.)

Also, every commit upstream is stable by definition! Human failures aside, nothing should go upstream that isn't "stable/tested".

"Squashing" is just merging neighboring commits. I do that all the time!

Usually when I work on something, commit incomplete work, work some more, commit, rinse, repeat, then when the whole thing is done I rewrite the history so that I have changes segregated into meaningful commits. E.g., I might be adding a feature and find and fix a few bugs in the process, add tests, fix docs, add a second, minor feature, debug my code, add commits to fix my own bugs, then rewrite the whole thing into N bug fix commits and 2 feature commits, plus as many test commits as needed if they have to be separate from related bug fix commits. I find it difficult to ignore some bug I noticed while coding a feature just so that I can produce clean history in one go without re-writing it! People who propose one never rewrite local history propose to see a single merge commit from me for all that work. Or else the original commits that make no logical sense.

Too, I use "WIP" commits as a way to make it easy to backup my work: commit extant changes, git log -p or git format-patch to save it on a different filesystem. Sure, I could use git diff and thus never commit anything until I'm certain my work is done so I can then write clean history once without having to rewrite. But that's silly -- the end result is what matters, not how many cups of coffee I needed to produce it.

I've toyed with the idea of using merge commits to record sets of commits as being... atoms.

Suppose you want to push regression tests first, then bug fixes, but both together: this is useful for showing that the test catches the bug and the bug fix fixes it. But now you need to document that they go together, in case they need to be reverted, or cherry-picked onto release maintenance branches.

I think branch push history is really something that should be a first-class feature. I could live with using merge commits (or otherwise empty-commits) to achieve this, but I'll be filtering them from history most of the time!

we use a rebase workflow in git at my current employer, and it is amazing.

previous employer used a merge workflow (primarily because we didnt understand git very well at the time), and there were merge conflicts all the time when pulling new changes down or merging new changes in.

It was a headache to say the least. As the integration manager for one project, I usually spent the better part of an hour just going through the pull requests and merge conflicts from the previous day. I managed a team that was on the other side of the world, so there were always new changes when I started working in the morning.

Yes! One of the most important advantages of a rebase workflow is that you can see immediately what upstream commits your conflict with, as opposed to some massive merge you have to go chasing branch history to figure out the semantics of the change in question.

"Amazing" is right. Sun was doing rebases in the 90s, and it never looked back.

My exact experience (in the context of "merging upwards"). Large merges are a huge pain to do, and are basically impossible to review, too.

Yes! Reviewing huge merges is infeasible. Besides, most CR tools are awful at capturing history, especially in multi-repo systems. So rebasing and keeping history clean and linear is a huge win there.

Though, of course, rebasing is a win in general, even if you happen to have an awesome CR tool (a unicorn I've yet to run into).

Another thing is that keeping your unpushed commits "on top" is a great aid in general (e.g., it makes it trivial to answer what haven't I pushed here yet?"), but also is the source of rebasing's conflict resolution power.

Because you're unpushed commits are on top, it's easy to isolate each set of merge conflicts (since you're going commit by commit) and to find the source of the conflicts upstream (with log/blame tools, without having to chase branch and merge histories).

We use a rebase workflow when working with third party source code. We keep all third party code on a main git branch and we create new branches off the main branch as we rebase our changes from third party code version to version.

Why wouldn't you use it for your own source?

A trip through "git rebase" search results at HN sure is cringe-inducing. So many people fail to get it.

Why is a clean linear history desirable? It's not reflective of how the product was built? Is it just for some naive desire of purity?

It's easier to see commits of a branch grouped together in most history viewers. Even though sorting commits topologically can help, most history viewers don't support that option.

When there is an undesired behavior that is hard to reason about, git-bisect can be used to determine the commit that first introduced it. With a normal merge, it will point to the merge commit, because it was the first time the 2 branches interacted. With a rebase, git bisect will point to one of the rebased commits, each of which already interacted with the branch coming before.

Resolving conflicts in a big merge commit vs in small rebased commits is like resolving conflicts in a distributed system by comparing only the final states, vs inspecting at the actual sequences of changes.

Who cares about how a product's sub-projects were put together? What one should care about is how those sub-projects were put together into a final product. To be sure, the sub-projects' internal history can be archived, but it needn't pollute the upstream's history.

Analogously: who cares how you think? Aside from psychologists and such, that is. We care about what you say, write, do.

Can you describe it in a little more detail? Do you still use branches? If so, for what? For different versions?

Great question.

We basically had a single branch per repo, and every repo other than the master one was a fork (ala github). But that was dictated by the limitations of the VCS we used (Teamware) before the advent of Hg and git.

So "branches" were just a developer's or project's private playgrounds. When done you pushed to the master (or abandoned the "branch"). Project branches got archived though.

In a git world what this means is that you can have all the branches you want, and you can even push them to the master repo if that's ok with its maintainers, or else keep them in your forks (ala github) or in a repo meant for archival.

But! There is only one true repo/branch, and that's the master branch in the master repo, and there are no merge commits in there.

For developers working on large projects the workflow went like this:

- clone the project repo/branch

- work and commit, pulling --rebase periodically

- push to the project repo/branch

- when the project repo/branch rebases onto a newer upstream the developer has to rebase their downstream onto the rebased project repo/branch

Project techleads or gatekeepers (larger projects could have someone be gatekeeper but not techlead) would be responsible for rebasing the project onto the latest upstream.

To simplify things the upstream did a bi-weekly "release" (for internal purposes) that projects would rebase onto on either a bi-weekly or monthly schedule. This minimizes the number of rebases to do periodically.

When the project nears the dev complete time, the project will start rebasing more frequently.

For very large projects the upstream repo would close to all other developers so that the project could rebase, build, test, and push without having to rinse and repeat.

(Elsewhere I've seen uni-repo systems where there is no closing of the upstream for large projects. There a push might have to restart many times because of other pushes finishing before it. This is a terrible problem. But manually having to "close" a repo is a pain too. I think that one could automate the process of prioritizing pushes so as to minimize the restarts.)

Are you saying that you use Git instead of Mercurial these days?

Not necessarily implied by you; just checking.

Me personally? Yes, I use git whenever I can. I still have to use Mercurial for some things.

I don't know what Oracle does nowadays with the gates that make up Solaris. My guess is that they still have a hodge podge, with some gates using git, some Mercurial, and some Teamware still. But that's just a guess. For all I know they may have done what Microsoft did and gone with a single repo for the whole thing.

I have tremendous respect for Microsoft pulling itself together over the past few years.

Such a relevant point, and I don't think they get enough props for it.

I don't believe for one second this was a quick turnaround for them either. I've spoken to MS dev evangelists at work stuff over the past few years and they've continually said "it's going to get better", usually with a wry smile.

It bloody did too. They're nowhere near perfect, and the different product branches remain as disjointed as ever, but I'm genuinely impressed at the sheer scale of the organisational change they've implemented.

Moving entire code base from source Depot (invented at Microsoft) git (not ms) was a huge undertaking. I know many ms devs who hated git.

But this is seriously brave and well executed on their part.

I also know many MS devs who hated Source Depot :)

Technically - Source Depot is a fork of Perforce. Not entirely invented at MSFT :).

To give you a little idea of scale - they've been at this for at least 4 years. It started while I still worked in Windows.

This may be the thing that gets Google to switch. They like having every piece of code in a single repository which Git cannot handle.

Now that it is somewhat proven, maybe Google will leverage GVFS on Windows and create a FUSE solution for Linux.

I'd rather see google open up their monorepo as a platform, and compete with github. git is fine, but there's something compelling about a monorepo. Whether they do it one-monorepo-per-account, or one-global-monorepo, or some mix of the two, would be interesting to see how it shapes up.

Though as things are going, I wouldn't be surprised if Amazon goes from zero to production-quality public monorepo faster than Google gets from here to public beta. It's not in Google's blood.

And of course Google will shut it down in five years once they're bored of it.

Amazon doesn't do mono-repos. They have ORDER OF a million repos. They instead invested in excellent cross repo meta-version and meta build capabilities instead of going mono-repo.

"one-global-monorepo" caused me to envision a beautiful/horrifying Borg-like future where all code in the universe was in a single place and worked together.

This is how Linux (and BSD, and so on) distributions work. Of course there are proprietary and niche outliers, but you can't forbid those in the first place.

I felt like that when I first saw golang and how you can effortlessly use any repo from anywhere.

As the joke goes, Go assumes all your code lives in one place controlled by a harmonious organization. Rust assumes your dependencies are trying to kill you. This says a lot about the people who came up with each one.

> Rust assumes your dependencies are trying to kill you.

Would you mind unpacking this? I'm intrigued.

Cargo.lock for applications freezes the entire dependency graph incl. checksums of everything, for example.

This is the main thing I miss about subversion. You could check out any arbitrary subdirectory of a repository. On two projects the leads and full stack people had the whole thing checked out, everybody else just had the one submodule they were responsible for. Worked fairly well.

Mercurial has narrowspecs these days too! Facebook's monorepo lets you check out parts of the overall tree too. It's not like every Android engineer's laptop has all of fbios in it.

Git has submodules too and teams usually have access control on the main server used for sharing commits.

git submodules aren't seemless. None of the alternatives appear to be any better. A halfassed solution is no solution at all.

Would you mind explaining about this lack of seamlessness?

I can imagine that part of it could be the need of a git clone --recursive, and everybody omits the --recursive if they don't know there are submodules inside the repository. There is another command to pull the submodules later but I admit it's far from ideal.

What's wrong with git submodules?

Google already has a FUSE layer for source control: http://google-engtools.blogspot.com/2011/06/build-in-cloud-a...

Google used to have a Perforce frankenstein, but now they have their own VCS.

Piper is their perforce-like server. You check out certain files, as you'd with p4, work on the tree using git, with reviews, tests, etc. You periodically do a sync which is like pull - - rebase. Then you push your changes back into the perforce-like monorepo.

also they do all the development in one branch / in the trunk https://arxiv.org/abs/1702.01715 ( I never understood the explanations as to why they do that )

Now the article says that with windows they do branches.

They use mercurial (or were), which is as good as git. In fact, I bet a lot of people at Google are happy to use mercurial instead of git, given git's bad reputation with its command line interface.

They don't use mercurial.

You're thinking of Facebook if I'm not mistaken.

I had seen several sources that affirmed that Google used Mercurial, but I'm not sure to what extend, so I will retract it :-)

I'm sure there are a few teams that use Mercurial incidentally somewhere, but our primary megarepo is all on a VCS called Piper. Piper has a Perforce-y interface and there are experimental efforts to use Mercurial with Piper. Also mentioned in the article below, there's limited interop with a Git client.

If you're curious what it all ends up looking like, read this article. It's a fairly good overview and reasonably up to date.


IIRC, the git client is deprecated; the mercurial one is meant to replace it for the use cases where you would want a DVCS client interfacing with Piper in the first place.

It's not deprecated. I use git-multi every day, much to the chagrin of my reviewers. Google thinks that long DIFFBASE chains are weird and exotic, and Google doesn't like weird and exotic as a rule.

I wonder why Windows is a single repository - Why not split it in separate modules? I can imagine tools like Explorer, Internet Explorer/Edge, Notepad, Wordpad, Paint, etc. all can stay in its own repository. I can imagine you can even further split things up, like a kernel, a group of standard drivers, etc. If that is not already the case (separate repos, that is), are the plans to separate it in the future?

Really good question. Actually, splitting Windows up was the first approach we investigated. Full details here: https://www.visualstudio.com/learn/gvfs-design-history/


- Complicates daily life for every engineer

- Becomes hard to make cross-cutting changes

- Complicates releasing the product

- There's a still a core of "stuff" that's not easy to tease apart, so at least one of the smaller Windows repos would still have been a similar order of magnitude in most dimensions

> - Becomes hard to make cross-cutting changes

This does seem like a negative, doesn't it?

But it's not. Making it hard to make cross-cutting changes is exactly the point of splitting up a repo.

It forces you to slow down, and—knowing that you can only rarely make cross-cutting changes—you have a strong incentive to move module boundaries to where they should be.

It puts pressure on you to really, actually separate concerns. Not just put "concerns" into a source file that can reach into any of a million other source files and twiddle the bits.

"Easy to make sweeping changes" really means "easy to limp along with a bad architecture."

I think that's one of the reasons why so much code rots: developers thinking it should be easy to make arbitrary changes.

No, it should be hard to make arbitrary changes. It should be easy to make changes with very few side effects, and hard to make changes that affect lots of other code. That's how you get modules that get smaller and smaller, and change less and less often, while still executing often. That's the opposite of code rot: code nirvana.

No, it should be hard to make arbitrary changes.

If you change the word "arbitrary" to "necessary" (implying a different bias than the one you went with) then all of a sudden this attitude sounds less helpful.

Similarly "easy to limp along with a bad architecture" could be re-written as "easy to work with the existing architecture".

At the end of the day, it's about getting work done, not making decisions that are the most "pure".

You have to balance getting work done vs. purity, and Microsoft has spent years trying to fix a bad balance.

Windows ME/Vista/8 were terrible and widely hated pieces of software because of "getting things done" instead of making good decisions. They made billions of dollars doing it, don't get me wrong, but they've also lost a lot of market share too and have been piling on bad sentiment for years. They've been pivoting and it has nothing to do with "getting work done" but by going back and making better decisions.

I assumed that Windows 8 was hated because it broke the Start Menu and tried to force users onto Metro.

It also broke a lot of working user interfaces, e.g. wireless connection management.

Those releases (well, Vista and 8 anyway, I don't know about ME) came out of a long and slow planning process - if they made bad decisions I don't think it was about not taking long enough to make them.

> At the end of the day, it's about getting work done, not making decisions that are the most "pure".

This attitude will lead to a total breakdown of the development process over the long term. You are privileging Work Done At The End Of The Day over everything else.

You need to consider work done at every relevant time scale.

How much can you get done today?

How much can you get done this month?

How much can you get done in 5 years?

Ignore any of these questions at your peril. I fundamentally agree with you about purity though. I'm not sure what in my piece made you think I think Purity Uber Alles is the right way to go.

> This attitude will lead to a total breakdown of the development process over the long term.

As evidenced by Microsoft following the one repo rule and not being able to release any new software.

Wait, what ?

The text I quoted had nothing to do with monolithic repos.

The linux codebase exists in stark contrast to your claim. Assuming your claim is that broken up repos is the better way.

No it doesn't. I think you are thinking of the kernel, which is separate from all the distros.

Linux by itself is just a kernel and won't do anything for you without the rest of the bits that make up an operating system.

Then I'll point to the wide success of monolithic utilities such as systemd as evidence that consolidating typically helps long term.

Which is to say, not shockingly, it is typically a tradeoff debate where there is no toggle between good and bad. Just a curve that constantly jumps back and forth between good and bad based on many many variables.

systemd is also completely useless on its own. It still needs a bootloader, a kernel, and user-space programs to run.

When it comes to process managers, there is obviously disagreement about how complex they should be, but systemd is still a system to manage and collect info about processes.

The hierarchical merging workflow used by the Linux kernel does mean that there's more friction for wide-ranging, across-the-whole-tree changes than changes isolated to one subsystem.

Isolated changes will always be easier than cross cutting ones. The question really comes down to whether or not you have successfully removed cross cutting changes. If you have, then more isolation almost certainly helps. If you were wrong, and you have a cross cutting change you want to push, excessive isolation (with repos, build systems, languages, whatever), adds to the work. Which typically increases the odds of failure.

Arguing about purity is only pointless and sanctimonious if the water isn't contaminated. Being unable to break a several hundred megabyte codebase into modules isn't a "tap water vs bottled" purity argument, it's a "lets not all die of cholera" purity argument.

As the linked article says, modularizing and living in separate repos was the plan of record for a while. But after evaluating the tradeoffs, we decided that Windows needs to optimize for big, rapid refactors at this stage in its development. "Easy to make sweeping changes" also means "easy to clean up architecture and refactor for cleaner boundaries".

The Windows build system is where component boundaries get enforced. Having version control introduce additional arbitrary boundaries makes the problem of good modularity harder to solve.

You say that, but it is very telling that every large company out there (Google and Facebook come to mind) go for the single-repository approach.

I'm sure that, when dealing with stakeholder structures where different organizations can depend on different bits and pieces, having multiple repositories with difficulty of making breaking and cross-cutting changes, becomes good.

From the view of a single organization where the only users of a component are other components in the same organization, it seems like there is consensus around single-repository.

It is very telling. Google has a cloud supercomputer doing nothing but building code to support their devs. I don't know about Facebook. (I really don't -- I'm constantly amazed that they have as many engineers that they do, what do they all work on?) Where I work (https://medium.com/salesforce-engineering/monolith-to-micros...) there's a big monolith but with more push towards breaking things up, at the architecture level and also on the code organization level. We also use and commit to open source projects (that use git), so to integrate those with the core requires a bit more effort than if they were there already but it's not a big burden and the benefits of having their tendrils being self-contained are big.

Which brings me to the point that in the open source world, you can't get away with a single-repository approach for your large system. And that also is telling, along with open source's successes. So which approach is better in the long run? I'd bet on the open source methods.

You forget Conway's law:

> organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations

Open source has a very different communication structure than a company. While the big three (MS, Google, FB) try to work towards good inter-departmemt relations, it is usually either

- a single person - a small group

that are the gatekeepers for a small amout of code, typically encapsulated in a "project". They do commit a lot to their project, yet rarely touch other projects in comparision.

Also, collaboration is infinitely harder, as in the office you can simply walk up to someome, call them, or chat them - in OSS a lot of communication works via Issues and PRs, which are a fundamentally different way to communicate.

This all is reflected by the structure of how a functionality is managed: Each set of gatekeepers gets their own repository, for their function.

Interestingly this even happens with bigger repositories: DefinitelyTyped is a repository for TypeScript "typings" for different JS libraries, which has hundreds of collaborators.

Yet, if you open a pull-request for an existing folder the ones that have previously made big-ish changes can approve / decline the PR, so each folder is its own little repo.


So: maybe the solution is big repos for closed companies, small repos for open-source?

Amazon doesn't.

From my experience with Java-style hard module dependencies, this makes it extremely difficult to refactor anything touching external interfaces.

You say this forces you to think ahead, but predicting the future is quite difficult. The result is that you limp along with known-broken code because it would take so much effort to make the breaking changes to clean it up.

For example, lets say you discover that people are frequently misuing a blocking function because they don't realize that it blocks.

Let's say that we have a function `bool doThing()`. We discover that the bool return type is underspecified: there's a number of not-exactly-failure not-exactly-success cases. In a monorepo, it's pretty easy to modify this so that `doThing()` can return a `Result` instead. With multiple repos and artifacts, you either bring up the transitive closure of projects, or you leave it for someone to do later. For a widely used function, this can be prohibitive. That makes people frequently choose the "rename and deprecate" model, which means you get an increasing pile of known-bad functions.

Have you actually worked on a repository at Windows scale..? If not, how can you know that your guesses about the workflow are accurate?

Making something difficult even more difficult is not helpful to anyone.

And what happens if the inherent difficult disappears or reduces? You're still left with the imposed external difficulty.

Remember when they said IE was an integral part of the OS? Yeah...

Facebook cited similar reasons for having a single large repository: https://code.facebook.com/posts/218678814984400/scaling-merc...

I remember that, in the 90's, you'd often get new UI elements in Office releases that then would eventually move into Windows. There was a technical reason - the cross-cutting - but there also seemed to be a marketing reason - the moment those UI elements became part of core Windows, all developers (yours truly included) would be able to use those elements, effectively negating Office the fresh look before the competition.

That's a really interesting article. I wish I would have found it before going down a similar path with my team, recently.

These types of use cases seem so commonly encountered that there should be a list of best practices in the Git docs.

It's harder to share code between repos, though.

EDIT: like if something was to be shared between Windows and Office, for example.

This was a very interesting point. It sounds like there are some serious architectural limitations on Windows, and this makes me believe the same might be true for the NT kernel, and that MS might not be interested in doing heavy refactoring of it.

I'm not a frequent Windows user, or a Windows dev at all. Does anyone know of any consequences that MS's decision might mean, if this hypothesis is true?

The NT kernel is surprisingly small and well-factored to begin with - it is a lot closer to a 'pure' philosophy (e.g. Microkernel) than something like Linux to begin with.

If you have a problem with Windows being overcomplicated or in need of refactor it is almost certainly something to do with not-the-kernel.

If you look at something like the Linux kernel its actually much larger than Windows. It needs to have every device driver known to man (except that one WiFi/GPU/Ethernet/Bluetooth driver you need) because internally the architecture is not cleanly defined and kernel changes also involve fixing all the broken drivers.

Small and well-factored the core kernel may be, but if you're parsing fonts in kernel mode, you ain't a microkernel.

( https://googleprojectzero.blogspot.com.au/2015/07/one-font-v... )

For sure, and Windows is not a microkernel, but it does have separated kernel-in-kernel and executive layers; it would approach being a microkernel architecture if the executive was moved into userland. This is similar to how macOS would be a microkernel, if everything wasn't run in kernel mode (mach, on which it is partially based, is a microkernel).

Of course the issue here is that after NT 4, GDI has been in kernel mode; this is necessary for performance reasons. Prior to that it was a part of the user mode Windows subsystem.

I'd be curious to see if GDI moved back to userland would be acceptable with modern hardware, but I suspect MS is not interested in that level of churn for minimal gain.

Could you please share a link to NT kernel sources so that I take a look?

It is not true that internal architecture of Linux drivers is not clearly defined. It is just a practical approach to maintenance of drivers (as an author of one Linux hardware driver I'm pretty sure the best possible). Reasoning is outlined in famous "Stable API nonsense" document http://elixir.free-electrons.com/linux/latest/source/Documen...

I don't think Windows approach is worth praising here. It results in a lot of drivers stopping working after few major Windows release upgrades. In Linux, available drivers can be maintained forever.

If you are sufficiently motivated, NT 4 leaked many years ago and you could find it; it even has interesting things like DEC Alpha support & various subsystems still included IIRC. Perhaps you could find a newer version like 5.2 on GitHub or another site, but beware, as a Linux dev/contributor, you probably don't want to have access to that.

FWIW, I've stumbled upon both of those things in my personal research while I had legitimate access to the 5.2 sources as a student. It turns out Bing will link you directly to the Windows source code if you search for <arcane MASM directive here>.

Yes, I'm in the process of reporting this to Microsoft and cleansing myself of that poison apple.

Yes! Windows Kernel is a much more "modern" microkernel architecture than any of the circa-1969 Unix-like architectures popular today. We use Windows 10 / Windows Serve for everything at our company, and we have millions of simultaneously connected users on single boxes. No problems and easy to manage.

It seems presumptive to say that "If you can't use multiple repos, your architecture must be bad". I could just as easily counter with, "If you're using multiple repos, it must mean you have an unnecessarily complex and fragile microservice architecture"

I'm sorry that was implied. I simply want better insight into the kernel from people who have experience developing large kernels, and the decisions that are made as a consequence of architectural choices.

I feel like that those questions are valid, and are important in this field, not just kernel development. As someone who desires to continue learning, I will not yield to your counter.

Google also uses a single giant repo...

Why is that a "good reason" to do it?

Not really :) and I am not the OP. This article [1] provides a very good overview of the repository organization Google has and the reasons behind it.

I think the reason that this works for well Google is the amount of test automation that is in place which seems to be very good at flagging any errors due to dependency changes before it gets deployed. Not sure how many organizations have invested and have an automated test infrastructure like Google has built.

1. https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

It radically simplifies everything. Every commit is reproducible across the entire tool chain and ecosystem.

It makes the entire system kind of pure-functional and stateless/predictable. Everything from computing which tests you need to run, to who to blame when something breaks, to caching build artifacts, or even sharing workspaces with co-workers.

While this could be implemented with multiple repros underneath, it would add much complexity.

I think this is like the "your data isn't big enough for HDFS" argument earlier this week. The point I take away is that at some stage of your growth, this will be a logical decision. I don't think it implies that the same model works for your organization.

And Facebook.
RUTHLESS_RUFUS 3 days ago [flagged] [dead] [-]

hey chummer, show me your elegant repos will ya?

Why split it into separate modules? Seeing that big companies are very successful with monorepos (Google, Facebook, Microsoft), has made me reconsider if repository modularization is actually worth it. There are a host of advantages to not modularizing repos, and I'm beginning to believe they outweigh those of modular repos.

mono repo only works if you have tooling. google's and facebook's tools are not opensource. also, ms tooling is windows only.

so for most of us the only reasonable path is to split into multiple repositories.

it is also easier to create tools that deal with many repositories.

than it is to create a tool that virtualizes a single large repo.

Microsoft's tooling is Windows only _today_. GVFS is open source _and_ we are actively hiring filesystem hackers for Linux/macOS.

but google's and facebook's repos are also orders of magnitude larger than what most people deal with, normal tools might work just fine in most cases.

> Why split it into separate modules?

Well, because of the unbelievable amount of engineering work involved in trying to get Git to operate at such insane scale? To say nothing of the risk involved in the alternative. This project in particular could easily have been a catastrophe.

Microsoft has about 50k developers. When you're dealing with an engineering organization of this size, you're looking at a run rate of $5B a year, or about $20M a day. It's a no-brainer to spend tens or even hundreds of millions of dollars on projects like this if you're going to get even a few percentage points of productivity.

It can be hard to understand this as an individual engineer, but large tech companies are incredible engineering machines with vast, nearly infinite, capacity at their disposal. At this scale, all that matters is that the organization is moving towards the right goals. It doesn't matter what's in the way; all the implementation details that engineers worry about day-to-day are meaningless at this scale. The organization just paves over over any obstacles in pursuit of its goal.

In this case Microsoft decided that Git was the way to go for source control, that it's fundamentally a good fit for their needs. There was just implementation details in the way. So they just... did it. At their scale, this was not an incredible amount of work. It's just the cost of doing business.

You haven't said anything at all about the risk.

If there's one thing large organizations are good at, it's managing risk. And if you read their post, they've done this.

They're running both source control systems in parallel, switching developers in blocks, and monitoring commit activity and feedback to watch for major issues. In the worst case, if GVFS failed or developers hated it, they could roll back to their old system.

Again, to my point above: there's a cost to doing this but it's negligible for very large organizations like Microsoft.

Wait so like, at google, Inbox and Android are in the same repo as ChromeOS and oh I dunno, Google Search? That doesn't make any sense at all...

Android and Chrome are different, but most of Google's code lives in a single repository.


It makes total sense when the expectation is that any engineer in the company can build any part of the stack at any time with a minimum of drama.

This just blew my mind. I'm gonna go home and see about combining all my projects. That seems very useful!

It makes sense to have separate repos for things that don't interact. But when your modules or services do. Having them together cuts out some overhead.

You don't have the same requirements as Google.

What are my requirements such that this wouldn't work for me?

It makes things tricky if you want to opensource just one of your projects.

Well, there's always "git filter-branch".

Not that I'd want to run it on such a mega-repository; it takes long enough running it on an average one with a decade of history.

What's dramatic about copy and pasting a clone uri into a command?

A url? Not much.

100 urls? That's getting a bit annoying.

Yes, Inbox, Maps, Search, etc. are all in one repo in a specialized version control system.

Android, Chromium, and Linux (among others, I'm sure) are different in that they use git for version control so they are in their own separate repos.

Unsure if all their clients are in the same repo as well but even if…

Why doesn't this make sense.

I personally think of a repo as of an index not a filesystem. You checkout what you need but there is one global constant state - which can eg be used for continuous integration tests

Android is in a separate repo.

their earlier blogpost goes into it a bit https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g... :

The first big debate was – how many repos do you have – one for the whole company at one extreme or one for each small component? A big spectrum. Git is proven to work extremely well for a very large number of modest repos so we spent a bunch of time exploring what it would take to factor our large codebases into lots of tenable repos. Hmm. Ever worked in a huge code base for 20 years? Ever tried to go back afterwards and decompose it into small repos? You can guess what we discovered. The code is very hard to decompose. The cost would be very high. The risk from that level of churn would be enormous. And, we really do have scenarios where a single engineer needs to make sweeping changes across a very large swath of code. Trying to coordinate that across hundreds of repos would be very problematic.

After much hand wringing we decided our strategy needed to be “the right number of repos based on the character of the code”. Some code is separable (like microservices) and is ideal for isolated repos. Some code is not (like Windows core) and needs to be treated like a single repo.

As Brian Harry alluded to, this is in fact very close to the previous system. As I recall it from my time at Microsoft:

The Windows source control system used to be organized as a large number (about 20 for product code alone, plus twice that number for test code and large test data files) of independent master source repositories.

The source checkouts from these repos would be arranged in a hierarchy on disk. For example, if root\ were the root of your working copy, files directly under root\ would come from the root repo, files under root\base\ from the base repo, root\testsrc\basetest from basetest, and so on.

To do cross repo operations you used a tool called "qx" (where "q" was the name of the basic source control client). Qx was a bunch of Perl scripts, some of which wrapped q functions and others that implemented higher-level functions such as merging branches. However, qx did not try to atomically do operations across all affected repos.

(The closest analog to this in git land would be submodules.)

While source control was organized this way, build and test ran on all of Windows as a single unit. There was investigation into trying to more thoroughly modularize Windows in the past, but I think the cost was always judged too great.

Mark Lucovsky did a talk several years ago on this source control system, among other aspects of Windows development:


I believe it is still valid for folks not using GVFS to access Windows sources.

It'd probably be hell on earth for the developers/engineers if it was more than one repo.

I'm sure the build process is non-trivial and with 4,000 people working on it, the amount of updates it gets daily is probably insane. Any one person across teams trying to keep this all straight would surely fail.

Having done a lot of small git repos, I'm a big fan of one huge repo. It makes life easier for everyone, especially your QA team as it's less for them to worry about. In the future, anywhere I'm the technical lead I'm gonna push for one big repo. No submodules either. They're a big pain in the ass too.

Most of the work that is the issue would be solved by meta repos and tools to keep components up to date and integrated upward.

So, this is actually pretty common. I know that both Google and Facebook use a huge mono-repo for literally everything (except I think Facebook split out their Android code into a separate repo?). So, all of Facebook's and Google's code for front-end, back-end, tools, infrastructure, literally everything, lives in one repo.

It's news to me that Windows decided to go that route too. Personally, I think submodules and git sub-trees suck, so I'm all for putting things in a monorepo.

How does a mono-repo company manage open sourcing a single part of their infrastructure if things are in one large repo? For example, if everything lived in one repo, how does Facebook manage open sourcing React? Or if I personally wanted to switch to one private mono-repo, how would I share individual projects easily?

So open sourcing can mean three different things:

The bad one is just dumping snapshots of the code into a public repo every so often. You need to make sure your dependencies are open source, have tools that rewrite your code and build files accordingly, and put them in a staging directory for publishing.

The good one is developing that part publicly, and importing it periodically into your internal monorepo with the same (or similar) process to the one you use for importing any other third-party library.

There's also a hybrid approach which is to try and let internal developers use the internal tooling against the internal version of the code, and also external developers, with external tooling, against the public version. That one's harder, and you need a tool that does bidirectional syncing of change requests, commits, and issues.

We have an internal tool that allows us to mirror subdirectories of our monorepo into individual github repositories, and another tool that helps us sync our internal source code review tool with PRs etc.

An internal tool which manages commits, between individual repos etc. does it not seem that this is a logical extension to git itself? A little like submodules, but being able to publish only parts of the sourcetree. Maybe it would be impossible to keep any consistency and leaking information from the rest of the tree.

With difficulty.

No, seriously, that's the answer.

They have an internal mono-repo and public repos on GitHub that are mirrors of their mono-repo.

pros to big repo:

-dont have to spend time to think about defining interfaces


-history is full of crap you dont care about

-tests take forever to run

-tooling breaks down completely, though thanks to MS the limit was increased seriously

Are the big monorepo companies actually waiting for global test suite completion for every change? I'd doubt that, I'm sure they're using intelligent tools to figure out what tests to actually run. Compute for testing is massively expensive at that scale so it's an obvious place to optimize

Google's build and testing system is smart in which tests to run, as you suspect, but it still has a very, very large footprint.

Right. My point is that the monorepo almost certainly isn't a problem in this regard.

You still have to do something about internal interfaces. The problem is that the moment you want to make a backwards-incompatible change to an internal interface now you have to go find users of it, and there go the benefits of GVFS... Or you can let the build and test system tell you what breaks (take a long coffee break, repeat as many times as it takes; could be many times). Or use something like OpenGrok to find all those uses in last night's index of the source.

Defining what portions of the OS you'll have to look in for such changes helps a great deal.

As to building and testing... the system has to get much better about detecting which tests will need to be re-run for any particular change. That's difficult, but you can get 95% of the way there easily enough.

-dont have to spend time to think about defining interfaces

That seems like a design and policy choice, orthogonal to repos.

Not really. It's easier to make a single atomic breaking change to how different components talk to each other if they are in the same repository.

If they are in different repos, the change is not atomic and you need to version interfaces or keep backwards compatibility in some other way.

It's very much really. The fact that it's easier doesn't really matter - a repo is about access to the source code and its history with some degree of convenience. The process and policy of how you control actual change is quite orthogonal. You can have a single repo and enforce inter-module interfaces very strongly. You can have 20 repos and not enforce them at all. Same goes for builds, tests, history, etc. The underlying technology can influence the process but it doesn't make it.

I have always wondered how they deal with acquisitions and sales. I guess a single system makes sense there too.

I have worked at a company using one repo per team and at Google which uses a big monorepo. I much prefer the latter. As long as you have the infrastructure to support it, I see no other downsides (obviously, Google does not use git).

> Before the move to Git, in Source Depot, it was spread across 40+ depots and we had a tool to manage operations that spanned them.

Coming from the days of CVS and SVN, git was a freaking miracle in terms of performance, so I have to just put things into perspective here when the topmost issue of git is performance. It's just a testament how huge are the codebases we're dealing with (Windows over there, but also Android, and surely countless others), the staggering amount of code we're wrangling around these days and the level of collaboration is incredible and I'm quite sure we would not have been able to do that (or at least not that nimbly and with such confidence) were it not for tools like git (and hg). There's a sense of scale regarding that growth across multiple dimensions that just puts me in awe.

Broadly speaking this is true, but note that in some ways CVS and SVN are better at scaling than Git.

- They support checking out a subdirectory without downloading the rest of the repo, as well as omitting directories in a checkout. Indeed, in SVN, branches are just subdirectories, so almost all checkouts are of subdirectories. You can't really do this in Git; you can do sparse checkouts (i.e. omitting things when copying a working tree out of .git), but .git itself has to contain the entire repo, making them mostly useless.

- They don't require downloading the entire history of a repo, so the download size doesn't increase over time. Indeed, they don't support downloading history: svn log and co. are always requests to the server. Unfortunately, Git is the opposite, and only supports accessing previously downloaded history, with no option to offload to a server. Git does have the option to make shallow clones with a limited amount of (or no) history, and unlike sparse checkouts, shallow clones truly avoid downloading the stuff you don't want. But if you have a shallow clone, git log, git blame, etc. just stop at the earliest commit you have history for, making it hard to perform common development tasks.

I don't miss SVN, but there's a reason big companies still use gnarly old systems like Perforce, and not just because legacy: they're genuinely much better at scaling to huge repos (as well as large files). Maybe GVFS fixes this; I haven't looked at its architecture. But as a separate codebase bolted on to near-stock Git, I bet it's a hack; in particular, I bet it doesn't work well if you're offline. I suspect the notion of "maybe present locally, maybe on a server" needs to be baked into the data model and all the tools, rather than using a virtual file system to just pretend remote data is local.

CVS and SVN are probably a bit better at scaling than (stock) Git. Perforce and TFVC _are certainly_ better at scaling than (again, stock, out-of-the-box) Git. That was their entire goal: handle very large source trees (Windows-sized source trees) effectively. That's why they have checkout/edit/checkin semantics, which is also one of the reasons that everybody hates using them.

GVFS intends to add the ability to scale to Git, through patches to Git itself and a custom driver. I don't think this is a hack - by no means is it the first version control system to introduce a filesystem level component. Git with GVFS works wonderfully while offline for any file that you already have fetched from the server.

If this sounds like a limitation, then remember that these systems like Perforce and TFVC _also_ have limitations when you're offline: you can continue to edit any file that you've checked out but you can't check out new files.

You can of course _force_ the issue with a checkout/edit/checkin but then you'll need to run some command to reconcile your changes once you return online. This seems increasingly less important as internet becomes ever more prevalent. I had wifi on my most recent trans-Atlantic flight.

I'm not sure what determines when something is "a hack" or not, but I'd certainly rather use Git with GVFS than a heavyweight centralized version control system if I could. Your mileage, as always, may vary.

GVFS won't fix this because you still need to lock opaque binary files, which is something Perforce supports.

Nothing about git prevents an "ask the remote" feature. It's just not there. I suspect that as git repos grow huge and shallow and partial cloning becomes more common, git will grow such a feature. Granted, it doesn't have it today. And the GVFS thing is... a bit of a hack around git not having that feature -- but it proves the point.

At the risk of sounding like a downer, this was a migration of an existing codebase.

I agree. I really think Linux needs a Nobel Price

For what? Peace?

A handful of us from the product team are around for a few hours to discuss if you're interested.

Does the virtualization work equally well for lots of history as it does for large working copies?

I have a 100k commit svn repo I have been trying to migrate but the result is just too large. Partly this is due to tons of revisions of binary files that must be in the repo.

Does the virtualization also help provide a shallow set of recent commits locally but keep all history at the server (which is hundreds of gigs that is rarely used)?

GVFS helps in both dimensions, working copy and lots of history. For the lots of history case, the win is simply not downloading all the old content.

A GVFS clone will contain all of the commits and all of the trees but none of the blobs. This lets you operate on history as normal, so long as you don't need the file content. As soon as you touch file content, GVFS will download those blobs on demand.

Thanks - that sounds perfect for lots of binary history as you never view history on the binaries, only the source files.

This is amazing, congrats. I worked on Windows briefly in 2005 (the same year git was released!) and was surprised at how well Source Depot worked, especially given the sheer size of the codebase and the other SCM tools at the time.

Is there anything people particularly miss about Source Depot? Something SD was good at, but git is not?

Another interesting complaint is one that we hear from a lot of people who move from CVCS to DVCS: there are too many steps to perform each action. For example, "why do I have to do so many steps to update my topic branch". While we find that people get better with these things over time, I do think it would be interesting to build a suite of wrapper commands that roll a bunch of these actions up.

I just got a request today for an API equivalent to `sd files`, which is not something Git is natively great at without a local copy of the repo.

How do you prevent data exfiltration? I mean, in theory you could restrict the visibility of repos to the user based on team membership/roles and so prevent a single person from unauditably exfiltrating the whole Windows source code tree. In contrast with a monorepo there likely won't be any alerts triggered if someone does do a full git clone, except for someone saturating his switch port...

What would someone do with the source code for Windows? No one in open source would want to touch it. No large company would want to touch it. Grey/black hats are probably happier with their decompilers. Surely it would be easier to pirate than build (assuming their build system scales with most build systems I've observed in the wild). No small company would want to touch it.

Anyway MS share source with various third parties (governments at least and I believe large customs in general) so any of these are a potential leak source.

This is all correct. Also, we'd notice someone grabbing the whole 300GB whether it's in 40 SD depots or a single Git repo.

Mind if I ask how you'd notice?

The article mentions relying on a windows filesystem driver. Two questions about that:

1) Why include it in default windows? It seems that 99.99% of users would never even know it existed, let alone use it

2) Does that mean GVFS isn't useable on *nix systems? Any plans to make it useable, if so?

1) The file system driver is called GvFlt. If it does get included in Windows by default, it'll be to make it easier for products like GVFS, but GvFlt on its own is not usable by end users directly.

2) GVFS is currently only available on Windows, but we are very interested in porting to other platforms.

What are your thoughts on implementing something more general like linux's FUSE[1] instead? A general Virtual Filesystem driver in Windows could be used for a wide range of things and means you don't just have a single-purpose driver sitting around.

[1] https://en.m.wikipedia.org/wiki/Filesystem_in_Userspace

A general FUSE-like API would be very useful to have, but unfortunately it can't meet our performance requirements.

The first internal version of GVFS was actually based on a 3rd party driver that looks a lot like FUSE. But because it is general purpose, and requires all IO to context switch from the kernel to user mode, we just couldn't make it fast enough.

Remember that our file system isn't going to be used just for git operations, but also to run builds. Once a file has been downloaded, we need to make sure that it can be read as fast as any local file would be, or dev productivity would tank.

With GvFlt, we're able to virtualize directory enumeration and first-time file reads, but after that get completely out of the way because the file becomes a normal NTFS file from that point on.

I'm curious what the cross-over between GvFlt may be and the return of virtual files for OneDrive in the Fall Creators Update? Is the work being coordinated between the efforts?

From your description it sounds like there could be usefulness in coordinating such efforts.

For use cases that fit that limited virtualization, is GvFlt something to consider? Or is that a bad idea?

It's not included in Windows, which is why they have a signed drop of the driver for you to install. Even internally we have to install the driver.

E- oops, missed that line. Cool, wonder if it will show up in more than Git

Yes, I was referring to the plan to include it in future Windows builds

Probably as an optional feature?

Why do you name it "GVFS" instead of something more descriptive like "GitVFS"?

This was discussed a bit in the comments in Brian Harry's last post on GVFS: https://blogs.msdn.microsoft.com/bharry/2017/02/03/scaling-g...

We're building a VFS (Virtual File System) for Git (G) so GVFS was a very natural name and it just kind of stuck once we came up with it.

Are you aware that the name was already taken[0] for something which also has to do with file systems?

[0] https://wiki.gnome.org/Projects/gvfs

Sure, a couple of questions:

1. How do you measure "largeness" of a git repo?

2. How are you confident that you have the largest?

3. How much technical debt does that translate to?

1. Saeed is writing a really nice series of articles starting here: https://www.visualstudio.com/learn/git-at-scale/ In the first one, he lays out how we think about small/medium/large repos. Summary: size at tip, size of history, file count at tip, number of refs, and number of developers.

2. Fairly confident, at least as far as usable repos go. Given how unusable the Windows repo is without GVFS and the other things we've built, it seems pretty unlikely anyone's out there using a bigger one. If you know of something bigger, we'd love to hear about it and learn how they solved the same problems!

3. Windows is a 30 year old codebase. There's a lot of stuff in there supporting a lot of scenarios.

Is it possible to checkout (if not build) something like Windows 3.11 or NT 4?

As far as I can recall, this is not possible using Windows source control, as its history only goes back to the lifecycle of Windows XP (when the source control tool prior to GVFS was adopted).

Microsoft does have an internal Source Code Archive, which does the moral equivalent of storing source code and binary artifacts for released software in a underground bunker. I used to have a bit of fun searching the NT 3.5 sources as taken from the Source Code Archive...

I recently heard a story that someone tried to push a 1TB repo to our university Gitlab which then ran out of disk space. Sure, that might have been not be a usable repo but only an experiment. Still, I would bet against the claim that 300GB is the largest one.

300 GB is not the size of _the repository_. It's the size of the code base - the checked out tree of source, tests, build tools, etc - without history.

It's certainly possible that somebody created a 1 TB source tree in Git, but what we've never heard of is somebody actually _using_ such a source tree, with 4000 or more developers, for their daily work in producing a product.

I say this with some certainty because if somebody had succeeded, they would have needed to make similar changes to Git to be successful, though of course they could have kept such changes secret.

1 TB of code?

I'd sure like to run that as my operating system, browser, virtual assistant, car automation system, and overall do-everything-for-me system...

I'm currently investigating using GitLFS for a large repo that has many binary and other large artifacts.

I'm curious, did you experiment with LFS for prior to building GitVFS?

Also, I know that there is an (somewhat) active effort to port GitVFS to Linux, do you know if any of the Git vendors (GitLab and/or GitHub) are planning to support GitVFS in their enterprise products?

Yes we did evaluate LFS. The thing about LFS is that while it does help reduce the clone size, it doesn't reduce the number of files in the repo at all. The biggest bottleneck when working with a repo of this size is that so many of your local git operations are linear on the number of files. One of the main values of GVFS is that it allows Git to only consider the files you're actually working with, not all 3M+ files in the repo.

> One of the main values of GVFS is that it allows Git to only consider the files you're actually working with, not all 3M+ files in the repo.

That is an excellent point. Thanks!

We at GitLab are looking at GitVFS but have not made a decision yet https://gitlab.com/gitlab-org/gitlab-ce/issues/27895

Very cool blog! As I understand, you dynamically fetch a file from the remote git server once for the first time I open the file. Do you do any sort of pre-fetching of files? For example, if a file has an import and uses a few symbols from that file, do you also fetch the imported file beforehand or just fetch it when you access it first time?

For now, we're not that smart and simply fetch what's opened by the filesystem. With the cache servers in place, it's plenty fast. We do also have an optional prefetch to grab all the contents (at tip) for a folder or set of folders.

We don't currently do that sort of predictive prefetching, but it's a feature we've thought a lot about. For now, users can explicitly call "gvfs prefetch" if they want to, or just allow files to be downloaded on demand.

What's the PR review UI built in?

Custom JQuery-based framework, transitioning to React.

Actually, I think we finished the conversion to React :). So, React.

Taylor is the dev manager for that area so I'm inclined to believe his correction :)

What was the impetus for switching to git?

More or less:

- Availability of tools

- Familiarity of developers (both current and potential)

Any plans to port GVFS to Linux or macOS?

Why not TFS?

"A handful of us from the product team are around for a few hours to discuss if you're interested."


This is a little off-topic, but why can't Windows 10 users conclusively disable all telemetry?

(I consider the question only a little off-topic, because I have the impression that this story is part of an ongoing Microsoft charm-offensive.)

Haha, no answer, as expected. HN got butthurt as well lol.

I knew there was a risk of getting downvoted, but I was surprised that it went to "-4".

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact