Jay Taylor's notes

back to listing index

Announcing GVFS: Git Virtual File System | Hacker News

[web search]
Original source (news.ycombinator.com)
Tags: Google version-control microsoft piper gvfs perforce citc news.ycombinator.com
Clipped on: 2017-05-27

Image (Asset 1/2) alt=
Image (Asset 2/2) alt=
This is similar to what Google uses internally. See http://cacm.acm.org/magazines/2016/7/204032-why-google-store...:

"Most developers access Piper through a system called Clients in the Cloud, or CitC, which consists of a cloud-based storage backend and a Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their changes overlaid on top of the full Piper repository. CitC supports code browsing and normal Unix tools with no need to clone or sync state locally. Developers can browse and edit files anywhere across the Piper repository, and only modified files are stored in their workspace. This structure means CitC workspaces typically consume only a small amount of storage (an average workspace has fewer than 10 files) while presenting a seamless view of the entire Piper codebase to the developer."

This is a very powerful model when dealing with large code bases, as it solves the issue of downloading all the code to each client. Kudos to Microsoft for open sourcing it, and under the MIT license no less.

Holy cow, it sounds like they reinvented Clearcase!

If they get this right, this can be MASSIVE for Microsoft in Enterprise. ClearCase was the reason why IBM was able to charge $1000+ per developer license fees.

ClearCase did what a lot of Enterprise companies needed at the time, and most importantly, it created hooks, that were mostly too difficult to remove. Once you create deep integration with ClearCase, you are very much committed to using it long term.

I was planning on responding to the person that had replied, but their comment has since been removed, and because I can't edit my post, I'll add some insight, as to why I used ClearCase as an example.

For those who have never worked with/administered ClearCase before, you may not fully appreciate how insanely complex it is. In order to use it, you first have to apply kernel patches from IBM, which shows how committed you had to be. ClearCase provided something that others couldn't, which is why it was so expensive. With Git, everything has changed.

Since nobody owns Git and its implementation, the differentiating factor right now is mostly superficial. There really isn't anything, other than hosting repos at a massive scale, that can't be easily duplicated. Git hosting, in my opinion, is now officially a commodity product. And what differentiates GitHub, GitLab, Bitbucket, etc. is mostly marketing.

With GVFS, things could change. This could be the first step, in Microsoft owning the hard part, that can't be easily duplicated by others. I really don't know what is on their roadmap, but views in ClearCase were pretty powerful and if they are looking at the level of integration, then it could be tough for GitLab, GitHub and others to follow.

My "favorite" Clearcase server issue was one that took 2 weeks of uptime before it resulted in a crash on a up-to-date AIX installation. I had to wait until we had somewhat identified the timeline before engaging our ops team so they could log the crash and submit it back to IBM so they could investigate/fix

Agreed, same w/GVFS IMO.

I ruminated about ccase/git elsewhere in this thread: https://news.ycombinator.com/item?id=13560108

Google's Piper is impressive (I used it), but it emulates Perforce. Having something Git-based is a lot more exciting. Hope someone ports it to platforms other than Windows...

Google is far more advanced than this. They have one giant monorepo (Piper) that's backed by Bigtable (or at least it was, when I was there). Piper was mostly created in response to Perforce's inability to scale and be fault tolerant. Until Piper came along, they would have to periodically restart The Giant Perforce Server in Mountain View. Piper is 24x7x365 and doesn't need any restarts at all. But the key bit here is not Piper per se. Unlike Microsoft, Google also has a distributed, caching, incremental build system (Blaze), and a distributed test system (Forge), and they are integrated with Piper. The vast majority of the code you depend on never actually ends up on your machine. Thanks to this, what takes hours at Microsoft takes seconds at Google. This enables pretty staggering productivity gains. You don't think twice about kicking off a build, and in most cases no more than a minute or two later you have your binaries, irrespective of the size of your transitive closure. Some projects take longer than that to build, most take less time. Tests are heavily parallelized. Dependencies are tracked (so tests can be re-run when dependencies change), there are large scale refactoring tools that let you make changes that affect the entire monorepo with confidence and without breaking anyone.

Google's dev infra is pretty amazing and it's at least a decade ahead of anything else I've seen. Every single ex-Googler misses it quite a bit.

I'm a Microsoft employee on the Git team. We do have a distributed, caching, incremental build system and a distributed test system. Right now, they're completely internal - like Google. They're called CloudBuild and CloudTest. They're very fast and no one thinks twice about kicking off a build.

Google employs a distributed, caching, incremental build system and a distributed test system across the majority of their code base. I worked in Windows Store and I can assure you that most people there don't use CloudBuild and CloudTest, let along know what they are. I would be confident in saying the majority of people at Microsoft are in that boat.

Just to give others an idea of what this is like: I work on the Protocol Buffers team at Google. Almost all software at Google depends on my team's code. If I have a pending change, I can test it against the whole codebase before submitting (this is a "global presubmit"). If something breaks in global presubmit, I can build and run any target in the codebase with my change in a single command, and this will take O(10 minutes) to build from scratch.

This would be like if I worked on the core Windows SDKs and I could routinely test my changes against everything from Microsoft Flight Simulator to the Bing server code before I submit.

Is the ChromeOS / Android team like this because it really sounds like we're comparing Epeens.. i'd be surprised if android builds typically build world and do so in 10 minutes..

There are parts of the company that aren't in this ecosystem, usually for legacy reasons. But even if they were you'd find a lot of this stuff to still be shockingly fast.

Because the build server is centralized it can be aggressive about caching intermediate build steps. Incremental builds aren't just incremental for you, but incremental for everybody.

Dunno how it is now, but years ago it'd take them a _few weeks_ to just propagate commits into the stable branch through a series of elaborate branch integrations, so yeah, you couldn't change something and test it on a whim. Plus build of just windows alone would take overnight, and rebuilding everything to test a Windows change was not logistically, politically, or technically possible.

> you couldn't change something and test it on a whim

You could. It would just not leave your branch for a while. Around the scheduled merges it would run against the tests of progressively more of the larger organization.

Parts of this actually constituted a good way to prevent being distracted by the bugs of faraway teams. If something reached your branch, where you were working, it was vetted by the tests required to make it into winmain.

The downside was that people got fairly political about what goes into the branch and when, even for small things.

Forgot to mention and now it's too late to edit: the most common way to do quick tests during development was to only build a few DLLs or .sys files and replace them on a running system. Then your team would have a set of branches that build every night.

But that's just Windows. What you can get a Google is a full test run over _everything_ that your change affects. This lets you ensure that an obscure change in behavior will not break others, including products built using your library as a remote, transitive dependency. You also get to fix bugs globally. Say you had a really shitty internal API that was causing problems or slowing things down. You get to actually go in there and change that API, and update callsites in one atomic commit. You can also make sure that you're not breaking anyone's build or tests by introducing your change. There are teams at Google whose purpose in life is repo-wide "code cultivation". Finding issues and fixing them globally basically. This just doesn't happen at MS.

+1. I know some insiders at MS and their build/test/deployment story is universally very crappy. Things barely work, held together by curse words and duct tape.

Googlers like to joke internally that Google looks like a race car from the outside and like Moving Castle from Hayao Miyazaki'a cartoon from the inside, but that's not the case at all. Comparatively speaking it's a race car inside and out, it's just that the insiders don't know how shitty things are elsewhere.

P.S. I heard Bing is different, but I have no visibility into it, so can't comment.

I worked in the cloud at MS roughly 7 years ago, and Bing was very different from anything else. The MS cloud services were a total mess. I mean, as in hard to overstate how bad. In contrast, Bing was a well-oiled machine. Simple build, simple deploy, simple devops, simple test, very consistent processes across all of their teams.

At the time, Azure was a joke (partly due to the fact that the initial teams were headed up by ex Office devs with no cloud experience, if I remember correctly). But Azure was cannibalizing the Bing team pretty hard. I hear that strategy worked and that Azure is in much more capable hands now.

> Every single ex-Googler misses it quite a bit.

I dunno. I don't miss the 1-minute incremental builds. (Maybe they've improved since I left, though.)

BTW Forge is not just the test runner, but the thing that runs all build tasks, farmed out to all servers. Blaze interprets the build language and does dependency tree analysis but then hands off the tasks to Forge. Blaze has been (partially) open sourced: https://bazel.build/

I know. :-)

> Google's dev infra is pretty amazing and it's at least a decade ahead of anything else I've seen. Every single ex-Googler misses it quite a bit.

This may be naive but why not recreate it as an open source project?

Blaze has been: https://bazel.build/

Forge and Piper are built on Google's internal tech stack and designed for Google's production infrastructure, so open sourcing them would be a very big project. I think it would be a lot more likely for them to be offered as a service -- and that might be more useful to users anyway, since you'd be able to share resources with everyone else doing builds, rather than try to get your own cluster running which might sit idle a lot of the time. Of course, there are privacy issues, etc.

(Disclaimer: I'm purely speculating. I left Google over four years ago, and have no idea what the tools people are up to today.)

Because very few people have a need to support a billion-LOC monorepo on which 30K engineers make tens of thousands of commits daily. That's where this system shines.

For smaller projects, Git+Bazel (open source, non-distributed version of Blaze) works fine if you're working with C++, and other build systems work OK as well, if you're working with other languages.

I think this would be better the piper+citc. While the virtual filesystem aspect is nice the perforce model is far inferior to git's. (IMO) Of course Google has tools on top of it. But it's not fair to compare a VCS and interface to it to a complete development infrastructure. Hell, with the content addressing of git it would even make it easy to build something similar.

> perforce model is far inferior to git's

That's just, like, your opinion, man. There are other bits of infra that integrate with it quite nicely, and would integrate with something like Git quite poorly. One of those things is their code review system. The closest thing I could find to it outside Google is Gerrit, but it's a tremendous pain to set up and use, and it's but a pale shadow of Google's internal tool (Critique).

And also, one does not preclude another: Google has a git wrapper on top of Piper, so you can spend your entire Google career not even touching Piper directly if that's what you prefer. And Piper went beyond the "Perforce model" in ways I can't disclose here.

Check out Reviewable, it's influenced by my fond memories of Critique but designed specifically for git (and GitHub).

Reviewable is amazing, thanks for making that.

I've used lots of review tools and worked a bit on Google's review tool and on ReviewBoard in the past, and Reviewable is better than all of them in my opinion (or at least, better than when I last used the others).

I wonder how do you know what parts you CAN disclose?

Whatever is already public. You can watch a video about Piper and some other systems on YouTube, and read about other things from Google's own blogs, papers, etc.

There is a discussion thread on r/programming, where MS folks, who implemented this answer questions. A lot of questions like why not use multiple repos, why not git-lfs, why not git subtree, etc. are answered there


Thanks for bringing this up, it was actually a more interesting read than this thread. Less trolling, more facts and also interesting to read stuff I didn't happen to know. Like

One of the core differences between Windows and Linux is process creation. It's slower - relatively - on Windows. Since Git is largely implemented as many Bash scripts that run as separate processes, the performance is slower on Windows. We’re working with the git community to move more of these scripts to native cross-platform components written in C, like we did with interactive rebase. This will make Git faster for all systems, including a big boost to performance on Windows.

> "We’re working with the git community to move more of these scripts to native cross-platform components written in C"

Sad. Rather than fix the root problem they rewrite the product in a less-agile language and require everyone to run opaque binaries.

They probably even think they're doing a good thing.

C is portable, bash scripts are not.

Bash is portable across other OSes... They could work on a good port. Or, remove some bash-isms from the code so it would work in another shell if that was an issue.

I understand they took the initially easy route. But it'll be harder for everyone to use that code now, including them.

It's interesting how all the cool things seem to come from Microsoft these days.

I still think we need something better than Git, though. It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.

I'd love to see a second attempt at a distributed version control system.

But I applaud MS's initiative. Git's got a lot of traction and mind share already and they'd probably be heavily criticized if they tried to invent its own thing, even if it was open sourced. Will take a long time to overcome its embrace, extend and extinguish history.

> I still think we need something better than Git, though. It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.

Note that Google and Facebook ran into the same problems Microsoft did, and their solution was to use Mercurial and build similar systems on top of it. Microsoft could've done that too, but instead decided to improve Git, which deserves some commendation. I'd rather Git and hg both got better rather than one "taking over".

Google uses some variant of perforce, just like MS has been doing.

They used to, but now they use Piper which is built on top of mercurial.

Piper is not built on top of Mercurial.

Sorry, you are right, I got it confused with some other work Google was doing to improve the scalability of mercurial. It's not based on perforce though either, appears to be entirely custom technology.

They first outgrew git, then outgrew perforce and implemented a custom server on top of their usual storage/database stack.

> Microsoft could've done that too, but instead decided to improve Git,

They didn't improve git, they only made this for themselves and for their product users. Git doesn't restrict you to a single operating system.

Look at Issue 4 on their GitHub repo, they want to port it also to Linux and macOS

> They didn't improve git, they only made this for themselves and for their product users. Git doesn't restrict you to a single operating system.

Given Microsoft's recent form, I'd expect this to appear on Linux before long, and possibly osx too. In any case, it's open source so you could always port it yourself.

I would be surprised. This sort of project is deeply OS-specific. If they wanted to eventually make it cross-platform, they would have started by implementing FUSE on Windows.

Several years ago a friend and I had a need to build a virtual file system that was portable between Linux and Windows. At the very least, we attempted to share as much code as possible. It proved to be pretty easy and we had a working prototype after about 30 hours of work. We used FUSE on Linux, and Dokan FUSE on Windows.

[1] https://dokan-dev.github.io/

[2] https://github.com/dokan-dev/dokany/wiki/FUSE

> In any case, it's open source so you could always port it yourself

Of course, but that would be me, and not Microsoft, who's improving git ;-)

> It's interesting how all the cool things seem to come from Microsoft these days.

I've assumed Microsoft have been making all this stuff all along, but keeping it internal then throwing it away on the probably false assumption that every bit of it is some sort of competitive advantage. I think they're coming around to the idea that at least appearing constructive and helpful to the developer community will help with trying to hire good developers.

Maybe something that has the data models of git but has a more consistent interface? Today on Git Merge there was a presentation about http://gitless.com/

For example one of the goals is to always allow you to switch branches. Stash and stash pop would happen automatically and it would even work if you're in the middle of a merge.

I'm still waiting for a decent GUI that takes full advantage of the simplicity of git's underlying data model. The CLI is okay and I've gotten really good with it, but fundamentally I think git's DAG is something that would be best represented and manipulated graphically.

[Reinventing the Git Interface][1] was written almost 3 years ago now and yet to my knowledge nobody's implemented anything quite like that yet.

[1]: http://tonsky.me/blog/reinventing-git-interface/

Git Kraken has some neat ideas around dragging bits of the DAG around to manipulate them.

Yeah, I've been meaning to try that for a while now. Unfortunately I can't use it at work because Kraken still don't support connecting to the internet through a proxy, and they won't let you use it offline.

How does Git Kraken fall short?

I quite love both the motivation and the implementation of gitless (and the choices they've made). I find it much more usable than git.

I'd never heard of Gitless, I'll check it out, thanks.

> I'd love to see a second attempt at a distributed version control system.

Out of curiosity, why a whole new attempt? Personally, I'd prefer the approach of "making our current tools better."

"Let a thousand flowers bloom." Competition helps both sides. Clang became good enough that it spurred GCC to become a lot better.

Until 1997, forking a project was considered a tragedy. I think things have improved since then :-).

Fair point.

What are your thoughts on Pijul? (https://pijul.org)

I'd love to dig into Pijul but sadly it's AGPL. Just looking at the code effectively taints me as a developer who works tangentially to VCS - it could be argued that my work is derivative and thus needs to be AGPLed. The viral nature of the license then creates an existential legal risk to my employer.

   I'd love to see a second attempt at a distributed version control system.

Git wasn't the first, and even then had several contemporaries at 2nd gen.

Yup. I remember when git came along the field was already pretty crowded (DVCS, Darcs, Bazaar, BitKeeper, Mercurial...). I've always suspected Linus wrote git in a panic simply to sidestep the months of flames that switching VCS again would have inevitably generated, once BitKeeper stopped being viable. I also remember people jumping at the occasion like they would have never done to improve someone else's tool.

The story of git is a good case-study for people interested in group dynamics.

I agree, the story of git is a good group dynamics case-study. I watched a small bit of it from mailing lists at times.

I was a heavy darcs user at the time and the impression I got was part of the name git in the first place was that it was intentionally the "dumb, dirty, get things done" answer to darcs' (sometimes problematic) smarts. (Remember, the British slang definition of git is "an unpleasant or contemptible person".)

It's also interesting that both Mercurial and git were spun out of the BitKeeper fiasco (BitKeeper was a commercial product that allowed free hosting for Open Source projects, up until the fiasco where they decided they were bored hosting Open Source) by Linux kernel members. Mercurial actually wound up with lead and if I recall correctly was much more usable faster than git was. The problems with Mercurial were that it was written in Python and git was lead by Linus himself and in the apparently more preferable to kernel hackers C, perl, bash, awk, sed, spit, and duct tape development environment.

The problem for Mercurial (with respect to group dynamics) was that it didn't have the built-in user base of Linux contributors right off the bat.

I briefly glossed over it, but Mercurial could have had the Linux contributors right off the back. It was built faster than git and was built with the kernel team in mind. I do think it is an interesting bit of group dynamics that the kernel team as a whole didn't adopt Mercurial (because there were kernel team members on both sides of the Mercurial/git divide), and some of that reason seemed to be, from what I read at the time, simply a dislike of Python by a surprising number of kernel team members.

Indeed, but I also had a pause when I considered how heavily Microsoft depended on a system originally built by Linus Torvolds :)

Mercurial is very similar to git but more user friendly.

> It's interesting how all the cool things seem to come from Microsoft these days.

It's like a whole'nother company after they got rid of Steve Ballmer.

Meh. If I'm watching the 3E cycle right, they're currently in the Embrace phase and heading to Extend. And it's been a cycle for a number of repetitions - it doesn't take a genius to see where it goes next.

> but the UI is atrociously complicated

Linus himself admitted that he isnt good at UI. Anyway, I think git just wasnt designed to be used directly, but via another UI. For example, I use it within Visual Studio Code, and that covers about 90 percent of usecases, and then Git Extensions can take care of almost everything else. Sometimes cli is needed, though.

Git Extensions was a fantastic project back when the idea of Microsoft supporting Git would seemed impossible. It made my life better everyday when I was a C# developer. These days, I use the Git CLI on Windows, but VS integration with Git seems good, so I haven't felt the need to install Git Extensions.

I've still not yet seen a stand-alone GUI for Git that is better than the one that ships with Git Extensions, though.

What a change of cash-cow placement can do ...

Using git with large repos and large (binary blob) files has been a pain point for quite a while. There have been several attempts to solve the problem, none of which have really taken off. I think all the attempts have been (too) proprietary – without wide support, it doesn’t get adopted.

I'll be watching this to see if Microsoft can break the logjam. By open sourcing the client and protocol, there is potential...

Other attempts:

* https://github.com/blog/1986-announcing-git-large-file-stora...

* https://confluence.atlassian.com/bitbucketserver/git-large-f...

Article on GitHub’s implementation and issues (2015): https://medium.com/@megastep/github-s-large-file-storage-is-...

I think Joey Hess' attempt at "solving the problem" deserves a mention.

It is open source (GPLV3) licensed. [not proprietary]

Written in Haskell. [cool aid]

Currently has 1200+ stars on Github and is part of at least Ubuntu (http://packages.ubuntu.com/search?keywords=git-annex) since 12.04. [shows something for support and adoption]

edit: Link to Github https://github.com/joeyh/git-annex -- thanks dgellow

For the problem of large files I think Git LFS has largely won out over git annex, mostly because it's natively supported by GitHub and GitLab and requires no workflow changes to use.

Atlassian's Bitbucket and Microsoft's Visual Studio Team Services both also support Git LFS.

As of version 6 of git annex, the only thing an unlocked repo need other than the usual workflow is a `git annex sync`, that could be easily configured as a push hook.

There's a small but important trap to people who might want to use git-annex as a backup tool, namely that you can't store a git repo in git-annex.


Having nested git repositories is a solved problem both in git and in git-annex: use submodules.


Link to the github mirror https://github.com/joeyh/git-annex

git-annex is, IMHO, by far the best solution.

Pros of git-annex:

- it is conceptually very simple: use symlinks instead of ad-hoc pointer files, virtual files system, etc. to represent symbolic pointer that point to the actual blob file;

- you can add support for any backend storage you want. As long as it support basic CRUD operations, git-annex can have it as a remote;

- you can quickly clone a huge repo by just cloning the metadata of the repo (--no-content in git-annex) and just download the necessary files on-demand;

And many other things that no other attempt even consider having, like client-side encryption, location tracking, etc.

Git Annex is only a partial solution, since it only solves issues with binary blobs. It doesn't solve problems with large repos.

That still only solves half the problem with large binary blobs.

The other half is that almost all of the binary formats can't be merged and so you need a mechanism to lock them to prevent people from wiping out other people's changes. Unfortunately that runs pretty much counter the idea of DCVS.

I always wonder why this never gets discussed much. We seem to have tons of solutions for storing large files outside the repo, but so what? OK, so I don't deny that storage still isn't cheap enough to just say "oh well" to the idea of a multi-TByte repo, so it's certainly solving a problem. But there's still another major problem left!

Don't their artists and designers use version control too? Maybe they just have one such person per team, or each person owns one file, or something like that. Hard to say.

Maybe it's like how I used to work on teams that never used branches - you have various problems that you figure there's probably a solution for, but there's never time to (a) figure out what the solution looks like, (b) shift the whole team over to a brand new workflow and set of tools, and (c) clean up the inevitable mess. So you just work around the problems the same way you always have - because at least that's a known quantity.

Yeah, best solution I've seen(but haven't had a chance to use in anger) is P4-Fusion(P4 backend, git views).

Perforce was always the gold standard for stuff like this. Did a great job at not only providing locking but stuff like seamless proxies and other solutions to common problems in that domain(like a usable UI).

When git-annex finds a conflict it can't solve, it gives back to you the two versions of the same file with the SHA of the original versions suffixed.

This way you can look at both and resolve the conflict.

That's the exact point I'm making, fundamentally a large portion of these formats(PSD, JPG, PNG, MA, 3DS) can't be resolved.

If two people touch the same file at the same time someone is going to drop their work on the floor and that's a bad thing(tm). You need to synchronize their work with a locking mechanism that informs the user at the edit(not sync) point in the workflow.

It's disappointing that all the comments are so negative. This is a great idea and solves a real problem for a lot of use cases.

I remembering years ago Facebook says it had this problem. A lot of the comments were centered around that you could change your codebase to for what git can do. I'm glad there's another option now.

Yes they did. They choose to scale out Mecurial to solve their problem. Wonder if they still use Mercurial?


Both Facebook and Google are continuing to contribute to Mercurial, so they both have some vested interest in it. If you poke around the commits on the repo[0] you'll see commits from people with @fb.com and @google.com email addresses. The mailing lists also has activity from both companies still.

As well, the Mercurial team does quarterly sprints (I believe), and Google is hosting the next one[1].

[0] https://www.mercurial-scm.org/repo/hg

[1] https://www.mercurial-scm.org/wiki/4.2sprint

Sprints are twice a year, once in the US and once in Europe.

They do. Durham Goode (Tech Lead on Source Control at Facebook) just held a talk at Git-Merge about how they scaled Mercurial at Fb. They seem to be quite happy with it, albeit applying quite a few restrictions on their internal users that are not really transferable to the general (outside-corporate) usage of VCS (for example only rebases are allowed, directly committing to master all the time, etc.)

That's actually pretty comparable to how we tend to operate the Mercurial project, FYI. We tend to prefer rebase to merge for feature work.

Do you use Changeset Evolution?

Very small fraction of FB engineers use Changeset Evolution.

We have new workflows based on some of the underpinnings of Evolution, but without the UI confusion.

They did a couple of months ago so I assume they still do.

Also don't think that this is a good idea. Git is a Distributed Version Control https://en.wikipedia.org/wiki/Distributed_version_control, the main benefit of which is "allows many software developers to work on a given project without requiring them to share a common network". Seems like with GVFS they are making DVC to be a CVS (https://en.wikipedia.org/wiki/Concurrent_Versions_System) again. What is the point? There are a lot of good CVS systems around. They just to give cool kids access to cool tools? I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

At Splunk we had the same problem, our source code was stored in CVS (perforce), but we wanted to switch to git. And not only because we really wanted to use git, but to simplify our development process, mainly because of the much easier branching model (lightweight branching also is available in perforce, but to get it we still needed to do some upgrades on our servers). We also had a problem that at the beginning we had very large working tree, don't think it was 200-300Gb, I believe it was 10x less, and actually required 4-5 seconds for git status. This was not appropriate for us, so we worked on our source code and release builds to split it in several git repos to make sure that git status will take not more than 0.x seconds.

My point is use right tools for right jobs. 4-5 seconds for git status is still a huge problem, I would prefer to use CVS instead if that will not require me to wait 5 seconds for each git status invocation.

> I believe there are plenty bridges between CVS and git already implemented, which also allows you to checkout only part of the CVS tree.

How many of them have you used? I've used a couple, to interact with large code bases on the rough order of 300GB. In my experience they don't work very well, because you have to be hygienic about the commands you run or some part of your Git state gets out of sync with some part of your state for the other source control system. So I gave up on those, and I use something similar to Microsoft's solution at work on a daily basis. It's a real pleasure by comparison, and in spite of that I still call myself a Git fan (about 10 years of heavy Git use now). At work the code base is monolithic and everyone commits directly to trunk (at a ridiculous rate, too).

I've heard horror stories about back when people had to do partial checkouts of the source code, and I'm glad that the tooling I use is better.

The idea of breaking up a repository merely because it is too large reminds me the story of the drunkard looking for his keys under the streetlights. The right tools for the right job, sometimes you change the job to match the tools, and sometimes you change the tools to match the job.

A bit of a topic highjack, but I'm always curious how "many people committing very often on the same branch" works in practice. I'd expect there to be livelock like scenario's.

Do you need to do anything special? Or is this just a non-issue? Doy you push to master or do you use some sort of pull request gui (like github or phabricator)

Not sure what livelock you're talking about. If you have a bunch of people pushing to Git master, sure, you'll get conflicts and have to rebase (if we're talking about TBD). But the conflicts are always caused by someone successfully pushing to master, so some progress will always be made.

But I was always just using Git as an interface to something else, usually Perforce or something similar. When pushing with these tools, you'd only get conflicts if other people changed the same files. Git was just used to create a bunch of intermediate commits and failed experiments on your workstation, which is something that it really excels at.

The only real problem is when the file you're changing is modified by many people on different teams, which often means that it's used for operations, and when that becomes a bottleneck it'll get refactored into multiple files or the data will be moved out of source control.

I used one for TFS, when I worked in Microsoft, and git p4 when I worked at Splunk. Certainly enjoying that we are 100% git now.

My point was that with GVFS they are not really solving the problem they had - git status still takes 4-5 seconds, to be that is a lot.

So you're saying that GVFS isn't a good idea because it's not good enough for working on the Windows repository?

Well, yeah. It's pre-production, and let Microsoft worry about their own problems anyway. But it sounds like GVFS will be killer for people who have large repos that aren't as large as the Windows repo. Even if 4-5s for the 270GB/3.5M file repo is too long, 400-500ms for the 27GB repo is fantastic.

At some point you ask yourself, "Would I split this repo if my tools could handle the combined repo just fine?" If the answer is no, then you're going to be happy that the tools are getting better at handling big repos. Microsoft's choice to exploit the filesystem-as-API and funnel all filesystem interaction through the VCS is a smart choice and there are a ton of opportunities for optimization that don't exist when you're just writing to a plain filesystem.

> Seems like with GVFS they are making DVC to be a CVS again. What is the point?

It sounds like they answered that:

> In a repo that is this large, no developer builds the entire source tree. Instead, they typically download the build outputs from the most recent official build, and only build a small portion of the sources related to the area they are modifying. Therefore, even though there are over 3 million files in the repo, a typical developer will only need to download and use about 50-100K of those files.

Source will still be distributed among the developers that touch it. Seems like a decent compromise.

I'm curious to dig a bit further in, but from the blog post I get the impression that they are also still cloning the full commit history, just not the full file trees attached to the commits and definitely not the full worktree of HEAD, leaving those to be lazily fetched. If that is the case, that sounds like an interesting compromise on the git model and something verging on some of the speculative ideas I've seen about using something like IPFS to back git's trees, to the point where maybe you could use something IPFS in tango with this and have a good DVCS solution.

Based on the protocol https://github.com/Microsoft/gvfs/blob/master/Protocol.md#ge... don't think that "still cloning the full commit history" - this is true.

> Seems like with GVFS they are making DVC to be a CVS

> just to give cool kids access to cool tools

Yes. DVCS with the huge code bases, large binary objects and large teams is hardly the optimal approach. But the "cool kids" are just used to use what they use. And now they can pretend to do it even when they have to be always connected, because the files are virtual and remain on the server until really used.

If Microsoft is giving the solution to the "cool kids," no reason to complain about the fact that Microsoft is willing to care for them.

And if you'd ask the "cool kids" why do they need git at all for such scenarios, have fun with the amount of arguments you'll get. Why this one "needs" vi and another "Emacs" etc. The same reasons. You'll find the arguments also in the comments here. Including mentions of Mercurial, the competition, just like "vi or Emacs". Because. Don't ask.

And no, as far as I understand, Google doesn't primarily "use Mercurial", they use something called Piper, and before they used a customized Perforce just like Microsoft did.


"Piper spans about 85 terabytes of data" "and Google’s 25,000 engineers make about 45,000 commits (changes) to the repository each day. That’s some serious activity. While the Linux open source operating spans 15 million lines of code across 40,000 software files, Google engineers modify 15 million lines of code across 250,000 files each week."

It is not clear from the announcement nor the code, but in principle, I don't see a reason that it can't be a DVCS.

Sure, GVFS downloads files only when first read; but maybe it keeps them cached? Maybe you can still work on them and commit changes after you get offline? At least in principle, nothing prevents that.

I was actually surprised that there was only as much negative sentiment as there is. Microsoft could cure cancer and the post to HN would be mostly negative. It's tribal. It doesn't even matter what they do at this point.

That being said, you can see more and more people getting off the "Microsoft is evil" train. It's super slow and every bone headed thing that Microsoft does resets the needle for lots of people.

I've always been surprised how much sympathy a company like IBM or Intel gets on HN. They both sue people over patents. That both contribute to non-free software. They were early backers of Linux, though, and that is what people care about superficially.

To be honest, I was pretty neutral about MS, for a long time now, carefully optimistic even: IE8 was fair enough (when it was new), Win8 was kinda okay, Azure is great...and just when you think they're a normal company, they take out the old guns and start shoving (first GWX and then) WinX down people's throats, never mind any consent.

So, I'm very, very, very sorry that I can't hear their words over the noise of their actions; and in the light of this, I eye each new gift-bearing Redmondian with suspicion.

to be fair, i cant say that i care if people are fair to a multinational corporation. whether linux fans are right or not they are still only doing whats best for their bottom line. should a company get a trophy for doing what its customers want?

I don't agree with the sentiment that doing something that benefits lots of people should be dismissed on the grounds that it was mutually beneficial.

that's a good point. its funny, though, that they have actually started doing a lot of things for PR purposes that i van only imagine that most of their customers couldn't really care less about.

for example, the majority of their money still comes from windows and office, but open source and hologram BS impress the most vocal anti-MS voices in the media.

my point, though, is that there are other companies that dont draw nearly as much ire that engage in the exact same practices. i think, that early antagonism between MS and Linux users has become a tribal signifier for some people. Microsoft people used to have the same kind of relationship with IBM. They also kept flogging that longer than it really made sense...just like linux and mac fans.

HN is at times astonishingly driven by brogrammer conventional wisdom. Look at all of the "why I'm ditching the Mac because I totally need a laptop with 64GB of RAM" stories that got posted after the latest MacBook Pro got introduced. Amoung the "creative" in New York there's the phenomenon of "why I left NYC and moved to LA" stories that some people—specifically dumb people—think are somehow representative of the zeitgeist.

yeah, microsoft fans used to think that IBM was literally the devil. turns out they were just another inept global company schlepping its way through history. "microsoft fans" isnt something that you hear that much anymore.

This is just more of their embrace, extend, extinguish campaign. This is the extend part.

yep. you got it one, wiley coyote.

> This is a great idea and solves a real problem for a lot of use cases.

I don't know if "a lot" is the right qualifier. Solitary repos of millions of files have scalability problems even outside the source control system (I mean: how long does it take your workstation to build that 3.5 million-file windows tree?)

A full Android system tree is roughly the same size and works fine with git via a small layer of indirection (the repo tool) to pull from multiple repositories. A complete linux distro is much larger still, and likewise didn't need to rework its tooling beyond putting a small layer of indirection between the upstream repository and the build system.

Honestly I'd say this GVFS gadget (which I'll fully admit is pretty cool) exists because Microsoft misapplied their source control regime.

It's because the 'problem' it solves is a corner case that's rarely encountered. I love their absurd examples of repos that take 12 hours to download. How many people have that problem, really?

All they did is create a caching layer.

   How many people have that problem, really?
An easy lower bound is 10s of thousands of engineers : developers at several large tech companies (e.g. MS, facebook, google, ?)

If you deal with code, the case is marginal for you.

If you deal with graphics, audio assets, etc, the binary-blob type of data, the case is central.

This is about code, and code history. Just insane volumes.

Well it's a problem for thousands of employees of Microsoft, isn't it? We've had much smaller repository (10GB IIRC) and it really was annoying how long everything took, even with various caches and what not enabled.

"I don't have this problem, so nobody does."

Lacking support for large binary blobs is, like, THE #1 reason that an engineer might have to use an alternative.

Ok, but you'll encounter similar git limitations with repos several orders of magnitude smaller than that too.

All you need is several hundred engineers and your monorepo becomes unwieldy for git to handle.

It's not a caching layer, it's lazy evaluation.

testUser69 113 days ago [flagged] [dead] [-]

Shhhhh! It's Microsoft, we're not allowed to have a negative opinion of anything they do on hacker news.

We've already asked you to please stop this, so we've banned the account.

The recent Windows 10 thread was full of criticism for MS

And full of down votes for the negative comments. This website is very pro successful business, not pro tech.

Exactly, the top comment in this thread claims MS is the only company doing cool things (which is 100% not true). The top comments in the Windows 10 thread yesterday were mostly people claiming "it's sad to see such negativity toward MS" and there was maybe one or two negative comments followed by a circle jerk about how good MS and their software is.

Microsoft has historically been one of the worst tech companies for tech enthusiasts. We can ignore all the awful things they did in the 90's that stifled open standards (because apparently that doesn't matter anymore?) and just look at 2013, when they were exposed to have been participating in the NSA PRISM project. That means there is a whole team at MS that worked on a secret government project to help violate our fourth amendment rights. Even much of congress didn't know about NSA mass data collection, but Microsoft did.

People who trust MS these days are either naive or employed by them.

I'm immediately reminded of MVFS and clearcase. Lots of companies still use clearcase, but IMO it's not the best tool for the job. git is superior in most dimensions. From what this article says, it's not quite the same as clearcase but there's certainly some hints of similarities.

The biggest PITA with clearcase was keeping their lousy MVFS kernel module in sync with ever-advancing linux distros.

I really liked Clearcase in 1999, it was an incredible advancement over other offerings then. MVFS was like "yeah! this is how I'd design a sweet revision control system. Transparent revision access according to a ranked set of rules, read-only files until checked out." But with global collaborators, multi-site was too complex IMO. And overall, clearcase was so different from other revision control systems that training people on it was a headache. Performance for dynamic views would suffer for elements whose vtrees took a lot of branches. Derived objects no longer made sense -- just too slow. Local disk was cheap now, it got bigger much faster than object files.

> However, we also have a handful of teams with repos of unusual size! ... You can see that in action when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.

This seems like a way-out-there use case, but it's good to know that there's other solutions. I'd be tempted to partition the codebase by decades or something.

I used Clearcase (on Solaris) in 1999 and was not a fan. It slowed our build times by at least 10x. I'm sure it was probably set up wrong, but this was a Fortune 100 company with lots of dedicated resources.

Clearcase performance for many builds was specifically impacted by the very poor performance of stat(). You could make very real improvements on build times by reducing the number of calls to stat(). It was sort of amazing.

Clearcase also suffered, at least in my experience, from a clumsy and ugly merging process and deeply unintuitive command set which meant everyone who "used clearcase" actually tended to use some terrible homegrown wrapper scripts.

Still, considering it was the last remaining vestige of the Apollo Domain OS, not bad.

I used Clearcase a while back in an office in China. Every few days there'd be an email going round pleading with people to keep their antivirus software up-to-date; because someone had a virus somewhere, and apparently the virus would try to infect executables on the Clearcase virtual drive, at which point Clearcase would obligingly check the infected file into the VCS and distribute it to all clients...

I think they could have picked a name that doesn't conflict with GNOME Virtual File System (GVfs).

When they picked the "Windows" product name, they could have picked a name that didn't conflict with the use of windows. Picking on an obscure file system doesn't even register in comparison.

X Windows

NeWS - https://en.m.wikipedia.org/wiki/NeWS

Remember the mess on usenet?


comp.windows.new - not news about Microsoft Windows

In this particular case, the name is THE SAME (GVFS / GVFS). And they're both virtual filesystems, so there's lots of rooms for confusion.

I can image people at a forum:

"Hey, GVFS isn't working for me. It crashes with error -504" when I try to mount /nfs/company_data".

Try guessing which GVFS that is.

They used to choose very bad names like .NET or COM [1] (this predates Internet) makes searching information very tricky. MSDN doesn't help.

[1] https://en.wikipedia.org/wiki/Component_Object_Model

.NET is still rough naming-wise. We're porting a project to .NET Core Runtime which requires porting over to EF Core and ASP.NET Core (neither of which require being on Core Runtime).

Our internal libraries need to be compatible with the Core Runtime, so we have to have them target .NET Standard, which is compatible w/ the full .NET Framework or .NET Core. To target .NET Standard, you need the .NET Core SDK/CLI which includes the `dotnet` tool, which is almost never clarified as "the SDK/CLI" in documentation or in talks, but usually just ".NET Core".

Another minor annoyance: to build a .NET Standard-compatible library, you reference the "NETStandardLibrary" NuGet package. Makes a fair amount of sense, but is hard to talk about.

If you're running on Windows and want a smaller server footprint, you can use Windows Server Nano, which requires your apps to target .NET Core Runtime (not .NET Full Framework). Note that this requirement is not true for Windows Server Core. -_-

A few years ago I had to do some interfacing between python and some modelling software. I went through a COM interface, and it was a bloody nightmare to find docs.

I later found out I could have looked for "ActiveX" and found similar results.

A few years ago your best friend would have been https://www.codeproject.com . The issue of searching difficult questions using another keyword (e.g. ActiveX) is that you can miss the only answer available. For common questions (with answers!) you can find an answer with all the variations.

Their relational database is called SQL Server, which might otherwise be a colloquial generic name for an RDBMS.

They have a product in Azure named simply DocumentDB. I don't think "used to" is necessarily the best tense here (:

> NET or COM [1] (this predates Internet)

What could you possibly mean by that? The .com TLD was introduced in 1985, with microsoft.com registered already in 1991. Microsoft COM was created in 1993. (Of course, "the Internet" in any sense of the word predates all of this.)

.com was a command extension in DOS even before that.

And more fun, COM<number> is a special name, so you can't create e.g. a directory for COM2/3/etc. At least COM wasn't marketed like .NET.

That's what my first thought. "dick move". But they probably didn't know about it.

This is Microsoft. Before announcing a product they have more than enough lawyers to check the name for any clashes.

They just came to the conclusion thas GNOME's product is no threat and that they can just claim the name. Smaller companies [1] tried that before.

[1]: https://www.groupon.com/blog/cities/groupon-launches-gnome

I've built a personalized linux from the kernel up a couple times, and it never even popped into my head. Lets be fair to Microsoft, the nix world is vast, and not exactly easy to navigate if you don't live there. I live in all three worlds, and there are only so many letters. It drives me nuts when I just want a code name for a project, because I can no longer find unique words that aren't used by some project somewhere.

I mean, git itself did this in the beginning.

Microsoft, under Nadella has made me not hate Microsoft again, and that's a tall order because I'm over 40. This is an impressive move, and if they effectively execute all the bits that are possible here, this is just some great work.

(Oh, and I can't even use the word nix now as a catch all for all the POSIX/ POSIX(like) OSs because of nixOS.)

I think going forward, we just have to accept name collision.

> Microsoft, under Nadella has made me not hate Microsoft again, and that's a tall order because I'm over 40.

I'm over 40 as well, and I can honestly say I've never hated Microsoft or Bill Gates - what I hated (hate?) were/are their business practices.

I honestly wish I could (somehow) just get an apology from the company - something like "we were wrong, we're sorry, and we're working to make things right". Instead, it feels instead like a person you thought of as a friend, after they've put you down, did bad things to you directly and behind your back, you dropped them...then years later starting to do nice things toward you and others, trying to get back into your "good graces" - but never once apologizing for their past actions.

I want to see Nadella's and Microsoft's actions in a good light, I want to see them as an unvoiced apology. At the same time, though, if it were a person doing this, I don't know if I could trust their motives, not matter how sincere or enticing it might look like.

If you are anybody else could point me to a video of Nadella or someone else representing Microsoft making an apology regarding their past actions, it would go a long way toward me accepting their present behavior.

It probably won't make me ever install Windows 10, but I will probably see them in a better light.

Yeah let's apologize for corrupting governments. It will make it ok. Glory to the new microsoft, the proof you can bully your way to the top and with good PR, people will be ok with it in the end.

> But they probably didn't know about it.

GNOME Virtual Filesystem is first search result for "gvfs". Even if you use bing!

I understand that sometimes two products have the same name, but they usually have very different scopes/usages.

In this case, they're both called GVFS AND three of the letters have the same meaning, and they both do relatively similar things.

Even the tooling, and the output of `mount` is bound to be incredible confusing.

The article doesn't directly say it, but are they migrating the Windows source code repository to git? That seems like a big deal.

I seem to recall that Microsoft has previously used a custom Perforce "fork" for their larger code bases (Windows, Server, Office, etc.).

Source Depot. Forked years ago with tons of added features. Various Halo titles also used it and had easy to use integrations with most of the art and design pipeline tools.

Yes, Windows is migrating to Git.

Do you have a citation for that?

It was stated in Saeed Noursalehi's talk "Scaling Git at Microsoft", held at Git-Merge 2017. Until the conference recordings are available, here is the closest thing to a "source": https://twitter.com/no_more_ducks/status/827479795185364993

There is no citation but he seems to be pretty good source :)


If I understand this correctly, unlike git-annex and git lfs, this not about extending the git format with special large files, but changing the algorithm for the current data format.

A custom filesystem is indeed the correct approach, and one that git itself should have probably supported long ago. In fact, there should really only be one "repo" per machine, name-spaced branches, and multiple mountpoints a la `git worktree`. In other words there should be a system daemon managing a single global object store.

I wonder/hope IPFS can benefit from this implementation on Windows, where FUSE isn't an option.

The blog post does mention that some changes have been made to git (in their fork)

I did a quick comparison of Microsoft's fork and it appears they have done quite a bit with it.

Microsoft's fork contains 67,522 commits. The official Git repo contains 45,810. It appears the bulk of the work started in 2010, with significant ramp up of development in 2015.


Looks like Microsoft only really introduced about 100 more new files.


Microsoft's repo contains 1712 contributors. Git's repo contains 1685 contributors. So it looks 20 - 30 employees worked on Microsoft's fork.

https://gitsense.com/mgit-vs-git/mgit-contributors.png https://gitsense.com/mgit-vs-git/git-contributors.png

This is pretty big news. I know that when I was at Adobe, the only reason that Perforce was used for things like Acrobat, is because it was simply the only source control solution that could handle the size of the repo. Smaller projects were starting to use Git, but the big projects all stuck with Perforce.

I love this approach. From working at Google I appreciate the virtual filesystem, it makes a lot of things a lot easier. However all my repos are large enough to fit on a single machine so I wish there was a mode where it was backed by a local repository, however the filesystem allows git to avoid tree scans.

Basically most operations in git are O(modified files) however there are a few that are O(working tree size). For example checkout and status were mentioned by the article. However these operations can be made to O(modified) files if git doesn't have to scan the working tree for changes.

So pretty much I would be all over this if:

- It worked locally.

- It worked on Linux.

Maybe I'll see how it's implemented and see if I could add the features required. I'm really excited for the future of this project.

Assuming that the repo was this big in the beginning, I wonder why the ever migrated to git (I'm assuming they did, because they can tell how long it takes to checkout). At least when somebody "tries" do the migration, wouldn't they realize that maybe git is not the right tool for them? Or did they actually migrate and then work with "git status" that take 10 minutes for some time until they realize they may need to change something?

Also, it would have been interesting if the article mentioned whether they tried other approaches taken by facebook (mercurial afaik) or google.

To me it sounds like these numbers are from a migration-in-progress. So they are trying, but instead of giving up and saying "not the right tool for us" they are trying to improve the tool.

Because of the productivity benefits of using public tools instead of internal ones. Devs are more familiar with them, more documentation and examples, morale benefit because skills are transferable to other jobs, etc.

> repos of unusual size

Sounds like they've almost solved the secrets of the fire swamp!

Repos of Unusual Size? I don't think they exist.

They can live there quite happily for some time.

Did they really need to make a name collision?


It won't affect their profits, so I doubt they care. Compared to googles custom VCS, git vfs is amateur hour turned up to 11.

This sounds like a solid use case and a solid extension for that use case - but definitely not the end-all-be-all.

For one, it's not really distributed if you're only downloading when you need that specific file.

But that doesn't change the merrits of this at all, I think.

My sysadmin: "we won't switch to git because it can't handle binary files and our code base is too big"

Our whole codebase is 800MB.

I hope that was a conversation from 5 years ago.

Otherwise, I hope you replaced your sysadmin.

Our codebase (latest tree) is similar, but switching to git it's the total history size that is the problem. Our history is well over 25GB which git doesn't handle very gracefully.

History shouldn't be a problem, you can do a shallow checkout. But you will have to store the working tree at least on your workstation.

This solves the next scaling problem of avoiding managing the whole working tree. (without requiring narrow clones which have significant downsides)

Yeah, the working tree works well to have locally, and that's what's done with svn currently.

The problem is that I also want a fast log/blame for any file back to the beginning of time - but I'm ok with that requiring devs connecting to the server containing the history (as with svn).

I also haven't found a way to make git work smoothly in shallow mode as the default, e.g can I make checkout of a branch always remember it must be shallow? Can I make log use remote history when necessary etc? I don't want to fight the tool all the time because I'm using a nonstandard approach.

Isn't that the case that LFS solves? I've got 30ish gigs of binary blobs stored in my repos.

I appreciated the Princess Bride reference with "repos of unusual size"

I don't believe in them.

Just to make sure I have this right, this has to do with the _amount_ of files in their repo and not the _size_ of the files? So projects like git annex and LFS would not help the speed of the git repos?

That's how I read it, that this is about monorepos with file trees with large numbers of files where users don't necessary need every single file in their local worktree to get work done.

I'd assume this GVFS would work hand in hand with Git LFS for the use case of large files.

> when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.

How on Earth can anybody work like that?

I'd have thought you may as well ditch git at that point, since nobody's going to be using it as a tool, surely?

    git commit -m 'Add today\'s work - night all!' && git push; shutdown

You can't, and microsoft isn't. They built this so they can use git without those problems.

How on Earth can anybody work like that?

Since it's look like they are still migrating I don't think a lot of people actually did work like that. Maybe just a couple of times to figure out how long it would actually take. Or maybe those who really use it are actually doing shallow clones which would probably take much less time. Actually shallow clone is nice but doesn't seem to be known very well. I use it often if I know I won't ever need the full history anyway. Also great to shave time of CI builds.

Shallow clones are great, until they're not. I don't think I've ever (having tried a few times) cleanly cloned 'below' the graft point when I've needed to, or a different branch.

It's called Pomodoro++.

Or how about we start some compartmentalizing your codebase so that you can like. You know, organize your code and restore sanity to the known universe.

I think when the powers that be said that whole thing about geniuses and clutter, they were specifically talking about their living spaces and not their work...

Does anyone know Microsoft's open source policy works internally? I'm thinking from a governance perspective, as I'm involved in a similar effort at $WORK.

I had a medium sized project in Ruby on Rails as git repo inside vm.

It was slow to do 'git status' and other common commands. Restarting RoR app was also slo. I've put repo on RAM disk which made the whole experience at least few times faster.

Since all was in vm that I rarely restarted I didn't have to recreate files on ram disk all that often. I was syncing changes with the persistent disk with rsync running periodically.

"For example, the Windows codebase has over 3.5 million files and is over 270 GB in size."

Okay, so this is a networking issue. Or is it a stick everything in the same branch issue?

Whatever the reason here the issue is pure size vs. network pipe, pure and simple. Hum, when can I get a laptop with a 10GBaseT interface?

One of the issue with the way they are doing this (only grab files when needed) is you cannot really work offline anymore.

I'm no expert but if most single developers only use 5-10% of the codebase in their daily life, wouldn't it make to maybe break the project into multiple codebases of about 5% each and use a build pipeline that combines them together when needed?

Although I could definitely be wrong but this sounds a lot like monolith vs microservices to me.

Microsoft is moving away from source depo to git it seems. I think its fantastic that a company like Microsoft is adapting git for its big king and queen projects such as office and windows. Also open sourcing the underlying magic tells a lot about the new Microsoft. They're really moving away from not-invented here syndrome

MS has been doing really neat stuff lately. I never worked on a project that takes hours to clone. The largest repository I regularly clone is the Linux repo. It still takes only a few minutes. Yet I can see the GVFS being beneficial for me as I spend most of the time just reading the code (so no need to compile) on my laptop.

Does this article imply that Microsoft itself is also moving towards Git? Instead of e.g. using their own product like TFS?

TFS has first-class support for Git repositories (in addition to the classic TFSVC repositories). So yes, they're moving more and more to Git. But no, they're not abandoning TFS.

Interestingly, however, most of their "open source" efforts (.NET, C#, and related) are all on GitHub rather than their own hosted offerings: CodePlex (which is basically dead) or "Visual Studio Team Services".

Not sure why Visual Studio Team Services is in scare quotes -- that's the product's name. And it's not an open source hosting service, which handily explains why Microsoft's open source isn't hosted there.

Disclosure: I'm a PM on VSTS/TFS, and I own part of version control.

Are those scare quotes or just regular old quotes? TFS and associated technologies have been through a lot of names (Visual Studio Team System, TFS, Team Services, and probably a few I can't remember).

Disclaimer: used to work on TFS team.

Ha, fair point :)

Just to say, I'm really liking modern VSTS.

Thank you!

I'm a Microsoft employee on the Git team.

Microsoft is moving to Git and we use Team Services / TFS as our Git server for all private repositories. GitHub is only used for OSS since that's where the OSS community is.

Any comments on why you picked GitHub (propietary) instead of GitLab (FLOSS) for FLOSS projects?

I believe this, pretty much summed up why

> that's where the OSS community is

It's also not just the community, but GitHub provides significantly better integration support, than GitLab. Since GitHub has such a robust API, it's easier to create bots and what not, to help better manage large open source projects.

Reading release notes of TFS, they seem to be putting much more effort into improving integration with Git compared with TFVC. This may be just to catch up to acheive parity with TFVC in TFS, but it was enough for me to abandon TFVC for Git in all new projects.

Team Foundation Server supports both git and TFSVC.

You can use TFS with Git. Git will act as the underlying SCM.

Could this also help a smaller repo but with long history, making the total repo size too large?

The whole repo is needed for every developer - i.e it's not possible to do a sparse checkout but many gigs of old versions of small binaries I would prefer to keep only at the server until I need it (which is never).

And for all those who still try to stick to anything older:


"GVFS requires Windows 10 Anniversary Update or later."

Check out the GVFS back story and details here: https://news.ycombinator.com/item?id=13563439

I remember few years ago Git under Windows was very slow, is it still true?

Git on Windows has gotten very fast and stable in the last few years. Microsoft employees themselves, among others of course, have directly contributed to a much better Git experience on Windows.

The reddit thread has quite a few people with opposing opinions, fwiw. Mostly "stuff that's ~instant on unix takes many seconds on Windows" and the like. It's true that Microsoft has contributed a lot (to the benefit of all), but from what I'm seeing it sounds like it's still lagging quite a bit.

I haven't touched Windows in quite a while, so I can't really make a claim either way.

I'm at least speaking from daily use in my anecdotes. Apples for apples, yes Windows is going to lag behind Linux. [1] That doesn't mean it isn't fast and stable from the perspective of day-to-day Windows usage, and definitely as I stated in the previous comment, it is much faster and more stable on Windows today compared to Windows a few years ago.

Several of the anecdotes on the reddit thread don't even seem to take account what version the offending slowness was happening in, and anecdotally every time I've helped a Windows user experiencing slowness enough to complain about it, they've been years behind on their git version and installing the latest removed the complaints.

[1] ...and is just about guaranteed to in the many places in git where a command is still built as a tower of bash scripts calling perl scripts calling more bash scripts... If you read the changelogs, a lot of the performance optimizations that are helping every platform are the places where entire commands are getting replaced with C versions of themselves.

Ah, you're right, you were referring to on-windows progress.

And "years behind on their git version" is I think the norm for git users :) I pretty regularly have to recommend that coworkers / etc upgrade from git 1.7 (or 1.8 or something similar) to an even-remotely-modern version.

Quite nice use of C# and C++/CX for a virtual system implementation.

looks like C++/CLI (C++/CX reused its syntax and maybe parsing code, but they're still distinct)

Yes, hence why Microsoft had lots of trouble to convince developers that don't read documentation, that they are distinct and C++/CX gets compiled to just pure native code, as they were spreading misinformation about it.

In any case, when C++/WinRT gets feature parity, I imagine it will eventually be deprecated, depending which one gets more developer love.

Don't believe in modular development with smaller repos?

Yeah, I see things like this, and I always wonder why they don't make a submodule tree.

It wasn't an option a couple years ago, but submodules work fine now. With a little bit of scripting to wrap common uses, they're practically pain-free.

Could you elaborate a little what has changed there? My understanding is that submodules are still considered a mess, but would be really nice if some actual improvements have happened.

https://github.com/blog/2104-working-with-submodules is a decent overview of what was available a couple years ago (things have improved a bit since then too), though it needs a tl;dr. So here's an attempt.

1) When you cd into a submodule, it's the same as if it you just cloned into there, all normal git commands work. need to update your submodule-lib? cd, do stuff, git push, at worst.

2) `git clone --recursive` instead of just `git clone`, no need to `git submodule init --update` / etc.

3) `git pull` will automatically pull submodules when the parent repo changes which commit it's using. `git push` should push any changes too, though the manpage isn't explicit (there's an identical flag/config value for push as for pull to control this). also solvable with `pushall` and `pullall` aliases, which is a very minor re-education.

4) submodules can track submodule-repo branches, not just commits. auto-updating ftw? if you want it.

5) there are some somewhat-unhappy defaults / you probably want `git diff --submodule=log` and `git config --global status.submoduleSummary true`, etc. these (and aliases) are easily fixed the same way as you probably already have for templated .gitignore / etc - just generate some company-wide defaults, and move on with your life.


A lot of the "you have to git submodule command everything all the time" is a thing of the past, the difficulty now is largely related to it being a minor conceptual difference from a monorepo. It's a repo in a repo, and you're manipulating the pointer to the version. There are more options because of this, but they exist for good reasons, and they're not too hard to wrap your head around.

https://git-scm.com/book/en/v2/Git-Tools-Submodules also has some nice examples, and e.g. `git submodule foreach` can simplify a lot if you actually dive into submodules and make changes across multiple simultaneously (big refactor maybe?).

Is it really that fucking hard to check if your package name is unique?

Here is another virtual filesystem with the exact same name: https://wiki.gnome.org/Projects/gvfs

Debian package for it: https://packages.debian.org/jessie/gvfs

So... what happens when one runs "git grep foo" on it?

It will be slow. Small steps. But in practice companies with large repos have other search solutions so that each user doesn't have to do a raw search on the entire working tree.

Anybody knows what does Linus think about it ?

Can we use this together with git LFS?

Couldn't they use git over IPFS?

No. The problem isn't only the storage or fetching of the files (this is the easy bit :) ), it's the operations that detect changes in the working tree. If you have a large tree scanning it becomes slow.

Using a vfs allows you to track which files have changed so that these operations no longer need to scan. Now they are O(changed files) which is generally small.

Now IPFS has a vfs, but it is just a simple read/write interface. This vfs needs slightly more logic to do things like change the base revision and track changes.

IPFS clearly does a lot more than storing and fetching files. Seriously, go have a read. A single hash can represent an arbitrarily large subtree of data (Microsoft's entire repo). Using an IPLD selector (in its simplest form, a path beyond the hash) an arbitrary sub component can be addressed. This can be used to avoid scanning entire subtrees (maintaining your O(changed files)). To commit your modifications is O(changed files + tree depth to the root of your modifications) you never need to do anything with the rest of the repo.

For tracking changes (i.e. mutable data) you can use IPNS and create a signed commit history. This will be built on IPFS eventually so it's only a matter of time.

It was explained in the talk at Git-Merge that their problem is not large files per se. The codebase is huge in the amount of source files alone. It was stated that the repo contains about 3.5 million files. Having IPFS here wouldn't help, would it?

yes, IPFS is designed to host the entire internet. You can selectively mount sub-graphs arbitrarily, which means only downloading locally exactly what you need.

Unfortunately that alone would not have allowed Git to be fast on such a huge repository. Normally (without tools as sparse-checkouts) Git would read all files for example on git status. Therefore IPFS would also download all files locally, making it a moot addition.

You would probably still need the changes they made to Git itself. But fundamentally IPFS is also a filesystem virtualization layer (so should be able to do everything their file system virtualization is doing - if it doesn't already), and inherently has lazy checkouts.

The main added benefit is that if your friend on the LAN has also checked out the parts you need you can get them directly from them rather than some central repo, which could make a big difference in a company of 10's of thousands of employees.

This is why I think that enhancing the protocol that GVFS uses for downloads with a IPFS backend might be an interesting solution to making everything distributed again.

Not sure why I'm being down voted. It was a serious question. IPFS solves the handling of large files (by chunking them), and works in a P2P way in which you can locally mount remote merkle trees (the core data structure of git). I believe this use case is also actually one of the original design goals of IPFS.

testUser69 113 days ago [flagged] [dead] [-]

>the Windows codebase has over 3.5 million files and is over 270 GB in size

So instead of cleaning up this mess they decided to "fix" git? With this type of thing going on at MS it's no wonder Windows 10 is more buggy than my Linux box.

Also why would they keep the entirety of "Windows" in one git repo? The only reason I can think to do this is if very large parts of the ecosystem are so tightly coupled together, that they depend on each other for compilation. I know it's not UNIX but any basic programming course teaches you to decouple your programs (and the parts of your programs) to make them not dependent on each other. Is the developer of explorer.exe expected to clone the whole Windows repo? Do they have no idea what they're doing? If they seriously have one monolithic code base then that really explains a lot.

Sounds like it's amateur hour turned up to 11 over at Microsoft.

It's a normal practice, google does it also: "The Google codebase includes approximately one billion files and has a history of approximately 35 million commits spanning Google's entire 18-year existence. The repository contains 86TBa of data, including approximately two billion lines of code in nine million unique source files." [1]

[1] http://cacm.acm.org/magazines/2016/7/204032-why-google-store...

>The Git community strongly suggests and prefers developers have more and smaller repositories. A Git-clone operation requires copying all content to one's local machine, a procedure incompatible with a large repository.

They aren't copying all the files like you do with Git. They have a custom set up that sounds like it lets you checkout just the parts you need. I don't have time to read the whole thing, but it sounds like it works by breaking down a "super repo" into small "sub repos". This actually makes sense.

There is no way working with a 300gb git repo is fun or efficient, and they've probably been doing that for years at Microsoft.

> I don't have time to read the whole thing, but it sounds like it works by breaking down a "super repo" into small "sub repos".

They're explicitly not doing that. They have a massive, monolithic repo, and then tooling for interacting with that monolithic repo without having to grab the whole thing. They are not using Git. You just read the section titled "Alternatives".

GVFS, or something like it, is an important development for git. Facebook already implemented something similar for Mercurial[1]. At a glance, GVFS hints that git has become more extensible than it used to be.

The merits of a "monorepo" have been hashed out previously[2], it's more nuanced than "lol, M$".

[1] https://code.facebook.com/posts/218678814984400/scaling-merc...

[2] http://danluu.com/monorepo/

Hmmm.. Both Facebook and Google have giant monolithic repositories and use trunk based development and i'm guessing you wouldn't say they do it wrong. So not surprised Windows does

This is the Internet. There's always someone who'll say "they're doing it wrong!"

While never offering a suggested improvement.

To be fair, every time we have a thread on HN about how Google or Facebook do version control, there are people who leap at the chance to say they're doing it wrong.

I wouldn't do that way either, but they are not the only one doing that. Google and facebook have a similar approach:



They may know something we don't :)

I guess Microsoft, Google, Facebook, and Twitter are all run by amateurs, to name a few.

It works fine. We have really good code search tools.

The numbers include test code, utilities, two entire web browsers, UI frameworks, etc.

There is a reason you can only think of one reason for this - it's known as inexperience.

It seems you have no clue how very big companies work with source code management. Google e.g. is using for the most part only ONE single repository for their source code. So would you also say the same about them?


Big companies, even successful companies aren't incapable of making, and continuing to make stupid decisions.

I'm sure each company has a reason for using a single massive repo. I doubt I would agree with their reason, but I'm sure they have one.

The reasons seem to be fairly consistent regardless of the size of monorepo [1]: cross-cutting concerns (keeping commits/changes together that target a lot of different modules) and integration testing effort (testing a lot of different modules together).

These may not be problems that everyone feels as sharply, but they can be problems that nearly everyone might face at some point. Having these problems isn't even necessarily a sign of bad architecture: at some point all of your software likely has to play nice together on the same machine.

Certainly there are solutions beyond just monorepos, but monorepos are a very well understood, ancient solution (that seems to be making a comeback of sorts, despite many of the other solutions being easier and more powerful today than they were back when monorepos were about the only solution).

[1] I've seen tools like Lerna (https://github.com/lerna/lerna) used for a managing several relatively small "monorepos" lately.

It aint easy to transition from a significant code base in centralized source control to distributed over night, yet the benefits of distributed are desired.

> tightly coupled together


This is exactly what I was thinking when I read the article. There is no reason for any Git repo to be that big. It's not a bug in Git, but more like a reasonable limit... if your project exceeds it, you're doing it wrong.

I'm also curious as to how they used to do it without git, maybe using TFS? I wonder what the timings on that were.

Anyway, I don't think GVFS is the way to go, and I hope that it either doesn't get accepted or doesn't play a role outside of Windows. It's good to see more Git usage, but hacking away instead of fixing the problematic project seems somewhat idiotic. I can imagine other tools having problems with a single project that size, are they going to hack those as well?

Ah, yes. The mating call of the wild elite computer expert.

Microsoft: "We, only one of most technologically advanced companies with only the 2nd or 3rd highest market cap of any public company on the planet depending on the day, had a problem with infrastructure trying to manage possibly the largest software project that anyone has ever made. And then we solved it."

You: "Stop doing what you're doing and pay attention to meeeeeeeeeeee."

I think it's more naive to believe that large companies always make the right decisions. We see Microsoft make mistake after mistake (look at windows 8, windows vista). The only time they fix their mistakes is when they're made public.

So there is no reason to fix their mistake of a code base.

It would be naive to assume large companies always make the right decisions, which is why few people do that.

However, when you see that a large company is doing something in a way that you thing is silly or strange, the logical thing to ask first is "what do they understand that I don't understand?". It won't always be the case, of course, but most of the time it will turn out you were missing something.

Assuming off the bat that they are idiots and you know a better way is staggeringly naive.

>the logical thing to ask first is "what do they understand that I don't understand?".

There's even a name for this: "Going from D to C - from disparagement to curiosity". I think I first heard it from @patio11.

Doesn't sound like the way I would phrase the thought, so I checked: http://www.kalzumeus.com/2012/09/17/ramit-sethi-and-patrick-... <-- Ramit said it here, although I suspect it might be older.

It isn't the snappiest label, but it does capture the idea well.

But, it seems all large companies (Facebook, Google, Microsoft) are using a mono-repo, and no-one is publicly admitting to using many small repos.

If no company is using many small repos for truly massive projects, then it's hard to argue it would be a good idea. Could everyone who has looked at this problem make the wrong choice?

This is also mirrored by many not-so-big javascript projects, i.e. see the adoption of https://github.com/lerna/lerna

Note that Amazon is not on your list. They did not adopt the mono-repo approach. Their tools have their own advantages and disadvantages, to be sure, but mono-repo is not the only way.

I hadn't heard about Amazon. That's good to know.

Read Steve Yegge's Google vs Amazon rant: it's amazing.


> no-one is publicly admitting to using many small repos.

Amazon. They built an entire system around managing versions so that they can make it work.

The problem with this argument is that it doesn't really explain why large repos are wrong. It turns out the choice between monorepo and multi repo does not have a one-size-fits-all answer. Microsoft can make their own tooling, and internal tools don't have to run outside of Microsoft so they're easier to develop.

Google reportedly used to use Perforce for their monolithic codebase, and Facebook is supposed to use Mercurial with a bunch of modifications. They all have huge code bases mostly in one repo (I've heard Facebook had a >50GB Git repository, and Google's codebase is supposed to be in the TB range).

"640k is enough for anyone."

I don't see why there should be limits.

I recently ran some experiments, one file per experiment, one result file per experiment. At around 100,000 files, git started getting very upset. Why shouldn't I be allowed to have 100,000 files, or a million files, in a directory? Why should it be my job, as a user, to manually rearrange my data into a format my computer is happier with?

Being able to use git for things it's not designed for would be a strength, e.g. it works great as a database for certain kinds of projects.

I put all the records for https://www.findlectures.com in it, because then I can use diff tools for testing changes. Obviously this is nowhere near the size of the Windows codebase, but I could see a world where GVFS would be helpful for collaboration on this project.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact