I still don't get Git (source control)

I was deliberately staying away from talking about Git monorepos. Controversial topic, personally I’m against them, but apparently Microsoft feels differently. I know they invested some very significant effort in custom tools to make it work. I’m not interested in that, but then again I’m not Microsoft.

edited: I had confused Google with Microsoft on this front. Google’s monorepo is something they built in-house.

I don’t find anything weird about using GitHub for this purpose. I considered it. It’s just another cloud service like Google Drive itself. And even my personal Perforce server is just abstracted away to my mind; it’s just something that exists and always works. I ask because I don’t want to share my stuff with the universe, and GH is (I thought) only free for open source projects, and I’m already paying for GD. However, looking again it seems GH does support free private repositories (is that new?). So, maybe worth trying again.

It’s possible I screwed up something with the GD syncing, like not waiting long enough when switching, but it really did make a hash of the whole thing. I’m not sure I’m willing to risk getting it into a screwed up state, even if it works 99% of the time.

For a long time you could only have 1 private repo in the free tier of service, but in 2019 they opened it up to unlimited free private repos. You can only have 5 collaborators in the free tier, but hey, free is free.

So, unrelated comment, but it’s good to have a GitHub account and be somewhat comfortable with it. Whenever I take a new job, the onboarding process is that they ask me for my Github account name, and they add my account directly to their company org. Good thing to have readily available.

I’ve had an account and an open source project there since 2014. Never went much of anywhere, but it exists. Anyway, thanks for the info.

If @HMS_Irruncible and @Dr.Strangelove are going to get all reasonable, I may have to stop following this thread. I’m here for the cage fight!

I can stir up more controversy, if desired :slight_smile: . Google, Facebook, Microsoft, Uber, Airbnb, and Twitter all use monorepos. Google has their own internal source control, Facebook a heavily modified version of Mercurial, and Microsoft a modified version of Git. And the company I work for, which is not quite Google-sized but larger than the three smallest aforementioned companies combined (technically, we have a few extremely coarse-grained repos, but all of software is under one of them).

The arguments against monorepos (that they don’t scale, that they don’t have decent access controls, and a few other things) are basically obsolete. They’ve been fixed, even for off-the-shelf solutions. Read-only servers, caching servers, etc. fix all the perf problems.

Amazon has a federated multi-repo with dependency tracking but developers hate it.

More than any of the technical aspects, my opinion is that the cultural damage is the worst side effect of multi-repo. It does not take much for groups to take on a silo mentality, where people only work on their own stuff in their own little sandbox. This is absolutely deadly to a company as it grows larger. There is no piece of software that I won’t touch, in principle. In fact there are few modules that I haven’t touched at some point. Every developer should have some level of ownership across the entire codebase, even if there are hundreds of projects and thousands of developers.

Monorepos forever!

I almost missed a salient point from one of my own links. Facebook has 114 GitHub projects, you say? Guess what? They actually do all of their development from their internal custom monorepo and then just export the commits to GitHub with a tool called FBShipIt. Not sure this is the best example in favor of primary development happening with Git. Makes for a good code sharing tool, though.

So I’ve been working on a little side project that uses Git for everything. And so far, after getting used to it, it’s actually worse than I expected. I knew it had some limitations but I thought it would do a better job at the stuff it was optimized for.

It’s adequate for local source control, but I definitely miss Perforce here. So far, the biggest deficiency I’ve found is in its stash implementation.

Say you have some experimental changes and want to revert them, but keep the experiment around for future use. Git has a solution–the stash. You can take the current state and save it away to be restored for later use. You can keep as many stashed changes around as you’d like. It has a kind of useless push/pop model but I can deal with that.

Perforce has a similar feature called shelving. However, it’s significantly superior for a variety of reasons:

  • You can shelve an arbitrary subset of files. Git stashing applies to all changed files (actually, there is apparently a way for it to ask on every changed file, but this is going to be tedious for many files)
  • You can easily add/remove files to a given shelf, split a shelf, and so on. You can also do a partial unshelf.
  • Shelved files are stored on the server, so there’s no risk of data loss if your machine goes kaput.
  • Because of this, it’s also trivial to move shelved changes between machines, or share them with others.
  • If you want to unshelf some files, but the branch has moved on since then, you can simply sync to that earlier point and do so. Git requires creating a new branch at that point (unless the changes can be applied without conflict).

I looked around, and people have come up with all kinds of ugly and IMO error-prone workarounds. But really the only clean answer is: don’t use the stash at all. Instead, create a new branch, commit the changes to that using the normal staging process, and cherry-pick the changes if you want them back. It’s still ugly, but less error-prone and fits better with the rest of the Git model.

On these kinds of matters, I do try to ask myself “Am I asking for the wrong thing here? Maybe the model has changed and I should be asking something else.” But stashes are a primary feature of Git and they suck compared to shelving. The answer does turn out to be to not use them, but why have the feature at all in that case?

The main version control system I use is based on Mercurial instead of Git, so take what I say with a grain of salt, but…

I’m always surprised at how many people I hear about using Git who make using stashes a major part of their standard workflow. My impressing was that stashes were kind of an “escape hatch” feature, like “I’m working on something and it’s not really in a committable state, but I need to work on something else right now without giving time to clean up current work, so I stash it so I can in the near future unstash it and follow the normal workflow once the distracting crisis is over”.

My flow (again, not using Git) makes a branch for every feature, and when the feature is ready, rebases the branch (to solve any conflicts with stuff that has gone onto the trunk since I branched) and merges to the trunk. If I need to context switch, I just switch branches.

Maybe Git has some peculiarities that make that kind of flow less feasible, but from what I know of Git, I don’t see why it would.

Aside from the crisis part, this is how I work 100% of the time. I constantly have dozens of changelists going at a time, in various degrees of completion or likelihood of ever being checked in. Some of these are various tweaks that will never get checked in, but I want to keep around for future experiments. Some are actually just reminders of a sort; changes that don’t do anything and/or won’t compile, but come alongside some notes indicating a place I want to fix in the future. Some are changes where I’m waiting for some automated testing to complete before checking in. Some represent one of N possible solutions to a problem, and I haven’t yet decided which one is best (again possibly dependent on automation, or maybe a submission to human QA testing). I could go on.

In short, I have a critical need to have lots of changes going at once. I was using Perforce before they had shelving, and in that case we had scripts to bundle a changelist into a file and then restore it later. Janky, and about at the same level as Git stashes are at now (though easier to share with others). That said, I do keep a dozen Perforce “clients” around; i.e., separate copies of the tree that I can work on in parallel. That reduces the need for shelving since I can rotate between them if I want to work on something else, but shelving is still very handy.

That seems to be the standard approach with Git, too. It’s working ok with my side project because it’s really just a tiny project overall. The whole thing takes 5 seconds to compile, so switching branches isn’t very costly. Not true of my main job where I keep a bunch of clients around in parallel that keep their own sets of intermediate files. Otherwise, switching branches would entail 10-15 minutes to recompile everything from scratch.

Still, it’s weird that stashes exist when they are so limited.

So what I’m hearing is that your stashes typically contain intermediate build artifacts, so unstashing them reduces build time whereas a branch switch would not. Is that correct?

If that’s the case I think the main difference between our setups is a difference in build systems/project layout. I’ve not found switching between branches so onerous because there’s rarely a need on any branch to do a more than incremental build fairly tightly around the tests for the bits of code I’m changing.

Not quite. Build artifacts are separated by virtue of being on different clients–that is, completely separate trees on my disk. Shelved files are source only. I avoid full rebuilds by simply building from a different tree. Shelving allows me to swap experiments in and out on a given client, but I can also easily move them between trees if desired.

It would be nice if our codebase was a little better separated, so that changing a single header didn’t usually entail a full rebuild, but unfortunately that’s not the case. Much of this is unavoidable, though; even if things were perfectly abstracted I’d still be doing a lot of full rebuilds (I spend a lot of time adding or improving generic/common code).

Even incremental builds can be very slow. I do performance work and that often means optimized builds. Optimized builds use link-time code generation so as to inline functions across files, and perform other global optimizations. So touching a single cpp file and rebuilding can take many minutes.

I can’t avoid that link time, but if I have one experimental change and the associated artifacts, I don’t want to blow it away when I iterate on another change. On my test machine I pull down the binaries from whatever source. Sure, I could copy them manually but that’s a pain.

A few years later and I still hate Git.

I’ve been cloning repositories from HuggingFace lately. They’re big–sets of LLM weights. Tens to hundreds of GB.

I noticed that it was taking way more disk space than expected. Oh: Git is storing an exact duplicate of the data within the .git folder. Storing the history locally is pointless in the first place. But worse, it’s stupid enough that it duplicates all the data (in one particular case, 120 GB worth).

I guess that’s expected, since if I edit the file it needs to keep the backup around to restore to if required. But the data is still available online–why do I need to store it locally? The vast majority of users will never need to edit the data, either, but we’re all paying double the local storage cost.

Here, the solution is easy–just delete everything .git related, since it’s useless. If my local files get corrupted or whatever, I’ll just clone from the server again. But it’s dumb, and as with so many other cases, the answer is “don’t use git”.

A better solution for this problem would store references to the server so that I could easily pull new versions, compare the local hashes vs. the canonical copies, and so on. But not store the data itself (aside from what’s actually in the tree). That would actually add value since it means I could do integrity checks and restore things if anything went wrong. But what git offers is not that, and in addition doubles the storage use.

Maybe git is not the proper solution for these giant binaries. Well, sucks to be me or anyone else on the planet, since that’s what HuggingFace (and other weight repositories) use. Oh well.

Git is a terrible solution for data. Doesn’t stop way too many people from trying to bludgeon both git and their data into submission. They should stop.

For its intended purpose of versioning code, having a local copy of history is awesome. Before I had such a thing, I spent years trying to code on the subway and had to waste way too much time waiting until I had an internet connection so that I could commit and move on.

Strictly speaking, I agree that I can’t really criticize git on this basis, since I think we all agree that using it to distribute binaries like this is a bad idea. Nevertheless, git is not just a specific product but also an ecosystem where people actually do stuff in practice and behave in certain ways. And in practice, people do distribute binaries like this, and not just in isolated cases. It’s super common. HuggingFace just being one obvious example.

Perforce supports local/distributed/offline workflows these days, but I’ve never used that feature so I dunno if it sucks or not.

<Jamie Hyneman>Well, there’s your problem.</Jamie Hyneman>

If you solve your big binary data problem by committing it to a git, now you have two problems.

If I had a choice in the matter, I’d do something else. But their website doesn’t allow downloading a collection of files as a zip file or the like. I could do it one-by-one, but that’s super annoying. Cloning the repository is clearly the preferred method.

Well, I don’t dislike my hammer because it’s shit at driving screws.

I would reasonably resent being forced to hammer screws because someone else doesn’t understand “use the proper tool for the job.”

You are supposed to use “git-annex” (or similar) instead of trying to store huge binaries in a Git repository. If people on Huggingface are not distributing their stuff that way, that is their fault, not yours.

Well, yes:
https://git-annex.branchable.com/walkthrough/