I still don't get Git (source control)

I have read several beginners’ guides on Git. I have a Github page, and I have successfully uploaded source code and releases to it.
But I still just don’t understand it, and I am reluctant to use it. I’m still basically clueless how I could contribute to other people’s projects. I’m lucky I guess that I haven’t needed to use it in a professional setting.

I think one issue is that terms like “checkout” don’t seem to mean the same thing in Git as on other, server-based source control systems. As soon as the aforementioned guides start talking about “checking out” something that is only on my local machine, I’m lost, I don’t know what operation we’re talking about and why.

It is a distributed source version control system. Meaning, you do not necessarily need to connect to a server to check out files. There can be multiple instances of a repository even on your own workstation. Anyway, when you are in a directory with a git repository, “checkout” checks out a version of the files so you can work with them.

I would not focus on the literal command names, like checkout, reset, clone, pull, etc. and try to guess what they do. It could be different in another revision control system even with the same name. You need to know what they actually do, so tutorial if it is not clear, which it should not be unless you are already familiar with such tools.

ETA the command names may or may not be intuitive; you are not the only one confused by them. But they are not going to change them soon, so you are stuck with checking out, branches, packs, and so on. It is not everyone’s hands-down favorite tool, but it gets the job done.

For your use case (contributing to other people’s projects), let’s take Kubernetes as an example. Click on the “Fork” button in the top right, and fork it to your account.

Next, in the folder where you want to store the code, run the terminal command:

git clone https://github.com/[YOUR ACCOUNT]/kubernetes.git

which copies the project to your local machine. master should be the default branch.

git checkout -b feature

creates a new branch off master named feature , and switches you to that branch, in one step.

Now you can make whatever changes you wish. To save your progress at any point, commit it with the command:

git commit -m “[COMMIT MESSAGE]” ← the quotation marks are necessary, but not the brackets.

Rinse and repeat for however long you need to finish your changes. If you want to push the code to your Github account, then:

  • If it is the first time you’re pushing the code, and the feature branch doesn’t exist yet in your Github account, run: git push -u origin feature, which creates the remote feature branch
  • For subsequent pushes, use git push origin feature

Finally, when you are ready to merge the code to the original project, go to your Github repo, select the “Pull requests” tab, and click on the green “New pull request” button. Since you originally forked off the master branch, choose that branch as the destination, and feature as the source.

Then you wait until the maintainers of the project agree to merge your changes, or shoot you down unceremoniously :stuck_out_tongue:

I often teach this curriculum to researchers with no experience with git. Many have no prior experience with version control at all, which sounds like isn’t your situation. It takes a couple of hours for live instruction, and goes over how to use git independently and how to use it to collaborate with others using the distributed version control model. Another related curriculum approaches things a little differently and I find can work better for people with a bit more experience.

I agree with @DPRK that you shouldn’t get too hung up on the literal command names, especially if you’re coming from something like svn. They’re just similar enough to be baffling.

As for checkout, maybe it’s helpful to think of there being a version control server someplace that you’re checking out from. It just happens to be in the .git/ directory inside your working directory and not off on SourceForge. Your working directory holds the version of the code you’re working on right now. The .git/ subdirectory inside your working directory holds the database of every version of the code.

Even though git is a distributed version control system and is designed for every developer to have their own complete history of the repository and to push and pull between peers, it really is advantageous to have a central repository that everybody has access to, like at GitHub. Used that way, it’s not so different from server-based version control.

I remember being completely flummoxed by incantations like git pull origin master and could never remember what those tokens meant or what order they went in.

  • origin is the name of a remote repository that you can exchange information (push and pull) with your local repository. There is no magic to the name. “origin” is just the default that git uses when you clone from another repository.
  • master is the name of a branch in the remote repository. Again, there’s no magic to this name [*]. “master” is just the default name. In simple projects, you may never have any other branches. On the other hand, git branches are easy and merge without problems, so many people use them a lot, unlike svn where they’re avoided like the plague in my experience.

[*] There is so little special about this name that GitHub has recently changed to using main as the default name instead of master.

Finally:

I’ll try to summarize as briefly as I can, but you are correct, you should forget your current understanding of what “check out” means in other systems. In SVN it means retrieving a copy (this is “clone” in Git). In older systems it means locking the central copy of the file to prevent concurrent editing (Git does not do this AFAIK).

In Git, “checking out” means “branching.” Git provides some powerful and fluent features for managing branching, and you have complete autonomy to do it locally. You should be using branches frequently in Git.

Git is also distributed, meaning that you have a small .git folder that contains the entire history of the repository. When you clone the repo, you are essentially a peer. You can do branching and tagging operations offline, and nobody will ever see it until you push the change. In fact, nobody ever needs to see what crimes you did on your local branch while flailing around to make things work.

Having set up the those basic concepts, here’s a basic workflow:

  1. Retrieve the repository to my local machine: git clone repo tmpfolder
  2. Change directory to tmpfolder
  3. Create a new local branch so as not to pollute the mainline: git checkout -b tmpbranch
  4. Make some edits and run my tests
  5. Stage those edits to be committed using git add
  6. git commit
  7. Now my changes are part of the history of tmpbranch
  8. switch back to the mainline: git checkout master
  9. merge my changes to master: git merge tmpbranch
  10. run my tests again
  11. resolve any conflicts, add any unadded files, commit to master
  12. push my master changes to the shared master branch: git push

If this was to be a team branch instead of a personal branch, I would skip the last 5 steps and instead just push my branch to the shared repo.

It’s awesome how Git is so decentralized and autonomous. You don’t need any kind of server to use it. It has completely replaced my previous local versioning system of file.txt, file.txt.1, file.txt.2, file.txt.2.good, file.txt.t.good.real and so forth.

Don’t feel bad. Git is a great tool with an astoundingly bad user interface. Lots of developers struggle with it for years. You might appreciate this website:

https://ohshitgit.com/

Thanks everyone for attempting to clone your understanding to my ageing brain.
I’m still digesting some of the details.

My first task is updating one of my projects on github. I cheated with this one, in that the release exe is up to date, but the source code isn’t :grimacing:. And the machine with the up to date source code doesn’t have a git repository yet. So I’ll need to clone the master repo, make a local branch, manually copy the new files to that folder, then add, commit etc.

I’ll let you know how it goes.

Resurrecting this old thread because I found a great resource for learning Git and other software development concepts and platforms.

Syncfusion is a developer of software components and frameworks. They also publish a Succintly series of ebooks that break down topics in under 100 pages. Completely free, no catch.
I had a number of penny-drop moments reading their book on Git, and the book on GitHub looks great too. Mucho recommendacion.

This isn’t spam; I do not work for Syncfusion :slight_smile:

I think this is something that those of us used to “industrial strength” versioning systems like Perforce should remember. Git isn’t really a replacement for those (though with enough extra support, it can act that way). Instead, it’s a replacement for the file.txt.2 approach, which I think every developer has used from time to time. And it’s absolutely superior to that.

But the losses compared to centralized systems are immense. Especially for very large repositories. Which isn’t to say git can’t handle large projects; obviously it works just fine for Linux. But Linux always had a distributed development model, which historically wasn’t much more than people emailing code patches to Linus or some other maintainer. Git was certainly an improvement over that. But compared to systems designed for lots of people working under one roof (so to speak), the distributed model isn’t so great.

Can you elaborate on this? I’ve worked with many SCM systems through the years (not Perforce but some real heavyweight battleship-class centralized systems). I haven’t found much of a feature gap at all with Git. Those I found are easily filled by using Git in tandem with GitHub (as most people do), and by adopting better development and architecture practices.

That last bit is really key, IMO. If you tell me you really need a big central system like Perforce then I can probably find a lot of sub-optimal practices in your organization, and it would be more long-term efficient to solve those.

IMO the only real drawbacks to Git are:

  1. Git requires a complete copy of the repository. Solution: manage your repository size better. Use more repos, use dependency managers, don’t store large binaries. Few people have a legit need for a 1GB repo.
  2. Many people are resistant to command-line. Solution: CL isn’t hard. I’ve found that whiners respond readily to threats and punishment.

git actually has a built-in gui: just type git gui
Nevertheless, I would not expect hackers to be the ones “resistant to command line” !

I do not know what are the most used version control systems today, but it would not surprise me if Git were way up there.

Interesting. It’s not for me, GUI just slows me down, but I’m sure some people benefit from it.

There’s a huge amount of diversity among developers. Many spend so much time in graphical IDEs and drag-and-drop tools that they’re not fluent at all in CLI. In fact I’ve met many prima-donna developers who think something’s wrong with the process if they have to use a CLI (eyeroll to that, but sometimes there’s a grain of truth. Sometimes.)

Git doesn’t require a server or account or fee or license or permission or anything. The amount of free online documentation and helpful developer community dwarfs anything else out there. With that being the case, if it’s not #1, I can’t imagine what is.

I’ve found people who don’t like command line interfaces, don’t deserve to have them. GUIs give some protection against idiots really messing things up by simply not giving them the option to do so. I generally know what I’m doing with a CLI, but I have royally screwed up my local repos, also.

For those resistant to command lines, TortoiseGit works great on Windows machines. It integrates into FileExplorer and shows little checkmarks on each file and folder with their current status with respect to the repo. Great for little projects at home.

The best thing about Git for small projects is that it’s extremely light weight. The repo can simply be a folder on your local file system. You don’t need to run a server to have version control! The only time Git is active is when you run a command.

Git doesn’t require decentralization; it supports centralization just fine. My work uses BitBucket instead of GitHub, but I don’t think there’s much difference. It’s great to have a central location to manage branches and pull requests for large projects.

We’ve switched over to using “trunk” instead of “master”. It matches the theme of “branch”.

I dunno, this smacks of the typical “tabs vs spaces” developer snobbishness.

I’ve used plenty of GUIs and CLIs. In fact, I’m old enough to remember a time when the former wasn’t an option.

But when it comes to development work, yeah I want it to be as visual as possible; a nice cosy IDE, with syntax coloring and intelli-text, and ideally source control that looks much like an explorer window.
Also it feels safer…as much as I now “get” Git, the idea that changing the case of one character can change whether a command is going to throw away data or not is scary to me.

For people who prefer to do everything in the command line, even editing text with vim (shudder), that’s great, I actually understand the appeal. But don’t imply I’m inferior for not wanting to work that way myself.

First, I want to emphasize that this is just my opinion. There are plenty of people (and groups) in the company that use Git, and seem to like it (some are very evangelical about it). They have different needs and priorities.

Another thing is that some of Git’s limitations are fixed/worked around with third-party tools. GitHub is great. But it’s not Git, it’s Git plus something else. Or if you need large binaries, there’s Git LFS. Again: not Git. And you still need an external server like GitHub, too.

Nothing wrong with that per-se, but after a while it seems one might be better off with an all-in-one system. Git is no longer really distributed if you end up dependent on a central server for daily work, so you’ve lost its entire reason for existence. And all these systems become harder for IT to manage, especially if every group in the company uses a different subset of them.

I would say the primary thing that Git enthusiasts seem to love is the easy way to enable/disable various sets of changes. For example:

  • You’re working on a few independent changes at once and want to switch between them
  • You have some debug code that you don’t want to check in, and need to occasionally enable/disable
  • Someone else gave you a change to try out, which you want to temporarily integrate into your tree

All fine things (and all doable with more traditional systems, though maybe with a tad more friction). But there’s a little problem: with big projects, you end up spending a ton of time just recompiling things. Change a common header, recompile the whole codebase. Intolerable. At least for C/C++ projects, recompilation is probably the single biggest contributor to wasted time.

So what’s the solution? Multiple local repos (“clients” in Perforce). Totally independent; they don’t step on each other at all. Because each repo only keeps the latest version of the file, and things like tools can be shared, there’s not really much hardware cost. And there’s zero developer cost, because switching changesets is literally as easy as bringing up the IDE window for the change you want to work in. No recompilation, no syncing, no reloading, no anything.

You can do this in Git too, of course. But then you’ve lost the advantages of Git, and paid a higher cost. So why not just use a centralized system?

Few people have a legit need for a 1GB repo.

LOL. I mean, maybe it’s true, but for me it’s a laughable statement. I looked at the source code size for just the piece of the codebase I work on, and it comes to 12 GB. A fair amount of that is generated files and third party files, but even without that it’s multiple gigabytes. And that is not counting tools (compiler binaries, etc.). It’s also just counting the latest version of the files, not the entire stored history. I don’t have a good means of estimating the real size of the history, but many files have several thousand changes.

If I include that I regularly need to look at, including code I need to touch while changing common code, it goes up to 45 GB. It’s >12 GB even if I include just .cpp and .h files. And still doesn’t include everything.

Use more repos

Why? More repos are bad practice. More stuff to keep track of, more things to go wrong. A single unified view of the universe is good.

use dependency managers

Not acceptable. Yet another point of failure, particularly if the source repro is external. Hell, I’d consider that unacceptable for security reasons alone. We need very particular versions of third party code that are static for long periods until a new version is needed, in which case it goes through a long qualification process. That code should be checked in alongside everything else.

don’t store large binaries

Absurd. Binaries are in need of just as much version control love as anything else. Perforce works great with large binaries. We have a large set of test binaries that are used with automated testing. They regularly get updated, because the underlying source data changed or the test generator changed. And sometimes that process itself causes bugs, so going back in time to figure out when the test binary broke is crucial.

I just checked and the largest of these test binaries is 11 GB. All are >1 GB, and there are hundreds total. They get revved on a regular basis, generally at least annually but often more like monthly.

And although a full history is crucial, it would be utterly stupid to require everyone to store the complete history of each one. As I mentioned, there’s Git LFS, which I haven’t used but AFAIK only stores the history on the server. It’s a fine example of how Git is insufficient and requires third-party support to be useful in many cases.

I can probably find a lot of sub-optimal practices in your organization

This is where I start to rant. In short: fuck that noise. That’s a lot of Steve Jobs “you’re holding it wrong” bullshit.

You mentioned a couple examples already of things that you apparently consider bad practice, but aren’t bad practice at all. They’re just bad practice for Git, because Git has some limitations. For many devs these limitations aren’t relevant, and for some the workarounds aren’t too high a cost to overcome, but for others they are. Again, going back to large binary support: the very fact that there are tools to make it kinda work illustrates that it’s a completely valid use case.

Microsoft uses Git internally. Except they don’t, they use VFS for Git (aka GVFS aka Scalar). To be honest I haven’t tried it, so I can’t really give my impression of it, besides pointing out that plain Git was so totally unsuitable that they had to virtualize the entire filesystem to make it work and have a whole team of people supporting it.

At any rate, this is a bit of a scattershot response; I don’t have the time or inclination to write a whole essay about it. But in summary: Git needs extra support to make it suitable for large and diverse codebases, which both makes it not-really-Git and eliminates much of its reason for existence. What’s left is a mildly confusing command line interface with the advantage that a lot of newbie devs are nevertheless familiar with it, which to be fair isn’t a small thing.

If you read the next three sentences after what you quoted from me, you’ll see that you’re agreeing with me. :smiley:

So, let us reframe this statement. Builds need traceability. For the integrity of the system, you need to be able to identify every change to every input.

One way to do this is, yes, to check all your binaries into SCM. That’s a quick-and-dirty sure shot solution, true. And here are the problems with that:

  1. It doesn’t scale. As you’ve seen, your repository mushrooms in size. It takes forever to download. It takes a ton of storage space. You need a high-powered system to manage it, and a specialist to manage that system. Eventually that specialist will tell you that you need a specialist like me to come in and re-engineer it into smaller chunks. Or, the project will become so difficult to release that management will just decide to rewrite it.
  2. Binary management is not what SCM systems are for. SCM is for recording changes. If you can’t diff a file and see what line of code changed, then it’s not a candidate for SCM! Of course all of your build dependencies need to be traceable, but SCM isn’t the tool for that. It’s turning a screw with a knife!

Traceability does not automatically imply that all of your build inputs need to be committed to source control. It’s sufficient to store a manifest with an incorruptible reference to the compiled asset (say, the checksum of a binary in some static file store). Put that in a manifest file, and then commit that into source control. There are tools that do that very well because that’s their sole responsibility.

So, as I expected, you need Perforce because your codebase has some bad practices that require Perforce. Specifically, someone needs to get in there with a solid dependency management strategy. Pull out the binaries. Pull out the generated code. Manage those in a more suitable toolchain (I can help). See if your codebase can be decomposed into smaller units, to isolate things that change frequently from those that don’t. If your codebase is as big as you say, it’s long past time to do this.

Note… my comments above aren’t based on Git. I’ve been troubleshooting build toolchains for years before Git ever came on the scene. But I do love Git because it shows me how good an SCM tool can be when it’s not bloated with functions that ultimately aren’t the responsibility of SCM.

All your arguments are based around “this sucks in Git, therefore it sucks in general”.

So what if the repo is petabytes in size? If we want to maintain a history, which we do, we have to store that somewhere. We can have yet another toolchain that handles this somehow. Or we can check into Perforce, which is actually faster than filesystem transfers, everyone has the toolset already, and everything works the way code works, from inspecting the history to everything else. If I need to tell someone where some data lives, I just say to sync a certain folder and they’re done. And since they already know the toolchain they don’t have to ask me how to get the previous version of the file or anything like that.

It doesn’t scale.

It scales just fine. We have thousands of devs using the same repository. It’s extremely fast and devs only have to store the latest version of each file, not the entire history. When I sync new binaries, I’m limited by my gigabit connection to the server. Some of our systems have 10 gigabit connections and get a nice boost.

Replicating the history is the unscalable thing. The number of changes per day goes up with the number of devs, and if each dev has to replicate all the changes then the storage requirements go up with O(N^2), not O(N).

It’s not like Git avoids the need for fast IT-managed servers, anyway. Those are always going to be necessary when you have a bunch of devs working out of the same internal codebase.

Binary management is not what SCM systems are for .

Yawn. Argument by definition. Large binaries work just fine in proper version control systems. Just not in baseline Git. Of course, this fact is well recognized which is why you can extend Git to handle large binaries.

BTW, I do binary diffs all the time. And image diffs. Source isn’t the only thing you can diff.

The thing is, we do actually have some binaries which are a little too big for Perforce, say >100 GB. Well, they aren’t too big, but they aren’t worth spending the centralized storage on compared to cheaper distributed storage systems. So we have separate systems for storing these moderately large binaries. But it sucks, because the toolset isn’t as robust, and the things we build for one system don’t work for the other. Having too many systems doing the same thing generally sucks, because you don’t get amplification in tool development.

I do wonder if some of your perceptions are based on things that were fixed years ago. Perforce had some serious problems here in the past, such as a global lock on some file ops. So it really didn’t scale to thousands of users unless you manually split the repos (which we did, though only at a very coarse granularity). But that time is long gone. It’s probably been a decade since the last “oh, Perforce is slow today” incident.

This is what Microsoft fixed with their VFS for Git program, incidentally. Git works as if it had the full history, but it’s virtualized away and is actually stored on the centralized server, fetched on demand. Seems to work well enough for them. But it’s not Git, it’s a centralized system with a Git front-end.