Git Internals
TOC
Introduction
There are many version control systems, but git is undoubtedly the most popular, and regularly used, thanks to online social platforms such as GitHub and GitLab.
Yet, it is a tool that is still vastly misunderstood and feared. In this post I aim to take a look at some of the internal moving parts of git, primarily what’s inside the .git directory (inc the various subdirectories and files).
My hope is that by better understanding how git works, and the concepts it is built upon, readers will feel more empowered and confident when working with git (especially when they have issues and would normally be unsure of what to do).
Note: this article isn’t an introduction to git, and does presume that the reader is familiar with (i.e. a user of) git.
General Concept
I wanted to take a quick moment just to clarify the terminology associated with the general concepts of how git works (so we’re all on the same page):
- Working Directory: your project files.
- Staging Area: a file that tracks the changes to your project files.
- Repository: the location where your project files are stored.
Note: these bullet points are just summarizations, but I would like to extend upon it slightly in that: your ‘working directory’ can change depending on what ‘version’ of the project you have ‘checked out’ from the git repository (i.e. this is what happens when you change your ‘branch’ with
git checkout <branch_name>).
So for example, commands like git add will copy objects from the working directory into the staging area (aka the ‘index’), while git reset will remove objects from the staging area.
A command such as git diff compares your working directory to your staging area, while using the --staged flag will change this behaviour such that git will compare your staging area to your actual repository state.
Subcommands: Porcelain and Plumbing
The git version control system wasn’t initially designed to be a user-friendly interface, and so alongside the more commonly used subcommands are commands that can carry out very low-level operations.
This has resulted in much confusion around what commands are intended for use by general users and which commands exist for the purpose of internal use.
Note: although used internally, the low-level subcommands are also typically used by systems that require such granular operational control.
The git subcommands are generally split into one of two groups:
- Porcelain: the user-friendly interface (e.g.
git checkout,git pulletc.) - Plumbing: low-level interface (e.g.
git cat-file,git rev-parseetc.)
Git Subcommands
Below is a list of the git subcommands (as of git version 2.22.0), and knowing which are meant to be ‘porcelain’ and which are meant to be ‘plumbing’ can be difficult.
$ man git-<tab>
git-add git-commit-tree git-fsck
git-am git-config git-fsck-objects
git-annotate git-count-objects git-gc
git-apply git-credential git-get-tar-commit-id
git-archimport git-credential-cache git-grep
git-archive git-credential-cache--daemon git-gui
git-bisect git-credential-store git-hash-object
git-blame git-cvsexportcommit git-help
git-branch git-cvsimport git-http-backend
git-bundle git-cvsserver git-http-fetch
git-cat-file git-daemon git-http-push
git-check-attr git-describe git-imap-send
git-check-ignore git-diff git-index-pack
git-check-mailmap git-diff-files git-init
git-check-ref-format git-diff-index git-init-db
git-checkout git-diff-tree git-instaweb
git-checkout-index git-difftool git-interpret-trailers
git-cherry git-fast-export git-log
git-cherry-pick git-fast-import git-ls-files
git-citool git-fetch git-ls-remote
git-clean git-fetch-pack git-ls-tree
git-clone git-filter-branch git-mailinfo
git-column git-fmt-merge-msg git-mailsplit
git-commit git-for-each-ref git-merge
git-commit-graph git-format-patch git-merge-base
git-merge-file git-rebase git-show-index
git-merge-index git-receive-pack git-show-ref
git-merge-one-file git-reflog git-stage
git-merge-tree git-remote git-stash
git-mergetool git-remote-ext git-status
git-mergetool--lib git-remote-fd git-stripspace
git-mktag git-remote-testgit git-submodule
git-mktree git-repack git-svn
git-multi-pack-index git-replace git-symbolic-ref
git-mv git-request-pull git-tag
git-name-rev git-rerere git-unpack-file
git-notes git-reset git-unpack-objects
git-p4 git-rev-list git-update-index
git-pack-objects git-rev-parse git-update-ref
git-pack-redundant git-revert git-update-server-info
git-pack-refs git-rm git-upload-archive
git-parse-remote git-send-email git-upload-pack
git-patch-id git-send-pack git-var
git-prune git-sh-i18n git-verify-commit
git-prune-packed git-sh-i18n--envsubst git-verify-pack
git-pull git-sh-setup git-verify-tag
git-push git-shell git-web--browse
git-quiltimport git-shortlog git-whatchanged
git-range-diff git-show git-worktree
git-read-tree git-show-branch git-write-tree
But there is a way to find out! Currently the man git page describes which commands are intended as porcelain and which are plumbing. Simple search for GIT COMMANDS and you’ll find the two groupings.
My own generalized way of making a distinction is to consider the day-to-day subcommands I use (e.g. git add, git diff) as being porcelain, while the more esoteric subcommands (e.g. git fsck, git multi-pack-index) as being more plumbing orientated.
In practice it doesn’t really matter which subcommands are porcelain and which are plumbing. If there’s a subcommand you feel you need to use, then go ahead and use it. My personal perspective on this is: if you’re ever unsure of what it is you’re doing you’re unlikely to use a subcommand.
Most users do not diverge from the well trodden path of: git add, git commit, git pull, git push, git diff (with an occasional git rebase).
What’s interesting about the plumbing subcommands is that some of them are used internally by git when you’re calling the porcelain subcommands (e.g. git read-tree, git update-index, git update-ref will be called by other porcelain commands such as git add or git commit).
Note: although we’ll be looking at a couple of plumbing commands in this article, I’ll refer you to the git book for a look at the different plumbing commands available and how they’re used.
The .git directory
When you start a new project that you want to use version control for, you’ll typically run the git init subcommand:
git init [dir]
Most people will know that there is now a .git directory created in the root of your project directory, but that’s about where their understanding of things stop.
Let’s see what’s initially inside the .git directory of a new project…
$ tree .git/
.git/
├── HEAD
├── config
├── description
├── hooks
│ ├── applypatch-msg.sample
│ ├── commit-msg.sample
│ ├── fsmonitor-watchman.sample
│ ├── post-update.sample
│ ├── pre-applypatch.sample
│ ├── pre-commit.sample
│ ├── pre-push.sample
│ ├── pre-rebase.sample
│ ├── pre-receive.sample
│ ├── prepare-commit-msg.sample
│ └── update.sample
├── info
│ └── exclude
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
8 directories, 15 files
OK, so there’s some important directories and files here that we need to learn a bit about in order to appreciate how git works.
Note: I’m not going to explain every file and directory, only those necessary to understand the fundamentals.
Here are some interesting ones:
HEAD: contains a pointer to the tip of the current branch.config: contains project-specific configuration options.info: contains a global exclude file †objects: contains four types of ‘objects’ (commit, tree, blob, tag).refs: contains pointers to ‘commit’ objects.
† this is separate from a local user’s
.gitignore.
References and Objects
The two most important concepts in git are: references and objects.
For example, your branches, tags and remotes are all references to commits. While your commits are objects, your files are objects, your directories are objects.
References
Git is built upon the simple premise of using ‘pointers’ to data, and these pointers are typically referred to as ‘references’ (or ‘refs’ for short).
This is what the .git/refs directory stores: references.
As I mentioned earlier, these references all point to a ‘commit’ object…
remote branch tag
| | |
| | |
| V |
------> commit <-----
|
|
V
tree
|
|
V
blob
Note: you can see from the above ascii graph that the ‘commit’ object itself points to a ‘tree’ object, and that tree object points to a ‘blob’ object. We’ll dig into these reference ‘object’ types in more detail in the “Object Types” section.
It’s worth clarifying now that although we conceptually talk in terms of ‘branches’ in git, the internal directory structure (where references to branches are stored) uses the term ‘heads’ instead. It’s a terrible name (like most things in git’s lexicon), but it’s best to just accept it and move on.
The reason git uses ‘references’ is it enables users to be able to refer to a specific commit without having to remember the full SHA1 hash.
Imagine wanting to checkout your master branch but instead of just executing git checkout master you had to remember the specific hash.
git checkout b5d34b608ce697f0d20d011ee569529bca3feee8
Not very practical heh.
The HEAD reference
If you recall from earlier, we said the HEAD file contains a pointer to the tip of the current branch.
If we were to look at the .git/HEAD file we would find that by default it has the following content:
ref: refs/heads/master
You can see it’s a pointer to another location (the reference .git/refs/heads/master), which means it’s a pointer to a pointer!
Remember that refs/heads/master is a reference file (which refers to our master branch), and the contents of that file is a pointer to a commit hash. So this is telling us that ultimately HEAD is pointing to our master branch.
But at this point in time I’ve only executed git init, and so I’ve not actually committed anything into git. This means that there isn’t actually a master file inside of the .git/refs/heads subdirectory.
If we look back at the earlier directory tree (which we printed after running git init), we’ll notice that although there is a .git/refs/heads directory, there is no master file. A file called master won’t exist in that subdirectory until I make my first commit.
Note: if you recall from earlier I said that the
refs/headssubdirectory was essentially a synonym for ‘branches’ created locally for this project. Hence, the default file referenced by theHEADfile ismaster(because it’s referencing themasterbranch).
Let’s now create a commit so that we can see a refs/heads/master file and what it points to…
$ echo foo > foo.txt
$ git add foo.txt
$ git commit -m "foo"
[master (root-commit) b5d34b6] foo
1 file changed, 1 insertion(+)
create mode 100644 foo.txt
Once we do this we’ll find git has created a master file inside of .git/refs/heads and the contents of that file is the hash of my first commit (which indicates that the master reference file, or ‘branch’, is pointing at a specific commit snapshot):
b5d34b608ce697f0d20d011ee569529bca3feee8
When you execute a command (such as) git checkout master, internally git will resolve master into refs/heads/master and that is what tells git which commit object to now point to.
Subcommands and References
Although a reference is a pointer to a commit hash, it doesn’t mean you can use a reference within a git subcommand.
Here is an example subcommand that works fine with a reference: git log. We can use git log origin/master, and git will know to internally resolve that reference to the fully qualified path .git/refs/remotes/origin/master.
Knowing that, we would also know that it is possible to use a partial reference path such as git log refs/remotes/origin/master or maybe git log remotes/origin/master.
All these variations work fine, but we typically use git log origin/master for convenience (because it’s less typing).
But using a shorted ‘reference’ isn’t possible with commands like git checkout and git pull for different reasons. With git pull if we look at man git-pull we see we need to provide a <repository> <refspec> and that means the refspec we provide will be scoped to .git/refs/remote/.
If I look at .git/refs/remote/ I’ll see only a single directory origin, and inside of that are all the branches (i.e. refspecs) for the origin remote. So if I attempted to do something like git pull origin HEAD this wouldn’t work because there’s a HEAD file inside of that origin directory (and it points to a different commit from our local HEAD in .git/HEAD)!
This means we’d end up trying to pull the changes from the remote master!! Which happens because HEAD on the remote is setup to track the master branch…
$ git remote show origin
* remote origin
Fetch URL: git@github.com:example/repo.git
Push URL: git@github.com:example/repo.git
HEAD branch: master
Remote branches:
...
So subsequently doing git pull origin HEAD would bring in lots of unexpected changes to your local branch 😬
Note: using
HEADisn’t a problem when doing something likegit push origin HEADbecause it’s a fundamentally different operation and so git knows to reference the localHEADfile to get the commit range before pushing to the remote.
Similarly, using a shortened ‘reference’ isn’t possible with a command like git checkout as its internal logic will cause a detached HEAD state (e.g. if you were to do something like git checkout refs/heads/master instead of git checkout master).
Let’s now understand what a ‘detacted HEAD’ means, and why it is a git checkout would cause that when using a refspec…
Detached HEAD
Internally git does recognize the reference and can resolve it to the appropriate .git/refs directory, but the behaviour of the checkout command changes when checking out a reference that is a qualified path such as refs/heads/master. What you would discover is you don’t checkout the branch but are placed into a ‘detached HEAD’ at the relevant commit.
Why is that? Well, if we look at the documentation for the checkout subcommand (man git-checkout) we would discover…
if it (the given branch name) refers to a branch (i.e., a name that, when prepended with “refs/heads/”, is a valid ref), then that branch is checked out. Otherwise, if it refers to a valid commit, your HEAD becomes “detached” and you are no longer on any branch.
Running git checkout master means you’ve given an identifier (i.e. master) that git can internally resolve to refs/heads/master and thus git will happily checkout that branch, while git checkout refs/heads/master is a direct reference that git first resolves to a commit.
Hence it’s like you had actually run the subcommand git checkout <commit-hash>, and so git puts you into a detached HEAD state.
If you’re unfamiliar with what a ‘detached HEAD’ state is, then it simply means the HEAD file no longer is pointing at a reference such as .git/refs/heads/master but directly to a commit hash. The purpose of a detached HEAD is to allow you to do work off a branch.
I’ve never had a need to work ‘off’ a branch (:shrugs:) and so I can only presume there are situations where you would want to do that.
OK, now that we have our first commit let’s dig a little deeping into the ‘objects’ git defines, and how the .git directory structure has changed…
Object Types
There are four main types of objects in git:
- commit
- tree
- blob
- tag
Note: we’ll primarily be covering the first three object types.
Since we committed a single file into git there has been a few new files and directories created:
index: a binary file containing a sorted list of path names.COMMIT_EDITMSG: temporary file used to store latest commit message.objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99: thefoo.txtfile (type: blob)objects/b5/d34b608ce697f0d20d011ee569529bca3feee8: commit message data (type: commit)objects/fc/f0be4d7e45f0ef9592682ad68e42270b0366b4: directory tree (type: tree)
You’ll notice that the new objects are stored in a subdirectory which uses the first two characters from the hash of the object’s contents.
For example, the foo.txt blob object’s content was hashed into 257cc5642cb1a054f08cc83f2d943e56fd3ebe99. Next git took the first two characters 25 and made a subdirectory, and then moved the object into that directory while naming the object file using the remaining characters (i.e. 7cc5642cb1a054f08cc83f2d943e56fd3ebe99).
In order to look at these files you’ll need a couple different plumbing commands: git ls-files and git cat-files.
Let’s start with the index file.
The index is a binary file which tracks our working directory and our staging area (use --stage flag to see staging area). The index enables fast comparisons between the tree object it defines and the working tree.
We’ll need to use git ls-files in order to read the contents:
$ git ls-files
foo.txt
It only has foo.txt tracked, which is correct. There are no other files or directories at this point in time (we’ll add more as we go).
To look at the different ‘objects’ we’ll use the git cat-files command which decompresses the file and displays the file contents (we’ll use the -t flag to return the ‘type’ and the -p flag to ‘print’ the contents).
Note: we don’t provide the path (e.g.
objects/../...) as the argument, but the sha itself (shortened sha is acceptable too).
$ git cat-file -t 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
blob
$ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
foo
$ git cat-file -t b5d34b608ce697f0d20d011ee569529bca3feee8
commit
$ git cat-file -p b5d34b608ce697f0d20d011ee569529bca3feee8
tree fcf0be4d7e45f0ef9592682ad68e42270b0366b4
author Integralist <example@gmail.com> 1585480397 +0100
committer Integralist <example@gmail.com> 1585480397 +0100
foo
$ git cat-file -t fcf0be4d7e45f0ef9592682ad68e42270b0366b4
tree
$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
What’s also interesting is that when you execute command such as git add, git will ‘conceptually’ copy the file to your staging area, but internally it has created a ‘blob’ object. While a command such as git commit then creates the ‘commit’ and ‘tree’ objects to reference the already existing ‘blob’ object. I mention this because I wanted to be clear that these three objects don’t all get created at the same time.
Snapshots, Not Differences
We saw earlier an ascii graph that indicated the hierarchy of these objects. It showed that git reference types (e.g. remotes, branches and tags) all point to a ‘commit’ object. This commit object will include a pointer to a ‘tree’ object, and the tree object is a list of files (i.e. blobs) and directories (i.e. more trees).
It’s this graph that builds up the entire snapshot of the repository. This is why you shouldn’t think of a git commit as being a patch or set of changes to a bunch of files, but instead should see each commit as a complete snapshot of your entire project at a singular point in time.
If any files or directories change, then their commit hash will change and thus the HEAD commit will consist of different tree and blob objects (resulting in a different hash-tree graph).
With that in mind, let’s start by looking at the commit object we have (git cat-file -p b5d34b6). We can see the first line says tree followed by a hash (all other information is the typical commit information you’re used to seeing when you run git status).
If we look at the tree object git cat-file -p fcf0be4 (which the commit object linked to), then we can see it consists of a single line: a blob object with its hash and its filename foo.txt (this makes sense as our project only contains this single file).
Lastly, let’s look at the blob object git cat-file -p 257cc56 (which the tree object linked to), then we can see the contents of that blob object is the contents of the foo.txt file itself.
OK, so what happens if I add a new file bar.txt and a new subdirectory baz with another file qux.txt within that subdirectory…
$ tree
.
├── bar.txt
├── baz
│ └── qux.txt
└── foo.txt
1 directory, 3 files
Once I add baz/qux.txt and commit it I then inspect the new objects in my .git/objects folder. From there I locate the commit object (I do that by looking at the .git/refs/heads/master and seeing what commit hash it has) and once I cat-file -p that hash, I follow its tree pointer…
$ git cat-file -p edc6771b338b472d901358e530db7cede202c1c7
100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6 bar.txt
040000 tree 3d15e426c95bac2548d7255af9c5e240df786e03 baz
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
$ git cat-file -p 3d15e426c95bac2548d7255af9c5e240df786e03
100644 blob 100b0dec8c53a40e4de7714b2c612dad5fad9985 qux.txt
We can see from the above output that the tree object not only includes my project files, but now a baz directory (itself a tree object). Looking at that tree object shows there is one file inside of it (a blob object for qux.txt).
If we review the index file again we’ll see our new set of files/directories:
$ git ls-files
bar.txt
baz/qux.txt
foo.txt
Tags
Along the way I’ve been tagging my commits. A tag (as far as git internals are concerned) is another ‘object’ type. Let’s look at my tags:
$ git tag -n
v1 foo
v2 an anotated tag
So we can see I have two separate tags, and each one points at a different commit (the v1 tag was a lightweight tag and so the associated foo comes from the commit message, while the v2 tag was an annotated tag and so the message I gave at that point was displayed).
In order to see the commit that a tag is associated with, we’ll need another plumbing subcommand rev-list:
$ git rev-list -n 1 v1
b5d34b608ce697f0d20d011ee569529bca3feee8
$ git rev-list -n 1 v2
0b56156eba23ae9bee8c32137605397cf7c9e88e
But for us to see what the ‘tag’ object type looks like internally, we need to get the hash that the tag reference file is set to:
$ cat .git/refs/tags/v1
b5d34b608ce697f0d20d011ee569529bca3feee8
$ cat .git/refs/tags/v2
75d37b7c37173def7a0a8cd43d674edc8e9ce614
Once we have that hash we can use cat-file to see the ‘tag’ object:
$ git cat-file -t 75d37b7c37173def7a0a8cd43d674edc8e9ce614
tag
$ git cat-file -p 75d37b7c37173def7a0a8cd43d674edc8e9ce614
object 0b56156eba23ae9bee8c32137605397cf7c9e88e
type commit
tag v2
tagger Integralist <example@gmail.com> 1585592962 +0100
an anotated tag
OK, so you may have noticed I used cat-file on the v2 (annotated) tag, but not on the v1 (lightweight) tag. That was not an accidental omission.
A lightweight tag is just a reference to a commit hash, but an annotated tag is more complex and so a ‘tag object’ is created, and we can see that when we inspect the hash inside the v2 tag reference.
We can see the tag object includes a pointer to the ‘commit’ object (0b56156eba23ae9bee8c32137605397cf7c9e88e) as well as information about the ‘tagger’ (in this case me!)
Remotes
When you add a remote like so:
git remote add origin git@github.com:Integralist/dotfiles.git
We can now look at the configuration of our remote:
$ git remote show origin
* remote origin
Fetch URL: git@github.com:Integralist/dotfiles.git
Push URL: git@github.com:Integralist/dotfiles.git
HEAD branch: master
Remote branches:
linux new (next fetch will store in remotes/origin)
master new (next fetch will store in remotes/origin)
minimal-mac-version-of-linux-version new (next fetch will store in remotes/origin)
Local ref configured for 'git push':
master pushes to master (local out of date)
You might be confused though if you were to look at .git/refs and don’t see a remotes subdirectory. This happens automatically if you clone an existing repository, but it’ll also be created when executing git fetch after manually adding a new remote to an existing repository.
I added my new origin remote (see above), but it was only once I had executed a git fetch was I then able to see a ‘remote’ reference:
refs/
| remotes/
| | origin/
| | | master
If I inspect the .git/refs/remotes/origin/master file, then I’ll see the latest commit my remote master branch is on. It’s also interesting to remember what we mentioned earlier about references that point to commits being interchangeable with commit hashes in various subcommands.
For example, git diff allows you to specify two branches to compare against each other (remember a branch is just a reference file that points to a commit hash), and so you might want to compare your local master against your remote master branch:
git diff master..origin/master
This is just a shortened way of doing:
git diff master..refs/remotes/origin/master
Which itself is just a shortened way of doing:
git diff master..c3865b72b019ced930cfc601b09b874685c29e72
Note: one last thing I wanted to mention (and there was no other place really to mention this) is that git comes with a UI! you can execute the command
gitkto use it.