---
title: Git Internals
date: 2020-03-29
description: A deep dive into the .git directory, plumbing vs porcelain commands, and how git works under the hood.
tags: [git]
---

## Introduction

There are many version control systems, but [git](https://git-scm.com/) is undoubtedly the most popular, and regularly used, thanks to online social platforms such as [GitHub](https://www.github.com/) and [GitLab](https://gitlab.com/).

Yet, it is a tool that is still vastly misunderstood and feared. In this post I aim to take a look at some of the internal moving parts of git, primarily what's inside the `.git` directory (inc the various subdirectories and files).

My hope is that by better understanding how git works, and the concepts it is built upon, readers will feel more empowered and confident when working with git (especially when they have issues and would normally be unsure of what to do).

> Note: this article isn't an introduction to git, and does presume that the reader is familiar with (i.e. a user of) git.

## General Concept

I wanted to take a quick moment just to clarify the terminology associated with the general concepts of how git works (so we're all on the same page):

- **Working Directory**: your project files.
- **Staging Area**: a file that tracks the changes to your project files.
- **Repository**: the location where your project files are stored.

> Note: these bullet points are just summarizations, but I would like to extend upon it slightly in that: your 'working directory' can _change_ depending on what 'version' of the project you have 'checked out' from the git repository (i.e. this is what happens when you change your 'branch' with `git checkout <branch_name>`).

So for example, commands like `git add` will copy objects from the working directory into the staging area (aka the 'index'), while `git reset` will _remove_ objects from the staging area.

A command such as `git diff` compares your working directory to your staging area, while using the `--staged` flag will change this behaviour such that git will compare your staging area to your actual repository state.

## Subcommands: Porcelain and Plumbing

The git version control system wasn't initially designed to be a user-friendly interface, and so alongside the more commonly used subcommands are commands that can carry out very low-level operations.

This has resulted in much confusion around what commands are intended for use by general users and which commands exist for the purpose of internal use.

> Note: although used internally, the low-level subcommands are also typically used by systems that require such granular operational control.

The `git` subcommands are generally split into one of two groups:

- *Porcelain*: the user-friendly interface (e.g. `git checkout`, `git pull` etc.)
- *Plumbing*: low-level interface (e.g. `git cat-file`, `git rev-parse` etc.)

### Git Subcommands

Below is a list of the `git` subcommands (as of git version `2.22.0`), and knowing which are meant to be 'porcelain' and which are meant to be 'plumbing' can be difficult.

```txt
$ man git-<tab>

git-add                       git-commit-tree               git-fsck                      
git-am                        git-config                    git-fsck-objects              
git-annotate                  git-count-objects             git-gc                        
git-apply                     git-credential                git-get-tar-commit-id         
git-archimport                git-credential-cache          git-grep                      
git-archive                   git-credential-cache--daemon  git-gui                       
git-bisect                    git-credential-store          git-hash-object               
git-blame                     git-cvsexportcommit           git-help                      
git-branch                    git-cvsimport                 git-http-backend              
git-bundle                    git-cvsserver                 git-http-fetch                
git-cat-file                  git-daemon                    git-http-push                 
git-check-attr                git-describe                  git-imap-send                 
git-check-ignore              git-diff                      git-index-pack                
git-check-mailmap             git-diff-files                git-init                      
git-check-ref-format          git-diff-index                git-init-db                   
git-checkout                  git-diff-tree                 git-instaweb                  
git-checkout-index            git-difftool                  git-interpret-trailers        
git-cherry                    git-fast-export               git-log                       
git-cherry-pick               git-fast-import               git-ls-files                  
git-citool                    git-fetch                     git-ls-remote                 
git-clean                     git-fetch-pack                git-ls-tree                   
git-clone                     git-filter-branch             git-mailinfo                  
git-column                    git-fmt-merge-msg             git-mailsplit                 
git-commit                    git-for-each-ref              git-merge                     
git-commit-graph              git-format-patch              git-merge-base                
git-merge-file                git-rebase                    git-show-index
git-merge-index               git-receive-pack              git-show-ref
git-merge-one-file            git-reflog                    git-stage
git-merge-tree                git-remote                    git-stash
git-mergetool                 git-remote-ext                git-status
git-mergetool--lib            git-remote-fd                 git-stripspace
git-mktag                     git-remote-testgit            git-submodule
git-mktree                    git-repack                    git-svn
git-multi-pack-index          git-replace                   git-symbolic-ref
git-mv                        git-request-pull              git-tag
git-name-rev                  git-rerere                    git-unpack-file
git-notes                     git-reset                     git-unpack-objects
git-p4                        git-rev-list                  git-update-index
git-pack-objects              git-rev-parse                 git-update-ref
git-pack-redundant            git-revert                    git-update-server-info
git-pack-refs                 git-rm                        git-upload-archive
git-parse-remote              git-send-email                git-upload-pack
git-patch-id                  git-send-pack                 git-var
git-prune                     git-sh-i18n                   git-verify-commit
git-prune-packed              git-sh-i18n--envsubst         git-verify-pack
git-pull                      git-sh-setup                  git-verify-tag
git-push                      git-shell                     git-web--browse
git-quiltimport               git-shortlog                  git-whatchanged
git-range-diff                git-show                      git-worktree
git-read-tree                 git-show-branch               git-write-tree
```

But there _is_ a way to find out! Currently the `man git` page describes which commands are intended as porcelain and which are plumbing. Simple search for `GIT COMMANDS` and you'll find the two groupings.

My own generalized way of making a distinction is to consider the day-to-day subcommands I use (e.g. `git add`, `git diff`) as being porcelain, while the more esoteric subcommands (e.g. `git fsck`, `git multi-pack-index`) as being more plumbing orientated.

In practice it doesn't _really_ matter which subcommands are porcelain and which are plumbing. If there's a subcommand you feel you need to use, then go ahead and use it. My personal perspective on this is: if you're ever unsure of what it is you're doing you're unlikely to use a subcommand.

Most users do not diverge from the well trodden path of: `git add`, `git commit`, `git pull`, `git push`, `git diff` (with an occasional `git rebase`).

What's interesting about the plumbing subcommands is that some of them are used internally by git when you're calling the porcelain subcommands (e.g. `git read-tree`, `git update-index`, `git update-ref` will be called by other porcelain commands such as `git add` or `git commit`).

> Note: although we'll be looking at a couple of plumbing commands in this article, I'll refer you to the [git book](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a look at the different plumbing commands available and how they're used.

## The `.git` directory

When you start a new project that you want to use version control for, you'll typically run the `git init` subcommand:

```
git init [dir]
```

Most people will know that there is now a `.git` directory created in the root of your project directory, but that's about where their understanding of things stop.

Let's see what's initially inside the `.git` directory of a new project...

```
$ tree .git/

.git/
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

8 directories, 15 files
```

OK, so there's some important directories and files here that we need to learn a bit about in order to appreciate how git works.

> Note: I'm not going to explain _every_ file and directory, only those necessary to understand the fundamentals.

Here are some interesting ones:

- `HEAD`: contains a pointer to the tip of the _current_ branch.
- `config`: contains project-specific configuration options.
- `info`: contains a _global_ exclude file †
- `objects`: contains four types of 'objects' (commit, tree, blob, tag).
- `refs`: contains pointers to 'commit' objects.

> † this is separate from a local user's `.gitignore`.

## References and Objects

The two most important concepts in git are: **references** and **objects**.

For example, your branches, tags and remotes are all references to commits. While your commits are objects, your files are objects, your directories are objects.

### References

Git is built upon the simple premise of using 'pointers' to data, and these pointers are typically referred to as 'references' (or 'refs' for short).

This is what the `.git/refs` directory stores: references.

As I mentioned earlier, these references all point to a 'commit' object...

```
remote    branch     tag
  |          |        |
  |          |        |
  |          V        |
  ------> commit <-----
             |
             |
             V
           tree
             |
             |
             V
           blob
```

> Note: you can see from the above ascii graph that the 'commit' object itself points to a 'tree' object, and that tree object points to a 'blob' object.  We'll dig into these reference 'object' types in more detail in the "[Object Types](#object-types)" section.

It's worth clarifying now that although we conceptually talk in terms of 'branches' in git, the internal directory structure (where references to branches are stored) uses the term 'heads' instead. It's a terrible name (like most things in git's lexicon), but it's best to just accept it and move on.

The reason git uses 'references' is it enables users to be able to refer to a specific commit without having to remember the full SHA1 hash.

Imagine wanting to checkout your master branch but instead of just executing `git checkout master` you had to remember the specific hash.

```
git checkout b5d34b608ce697f0d20d011ee569529bca3feee8
```

Not very practical heh.

### The HEAD reference

If you recall from earlier, we said the `HEAD` file contains a pointer to the tip of the _current_ branch.

If we were to look at the `.git/HEAD` file we would find that by default it has the following content:

```
ref: refs/heads/master
```

You can see it's a pointer to another location (the reference `.git/refs/heads/master`), which means it's a pointer to a pointer!

Remember that `refs/heads/master` is a reference file (which refers to our master branch), and the contents of that file is a pointer to a commit hash. So this is telling us that ultimately `HEAD` is pointing to our `master` branch.

But at this point in time I've only executed `git init`, and so I've not actually _committed_ anything into git. This means that there isn't actually a `master` file inside of the `.git/refs/heads` subdirectory.

If we look back at the earlier directory tree (which we printed after running `git init`), we'll notice that although there is a `.git/refs/heads` directory, there is no `master` file. A file called `master` won't exist in that subdirectory _until_ I make my first commit.

> Note: if you recall from earlier I said that the `refs/heads` subdirectory was essentially a synonym for 'branches' created locally for this project. Hence, the default file referenced by the `HEAD` file is `master` (because it's referencing the `master` branch).

Let's now create a commit so that we can see a `refs/heads/master` file and what it points to...

```
$ echo foo > foo.txt
$ git add foo.txt
$ git commit -m "foo"

[master (root-commit) b5d34b6] foo
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt
```

Once we do this we'll find git has created a `master` file inside of `.git/refs/heads` and the contents of that file is the hash of my first commit (which indicates that the `master` reference file, or 'branch', is pointing at a specific commit snapshot):

```
b5d34b608ce697f0d20d011ee569529bca3feee8
```

When you execute a command (such as) `git checkout master`, internally git will resolve `master` into `refs/heads/master` and that is what tells git which commit object to now point to.

### Subcommands and References

Although a reference is a pointer to a commit hash, it doesn't mean you can use a reference within a git subcommand.

Here is an example subcommand that works fine with a reference: `git log`. We can use `git log origin/master`, and git will know to internally resolve that reference to the fully qualified path `.git/refs/remotes/origin/master`.

Knowing that, we would also know that it is possible to use a partial reference path such as `git log refs/remotes/origin/master` or maybe `git log remotes/origin/master`.

All these variations work fine, but we typically use `git log origin/master` for convenience (because it's less typing).

But using a shorted 'reference' isn't possible with commands like `git checkout` and `git pull` for different reasons. With `git pull` if we look at `man git-pull` we see we need to provide a `<repository> <refspec>` and that means the refspec we provide will be scoped to `.git/refs/remote/`.

If I look at `.git/refs/remote/` I'll see only a single directory `origin`, and inside of that are all the branches (i.e. refspecs) for the `origin` remote. So if I attempted to do something like `git pull origin HEAD` this wouldn't work because there's a `HEAD` file inside of that `origin` directory (and it points to a different commit from our local `HEAD` in `.git/HEAD`)!

This means we'd end up trying to pull the changes from the remote `master`!! Which happens because `HEAD` on the remote is setup to track the `master` branch...

```
$ git remote show origin

* remote origin
  Fetch URL: git@github.com:example/repo.git
  Push  URL: git@github.com:example/repo.git
  HEAD branch: master
  Remote branches:
    ...
```

So subsequently doing `git pull origin HEAD` would bring in _lots_ of unexpected changes to your local branch 😬

> Note: using `HEAD` isn't a problem when doing something like `git push origin HEAD` because it's a fundamentally different operation and so git knows to reference the local `HEAD` file to get the commit range before _pushing_ to the remote.

Similarly, using a shortened 'reference' isn't possible with a command like `git checkout` as its internal logic will cause a `detached HEAD` state (e.g. if you were to do something like `git checkout refs/heads/master` instead of `git checkout master`).

Let's now understand what a 'detacted HEAD' means, and why it is a `git checkout` would cause that when using a refspec...

### Detached HEAD

Internally git does recognize the reference and can resolve it to the appropriate `.git/refs` directory, but the _behaviour_ of the checkout command changes when checking out a reference that is a qualified path such as `refs/heads/master`. What you would discover is you don't checkout the branch but are placed into a 'detached HEAD' at the relevant commit.

Why is that? Well, if we look at the documentation for the checkout subcommand (`man git-checkout`) we would discover...

> if it (the given branch name) refers to a branch (i.e., a name that, when prepended with "refs/heads/", is a valid ref), then that branch is checked out. Otherwise, if it refers to a valid commit, your HEAD becomes "detached" and you are no longer on any branch.

Running `git checkout master` means you've given an identifier (i.e. `master`) that git can internally resolve to `refs/heads/master` and thus git will happily checkout that branch, while `git checkout refs/heads/master` is a _direct_ reference that git first resolves to a commit.

Hence it's like you had actually run the subcommand `git checkout <commit-hash>`, and so git puts you into a detached HEAD state.

If you're unfamiliar with what a 'detached HEAD' state is, then it simply means the `HEAD` file no longer is pointing at a reference such as `.git/refs/heads/master` but _directly_ to a commit hash. The purpose of a detached HEAD is to allow you to do work _off_ a branch.

I've never had a need to work 'off' a branch (`:shrugs:`) and so I can only presume there are situations where you would want to do that.

OK, now that we have our first commit let's dig a little deeping into the 'objects' git defines, and how the `.git` directory structure has changed...

## Object Types

There are four main types of objects in git:

1. commit
1. tree
1. blob
1. tag

> Note: we'll primarily be covering the first three object types.

Since we committed a single file into git there has been a few new files and directories created:

- `index`: a binary file containing a sorted list of path names.

- `COMMIT_EDITMSG`: temporary file used to store latest commit message.

- `objects/25/7cc5642cb1a054f08cc83f2d943e56fd3ebe99`: the `foo.txt` file (type: blob)

- `objects/b5/d34b608ce697f0d20d011ee569529bca3feee8`: commit message data (type: commit)

- `objects/fc/f0be4d7e45f0ef9592682ad68e42270b0366b4`: directory tree (type: tree)

You'll notice that the new objects are stored in a subdirectory which uses the first two characters from the hash of the object's contents.

For example, the `foo.txt` blob object's content was hashed into `257cc5642cb1a054f08cc83f2d943e56fd3ebe99`. Next git took the first two characters `25` and made a subdirectory, and then moved the object into that directory while naming the object file using the remaining characters (i.e. `7cc5642cb1a054f08cc83f2d943e56fd3ebe99`).

In order to look at these files you'll need a couple different plumbing commands: `git ls-files` and `git cat-files`.

Let's start with the `index` file.

The `index` is a binary file which tracks our working directory and our staging area (use `--stage` flag to see staging area). The index enables fast comparisons between the tree object it defines and the working tree.

We'll need to use `git ls-files` in order to read the contents:

```
$ git ls-files

foo.txt
```

It only has `foo.txt` tracked, which is correct. There are no other files or directories at this point in time (we'll add more as we go).

To look at the different 'objects' we'll use the `git cat-files` command which decompresses the file and displays the file contents (we'll use the `-t` flag to return the 'type' and the `-p` flag to 'print' the contents).

> Note: we don't provide the path (e.g. `objects/../...`) as the argument, but the sha itself (shortened sha is acceptable too).

```
$ git cat-file -t 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
blob

$ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99
foo

$ git cat-file -t b5d34b608ce697f0d20d011ee569529bca3feee8
commit

$ git cat-file -p b5d34b608ce697f0d20d011ee569529bca3feee8
tree fcf0be4d7e45f0ef9592682ad68e42270b0366b4
author Integralist <example@gmail.com> 1585480397 +0100
committer Integralist <example@gmail.com> 1585480397 +0100

foo

$ git cat-file -t fcf0be4d7e45f0ef9592682ad68e42270b0366b4
tree

$ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99    foo.txt
```

What's also interesting is that when you execute command such as `git add`, git will 'conceptually' copy the file to your staging area, but internally it has created a 'blob' object. While a command such as `git commit` then creates the 'commit' and 'tree' objects to reference the already existing 'blob' object. I mention this because I wanted to be clear that these three objects don't all get created at the same time.

### Snapshots, Not Differences

We saw earlier an ascii graph that indicated the hierarchy of these objects. It showed that git reference types (e.g. remotes, branches and tags) all point to a 'commit' object. This commit object will include a pointer to a 'tree' object, and the tree object is a list of files (i.e. blobs) and directories (i.e. more trees).

It's this graph that builds up the entire snapshot of the repository. This is why you shouldn't think of a git commit as being a patch or set of changes to a bunch of files, but instead should see each commit as a complete snapshot of your entire project at a singular point in time.

**If any files or directories change, then their commit hash will change and thus the HEAD commit will consist of different `tree` and `blob` objects (resulting in a different hash-tree graph).**

With that in mind, let's start by looking at the commit object we have (`git cat-file -p b5d34b6`). We can see the first line says `tree` followed by a hash (all other information is the typical commit information you're used to seeing when you run `git status`).

If we look at the tree object `git cat-file -p fcf0be4` (which the commit object linked to), then we can see it consists of a single line: a blob object with its hash and its filename `foo.txt` (this makes sense as our project only contains this single file).

Lastly, let's look at the blob object `git cat-file -p 257cc56` (which the tree object linked to), then we can see the contents of that blob object is the contents of the `foo.txt` file itself.

OK, so what happens if I add a new file `bar.txt` and a new subdirectory `baz` with another file `qux.txt` within that subdirectory...

```
$ tree
.
├── bar.txt
├── baz
│   └── qux.txt
└── foo.txt

1 directory, 3 files
```

Once I add `baz/qux.txt` and commit it I then inspect the new objects in my `.git/objects` folder. From there I locate the commit object (I do that by looking at the `.git/refs/heads/master` and seeing what commit hash it has) and once I `cat-file -p` that hash, I follow its `tree` pointer...

```
$ git cat-file -p edc6771b338b472d901358e530db7cede202c1c7

100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6    bar.txt
040000 tree 3d15e426c95bac2548d7255af9c5e240df786e03    baz
100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99    foo.txt

$ git cat-file -p 3d15e426c95bac2548d7255af9c5e240df786e03

100644 blob 100b0dec8c53a40e4de7714b2c612dad5fad9985    qux.txt
```

We can see from the above output that the tree object not only includes my project files, but now a `baz` directory (itself a tree object). Looking at that tree object shows there is one file inside of it (a blob object for `qux.txt`).

If we review the `index` file again we'll see our new set of files/directories:

```
$ git ls-files

bar.txt
baz/qux.txt
foo.txt
```

## Tags

Along the way I've been tagging my commits. A tag (as far as git internals are concerned) is another 'object' type. Let's look at my tags:

```
$ git tag -n

v1  foo
v2  an anotated tag
```

So we can see I have two separate tags, and each one points at a different commit (the v1 tag was a lightweight tag and so the associated `foo` comes from the commit message, while the v2 tag was an annotated tag and so the message I gave at that point was displayed).

In order to see the commit that a tag is associated with, we'll need another plumbing subcommand `rev-list`:

```
$ git rev-list -n 1 v1

b5d34b608ce697f0d20d011ee569529bca3feee8

$ git rev-list -n 1 v2

0b56156eba23ae9bee8c32137605397cf7c9e88e
```

But for us to see what the 'tag' object type looks like internally, we need to get the hash that the tag reference file is set to:

```
$ cat .git/refs/tags/v1

b5d34b608ce697f0d20d011ee569529bca3feee8

$ cat .git/refs/tags/v2

75d37b7c37173def7a0a8cd43d674edc8e9ce614
```

Once we have that hash we can use `cat-file` to see the 'tag' object:

```
$ git cat-file -t 75d37b7c37173def7a0a8cd43d674edc8e9ce614

tag

$ git cat-file -p 75d37b7c37173def7a0a8cd43d674edc8e9ce614

object 0b56156eba23ae9bee8c32137605397cf7c9e88e
type commit
tag v2
tagger Integralist <example@gmail.com> 1585592962 +0100

an anotated tag
```

OK, so you may have noticed I used `cat-file` on the v2 (annotated) tag, but not on the v1 (lightweight) tag. That was not an accidental omission.

A lightweight tag is just a reference to a commit hash, but an annotated tag is more complex and so a 'tag object' is created, and we can see that when we inspect the hash inside the v2 tag reference.

We can see the tag object includes a pointer to the 'commit' object (`0b56156eba23ae9bee8c32137605397cf7c9e88e`) as well as information about the 'tagger' (in this case _me_!)

## Remotes

When you add a remote like so:

```
git remote add origin git@github.com:Integralist/dotfiles.git
```

We can now look at the configuration of our remote:

```
$ git remote show origin

* remote origin
  Fetch URL: git@github.com:Integralist/dotfiles.git
  Push  URL: git@github.com:Integralist/dotfiles.git
  HEAD branch: master
  Remote branches:
    linux                                new (next fetch will store in remotes/origin)
    master                               new (next fetch will store in remotes/origin)
    minimal-mac-version-of-linux-version new (next fetch will store in remotes/origin)
  Local ref configured for 'git push':
    master pushes to master (local out of date)
```

You might be confused though if you were to look at `.git/refs` and don't see a `remotes` subdirectory. This happens automatically if you _clone_ an existing repository, but it'll also be created when executing `git fetch` after manually adding a new remote to an _existing_ repository.

I added my new `origin` remote (see above), but it was only once I had executed a `git fetch` was I then able to see a 'remote' reference:

```
refs/
| remotes/
| | origin/
| | | master
```

If I inspect the `.git/refs/remotes/origin/master` file, then I'll see the latest commit my remote `master` branch is on. It's also interesting to remember what we mentioned earlier about references that point to commits being interchangeable with commit hashes in various subcommands.

For example, `git diff` allows you to specify two branches to compare against each other (remember a branch is just a reference file that points to a commit hash), and so you might want to compare your local `master` against your remote `master` branch:

```
git diff master..origin/master
```

This is just a shortened way of doing:

```
git diff master..refs/remotes/origin/master
```

Which itself is just a shortened way of doing:

```
git diff master..c3865b72b019ced930cfc601b09b874685c29e72
```

> Note: one last thing I wanted to mention (and there was no other place really to mention this) is that git comes with a UI! you can execute the command `gitk` to use it.
