- General Concept
- Subcommands: Porcelain and Plumbing
- References and Objects
- Object Types
Yet, it is a tool that is still vastly misunderstood and feared. In this post I aim to take a look at some of the internal moving parts of git, primarily what’s inside the
.git directory (inc the various subdirectories and files).
My hope is that by better understanding how git works, and the concepts it is built upon, readers will feel more empowered and confident when working with git (especially when they have issues and would normally be unsure of what to do).
Note: this article isn’t an introduction to git, and does presume that the reader is familiar with (i.e. a user of) git.
But before we get into it... time for some self-promotion 🙊
I wanted to take a quick moment just to clarify the terminology associated with the general concepts of how git works (so we’re all on the same page):
- Working Directory: your project files.
- Staging Area: a file that tracks the changes to your project files.
- Repository: the location where your project files are stored.
Note: these bullet points are just summarizations, but I would like to extend upon it slightly in that: your ‘working directory’ can change depending on what ‘version’ of the project you have ‘checked out’ from the git repository (i.e. this is what happens when you change your ‘branch’ with
git checkout <branch_name>).
So for example, commands like
git add will copy objects from the working directory into the staging area (aka the ‘index’), while
git reset will remove objects from the staging area.
A command such as
git diff compares your working directory to your staging area, while using the
--staged flag will change this behaviour such that git will compare your staging area to your actual repository state.
Subcommands: Porcelain and Plumbing
The git version control system wasn’t initially designed to be a user-friendly interface, and so alongside the more commonly used subcommands are commands that can carry out very low-level operations.
This has resulted in much confusion around what commands are intended for use by general users and which commands exist for the purpose of internal use.
Note: although used internally, the low-level subcommands are also typically used by systems that require such granular operational control.
git subcommands are generally split into one of two groups:
- Porcelain: the user-friendly interface (e.g.
- Plumbing: low-level interface (e.g.
Below is a list of the
git subcommands (as of git version
2.22.0), and knowing which are meant to be ‘porcelain’ and which are meant to be ‘plumbing’ can be difficult.
$ man git-<tab> git-add git-commit-tree git-fsck git-merge-file git-rebase git-show-index git-am git-config git-fsck-objects git-merge-index git-receive-pack git-show-ref git-annotate git-count-objects git-gc git-merge-one-file git-reflog git-stage git-apply git-credential git-get-tar-commit-id git-merge-tree git-remote git-stash git-archimport git-credential-cache git-grep git-mergetool git-remote-ext git-status git-archive git-credential-cache--daemon git-gui git-mergetool--lib git-remote-fd git-stripspace git-bisect git-credential-store git-hash-object git-mktag git-remote-testgit git-submodule git-blame git-cvsexportcommit git-help git-mktree git-repack git-svn git-branch git-cvsimport git-http-backend git-multi-pack-index git-replace git-symbolic-ref git-bundle git-cvsserver git-http-fetch git-mv git-request-pull git-tag git-cat-file git-daemon git-http-push git-name-rev git-rerere git-unpack-file git-check-attr git-describe git-imap-send git-notes git-reset git-unpack-objects git-check-ignore git-diff git-index-pack git-p4 git-rev-list git-update-index git-check-mailmap git-diff-files git-init git-pack-objects git-rev-parse git-update-ref git-check-ref-format git-diff-index git-init-db git-pack-redundant git-revert git-update-server-info git-checkout git-diff-tree git-instaweb git-pack-refs git-rm git-upload-archive git-checkout-index git-difftool git-interpret-trailers git-parse-remote git-send-email git-upload-pack git-cherry git-fast-export git-log git-patch-id git-send-pack git-var git-cherry-pick git-fast-import git-ls-files git-prune git-sh-i18n git-verify-commit git-citool git-fetch git-ls-remote git-prune-packed git-sh-i18n--envsubst git-verify-pack git-clean git-fetch-pack git-ls-tree git-pull git-sh-setup git-verify-tag git-clone git-filter-branch git-mailinfo git-push git-shell git-web--browse git-column git-fmt-merge-msg git-mailsplit git-quiltimport git-shortlog git-whatchanged git-commit git-for-each-ref git-merge git-range-diff git-show git-worktree git-commit-graph git-format-patch git-merge-base git-read-tree git-show-branch git-write-tree
But there is a way to find out! Currently the
man git page describes which commands are intended as porcelain and which are plumbing. Simple search for
GIT COMMANDS and you’ll find the two groupings.
My own generalized way of making a distinction is to consider the day-to-day subcommands I use (e.g.
git diff) as being porcelain, while the more esoteric subcommands (e.g.
git multi-pack-index) as being more plumbing orientated.
In practice it doesn’t really matter which subcommands are porcelain and which are plumbing. If there’s a subcommand you feel you need to use, then go ahead and use it. My personal perspective on this is: if you’re ever unsure of what it is you’re doing you’re unlikely to use a subcommand.
Most users do not diverge from the well trodden path of:
git diff (with an occasional
What’s interesting about the plumbing subcommands is that some of them are used internally by git when you’re calling the porcelain subcommands (e.g.
git update-ref will be called by other porcelain commands such as
git add or
Note: although we’ll be looking at a couple of plumbing commands in this article, I’ll refer you to the git book for a look at the different plumbing commands available and how they’re used.
When you start a new project that you want to use version control for, you’ll typically run the
git init subcommand:
git init [dir]
Most people will know that there is now a
.git directory created in the root of your project directory, but that’s about where their understanding of things stop.
Let’s see what’s initially inside the
.git directory of a new project…
$ tree .git/ .git/ ├── HEAD ├── config ├── description ├── hooks │ ├── applypatch-msg.sample │ ├── commit-msg.sample │ ├── fsmonitor-watchman.sample │ ├── post-update.sample │ ├── pre-applypatch.sample │ ├── pre-commit.sample │ ├── pre-push.sample │ ├── pre-rebase.sample │ ├── pre-receive.sample │ ├── prepare-commit-msg.sample │ └── update.sample ├── info │ └── exclude ├── objects │ ├── info │ └── pack └── refs ├── heads └── tags 8 directories, 15 files
OK, so there’s some important directories and files here that we need to learn a bit about in order to appreciate how git works.
Note: I’m not going to explain every file and directory, only those necessary to understand the fundamentals.
Here are some interesting ones:
HEAD: contains a pointer to the tip of the current branch.
config: contains project-specific configuration options.
info: contains a global exclude file †
objects: contains four types of ‘objects’ (commit, tree, blob, tag).
refs: contains pointers to ‘commit’ objects.
† this is separate from a local user’s
References and Objects
The two most important concepts in git are: references and objects.
For example, your branches, tags and remotes are all references to commits. While your commits are objects, your files are objects, your directories are objects.
Git is built upon the simple premise of using ‘pointers’ to data, and these pointers are typically referred to as ‘references’ (or ‘refs’ for short).
This is what the
.git/refs directory stores: references.
As I mentioned earlier, these references all point to a ‘commit’ object…
remote branch tag | | | | | | | V | ------> commit <----- | | V tree | | V blob
Note: you can see from the above ascii graph that the ‘commit’ object itself points to a ’tree’ object, and that tree object points to a ‘blob’ object. We’ll dig into these reference ‘object’ types in more detail in the “Object Types” section.
It’s worth clarifying now that although we conceptually talk in terms of ‘branches’ in git, the internal directory structure (where references to branches are stored) uses the term ‘heads’ instead. It’s a terrible name (like most things in git’s lexicon), but it’s best to just accept it and move on.
The reason git uses ‘references’ is it enables users to be able to refer to a specific commit without having to remember the full SHA1 hash.
Imagine wanting to checkout your master branch but instead of just executing
git checkout master you had to remember the specific hash.
git checkout b5d34b608ce697f0d20d011ee569529bca3feee8
Not very practical heh.
The HEAD reference
If you recall from earlier, we said the
HEAD file contains a pointer to the tip of the current branch.
If we were to look at the
.git/HEAD file we would find that by default it has the following content:
You can see it’s a pointer to another location (the reference
.git/refs/heads/master), which means it’s a pointer to a pointer!
refs/heads/master is a reference file (which refers to our master branch), and the contents of that file is a pointer to a commit hash. So this is telling us that ultimately
HEAD is pointing to our
But at this point in time I’ve only executed
git init, and so I’ve not actually committed anything into git. This means that there isn’t actually a
master file inside of the
If we look back at the earlier directory tree (which we printed after running
git init), we’ll notice that although there is a
.git/refs/heads directory, there is no
master file. A file called
master won’t exist in that subdirectory until I make my first commit.
Note: if you recall from earlier I said that the
refs/headssubdirectory was essentially a synonym for ‘branches’ created locally for this project. Hence, the default file referenced by the
master(because it’s referencing the
Let’s now create a commit so that we can see a
refs/heads/master file and what it points to…
$ echo foo > foo.txt $ git add foo.txt $ git commit -m "foo" [master (root-commit) b5d34b6] foo 1 file changed, 1 insertion(+) create mode 100644 foo.txt
Once we do this we’ll find git has created a
master file inside of
.git/refs/heads and the contents of that file is the hash of my first commit (which indicates that the
master reference file, or ‘branch’, is pointing at a specific commit snapshot):
When you execute a command (such as)
git checkout master, internally git will resolve
refs/heads/master and that is what tells git which commit object to now point to.
Subcommands and References
Although a reference is a pointer to a commit hash, it doesn’t mean you can use a reference within a git subcommand.
Here is an example subcommand that works fine with a reference:
git log. We can use
git log origin/master, and git will know to internally resolve that reference to the fully qualified path
Knowing that, we would also know that it is possible to use a partial reference path such as
git log refs/remotes/origin/master or maybe
git log remotes/origin/master.
All these variations work fine, but we typically use
git log origin/master for convenience (because it’s less typing).
But using a shorted ‘reference’ isn’t possible with commands like
git checkout and
git pull for different reasons. With
git pull if we look at
man git-pull we see we need to provide a
<repository> <refspec> and that means the refspec we provide will be scoped to
If I look at
.git/refs/remote/ I’ll see only a single directory
origin, and inside of that are all the branches (i.e. refspecs) for the
origin remote. So if I attempted to do something like
git pull origin HEAD this wouldn’t work because there’s a
HEAD file inside of that
origin directory (and it points to a different commit from our local
This means we’d end up trying to pull the changes from the remote
master!! Which happens because
HEAD on the remote is setup to track the
$ git remote show origin * remote origin Fetch URL: email@example.com:example/repo.git Push URL: firstname.lastname@example.org:example/repo.git HEAD branch: master Remote branches: ...
So subsequently doing
git pull origin HEAD would bring in lots of unexpected changes to your local branch 😬
HEADisn’t a problem when doing something like
git push origin HEADbecause it’s a fundamentally different operation and so git knows to reference the local
HEADfile to get the commit range before pushing to the remote.
Similarly, using a shortened ‘reference’ isn’t possible with a command like
git checkout as its internal logic will cause a
detached HEAD state (e.g. if you were to do something like
git checkout refs/heads/master instead of
git checkout master).
Let’s now understand what a ‘detacted HEAD’ means, and why it is a
git checkout would cause that when using a refspec…
Internally git does recognize the reference and can resolve it to the appropriate
.git/refs directory, but the behaviour of the checkout command changes when checking out a reference that is a qualified path such as
refs/heads/master. What you would discover is you don’t checkout the branch but are placed into a ‘detached HEAD’ at the relevant commit.
Why is that? Well, if we look at the documentation for the checkout subcommand (
man git-checkout) we would discover…
if it (the given branch name) refers to a branch (i.e., a name that, when prepended with “refs/heads/”, is a valid ref), then that branch is checked out. Otherwise, if it refers to a valid commit, your HEAD becomes “detached” and you are no longer on any branch.
git checkout master means you’ve given an identifier (i.e.
master) that git can internally resolve to
refs/heads/master and thus git will happily checkout that branch, while
git checkout refs/heads/master is a direct reference that git first resolves to a commit.
Hence it’s like you had actually run the subcommand
git checkout <commit-hash>, and so git puts you into a detached HEAD state.
If you’re unfamiliar with what a ‘detached HEAD’ state is, then it simply means the
HEAD file no longer is pointing at a reference such as
.git/refs/heads/master but directly to a commit hash. The purpose of a detached HEAD is to allow you to do work off a branch.
I’ve never had a need to work ‘off’ a branch (
:shrugs:) and so I can only presume there are situations where you would want to do that.
OK, now that we have our first commit let’s dig a little deeping into the ‘objects’ git defines, and how the
.git directory structure has changed…
There are four main types of objects in git:
Note: we’ll primarily be covering the first three object types.
Since we committed a single file into git there has been a few new files and directories created:
index: a binary file containing a sorted list of path names.
COMMIT_EDITMSG: temporary file used to store latest commit message.
foo.txtfile (type: blob)
objects/b5/d34b608ce697f0d20d011ee569529bca3feee8: commit message data (type: commit)
objects/fc/f0be4d7e45f0ef9592682ad68e42270b0366b4: directory tree (type: tree)
You’ll notice that the new objects are stored in a subdirectory which uses the first two characters from the hash of the object’s contents.
For example, the
foo.txt blob object’s content was hashed into
257cc5642cb1a054f08cc83f2d943e56fd3ebe99. Next git took the first two characters
25 and made a subdirectory, and then moved the object into that directory while naming the object file using the remaining characters (i.e.
In order to look at these files you’ll need a couple different plumbing commands:
git ls-files and
Let’s start with the
index is a binary file which tracks our working directory and our staging area (use
--stage flag to see staging area). The index enables fast comparisons between the tree object it defines and the working tree.
We’ll need to use
git ls-files in order to read the contents:
$ git ls-files foo.txt
It only has
foo.txt tracked, which is correct. There are no other files or directories at this point in time (we’ll add more as we go).
To look at the different ‘objects’ we’ll use the
git cat-files command which decompresses the file and displays the file contents (we’ll use the
-t flag to return the ’type’ and the
-p flag to ‘print’ the contents).
Note: we don’t provide the path (e.g.
objects/../...) as the argument, but the sha itself (shortened sha is acceptable too).
$ git cat-file -t 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 blob $ git cat-file -p 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo $ git cat-file -t b5d34b608ce697f0d20d011ee569529bca3feee8 commit $ git cat-file -p b5d34b608ce697f0d20d011ee569529bca3feee8 tree fcf0be4d7e45f0ef9592682ad68e42270b0366b4 author Integralist <email@example.com> 1585480397 +0100 committer Integralist <firstname.lastname@example.org> 1585480397 +0100 foo $ git cat-file -t fcf0be4d7e45f0ef9592682ad68e42270b0366b4 tree $ git cat-file -p fcf0be4d7e45f0ef9592682ad68e42270b0366b4 100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt
What’s also interesting is that when you execute command such as
git add, git will ‘conceptually’ copy the file to your staging area, but internally it has created a ‘blob’ object. While a command such as
git commit then creates the ‘commit’ and ’tree’ objects to reference the already existing ‘blob’ object. I mention this because I wanted to be clear that these three objects don’t all get created at the same time.
Snapshots, Not Differences
We saw earlier an ascii graph that indicated the hierarchy of these objects. It showed that git reference types (e.g. remotes, branches and tags) all point to a ‘commit’ object. This commit object will include a pointer to a ’tree’ object, and the tree object is a list of files (i.e. blobs) and directories (i.e. more trees).
It’s this graph that builds up the entire snapshot of the repository. This is why you shouldn’t think of a git commit as being a patch or set of changes to a bunch of files, but instead should see each commit as a complete snapshot of your entire project at a singular point in time.
If any files or directories change, then their commit hash will change and thus the HEAD commit will consist of different
blob objects (resulting in a different hash-tree graph).
With that in mind, let’s start by looking at the commit object we have (
git cat-file -p b5d34b6). We can see the first line says
tree followed by a hash (all other information is the typical commit information you’re used to seeing when you run
If we look at the tree object
git cat-file -p fcf0be4 (which the commit object linked to), then we can see it consists of a single line: a blob object with its hash and its filename
foo.txt (this makes sense as our project only contains this single file).
Lastly, let’s look at the blob object
git cat-file -p 257cc56 (which the tree object linked to), then we can see the contents of that blob object is the contents of the
foo.txt file itself.
OK, so what happens if I add a new file
bar.txt and a new subdirectory
baz with another file
qux.txt within that subdirectory…
$ tree . ├── bar.txt ├── baz │ └── qux.txt └── foo.txt 1 directory, 3 files
Once I add
baz/qux.txt and commit it I then inspect the new objects in my
.git/objects folder. From there I locate the commit object (I do that by looking at the
.git/refs/heads/master and seeing what commit hash it has) and once I
cat-file -p that hash, I follow its
$ git cat-file -p edc6771b338b472d901358e530db7cede202c1c7 100644 blob 5716ca5987cbf97d6bb54920bea6adde242d87e6 bar.txt 040000 tree 3d15e426c95bac2548d7255af9c5e240df786e03 baz 100644 blob 257cc5642cb1a054f08cc83f2d943e56fd3ebe99 foo.txt $ git cat-file -p 3d15e426c95bac2548d7255af9c5e240df786e03 100644 blob 100b0dec8c53a40e4de7714b2c612dad5fad9985 qux.txt
We can see from the above output that the tree object not only includes my project files, but now a
baz directory (itself a tree object). Looking at that tree object shows there is one file inside of it (a blob object for
If we review the
index file again we’ll see our new set of files/directories:
$ git ls-files bar.txt baz/qux.txt foo.txt
Along the way I’ve been tagging my commits. A tag (as far as git internals are concerned) is another ‘object’ type. Let’s look at my tags:
$ git tag -n v1 foo v2 an anotated tag
So we can see I have two separate tags, and each one points at a different commit (the v1 tag was a lightweight tag and so the associated
foo comes from the commit message, while the v2 tag was an annotated tag and so the message I gave at that point was displayed).
In order to see the commit that a tag is associated with, we’ll need another plumbing subcommand
$ git rev-list -n 1 v1 b5d34b608ce697f0d20d011ee569529bca3feee8 $ git rev-list -n 1 v2 0b56156eba23ae9bee8c32137605397cf7c9e88e
But for us to see what the ’tag’ object type looks like internally, we need to get the hash that the tag reference file is set to:
$ cat .git/refs/tags/v1 b5d34b608ce697f0d20d011ee569529bca3feee8 $ cat .git/refs/tags/v2 75d37b7c37173def7a0a8cd43d674edc8e9ce614
Once we have that hash we can use
cat-file to see the ’tag’ object:
$ git cat-file -t 75d37b7c37173def7a0a8cd43d674edc8e9ce614 tag $ git cat-file -p 75d37b7c37173def7a0a8cd43d674edc8e9ce614 object 0b56156eba23ae9bee8c32137605397cf7c9e88e type commit tag v2 tagger Integralist <email@example.com> 1585592962 +0100 an anotated tag
OK, so you may have noticed I used
cat-file on the v2 (annotated) tag, but not on the v1 (lightweight) tag. That was not an accidental omission.
A lightweight tag is just a reference to a commit hash, but an annotated tag is more complex and so a ’tag object’ is created, and we can see that when we inspect the hash inside the v2 tag reference.
We can see the tag object includes a pointer to the ‘commit’ object (
0b56156eba23ae9bee8c32137605397cf7c9e88e) as well as information about the ’tagger’ (in this case me!)
When you add a remote like so:
git remote add origin firstname.lastname@example.org:Integralist/dotfiles.git
We can now look at the configuration of our remote:
$ git remote show origin * remote origin Fetch URL: email@example.com:Integralist/dotfiles.git Push URL: firstname.lastname@example.org:Integralist/dotfiles.git HEAD branch: master Remote branches: linux new (next fetch will store in remotes/origin) master new (next fetch will store in remotes/origin) minimal-mac-version-of-linux-version new (next fetch will store in remotes/origin) Local ref configured for 'git push': master pushes to master (local out of date)
You might be confused though if you were to look at
.git/refs and don’t see a
remotes subdirectory. This happens automatically if you clone an existing repository, but it’ll also be created when executing
git fetch after manually adding a new remote to an existing repository.
I added my new
origin remote (see above), but it was only once I had executed a
git fetch was I then able to see a ‘remote’ reference:
refs/ | remotes/ | | origin/ | | | master
If I inspect the
.git/refs/remotes/origin/master file, then I’ll see the latest commit my remote
master branch is on. It’s also interesting to remember what we mentioned earlier about references that point to commits being interchangeable with commit hashes in various subcommands.
git diff allows you to specify two branches to compare against each other (remember a branch is just a reference file that points to a commit hash), and so you might want to compare your local
master against your remote
git diff master..origin/master
This is just a shortened way of doing:
git diff master..refs/remotes/origin/master
Which itself is just a shortened way of doing:
git diff master..c3865b72b019ced930cfc601b09b874685c29e72
Note: one last thing I wanted to mention (and there was no other place really to mention this) is that git comes with a UI! you can execute the command
gitkto use it.
But before we wrap up... time (once again) for some self-promotion 🙊