Git Repository – Solving Problems with Bare Repos

gitrepository

I think I understood the practical difference between a bare and non-bare repo in Git, but I really don't get why logically this distinction exists: why Git had to implement the concept of bare and non-bare repos? I know there are already tons of thread and articles about the topic, but I am really missing some concrete examples to fully understand the topic.

To recap, the practical difference (i.e. in terms of files) between a non-bare and a bare repo should be the following:

a non-bare repo is a combination of:
1. a .git folder, that is a "special" folder that Git, as a software, uses to store all the data (e.g. blobs of the files of the project is versioning) and metadata (e.g. history of commits) to be able to work properly.
2. a working tree, that is the actual files and folders that represent our project. One crucial thing to keep in mind is that the working tree isn't the only place where the content of your project is stored. The data of your project is saved by Git also inside the .git folder in a special format using some internal tools like .git/objects. The working tree exists because it's the way everything that is not Git can work (hence the name working tree) and edit the project files and folders.
a bare repo is:
1. the contents of the .git subdirectory right in the main directory itself
2. no working tree

The question is: why do I need an intermediate bare repo to sync in a convenient way two non-bare repos? A lot of threads and articles are answering saying that not having a central bare repo would lead the central working tree to be out of sync (see here). Ok, but why? Can someone provide a concrete example?

The situation I can imagine is the following:

A is a local repo, B is a local repo and C is the central remote repo. All are non-bare.
A makes a commit c and pushes it on C. C updates its .git folder and it's working tree accordingly to the changes in c. Important: "updating the working tree" in my mind means replacing the C working tree (i.e. C files and folders) with the A working tree.
B pulls the changes from C and updates its .git folder and working tree accordingly to the changes in c

How can a situation like the one described above can make the working tree of C out of sync? What does even mean that C goes out of sync?

The only true advantage that I understood so far is that for services like Github or Gitlab not maintaining a working tree (i.e. having a bare repo) for each repo and for each branch is very convenient to save storage space. They can reconstruct the working tree on the fly leveraging the Git tools.

Best Answer

It's relatively simple, really. A bare repository has no working tree, therefore it cannot have an active checkout.¹ And, as you've seen elsewhere, the issue is that pushing to an active checkout of some branch results in an out-of-sync checkout. So Git forbids pushing to the checked-out branch.² By the fact that a bare repository has no working tree, and therefore no checked-out branch, a bare repository sidesteps the problem.

What does even mean that [the non-bare central repo] goes out of sync?

Let's dispense with the third machine: we only need client, a non-bare repository, and server, the repository that should be but isn't bare.

On server, branch main is actively checked-out. Someone may or may not be logged in to server and editing files there.

Meanwhile, on client, you've made some new commit and you run git push and send the new commit to server. If the server accepts this commit, there are now two possibilities:

server's Git repository doesn't update the checked-out working tree, or
server's Git repository does update the checked-out working tree.

Both situations can produce a bad outcome. Before we start on that, let's explore Git's workings a bit.

¹This was true before git worktree add was added, and now it isn't. So the simplicity was there until Git 2.5, and now it isn't.

²This was true in original Git, before the invention of various configuration items. Now it isn't. So the simplicity was there once, and now it isn't. (The receive.denyCurrentBranch stuff happened before the git worktree command, but I don't recall offhand which version that was.)

Git is about commits; commits are numbered; branch names find commits

A Git repository consists mainly of two databases, one usually much larger than the other. The larger database contains commits and supporting Git objects. The smaller database contains names, such as branch and tag names.

The commit objects are numbered, with the numbers expressed as hexadecimal hash IDs. Git needs the hash ID to find the commit: the big database is indexed solely by hash ID.

A commit itself contains two things:

a full snapshot of every file, as of the state it should have for that commit; and
metadata: information about the commit itself, such as who made it, when, and why (a log message).

In the metadata for any given commit, Git stores the raw hash ID(s) of the parent or parents of that commit. A commit therefore has a list of previous-commit hash IDs, stored in its metadata. This forms the history in the repository.

To be able to obtain the latest commit for any given branch, Git stores, in the branch name (e.g., refs/heads/main), the raw hash ID of the latest commit. That commit contains, in its metadata, the hash ID of the previous (parent) commit, which in turn contains another hash ID for another parent, and so on.

When we use git checkout or git switch with a branch name, we're telling Git: extract the latest commit for that branch. That's the one whose hash ID is stored in the branch name. So with git switch main, Git looks up refs/heads/main, finds a hash ID such as a123456..., and looks up that commit in the database. That commit has a set of files associated with it. Git copies those files out of the commit—the ones in the commit aren't generally usable by the OS, as they're in a read-only, compressed, Git-only, de-duplicated form—to your working tree.

But, Git also copies the files—or rather, information about the files (names and blob hash IDs)—into Git's index, which goes along with the working tree. This defines which files are tracked, helps Git go fast, and is generally needed to know what to put in the next commit.

Once this is all in place, Git sets up the special name HEAD to contain the branch name. (In original Git, this was a symbolic link to the refs/heads/main file, but as with many bits of Git, that was done away with more than a decade ago.)

There's now a group of well-defined, carefully-coordinated data:

HEAD contains the current branch name;
the branch name contains the hash ID;
Git's index contains the file names and blob hash IDs for the next commit, tracking the files in the working tree; and
the working tree contains the files copied out of the commit.

You work on the files, run git add to tell Git to update what's in Git's index, and eventually run git commit. At this point Git:

reads HEAD and then the branch name to find the current commit;
collects all necessary metadata;
packages up the index's content;
writes all of this to a new commit, which gets a new unique hash ID; and
writes the hash ID into the branch name.

The carefully-coordinated data lets Git do all of this, and is still carefully-coordinated.

If someone else commits while you're working ...

Now suppose we're on the server, working, and someone on the client commits. That's not a problem because the client Git repository has its own branch names. It gets a new commit in its commit database, and its branch name main stores a new hash ID. But over here on server, our Git databases are unchanged.

But if they now run git push main and send their commit, our Git has to either accept their commit or reject it. If we reject it, that's fine: our databases remain unchanged and everything is still coordinated.

Let's say that instead, though, we accept the push. The server Git updates refs/heads/main to store their commit hash ID. Our two possibilities are:

don't update index and working tree;
do update index and working tree.

If we choose possibility #1, then we have a "stale checkout": our files are from the previous commit. But the branch name holds the new commit hash ID. So we're out of sync. If we update any files and then commit, we'll revert the other guy's work (remember that our Git software uses what's in our index, which matches our working tree). That's not great, so let's move on to option 2.

If we choose option 2, our files get ripped away from us and replaced. Our index and working tree are re-synchronized with the updated branch name. That's better ... except, if we're actively working on some file, what happens to our work? Maybe our editor notices that the underlying file has changed and gives us a chance to fix things. Maybe it just overwrites the underlying file. Either way, it's likely to be a problem.

So, updating the working tree of a server's repository is perhaps better than not doing that, and that's what receive.denyCurrentBranch's updateInstead setting does. It's not perfect, though. "Perfect" is just don't have a working tree so that nothing can go wrong, and we get that with --bare.

Related Solutions

Git – Why Can’t Push to Bare Repository

Yes, the problem is that there are no commits in "bare". This is a problem with the first commit only, if you create the repos in the order (bare,alice). Try doing:

git push --set-upstream origin master

This would only be required the first time. Afterwards it should work normally.

As Chris Johnsen pointed out, you would not have this problem if your push.default was customized. I like upstream/tracking.

Git Bare Repository – What is a bare repository and why would I need one?

Is there the whole data of a repository always within .git directory (or in a bare repo), in some kind of format which is able to render all files at any time?

Yes, those files and their complete history are stored in .git/packed-refs and .git/refs, and .git/objects.

When you clone a repo (bare or not), you always have the .git folder (or a folder with a .git extension for bare repo, by naming convention) with its Git administrative and control files. (see glossary)

Git can unpack at any time what it needs with git unpack-objects.

The trick is:

From a bare repo, you can query the logs (git log in a git bare repo works just fine: no need for a working tree), or list files in a bare repo.
Or show the content of a file from a bare repo.
That is how GitHub can render a page with files without having to check out the full repo.

I don't know that GitHub does exactly that, though, as the sheer number of repos forces GitHub engineering team to do all kind of optimization.
See for instance how they optimized cloning/fetching a repo.
With DGit, those bare repos are actually replicated across multiple servers.

Is this the reason of bare repository, while working copy only has the files at a given time?

For GitHub, maintaining a working tree would cost too much in disk space, and in update (when each user requests a different branch). It is best to extract from the unique bare repo what you need to render a page.

In general (outside of GitHub constraint), a bare repo is used for pushing, in order to avoid having a working tree out of sync with what has just been pushed. See "but why do I need a bare repo?" for a concrete example.

That being said:

since git 2.3 you could push to a non-bare repo (that would update the working tree accordingly)
since git 2.4, you can "push-to-deploy" (ie, it works for unborn branch as well)

But that would not be possible for GitHub, which cannot maintain one (or server) working tree(s) for each repo it has to store.

The article "Using a bare Git repo to get version control for my dot files" from Greg Owen, originally reported by aifusenno1 adds:

A bare repository is a Git repository that does not have a snapshot.
It just stores the history. It also happens to store the history in a slightly different way (directly at the project root), but that’s not nearly as important.

A bare repository will still store your files (remember, the history has enough data to reconstruct the state of your files at any commit).
You can even create a non-bare repository from a bare repository: if you git clone a bare repository, Git will automatically create a snapshot for you in the new repository (if you want a bare repository, use git clone --bare).

And Greg adds:

So why would we use a bare Git repository?Permalink

Almost every explanation I found of bare repositories mentioned that they’re used for centralized storage of a repository that you want to share between multiple users.

See Git repository layout:

A <project>.git directory that is a bare repository (i.e. without its own working tree), that is typically used for exchanging histories with others by pushing into it and fetching from it.

Basically, if you wanted to write your own GitHub/GitLab/BitBucket, your centralized service would store each repo as a bare repository.
But why? How does not having a snapshot connect to sharing?

The answer is that there’s no need to have a snapshot if the only service that’s interacting with your repo is Git.
Basically, the snapshot is a convenience for humans and non-Git tools, but Git only interacts with the history. Your centralized Git hosting service will only interact with the repos through Git commands, so why bother materializing snapshots all the time? The snapshots only take up extra space for no gain.

GitHub generates that snapshot on the fly when you access that page, rather than storing it permanently with the repo (this means that GitHub only needs to generate a snapshot when you ask for it, rather than keeping one updated every time anybody pushes any changes).

With Git 2.38 (Q3 2022) introduces a safe.bareRepository configuration variable that allows users to forbid discovery of bare repositories.

See commit 8d1a744, commit 6061601, commit 5b3c650, commit 779ea93, commit 5f5af37 (14 Jul 2022) by Glen Choo (chooglen).
^{(Merged by Junio C Hamano -- gitster -- in commit 18bbc79, 22 Jul 2022)}

setup.c: create safe.bareRepository

^{Signed-off-by: Glen Choo}

There is a known social engineering attack that takes advantage of the fact that a working tree can include an entire bare repository, including a config file.
A user could run a Git command inside the bare repository thinking that the config file of the 'outer' repository would be used, but in reality, the bare repository's config file (which is attacker-controlled) is used, which may result in arbitrary code execution.
See this thread for a fuller description and deeper discussion.

A simple mitigation is to forbid bare repositories unless specified via --git-dir or GIT_DIR.
In environments that don't use bare repositories, this would be minimally disruptive.

Create a config variable, safe.bareRepository, that tells Git whether or not to die() when working with a bare repository.
This config is an enum of:

"all": allow all bare repositories (this is the default)

"explicit": only allow bare repositories specified via --git-dir or GIT_DIR.

If we want to protect users from such attacks by default, neither value will suffice - "all" provides no protection, but "explicit" is impractical for bare repository users.
A more usable default would be to allow only non-embedded bare repositories (this thread contains one such proposal), but detecting if a repository is embedded is potentially non-trivial, so this work is not implemented in this series.

git config now includes in its man page:

safe.bareRepository

Specifies which bare repositories Git will work with. The currently supported values are:

all: Git works with all bare repositories. This is the default.

explicit: Git only works with bare repositories specified via the top-level --git-dir command-line option, or the GIT_DIR environment variable.

If you do not use bare repositories in your workflow, then it may be beneficial to set safe.bareRepository to explicit in your global config. This will protect you from attacks that involve cloning a repository that contains a bare repository and running a Git command within that directory.

This config setting is only respected in protected configuration (see definition). This prevents the untrusted repository from tampering with this value.

With Git 2.41 (Q2 2023), the tracing mechanism learned to notice and report when auto-discovered bare repositories are being used, as allowing so without explicitly stating the user intends to do so (with setting GIT_DIR for example) can be used with social engineering as an attack vector.

See commit e35f202 (01 May 2023) by Glen Choo (chooglen).
^{(Merged by Junio C Hamano -- gitster -- in commit fa88934, 15 May 2023)}

setup: trace bare repository setups

^{Signed-off-by: Glen Choo}
^{Signed-off-by: Josh Steadmon}

safe.bareRepository=explicit is a safer default mode of operation, since it guards against the embedded bare repository attack.
Most end users don't use bare repositories directly, so they should be able to set safe.bareRepository=explicit, with the expectation that they can reenable bare repositories by specifying GIT_DIR or --git-dir.

However, the user might use a tool that invokes Git on bare repositories without setting GIT_DIR (e.g. "go mod" will clone bare repositories, see go.dev/ref/mod), so even if a user wanted to use safe.bareRepository=explicit, it wouldn't be feasible until their tools learned to set GIT_DIR.

To make this transition easier, add a trace message to note when we attempt to set up a bare repository without setting GIT_DIR.
This allows users and tool developers to audit which of their tools are problematic and report/fix the issue.
When they are sufficiently confident, they would switch over to "safe.bareRepository=explicit".

Note that this uses trace2_data_string(), which isn't supported by the "normal" GIT_TRACE2 target, only _EVENT or _PERF.

With Git 2.44 (Q1 2024), batch 12, the "disable repository discovery of a bare repository" check, triggered by setting safe.bareRepository configuration variable to 'explicit', has been loosened to exclude the ".git/" directory inside a non-bare repository from the check.
So you can do "cd .git && git cmd" to run a Git command that works on a bare repository without explicitly specifying $GIT_DIR now.

See commit 45bb916 (20 Jan 2024) by Kyle Lippincott (spectral54).
^{(Merged by Junio C Hamano -- gitster -- in commit a8bf3c0, 30 Jan 2024)}

setup: allow cwd=.git w/ bareRepository=explicit

^{Signed-off-by: Kyle Lippincott}

The safe.bareRepository setting can be set to 'explicit' to disallow implicit uses of bare repositories, preventing an attack where an artificial and malicious bare repository is embedded in another git repository.
Unfortunately, some tooling uses myrepo/.git/ as the cwd when executing commands, and this is blocked when safe.bareRepository=explicit.
Blocking is unnecessary, as git already prevents nested .git directories.

Teach git to not reject uses of Git inside of the .git directory: check if cwd is .git (or a subdirectory of it) and allow it even if safe.bareRepository=explicit.

With Git 2.45 (Q2 2024), batch 10, users with safe.bareRepository=explicit can still work from within $GIT_DIR of a secondary worktree (which resides at .git/worktrees/$name/) of the primary worktree without explicitly specifying the $GIT_DIR environment variable or the --git-dir=<path> option.

See commit 30b7c4b (09 Mar 2024) by Junio C Hamano (gitster).
^{(Merged by Junio C Hamano -- gitster -- in commit dc97afd, 21 Mar 2024)}

setup: notice more types of implicit bare repositories

^{Helped-by: Kyle Lippincott}
^{Helped-by: Kyle Meyer}

Setting the safe.bareRepository configuration variable to explicit stops git from using a bare repository, unless the repository is explicitly specified, either by the "--git-dir=<path>" command line option, or by exporting $GIT_DIR environment variable.
This may be a reasonable measure to safeguard users from accidentally straying into a bare repository in unexpected places, but often gets in the way of users who need valid accesses to the repository.

Earlier, 45bb916 ("setup: allow cwd=.git w/ bareRepository=explicit", 2024-01-20, Git v2.44.0-rc0 -- merge listed in batch #12) loosened the rule such that being inside the ".git/" directory of a non-bare repository does not really count as accessing a "bare" repository.
The reason why such a loosening is needed is because often hooks and third-party tools run from within $GIT_DIR while working with a non-bare repository.

More importantly, the reason why this is safe is because a directory whose contents look like that of a "bare" repository cannot be a bare repository that came embedded within a checkout of a malicious project, as long as its directory name is ".git", because ".git" is not a name allowed for a directory in payload.

There are at least two other cases where tools have to work in a bare-repository looking directory that is not an embedded bare repository, and accesses to them are still not allowed by the recent change.

A secondary worktree (whose name is $name) has its $GIT_DIR inside "worktrees/$name/" subdirectory of the $GIT_DIR of the primary worktree of the same repository.

A submodule worktree (whose name is $name) has its $GIT_DIR inside "modules/$name/" subdirectory of the $GIT_DIR of its superproject.

As long as the primary worktree or the superproject in these cases are not bare, the pathname of these "looks like bare but not really" directories will have "/.git/worktrees/" and "/.git/modules/" as a substring in its leading part, and we can take advantage of the same security guarantee allow git to work from these places.

Extend the earlier "in a directory called '.git' we are OK" logic used for the primary worktree to also cover the secondary worktree's and non-embedded submodule's $GIT_DIR, by moving the logic to a helper function "is_implicit_bare_repo()".
We deliberately exclude secondary worktrees and submodules of a bare repository, as these are exactly what safe.bareRepository=explicit setting is designed to forbid accesses to without an explicit GIT_DIR/--git-dir=<path>

Best Answer

Git is about commits; commits are numbered; branch names find commits

If someone else commits while you're working ...

Related Solutions

Git – Why Can’t Push to Bare Repository

Git Bare Repository – What is a bare repository and why would I need one?

So why would we use a bare Git repository?Permalink

setup.c: create safe.bareRepository

safe.bareRepository

setup: trace bare repository setups

setup: allow cwd=.git w/ bareRepository=explicit

setup: notice more types of implicit bare repositories

Related Question

`setup.c`: create `safe.bareRepository`

`safe.bareRepository`

`setup`: trace bare repository setups

`setup`: allow cwd=.git w/ bareRepository=explicit

`setup`: notice more types of implicit bare repositories