Is there the whole data of a repository always within .git
directory (or in a bare repo), in some kind of format which is able to render all files at any time?
Yes, those files and their complete history are stored in .git/packed-refs
and .git/refs
, and .git/objects.
When you clone a repo (bare or not), you always have the .git
folder (or a folder with a .git
extension for bare repo, by naming convention) with its Git administrative and control files. (see glossary)
Git can unpack at any time what it needs with git unpack-objects.
The trick is:
From a bare repo, you can query the logs (git log
in a git bare repo works just fine: no need for a working tree), or list files in a bare repo.
Or show the content of a file from a bare repo.
That is how GitHub can render a page with files without having to check out the full repo.
I don't know that GitHub does exactly that, though, as the sheer number of repos forces GitHub engineering team to do all kind of optimization.
See for instance how they optimized cloning/fetching a repo.
With DGit, those bare repos are actually replicated across multiple servers.
Is this the reason of bare repository, while working copy only has the files at a given time?
For GitHub, maintaining a working tree would cost too much in disk space, and in update (when each user requests a different branch). It is best to extract from the unique bare repo what you need to render a page.
In general (outside of GitHub constraint), a bare repo is used for pushing, in order to avoid having a working tree out of sync with what has just been pushed. See "but why do I need a bare repo?" for a concrete example.
That being said:
But that would not be possible for GitHub, which cannot maintain one (or server) working tree(s) for each repo it has to store.
The article "Using a bare Git repo to get version control for my dot files" from Greg Owen, originally reported by aifusenno1 adds:
A bare repository is a Git repository that does not have a snapshot.
It just stores the history. It also happens to store the history in a slightly different way (directly at the project root), but that’s not nearly as important.
A bare repository will still store your files (remember, the history has enough data to reconstruct the state of your files at any commit).
You can even create a non-bare repository from a bare repository: if you git clone
a bare repository, Git will automatically create a snapshot for you in the new repository (if you want a bare repository, use git clone --bare
).
And Greg adds:
So why would we use a bare Git repository?Permalink
Almost every explanation I found of bare repositories mentioned that they’re used for centralized storage of a repository that you want to share between multiple users.
See Git repository layout:
A <project>.git
directory that is a bare repository (i.e. without its own working tree), that is typically used for exchanging histories with others by pushing into it and fetching from it.
Basically, if you wanted to write your own GitHub/GitLab/BitBucket, your centralized service would store each repo as a bare repository.
But why? How does not having a snapshot connect to sharing?
The answer is that there’s no need to have a snapshot if the only service that’s interacting with your repo is Git.
Basically, the snapshot is a convenience for humans and non-Git tools, but Git only interacts with the history. Your centralized Git hosting service will only interact with the repos through Git commands, so why bother materializing snapshots all the time? The snapshots only take up extra space for no gain.
GitHub generates that snapshot on the fly when you access that page, rather than storing it permanently with the repo (this means that GitHub only needs to generate a snapshot when you ask for it, rather than keeping one updated every time anybody pushes any changes).
With Git 2.38 (Q3 2022) introduces a safe.bareRepository
configuration variable that allows users to forbid discovery of bare repositories.
See commit 8d1a744, commit 6061601, commit 5b3c650, commit 779ea93, commit 5f5af37 (14 Jul 2022) by Glen Choo (chooglen
).
(Merged by Junio C Hamano -- gitster
-- in commit 18bbc79, 22 Jul 2022)
setup.c
: create safe.bareRepository
Signed-off-by: Glen Choo
There is a known social engineering attack that takes advantage of the fact that a working tree can include an entire bare repository, including a config file.
A user could run a Git command inside the bare repository thinking that the config file of the 'outer' repository would be used, but in reality, the bare repository's config file (which is attacker-controlled) is used, which may result in arbitrary code execution.
See this thread for a fuller description and deeper discussion.
A simple mitigation is to forbid bare repositories unless specified via --git-dir
or GIT_DIR
.
In environments that don't use bare repositories, this would be minimally disruptive.
Create a config variable, safe.bareRepository
, that tells Git whether or not to die()
when working with a bare repository.
This config is an enum of:
- "all": allow all bare repositories (this is the default)
- "explicit": only allow bare repositories specified via
--git-dir
or GIT_DIR
.
If we want to protect users from such attacks by default, neither value will suffice - "all
" provides no protection, but "explicit" is impractical for bare repository users.
A more usable default would be to allow only non-embedded bare repositories (this thread contains one such proposal), but detecting if a repository is embedded is potentially non-trivial, so this work is not implemented in this series.
git config
now includes in its man page:
safe.bareRepository
Specifies which bare repositories Git will work with. The currently
supported values are:
all
: Git works with all bare repositories. This is the default.
explicit
: Git only works with bare repositories specified via
the top-level --git-dir
command-line option, or the GIT_DIR
environment variable.
If you do not use bare repositories in your workflow, then it may be
beneficial to set safe.bareRepository
to explicit
in your global
config. This will protect you from attacks that involve cloning a
repository that contains a bare repository and running a Git command
within that directory.
This config setting is only respected in protected configuration (see definition). This prevents the untrusted repository from tampering with this value.
With Git 2.41 (Q2 2023), the tracing mechanism learned to notice and report when auto-discovered bare repositories are being used, as allowing so without explicitly stating the user intends to do so (with setting GIT_DIR
for example) can be used with social engineering as an attack vector.
See commit e35f202 (01 May 2023) by Glen Choo (chooglen
).
(Merged by Junio C Hamano -- gitster
-- in commit fa88934, 15 May 2023)
setup
: trace bare repository setups
Signed-off-by: Glen Choo
Signed-off-by: Josh Steadmon
safe.bareRepository=explicit
is a safer default mode of operation, since it guards against the embedded bare repository attack.
Most end users don't use bare repositories directly, so they should be able to set safe.bareRepository=explicit, with the expectation that they can reenable bare repositories by specifying GIT_DIR
or --git-dir
.
However, the user might use a tool that invokes Git on bare repositories without setting GIT_DIR
(e.g. "go mod
" will clone bare repositories, see go.dev/ref/mod
), so even if a user wanted to use safe.bareRepository=explicit
, it wouldn't be feasible until their tools learned to set GIT_DIR
.
To make this transition easier, add a trace message to note when we attempt to set up a bare repository without setting GIT_DIR
.
This allows users and tool developers to audit which of their tools are problematic and report/fix the issue.
When they are sufficiently confident, they would switch over to "safe.bareRepository=explicit".
Note that this uses trace2_data_string()
, which isn't supported by the "normal" GIT_TRACE2
target, only _EVENT
or _PERF
.
With Git 2.44 (Q1 2024), batch 12, the "disable repository discovery of a bare repository" check, triggered by setting safe.bareRepository
configuration variable to 'explicit', has been loosened to exclude the ".git/
" directory inside a non-bare repository from the check.
So you can do "cd .git && git cmd
" to run a Git command that works on a bare repository without explicitly specifying $GIT_DIR
now.
See commit 45bb916 (20 Jan 2024) by Kyle Lippincott (spectral54
).
(Merged by Junio C Hamano -- gitster
-- in commit a8bf3c0, 30 Jan 2024)
setup
: allow cwd=.git w/ bareRepository=explicit
Signed-off-by: Kyle Lippincott
The safe.bareRepository setting can be set to 'explicit' to disallow implicit uses of bare repositories, preventing an attack where an artificial and malicious bare repository is embedded in another git repository.
Unfortunately, some tooling uses myrepo/.git/ as the cwd when executing commands, and this is blocked when safe.bareRepository=explicit
.
Blocking is unnecessary, as git already prevents nested .git
directories.
Teach git
to not reject uses of Git inside of the .git
directory: check if cwd
is .git
(or a subdirectory of it) and allow it even if safe.bareRepository=explicit
.
With Git 2.45 (Q2 2024), batch 10, users with safe.bareRepository=explicit
can still work from within $GIT_DIR
of a secondary worktree (which resides at .git/worktrees/$name/
) of the primary worktree without explicitly specifying the $GIT_DIR
environment variable or the --git-dir=<path>
option.
See commit 30b7c4b (09 Mar 2024) by Junio C Hamano (gitster
).
(Merged by Junio C Hamano -- gitster
-- in commit dc97afd, 21 Mar 2024)
setup
: notice more types of implicit bare repositories
Helped-by: Kyle Lippincott
Helped-by: Kyle Meyer
Setting the safe.bareRepository
configuration variable to explicit stops git from using a bare repository, unless the repository is explicitly specified, either by the "--git-dir=<path>
" command line option, or by exporting $GIT_DIR
environment variable.
This may be a reasonable measure to safeguard users from accidentally straying into a bare repository in unexpected places, but often gets in the way of users who need valid accesses to the repository.
Earlier, 45bb916 ("setup
: allow cwd=.git w/ bareRepository=explicit", 2024-01-20, Git v2.44.0-rc0 -- merge listed in batch #12) loosened the rule such that being inside the ".git/
" directory of a non-bare repository does not really count as accessing a "bare" repository.
The reason why such a loosening is needed is because often hooks and third-party tools run from within $GIT_DIR
while working with a non-bare repository.
More importantly, the reason why this is safe is because a directory whose contents look like that of a "bare" repository cannot be a bare repository that came embedded within a checkout of a malicious project, as long as its directory name is ".git
", because ".git
" is not a name allowed for a directory in payload.
There are at least two other cases where tools have to work in a bare-repository looking directory that is not an embedded bare repository, and accesses to them are still not allowed by the recent change.
- A secondary worktree (whose name is
$name
) has its $GIT_DIR
inside "worktrees/$name/
" subdirectory of the $GIT_DIR
of the primary worktree of the same repository.
- A submodule worktree (whose name is
$name
) has its $GIT_DIR
inside "modules/$name/
" subdirectory of the $GIT_DIR
of its superproject.
As long as the primary worktree or the superproject in these cases are not bare, the pathname of these "looks like bare but not really" directories will have "/.git/worktrees/
" and "/.git/modules/
" as a substring in its leading part, and we can take advantage of the same security guarantee allow git to work from these places.
Extend the earlier "in a directory called '.git
' we are OK" logic used for the primary worktree to also cover the secondary worktree's and non-embedded submodule's $GIT_DIR,
by moving the logic to a helper function "is_implicit_bare_repo()
".
We deliberately exclude secondary worktrees and submodules of a bare repository, as these are exactly what safe.bareRepository=explicit
setting is designed to forbid accesses to without an explicit GIT_DIR/--git-dir=<path>
Best Answer
It's relatively simple, really. A bare repository has no working tree, therefore it cannot have an active checkout.1 And, as you've seen elsewhere, the issue is that pushing to an active checkout of some branch results in an out-of-sync checkout. So Git forbids pushing to the checked-out branch.2 By the fact that a bare repository has no working tree, and therefore no checked-out branch, a bare repository sidesteps the problem.
Let's dispense with the third machine: we only need
client
, a non-bare repository, andserver
, the repository that should be but isn't bare.On
server
, branchmain
is actively checked-out. Someone may or may not be logged in toserver
and editing files there.Meanwhile, on
client
, you've made some new commit and you rungit push
and send the new commit toserver
. If the server accepts this commit, there are now two possibilities:server
's Git repository doesn't update the checked-out working tree, orserver
's Git repository does update the checked-out working tree.Both situations can produce a bad outcome. Before we start on that, let's explore Git's workings a bit.
1This was true before
git worktree add
was added, and now it isn't. So the simplicity was there until Git 2.5, and now it isn't.2This was true in original Git, before the invention of various configuration items. Now it isn't. So the simplicity was there once, and now it isn't. (The
receive.denyCurrentBranch
stuff happened before thegit worktree
command, but I don't recall offhand which version that was.)Git is about commits; commits are numbered; branch names find commits
A Git repository consists mainly of two databases, one usually much larger than the other. The larger database contains commits and supporting Git objects. The smaller database contains names, such as branch and tag names.
The commit objects are numbered, with the numbers expressed as hexadecimal hash IDs. Git needs the hash ID to find the commit: the big database is indexed solely by hash ID.
A commit itself contains two things:
In the metadata for any given commit, Git stores the raw hash ID(s) of the parent or parents of that commit. A commit therefore has a list of previous-commit hash IDs, stored in its metadata. This forms the history in the repository.
To be able to obtain the latest commit for any given branch, Git stores, in the branch name (e.g.,
refs/heads/main
), the raw hash ID of the latest commit. That commit contains, in its metadata, the hash ID of the previous (parent) commit, which in turn contains another hash ID for another parent, and so on.When we use
git checkout
orgit switch
with a branch name, we're telling Git: extract the latest commit for that branch. That's the one whose hash ID is stored in the branch name. So withgit switch main
, Git looks uprefs/heads/main
, finds a hash ID such asa123456...
, and looks up that commit in the database. That commit has a set of files associated with it. Git copies those files out of the commit—the ones in the commit aren't generally usable by the OS, as they're in a read-only, compressed, Git-only, de-duplicated form—to your working tree.But, Git also copies the files—or rather, information about the files (names and blob hash IDs)—into Git's index, which goes along with the working tree. This defines which files are tracked, helps Git go fast, and is generally needed to know what to put in the next commit.
Once this is all in place, Git sets up the special name
HEAD
to contain the branch name. (In original Git, this was a symbolic link to therefs/heads/main
file, but as with many bits of Git, that was done away with more than a decade ago.)There's now a group of well-defined, carefully-coordinated data:
HEAD
contains the current branch name;You work on the files, run
git add
to tell Git to update what's in Git's index, and eventually rungit commit
. At this point Git:The carefully-coordinated data lets Git do all of this, and is still carefully-coordinated.
If someone else commits while you're working ...
Now suppose we're on the server, working, and someone on the client commits. That's not a problem because the client Git repository has its own branch names. It gets a new commit in its commit database, and its branch name
main
stores a new hash ID. But over here onserver
, our Git databases are unchanged.But if they now run
git push main
and send their commit, our Git has to either accept their commit or reject it. If we reject it, that's fine: our databases remain unchanged and everything is still coordinated.Let's say that instead, though, we accept the push. The server Git updates
refs/heads/main
to store their commit hash ID. Our two possibilities are:If we choose possibility #1, then we have a "stale checkout": our files are from the previous commit. But the branch name holds the new commit hash ID. So we're out of sync. If we update any files and then commit, we'll revert the other guy's work (remember that our Git software uses what's in our index, which matches our working tree). That's not great, so let's move on to option 2.
If we choose option 2, our files get ripped away from us and replaced. Our index and working tree are re-synchronized with the updated branch name. That's better ... except, if we're actively working on some file, what happens to our work? Maybe our editor notices that the underlying file has changed and gives us a chance to fix things. Maybe it just overwrites the underlying file. Either way, it's likely to be a problem.
So, updating the working tree of a server's repository is perhaps better than not doing that, and that's what
receive.denyCurrentBranch
'supdateInstead
setting does. It's not perfect, though. "Perfect" is just don't have a working tree so that nothing can go wrong, and we get that with--bare
.