Jay Taylor's notes
back to listing indexHow to find/identify large commits in git history? - Stack Overflow
[web search]-
-
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.
The Base Script
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
When you run above code, you will get nice human-readable output like this:
... 0d99bb931299 530KiB path/to/some-image.jpg 2ba44098e28f 12MiB path/to/hires-image.png bd1741ddce0d 63MiB path/to/some-video-1080p.mp4
macOS users: Since
numfmt
is not available on macOS, you can either omit the last line and deal with raw byte sizes orbrew install coreutils
.Filtering
To achieve further filtering, insert any of the following lines before the
sort
line.To exclude files that are present in
HEAD
, insert the following line:grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |
To show only files exceeding given size (e.g. 1 MiB = 220 B), insert the following line:
awk '$2 >= 2^20' |
Output for Computers
To generate output that's more suitable for further processing by computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:
... 0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg 2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4
Appendix
File Removal
For the actual file removal, check out this SO question on the topic.
Understanding the meaning of the displayed file size
What this script displays is the size each file would have in the working directory. If you want to see how much space a file occupies if not checked out, you can use
%(objectsize:disk)
instead of%(objectsize)
. However, mind that this metric also has its caveats, as is mentioned in the documentation.More sophisticated size statistics
Sometimes a list of big files is just not enough to find out what the problem is. You would not spot directories or branches containing humongous numbers of small files, for example.
So if the script here does not cut it for you (and you have a decently recent version of git), look into
git-filter-repo --analyze
orgit rev-list --disk-usage
(examples).
-
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.
#!/bin/bash
#set -x
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'n';
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
# extract the size in bytes
size=$((`echo $y | cut -f 5 -d ' '`/1024))
# extract the compressed size in bytes
compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
# extract the SHA
sha=`echo $y | cut -f 1 -d ' '`
# find the objects location in the repository tree
other=`echo "${allObjects}" | grep $sha`
#lineBreak=`echo -e "n"`
output="${output}n${size},${compressedSize},${other}"
done
echo -e $output | column -t -s ', '
That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:
... to find the commit that points to each of those blobs.
Source
Steps 1-3a were copied from Finding and Purging Big Files From Git History
EDIT
The article was deleted sometime in the second half of 2017, but an archived copy of it can still be accessed using the Wayback Machine.
-
1Interesting to see a PowerShell version of my script! I have not tried it but from the code it looks like you do not output the
objectname
field. I really think you should though, since the path:objectname relationship is n:m not 1:1. Mar 16, 2021 at 11:11 -
1@raphinesse Yeah my use-case is to create an ignore-regex to migrate from TFVC to git without too many big files, so I was only interested in the paths of the files that I need to ignore ;) But you're right, I'll add it. Thanks for the edit by the way :)– SvenSMar 16, 2021 at 15:21
Try git ls-files | xargs du -hs --threshold=1M
.
We use the below command in our CI pipeline, it halts if it finds any big files in the git repo:
test $(git ls-files | xargs du -hs --threshold=1M 2>/dev/null | tee /dev/stderr | wc -l) -gt 0 && { echo; echo "Aborting due to big files in the git repository."; exit 1; } || true
Powershell solution for windows git, find the largest files:
git ls-tree -r -t -l --full-name HEAD | Where-Object {
$_ -match '(.+)s+(.+)s+(.+)s+(d+)s+(.*)'
} | ForEach-Object {
New-Object -Type PSObject -Property @{
'col1' = $matches[1]
'col2' = $matches[2]
'col3' = $matches[3]
'Size' = [int]$matches[4]
'path' = $matches[5]
}
} | sort -Property Size -Top 10 -Descending
I was unable to make use of the most popular answer because the --batch-check
command-line switch to Git 1.8.3 (that I have to use) does not accept any arguments. The ensuing steps have been tried on CentOS 6.5 with Bash 4.1.2
Key Concepts
In Git, the term blob implies the contents of a file. Note that a commit might change the contents of a file or pathname. Thus, the same file could refer to a different blob depending on the commit. A certain file could be the biggest in the directory hierarchy in one commit, while not in another. Therefore, the question of finding large commits instead of large files, puts matters in the correct perspective.
For The Impatient
Command to print the list of blobs in descending order of size is:
git cat-file --batch-check < <(git rev-list --all --objects |
awk '{print $1}') | grep blob | sort -n -r -k 3
Sample output:
3a51a45e12d4aedcad53d3a0d4cf42079c62958e blob 305971200
7c357f2c2a7b33f939f9b7125b155adbd7890be2 blob 289163620
To remove such blobs, use the BFG Repo Cleaner, as mentioned in other answers. Given a file blobs.txt
that just contains the blob hashes, for example:
3a51a45e12d4aedcad53d3a0d4cf42079c62958e
7c357f2c2a7b33f939f9b7125b155adbd7890be2
Do:
java -jar bfg.jar -bi blobs.txt <repo_dir>
The question is about finding the commits, which is more work than finding blobs. To know, please read on.
Further Work
Given a commit hash, a command that prints hashes of all objects associated with it, including blobs, is:
git ls-tree -r --full-tree <commit_hash>
So, if we have such outputs available for all commits in the repo, then given a blob hash, the bunch of commits are the ones that match any of the outputs. This idea is encoded in the following script:
#!/bin/bash
DB_DIR='trees-db'
find_commit() {
cd ${DB_DIR}
for f in *; do
if grep -q $1 ${f}; then
echo ${f}
fi
done
cd - > /dev/null
}
create_db() {
local tfile='/tmp/commits.txt'
mkdir -p ${DB_DIR} && cd ${DB_DIR}
git rev-list --all > ${tfile}
while read commit_hash; do
if [[ ! -e ${commit_hash} ]]; then
git ls-tree -r --full-tree ${commit_hash} > ${commit_hash}
fi
done < ${tfile}
cd - > /dev/null
rm -f ${tfile}
}
create_db
while read id; do
find_commit ${id};
done
If the contents are saved in a file named find-commits.sh
then a typical invocation will be as under:
cat blobs.txt | find-commits.sh
As earlier, the file blobs.txt
lists blob hashes, one per line. The create_db()
function saves a cache of all commit listings in a sub-directory in the current directory.
Some stats from my experiments on a system with two Intel(R) Xeon(R) CPU E5-2620 2.00GHz processors presented by the OS as 24 virtual cores:
- Total number of commits in the repo = almost 11,000
- File creation speed = 126 files/s. The script creates a single file per commit. This occurs only when the cache is being created for the first time.
- Cache creation overhead = 87 s.
- Average search speed = 522 commits/s. The cache optimization resulted in 80% reduction in running time.
Note that the script is single threaded. Therefore, only one core would be used at any one time.
Use the --analyze
feature of git-filter-repo like this:
$ cd my-repo-folder
$ git-filter-repo --analyze
$ less .git/filter-repo/analysis/path-all-sizes.txt
to get a feeling for the "diff size" of the last commits in the git history
git log --stat
this will show the diff size in lines: lines added, lines removed
Your Answer
Post as a guest
Required, but never shown
By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.
- The Overflow Blog
-
-
- Featured on Meta
-
-
-
-
-
Linked
Related
Hot Network Questions
- Can you beat random?
- When is it appropriate to mix 74LSxx components with original TTL 74xx?
- What kinds of places in Singapore require visitors to present their physical passport?
- A clear map of mathematical approaches to Artificial Intelligence
- Looking for an accurate Apple II(e) character set - in particular, what is CHR$(133)?
- Is there a good textbook on Second Order Logic?
- How do I go about installing a ground wire to a cordless drill converted to corded?
- Is it safe to transport diesel fuel in a plastic can inside of a plastic tub? Or store it in the same way?
- Does DB15 actually exist?
- Bond neutral to service box and earth ground in shop sub?
- Can a paladin decide who is considered "friendly", and thus gets the benefit of their aura(s)?
- Why does wget work and curl fail with a specific REST URL?
- PSE Advent Calendar 2023 (Day 12): A peaceful tree
- The theory of operation of NPN BJT
- Is “either” or “too” better in "It's a no-go today ____"?
- Does an electric heater lose efficiency if the elements are rusty/corroded?
- Defining set of all real numbers
- Where can I find BLYP pseudopotentials?
- Turn the three lines into a part of the body
- Reference request: number of antichains of a partially ordered set
- 17 letters, 4 groups
- Is a contract written by a language model valid?
- What coordinates on Earth would provide the most arable land for a large island?
- Why does Denmark provide a significantly lower level of protection to some refugees if there's a Common European Asylum System?