· Apr 26, 2023 2m read

Removing large files from the repository history

Git stores complete history - meaning you would never lose your files, even if they are deleted, they are still available. That, however, presents an issue if large or sensitive files have been committed. Deleting them DOES NOT remove them from history. Recently one of the repos I work on became unexpectedly large, so here's how you can resolve that:

  1. Check repo files. If there are large files - well, here are your suspects, but if there's nothing large in the repo itself, check the pack files .git\objects\pack. In my case, the packfile was ~1gb (99.9% of the repo size). It means that a large object was committed into a repo and then deleted - which still leaves it in the pack files.
  2. Listed all repo objects by size:
git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Which returned one large file:

4c23d572ae62  963MiB docker/irishealth-2022.

Probably IRIS image was accidentally committed (and later deleted) - but was left in the repo history.

  1. Searched for commits affecting the image file: git log --all --full-history -- "*irishealth-2022.*"


commit abc
Date:   Thu Apr 20 10:47:46 2023

    Remove iris image

commit xyz
Date:   Thu Apr 20 10:07:12 2023

    Fix bug
  1. Using a GitHub search, I found the affected branch.
  2. As the commit was in one branch only (not merged into the main branch, etc.) and so not deployed to prod, rewriting history was deemed ok.
  3. Ran pepo cleaner to remove large files from history.
  4. Force-pushed new history.
  5. Cloned the repo again, and the operation was completed in under a minute.

After doing this, every developer working on a repo needs to clone the repository again.

Discussion (0)1
Log in or sign up to continue