Git Guts - Pack Files, The Index

This post kicks off a series that takes a dive into git’s object storage. I’ll do this by examining the files, their formats, and writing some Go code to poke around and find things. In this installment, I’ll start covering git’s pack files by taking a look at the pack index.

You can find the full code for this series on GitHub at rubyist/gitguts. I (intentionally) don’t have any kind of commenting system on this blog, so if you have feedback or questions I’m @rubyist on twitter or you can file an issue in the repo.

Git’s Object Storage

Git has a couple of different ways to store its objects. They can be stored individually as files on disk, called loose objects. These are found under the .git/objects/xx directory, where xx is a directory named after the first byte of the SHA-1 of the obect. As an example, a git object with the SHA-1 of 057b91e290dad96bad3ed447539b2d214b5ee09c would live at .git/objects/05/7b91e290dad96bad3ed447539b2d214b5ee09c as a loose object. These files are too simple to warrant their own post, but we’ll work with them in another article.

Git can also bundle objects together in what it calls pack files. In a repo, pack files live under .git/objects/pack, named with a .pack extension, and generally have an accompanying index file, named with a .idx extension. The official documentation for the pack and index formats can be found in the pack-format.txt file in the git repo.

You’ll want a git repo and a pack file to follow along here, so how do you get one? You may already have one in your repo under .git/objects/pack, so check there first. If there isn’t one there, running git gc on your repo ought to bundle up all your objects and stuff them into a pack file for you.

If for some reason you have a .pack and no .idx, you can generate an index with git index-pack yourfile.pack.

The Pack File Index

The pack file index tells us which objects are in its associated pack file and where they are located, as well as a CRC32 checksum for each object. There are two versions of the index format, I’ll only be working with version 2 here. Here’s a break down of what’s in the index file:

  • 4-byte signature, [0xFF, 0x74, 0x4F, 0x63]
  • 4-byte version number, [0x00, 0x00, 0x00, 0x02]
  • Table of 256 4-byte fanout entries
  • Table of 20-byte SHA-1 object names
  • Table of 4-byte CRC32 values
  • Table of 4-byte offset values (in network byte order)
  • Table of 8-byte offset values for pack files > 2GiB
  • 20-byte SHA-1 checksum of the corresponding pack file
  • 20-byte SHA-1 checksum of the above data

Let’s grab some tools and dig in.

The first thing we’ll look at is the header. That way we’ll know for sure we’re dealing with a pack file index. As noted above, we have a 4-byte signature and a 4-byte version number.

The signature will always be [0xFF, 0x74, 0x4F, 0x63]. According to the documentation, this value was chosen because it is an “unreasonable fanout[0] value”. You’ll understand what the fanout table is in the next section. Version 1 of the index format had no header and started with the fanout table, so a reasonable disambiguation was necessary.

The version, for us, will always be [0x00, 0x00, 0x00, 0x02] for version 2.

We’ll leverage the encoding/binary package to make reading fixed length values easy. Unless otherwise noted, this data is always big endian.

Let’s start by making a struct for our pack index and adding these two fields to it. These are 4-byte fields, so we’ll use the uint32 type. Reading this data in from the file is very easy with binary.Read().

type PackIndex struct {
	Signature uint32
	Version   uint32
}

var (
	packIndexSignature = []byte{0xFF, 0x74, 0x4F, 0x63}

	errInvalidPackIndex   = errors.New("invalid pack index")
	errInvalidPackVersion = errors.New("invalid pack version")
)

// OpenPackIndex opens a pack file index at path and verifies that it
// is a valid index file.
func OpenPackIndex(path string) (*PackIndex, error) {
	f, err := os.Open(path)
	if err != nil {
		return nil, err
	}
	defer f.Close()

	idx := &PackIndex{}

	// Read the header into the idx objext
	if err := binary.Read(f, binary.BigEndian, idx); err != nil {
		return nil, err
	}

	// Verify the signature
	if bytes.Compare(idx.Signature[:], packIndexSignature) != 0 {
		return nil, errInvalidPackIndex
	}

	// We're just doing version 2 packs
	if idx.Version != 2 {
		return nil, errInvalidPackVersion
	}

	return idx, nil
}

As long as we have a fixed-sized value, binary.Read() can handle it. It uses the reflection package to figure out what you’re after, so it’s not the fastest way to read and decode binary data but it’s the easiest.

The other thing to note here is passing idx.Signature[:] to bytes.Compare(). We need to do this because the Signature field is a fixed size array rather than a slice of bytes. Using the [:] allows you to pass these to functions looking for []byte values.

Fanout Table, Level 1 - Object Counts

The fanout table is 256 entries of 4-byte values. This table keeps a cumulative count of the number of objects in the pack file, broken down by the first byte of their SHA-1 values. Working through an example will help us understand.

Say we’re trying to find the object with the SHA-1 of 057b91e290dad96bad3ed447539b2d214b5ee09c. The value of the first byte of that SHA-1 is 0x05, which is decimal 5. The table is an array, and the 4-byte value at index 5 is the number of objects in the pack file that start with 00 through 05. To get the number of objects that only start with 05, you subtract the value at index 4 from the value at index 5.

Why are the numbers cumulative? Wouldn’t it be easier to just store “42” at index 5 if there are 42 objects that start with “05”? It would be, but making it cumulative allows you to get the total number of objects in the pack file just by looking at the value at index 255. Because this value is at the same position in every pack index, it is very fast to look up. The total count is also important for parsing the rest of the file.

Given the following piece of a fanout table:

The value at index 4 is 0x15, or decimal 21 and the value at index 5 is 0x26, or decimal 38. This means there are 17 objects in the pack file with 0x05 as the first byte of their SHA-1 values.

The value at index 255 is 0x0116, or decimal 278, meaning there are 278 objects in total in this pack file.

Because this is a fixed-size table of fixed-size values, we can just slip this into our PackIndex structure and our existing binary.Read() call will pull them in with no extra work on our part.

Because they are 4-byte values, we will again use uint32 as the type.

type PackIndex struct {
	Signature uint32
	Version   uint32
	Fanout    [256]uint32
}

Fanout Table, Level 2 - Object Names

Next up is the table of object names. The values here are the 20-byte SHA-1 hashes of each object in the pack file. This table is sorted. The intention in sorting the list is to make it easy to search. A particular object can be located with the following steps:

  • Read in the level 1 fanout table
  • Calculate out how many objects start with the byte of the target object
  • Grab that section of the names table
  • Binary search those objects to find the index
  • Use that index in the offsets table

You’ll see in the next sections that the index of the object in the level 2 table is the same as the index in the level 3 and 4 tables. We’ll write code for all of that.

There’s a trade off to be made here. We can either read the entire set of names into memory or we can read only the ones we need from the file every time we need to find an object. Keeping the whole set in memory uses more memory, but searching is faster. Vice versa for reading sections of the name table as we need them.

The entries are only 20-bytes, so it would take a lot of objects in the pack file before we start running into memory problems. The code is also easier, so let’s go that route. I’ll leave it as an exercise for you to figure out other strategies.

The first thing we might notice is that this is not a fixed-size table. Each object is fixed-size, but the size of the table itself depends on the number of objects in the pack file. We’d like to keep the names table as part of PackIndex, but adding it will not work with our use of binary.Read(). What can we do?

One approach is to pull out the fixed portion of the header into another struct and embed it in PackIndex, then binary.Read() into that.

type PackIndex struct {
	packIndexHeader
	objects  [20]byte
}

type packIndexHeader struct {
	Signature [4]byte
	Version   uint32
	Fanout    [256]uint32
}

func OpenPackIndex(path string) (*PackIndex, error) {
	// ...

	idx := &PackIndex{}

	// Notice we read into the embedded struct here, not idx itself
	if err := binary.Read(f, binary.BigEndian, &idx.packIndexHeader); err != nil {
		return nil, err
	}

	// ...

	// The last entry of the fanout table tells us how many
	// objects are in the pack.
	numObjects := idx.Fanout[255]

	// Read in the object names table
	idx.objects = make([][20]byte, numObjects)
	if err := binary.Read(f, binary.BigEndian, idx.objects); err != nil {
		return nil, err
	}

	// ...

	return idx, nil
}

Fanout Table, Level 3 - Checksums

Next up is the table of CRC32 checksums. Git’s objects fly around over networks so there are checksums and redundancies all over the place to ensure data integrity. At this point there’s nothing we can really do with these checksums because they relate to the data in the pack files and we haven’t covered those yet. For now we’ll just skip them.

These are 4 bytes each, and there are numObjects of them, so we can just Seek() past them. We use io.SeekCurrent here to seek from the current position of the file rather than the start of the file.

func OpenPackIndex(path string) (*PackIndex, error) {
	// ...

	// Skip the CRC table and go to the offsets table
	if _, err := f.Seek(int64(numObjects*4), io.SeekCurrent); err != nil {
		return nil, err
	}

	// ...

	return idx, nil
}

Fanout Table, Level 4 - Object Offsets

Now we arrive at the all important offsets table. This table tells us where in a pack file a particular object resides. This table again is a table of 4-byte values and we will make the same trade off we made with the names table by storing it in memory.

type PackIndex struct {
	// ...
	offsets  []uint32
}

func OpenPackIndex(path string) (*PackIndex, error) {
	// ...

	// Read in the offsets table
	idx.offsets = make([]uint32, numObjects)
	if err := binary.Read(f, binary.BigEndian, idx.offsets); err != nil {
		return nil, err
	}

	// ...

	return idx, nil
}

That was easy, right? Not so fast. These are only 8 byte values, leaving us at most an offset in the pack file of 232 - 1. That puts a limit on the size of the pack files we can have, and we don’t like limits.

Thankfully, the clever minds who built git have a way around this. This is done by treating the lower 31-bits of the value as the offset in the packfile, while using the MSB as an indicator to look deeper.

If the pack file is less than 2GiB, the MSB will be 0 and we can just decode this as a regular number. If the pack file is larger than 2GiB and the object’s offset in the pack file needs more than 31 bits, the MSB is set to 1 and the lower 31-bits become an index into a fifth level table of 8 byte values.

Notice that I said if the packfile is > 2GiB and the object’s offset requires more than 31 bits. This means that there aren’t numObjects entries in the fifth level, so we can’t just snarf them in like before. We’re going to make the same memory trade off as before, so we’ll need to figure out how many objects need this secondary look up table.

type PackIndex struct {
	// ...
	offsets2 []uint64  // These are 8-byte values
}

func OpenPackIndex(path string) (*PackIndex, error) {
	// ...

	// Count the number of objects that have the MSB set to 1
	bigOffsets := 0
	for i := 0; i < len(idx.offsets); i++ {
		if idx.offsets[i]&0x8000 > 0 {
			bigOffsets++
		}
	}

	// Read in the fifth level table, if necessary
	if bigOffsets > 0 {
		idx.offsets2 = make([]uint64, bigOffsets)
		if err := binary.Read(f, binary.BigEndian, idx.offsets2); err != nil {
			return nil, err
		}
	}

	// ...

	return idx, nil
}

Git will tend to pack heavily used objects toward the front of the pack file which helps avoid lookups in the secondary table.

Pack and Index Checksums

The final 40 bytes of the pack file index contain 2 20-byte SHA1 checksums of the pack file itself and the previous contents of the index file, respectively.

We’re not going to do anything with these here, but you should absolutely read them in and use them to verify the pack and index files.

Locating Objects

Whew, we made it! You now, hopefully, understand and can parse git pack index files. What can we do with our newly crafted super power? Mostly, we can figure out whether or not an object is in a pack file and if it is, where it lives in that pack file. So let’s do that.

We’ll build a method on PackFile that uses the steps outlined in the level 2 section to find the offset of an object, given its SHA-1, in the pack file.

func (p *PackIndex) OffsetOf(oid [20]byte) (int, error) {
	// - Calculate out how many objects start with the byte of the target object
	oid0 := int(oid[0])
	before := uint32(0)
	if oid0 > 0 {
		before = p.Fanout[oid0-1]
	}
	numOIDs := p.Fanout[oid0] - before

	// - Grab that section of the level 2 table
	// - Binary search those objects
	offsetIndex := searchOIDs(p.objects[before:], 0, numOIDs-1, oid)
	if offsetIndex < 0 {
		return 0, errObjectNotFound
	}

	// This is the overall position of the object in the names table.
	// Use this position in the offsets table to get its offset
	offsetIndex += int(before)

	// - Calculate the objects index in the level 4 and 5 tables
	offset := p.offsets[offsetIndex]
	if offset&0x8000 > 0 {
		// Use the secondary table if necessary
		return int(p.offsets2[offset&0x7FFF]), nil
	}

	return int(p.offsets[offsetIndex]), nil
}

// searchOIDs perform a binary search on the set of OIDs to find the
// index of the oid.
func searchOIDs(oids [][20.byte, l, r uint32, oid [20]byte) int {
	for l <= r {
		m := l + (r-l)/2
		switch bytes.Compare(oids[m][:], oid[:]) {
		case 0:
			return int(m)
		case 1:
			r = m - 1
		default:
			l = m + 1
		}
	}
	return -1
}

Since we have the entire names table, we could skip the part where we figure out where the subset of objects we’re searching starts, and just search the whole thing. That would be a fine thing to do.

However, if we’re using another strategy for the names table, or we know it’s very large, we might need to narrow it down using the information from the fanout table. We’ve done that here as an illustration of how that could be worked through.

Conclusion

And that’s it for now! We’ve done all we can do with pack index files, without a pack file. Next time we’ll open up the pack files themselves and start pulling objects out of them. It’s deceptively simple at first, but there’s a lot going on in those things. Don’t miss it!