git filters & diff drivers: a technical overview

As I was investigating potential solutions for storing store secret files in our git repos securely, I recently looked into how some little known features of git, namely filters and diff drivers, work. While most git users will rarely need to know about the technical details of how those work on most cases, I found that they could be pretty powerful in some specific use cases.

How git filters work

The goal of git filters is to act as post-processors to transform the content of files when they are pushed to and pulled from a git repo. You define a filter in 2 places in a git repo:

In the .gitattributes file, you define file patterns and the name of the filter driver to apply to each of them.
In your git config, you define the filter driver itself, i.e. what command to run on the incoming blob during checkout/pull (smudge command), and what command to run on the file content during checkin/push (clean command). That filter driver can also indicate if it’s required to succeed for the content to be usable or not.

Example 1: a simplistic encryption filter

If we wanted to run a filter that encrypts/decrypts the content of some files on the fly, we can define a my-encrypt-filter filter driver like so in our git config:

[filter "my-encrypt-filter"]
	clean = openssl enc -aes-256-cbc -k "mysecretkey"
	smudge = openssl enc -d -aes-256-cbc -k "mysecretkey"
	required = true

And in our .gitattributes, declare which files should go through that filter:

secrets/*      filter=my-encrypt-filter
secret-*.json  filter=my-encrypt-filter
# …

In this example:

When we commit and push a secrets/foo.txt file, the content of that file will be run through the clean command of the filter, i.e. openssl enc …, which will encrypt the file content before pushing it to the remote.
When we (or another user having that same filter defined in their config) pull and checkout the file from the remote into the working copy, its content will be processed by the smudge command of the filter, i.e. openssl enc -d, to decrypt the encrypted blob coming from the git remote before putting the decrypted content in our local working copy.

⚠️ The implementation in this example is a bit too simplistic in reality, as it would need some adjustments before being used for real-world applications to ensure idempotence.

As the .gitattributes documentation suggests, clean and smudge actions should be idempotent; i.e. “clean→clean” should be equivalent to “clean”, and “smudge→smudge→clean” should be equivalent to “clean” too.

This is clearly not the case in this simplified example above—as applying the clean action (aka openssl enc) two times in succession would result in doing another encryption pass on top of the already-encrypted content.

In a real-world example, you should thus replace those direct openssl … calls with e.g. a wrapper script that first checks if the file is already encrypted before trying to re-encrypt it again (and vice-versa for decryption).

For example, you could imagine such a wrapper script would inject some “magic marker” as a prefix to the encrypted data during encryption. Then when the script is called to do a clean action, it’d only call openssl enc … on the input content (then add the marker) if that content does not already contain that magic prefix indicating it was already encrypted. Likewise, when called to do a smudge action, it’d only remove the prefix before passing the rest of the binary data to openssl enc -d … if the input binary content to smudge/decrypt actually starts with said prefix, and do nothing otherwise (i.e. if that content is already decrypted).

PS: As far as I could tell, that approach is basically what git-crypt (discussed below) is using internally.

Example 2: Auto-format on commit

Another example use case is to use a filter to reformat a text file on checkin.

For example, we could imagine using jq to ensure that when a json file is commited it’s indented in human-readable format. We do that by declaring the following filter in our repo’s .git/config:

[filter "json-prettify"]
	clean = jq

Then telling our .gitattributes to apply it to all JSON files:

*.json  filter=json-prettify

Then if we create a new JSON file e.g. echo '{"name":"Alice","age":42}' >user.json:

Once we git add user.json, its content will go through jq (as if we did cat user.json | jq), and the output of that filter (i.e. the prettify’d JSON) is what will actually be commited and pushed as the content of that file to the remote.
Then when another user checks out the repo, or if you delete the user.json file and git restore it, or if you look at the content of that file on the remote (e.g. github.com), you’ll see the content of that user.json file has been committed pretty-formatted.
In that example, there’s no need for a smudge filter command, as we only want to format outgoing JSON before commit but don’t need to transform them back on checkout.

Notes

The filter driver may be defined at any level of your git config, i.e.:

In the .git/config file—which is only local to your working copy and can’t be commited
In your global $HOME/.gitconfig—which applies to all repos in your Mac.

A missing filter driver definition in the git config is not an error, and just makes the filter a no-op passthru. This means that:

The definition of filter drivers are not commited in your repo (.git/* files are only local) so are not automatically shared with other users of the repo. In that sense, it’s a bit similar to git hooks, defined in your local .git/hooks/* and not commited to your remote to be shared with others either.
When someone else clones the repo, they likely won’t have the definition for that filter driver in their own ~/.gitconfig; so the file just won’t be processed by the filter and instead its raw content just kept as-is.

If you update the definition of your filter driver in your git config, you can use:

git add --renormalize . (or git add --renormalize <files> ) to ask git to re-apply the clean commands of all filter drivers applied to each file
git restore --source=HEAD -- . to reapply the smudge command of all filter drivers applied to each file

The latter is especially useful if you clone a repo that has some filter applications defined in its .gitattributes, and would only then add the filter driver definition in your local .git/config, as you’d then want the files to be re-processed by the filter after that.

Practical examples

Git-LFS

git-lfs (Git Large File Storage) is aimed to have large files (e.g. video files, etc…) in your repo to be actually stored outside of your git repo, to avoid such massive files to impede your git history and object database.

This works by leveraging the git filters we just learned about, by declaring a filter driver similar to this in your ~/.gitconfig:

[filter "lfs"]
	clean = git-lfs clean -- %f
	smudge = git-lfs smudge -- %f
	required = true

Then, roughtly speaking, for every file in your .gitattributes that is assigned filter=lfs:

On checkin, git-lfs clean command takes care of replacing the actual content of the large file with textual information representing a reference to that file’s current content version. (Note: %f is replaced by git with the file name when calling the command)
Then on git push, a custom pre-push hook uploads the large file’s real content to some external storage—while what’s actually pushed on the git remote is the tiny textual information that was swapped for the file content during step 1
On checkout, the textual content that was pushed on the remote for that file is passed as an input to git-lfs smudge, which transforms that info back to the real file content by downloading the actual file content from the external storage and returning it as the filter output.

(Again simplifying and not going into the details here, but you get the idea)

git-crypt

git-crypt, which is a tool dedicated to ensure some files are stored encrypted in an otherwise-public repo, also relies on git filters for its implementation.

Basically, the high level concept is similar to the Example 1 we saw above: the filter’s clean command encrypts the file during checkin, and its smudge command decrypts it during checkout. Except that instead of calling openssl enc … like we did in that example above, it delegates the encryption/decryption to the git-crypt executable:

[filter "git-crypt"]
	smudge = "git-crypt" smudge
	clean = "git-crypt" clean
	required = true

During git-crypt init, git-crypt generates a symmetric key, and stores it in the repo’s .git/git-crypt/keys/default file.
Then git-crypt clean reads that symmetric key and use it to encrypt the input file content before sending it to the git remote, while git-crypt smudge use it to decrypt encrypted blob content coming from the remote.

For anyone who is cloning the repo and doesn’t have git-crypt installed or setup in their git config, the filter will just be a no-op, which means they’ll just see the raw, encrypted content that was stored in the git remote.

Where does git-crypt stores the encryption key?

When a user calls git-crypt unlock "path/to/encryption/key/file", git-crypt copies that symmetric key file to the .git/git-crypt/keys/default file in your local working copy, then decrypts all the encrypted files using that symmetric key. From that point on, the git-crypt filter will apply as normal (i.e. any file assigned the filter=git-crypt in .gitattributes will be encrypted via git-crypt clean before being pushed to the remote and decrypted via git-crypt smudge when received from the remote)

How does git-crypt handle its support of GPG keys?

This is getting a bit outside of the topic of git filters, but since I learned about this while I was looking into it, I figured this was also an interesting point that git-crypt also supports git-crypt --add-gpg-user <user-email-or-gpg-id> as a way to not have to share the symmetric key between users manually. This also gives us a good example of how far you can go with using some more involved script or executable as your git filter to cover more features.

The way it works with git-crypt is that:

When you call git-crypt --add-gpg-user, it encrypts the .git/git-crypt/keys/default private symmetric key with that user’s public GPG key, and store the result in a .git-crypt/keys/default/0/<gpg-key-fingerprint>.gpg file that it commits into the repo.
Since those .gpg files are encrypted with the user’s public key, only the user this GPG key belongs to (and thus who has the corresponding private GPG key) will be able to decrypt the corresponding .git-crypt/keys/default/0/<gpg-key-fingerprint>.gpg. This is also why it’s fine to commit those.
When someone clones the repo, there won’t be a .git/git-crypt/keys/default file containing the private symmetric key yet; so at that point the files will still be encrypted.
Then when the user runs git-crypt unlock (without providing an explicit path to a symmetric key file to use to unlock in that case, given in this scenario we precisely want to avoid having to share the symmetric key file manually amongst users), git-crypt will look at all the GPG private keys installed on the current user’s machine keyring (gpg --list-secret-keys), find the first .git-crypt/keys/default/0/<gpg-key-fingerprint>.gpg file matching one of those private keys, and decrypt that .gpg file with that private key. This will allow it to get back the original symmetric key, store it in .git/git-crypt/keys/default locally, and use it to decrypt the content of the repo like before.

So basically when you add a GPG user to a git-crypt-managed repo, it encrypts the symmetric key using that GPG public key and commit that, then only that user can decrypt that gpg-encrypted file to restore the symmetric key and use it. The rest of the process is unchanged compared to when you only use a symmetric key directly (and pass it around between users manually) without that extra layer usage of GPG.

Git diff drivers & `textconv`

Git diff drivers are very similar to git filter drivers, in the sense that you define the diff drivers in .git/config while you define which files to use which diff=<drivername> in .gitattributes.

The use cases of diff drivers is a bit different from git filters though, as they are focused on customizing how git generates and shows diffs for those files, not modifying the actual content of the files on disk or during git push/pull like git filters do.

In practice, “diff drivers” support a lot features around generating diffs—like being able to specify a custom command to use to compute the diff itself, which could be useful e.g. if you have a custom diff program that is more suitable for creating diffs of particular file types. But in the context of this post, I’d want to focus on their textconv option.

The textconv = <command>… option of a git diff driver simply tells git to use that command to pre-process the file’s content before using the result as the input for doing the diff. One typical use case for this is to transform binary files into a textual representation more useful in a diff.

Example 3: Displaying metadata about an image

A standard git diff on a *.jpg or *.png file is not gonna be super useful, as it will just print “binary files differ” when they have changed.

But if we create a diff driver to extract the metadata (width, height, alpha channel, …) from the image and use that as a textconv command, that can make the diff quite more useful:

# In your .git/config
[diff "image"]
  textconv = exiftool

Where exiftool is a tool you can install via brew and which shows EXIF metadata of an image file. Then in our .gitattributes we can ask to use that driver when doing diffs on image files:

*.png diff=image
*.jpg diff=image
*.gif diff=image

Then if you modify an image file to change its dimensions or add/remove its alpha channel etc, those differences in metadata will appear in git diff file.png instead of “binary files differ”.

The issue with this is that if the change you made on the image file didn’t change its metadata (e.g. you flipped the image vertically or just changed some pixels in the image without changing its size nor colorspace or channels…), while git status will still show the file as changed and that change needing to be git add-ed and commits, you’ll see an empty git diff for it, as the output of exiftool will be the same before/after.

If that bothers you, you can solve this by adjusting the textconv to call a different command, maybe a custom script that not only prints the EXIF metadata, but also e.g. the MD5 of the file, that way at least the diff won’t be empty if the binary content of the file changed in any way even if the EXIF metadata stayed the same. For example, just update your diff driver in .git/config to use textconv = ./imageinfo.sh, then create that ./imageinfo.sh script with the following content:

#!/bin/bash

echo -n "md5 checksum: "
md5sum "$1" | cut -d' ' -f1
echo "Image Metadata:"
exiftool -x File:All "$1"

Example 4: Diffing ZIP archives

Similarly, a git diff on a .zip file won’t be super helpful other than an obscure “binary files differ”. But what if we told git to transform that binary .zip file into a listing of its content when showing it in diffs? This is as easy as adding this to your .git/config (see zipinfo man page):

[diff "zip"]
  textconv = zipinfo -l --h -t

Then adding *.zip diff=zip to your .gitattributes!

Now when you’re diffing 2 versions of a .zip file, git diff will not just show you “binary files differ” but will instead show you the diff between the two listings of those files, allowing you to see which files were added/removed/modified in that ZIP 🎉

💡 If you want this to be available for all the repos in your machine, just add this to your global git config (git config --global diff.zip.textconv 'zipinfo -l --h -t'), and add the *.zip diff=zip line to your global .gitattributes file (which you can set the path of via git conflig --global core.attributesFiles ~/.gitattributes for example)

Conclusion

That’s it for today!

There’s much more to be learned around git filters and diff drivers, and many possible applications; and there are many more gems to learn about git too of course (like merge drivers)! But I hope that this overview gave you a taste on not only how those work and what they can be useful for, but also how some tools like git-lfs and git-crypt make use of them internally!

git filters & diff drivers: a technical overview

How git filters work

Example 1: a simplistic encryption filter

Example 2: Auto-format on commit

Notes

Practical examples

Git-LFS

git-crypt

Where does git-crypt stores the encryption key?

How does git-crypt handle its support of GPG keys?

Git diff drivers & `textconv`

Example 3: Displaying metadata about an image

Example 4: Diffing ZIP archives

Conclusion

Published by Olivier Halligon

Leave a comment

Cancel reply

How git filters work

Example 1: a simplistic encryption filter

Example 2: Auto-format on commit

Notes

Practical examples

Git-LFS

git-crypt

Where does git-crypt stores the encryption key?

How does git-crypt handle its support of GPG keys?

Git diff drivers & textconv

Example 3: Displaying metadata about an image

Example 4: Diffing ZIP archives

Conclusion

Share this:

Published by Olivier Halligon

Leave a comment

Cancel reply

Git diff drivers & `textconv`