As I was investigating potential solutions for storing store secret files in our git repos securely, I recently looked into how some little known features of git, namely filters and diff drivers, work. While most git users will rarely need to know about the technical details of how those work on most cases, I found that they could be pretty powerful in some specific use cases.
How git filters work
The goal of git filters is to act as post-processors to transform the content of files when they are pushed to and pulled from a git repo. You define a filter in 2 places in a git repo:
- In the
.gitattributesfile, you define file patterns and the name of the filter driver to apply to each of them. - In your
git config, you define the filter driver itself, i.e. what command to run on the incoming blob during checkout/pull (smudgecommand), and what command to run on the file content during checkin/push (cleancommand). That filter driver can also indicate if it’srequiredto succeed for the content to be usable or not.

Example 1: a simplistic encryption filter
If we wanted to run a filter that encrypts/decrypts the content of some files on the fly, we can define a my-encrypt-filter filter driver like so in our git config:
[filter "my-encrypt-filter"]
clean = openssl enc -aes-256-cbc -k "mysecretkey"
smudge = openssl enc -d -aes-256-cbc -k "mysecretkey"
required = true
And in our .gitattributes, declare which files should go through that filter:
secrets/* filter=my-encrypt-filter
secret-*.json filter=my-encrypt-filter
# …
In this example:
- When we commit and push a
secrets/foo.txtfile, the content of that file will be run through thecleancommand of the filter, i.e.openssl enc …, which will encrypt the file content before pushing it to the remote. - When we (or another user having that same filter defined in their config) pull and checkout the file from the remote into the working copy, its content will be processed by the
smudgecommand of the filter, i.e.openssl enc -d, to decrypt the encrypted blob coming from the git remote before putting the decrypted content in our local working copy.
⚠️ The implementation in this example is a bit too simplistic in reality, as it would need some adjustments before being used for real-world applications to ensure idempotence.
As the .gitattributes documentation suggests, clean and smudge actions should be idempotent; i.e. “clean→clean” should be equivalent to “clean”, and “smudge→smudge→clean” should be equivalent to “clean” too.
This is clearly not the case in this simplified example above—as applying the clean action (aka openssl enc) two times in succession would result in doing another encryption pass on top of the already-encrypted content.
In a real-world example, you should thus replace those direct openssl … calls with e.g. a wrapper script that first checks if the file is already encrypted before trying to re-encrypt it again (and vice-versa for decryption).
For example, you could imagine such a wrapper script would inject some “magic marker” as a prefix to the encrypted data during encryption. Then when the script is called to do a clean action, it’d only call openssl enc … on the input content (then add the marker) if that content does not already contain that magic prefix indicating it was already encrypted. Likewise, when called to do a smudge action, it’d only remove the prefix before passing the rest of the binary data to openssl enc -d … if the input binary content to smudge/decrypt actually starts with said prefix, and do nothing otherwise (i.e. if that content is already decrypted).
PS: As far as I could tell, that approach is basically what git-crypt (discussed below) is using internally.
Example 2: Auto-format on commit
Another example use case is to use a filter to reformat a text file on checkin.
For example, we could imagine using jq to ensure that when a json file is commited it’s indented in human-readable format. We do that by declaring the following filter in our repo’s .git/config:
[filter "json-prettify"]
clean = jq
Then telling our .gitattributes to apply it to all JSON files:
*.json filter=json-prettify
Then if we create a new JSON file e.g. echo '{"name":"Alice","age":42}' >user.json:
- Once we
git add user.json, its content will go throughjq(as if we didcat user.json | jq), and the output of that filter (i.e. the prettify’d JSON) is what will actually be commited and pushed as the content of that file to the remote. - Then when another user checks out the repo, or if you delete the
user.jsonfile andgit restoreit, or if you look at the content of that file on the remote (e.g. github.com), you’ll see the content of thatuser.jsonfile has been committed pretty-formatted. - In that example, there’s no need for a
smudgefilter command, as we only want to format outgoing JSON before commit but don’t need to transform them back on checkout.
Notes
The filter driver may be defined at any level of your git config, i.e.:
- In the
.git/configfile—which is only local to your working copy and can’t be commited - In your global
$HOME/.gitconfig—which applies to all repos in your Mac.
A missing filter driver definition in the git config is not an error, and just makes the filter a no-op passthru. This means that:
- The definition of filter drivers are not commited in your repo (
.git/*files are only local) so are not automatically shared with other users of the repo. In that sense, it’s a bit similar to git hooks, defined in your local.git/hooks/*and not commited to your remote to be shared with others either. - When someone else clones the repo, they likely won’t have the definition for that filter driver in their own
~/.gitconfig; so the file just won’t be processed by the filter and instead its raw content just kept as-is.
If you update the definition of your filter driver in your git config, you can use:
git add --renormalize .(orgit add --renormalize <files>) to ask git to re-apply thecleancommands of all filter drivers applied to each filegit restore --source=HEAD -- .to reapply thesmudgecommand of all filter drivers applied to each file
The latter is especially useful if you clone a repo that has some filter applications defined in its .gitattributes, and would only then add the filter driver definition in your local .git/config, as you’d then want the files to be re-processed by the filter after that.
Practical examples
Git-LFS
git-lfs (Git Large File Storage) is aimed to have large files (e.g. video files, etc…) in your repo to be actually stored outside of your git repo, to avoid such massive files to impede your git history and object database.
This works by leveraging the git filters we just learned about, by declaring a filter driver similar to this in your ~/.gitconfig:
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge -- %f
required = true
Then, roughtly speaking, for every file in your .gitattributes that is assigned filter=lfs:
- On checkin,
git-lfs cleancommand takes care of replacing the actual content of the large file with textual information representing a reference to that file’s current content version. (Note:%fis replaced by git with the file name when calling the command) - Then on
git push, a custompre-pushhook uploads the large file’s real content to some external storage—while what’s actually pushed on the git remote is the tiny textual information that was swapped for the file content during step 1 - On checkout, the textual content that was pushed on the remote for that file is passed as an input to
git-lfs smudge, which transforms that info back to the real file content by downloading the actual file content from the external storage and returning it as the filter output.
(Again simplifying and not going into the details here, but you get the idea)
git-crypt
git-crypt, which is a tool dedicated to ensure some files are stored encrypted in an otherwise-public repo, also relies on git filters for its implementation.
Basically, the high level concept is similar to the Example 1 we saw above: the filter’s clean command encrypts the file during checkin, and its smudge command decrypts it during checkout. Except that instead of calling openssl enc … like we did in that example above, it delegates the encryption/decryption to the git-crypt executable:
[filter "git-crypt"]
smudge = "git-crypt" smudge
clean = "git-crypt" clean
required = true
- During
git-crypt init,git-cryptgenerates a symmetric key, and stores it in the repo’s.git/git-crypt/keys/defaultfile. - Then
git-crypt cleanreads that symmetric key and use it to encrypt the input file content before sending it to the git remote, whilegit-crypt smudgeuse it to decrypt encrypted blob content coming from the remote.
For anyone who is cloning the repo and doesn’t have git-crypt installed or setup in their git config, the filter will just be a no-op, which means they’ll just see the raw, encrypted content that was stored in the git remote.
Where does git-crypt stores the encryption key?
When a user calls git-crypt unlock "path/to/encryption/key/file", git-crypt copies that symmetric key file to the .git/git-crypt/keys/default file in your local working copy, then decrypts all the encrypted files using that symmetric key. From that point on, the git-crypt filter will apply as normal (i.e. any file assigned the filter=git-crypt in .gitattributes will be encrypted via git-crypt clean before being pushed to the remote and decrypted via git-crypt smudge when received from the remote)
How does git-crypt handle its support of GPG keys?
This is getting a bit outside of the topic of git filters, but since I learned about this while I was looking into it, I figured this was also an interesting point that git-crypt also supports git-crypt --add-gpg-user <user-email-or-gpg-id> as a way to not have to share the symmetric key between users manually. This also gives us a good example of how far you can go with using some more involved script or executable as your git filter to cover more features.
The way it works with git-crypt is that:
- When you call
git-crypt --add-gpg-user, it encrypts the.git/git-crypt/keys/defaultprivate symmetric key with that user’s public GPG key, and store the result in a.git-crypt/keys/default/0/<gpg-key-fingerprint>.gpgfile that it commits into the repo. - Since those
.gpgfiles are encrypted with the user’s public key, only the user this GPG key belongs to (and thus who has the corresponding private GPG key) will be able to decrypt the corresponding.git-crypt/keys/default/0/<gpg-key-fingerprint>.gpg. This is also why it’s fine to commit those. - When someone clones the repo, there won’t be a
.git/git-crypt/keys/defaultfile containing the private symmetric key yet; so at that point the files will still be encrypted. - Then when the user runs
git-crypt unlock(without providing an explicit path to a symmetric key file to use to unlock in that case, given in this scenario we precisely want to avoid having to share the symmetric key file manually amongst users),git-cryptwill look at all the GPG private keys installed on the current user’s machine keyring (gpg --list-secret-keys), find the first.git-crypt/keys/default/0/<gpg-key-fingerprint>.gpgfile matching one of those private keys, and decrypt that.gpgfile with that private key. This will allow it to get back the original symmetric key, store it in.git/git-crypt/keys/defaultlocally, and use it to decrypt the content of the repo like before.
So basically when you add a GPG user to a git-crypt-managed repo, it encrypts the symmetric key using that GPG public key and commit that, then only that user can decrypt that gpg-encrypted file to restore the symmetric key and use it. The rest of the process is unchanged compared to when you only use a symmetric key directly (and pass it around between users manually) without that extra layer usage of GPG.
Git diff drivers & textconv
Git diff drivers are very similar to git filter drivers, in the sense that you define the diff drivers in .git/config while you define which files to use which diff=<drivername> in .gitattributes.
The use cases of diff drivers is a bit different from git filters though, as they are focused on customizing how git generates and shows diffs for those files, not modifying the actual content of the files on disk or during git push/pull like git filters do.
In practice, “diff drivers” support a lot features around generating diffs—like being able to specify a custom command to use to compute the diff itself, which could be useful e.g. if you have a custom diff program that is more suitable for creating diffs of particular file types. But in the context of this post, I’d want to focus on their textconv option.
The textconv = <command>… option of a git diff driver simply tells git to use that command to pre-process the file’s content before using the result as the input for doing the diff. One typical use case for this is to transform binary files into a textual representation more useful in a diff.
Example 3: Displaying metadata about an image
A standard git diff on a *.jpg or *.png file is not gonna be super useful, as it will just print “binary files differ” when they have changed.
But if we create a diff driver to extract the metadata (width, height, alpha channel, …) from the image and use that as a textconv command, that can make the diff quite more useful:
# In your .git/config
[diff "image"]
textconv = exiftool
Where exiftool is a tool you can install via brew and which shows EXIF metadata of an image file. Then in our .gitattributes we can ask to use that driver when doing diffs on image files:
*.png diff=image
*.jpg diff=image
*.gif diff=image
Then if you modify an image file to change its dimensions or add/remove its alpha channel etc, those differences in metadata will appear in git diff file.png instead of “binary files differ”.
The issue with this is that if the change you made on the image file didn’t change its metadata (e.g. you flipped the image vertically or just changed some pixels in the image without changing its size nor colorspace or channels…), while git status will still show the file as changed and that change needing to be git add-ed and commits, you’ll see an empty git diff for it, as the output of exiftool will be the same before/after.
If that bothers you, you can solve this by adjusting the textconv to call a different command, maybe a custom script that not only prints the EXIF metadata, but also e.g. the MD5 of the file, that way at least the diff won’t be empty if the binary content of the file changed in any way even if the EXIF metadata stayed the same. For example, just update your diff driver in .git/config to use textconv = ./imageinfo.sh, then create that ./imageinfo.sh script with the following content:
#!/bin/bash
echo -n "md5 checksum: "
md5sum "$1" | cut -d' ' -f1
echo "Image Metadata:"
exiftool -x File:All "$1"
Example 4: Diffing ZIP archives
Similarly, a git diff on a .zip file won’t be super helpful other than an obscure “binary files differ”. But what if we told git to transform that binary .zip file into a listing of its content when showing it in diffs? This is as easy as adding this to your .git/config (see zipinfo man page):
[diff "zip"]
textconv = zipinfo -l --h -t
Then adding *.zip diff=zip to your .gitattributes!
Now when you’re diffing 2 versions of a .zip file, git diff will not just show you “binary files differ” but will instead show you the diff between the two listings of those files, allowing you to see which files were added/removed/modified in that ZIP 🎉
💡 If you want this to be available for all the repos in your machine, just add this to your global git config (git config --global diff.zip.textconv 'zipinfo -l --h -t'), and add the *.zip diff=zip line to your global .gitattributes file (which you can set the path of via git conflig --global core.attributesFiles ~/.gitattributes for example)
Conclusion
That’s it for today!
There’s much more to be learned around git filters and diff drivers, and many possible applications; and there are many more gems to learn about git too of course (like merge drivers)! But I hope that this overview gave you a taste on not only how those work and what they can be useful for, but also how some tools like git-lfs and git-crypt make use of them internally!
