Mar 27, 2022


This week I deployed my first two PRs to production: a refactoring pulling out some classes into a new module and a configuration change updating many many cronjobs to run at the correct local time after today’s DST switch.

It’s kind of funny how DST was a problem at GOV.UK, with its ancient Icinga set-up for which I manually changed the “in-hours / out-of-hours” thresholds twice a year,1 and it’s still a problem at GoCardless, which is using a fancy modern Kubernetes set-up.

Some things never change.


This week I read:


This week I’ve focussed on observability and improving the code quality. The two major new features are structured logging and Prometheus metrics.

The log output now looks like this:

{"level":"INFO","fields":{"message":"UDP request","peer":""},"target":"resolved"}
{"level":"INFO","fields":{"message":"ok","question":"barrucadu.co.uk. IN A","authoritative_hits":"0","override_hits":"0","blocked":"0","cache_hits":"1","cache_misses":"0","nameserver_hits":"0","nameserver_misses":"0","duration_seconds":"0.000094706"},"target":"resolved"}
{"level":"INFO","fields":{"message":"pruned cache","expired":"18","pruned":"0"},"target":"resolved"}

There are some other formats available too, for example, timestamps are included by default but I’ve disabled them for the systemd unit as that already does timestamps.

My dashboard, which uses most of the new metrics, looks like this:

DNS resolver dashboard

And it’s already giving some interesting insights!

For example, my cache size limit is 1,000,000 records, but it only holds ~4400 right now. It looks like new records are being added only a little faster than old records are expiring. Which makes me wonder about whether it would be worth having some sort of automatic cache renewal for entries which get a lot of hits: when the expiry time gets close, pre-emptively make a request to the upstream nameservers and replace the cached entry, so that queries can just continue hitting the cache.

Another thing is the upstream nameserver misses: these are queries which couldn’t be answered locally, got sent upstream, and still couldn’t be answered. These are bad because it means resolved (and the clients!) are wasting time on a query which won’t produce anything useful. Well, the gradient of that line changes suddenly, it becomes less steep, right? And the requests per second dropped off pretty obviously too. That’s because I noticed there were a lot of queries for azathoth., which is the hostname of my desktop computer, being sent upstream. This turned out to be from a syncthing misconfiguration I’ve now fixed, so those queries stopped.

This week I merged the following PRs:

And opened the following issues:


Back in 2019 I rewrote the script which generates this site. I wrote it from scratch in Python. No fancy features, really just a bit of plumbing around pandoc and jinja2.

Every time I’ve tried to use one of the big-name static site generator tools, I’ve found them to be both overly complex and yet very restrictive at the same time. If you want to do something the developers didn’t anticipate, and that something could be as simple as “I want a blog without dates in filenames”, you have to write code. And it’s never straightforward code, because you have to hook into this complicated sort-of-but-not-really general-purpose framework.

I’m not doing anything complex here. It feels like an off-the-shelf static site generator should be able to do what I want easily, but I’ve never really found that to be the case.

This week I made the biggest conceptual change to how generating this site works, ever. I added one of the killer features of static site generators: a cache, so that if you edit one page you don’t need to recompile the whole site.

I’d been putting this off for a while now, but since I started writing these weeknotes the number of posts here has exploded. Waiting 4 minutes to build the site was just unwieldy, and hampering my writing.

Surely this was a big change, right? After all, this is one of the major reasons people use static site generators rather than write their own!

Well… 2 changed files with 31 additions and 18 deletions.

Er, why do people use those complicated tools, again?

  1. Hey, if a GOV.UK person reads this you should check if that’s been done.↩︎