#184
Mar 27, 2022
Work
This week I deployed my first two PRs to production: a refactoring pulling out some classes into a new module and a configuration change updating many many cronjobs to run at the correct local time after today’s DST switch.
It’s kind of funny how DST was a problem at GOV.UK, with its ancient Icinga set-up for which I manually changed the “in-hours / out-of-hours” thresholds twice a year,1 and it’s still a problem at GoCardless, which is using a fancy modern Kubernetes set-up.
Some things never change.
Books
This week I read:
Volume 13 of So I’m a Spider, So What? by Okina Baba
Well, I’m caught up. The English translation of volume 14 isn’t out until June. It’s been a fun story so far. It definitely feels like it’s building up towards a climax, and I look forward to see what comes next with the two timelines now being in sync. I do think that worked really well as a storytelling device, and we now have the problem of how to persuade the human reincarnations that the monster reincarnations are right about what must be done, when the human reincarnations have been living a lie for so long. But oh wait, way back at the end of volume 5, the hero levelled up his Taboo skill, and all the truth of the world was poured into his brain…
Volumes 1 and 2 of Delicious in Dungeon by Ryoko Kui
Another fun story, this time about delving into a dungeon and eating the monstrous inhabitants therein. There is a story, which is that one of the adventurers in their party got eaten by a dragon, and so the other adventurers need to get them back before being digested, so they can be resurrected. Where does the monster-eating come into this you ask? Well, the reason they had a bad time with the dragon was because they were all distracted by hunger, you see.
It’s a pretty thin premise, but it’s a comedy manga so that’s fine. There’s just enough story to keep things moving, but the story isn’t the point, the humour is.
Annoyingly, I can’t find an English translation of volume 3 in physical form. Volumes 1, 2, and then 4 onwards are easy. But 3? I might just have to read that one online, and leave a gap on my shelves. I don’t like leaving gaps.
resolved
This week I’ve focussed on observability and improving the code quality. The two major new features are structured logging and Prometheus metrics.
The log output now looks like this:
{"level":"INFO","fields":{"message":"UDP request","peer":"10.0.0.3:33602"},"target":"resolved"}
{"level":"INFO","fields":{"message":"ok","question":"barrucadu.co.uk. IN A","authoritative_hits":"0","override_hits":"0","blocked":"0","cache_hits":"1","cache_misses":"0","nameserver_hits":"0","nameserver_misses":"0","duration_seconds":"0.000094706"},"target":"resolved"}
{"level":"INFO","fields":{"message":"pruned cache","expired":"18","pruned":"0"},"target":"resolved"}
There are some other formats available too, for example, timestamps are included by default but I’ve disabled them for the systemd unit as that already does timestamps.
My dashboard, which uses most of the new metrics, looks like this:
And it’s already giving some interesting insights!
For example, my cache size limit is 1,000,000 records, but it only holds ~4400 right now. It looks like new records are being added only a little faster than old records are expiring. Which makes me wonder about whether it would be worth having some sort of automatic cache renewal for entries which get a lot of hits: when the expiry time gets close, pre-emptively make a request to the upstream nameservers and replace the cached entry, so that queries can just continue hitting the cache.
Another thing is the upstream nameserver misses: these are queries which couldn’t be answered locally, got sent upstream, and still couldn’t be answered. These are bad because it means resolved
(and the clients!) are wasting time on a query which won’t produce anything useful. Well, the gradient of that line changes suddenly, it becomes less steep, right? And the requests per second dropped off pretty obviously too. That’s because I noticed there were a lot of queries for azathoth.
, which is the hostname of my desktop computer, being sent upstream. This turned out to be from a syncthing misconfiguration I’ve now fixed, so those queries stopped.
This week I merged the following PRs:
- Introduce
Label
type for domain names & add note about RFC 4343 to README - Implement RFC 2782: the SRV record type
- Add RFC 3597-compliant
Display
andFromStr
instances toRecordClass
andRecordType
- Add Prometheus metrics
- Gather Prometheus metrics from the resolvers
- Refactor: split
lib-dns-resolver
out ofbin-resolved
- Implement logging
- Add env var to set log format
- Only log cache pruning if something is removed
And opened the following issues:
Miscellaneous
Back in 2019 I rewrote the script which generates this site. I wrote it from scratch in Python. No fancy features, really just a bit of plumbing around pandoc and jinja2.
Every time I’ve tried to use one of the big-name static site generator tools, I’ve found them to be both overly complex and yet very restrictive at the same time. If you want to do something the developers didn’t anticipate, and that something could be as simple as “I want a blog without dates in filenames”, you have to write code. And it’s never straightforward code, because you have to hook into this complicated sort-of-but-not-really general-purpose framework.
I’m not doing anything complex here. It feels like an off-the-shelf static site generator should be able to do what I want easily, but I’ve never really found that to be the case.
This week I made the biggest conceptual change to how generating this site works, ever. I added one of the killer features of static site generators: a cache, so that if you edit one page you don’t need to recompile the whole site.
I’d been putting this off for a while now, but since I started writing these weeknotes the number of posts here has exploded. Waiting 4 minutes to build the site was just unwieldy, and hampering my writing.
Surely this was a big change, right? After all, this is one of the major reasons people use static site generators rather than write their own!
Well… 2 changed files with 31 additions and 18 deletions.
Er, why do people use those complicated tools, again?
Link Roundup
- How to visualize Prometheus histograms in Grafana
- New in Grafana 7.2:
$__rate_interval
for Prometheus rate queries that just work - Choosing a Rust web framework, 2020 edition
- Ready for changes with Hexagonal Architecture
- Herding elephants: Lessons learned from sharding Postgres at Notion
- PostgreSQL: Documentation: 14: 5.11. Table Partitioning
Hey, if a GOV.UK person reads this you should check if that’s been done.↩︎