#059
Nov 3, 2019
Work
We spotted an odd spelling suggestion, where the suggestion seemed too different to the original query. So I dug into how suggestions work. Elasticsearch uses a distance metric as part of its ranking for spelling suggestions. It compares the suggestion to the original query and comes up with a number. That number feeds into how it chooses the best suggestion.
We currently use the default metric,
internal
:The default based on
damerau_levenshtein
but highly optimized for comparing string distance for terms inside the index.Damerau-Levenshtein distance and Levenshtein distance are classic distance metrics for ordered sequences. They count how many of the following operations you need to do to transform one sequence into the other. The allowed operations are:
- Deleting a character
- Inserting a character
- Replacing one character with another
- And, in Damerau-Levenshtein but not Levenshtein, transposing two characters
For example, “dog” has a Levenshtein distance of 2 from “god”:
- Replace “d” with “g”
- Replace second “g” with “d”
But it only has a Damerau-Levenshtein distance of 1:
- Transpose “d” and “g”
Damerau-Levenshtein is generally good because it captures a lot of common typos. But we have some evidence to suggest that it’s not right for us.
Currently, if you search for “change adress”, the top suggestion is “change across”. This is not good. However, if I change the distance metric to
levenshtein
, the top suggestion is “change address”, and “change across” isn’t even in the top 5. This is good, but strange.I don’t understand why it makes a difference in this case, as according to both metrics “change address” is only a distance of 1 from “change adress” (insert a “d”), whereas “change across” is a distance of 2 from “change adress” (replace “d” with “c”, replace “e” with “o”). Nevertheless, it does make a difference. I assume Elasticsearch is doing something special and not using a “pure” Damerau-Levenshtein. I also assume it’s something to do with us wanting to correct phrases as a whole, rather than just correcting individual terms: there are more pages matching “change across” than there are matching “change address”.
We’ve not made the change in production yet. I think the next thing for us to do here is to change the metric in our integration environment and see how the suggestions for the top 10,000 or so queries change. We could even A/B test this and see how it affects the suggestion acceptance rate.
On GOV.UK search pages, there’s a sidebar with some expandable “facets” (or “filters”, depending on who you ask). For example, the /search/all page has “Show only”, “Organisation”, “Person”, and “World location”. They’re good tools for power users to narrow their search, but there are a lot of organisations, people, and world locations.
I put together a prototype of making those last three facets dynamic: they only show entries relevant to your current search. For example, if you select the Foreign and Commonwealth Office, then you’ll only get people and world locations for which there is content tagged both to the FCO and that person or location. If you select a person, the world locations will be narrowed further (and vice versa).
It’s done in terms of meeting the requirements and not introducing any new bugs, but now that there is a prototype it needs some product review to determine if it’s actually a good thing to do.
Rails 6 came out recently, so it’s time to upgrade everything on GOV.UK. Thankfully there’s a migration script which handles most of it for you. I had a go at finder-frontend. I’ve been having flashbacks to upgrading Overleaf (well, Overleaf classic) from Rails 4 to Rails 5.
I deployed an A/B test for incorporating bigrams into ranking, which I investigated a few weeks ago in the team hackathon. Hopefully they’ll help as much with real queries as they did with our test queries.
Miscellaneous
I read Lord of Chaos (by Robert Jordan), the sixth book in the Wheel of Time series.
Fed up with not really understanding how CSS handles conflict resolution, I finally sat down and read the MDN docs, and then implemented the cascade algorithm myself. My code agrees with Firefox on the one example I tried, so I think that’s a win.
I updated all my servers so that LetsEncrypt certificates wouldn’t start failing to renew.
Link Roundup
- Understanding searches better than ever before
- The Great(er) Bear - using Wikidata to generate better artwork
- Building personal search infrastructure for your knowledge and code
- The Extended Mind
- This Week in Rust 310
- Issue 183 :: Haskell Weekly
- Haskell2020 Is Dead, but All Hope Is Not Lost
- The reasonable effectiveness of the continuation monad
- A monad is just a submonad of the continuation monad, what’s the problem?
- Low-latency garbage collector merged for GHC 8.10