Metrics
(AI go-faster stripes)
“Oh Kent, you can use statistics to prove anything, 65% of people know that” - Homer Simpson
I have always felt ambivalent about this quote (in the Moonlighting sense). I think it’s true in many situations, and also entirely false as well. Properly constructed and monitored statistics are extraordinarily useful. Yet, arbitrary and narrow statistics can lead to bizarre and potentially dangerous decisions.
An example is compute utilisation. This is a measure, usually expressed as a percentage, of how busy a CPU is. Now depending on which hat you wear you might want it to be as high as possible to sweat the asset, or to be low enough that peaks in activity don’t take it past 100% causing a slow down - such as an online shop dealing with Black Friday.
It gets more complex than that. How often should we measure the CPU usage? Every second, every millisecond? Then what do we do with that data, average it? Retain an average and peak values over a given period? What happens when we invoke Moore’s law? CPU utilisation trends down, but unit cost does too. When comparing the utilisation on two machines we need to know what type of chips they use, how many they have and what the specifications actually are.
This is a real problem I had to deal with in several roles some years ago. Over the course of those roles I created a composite metric, eventually dubbed the Cato metric, which basically used average utilisation, a measure called ‘SPECInt’ (provided by a thirdparty organisatio - spec.org) and power consumption (from the manufacturer). This allowed a reasonable proxy for assessing implied cost and efficiency. It worked pretty well, and drove some of the right behaviours. There were a number of ways to game it, but generally it did its job.
Would I use it now? Not really; the advent of cloud, the increased importance of GPUs (which SPECInt doesn’t measure), the fact that the SPECInt metric is not being maintained, and that CPU consumption is no longer a real limiting factor mean that it wouldn’t be that useful any more.
Even more importantly, every metric ceases to be valuable after a while, as more and more people learn to game it. Or tune towards it. However charitably you want to view it. This means that something which helped encourage people in broadly the right direction, or measure trends heading broadly positively, becomes twisted into something which serves only a metric for marketing, either internal or external.
This brings us to the constant announcements about the best ever LLM. Based on the array of quoted metrics, and there are many. Actually quick aside here, I LOVE how many metrics there are. So many numbers, so pretty… But anyway, the problem is that the actual user experience, or value delivered, or you know, anything actually real, is not correlating to the metrics at all.
Now, the cynic in me says the AI companies don’t care because due to a combination of ego, and paranoia, they’re just competing with each other to rule the world. But I suspect, per Hanlon’s Razor, it’s more that they’re blinded by the chase for the metrics, which they are equating, wrongly to ruling the world. Fundamentally they’ve lost sight of what it is they are trying to achieve, and so now at least there’s something they can pin ‘progress’ to. Unfortunately, many people aren’t just letting them, they’re encouraging it. All those pretty numbers must means something!
The improvements we’re seeing do not, as far as I can tell, seem to be pointing to, well, anything really. Breathless discussion of these stats has turned it into a big game of top trumps. But who cares if the random stat on the card says Claude beats Gemini? Will it actually deliver on any of the promises being made? Until we see something concrete we should treat all record-breaking metrics with the suspicion they are due.
###



