Who measures the measurers

26 Apr

In March 2026, three AI labs published research that, taken together, exposes something the higher education sector hasn't reckoned with. OpenAI released its Learning Outcomes Measurement Suite (LOMS), built with the University of Tartu and Stanford's SCALE Initiative, validating it across nearly 20,000 students in Estonia and reporting a 15% exam performance gain in microeconomics. Anthropic published what it calls the largest qualitative study ever conducted: 81,000 people, 159 countries, 70 languages — interviews conducted by Claude, classifications by Claude, representative quotes pulled by Claude. The same company published a new metric called observed exposure, combining theoretical AI capability with usage data from Anthropic's own logs to map labour displacement. Google DeepMind, separately, published a cognitive taxonomy of ten abilities for measuring AGI progress.

The methods are genuinely sophisticated. The findings are useful. None of that is the issue.

Ben Williamson at the University of Edinburgh traced LOMS's data pipeline and found a closed loop: every measurement of student learning flows back into OpenAI's model development. The evaluation feeds the product. The product generates the data. The evaluation measures the product. Anthropic's 81,000-person study is simultaneously research, marketing, and product roadmap — you cannot separate them, because the methodology fuses all three. The labour displacement paper measures the damage using the weapon's telemetry.

Nick's framing on the episode: this isn't strictly new. Pearson published research showing Pearson products improved outcomes for years. Everyone rolled their eyes and moved on. What's different now is scale and the absence of independent alternatives. There is no public-sector equivalent capable of interviewing 81,000 people in 70 languages. Vendor research isn't competing with independent evidence; it's the only evidence that exists.

The five DeepMind cognitive abilities currently lacking robust evaluations — learning, metacognition, attention, executive functions, social cognition — overlap almost exactly with the graduate capabilities universities claim to develop. Dale's question: will universities define those capabilities, or will the company building the technology that may replace them?

Three responses sit on the table. Universities collectively build independent evaluation infrastructure — Stanford HAI has proposed something along these lines. Governments mandate it through bodies like the OECD, whose recent work on the "mirage of false mastery" took years and was rigorous precisely because it was slow. Or the sector stands up shared infrastructure — a kind of PISA for AI in education. None of these exist. Nick's pushback is sharp: the sector hasn't agreed on a shared definition of academic integrity in three years. Building cross-institutional measurement frameworks at the speed AI is moving is, on current evidence, fantasy. Which is exactly why the vendor frameworks are winning by default.

Dale Leszczynski

Who measures the measurers

Let the good times roll