Can Mechanism Design Change the Scaling Laws for Oversight?
Two papers have inspired my thinking lately.
Scaling Laws for Scalable Oversight measures how oversight success scales with intelligence. They pit models against each other in adversarial games and fit scaling laws to the results. The finding that matters: everything depends on the slope ratio between guard and adversary. In debate, guards scale faster than attackers. In code review, the reverse. The ratio determines whether oversight keeps pace with capability or falls behind.
They also try chaining oversight recursively — weak model oversees slightly stronger one, passes trust forward, repeat. For their hardest game, even optimal chaining reaches 10% success across a 400-Elo gap. The slopes are treated as fixed constants of the game.
Knowledge Divergence and the Value of Debate proves that debate outperforms single-model evaluation only when the debating models know different things. Same training data, same knowledge — debate reduces to RLAIF. Same model debating itself adds nothing.
As frontier models converge on the same knowledge, their errors correlate and debate’s advantage vanishes. The thing that makes oversight work — models knowing different things — is what current training pipelines erode.
Neither paper asks whether these parameters can be changed by design. But what if the scaling laws for oversight are functions of the incentive structure the models operate within?
I’m running an experiment that tests this directly: adversarial mechanism design applied to the same oversight games.
The question is whether the right mechanism steepens the slope of the scaling laws for scalable oversight. If the slope changes, oversight can bootstrap and keep pace with capability. If only the intercept moves, mechanism design helps now but becomes irrelevant as systems grow stronger.
Will open-source my results soon.