Preprint

Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control

One-sentence summary

A matched-FP benchmark for recursive collapse: signature directionally aligned across domains; no detector accepted.

Abstract

Recursive systems can enter collapse-like regimes — self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation — before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity (δ). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance.

We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP ∈ [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred.

The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract — non-acceptance reported as a first-class scientific outcome.

The result, stated plainly

Reported in full

On both flagship benchmarks, every tested detector failed the contract — the standard early-warning methods and Loopzero's own.

This is the load-bearing finding, not a footnote. A matched-false-positive contract is falsifiable in both directions: a comparator fails if its calibration grid places no operating point inside the band, and the pre-registered detector fails under the same rule. Here, both did. What that leaves standing is a framework others can test and a directional signature that held in the pre-collapse windows — not a finished detector.

What would falsify it

  • Externally defined collapse events occurring without the predicted G↑ / p↑ / δ↓ pattern in the pre-collapse window.
  • Control periods showing the same pattern without any subsequent collapse.
  • A tested standard comparator recovering an accepted operating point under the locked contract on the canonical benchmarks.
  • The pre-registered detector recovering acceptance on benchmarks where the directional bridge signature is absent.
  • Alternative reasonable operationalizations of amplification, persistence, and contraction failing to agree directionally.

Scope & limitations

Witness construction is still domain-adapted. The bridge is empirical, not an identification result. Comparator scope can expand. Component-level necessity of G, p, and δ is not yet established through ablations. Recommender negative controls, wild cluster bootstrap, and additional domains under full matched-false-positive evaluation are open next steps — stress-testing rather than rhetorical expansion.

Cite

@misc{mullett_recursive_collapse, author = {Mullett, David}, title = {Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control}, year = {2026}, eprint = {2606.00329}, archivePrefix = {arXiv}, primaryClass = {eess.SY}, url = {https://arxiv.org/abs/2606.00329} }

Reproducibility & artifacts

  • Repositorygithub.com/davidmullett/loopzero-paper-public · release lean-v1.0
  • Lean toolchainLean v4.30.0-rc2 · Mathlib 3ba1ec58 · builds clean under lake build
  • Axiom audit3 obstruction theorems: no axiom dependencies · 2 bridge theorems: propext, Classical.choice, Quot.sound only
  • Markets benchmarkvolmageddon_covid_public_v2 · 38 controls + 16 events · FP grid 1/38
  • Recsys benchmarkmovielens25m_recursive_frontier_public_v1 · 40,339 user clusters · h=50
  • Source archiveMovieLens-25M · SHA-256 8b21cfb7… · MD5-verified vs GroupLens
  • Engineitem-item replay v1.0.0 · hash 56c1cff225d60c09
  • Detectorpre-registered q=95, k=3 · band [0.03, 0.07]
  • Seeds42 (segment grain) · 43 (scenario grain) · bootstrap 10,000/cell