Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control
One-sentence summary
Abstract
Recursive systems can enter collapse-like regimes — self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation — before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity (δ). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance.
We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP ∈ [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred.
The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract — non-acceptance reported as a first-class scientific outcome.
The result, stated plainly
On both flagship benchmarks, every tested detector failed the contract — the standard early-warning methods and Loopzero's own.
This is the load-bearing finding, not a footnote. A matched-false-positive contract is falsifiable in both directions: a comparator fails if its calibration grid places no operating point inside the band, and the pre-registered detector fails under the same rule. Here, both did. What that leaves standing is a framework others can test and a directional signature that held in the pre-collapse windows — not a finished detector.
What would falsify it
- Externally defined collapse events occurring without the predicted G↑ / p↑ / δ↓ pattern in the pre-collapse window.
- Control periods showing the same pattern without any subsequent collapse.
- A tested standard comparator recovering an accepted operating point under the locked contract on the canonical benchmarks.
- The pre-registered detector recovering acceptance on benchmarks where the directional bridge signature is absent.
- Alternative reasonable operationalizations of amplification, persistence, and contraction failing to agree directionally.
Scope & limitations
Witness construction is still domain-adapted. The bridge is empirical, not an identification result. Comparator scope can expand. Component-level necessity of G, p, and δ is not yet established through ablations. Recommender negative controls, wild cluster bootstrap, and additional domains under full matched-false-positive evaluation are open next steps — stress-testing rather than rhetorical expansion.
Cite
Reproducibility & artifacts
- Repositorygithub.com/davidmullett/loopzero-paper-public · release lean-v1.0
- Lean toolchainLean v4.30.0-rc2 · Mathlib 3ba1ec58 · builds clean under
lake build - Axiom audit3 obstruction theorems: no axiom dependencies · 2 bridge theorems: propext, Classical.choice, Quot.sound only
- Markets benchmarkvolmageddon_covid_public_v2 · 38 controls + 16 events · FP grid 1/38
- Recsys benchmarkmovielens25m_recursive_frontier_public_v1 · 40,339 user clusters · h=50
- Source archiveMovieLens-25M · SHA-256 8b21cfb7… · MD5-verified vs GroupLens
- Engineitem-item replay v1.0.0 · hash 56c1cff225d60c09
- Detectorpre-registered q=95, k=3 · band [0.03, 0.07]
- Seeds42 (segment grain) · 43 (scenario grain) · bootstrap 10,000/cell