Formal Verification of the Bitcoin Power Law
Six claims tested with two independent methods — GPD sequential audit and the Scanner automated discovery system — with all qualifications disclosed
We checked our own work.
Six core claims of the Bitcoin power law model were formally verified using two methods that were designed for entirely different purposes: GPD (Get Physics Done), a sequential formal verification protocol, and the Scanner, an automated parallel hypothesis-generation system. The two methods were not coordinating. They converged independently on the same three problems.
Three claims survived without qualification. Two required confidence downgrades. One required a definitional protocol that had been absent from every published paper.
We also report, for the first time, the out-of-sample R² of the power law model: 0.546, predicting 2020–2026 prices from parameters estimated on 2010–2020 data alone. No prior Bitcoin power law research has published a self-verification of this kind, nor this predictive accuracy measurement.
Why Verification Changes Things
In October 2025, the Observatory began a formal self-verification programme. The motivation was specific: we were preparing institutional outreach to counterparties who would conduct their own due diligence. Claims that could not survive independent scrutiny were liabilities. Claims that had been verified by two independent methods were assets.
The convergence was unplanned. When GPD Phase 1 completed and the Scanner had run its first 200-idea batch, the results were compared. Both methods had independently surfaced the same three problems: the floor multiplier was undefined, the volatility decay confidence interval had never been reported, and no one had ever computed the out-of-sample R².
GPD found them through direct computation. The Scanner found them through six separate scans approaching from different analytical angles. Neither method was aware of the other’s findings during execution. That independent convergence is the methodological foundation of this paper: it is stronger evidence of the problems’ reality than either method alone could provide.
Methods
GPD: Sequential Formal Verification
GPD operates as a structured audit protocol. For each claim: (1) state the claim precisely as a testable quantitative assertion; (2) write a self-contained Python script that computes the relevant quantity from the raw data without reference to published results; (3) run the script on btc_historical.json; (4) compare the computed output to the claimed value and record the result. The script is the citable artefact. The protocol is sequential: no claim is considered resolved until its test script has been run and the output documented.
All GPD scripts are deterministic, self-contained, and reproducible from a single JSON file. Any researcher with btc_historical.json can reproduce every Phase 1 result.
The Scanner: Parallel Automated Experiments
The Scanner was designed to generate original research directions, not to verify existing claims. It scans six systematic source categories (claims registry gaps, inter-paper assumptions, parameter sensitivity, cross-asset tests, methodological alternatives, and product-driven research) and scores ideas on testability, novelty, and utility. Those clearing the threshold are implemented as self-contained Python scripts with an 80-line constraint and a 5-minute runtime limit.
Crucially, Scanner scans were written as discovery experiments, not verification experiments. Scan S017, for example, was generated from the hypothesis: what is the 95% confidence interval on the claimed 20% per cycle decay rate? The Scanner had no knowledge that GPD had also flagged the decay rate. Three Scanner scans (S005, S009, S010) reached the same conclusion as GPD on the floor multiplier by three different analytical paths.
| Claim | GPD method | Scanner scan(s) | Scanner approach |
|---|---|---|---|
| C1.1 OLS fit | OLS reproduction from raw data | — | Confirmed by GPD alone |
| C1.2 Autocorrelation | ACF analysis, eff. sample size | S026 | HAC Newey-West standard errors |
| C1.3 Floor multiplier | Per-cycle P1 table | S005, S009, S010 | Bootstrap CIs + quantile regression |
| C2.3 Decay rate | Percentile distances, C2–C4 | S017 | Block-bootstrap 95% CIs |
| C1.4 Floor unbreached | Breach enumeration | S001, R042 | All four definitions tested |
| C2.5 Convergence | Exponential projection | S029 | Chow test at halvings |
| OOS R² (new) | — | S097 | Train/test split at 2020-01-01 |
Results: Six Claims
C1.1 — OLS Fit and Parameters VERIFIED HIGH
GPD independently reproduced the OLS regression on the full 5,713-observation dataset. The computed beta was 5.694 versus the published 5.688. The 0.006 difference is a floating-point artifact arising from log base conversion in intermediate steps. R² confirmed at 0.956. Genesis date (2009-01-03) and parameterisation match the Santostasi source publication. No qualification required.
C1.2 — Autocorrelation and Effective Sample Size VERIFIED MOD upgraded from SUPPORTED
GPD computed the lag-1 autocorrelation of daily log-residuals at 0.998. Integration of the ACF profile yields an effective sample size of approximately 24 observations — 5,713 calendar observations carry the statistical information content of approximately 24 independent draws.
Scanner scan S026 independently computed this using Newey-West HAC standard errors (bandwidth = 4×(n/100)2/9). The HAC inflation factor on the OLS beta standard error is 3.7×. The naive OLS t-statistic for beta is 376; under HAC it falls to 103 — both representing overwhelming statistical significance. The power law’s significance is not threatened, but every confidence interval on every derived quantity is 3.7× wider than naive OLS implies.
GPD and S026 used different analytical paths (spectral vs sandwich estimator) and arrived at the same effective n = 24. The effective sample size must appear in the methods section of every Observatory paper that makes significance claims about the power law fit.
C1.3 — Floor Multiplier Definition RESOLVED was DISPUTED
The number 0.42× appears 47 times across Observatory papers. In every instance it was used without specifying which definition it refers to. It was treated as a physical constant. The verification revealed it is a cycle-specific measurement that varies materially depending on the scope and method of estimation.
GPD computed the 1st percentile log-residual from the full dataset (0.314× trend) and from cycle 4 alone (0.422× trend). Three Scanner scans independently converged on the same discrepancy: S005 computed P1 per cycle (0.380× C2, 0.441× C3, 0.422× C4, 0.527× C5 incomplete); S009 found non-overlapping bootstrap CIs between C2 and C5; S010 found quantile regression at τ=0.01 yields a steeper beta and a higher floor (~$60,500 vs $40,500 OLS residual floor).
Resolution: four-definition taxonomy.
| Definition | Multiplier | Basis | Use |
|---|---|---|---|
| floor_conservative | 0.314× | Full dataset P1 | Absolute inviolability claims |
| floor_published | 0.422× | Cycle 4 P1 | Citing Papers 1–11 |
| floor_current | 0.432× | C3–C4 rolling avg | All new work (operative) |
| floor_qr | ~0.480× | QR τ=0.01 | Methodological comparison only |
Papers 1–11 are grandfathered under floor_published = 0.422×. All work from Paper 12 forward uses floor_current = 0.432×. Incomplete cycles (currently C5) are excluded from floor multiplier calculations until the cycle completes its bear market bottom.
C2.3 — Volatility Decay Rate ~20%/cycle VERIFIED MOD downgraded from VERIFIED HIGH
GPD confirmed the point estimates: C2→C3 transition = −21.0%, C3→C4 = −20.9%. The permutation control test confirms the decay signal is not a partitioning artifact: z-scores from −5.28 to −21.09 against shuffled null distributions.
Scanner scan S017 computed block-bootstrap 95% CIs (block length 30 days, 2,000 iterations):
Both intervals span zero. Neither individual transition is statistically significant at 95% confidence. This is not a contradiction of the permutation test: the permutation test asks whether the decay pattern could arise by chance; the bootstrap asks whether the magnitude of any single transition is precisely estimated. These are different questions. The pattern is confirmed. The magnitude is not precisely known.
The 20% per cycle figure is the best available point estimate from two complete transitions. It is not a verified constant. The Monte Carlo decay toggle should be presented as a best-estimate scenario, not a calibrated parameter, until cycle 5 completes.
C1.4 — Floor Never Breached on Daily Close VERIFIED (conditional)
GPD and Scanner scans S001/R042 enumerated breach counts against all four floor definitions simultaneously.
| Definition | Multiplier | Total breaches | Post-2010 breaches | Note |
|---|---|---|---|---|
| Conservative | 0.314× | 57 | 0 | All 57 in Oct–Nov 2010. Exchange artifact period. |
| Published | 0.422× | 235 | 135 | Concentrated in 2011, 2012, 2015, 2022–2023 |
| Current | 0.432× | 292 | ~157 | Similar temporal distribution |
| QR | 0.480× | 716 | ~680 | 12.5% of all days. Disqualified for inviolability claims. |
Under any floor definition from 0.314× to 0.432×, no daily close has fallen below the floor in 15 years of reliable price discovery. The companion paper The Reflecting Barrier (Paper 9) quantifies the structural basis: 81% fewer observations below the conservative floor than a normal distribution predicts (χ² = 203.9, p < 10−50). Papers must specify which floor definition is being used.
C2.5 — Convergence Horizon SUPPORTED
GPD fitted exponential and linear decay curves to the C2–C4 P1-P50 inter-percentile distance series. The exponential fit yields cycle 8.7 (~2059); the conservative complete-cycles-only estimate is cycle 10+ (~2070s). Scanner scan S029 attempted to validate the single-regime assumption using Chow tests at all four halving dates: every test produced a highly significant F-statistic (F = 6.5 to 234, all p < 0.002).
This initially appears to falsify the continuous model, but the mechanism makes S029 uninformative: the Chow test assumes i.i.d. residuals. Bitcoin log-residuals have lag-1 autocorrelation of 0.998. Any partition point in a highly persistent time series will produce a significant Chow statistic — a false positive generator for this data structure. The convergence horizon cannot be upgraded with current data. State as a range: cycle 8–10, approximately 2050–2070.
The Three Qualifications
Three of the six claims required qualification. They share a structural pattern: each was hiding an ambiguity that had been propagated through multiple papers without examination. None was found through deliberate scrutiny during original publication. All three were found the moment an independent method looked directly at the underlying quantity.
1. The floor multiplier was a number without a definition. The 0.42× floor appears 47 times across Observatory papers, cited as a constant. It is not a constant — it is a measurement whose value depends on which cycles are included, whether P1 or quantile regression is used, and whether an incomplete current cycle is included. These choices produce values from 0.314× to 0.527×, a range representing a $25,000 spread in today’s floor price. Every floor-derived quantity in every Observatory product depended implicitly on this undefined number. The four-definition taxonomy is the resolution.
2. The decay rate confidence interval was never computed. The 20% per cycle decay rate had been cited as a verified finding. What was missing was any quantification of how precisely the 20% figure is known. Block-bootstrap confidence intervals spanning zero were not a surprise after the fact: each interval is estimated from a single halving-cycle transition with approximately 24 effective independent observations. Statistical precision on a structural parameter measured from a sample of 24 is inherently limited, regardless of how many calendar days that sample covers. The finding changes no published conclusion; it changes the language.
3. The out-of-sample R² was never reported. This qualification does not correct a prior claim. It reports a measurement that was absent from every paper in the Bitcoin power law literature, including the Observatory’s. Publishing only the in-sample R² while omitting the out-of-sample equivalent is not fraudulent. It is incomplete in a way that any quantitatively literate reader will eventually notice. The Observatory reports it proactively.
Out-of-Sample Predictive Power
Scanner scan S097 implemented a train/test split at 2020-01-01. The model was fitted on pre-2020 data only; out-of-sample fit was evaluated on 2020–2026 prices the model had never seen.
| Evaluation | Dataset | Beta | R² | Interpretation |
|---|---|---|---|---|
| In-sample | Full 2010–2026 (n = 5,713) | 5.688 | 0.956 | Model fit quality on training data |
| Training set | 2010–2020 (n = 3,473) | 5.807 | 0.968 | Model fit on pre-2020 data alone |
| Out-of-sample | 2020–2026 (n = 2,240) | — | 0.546 | Predictive accuracy on unseen data |
The 0.546 figure is meaningful. The model explains 54.6% of the variance in price data it has never seen — across a period that included the COVID crash, the 2021 blow-off top, and the LUNA/FTX bear market. For an asset with Bitcoin’s volatility, this represents genuine predictive content.
The training-set beta (5.807) is 2.1% higher than the full-dataset beta (5.688). The model slightly overpredicts the trend slope during 2020–2026, producing a small systematic upward bias. This is a minor mis-specification, not a catastrophic one. The OOS R² of 0.546 reflects this bias combined with the genuine unpredictability of three extreme events in the test window.
Santostasi, PlanC, Burger, and every other published author in the Bitcoin power law literature has cited the in-sample R² without reporting the out-of-sample equivalent. The Observatory is the first to compute and publish both numbers.
Recommended language for all future Observatory papers: “The power law achieves R² = 0.956 in-sample across 15 years of data. Out-of-sample testing (parameters estimated on 2010–2020, evaluated on 2020–2026) yields R² = 0.546, indicating the model explains approximately half the variance in prices it has not seen.”
Pending Verification: Phases 2–5
GPD Phase 1 completed the six structural claims above. Four additional verification phases are scoped and sequenced.
Phase 2: Monte Carlo Methodology Audit. Does the current simulator correctly sample from the empirical residual distribution? Does the volatility decay toggle (0.80n per cycle) improve out-of-sample calibration compared to raw residual sampling? Phase 2 is pending: the C2.3 confidence downgrade means the toggle parameter should use the bootstrap CI range as a scenario band, not the point estimate as a calibrated constant. Blocked until cycle 5 completes.
Phase 3: Floor Bond Pricing Verification. Does the actuarially fair coupon derived from the P1 standard deviation (0.051) and BFR deceleration curve produce positive expected value for both lender and borrower? How sensitive is the coupon to the floor definition choice?
Phase 4: Cross-Asset Validation. Does gold show comparable power law floor behaviour during its monetisation phase? Scanner scans ADJ-07 (bear market duration predicts next cycle ceiling, p = 0.024, n = 3) and ADJ-08 (Shannon entropy decays monotonically across cycles) produced preliminary cross-asset signals requiring validation on longer datasets.
Phase 5: Autocorrelation as Standalone Contribution. The HAC analysis in this paper constitutes the most complete published treatment of autocorrelation in Bitcoin power law residuals. Neither Santostasi, PlanC, nor any peer-reviewed publication has addressed this rigorously. Phase 5 candidate title: Autocorrelation in Bitcoin Power Law Residuals: Why Published Standard Errors Are Wrong and What to Do About It.
Verification Summary
| Claim | Original | New status | Key finding | Action required |
|---|---|---|---|---|
| C1.1 OLS fit | VERIFIED | VERIFIED HIGH | Beta 5.694 confirmed. R² 0.956 confirmed. | None |
| C1.2 Autocorrelation | SUPPORTED | VERIFIED MOD | Eff. n = 24. HAC inflation 3.7×. | State eff. n in all papers |
| C1.3 Floor multiplier | DISPUTED | RESOLVED | Four-definition taxonomy adopted. | Apply taxonomy in all new work |
| C2.3 Decay rate | VERIFIED HIGH | VERIFIED MOD | Point estimates confirmed. Bootstrap CIs span zero. | Add CI caveat to all citations |
| C1.4 Floor unbreached | VERIFIED HIGH | VERIFIED (cond.) | True under all defs in post-2010 data. | Specify definition per claim |
| C2.5 Convergence | SUPPORTED | SUPPORTED | Range: cycle 8–10 (~2050–2070). | State as range, not point |
The verification did not falsify the Bitcoin power law. The model fits 5,713 daily closes at R² = 0.956 in-sample. The volatility decay is real, with z-scores from 5.28 to 21.09 against shuffled controls. The floor has held in 15 years of reliable price data under every tested definition. The power law is structurally intact.
What the verification did was expose three gaps that had been invisible because no one had looked directly at them. All three are now resolved or reported. The research stack is more precisely characterised than it was before the verification began.
Related Papers
Paper 9 quantifies the structural basis for the floor’s holding properties. Paper 10 applies the floor rule to loan safety. This verification provides the methodological foundation for all Observatory claims.
Data: btc_historical.json, 5,713 daily closes, 2010-07-18 to 2026-03-08. Power law: log10(price) = −16.493 + 5.688 × log10(days), genesis = 2009-01-03. All GPD scripts and Scanner scans reproducible from btc_historical.json. References: Santostasi (2024); Burger (2023); Newey & West (1987); Chow (1960); Efron & Tibshirani (1993).