Important
We came in 4th position 🏅
Project Submission: View the official project page and hackathon submission on Apart Research
ExogenousAI quantifies the impact of external policy interventions on AI capability development by combining:
- Event Study Analysis: Measures abnormal returns in AI metrics around policy announcements
- Monte Carlo Simulation: Projects AGI timelines under different policy scenarios
- Meta-Forecasting: Compares results with existing literature forecasts
How do policy interventions (compute governance, export controls, collaboration frameworks) affect the timeline to transformative AI capabilities?
AGI Timeline Projections (95% MMLU-Pro threshold):
| Scenario | Median Year | Confidence Interval | Probability within 5 years |
|---|---|---|---|
| Status Quo | 2027 | 2026-2030 | 86.9% |
| Compute Governance | 2027 | 2026-2030 | 83.1% |
| Export Control Escalation | 2027 | 2026-2030 | 80.8% |
| Open Collaboration | 2027 | 2026-2030 | 90.8% |
Key Insight: All scenarios converge to 2027 median, indicating that under current strong growth trends (+25.5% annually), policy interventions affect probability distributions but not central estimates.
Baseline Trend (from MMLU-Pro real data):
- Growth Rate: +2.128% per month (+25.5% annually)
- Volatility: 9.75%
- Data Points: 16 months of real benchmark scores (July 2023 - December 2024)
Uncertainty Decomposition:
- Technical Uncertainty: 76.9%
- Economic Uncertainty: 23.1%
- Policy Uncertainty: 0.0% (scenarios don't differentiate medians)
| Forecast Source | Median Timeline | Range |
|---|---|---|
| ExogenousAI (all scenarios) | 2027 | 2026-2030 |
| EpochAI | 2036 | 2030-2050 |
| Bio Anchors | 2055 | 2040-2080 |
| AI 2027 Report | 2027 | 2025-2030 |
ExogenousAI aligns with aggressive near-term forecasts (AI 2027) due to observed 25.5% annual growth in MMLU-Pro benchmarks.
- Source: https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
- Models: 68 state-of-the-art LLMs
- Date Range: July 2023 - December 2024
- Score Range: 10.9% (SmolLM-360M) to 52.8% (Phi-4-mini)
- Top Performers:
- Phi-4-mini (5.6B parameters): 52.8% (Dec 2024)
- Phi-3.5-mini-instruct (3.8B): 47.9% (Aug 2024)
- Phi3-mini-4k (3.8B): 45.7% (Apr 2024)
- Policy Events: 4,370 AI policy announcements (SERP API, 2020-2025)
- arXiv Submissions: 45 months of AI research papers (cs.AI, cs.LG, cs.CL)
- NVIDIA Stock: 46 months of NVDA prices (Alpha Vantage API)
- EpochAI Training Compute: 284 months of frontier model compute trends
✅ No interpolation: All missing values preserved as NaN
✅ No hardcoded values: Zero manually entered benchmarks
✅ No synthetic data: Only real leaderboard scores
Classical finance event study adapted for AI policy:
AR_it = R_it - E[R_it | Normal Period]
CAR_i = Σ AR_it (over event window)
- Event Window: [-6, +6 months] around policy announcement
- Normal Period: [-12, -7 months] pre-event baseline
- Metrics: Benchmark scores, arXiv velocity, stock returns
- Statistical Test: Two-sample t-test (p < 0.05)
Current Results:
- 6 events analyzed (most skipped due to sparse real benchmark data)
- 0 statistically significant abnormal returns
- Interpretation: Either (1) policy effects are delayed/diffuse, or (2) insufficient statistical power from sparse data
Geometric Brownian Motion with policy adjustments:
dS_t = μ_policy × S_t × dt + σ × S_t × dW_t
Where:
μ_policy = μ_baseline × policy_factor(scenario-specific growth adjustment)σ = √(σ_historical² + σ_policy²)(combined technical + policy volatility)S_t= benchmark score at time t
Scenarios:
- Status Quo (policy_factor=1.0): Current trajectory continues
- Compute Governance (policy_factor=0.95): Modest slowdown from regulations
- Export Control Escalation (policy_factor=0.90): Significant compute restrictions
- Open Collaboration (policy_factor=1.05): Accelerated progress from cooperation
Simulation Parameters:
- Iterations: 10,000 per scenario
- Horizon: 60 months (5 years)
- Baseline: Exponential fit to 16 real MMLU-Pro data points
- AGI Threshold: 95% (adjusted for benchmark saturation; original MMLU-Pro scores now exceed 52%)
Variance decomposition across forecast sources:
Var_total = Var_policy + Var_technical + Var_economic
- Policy Variance: Spread across ExogenousAI scenarios
- Technical Variance: Within-scenario confidence intervals
- Economic Variance: Estimated from market volatility (30% of technical)
Finding: Policy variance = 0.0 because all scenario medians = 2027. This indicates strong baseline trend dominates policy effects in central estimates.
- parse_mmlu_manual.py: Parses MMLU-Pro leaderboard from manual input (68 models with release dates)
- scrape_policy_events.py: SERP API scraper for policy announcements (4,370 events)
- scrape_arxiv.py: arXiv API for research paper counts
- scrape_stocks.py: Alpha Vantage for NVIDIA stock data
- clean_benchmarks.py: Validates benchmark data, removes duplicates, normalizes scores
- aggregate_monthly.py: Converts all time series to monthly frequency
- merge_datasets.py: Combines benchmarks, arXiv, stocks, policy events (NO INTERPOLATION)
- event_study.py: Calculates abnormal returns and cumulative abnormal returns (CAR)
- monte_carlo.py: Runs 10,000 simulations per scenario, estimates AGI timelines
- meta_forecasting.py: Compares ExogenousAI with EpochAI, Bio Anchors, AI 2027
- plot_events.py: Time series with policy event markers, CAR plots
- plot_scenarios.py: Monte Carlo trajectories, timeline distributions, uncertainty decomposition
- Python 3.11+
- Virtual environment tool (venv)
git clone https://github.com/XAheli/ExogenousAI.git
cd ExogenousAIpython3.11 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -r requirements.txtCreate secrets.toml file in project root:
SERP_API_KEY=your_serpapi_key_here
ALPHAVANTAGE_API_KEY=your_alphavantage_key_here
ARXIV_EMAIL=your_email@example.comAPI Key Sources:
- SERP API: https://serpapi.com/ (free tier: 100 searches/month)
- Alpha Vantage: https://www.alphavantage.co/ (free tier: 500 calls/day)
- arXiv: Email for rate limiting compliance (no key needed)
EpochAI training compute data is included in data/raw/epochai/. If re-downloading:
# Manually download from https://epoch.ai/data
# Place CSVs in data/raw/epochai/ai_models/Problem: Original plan was to scrape MMLU, HumanEval, MATH benchmarks from Papers with Code API. Discovered:
- PwC API is deprecated/unstable
- Hugging Face Open LLM Leaderboard API endpoint returns 404
- No unified benchmark database with historical scores
Solution:
- Manually collected 68 models from TIGER-Lab/MMLU-Pro leaderboard
- Assigned release dates based on public model announcements
- Focused on single authoritative benchmark (MMLU-Pro) rather than mixing heterogeneous sources
Impact:
- Only 16 months of real benchmark data (vs 60 months desired)
- Sparse data → many event study analyses skipped
- But ensures 100% real data (no fabrication)
Problem: Initial implementation used linear interpolation to fill missing months, creating 86% synthetic data (60/70 months fabricated).
Solution:
- Removed all interpolation from
merge_datasets.py - Preserved NaN values for missing observations
- Updated analysis modules to skip windows with insufficient real data
Trade-off:
- Fewer statistically significant event study results
- But scientifically honest (only real measurements analyzed)
Problem: Current models exceed 50% on MMLU-Pro (Phi-4-mini: 52.8%), but AGI threshold historically set at 90% for MMLU.
Solution:
- Raised AGI threshold from 90% to 95% to account for:
- MMLU-Pro is harder than original MMLU
- Current pace suggests 95% achievable by 2027-2030
Rationale: Original MMLU has ceiling effects (GPT-4 at 86.4%), MMLU-Pro provides more headroom.
Problem: Meta-forecasting shows 0% policy contribution to variance.
Root Cause:
- All 4 scenarios predict same median year (2027)
- Strong baseline growth (+25.5% annually) overwhelms policy adjustments (±5-10%)
- Scenarios differ in probability distributions (80.8% - 90.8%) but not central estimates
Interpretation:
- This is mathematically correct, not a bug
- Indicates current trends are resilient to moderate policy interventions
- More aggressive policy scenarios (e.g., 50% slowdown) would differentiate
Potential Fix (not implemented):
- Increase policy_factor differences (e.g., 0.5x - 2.0x range)
- Use mean years instead of medians (spreads 2027-2028)
- But this would require strong justification for extreme policy impacts
Problem: Only 6 events analyzable (out of 4,370 collected), 0 statistically significant results.
Causes:
- Sparse benchmark data: Only 16 months with measurements
- Event windows need [-6, +6 months] data → most events skipped
- Benchmark changes are discrete jumps at model releases, not gradual
Solution Attempted:
- Lowered minimum data requirements (2 pre-event, 1 post-event points)
- But this reduces statistical power
Alternative Approaches (future work):
- Use daily/weekly stock prices or arXiv counts (denser data)
- Shift to "model release events" rather than "policy events"
- Employ interrupted time series analysis (fewer data requirements)
Decision: Use single benchmark (MMLU-Pro) exclusively, rather than composite of MMLU, HumanEval, MATH.
Rationale:
- Consistency: Different benchmarks have different scales, difficulty, saturation rates
- Availability: MMLU-Pro has complete leaderboard data
- Recency: Models tested consistently on same benchmark version
- Academic Standard: MMLU-Pro is widely reported in model papers
Trade-off:
- Narrower capability measurement (doesn't capture coding, math)
- But avoids heterogeneity artifacts
Decision: Preserve all missing values as NaN, no linear/spline interpolation.
Rationale:
- Scientific Integrity: Interpolation creates fake data
- Event Study Bias: Interpolated values have artificial smooth trends
- Monte Carlo Accuracy: Baseline fit should reflect real volatility
- Reproducibility: Unclear how future researchers should interpolate
Trade-off:
- Sparser data → fewer analyzable events
- But maintains data provenance and quality
Decision: Fit exponential trend y = a × exp(b × t) rather than linear or polynomial.
Rationale:
- Scaling Laws: AI capabilities theoretically improve exponentially with compute/data
- Historical Precedent: ImageNet, MMLU, GPT losses all show exponential improvement
- Geometric Brownian Motion: Natural model for percentage growth
Alternative Considered:
- Logistic growth (S-curve approaching saturation)
- But current data shows no inflection point yet
Decision: Set AGI threshold at 95% MMLU-Pro score (vs traditional 90% MMLU).
Rationale:
- Benchmark Difficulty: MMLU-Pro is harder than MMLU (CoT required, more distractors)
- Saturation Adjustment: Current models at 52.8%, 90% likely achieved within 1-2 years
- Conservative Estimate: 95% ensures superhuman performance on graduate-level questions
Calibration:
- 95% MMLU-Pro ≈ 99th percentile human expert
- Aligns with "economically transformative" capability threshold
Observation: Status Quo, Compute Governance, Export Controls, and Open Collaboration all converge to median year 2027.
Explanation:
- Strong Baseline: +25.5% annual growth from empirical MMLU-Pro data
- Moderate Policy Adjustments: Scenarios modify growth by ±5-10%, not enough to shift median >1 year
- Short Runway: Starting from 52.8%, only ~42 percentage points to 95% threshold
- High Volatility: 9.75% monthly volatility creates large confidence intervals (2026-2030)
Implications:
- Not a failure of methodology: Reflects reality that current trends are strong
- Policy effects are subtle: Scenarios differ in probability (80.8% - 90.8%) not timing
- Intervention threshold: Would need >50% growth slowdown to meaningfully delay AGI
Mathematical Reason:
Var_policy = Var([2027, 2027, 2027, 2027]) = 0
Conceptual Reason:
- Policy contribution measures variance across scenarios
- If all scenarios predict same outcome, contribution = 0
- This is correct, not erroneous
What It Means:
- Under current trends, policy interventions don't differentiate central estimates
- Policy affects probabilities and shapes of distributions
- To get >0% contribution, would need scenarios that predict 2025, 2027, 2030, 2035, etc.
Is This Realistic?
- Depends on whether policy can truly cause 5-10 year delays
- Current scenarios are conservative (±10% growth adjustments)
- Historical evidence: Export controls on GPUs (2022-2024) didn't stop frontier progress
- But: More aggressive interventions (compute caps, research moratoria) could matter more
- Sparse Benchmarks: Only 16 months of real MMLU-Pro scores
- Single Benchmark: Doesn't capture multimodal, robotic, or scientific capabilities
- Leaderboard Bias: Only publicly reported models (excludes proprietary/classified systems)
- Retrospective Dating: Model release dates are approximate, not exact evaluation dates
- Event Study Power: Insufficient data for robust statistical inference
- Policy Scenarios: Adjustments (±5-10%) are illustrative, not empirically validated
- Exponential Assumption: May not hold if scaling laws break down
- Threshold Arbitrariness: 95% MMLU-Pro as "AGI" is a proxy, not ground truth
- Capabilities ≠ Deployment: Timelines are for technical capability, not societal impact
- Safety Excluded: Doesn't model alignment, interpretability progress
- Expand Benchmarks: Add HumanEval+, MATH, GPQA, MMMU for multimodal
- Automate MMLU-Pro: Implement HTML scraper for TIGER-Lab Space
- Daily Indicators: Use stock prices, arXiv daily counts for denser event studies
- Robustness Checks: Jackknife, bootstrap for Monte Carlo confidence intervals
- Causal Inference: Diff-in-diff, synthetic controls for policy effects
- Regime-Switching Models: Detect trend breaks (e.g., if scaling laws fail)
- Multi-Benchmark Composite: PCA/factor analysis across MMLU-Pro, HumanEval, MATH
- Dynamic Scenarios: Policy adjustments that vary over time
- Endogenous Policy: Model feedback loops (progress → policy → progress)
- Alignment Timelines: Parallel forecast for AI safety benchmarks
- Economic Impact: Link capability timelines to GDP, employment effects
- Global Coordination: Multi-country game theory (US/China AI race)
If you use this framework in your research, please cite:
@misc{exogenousai2025,
title = {(HckPrj) ExogenousAI},
author = {Aheli Poddar},
year = {2025},
organization = {Apart Research},
note = {Hackathon research sprint submission},
url = {https://apartresearch.com/project/exogenousai}
}Status: Research Prototype - Results are preliminary and should not guide high-stakes decisions.