MBModelBall

Moneyball for LLMs

Can understanding ever-changing AI model biases
give you an edge in football analytics?

Five AIs predict 104 World Cup matches. We're testing whether knowing their blind spots helps.

The research papers

April 21, 2026 · 26 pages · 511 KB

Moneyball for LLMs

Behavioral fingerprinting of frontier AI models in football talent evaluation and event prediction

Four frontier AI models tested across 12 talent dimensions and ~45,000 trials. Documents League Prestige Discount (Cohen’s h = 1.18–1.41) — unanimous across all models — and demographic evaluation inconsistency with EU AI Act compliance implications.

April 27, 2026 · 18 pages · 353 KB

How can you improve the predictive power of LLMs in sports?

Two mechanisms for improving LLM football match predictions

979 matches across 18 leagues. A three-bias formula predicts model accuracy with r = 0.997 before any prediction is made. Bias-derived calibration improves Brier score by 4.6–7.3% per model.

THE KEY QUESTION

Does knowing AI blind spots improve predictions?
The Edge
Bias-corrected blend
57%
correct
Simple Average
Equal-weight blend
57%
correct
Based on 28 matches scored
How The Edge works →

PREDICTION ACCURACY

1
Grok
57%
16/28 correct
2
The Edge
57%
16/28 correct
3
GPT-5.4
54%
15/28 correct
View full leaderboard →
World Cup 2026 progress28 of 104 matches

Where the AIs disagree

Each spike shows a blind spot. Bigger spike = bigger bias.

All five models are given identical inputs. These differences come purely from how each model reasons about the same evidence.

League prestigeClub prestigeDemographicsAge curveTemporal weightTournament pedigreeAttribute typeRole valueRisk toleranceMedia narrativeTactical knowledgeTactical contextFixture difficultyHome advantageUpset IDNarrative overrideOdds integrationForm recencySquad depthKey player absenceStakes & pressurexG integration

Get predictions before each match

Receive all 6 predictions before kickoff. See where the AIs disagree and why.