Skip to main content
← Blog

A Fable 5 agent called both World Cup openers — to the exact score

· Mick Brzeziński

  • AI agents
  • LLM
  • build in public

Before they blocked Fable, it wrote me the system that called the first two World Cup matches perfectly.

Mexico vs South Africa finished 2–0. My agent had called it 2–0 that morning. South Korea vs Czechia finished 2–1 — the agent had called 2–1.

Two opening matches of the World Cup. Two exact scorelines. In our office pool at createit.com, an exact score is worth 3 points and the winner alone is worth 1. Six out of six.

I’ve shipped software for fifteen years and build LLM agents for a living. So this isn’t “look what AI did.” It’s a teardown of why it worked — and, at the very bottom, an honest look at how it’s gone since. None of it is magic.

Timeline: Jun 9 Fable 5 ships, Jun 11 the agent calls both World Cup openers (Mexico 2–0, Korea 2–1) for 6/6 points, Jun 12 the US pulls Fable 5 for foreign nationals.

The agent does not “predict the score”

Prompt a model with “give me the score” and you get a number from its gut — untunable, untestable. So I split the question in two:

Judgment (the model’s job): who wins (home/away/draw) and by how much (slight / clear / mismatch). This is where form, the group table, and stakes get weighed — the part LLMs are actually good at.

Mapping (a deterministic matrix in code): it turns that judgment into a concrete scoreline, the same way every time. It never decides who wins — only what number the judgment maps to.

An LLM is brilliant at judgment and terrible at consistency. So I let it judge and forbade it from improvising the scoreline. Determinism is the feature.

How it works: match context → Fable 5 makes the judgment (who wins, by how much) → a deterministic scoreline matrix → an exact score; only the 'normal' profile is passed to a human to approve before the bet is placed.

Why you never bet 4–2

Scoring: exact = 3, winner only = 1, miss = 0. That gives one rule:

EV = 1 × P(direction) + 2 × P(exact score)

Lock the direction point, then within that direction pick the most common scoreline. From the 2022 group stage: 2–0 (23%), 1–0 (19%), 2–1 (17%), 0–0 (12%), 1–1 (6%). Exotic scores almost never hit exactly, so they waste the 2-point upside. The matrix only reaches for them on a real mismatch.

Mexico, clear favourite → most likely score is 2–0. It finished 2–0. Not a guess — the top of the distribution, placed coldly. The priors are backtested on 2018 and 2022, and the matrix is cross-validated so it doesn’t overfit one tournament.

Nine agents, one bets for real

I run nine “characters.” All get the same judgment; they differ only in how aggressively they map it:

  • 🛡️ safe — the floor (1–0).
  • ⚖️ normal — modal scoreline by strength. The only one that bets for real.
  • ⚔️ brave — chases the exact (bigger margins).
  • 🤖 control — ignores strength; dumb 2–0 / 1–1 baseline.
  • 📉 minimalist — always the lowest score.
  • 🥅 btts — favourite wins, underdog scores (2–1 / 3–1).
  • 🎯 calibrated — risk dialled by confidence.
  • 🤖 agent_call / agent_free — a free-reasoning model instead of the matrix.

Eight are paper bets, scored on a file. Same judgment across all nine means character is the only variable — so at tournament’s end I won’t just know how many points I won, I’ll know which philosophy won.

The disciplined agent beats the creative one

Here’s the result I didn’t expect to enjoy this much.

Two of those nine profiles throw out the matrix entirely and let a free-reasoning model pick scores on instinct — momentum, narrative, nerve. Exactly the “just let the AI cook” approach everyone reaches for first.

It’s losing. The cautious, deterministic normal is ahead of the free-reasoning agent. The version that’s allowed to be clever is doing worse than the version I deliberately made boring. Turns out the value of an LLM here is its judgment, not its imagination — and the moment you let it freestyle the part that should be a lookup table, it gives points back. Discipline > vibes, even when the vibes come from a very good model.

Why I’m ignoring the hot start

Two perfect calls feels like a system that works. It mostly feels like that because two is a tiny number.

Twenty-four matches per round, scored 3/1/0, is loud with variance. A hot open is exactly what noise looks like before it regresses. So the rule is: re-judge every round from scratch, and do not retune the matrix off one round’s results — the real lever is judgment quality, not the table. There’s a checkpoint to re-fit it after the group stage, on this tournament’s own data, with ~72 judgments in hand. Not before. Believing your own hot streak is how you turn a good system into a curve-fit to luck.

(How loud is that variance? Scroll to the bottom. It came for me fast.)

A human pulls the trigger

A bet is irreversible. So the agent prepares the whole round, shows me the slate as a dry run, and stops. Only my explicit OK releases the real bets. The agent does the work; the human owns the one action that can’t be undone.

Under the hood

One Python file, standard library only, zero LLM calls in the code — the script is plumbing, the model is the brain, they never blur. The model’s whole output per match is one token:

1:home:clear:lean

(game · who wins · how big the gap · how sure of the direction). A deterministic table maps it to nine scorelines — clear → normal 2-0, safe 1-0, brave 3-1, and so on.

The nugget I like: the calibrated profile maps confidence to risk, and it dials down, never up. Under flat 3/1/0 scoring, chasing the exact score is negative-EV however sure you are — so being confident is never a licence to gamble.

The two “creative” profiles skip the table for a frozen persona prompt — “a bold, instinct-driven bettor with ice in your veins… back your conviction with committed scorelines, return only JSON.” Per round the pipeline is just: prepare → I judge → derive → dry-run → I approve → submit → score.

Steal this

Strip out the football and the transferable pattern is just five lines:

  1. Split judgment from formatting. Let the LLM make the fuzzy call; don’t let it improvise the structured output.
  2. Make the deterministic part deterministic — a table or function in code, not a second prompt.
  3. Optimise for the real scoring function, not for looking smart. Write the EV down.
  4. Run shadow variants. One real, N on paper, same inputs — so you measure the strategy, not one lucky outcome.
  5. A human owns the irreversible action. The agent prepares; you approve.

So how’s it actually going

Right. About that hot start.

After going 2-for-2 on exact scores before dinner on Sunday, my agent has correctly predicted exactly zero exact scores since. Peak performance lasted roughly 180 minutes.

The full scorecard for normal, through 12 of Matchday 1’s 24 matches:

#MatchCallResultPts
1Mexico vs South Africa2–02–03 — exact ✅
2South Korea vs Czechia2–12–13 — exact ✅
4United States vs Paraguay2–14–11 — direction
5Haiti vs Scotland0–20–11 — direction
10Germany vs Curaçao2–07–11 — direction
12Sweden vs Tunisia2–15–11 — direction
3Canada vs Bosnia & Herzegovina2–11–10
6Australia vs Turkey1–22–00
7Brazil vs Morocco2–11–10
8Qatar vs Switzerland0–21–10
9Ivory Coast vs Ecuador1–11–00
11Netherlands vs Japan2–12–20

10 points across 12 matches. Two exact scorelines, both before 9pm Sunday. Six of twelve directions correct — a 50% hit rate, which is to say a coin would have matched my carefully engineered, backtested, cross-validated football oracle on who even wins. (It did call Germany to beat Curaçao. It did not call 7–1. In fairness, nobody called 7–1.)

The good news: normal — the only one that bets for real — is top of all nine. The free-reasoning agent is dead last. Discipline > vibes.

Leaderboard through 12 matches: normal (the real bet) leads on 10 points; safe, brave, control, minimalist, btts, calibrated and agent_call all on 8; agent_free (free rein) last on 7.

104 matches in the tournament, 92 to go. I’ll report the numbers as they land — including the ones that make me look like this.


Footnote, re: the title. I built this with Claude Fable 5 — public for about three days. It launched June 9; my agent called the openers June 11; on June 12 the US ordered Anthropic to cut Fable 5 off for all foreign nationals, and it was pulled worldwide within hours. I’m a Polish dev — so the model that engineered this was gone two days after it went 2-for-2. Access is fragile; the edge is knowing what to ask before someone decides you can’t.