TSEval: an open leaderboard you can check

Most forecasting numbers are impossible to check. A paper reports a result, a leaderboard reprints it, and almost no one re-runs it. TSEval is built the other way around: every row on the board is a submission you can open — the result, the agent's full experiment trajectory, and a readable report — that anyone can audit and reproduce. The board is not a table we maintain by hand; it is a function of the evidence, rebuilt from scratch the moment someone submits.

It is the open scoreboard for ModernTSF, the producer framework where the experiments actually run. ModernTSF is where you run; TSEval is where the run is shown — in the open, with everything needed to trust it attached.

The problem: numbers that don't add up

Open three time-series papers and you will find three leaderboards that look comparable and aren't. A slightly different split, a slightly different lookback window, one metric averaged a little differently — none of it dishonest, just the thousand small choices that pile up when every group rebuilds evaluation from scratch. Add them together and a 0.41 in one paper and a 0.39 in another tell you nothing about which method is better.

Agents make this worse, not better: they write code faster than anyone can audit it, so the field now produces "similar but not the same" results at machine speed. A number with no traceable provenance — no fixed split, no recorded process — is not a result. It is a screenshot. A leaderboard is only worth reading if every row was produced the same way and can be walked back to the exact code, data, and process that made it. That is the whole point of TSEval.

A submission is evidence, not a number

This is what makes the board trustworthy. A submission is not a number you type into a form — it is a bundle of three things: a trajectory that records the agent's experiment at the CLI boundary (agent-agnostic — Claude Code, Codex, and OpenCode all serialize the same way), a single verified result carrying everything needed to place and trust it, and a short human-readable report. Their shape is fixed by a thin JSON Schema, so the leaderboard reads only the schema with zero Python coupling — producer and consumer can each be rewritten without touching the other. Weights are not required: a row earns its place with its result and its process, not a multi-gigabyte checkpoint, though you can optionally archive trained weights for bit-level reproducibility.

The board itself builds in a deterministic CI step — no torch, no GPU. It reads every submission, rejects anything incomplete or schema-invalid (with a reason, never silently), collates the rest, and ranks by MSE. Given the same submissions, anyone gets the same board: the ranking is a pure function of public evidence, with no human editing a table. It spans both static benchmarks — fixed data and the same split every time, so today's number is comparable to next year's — and periodically-refreshed real-time tracks on live data you can't overfit to. Static measures method; real-time measures method on data nobody has seen yet.

See the board — and add to it

The leaderboard is live, open, and growing. Every row links to its full evidence, and the rankings, the method-evolution view, and the per-track breakdowns are all on the site:

→ tseval.diaugeia.ai

Everything is open source. The leaderboard, its build pipeline, and every submission live in github.com/Diaugeia/TSEval; the datasets and an optional weights archive live on Hugging Face under the Diaugeia org.

Submitting is a pull request, not a request for access: run your experiment in ModernTSF, let the CLI package the evidence, and open a PR on the leaderboard repo. CI validates it against the schema and, if it passes, your row appears with its full evidence attached — see SUBMITTING.md for the details.

If you work on time series — or just want to see whether your model survives contact with a fixed protocol — clone ModernTSF, run something, and submit it. Or join us in building the board itself.