TS-Eval: an open, reproducible leaderboard for time-series forecasting
Every entry is a community submission — one agent trajectory plus one verified result — ranked transparently across tracks, datasets, and horizons. We open with a 135-model battle on CSI-300, and an honest finding: nothing separates at the top — the leaders are a dead heat of graph and per-series models well above the naive floor, with near-zero correlation to the truth.
Keep research Simple and Stupid.
Today we are releasing TS-Eval — an open, reproducible leaderboard for time-series forecasting. Every entry is a community submission: one agent trajectory plus one verified result, ranked transparently across tracks, datasets, and horizons. It is the evaluation layer that sits on top of ModernTSF, the producer framework we released on June 10.
The problem: numbers that don't add up
Open three time-series papers and you will find three leaderboards that look comparable and aren't. The split is a little different. The lookback window is a little different. One drops the last incomplete batch and one doesn't. The metric is averaged a little differently. None of it is dishonest — it is just the thousand small choices that pile up when every group rebuilds evaluation from scratch. Add it together and a 0.41 in one paper and a 0.39 in another tell you nothing about which method is better.
Agents make this worse, not better. They write code faster than anyone can audit it, so the field now produces "similar but not the same" results at machine speed. A number with no traceable provenance — no fixed split, no recorded process, no archived weights — is not a result. It is a screenshot.
A leaderboard is only worth reading if every row on it was produced the same way, and if you can walk any row back to the exact code, data, and weights that made it. That is what TS-Eval is for.
What TS-Eval is
TS-Eval is a public, append-only leaderboard. It does not run your model for you and it does not trust your README. Instead, it accepts evidence bundles from the community and ranks them under one fixed protocol. Every entry carries the proof of how it was made, so the leaderboard is auditable end to end — by a human, or by an agent.
Three properties hold by construction:
- Comparable. Splits, lookback, horizon, and metric are fixed by the protocol, not by the submitter.
- Reproducible. Each result references the exact trained weights that produced it, by sha256.
- Open. Datasets, submissions, weights, and the leaderboard frontend all live in public repositories. Anyone can read them, audit them, or submit to them.
Two tiers of tracks
A leaderboard on a frozen dataset eventually gets memorized — the field overfits to the test set and the ranking stops meaning anything. So TS-Eval has two tiers.
- Static tracks. Fixed benchmark data, split by task mode into time_series, spatiotemporal, and covariate. This is the stable, citable backbone: the same data, the same split, every time, so a number from today is comparable to a number from next year.
- Realtime tracks. Periodically-refreshed live datasets. The data keeps moving, so you cannot overfit to a fixed test set — you have to actually forecast. The first realtime track is stock CSI-300 (沪深300).
Static measures method. Realtime measures method on data nobody has seen yet. You want both.
The submission contract
This is the part that makes the leaderboard trustworthy. A submission is not a number — it is a bundle of three things:
- A trajectory (
trajectory.jsonl) that captures the agent's experiment process at the CLI boundary. It records what was actually run to produce the result. It is agent-agnostic — Claude Code, Codex, or OpenCode all serialize to the same boundary — so the evidence does not depend on which tool you used. - One verified result, a single schema-valid
RunRecord. This is the number, plus everything needed to place and trust it: track, dataset, horizon, metric, seed, and the sha256 of the weights that produced it. - A short report, human-readable, so a person can understand the submission without replaying the trajectory.
The shape of all three is defined by TSF-Core — a pydantic-only layer inside ModernTSF that exports a JSON Schema. This is the contract, and it is deliberately thin. The leaderboard consumer reads only that schema. It has zero Python coupling: no torch, no ModernTSF import, no model code. The producer (ModernTSF) and the consumer (the leaderboard) agree on a JSON Schema and nothing else. Either side can be rewritten without touching the other.
How the leaderboard is built
The build is a deterministic CI step — tsf leaderboard-build, no torch, no GPU. It does exactly four things:
- Read every submission in the Submissions dataset.
- Check each one: result present, trajectory present, schema-valid. Anything that fails is rejected with a reason — not silently dropped.
- Collate the survivors per
(track, dataset, horizon). - Rank by MSE, lower is better.
There is no judgment call in the build and no hidden state. Given the same submissions, anyone gets the same leaderboard. That is the point: the ranking is a pure function of public evidence.
Reproducible by separated weights
Trained checkpoints do not live in the submission. They are archived separately, in the TSEval-Weights dataset, and each submission references its weights by sha256. So a number on the board is always traceable to the exact bytes that produced it — download those weights, re-run the recorded protocol, and you should land on the same value. Separating weights from evidence keeps the Submissions dataset small and append-only, while keeping every result fully reproducible.
First results: CSI-300
To open the realtime track we ran the full model suite on CSI-300 (沪深300) index constituents: 135 models, 151 submissions in total (some models run on multiple seeds). The board spans two ways of seeing the same data — 124 submissions in time-series mode, where each stock is forecast largely on its own, and 27 in spatiotemporal / graph mode, where the ~300 stocks are nodes and the model learns the cross-sectional structure between them.
The protocol is fixed and simple: input seq_len = 20 trading days, predict pred_len = 5 trading days (horizon 5). Ranked by MSE, lower is better.
Here is the top of the board:
| Rank | Model | MSE | Type |
|---|---|---|---|
| 1 | NBeats | 0.7483 | time-series |
| 2 | MTGNN | 0.7484 | graph |
| 3 | DFDGCN | 0.7485 | graph |
| 4 | STPGNN | 0.7487 | graph |
| 5 | HimNet | 0.7488 | graph |
| 6 | GWNet | 0.7489 | graph |
| 7 | STNorm | 0.7490 | graph |
| 8 | STGCN | 0.7497 | graph |
The honest finding is that nothing separates at the top. The leading models are packed into a razor-thin band: #1 NBeats at 0.7483 and #2 MTGNN at 0.7484 are apart by 0.0002, and the rest of the top trails by hundredths. Whatever ranks first this round is, for practical purposes, tied with the next dozen.
Two patterns survive that dead heat. First, models that exploit the cross-sectional graph structure between stocks — MTGNN, DFDGCN, GWNet, STPGNN, and the rest — fill most of the front: 15 of the top 20 are graph / spatiotemporal. Modeling the relationships between stocks tends to land you near the top. But — second — the single best result is a pure per-series model, NBeats, so it would be wrong to say the cross-section decisively wins: graph models cluster at the front, they do not run away with it.
And no model captures much signal. Correlation with the truth hovers near 0.04 across the leaders — essentially noise. What the learned models do clear is a real bar: a naive last-value baseline (HL) sits near the bottom of the board at MSE ≈1.50, while the field lands at ~0.748. So architecture buys a large jump over doing nothing — it just does not let any one model pull ahead of the others.
Keep the nuance. Below the top, models cluster tightly and the long tail is wide: best 0.7483, median 0.7856, worst 1.7141. Absolute predictability is low — near-zero correlation for most models — so CSI-300 forecasting is still genuinely hard. The takeaway is not that deep models magically solve stocks, nor that one architecture wins. It is narrower and more useful: on this data, learned models beat the naive floor by a wide margin, graph models cluster at the front, and no single model meaningfully separates from the pack.
A few caveats, stated plainly. This round is mostly single-seed (seed 2024), it is one horizon, and it is the first round. More datasets, more horizons, and the realtime refresh are all coming. This is a launch snapshot, not a final verdict. Read it as the first data point on an open board, not the last word on Chinese equities.
Where everything lives
Five public repositories under the Diaugeia org, plus the live board:
- Static datasets — TSEval-Static
- Realtime datasets (CSI-300) — TSEval-RealTime
- Submissions (append-only evidence bundles) — TSEval-Submissions
- Weights (trained checkpoints) — TSEval-Weights
- Leaderboard Space (frontend) — Diaugeia/TSEval
The live leaderboard on the site: diaugeia.ai/tseval. The producer framework: github.com/Diaugeia/ModernTSF.
How to submit
You do the experiment in ModernTSF; the CLI produces the evidence; one command opens the PR.
git clone https://github.com/Diaugeia/ModernTSF.git
cd ModernTSF
# run your experiment via the tsf CLI, capturing the trajectory
tsf trace -- tsf run <your-experiment>
# open a community PR on the Submissions dataset
tsf submit --pushtsf submit --push packages your RunRecord, your trajectory.jsonl, and your report, archives the weights by sha256, and opens a pull request on TSEval-Submissions. The deterministic build picks it up, validates it against the TSF-Core schema, and — if it passes — your row appears on the board, with its full evidence attached.
What's next
More static tracks, more horizons, and the first scheduled CSI-300 refresh that makes the realtime track live in the real sense of the word. The contract stays thin and the board stays open.
If you work on time series — or you just want to see whether your model survives contact with a fixed protocol — clone ModernTSF, run something, and submit it. Or join us in building the board itself.
