TS-Eval is live
TS-Eval, our open and reproducible leaderboard for time-series forecasting, is now live — and the first round runs 135 models against the CSI-300 stock index, where the leaders are a dead heat of graph and per-series models with no clear winner.
Today we are launching TS-Eval — an open, reproducible leaderboard for time-series forecasting. Every entry is a community submission: one agent trajectory plus one verified result, ranked transparently across tracks, datasets, and horizons. It sits on top of ModernTSF, the producer framework we released a few days ago.
The first result
The opening round puts 135 models head-to-head on the Stock CSI-300 (沪深300) realtime track — 151 submissions, predicting 5 trading days from the prior 20, ranked by MSE (lower is better). The field spans two modes run on the same data: 124 submissions in pure time-series mode (each series forecast largely on its own) and 27 in graph/spatiotemporal mode (the ~300 stocks treated as nodes, with the cross-sectional structure between them modeled directly).
| Rank | Model | MSE | Type |
|---|---|---|---|
| 1 | NBeats | 0.7483 | time-series |
| 2 | MTGNN | 0.7484 | graph |
| 3 | DFDGCN | 0.7485 | graph |
| 4 | STPGNN | 0.7487 | graph |
| 5 | HimNet | 0.7488 | graph |
| 6 | GWNet | 0.7489 | graph |
| 7 | STNorm | 0.7490 | graph |
| 8 | STGCN | 0.7497 | graph |
The honest finding is that nothing separates at the top: #1 NBeats (0.7483) and #2 MTGNN (0.7484) are apart by 0.0002, and the rest of the leaders trail by hundredths — a dead heat. Two patterns survive it. Graph/spatiotemporal models that exploit the cross-sectional structure between stocks fill most of the front — 15 of the top 20 — yet the single best result is a pure per-series model (NBeats), so the cross-section clusters at the top without running away with it. And no model captures much signal: correlation with the truth sits near 0.04 across the leaders. What the field does clear is a real bar — a naive last-value baseline (HL) lands near the bottom at ≈1.50, while the leaders sit at ~0.748. Across the full board the spread runs from a best of 0.7483 to a median of 0.7856 and a worst of 1.7141.
Below the top, models cluster tightly and the long tail is wide; absolute predictability stays low (near-zero correlation for most), so CSI-300 forecasting is still genuinely hard. The takeaway is not that deep models magically solve stocks, nor that one architecture wins — it is that on this data, learned models beat the naive floor by a wide margin, graph models cluster at the front, and no single model meaningfully separates from the pack.
This is a launch snapshot, not a final verdict — mostly single-seed (seed 2024), one horizon, first round. More datasets, more horizons, and the realtime refresh are coming.
Where it lives
- Live leaderboard: diaugeia.ai/tseval
- Static datasets: Diaugeia/TSEval-Static
- RealTime datasets: Diaugeia/TSEval-RealTime
- Submissions: Diaugeia/TSEval-Submissions
- Weights: Diaugeia/TSEval-Weights
- Leaderboard Space: Diaugeia/TSEval
Submit
Clone ModernTSF, run your experiment via the tsf CLI, capture the run with tsf trace, then tsf submit --push to open a community PR — your model on the board.
For the full story, the protocol, and the complete results table, read the research post.
