diaugeia logodiaugeia.ai
All news
Diaugeia.AI team

TS-Eval is live

TS-Eval, our open and reproducible leaderboard for time-series forecasting, is now live — and the first round runs 135 models against the CSI-300 stock index, where the leaders are a dead heat of graph and per-series models with no clear winner.

Announcement

Today we are launching TS-Eval — an open, reproducible leaderboard for time-series forecasting. Every entry is a community submission: one agent trajectory plus one verified result, ranked transparently across tracks, datasets, and horizons. It sits on top of ModernTSF, the producer framework we released a few days ago.

The first result

The opening round puts 135 models head-to-head on the Stock CSI-300 (沪深300) realtime track — 151 submissions, predicting 5 trading days from the prior 20, ranked by MSE (lower is better). The field spans two modes run on the same data: 124 submissions in pure time-series mode (each series forecast largely on its own) and 27 in graph/spatiotemporal mode (the ~300 stocks treated as nodes, with the cross-sectional structure between them modeled directly).

RankModelMSEType
1NBeats0.7483time-series
2MTGNN0.7484graph
3DFDGCN0.7485graph
4STPGNN0.7487graph
5HimNet0.7488graph
6GWNet0.7489graph
7STNorm0.7490graph
8STGCN0.7497graph

The honest finding is that nothing separates at the top: #1 NBeats (0.7483) and #2 MTGNN (0.7484) are apart by 0.0002, and the rest of the leaders trail by hundredths — a dead heat. Two patterns survive it. Graph/spatiotemporal models that exploit the cross-sectional structure between stocks fill most of the front — 15 of the top 20 — yet the single best result is a pure per-series model (NBeats), so the cross-section clusters at the top without running away with it. And no model captures much signal: correlation with the truth sits near 0.04 across the leaders. What the field does clear is a real bar — a naive last-value baseline (HL) lands near the bottom at ≈1.50, while the leaders sit at ~0.748. Across the full board the spread runs from a best of 0.7483 to a median of 0.7856 and a worst of 1.7141.

Below the top, models cluster tightly and the long tail is wide; absolute predictability stays low (near-zero correlation for most), so CSI-300 forecasting is still genuinely hard. The takeaway is not that deep models magically solve stocks, nor that one architecture wins — it is that on this data, learned models beat the naive floor by a wide margin, graph models cluster at the front, and no single model meaningfully separates from the pack.

This is a launch snapshot, not a final verdict — mostly single-seed (seed 2024), one horizon, first round. More datasets, more horizons, and the realtime refresh are coming.

Where it lives

Submit

Clone ModernTSF, run your experiment via the tsf CLI, capture the run with tsf trace, then tsf submit --push to open a community PR — your model on the board.

For the full story, the protocol, and the complete results table, read the research post.