

Yeah, the cache hierarchy is behaving kinda wonky lately. Many AI workloads (and that’s what’s driving development lately) are constrained by bandwidth, and cache will only help you with a part of that. Cache will help with repeated access, not as much with streaming access to datasets much larger than the cache (i.e. many current AI models).
Intel already tried selling CPUs with both on-package HBM and slotted DDR-RAM. No one wanted it, as the performance gains of the expensive HBM evaporated completely as soon as you touched memory out-of-package. (Assuming workloads bound by memory bandwidth, which currently dominate the compute market)
To get good performance out of that, you may need to explicitly code the memory transfers to enable prefetch (preferably asynchronous) from the slower memory into the faster, á la classic GPU programming. YMMW.
Wrote a longer reply to someone else, but briefly, yes, you are correct. Kinda.
Caches won’t help with bandwidth-bound compute (read: ”AI”) it the streamed dataset is significantly larger than the cache. A cache will only speed up repeated access to a limited set of data.