- AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded) microarchitecture. That's my baby! Bulldozer は俺の考えた MCMT (MultiCluster-MultiThread) アーキテクチャの 実装。
- The only bad thing is that some guys I know at AMD say that Bulldozer is not really all that great a product, but is shipping just because AMD needs a model refresh. "Sometimes you just gotta ship what you got." でもなぁ… AMDのツレが言うにはAMDは製品サイクルに切迫して製品化 してくるらしいんだよな。「何でもいいから今出来てるのを出さなきゃいけない 時もあるんだよ」って。
- came up with MCMT in 1996-2000 while at the University of Wisconsin. It became public via presentations. I brought MCMT back to Intel in 2000, and to AMD in 2002. I was beginning to despair of MCMT ever seeing the light of day. I thought that when I left AMD in 2004, the MCMT ideas may have left with me. 元々MCMTはウィスコンシン大にいた1996-2000頃に考えていたんだ。 で、Intelに2000年に戻ったときに提案し、2002年に移ったときにも宣伝 しまくったんだ。でも中々日の目を見なくてAMDを離れた2004年には すっかりあきらめてたんだよ。
- Of course, AMD has undoubtedly changed and evolved MCMT in many ways since I first proposed it to them. For example, I called the set of an integer scheduler, integer execution units, and an L1 data cache a "cluster", and the whole thing, consisting of shared front end, shared FP, and 2 or more clusters, a processor core. Apparently AMD is calling my clusters their cores, and my core their cluster. It has been suggested that this change of terminology is motivated by marketing, so that they can say they have twice as many cores. もちろんAMDは俺のMCMTのコンセプトにに色々手を入れてる。例えばオリジナル のアイデアでは整数スケジューラ・整数ユニット・L1キャッシュをセットで「クラスタ」 とよび、2組以上のクラスタと共有デコーダ、共有FPUで「コア」を構成するという ものだった。ところがAMDは俺の「クラスタ」をコアと命名し、「コア」の方をクラスタ と呼んでる。マーケティングのためにコアが2倍あるように見せかけたいのが丸判り だよね。
- My original motivation for MCMT was to work around some of the limitations of Hyperthreading on Willamette. E.g. Willamette had a very small L0 data cache, 4K in some of the internal proposals, although it shipped at 8K. Two threads sharing such a tiny L0 data cache thrash. Indeed, this is one of the reasons why hyperthreading is disabled on many systems, including many current Nhm based machines with much larger closest-in caches.
- To avoid threads thrashing each other, I wanted to give each thread their own L0. But, you can't do so, and still keep sharing the execution units and scheduler - you can't just build a 2X larger array, or put two arrays side by side, and expect to have the same latency. Wires. Therefore, I had to replicate the execution units, and enough of the scheduler so that the "critical loop" of Scheduler->Execution->Data Cache was all isolated from the other thread/cluster. Hence, the form of multi-cluster multi-threading you see in Bulldozer.
- True, there are differences, and I am sure more will become evident as more Bulldozer information becomes public. For example, although I came up with MCMT to make Willamette-style threading faster, I have always wanted to put SpMT, Speculative Multithreading, on such a substrate. SpMT has potential to speed up a single thread of execution, by splitting it up into separate threads and running the separate threads on different clusters, whereas Willamette-style hyperthreading, and Bulldizer-style MCMT (apparently), only speed up workloads that have existing independent threads.
- If I received arows in my back for MCMT, I received 10 times as many arrows for SpMT. And yet still I have hope for it. Unfortunately, I am not currently working on SpMT. Haitham Akkary, the father of DMT, continues the work.
- Perhaps I should say here that my MCMT had a significant difference from clustering in, say, the Alpha 21264, http://www.hotchips.org/archives/hc10/2_Mon/HC10.S1/HC10.1.1.pdf [中略] Anyway: if it has an L0 or L1 data cache in the cluster, with or without the scheduler, it's my MCMT. If no cache in the cluster, not mine (although I have enumerated many such possibilities).
- Motivated by my work to use MCMT to speed up single threads, I often propose a shared L2 instruction scheduler, to load balance between the clusters dynamically. Although I admit that I only really figured out how to do that properly after I left AMD, and before I joined Intel. How to do this is part of the Multi-star microarchitecture, M*, that is my next step beyond MCMT.
- Also, although it is natural to have a single (explicit) thread per cluster in MCMT, I have also proposed allowing two threads per cluster. Mainly motivated by SpMT: I could fork to a "runt thread" running in tghe same cluster, and then migrate the run thread to a different cluster. Intra-cluster forking is faster than inter-cluster forkng, and does not disturb the parent thread. But, if you are not doing SpMT, there is much less motivation for multiple threads per cluster.
- With Willamette as background, I leaned towards a relatively small, L0, cache in the cluster. Also, such a small L0 can often be pitch-matched with the cluster execution unit datapath. A big L1, such as Bulldozer seems to have, nearly always has to lie out of the datapath, and requires wire turns. Wire turns waste area. I have, from time to time, proposed putting the alignment muxes and barrel shifters in the wire turn area. I'm surprised that a large cluster L1 makes sense, but that's the sort of thing that you can only really tell from layout.
- Some posters have been surprised by sharing the FP. Of course, AMD's K7 design, with separate clusters for integer and FP, was already half-way there. They only had to double the integer cluster. It would have been harder for Intel to go MCMT, since the P6 family had shared integer and FP. Willamette might have been easier to go MCMT, since it had separate FP.
- Anyway... of course, for FP threads you might like to have thread-private FP. But, in some ways, it is the advent of expensve FP, like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer MCMT: the FP is so big that the overhead of replicating the integer cluster, including the OOO logic, is a drop in the bucket.
- You'd like to have per-cluster-thread FP, but such big FP workloads are often so memory intensive that they thrash the shared-between-clusters L2 cache: threading may be disabled anyways. As it is, you get good integer threads via MCMT, and you get 1 integer thread and 1 FP thread. Two FP threads may have some slowdown, although, again, if memory intensive they may be blocking on memory, and hence allowing the other FP thread t use the FP. But two purely computational FP threads will almost undoubtedly block, unless the schedulers are piss-poor and can't use all of the FP for a single thread (e.g. by being too small).
- I don't expect to get any credit for MCMT. In fact, I'm sure I'm going to get shit for this post. I don't care. I know. The people who were there, who saw my presentations and read my proposals, know. But, e.g. Chuck Moore wasn't there at start; he came in later. Even Mike Haertel, my usual collaborator, wasn't there; he was hired in later, although before Chuck. Besides, Mike Haertel thinks that MCMT is obvious. That's cool, although I ask: if MCMT is obvious, then why isn't Intel doing it? Companies like Intel and AMD need idea generating people like me about once every 10 years. In between, they don't need new ideas. They need new incremental improvements of existing ideas.
Anyway... It's cool to see MCMT becoming real. It gives me hope that my follow-on to MCMT, M* may still, eventually, also become real.
- There were several K10s. While I wanted to work on low power when I went to AMD, I was hired to consult on low power and do high end CPU, since the low power project was already rolling and did not need a new chef. The first K10 that I knew at AMD was a low power part. When that was cancelled I was sent off on my lonesome, then wth Mike Haertel, to work on a flagship, out-of-order, aggressive processor, while the original low power team did something else. When that other low-power project was cancelled, that team came over to the nascent K10 that I was working on. My K10 was MCMT, plus a few other things. I had actually had to promise Fred Weber that I would NOT do anything advanced for this K10 - no SpMT, just MCMT. But when the other guys came on board, I thought this meant that I could leave the easy stuff for them, while I tried to figure out how to do SpMT and/or any other way of using MCMT to speed up single threads.
- Some of us have done a lot of work on dynamic predication. (My resume includes an OOO Itanium, plus I have been working on VLIW and predication longer than OOO.) But since such work inside companies will never see the light of day, do not let that hold you back, since you are not so constrained by NDAs and trade secrets.