CPUアーキテクチャについて語れ 16

587 ：MACオタ：2010/02/14(日) 03:24:04 ID:Wj71GeXX

AMDネタの上、11月の Financial Analyst Day の頃の話題なのですが、
プロセッサ・アーキテクチャ的には面白い話題なのでここで取り上げて
おきます。
ネタは当時の comp.arch での "bulldozer details + bobcat"というスレッド
なのですが、業界の人気者 Andy Glew が登場して色々語っています。
http://groups.google.com/group/comp.arch/browse_thread/thread/759bcccbfa0b8b07/73ad8a087e55ed0c
Glew の投稿だけでも拾い読みすると面白いかと。とりあえずここではかい
つまんで興味深いところだけ抽出しておきます。

- AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded)
　microarchitecture. That's my baby!
　Bulldozer は俺の考えた MCMT (MultiCluster-MultiThread) アーキテクチャの
　実装。

- The only bad thing is that some guys I know at AMD say that Bulldozer is
　not really all that great a product, but is shipping just because AMD
　needs a model refresh. "Sometimes you just gotta ship what you got."
　でもなぁ… AMDのツレが言うにはAMDは製品サイクルに切迫して製品化
　してくるらしいんだよな。「何でもいいから今出来てるのを出さなきゃいけない
　時もあるんだよ」って。

588 ：MACオタ＠続き：2010/02/14(日) 03:37:20 ID:Wj71GeXX

- came up with MCMT in 1996-2000 while at the University of Wisconsin.
　It became public via presentations.
　　I brought MCMT back to Intel in 2000, and to AMD in 2002.
　　I was beginning to despair of MCMT ever seeing the light of day. I
　thought that when I left AMD in 2004, the MCMT ideas may have left with
　me.
　元々MCMTはウィスコンシン大にいた1996-2000頃に考えていたんだ。
　で、Intelに2000年に戻ったときに提案し、2002年に移ったときにも宣伝
　しまくったんだ。でも中々日の目を見なくてAMDを離れた2004年には
　すっかりあきらめてたんだよ。

- Of course, AMD has undoubtedly changed and evolved MCMT in many ways
　since I first proposed it to them. For example, I called the set of an
　integer scheduler, integer execution units, and an L1 data cache a
　"cluster", and the whole thing, consisting of shared front end, shared
　FP, and 2 or more clusters, a processor core. Apparently AMD is calling
　my clusters their cores, and my core their cluster. It has been
　suggested that this change of terminology is motivated by marketing, so
　that they can say they have twice as many cores.
　もちろんAMDは俺のMCMTのコンセプトにに色々手を入れてる。例えばオリジナル
　のアイデアでは整数スケジューラ・整数ユニット・L1キャッシュをセットで「クラスタ」
　とよび、2組以上のクラスタと共有デコーダ、共有FPUで「コア」を構成するという
　ものだった。ところがAMDは俺の「クラスタ」をコアと命名し、「コア」の方をクラスタ
　と呼んでる。マーケティングのためにコアが2倍あるように見せかけたいのが丸判り
　だよね。

589 ：MACオタ＠続き：2010/02/14(日) 03:59:15 ID:Wj71GeXX

- My original motivation for MCMT was to work around some of the
　limitations of Hyperthreading on Willamette. E.g. Willamette had a very
　small L0 data cache, 4K in some of the internal proposals, although it
　shipped at 8K. Two threads sharing such a tiny L0 data cache thrash.
　Indeed, this is one of the reasons why hyperthreading is disabled on
　many systems, including many current Nhm based machines with much larger
　closest-in caches.

　元々 MCMT のアイデアは Willamett で Hyperthreading の性能が上がらない
　問題を解決するためのものなんだ。知ってのとおり Willamett の L0 [データ]
　キャッシュのサイズはメチャ小さい。初期の設計では 4KB だったし、出荷された
　バージョンでは増えたとは言え 8KB だ。
　このちっぽけなデータキャッシュを2つのスレッドで共有するとキャッシュスラッシ
　ングが多発する。結局のところ、これが多くのシステムで Hyperthreading が
　無効に設定された原因だし、当時よりはるかに大きなL1キャッシュを持つ Nehalem
　でも状況は変わっていない。

590 ：MACオタ＠続き：2010/02/14(日) 04:00:43 ID:Wj71GeXX

- To avoid threads thrashing each other, I wanted to give each thread
　their own L0. But, you can't do so, and still keep sharing the
　execution units and scheduler - you can't just build a 2X larger array,
　or put two arrays side by side, and expect to have the same latency.
　Wires. Therefore, I had to replicate the execution units, and enough of
　the scheduler so that the "critical loop" of Scheduler->Execution->Data
　Cache was all isolated from the other thread/cluster. Hence, the form
　of multi-cluster multi-threading you see in Bulldozer.

　スラッシングを避けるために、俺はスレッドごとに L0 キャッシュを占有させる
　ことを考えた。でもL0独立でALUとスケジューラを共有すると言う構成は無理だ。
　単純に2倍のサイズのキャッシュを用意したとしても短いレイテンシを維持できない。
　そんな訳で、俺は実行ユニットとスケジューラも独立にした。これでディスパッチ
　→実行→データキャッシュアクセスというクリティカルな部分がスレッドごとに
　独立した「クラスタ」ができあがる。。君らが見た Bulldozer の構成図そのものという
　ことだね。

591 ：MACオタ＠続き：2010/02/14(日) 04:21:49 ID:Wj71GeXX

- True, there are differences, and I am sure more will become evident as
　more Bulldozer information becomes public. For example, although I came
　up with MCMT to make Willamette-style threading faster, I have always
　wanted to put SpMT, Speculative Multithreading, on such a substrate.
　SpMT has potential to speed up a single thread of execution, by
　splitting it up into separate threads and running the separate threads
　on different clusters, whereas Willamette-style hyperthreading, and
　Bulldizer-style MCMT (apparently), only speed up workloads that have
　existing independent threads.

　　Bulldozer に関する情報が増えてくれば明らかになるんだろうけど、俺の
　MCMT が Bulldozer そのものって訳じゃないだろうね。例えば、俺は
　Willamett の Hyperthreading を高速化するに当たって Speculative Multi-
　threading (SpMT) の実装が頭にあった。SpMT は複数スレッドを費やして
　シングルスレッドアプリを高速化する手法だ。

- If I received arows in my back for MCMT, I received 10 times as many
　arrows for SpMT. And yet still I have hope for it. Unfortunately, I am
　not currently working on SpMT. Haitham Akkary, the father of DMT,
　continues the work.

　もし俺が MCMT の実装にかかわっていたら、SpMT を全力で押してたと
　思う。今でもその気持ちに変わりは無いけど、今はそういう立場じゃ無い。
　DMT [Dynamic Multithrading] の提案者の Haitham Akkary が今でも
　研究している様だね。

592 ：Socket774：2010/02/14(日) 04:23:41 ID:/LK7WEU/

intelがhyperthreadingの効率アップのためにそのテクニックを使わなかった理由も興味があるねぇ

593 ：MACオタ＠続き：2010/02/14(日) 04:35:13 ID:Wj71GeXX

- Perhaps I should say here that my MCMT had a significant difference from
　clustering in, say, the Alpha 21264,
　http://www.hotchips.org/archives/hc10/2_Mon/HC10.S1/HC10.1.1.pdf
　[中略]
　Anyway: if it has an L0 or L1 data cache in the cluster, with or
　without the scheduler, it's my MCMT. If no cache in the cluster, not
　mine (although I have enumerated many such possibilities).

　MCMT は Alpha 21264 のクラスタリングの概念とは大きく違うことは強調して
　おきたい。
　[中略]
　要するに、L0なりL1なりの最上位のデータキャッシュがが独立している
　クラスタリングは俺の MCMT アーキテクチャということになる。もちろん
　そうじゃない構成のクラスタリングは有り得る。

- Motivated by my work to use MCMT to speed up single threads, I often
　propose a shared L2 instruction scheduler, to load balance between the
　clusters dynamically. Although I admit that I only really figured out
　how to do that properly after I left AMD, and before I joined Intel.
　How to do this is part of the Multi-star microarchitecture, M*, that is
　my next step beyond MCMT.

　俺は MCMT でシングルスレッドを高速化するために頑張った。例えばクラスタ
　間のロードバランスのための二次スケジューラなんてのも考えた。でも、結局
　そのための「正しい方法」ってヤツを思いついたのは AMD を退社した後、ちょうど
　Intel に戻る前くらいだった。それが MCMT を越える新しいアーキテクチャ M*
　(Multi-star) さ。

594 ：MACオタ＠続き：2010/02/14(日) 04:55:08 ID:Wj71GeXX

- Also, although it is natural to have a single (explicit) thread per
　cluster in MCMT, I have also proposed allowing two threads per cluster.
　 Mainly motivated by SpMT: I could fork to a "runt thread" running in
　tghe same cluster, and then migrate the run thread to a different
　cluster. Intra-cluster forking is faster than inter-cluster forkng, and
　does not disturb the parent thread.
But, if you are not doing SpMT, there is much less motivation for
　multiple threads per cluster.

　　そう言えば、SpMT のためにクラスタ内で更に SMT をやるってのも考えた。
　スレッドの分割を同じクラスタ内で走るスレッドにやらせて、実行は別クラスタ
　でやるんだ。スレッドさえ分かれてしまえば、別々のクラスタで実行する方が
　親スレッドに対する干渉は小さいからね。
　　いずれにせよ SpMT を採用しないなら、クラスタ内 SMT にそれほど意味はない。

595 ：MACオタ＠続き：2010/02/14(日) 04:57:11 ID:Wj71GeXX

- With Willamette as background, I leaned towards a relatively small, L0,
　cache in the cluster. Also, such a small L0 can often be pitch-matched
　with the cluster execution unit datapath. A big L1, such as Bulldozer
　seems to have, nearly always has to lie out of the datapath, and
　requires wire turns. Wire turns waste area. I have, from time to time,
　proposed putting the alignment muxes and barrel shifters in the wire
　turn area. I'm surprised that a large cluster L1 makes sense, but that's
　the sort of thing that you can only really tell from layout.

　元々 Willamette が頭にあったから、俺はクラスタ内の L0 データキャッシュは
　容量が小さいものを考えていた。チップ上のレイアウトで実行ユニットのデータフロー
　のサイズに収まるようにL0の容量を決めると良いんだよ。。 Bulldozer の L1 は随分大
　きくて配線に無駄な「戻り」部分が必要だと思う。俺は常々配線の戻りのところには
　データアライメント用のマルチプレクサとバレルシフタにすれば良いと言ってるん
　だけどね。
　L1 が大きいからといって良いことは無いと思うんだけど、まぁそれもチップのレイアウト
　次第だよね。

596 ：MACオタ＠続き：2010/02/14(日) 05:13:23 ID:Wj71GeXX

- Some posters have been surprised by sharing the FP. Of course, AMD's K7
　design, with separate clusters for integer and FP, was already half-way
　there. They only had to double the integer cluster. It would have been
　harder for Intel to go MCMT, since the P6 family had shared integer and
　FP. Willamette might have been easier to go MCMT, since it had separate FP.

　FPU を共有していることに疑問を持っているヤツもいるよな。もちろん K7 は
　[整数パイプと浮動小数点パイプがスケジューラから分離しているという点で]
　別々の整数クラスタと浮動小数点クラスタを持っていると言える。後は整数
　クラスタをもう一つ追加すれば良いだけの話だよね。
　P6は整数パイプと浮動小数点パイプでスケジューラが共通だから MCMT の
　実装は難しい。Willamette は浮動小数点パイプラインが分離している分、MCMT
　の実装はより楽になっている。

- Anyway... of course, for FP threads you might like to have
　thread-private FP. But, in some ways, it is the advent of expensve FP,
　like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer
　MCMT: the FP is so big that the overhead of replicating the integer
　cluster, including the OOO logic, is a drop in the bucket.

　君らは独立したFPクラスタが必要だって言いたいんだろうけど、Bulldozer の
　FPU は128-bit の FMAなんて実行ユニットだけでもでかすぎる。その上、
　整数パイプと同じくOOOロジックを備えたスケジューラなんて無理だよ。

597 ：MACオタ＠続き：2010/02/14(日) 05:22:11 ID:Wj71GeXX

- You'd like to have per-cluster-thread FP, but such big FP workloads are
　often so memory intensive that they thrash the shared-between-clusters
　L2 cache: threading may be disabled anyways. As it is, you get good
　integer threads via MCMT, and you get 1 integer thread and 1 FP thread.
　 Two FP threads may have some slowdown, although, again, if memory
　intensive they may be blocking on memory, and hence allowing the other
　FP thread t use the FP. But two purely computational FP threads will
　almost undoubtedly block, unless the schedulers are piss-poor and can't
　use all of the FP for a single thread (e.g. by being too small).

　じゃあ一つのクラスタの中に FPU も入れろよって言うヤツもいるかもしれない。
　でもな、浮動小数点演算ってのはだいたいにおいてメモリの負荷が大きいんだよ。
　クラスタで共有している L2 なんて、すぐスラッシングでダメになっちまう。とにかく
　二つの整数クラスタでFPUを共有ってのは丁度良いってことになる。
　　ひとつのFPUを二つのスレッドで共有するっては、ちっとは遅くなるかもしれない
　けど、片方のスレッドがメモリで引っかかった時にもう片方が演算が出来るって
　意味でうまく動く。ところが独立した二つのFPUなんて、スケジューラがよっぽど
　ヘボく無い限りメモリ帯域を喰い合うだけで無意味なのさ。

598 ：MACオタ＠続き：2010/02/14(日) 05:41:14 ID:Wj71GeXX

- I don't expect to get any credit for MCMT. In fact, I'm sure I'm going
　to get shit for this post. I don't care. I know. The people who were
　there, who saw my presentations and read my proposals, know. But, e.g.
　Chuck Moore wasn't there at start; he came in later. Even Mike Haertel,
　my usual collaborator, wasn't there; he was hired in later, although
　before Chuck. Besides, Mike Haertel thinks that MCMT is obvious.
　That's cool, although I ask: if MCMT is obvious, then why isn't Intel
　doing it? Companies like Intel and AMD need idea generating people like
　me about once every 10 years. In between, they don't need new ideas.
　They need new incremental improvements of existing ideas.

　Anyway... It's cool to see MCMT becoming real. It gives me hope that my
　follow-on to MCMT, M* may still, eventually, also become real.

　色々書いたけど、俺は MCMT に関する権利を主張しようって訳じゃ無い。
　俺は当時誰がAMDで働いていたか知っているし、誰が俺のプレゼンや企画書
　を読んでいるか知ってるけど、当時まだ Chuck Moore はいなかったし、俺の
　仲間だった Mike Haertel も Chuckよりちょっと前に入社した程度だった。Haertel
　は MCMT を買ってくれたけどね。
　　それにしても俺は思うんだが、 MCMT がうまく機能するとすれば、何故 Intel
　は俺の提案を袖にしたんだろうね？結局のところ Intel や AMD みたいな大企業
　にとって、新アーキテクチャなんて10年に一度くらいしか必要なくて、既存アーキを
　洗練させるのがうまいやり方なんだろうね。

　とにかく MCMT が日の目を見たのは良かったと思うよ。願わくば M* も採用される
　日が来ればと思うね。

599 ：MACオタ＠続き：2010/02/14(日) 06:16:58 ID:Wj71GeXX

- There were several K10s. While I wanted to work on low power when I went
　to AMD, I was hired to consult on low power and do high end CPU, since
　the low power project was already rolling and did not need a new chef.
　The first K10 that I knew at AMD was a low power part. When that was
　cancelled I was sent off on my lonesome, then wth Mike Haertel, to work
　on a flagship, out-of-order, aggressive processor, while the original
　low power team did something else. When that other low-power project was
　cancelled, that team came over to the nascent K10 that I was working on.
　 My K10 was MCMT, plus a few other things. I had actually had to
　promise Fred Weber that I would NOT do anything advanced for this K10 -
　no SpMT, just MCMT. But when the other guys came on board, I thought
　this meant that I could leave the easy stuff for them, while I tried to
　figure out how to do SpMT and/or any other way of using MCMT to speed up
　single threads.

　　当時 K10 なるプロジェクトはたくさんあったのさ。そもそも俺がAMDに雇われた時の
　仕事は低消費電力プロセッサだったんだけど、これが俺の知る限り最初の「K10」
　って名前のプロジェクトだった。
　このK10の開発は既に随分進んでいて、俺の仕事は全然なかったんだが、あっさり
　キャンセルされて俺は宙ぶらりん状態になった。丁度そのころ Mike Haertel が入社
　してきて、一緒にハイエンドの OoOE プロセッサの開発を担当することになった。例の
　低消費電力プロセッサのグループは、別のプロジェクトに回され、それとは別の
　低消費電力プロジェクトをやっていたチームが我々の K10 の開発を行うことに
　なった。この K10 が MCMT の K10 って訳だ。
　　俺の K10 プロジェクトに関しては、Fred Webner から MCMT の実装だけに専念して
　 SpMT とか余計なことに手を出さないように約束させられた。しかし (Webner が失脚して)
　別の取締役が来たんで、その約束は無かったことにして SpMT を含むあらゆる方法で
　MCMT によるシングルスレッドの高速化を実装することにした。

600 ：MACオタ＠続き：2010/02/14(日) 06:48:35 ID:Wj71GeXX

- - indeed, the scheduler structure of queues
　feeding an RS arose from the debate between OOO (me) and in-order (Sager
　and Upton) -

　実際、(Willamette 開発時に) スケジューラの構造で OoO派(俺)とインオーダー派
　(Seger と Upton)で議論があった。
　
- Mitch Alsup was K9.

　[K9について尋ねられて]Mitch Alsup が K9 をやってたな。
　[MACオタ注: Mitch Alsup は Motorola 88Kや Ross HyperSPARC のアーキテクト]
　
- Some of us have done a lot of work on dynamic predication. (My resume
　includes an OOO Itanium, plus I have been working on VLIW and
　predication longer than OOO.) But since such work inside companies will
　never see the light of day, do not let that hold you back, since you are
　not so constrained by NDAs and trade secrets.

　俺の経歴にも書いてあるように、OoO の Itanium やプレディケーションについては色々
　研究したけど、Intel 社内では日の目を見なかった。俺はNDA とか色々あって無理だが、
　お前さんがやるなら頑張れ。