The RegisterのT.P.Morgan記者がSC09で展示されていたBlue Waters向けPOWER7ノード について結構詳しい記事を書いています。 http://www.theregister.co.uk/2009/11/27/ibm_power7_hpc_server/ 4chip (=32core) MCMの写真なども興味深いですが、マザーボードについて面白い記述が… -------------------- There are two monster motherboards underpinning the processors and their memory and the hub/switch and its interconnects. These mobos are manufactured by Japanese server maker Hitachi and Brenner said that these were the largest motherboards ever made. -------------------- 京速から遁走した日立はBlue Watersノード向けマザーボードの製造を担当していた とか(笑) 国が技術振興のためにやるべきことは『何』なのか、考えさせられますね。
2番目の話ですが、上の記事でPOWER7 MCMの消費電力とパッケージサイズについて の記述を読んでから真偽を考えてみてはいかがでしょうか? ----------------------- Both chip packages have the same pin count at 5,336 pins (92 pins by 58 pins), according to Alan Brenner, a senior technical staff member of the server and network architecture team within IBM's Systems and Technology Group: … At 800 watts, the package is not cool by any means, but the Power7 IH MCM is delivering performance at 1.28 gigaflops per watt at the package level. -----------------------
来年のISSCCでPOWER7の詳細が複数の論文で公開されることは周知の通りです。 http://www.isscc.org/isscc/2010/ISSCCAP2010.pdf ------------------------ 5.4 The Implementation of POWER7): A Highly Parallel and Scalable Multi-Core High-End Server Processor 5.5 A Wire-Speed Power Processor: 2.3GHz 45nm SOI with 16 Cores and 64 Threads (MACオタ注: おそらくPOWER6 - Z10の関係に対応するメインフレームプロセッサ) 9.3 POWER7 Local Clocking and Clocked Storage Elements 19.1 A 45nm SOI Embedded DRAM Macro for POWER7 32MB On-Chip L3 Cache 19.2 A 32kB 2R/1W L1 Data Cache in 45nm SOI Technology for the POWER7 Processor ------------------------ 今年のまとめとして、Hot Chips 21以来公開された資料からPOWER7についての技術情報を 書いてみます。
Power.orgが公開している別の資料にも興味深い記述があります。 http://www.power.org/news/newsletter/Power.org_Q3_2009_Newsletter_final.pdf (p.11) ---------------------- The new POWER7 Core has a total of 18 execution units, including two fixed point pipelines bit aligned to the two LSU pipes. ---------------------- 資料ごとにPOWER7の実行ユニット数の表記は異なっているのですが、>>313でも書いたような 12個という表記が多く、"two fixed point pipelines bit aligned to the two LSU pipes"というのは FXUのことではなく、LSUにx86のAGU相当のアドレス演算用整数演算ユニットが追加されたもの と思われます。 その他の追加情報は下記の通り ・11 levels of metal layer ・L3はコヒーレンシトラフィック低減のためのディレクトリとしても機能する
Digitimesがグラフィックカードベンダ筋から拾ってきたFermiの状況です。 発表は3月だが、4月までは入手難らしいとのこと。 http://www.digitimes.com/news/a20100114PD202.html ---------------------- Nvidia may see drop in global discrete graphics chip market share in 1Q10 Monica Chen, Taipei; Joseph Tsai, DIGITIMES [Thursday 14 January 2010]
Nvidia is expected to see its share of the global discrete graphics chip market drop from 65% in 2009 to 60% or even lower due to strong competition from AMD, according to sources from graphics card makers.
Nvidia has refuted the claims saying it expects to see strong demand.
Although Nvidia plans to launch its 40nm Fermi-GF100 graphics chip in March 2010, mass shipments are unlikely to start until April, the sources noted. Nvidia responded saying its launch schedule remains unchanged.
On the other hand, AMD has already launched its DirectX 11-supporting 40nm ATI Radeon HD 5970, 5870, 5850 and 5750 GPUs and will launch HD 5670, 5570 and 5450 shortly. The company recently claimed to have shipped a total of two million DirectX 11-capable GPUs. ----------------------
上記に関連して、TSMCの40nmプロセスの歩留まりが上がらないという記事を 同じくDigitimesが数日前に掲載しています。 魚拓のリンクはAMD次世代スレッドのこちら。 http://pc11.2ch.net/test/read.cgi/jisaku/1263352294/91 ----------------- Foundry chipmakers, including Taiwan Semiconductor Manufacturing Company (TSMC), have been struggling to increase their yields on 40nm to over 70%, according to industry sources. The unsatisfactory yield rate has caused production for next-generation graphics processors and FPGA (field- programmable gate array) chips to run tight. -----------------
実はメモリ帯域の方も測定法で大きく異なる様で、同じbit-tech.comのistanbulベンチでは こういう結果が(笑) http://www.bit-tech.net/hardware/2009/07/07/amd-opteron-2434-review/3 bit-techの言い訳はこちら。 -------------------- We started by retesting the Xeon W5580, as a new version of Sandra, which supports Intel's implementation of NUMA, has been released since our original review. These new results show that the Xeon W5580 system has significantly more memory bandwidth and lower latency than either Opteron system - an important consideration if you're running lots of apps together such as a server used to power multiple virtual machines. ---------------------
今日はTheRegisterより目ぼしいニュースが二つ。 まず、IBMの2009Q4業績の電話会議でIBMのCFO, Mark Loughridge より POWER7 のリリース時期が示されたとのこと。 http://www.theregister.co.uk/2010/01/20/ibm_power7_q1_launch/ ----------------------- "Later [in Q1], we'll introduce the next generation Power Systems, which will deliver two to three times the performance, in the same energy envelope," Loughridge told the assembled Wall Street multitudes on Tuesday. ----------------------- ・今四半期中にPOWER7製品が発表される ・45nm CPUプロセスの立ち上がりは順調で、65nm世代より5ヶ月は短かった。 ・今年中にPOWERサーバーはPOWER7世代に更新される
もう一つは龍芯3号を使った中共の国産スーパーコンピュータ 『曙光 6000』が 今年完成予定とのこと。 http://www.theregister.co.uk/2010/01/20/china_ict_dawning_super/ --------------------- Weiwu Hu, chief architect of the Loongson processors developed by ICT, told Technology Review that the future Dawning 6000 super, presumably based on the quad-core Loogson-3 MIPS-style processor, would be finished by the middle of this year and operational by the end of 2010. --------------------- 元ネタは MIT Technology Review のこちらの記事。 http://www.technologyreview.com/computing/24374/ 概要は次の通り。 ・昨年登場予定が今年に遅れた ・量産版マスクのテープアウトは昨年12月末。STMicro にて量産開始予定。 ・遅延した分、65nm世代で8-16コアバージョンが出てくるかもしれない
IBMの昨年第4四半期の業績ですが、CELL/B.E. と Xbox360 CPU の設計サービスで がっつり稼いだ2006年以来、長期低落が続いていた Microelectronics 部門がちょっと 上向いたとのこと。 http://www.theregister.co.uk/2010/01/20/ibm_q42009_numbers/page2.html --------------------- On the Microelectronics front, chip sales were up 2 per cent in the quarter, and Loughridge said that the 300mm wafer baker in East Fishkill, New York was nearing full utilization and that 45 nanometer output was sold out again this quarter. No doubt some of that wafer baking capacity is being pressed into action to crank out Power7 chips and probably the z11 mainframe engines too. ? --------------------- 最後の一節は Morgan 記者の推測に過ぎませんが、45nm ラインもフル操業体制に なっているとか。
実は今週もこのPPC476、ちょっとニュースに顔をだしていました。 http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=222301670 ------------------- LSI announced in September it helped IBM Corp. developed the multicore PowerPC 476FP. A four-core version running at up to 1.6 GHz is now available from LSI in TSMC's 40nm process. ------------------- 共同開発の権利なのかどうかは不明ですが、TSMCでも製造できるようです。同時に LSIは 500MHz eDRAM を顧客の設計に提供するというアナウンスもしてます。
ところで私にはハイエンドネットワークプロセッサと言えば、この辺のコアを使用した SoC 製品になるような気がするのですが、>>312の "Wire-Speed Power Processor" の正体が何なのかは、来る ISSCC の発表が楽しみです。
ちょっと古いニュースですが、>>382-383あたりで書いたTSMC 40nm プロセス の歩留まり、現状で解決されているというニュースが流れています。 ソースは Digitimes ですが、すぐ読めなくなるので DailyTech の記事を 引用しておきます。 http://www.dailytech.com/TSMC+Says+40nm+Problems+Resolved+Preparing+28nm+Fab+Production+/article17355c.htm ----------------------- DailyTech spoke with a TSMC spokesperson yesterday, who stated that TSMC's 40nm yields are now "approximately at the same level" as the more mature 65nm process. Semiconductors are made in lithography chambers, and the process can be comprised of several hundred steps. Usually a new manufacturing process is developed and refined in a test fab and then transferred to production lines in a process called Chamber Matching. This theoretically ensures standard conformity and higher yields. There were several problems with chamber matching on TSMC's 40nm lines, leading to yield problems despite using the same process and recipes. -----------------------
まずこちらのプレゼン資料は概要を判りやすく書いてあります。 http://www.power.org/events/powercon09/taiwan09/IBM_Overview_PowerPC476FP.pdf HPC向け SoC に使用される筈の Book-E APU (演算器やレジスタの内部拡張仕様)に 関しては、この資料の P.6 に次のような記述があります。 ------------------------ ・ High performance out-of-order auxiliary processor pipeline interface - Support the floating point unit - Support for future accelerator extensions such as VMX ------------------------ ますますもって、Sequoia のベースとなる公算は大きいかと。
積極的なロードマップから見ても、どうやら IBM の今後の組込向けコアはこの系列で決定の ようで、現世代のゲーム機に使用された PPE / PX コアはお払い箱になったようです。 次世代 CELL/B.E. があるとすれば、制御用 POWER ISA コアも PPC470 系列の設計になる のではないでしょうか。
こちらも昨年秋のニュースですが、AMCC の Titan コアを搭載した製品が発表されています。 Titan の発表ってもう2年以上前だったりするのですが… http://pc11.2ch.net/test/read.cgi/jisaku/1178140550/392 http://pc.watch.impress.co.jp/docs/2007/0531/mpf07.htm AMCC のリリースはこちら。 http://investor.appliedmicro.com/phoenix.zhtml?c=78121&p=irol-newsArticle&ID=1342823&highlight= ----------------------- The APM 83290 includes a processor subsystem that integrates two Titan cores based on Power Architecture technology, delivering frequencies of 1.5 GHz per core. The Titan core is a superscalar, dual-issue, out-of-order core designed to achieve industry leading single thread performance on a per clock basis. Along with high performance, innovative circuit design techniques enable the APM 83290 to deliver speeds of 1.5 GHz in 90nm bulk CMOS while comparable designs require 45nm SOI process technology to achieve similar operating speeds. ----------------------- 今となってはあらゆる点で PPC476 に劣る訳ですが、リリースにあるように 90nm バルクプロセス で同レベルのクロックを実現しているのは立派と言えるのかも。 量産は今年Q1なので、476より早く登場するのも確かです。
In February, IBM will introduce the next generation Power Systems--the first of a family of systems and storage designed to meet the demands of a smarter planet. From the chip and virtualization capabilities all the way through to the operating system, middleware and energy management, Power Systems from IBM are integrated to help support the complex workloads and dynamic computing models of a new kind of world. Power Systems--the future of Unix servers. They're coming. Smarter systems for a Smarter Planet.
PPC A2 が ISSCC で発表される "Wire-Speed Power Processor" だとすると、 アブストラクトには、こうあります。 https://submissions.miracd.com/ISSCC2010/WebAP/PDF/AP_Session5.pdf -------------------- A 64-thread simultaneous multi-threaded processor uses architecture and implementation techniques to achieve high throughput at low power. Included are static VDD scaling, multi-voltage design, clock gating, multiple VT devices, dynamic thermal control, eDRAM and low-voltage circuit design. Power is reduced by >50% in a 428mm2 chip. Worst-case power is 65W at 2.0GHz, 0.85V. -------------------- PPUより大規模そうな仕様にしては、16-core のチップ全体で 65W@2GHzは 現実的な数字に見えます。 それでも 4 Flops/Cycle 程度の APU を搭載したとして、2GHz でおよそ 2GFlops/W。 チップ単体でこれでは、システム全体で3GFlops/W を狙うと言われる Sequoia 用の プロセッサでは無さそうに見えますが、さて。
TheRegs の ISSCC プレビューですが、Morgan 記者は "Wire-Speed Power" を 試作品と見ている様で… http://www.theregister.co.uk/2010/01/28/isscc_chip_preview/page2.html -------------------- IBM's chip designers will be showing off another experimental Power7 derivative, an unnamed 2.3 GHz "wire-speed Power processor" that sports 16 cores and 64 threads. --------------------
理研とNVIDIAが主催した"Accelerated Computing"研究会で、 https://reg-nvidia.jp/public/seminar/view/3 牧野教授が次世代GRAPE-DR の開発状況を語ったようです。 http://www.artcompsci.org/~makino/talks/roppongi201001xx.pdf (P.56) ------------------------ GRAPEs with eASIC ・Completed an experimental design of a programmable processor for quadruple-precision arithmetic. 6PEs in nominal 2.5Mgates. ・Started designing low-accuracy GRAPE hardware with 7.4Mgates chip.
Summary of planned specs: ・around 8-bit relative precision ・support for quadrupole moment in hardware ・100-200 pipelines, 300MHz, 2-4Tflops/chip ・small power consumption: single PCIe card can house 4 chips (10 Tflops, 50W in total) ------------------------ 300MHz の HPC 向けプロセッサとはあまりに貧乏路線過ぎる気もしますが、電力効率 勝負になっている現在のトレンドには合致しているのかもしれません。 でも電力管理に(設計)リソースを振り向けられなくて、それほど効率も上がらないかも…
2/8のISSCCのプロセッサセッションでのPOWER7講演(>>312参照)と共に、製品発表も行われるようです。 http://www.theregister.co.uk/2010/02/01/ibm_power7_launch/ ------------------------- It looks like IBM's initial Power7-based servers are going to be launched in New York on February 8. Big Blue sent out the invitations today. -------------------------
なんと2/8にはTukwilaも発表になるんだとか。 http://www.theregister.co.uk/2010/02/02/intel_server_chip_launches/ -------------------- High-end server chip rivals Intel and IBM have picked the same day - next Monday, February 8 - to launch their respective quad-core "Tukwila" Itanium and eight-core Power7 processors. -------------------- すでに顧客には出荷が始まっているとのことで、Intel の新製品発表の通例 として、搭載製品も同時に公開されるのでしょう。
>>520-522 少なからぬ旧 P.A Semi の社員が Apple を退社済みとのこと。 Ahlee Vance 氏の記事なので信用できると思いますよ。 http://www.nytimes.com/2010/02/02/technology/business-computing/02chip.html?ref=technology ------------------------ Some of the chip engineers Apple gained in its purchase of PA Semi appear to have already left the company. According to partial records on the job networking site LinkedIn, at least half a dozen former PA Semi engineers have left Apple and turned up at a start-up called Agnilux, based in San Jose. The company was co-founded by one of PA’s leading system architects, Mark Hayter.
Neither Mr. Hayter nor other onetime PA workers who left Apple for Agnilux were willing to discuss either company’s plans. According to two people with knowledge of the two companies, who were unwilling to be named because the matter is delicate, some PA engineers left Apple a few months after the acquisition because they were given grants of Apple stock at an unattractive price. ------------------------
先週のニュースらしいですが、POWER7で浮かれるIBMの East Fishkill 工場で 飲料水に大量の鉛が含まれていることがバレたそうで… http://www.poughkeepsiejournal.com/apps/pbcs.dll/article?AID=2010100202008 ---------------------- WICCOPEE ― Too-high levels of lead have been found in drinking water at IBM Corp.’s East Fishkill complex, prompting the company to provide alternate sources of water. ---------------------- 流石、工場労働者なんて人とも思わない守銭奴IBMらしい所業ですね。
IntelのTukwila発表も来ました。Itenium 9300シリーズとのこと。 http://www.intel.com/pressroom/archive/releases/2010/20100208comp.htm ---------------- The Intel Itanium processor 9300 series ranges in price from $946 to $3,838 in quantities of 1,000. OEM systems are expected to ship within 90 days. ---------------- 搭載製品の同時発表とはいかなかったようで… なお、製品ラインは下記の通り。 http://download.intel.com/products/processor/itanium/318691.pdf 9350: 4-core, 1.73GHz, 24MB L3 9340: 4-core, 1.60GHz, 20MB L3 9330: 4-core, 1.46GHz, 20MB L3 9320: 4-core, 1.33GHz, 16MB L3 9310: 2-core, 1.60GHz, 10MB L3
その他、注目点はこんなものでしょうか? - 既報通り、Neahlem-EPとはプラットフォーム共通化が図られているとのこと。 "share several platform ingredients, including the Intel(R) QuickPath Interconnect, the Intel Scalable Memory Interconnect, the Intel(R) 7500 Scalable Memory Buffer (to take advantage of industry standard DDR3 memory), and I/O hub (Intel(R) 7500 chipset). " - "Foxton" Technology はNehalenと共通のブランド"Intel Turbo Boost Technology" になった模様。
"Wire-Speed POWER" (>>318-329 参照)講演のレポートが EETimesに来てます。 http://www.eetimes.com/news/semi/showArticle.jhtml?articleID=222700420 用途に関しては、色々含みを込めているよう感があります。 -------------------- "It's not a network processor or a server processor but a middle ground, a blurring of the two worlds," Johnson said. The chips will be used in a range of standalone systems and PCI Express adapter cards in servers. It is mainly designed for use in IBM's own systems, however the company is willing to sell it on a merchant basis as well. -------------------- 正直、『サーバープロセッサと(組込向け)ネットワークプロセッサの中間的存在』って デスクトッププロセッサのことでは? かつての PowerPC G3/G4 の様な。 含みを持たせていると言えば、記事の最後がこう締めくくられています。 -------------------- Johnson was chief architect of IBM's Power4 processor. He also designed IBM's portion of the processor in the Microsoft Xbox 3609 [MACオタ注: Xbox 360の誤植 でしょう] videogame console. -------------------- 単に PX/PPE と同じグループが開発したと言いたいのかどうか…
話の順序が逆になりましたが、記事中に含まれる新情報は下記の通り。 - 64-bit - 16-core, 1.43B Transistors, 428mm^2 (POWER7は 1.2B Transistors, 567mm^2) - 65W @ 16-core/2.3GHz, 20W @ 4-core/1.4GHz - 16-core 版は 8MB 内蔵キャッシュサポート - 10G Ethernet 4ポート内蔵 - XML, 正規表現処理, 暗号化アクセラレータ搭載 - グルーレスでSMP可能 - プロセッサ製品としてを外販予定 - 開発期間は5年 - Linux ハイパーバイザをサポート - (製品版の?)テープアウトは一週間前。ファーストシリコンは2週間以内に (既に製作済みの)搭載システムでテスト予定。 - ここでの議論と同様に、アナリストも用途に疑問を呈している。 ---------------- "That's a huge chip, bigger than most of the PC and server processor Intel makes and probably twice the size of many network processors out there, so cost-wise it will be tough for them to be competitive," Gwennap said. ----------------
"Wire-Speed Power Processor" = PowerPC A2 の確定情報来ました。 正確には SOC 製品である Wire-Speed Power Processor の汎用プロセッサコアが PPC A2 ということになります。 http://www.theregister.co.uk/2010/02/09/ibm_wire_speed_processor/ ------------------ The processor's A2 cores are small, 64-bit PowerPC cores based on IBM's embedded architecture - "a little bit different from our server architecture," said Johnson. Full vitualization and hypervisor support is also included, along with some new instructions that allow for low-latency interaction with the processors' accelerators. ------------------ その他、新情報は次の通り。 - 2.3GHz は電力効率が良い周波数というだけで、3GHz でも動作する。 - アクティブなコア数で消費電力は 20-65Wの範囲で変化する。平均的には 55W 程度。
PPC746FP を共同開発した LSI Corp. が自社でネットワークプロセッサ "Axxia" をリリースしました。 http://www.lsi.com/news/product_news/2010/2010_02_09b.html ------------------- Axxia Communication Processors are capable of managing huge volumes of wireless traffic with low latency and no load on the CPU complex. The first member of the Axxia Communication Processor family, the ACP3448 processor, features four powerful PowerPC^(TM) 476FP processor cores with a large 512KB L2 cache per core, 4 MB of system cache, integrated DDRIII memory controllers, and a wide array of intelligent offload engines, including industry-proven packet classification, traffic management, security processing and deep packet inspection. The on-chip processing elements are tied together using the new LSI Virtual Pipeline technology. ------------------- 製品ページはこちら。(PDF資料へのリンク有) http://www.lsi.com/networking_home/networking_products/multicore_comm_processors/axxia/index.html ・4-core, up to 1.8GHz ・512KB L2 ・4MB eDRAM システムキャッシュ (アクセラレータを含むSoC全体で共有) ・Dual DDR3 メモリコントローラ ・各種アクセラレータ (パケット処理、セキュリティ、正規表現) ・45nm, SOI リリースによると -------------------- The first members of the Axxia family, designed to deliver 20 Gbps performance for today’s wireless infrastructure requirements, will be available in February of 2010. -------------------- 最初の製品は今月にも販売開始ということと、上記の製造プロセスから IBM で製造するものと 思われます。
既に Freescale を分社している以上、もはやどうでも良い話なのですが、 栄光の Motorola が更に2分割されるんだとか。 http://mediacenter.motorola.com/content/detail.aspx?ReleaseID=12429&NewsAreaID=2 ----------------------- SCHAUMBURG, Ill., February 11, 2010 -- Motorola, Inc. (NYSE: MOT) today announced the Company is targeting the first quarter of 2011 for its planned separation. Motorola intends to separate into two independent, publicly traded companies. One will include the Company’s Mobile Devices and Home businesses, and the other will include its Enterprise Mobility Solutions and Networks businesses. -----------------------
- AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded) microarchitecture. That's my baby! Bulldozer は俺の考えた MCMT (MultiCluster-MultiThread) アーキテクチャの 実装。
- The only bad thing is that some guys I know at AMD say that Bulldozer is not really all that great a product, but is shipping just because AMD needs a model refresh. "Sometimes you just gotta ship what you got." でもなぁ… AMDのツレが言うにはAMDは製品サイクルに切迫して製品化 してくるらしいんだよな。「何でもいいから今出来てるのを出さなきゃいけない 時もあるんだよ」って。
- came up with MCMT in 1996-2000 while at the University of Wisconsin. It became public via presentations. I brought MCMT back to Intel in 2000, and to AMD in 2002. I was beginning to despair of MCMT ever seeing the light of day. I thought that when I left AMD in 2004, the MCMT ideas may have left with me. 元々MCMTはウィスコンシン大にいた1996-2000頃に考えていたんだ。 で、Intelに2000年に戻ったときに提案し、2002年に移ったときにも宣伝 しまくったんだ。でも中々日の目を見なくてAMDを離れた2004年には すっかりあきらめてたんだよ。
- Of course, AMD has undoubtedly changed and evolved MCMT in many ways since I first proposed it to them. For example, I called the set of an integer scheduler, integer execution units, and an L1 data cache a "cluster", and the whole thing, consisting of shared front end, shared FP, and 2 or more clusters, a processor core. Apparently AMD is calling my clusters their cores, and my core their cluster. It has been suggested that this change of terminology is motivated by marketing, so that they can say they have twice as many cores. もちろんAMDは俺のMCMTのコンセプトにに色々手を入れてる。例えばオリジナル のアイデアでは整数スケジューラ・整数ユニット・L1キャッシュをセットで「クラスタ」 とよび、2組以上のクラスタと共有デコーダ、共有FPUで「コア」を構成するという ものだった。ところがAMDは俺の「クラスタ」をコアと命名し、「コア」の方をクラスタ と呼んでる。マーケティングのためにコアが2倍あるように見せかけたいのが丸判り だよね。
- My original motivation for MCMT was to work around some of the limitations of Hyperthreading on Willamette. E.g. Willamette had a very small L0 data cache, 4K in some of the internal proposals, although it shipped at 8K. Two threads sharing such a tiny L0 data cache thrash. Indeed, this is one of the reasons why hyperthreading is disabled on many systems, including many current Nhm based machines with much larger closest-in caches.
- To avoid threads thrashing each other, I wanted to give each thread their own L0. But, you can't do so, and still keep sharing the execution units and scheduler - you can't just build a 2X larger array, or put two arrays side by side, and expect to have the same latency. Wires. Therefore, I had to replicate the execution units, and enough of the scheduler so that the "critical loop" of Scheduler->Execution->Data Cache was all isolated from the other thread/cluster. Hence, the form of multi-cluster multi-threading you see in Bulldozer.
- True, there are differences, and I am sure more will become evident as more Bulldozer information becomes public. For example, although I came up with MCMT to make Willamette-style threading faster, I have always wanted to put SpMT, Speculative Multithreading, on such a substrate. SpMT has potential to speed up a single thread of execution, by splitting it up into separate threads and running the separate threads on different clusters, whereas Willamette-style hyperthreading, and Bulldizer-style MCMT (apparently), only speed up workloads that have existing independent threads.
- If I received arows in my back for MCMT, I received 10 times as many arrows for SpMT. And yet still I have hope for it. Unfortunately, I am not currently working on SpMT. Haitham Akkary, the father of DMT, continues the work.
- Perhaps I should say here that my MCMT had a significant difference from clustering in, say, the Alpha 21264, http://www.hotchips.org/archives/hc10/2_Mon/HC10.S1/HC10.1.1.pdf [中略] Anyway: if it has an L0 or L1 data cache in the cluster, with or without the scheduler, it's my MCMT. If no cache in the cluster, not mine (although I have enumerated many such possibilities).
- Motivated by my work to use MCMT to speed up single threads, I often propose a shared L2 instruction scheduler, to load balance between the clusters dynamically. Although I admit that I only really figured out how to do that properly after I left AMD, and before I joined Intel. How to do this is part of the Multi-star microarchitecture, M*, that is my next step beyond MCMT.
- Also, although it is natural to have a single (explicit) thread per cluster in MCMT, I have also proposed allowing two threads per cluster. Mainly motivated by SpMT: I could fork to a "runt thread" running in tghe same cluster, and then migrate the run thread to a different cluster. Intra-cluster forking is faster than inter-cluster forkng, and does not disturb the parent thread. But, if you are not doing SpMT, there is much less motivation for multiple threads per cluster.
- With Willamette as background, I leaned towards a relatively small, L0, cache in the cluster. Also, such a small L0 can often be pitch-matched with the cluster execution unit datapath. A big L1, such as Bulldozer seems to have, nearly always has to lie out of the datapath, and requires wire turns. Wire turns waste area. I have, from time to time, proposed putting the alignment muxes and barrel shifters in the wire turn area. I'm surprised that a large cluster L1 makes sense, but that's the sort of thing that you can only really tell from layout.
- Some posters have been surprised by sharing the FP. Of course, AMD's K7 design, with separate clusters for integer and FP, was already half-way there. They only had to double the integer cluster. It would have been harder for Intel to go MCMT, since the P6 family had shared integer and FP. Willamette might have been easier to go MCMT, since it had separate FP.
- Anyway... of course, for FP threads you might like to have thread-private FP. But, in some ways, it is the advent of expensve FP, like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer MCMT: the FP is so big that the overhead of replicating the integer cluster, including the OOO logic, is a drop in the bucket.
- You'd like to have per-cluster-thread FP, but such big FP workloads are often so memory intensive that they thrash the shared-between-clusters L2 cache: threading may be disabled anyways. As it is, you get good integer threads via MCMT, and you get 1 integer thread and 1 FP thread. Two FP threads may have some slowdown, although, again, if memory intensive they may be blocking on memory, and hence allowing the other FP thread t use the FP. But two purely computational FP threads will almost undoubtedly block, unless the schedulers are piss-poor and can't use all of the FP for a single thread (e.g. by being too small).
- I don't expect to get any credit for MCMT. In fact, I'm sure I'm going to get shit for this post. I don't care. I know. The people who were there, who saw my presentations and read my proposals, know. But, e.g. Chuck Moore wasn't there at start; he came in later. Even Mike Haertel, my usual collaborator, wasn't there; he was hired in later, although before Chuck. Besides, Mike Haertel thinks that MCMT is obvious. That's cool, although I ask: if MCMT is obvious, then why isn't Intel doing it? Companies like Intel and AMD need idea generating people like me about once every 10 years. In between, they don't need new ideas. They need new incremental improvements of existing ideas.
Anyway... It's cool to see MCMT becoming real. It gives me hope that my follow-on to MCMT, M* may still, eventually, also become real.
- There were several K10s. While I wanted to work on low power when I went to AMD, I was hired to consult on low power and do high end CPU, since the low power project was already rolling and did not need a new chef. The first K10 that I knew at AMD was a low power part. When that was cancelled I was sent off on my lonesome, then wth Mike Haertel, to work on a flagship, out-of-order, aggressive processor, while the original low power team did something else. When that other low-power project was cancelled, that team came over to the nascent K10 that I was working on. My K10 was MCMT, plus a few other things. I had actually had to promise Fred Weber that I would NOT do anything advanced for this K10 - no SpMT, just MCMT. But when the other guys came on board, I thought this meant that I could leave the easy stuff for them, while I tried to figure out how to do SpMT and/or any other way of using MCMT to speed up single threads.
- Some of us have done a lot of work on dynamic predication. (My resume includes an OOO Itanium, plus I have been working on VLIW and predication longer than OOO.) But since such work inside companies will never see the light of day, do not let that hold you back, since you are not so constrained by NDAs and trade secrets.
tpi.exe -T 2 -o pi.txt 128M Using 3.67GiB of RAM Computation to 128000000 digits, formula=Chudnovsky Output file=pi.txt, format=txt, binary result size=53.1MB Binary Splitting Depth=24, thread_level=1 mem max disk max operation compl lv 545M 545M 0 0 completed 100.0% 0 time = 63.601 s Compute P, Q 362M 545M 0 0 completed time = 0.836 s Division 599M 599M 0 0 completed time = 5.646 s Sqrt 528M 599M 0 0 completed time = 3.793 s Final multiplication 925M 925M 0 0 completed time = 2.353 s Total time (binary result) = 76.247 s Base conversion 523M 925M 0 0 completed time = 13.922 s Total time (base 10 result) = 90.170 s Writing result to 'pi.txt'
Using 3.67GiB of RAM Computation to 134217728 digits, formula=Chudnovsky Output file=pi.txt, format=txt, binary result size=55.7MB Binary Splitting Depth=24, thread_level=1 mem max disk max operation compl lv 571M 571M 0 0 completed 100.0% 0 time = 66.222 s Compute P, Q 377M 571M 0 0 completed time = 0.874 s Division 623M 623M 0 0 completed time = 6.115 s Sqrt 547M 623M 0 0 completed time = 4.134 s Final multiplication 966M 966M 0 0 completed time = 2.699 s Total time (binary result) = 80.044 s Base conversion 549M 966M 0 0 completed time = 14.836 s Total time (base 10 result) = 94.879 s Writing result to 'pi.txt'
PPC470自体がサポートする物理メモリは、ちょうど上のプレゼンに書いてあって、4TB とのこと。 --------------------- - Real memory support up to 4 terrabytes --------------------- 16コアで16GBを共有するのですから、プロセスあたり4GBの制限があっても何とかなる のかもしれませんが、PC的な64-bit SMP のようにノード内の全メモリをスレッド間で共有 するようなコードは使えません。
どうせ Blue Gene で動かす以上、PCクラスタからのベタ移植なんて考えないのかもしれ ませんが、64-bitの A2 コアを選択する可能性も出てきた…ということで。
TheRegisterのMorgan記者がIBM純正のCELL Blade QSシリーズについて書いてます。 http://www.theregister.co.uk/2010/03/23/ibm_kills_qs21_blade/ QS21は今年の6/25で受注終了、次世代機QS2Zは昨年報じられたように開発中止という話なのですが、最後をこのようにまとめています。 --------------------- The reason why the QS22's days are numbered is simple. IBM, say sources familiar with the company's plans, is to add specialty processing capabilities like those embodied in the SPEs in the Cell chip to the future Power chips beyond the current Power7 generation. Perhaps starting with Power7+ and definitely in full bloom with the Power8 generation. --------------------- CELL/B.E.としての開発を止めた理由は、SPEがPOWER8のアーキテクチャに取り込まれるから…という話。
ロシアのiXBTなどで既に言及されていた話ですが、GPU版のGF100は倍精度浮動小数点演算が単精度の1/8に制限されているんだとか… もう少し判りやすい記事を待っていたのですが、Hexusのレビューを見ると間違い無さそうです。 http://www.hexus.net/content/item.php?item=24000&page=3 ----------------------- Delve a little deeper, handily not mentioned in any briefing, and NVIDIA is limiting the double-precision speed of the desktop GF100 part to one- eighth of single-precision throughput, rather than the one-fifth speed of the Radeon HD 5000-series. We'll have to wait for the Tesla parts before that's restored to the one-half speed the GF100 is capable of. ----------------------- 安いGPUを買って、文字通りのGPGPUを企んでいた皆さんは残念でした。
PS3のLinuxサポート廃止と言い、貧乏HPCにはイヤな時代になってきました。背景としてはさほど大きいものとは言えないHPCが不況の中で市場として認められきたという事実があるようです。 http://www.theregister.co.uk/2010/03/25/idc_hpc_servers_2009/ ======================= The non-HPC portion of the server market was actually down 20.5 per cent, to $34.6bn - a decline that was nearly twice as steep as that in the HPC space. ======================= 唯一の希望は Magny-Cours の投入で自爆的なディスカウントによるサーバー市場での逆襲を狙うAMDプラットフォームくらいでしょうか…
色々紹介する内容が溜まっているのですが、XILINXのプレスリリースから。 http://press.xilinx.com/phoenix.zhtml?c=212763&p=irol-newsArticle&ID=1409753&highlight= -------------------- Xilinx, Inc. (Nasdaq: XLNX), the world's leading provider of programmable solutions, is applauded for its role in developing QPACE; a bespoke supercomputer developed to unlock the mysteries of Quantum Chromodynamics. -------------------- ここでも>>336, >>631で紹介した PowerXCell 搭載のスーパーコンピュータ QPACE のインタコネクトチップに Virtex-5 が使用されたという発表です。 なぜこのタイミングなのか?という疑問もありますが…
==最速レース=== 1.Oracle Database 10G Enterprise Edition 824,164tpmC 8.28 US $(2003/07/30) 2.IBM DB2 UDB 8.1 763,898tpmC 8.31 US $(2003/06/30) 3.Microsoft SQL Server 2000 Enterprise Ed. 64-bit 707,102tpmC 14.96 US $(2003/05/20)
==コストパフォーマンスレース=== 1.Microsoft SQL Server 2000 Standard Ed. 20,108tpmC 2.28 US $(2003/07/14) 2.Microsoft SQL Server 2000 19,526tpmC 2.38 US $(2003/05/12) 3.Microsoft SQL Server 2000 Standard Ed. SP3 19,718tpmC 2.44 US $(2003/07/15)
Google が P.A. Semi の残党が起業した Agnilux を買収したとのこと。 http://www.washingtonpost.com/wp-dyn/content/article/2010/04/20/AR2010042004854.html ------------------- Agnilux was founded by a few ex-Apple employees. More specifically, it was founded by Apple employees who came over in the PA Semi acquisition. ------------------- 単なる Apple への嫌がらせということは無いと思われますが、はたして Google まで携帯プロセッサ開発に参入するのでしょうか?いっそ ARM より PowerPC だと面白いかも(笑)
一方、Apple は ARM そのものを買収するという噂が流れています。 http://www.reuters.com/article/idUSTRE63K1KG20100421 ---------------------- The Financial Times reported renewed speculation that ARM could be a takeover target, mentioning Apple as a possible suitor. ----------------------
>>866 買収されたら契約を破棄してよいということはありません。 P.A.semiの時すら Apple は最低3年間 PwerEfficient チップの供給を続けることを約束させられています。 http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=208808517 ---------------------- Apple sent a letter to the DoD saying it will assure production of the 1.8 GHz PWRficient processor for three to five years, said one source who saw the letter but asked not to be named. The letter suggests Apple will explore selling the designs to a third party after that time. ----------------------
7.1. About the L1 memory system The L1 memory system consists of separate instruction and data caches in a Harvard arrangement. The L1 memory system provides the core with: * fixed line length of 64 bytes * support for 16KB or 32KB caches * two 32-entry fully associative ARMv7-A MMU * data array with parity for error detection * an instruction cache that is virtually indexed, IVIPT * a data cache that is physically indexed, PIPT * 4-way set associative cache structure * random replacement policy * nonblocking cache behavior for Advanced SIMD code * blocking for integer code * MBIST * support for hardware reset of the L1 data cache valid RAM, see Hardware RAM array reset.
>>843でちょっと迂闊なことを書きました。 ------------------- 将来的にはSOIを生かしたFBCに向かう様で… ------------------- 昨年9月にプレスリリースが出ていたネタですが、32nmのディープ・トレンチ・セルの eDRAM を IEDM 2009 で発表しています。 詳しい紹介記事はこちら。当分IBMの eDRAM はこの路線ということで間違いなかろうかと… http://www.semiconductor.net/article/354546-IBM_Readies_32_nm_eDRAM_With_Low_Latency.php ====================== The eDRAM is fully compatible with logic transistors, with no degradation in logic performance. It incorporates a deep trench capacitor structure, with a high-k dielectric and metal liner capacitor technology. ======================
そう言えば、Z-RAM の Innovative Silicon も Hynix 向けの Bulk Si 向けの実装の方に力が入っているようです。 http://www.edn.com/article/CA6726490.html?nid=2551 ====================== Hynix, Innovative Silicon show floating-body DRAM on bulk silicon ====================== これはこれで FinFET + FBC という意欲的な組み合わせで、先行きが楽しみではあります。
先月半ばにリリースされた GCC 4.5 では、ちょうどこのスレッドに登場した様々なPPCの実装がサポートされています。 http://gcc.gnu.org/gcc-4.5/changes.html --------------------------- ・GCC now supports the Power ISA 2.06, which includes the VSX instructions that add vector 64-bit floating point support, new population count instructions, and conversions between floating point and unsigned types. ・Support for the power7 processor is now available through the -mcpu=power7 and -mtune=power7. ・GCC will now vectorize loops that contain simple math functions like copysign when generating code for altivec or VSX targets. ・Support for the A2 processor is now available through the -mcpu=a2 and -mtune=a2 options. ・Support for the 476 processor is now available through the -mcpu={476,476fp} and -mtune={476,476fp} options. ・Support for the e500mc64 processor is now available through the -mcpu=e500mc64 and -mtune=e500mc64 options. ・GCC can now be configured with options --with-cpu-32, --with-cpu-64, --with-tune-32 and --with-tune-64 to control the default optimization separately for 32-bit and 64-bit modes. --------------------------- まだ現物が発表されていないのは Freescale の e500mc64 でしたっけ?今年の FTF で発表されると思われます。 http://www.freescale.com/webapp/sps/site/overview.jsp?nodeId=052577903689DC
Air Force may suffer collateral damage from PS3 firmware update (空軍はPS3のファームウェアアップデートからの間接被害で苦しむかもしれない)
We checked in with the Air Force Research Laboratory, which noted its disappointment with the Sonydecision. (我々は空軍のリサーチラボを調査したが、そこはソニーの決定への失望を示していた)
the lab told Ars, but "this will make it difficult to replace systems that break or fail. (しかしラボがArsに語ったところでは「壊れたり、故障したときには復旧するの が困難だろう。) We are aware of class-action lawsuits against Sony for taking away this option on systems that use to have it." (我々はソニーに対する利用しているシステム上の権利を取り消すためのソニーへの集団訴訟 を承知している」)