【轉(zhuǎn)】了解Centaur CHA(威盛半人馬CNS)的物理實(shí)現(xiàn)與雙槽測試
Examining Centaur CHA’s Die and Implementation Goals
外文資料,附機(jī)翻,圖片有數(shù)量上限自己對(duì)一下吧
April 30, 2022 clamchowder Leave a comment
In our last article, we examined Centaur’s CNS architecture. Centaur had a long history as a third x86 CPU designer, and CNS was the last CPU architecture Centaur had in the works before they were bought by Intel. In this article, we’ll take a look at Centaur CHA’s physical implementation. CHA is Centaur’s name for their system-on-chip (SoC) that targets edge server inference workloads by integrating a variety of blocks. These include:
Eight x86-compatible CNS cores running at up to 2.5 GHz
NCore, a machine learning accelerator, also running at 2.5 GHz
16 MB of L3 cache
Quad channel DDR4 controller
44 PCIe lanes, and IO links for dual socket support
We’re going to examine how CHA allocates die area to various functions. From there, we’ll discuss how Centaur’s design goals influenced their x86 core implementation.Compared to Haswell-ECentaur says CNS has similar IPC to Haswell, so we’ll start with that obvious comparison. CHA is fabricated on TSMC’s 16nm FinFET process node, with a 194 mm2die size. Haswell-E was fabricated on Intel’s 22nm FinFET process node and the 8 core die is 355 mm2. So by server and high end desktop standards, the CHA die is incredibly compact.

CHA ends up being just slightly larger than half of Haswell-E, even though both chips are broadly comparable in terms of core count and IO capabilities. Both chips have eight cores fed by a quad channel DDR4 controller, and support dual socket configurations. Differences from the CPU and IO side are minor. CHA has slightly more PCIe lanes, with 44 compared to Haswell-E’s 40. And Haswell has 20 MB of L3 cache, compared to 16 MB on CHA. Now, let’s break down die area distribution a bit. The Haswell cores themselves occupy just under a third of the die. Add the L3 slices and ring stops, and we’ve accounted for about 50% of die area. The rest of the chip is used to implement IO.

CHA also has eight cores and similar amounts of IO. Like with Haswell-E, IO takes up about half the die. But the eight CNS cores and their L3 cache occupy a third of the die, compared to half on Haswell-E. That leaves enough die area to fit Centaur’s NCore. This machine learning accelerator takes about as much area as the eight CNS cores, so it was definitely a high priority for CHA’s design.

Centaur and Intel clearly have different goals. Haswell-E serves as a high end desktop chip, where it runs well above 3 GHz to achieve higher CPU performance. That higher performance especially applies in applications that don’t have lots of parallelism. To make that happen, Intel dedicates a much larger percentage of Haswell-E’s die area towards CPU performance. Haswell cores use that area to implement large, high clocking circuits. They’re therefore much larger, even after a process node shrink to Intel’s 14nm process (in the form of Broadwell).

In contrast, Centaur aims to make their cores as small as possible. That maximizes compute density on a small die, and makes room for NCore. Unlike Intel, CNS doesn’t try to cover a wide range of bases. Centaur is only targeting edge servers. In those environments, high core counts and limited power budget usually force CPUs to run at conservative clock speeds anyway. It’s no surprise that clock speed takes a back to seat to high density. The result is that CNS achieves roughly Haswell-like IPC in a much smaller core, albeit one that can’t scale to higher performance targets. We couldn’t get a Haswell based i7-4770 to perfectly hit 2.2 GHz, but you get the idea:

7-Zip has a lot of branches, and a lot of them seem to give branch predictors a hard time. In our previous article, we noted that Haswell’s branch predictor could track longer history lengths, and was faster with a lot of branches in play. That’s likely responsible for Haswell’s advantage in 7-Zip. Unfortunately, we don’t have documented performance counters for CNS, and can’t validate that hypothesis. Video encoding is a bit different. Branch performance takes a back seat to vector throughput. Our test uses libx264, which can use AVX-512. AVX-512 instructions account for about 14.67% of executed instructions, provided the CPU supports them. CNS roughly matches Haswell’s performance per clock in this scenario. But again, Haswell is miles ahead when the metaphorical handbrake is off:

CNS is most competitive when specialized AVX-512 instructions come into play. Y-Cruncher is an example of this, with 23.29% of executed instructions belonging to the AVX-512 ISA extension. VPMADD52LUQ (12.76%) and VPADDQ (9.06%) are the most common AVX-512 instructions. The former has no direct AVX2 equivalent as far as I know.

Even though its vector units are not significantly more powerful than Haswell’s, CNS knocks this test out of the park. At similar clock speeds, Haswell gets left in the dust. It still manages to tie CNS at stock, showing the difference high clock speeds can make. We also can’t ignore Haswell’s SMT capability, which helps make up some of its density disadvantage.Compared to Zeppelin and Coffee LakePerhaps comparisons with client dies make more sense, since they’re closer in area to CHA. AMD’s Zeppelin (Zen 1) and Intel’s Coffee Lake also contain eight cores with 16 MB of L3. Both are especially interesting because they’re implemented on process nodes that are vaguely similar to CHA’s.

Let’s start with Zeppelin. This AMD die implements eight Zen 1 cores in two clusters of four, each with 8 MB of L3. It’s about 9% larger than CHA, but has half as many DDR4 channels. Zeppelin has fewer PCIe lanes as well, with 32 against CHA’s 44.

L3 area is in the same ballpark on both Zeppelin and CHA, suggesting that TSMC 16nm and GlobalFoundries 14nm aren’t too far apart when high density SRAM is in play. Core area allocation falls between CHA and Haswell-E’s, although percentage-wise, it’s closer to the latter. A lot of space on Zeppelin that doesn’t belong to cores, cache, DDR4, or PCIe. That’s because Zeppelin is a building block for everything. Desktops are served with a single Zepplin die that provides up to eight cores. Workstations and servers get multiple dies linked together to give more cores, more memory controller channels, and more PCIe. In the largest configuration, a single Epyc server chip uses four Zeppelins to give 32 cores, eight DDR4 channels, and 128 PCIe lanes. Zeppelin’s flexibility lets AMD streamline production. But nothing in life is free, and die space is used to make that happen. For client applications, Zeppelin tries to bring a lot of traditional chipset functionality onto its die. It packs a quad port USB 3.1 controller. That takes space. Some PCIe lanes can operate in SATA mode to support multi-mode M.2 slots, which means extra SATA controller logic. Multi-die setups are enabled by cross-die links (called Infinity Fabric On Package, or IFOP), which cost area as well. AMD’s Zen 1 core was probably designed with that in mind. It can’t clock low because it has to reach desktop performance levels, so density is achieved by sacrificing vector performance. Zen 1 uses 128-bit execution units and registers, with 256-bit AVX instructions split into two micro-ops.

Then, Zen 1 took aim at modest IPC goals, and gave up on extremely high clock speeds. That helps improve their compute density, and lets them fit eight cores in a die that has all kinds of other connectivity packed into it. Like Haswell, Zen uses SMT to boost multithreaded performance with very little die area overhead.

In its server configuration, Zen 1 can’t boost as high as it can on desktops. But it still has a significant clock speed advantage over CNS on low threaded parts of workloads. Like Haswell, Zen 1 delivers better performance per clock than CNS with 7-Zip compression.

With video encoding, CNS again falls behind. Part of that is down to clock speed. Zen 1’s architecture is also superior in some ways. AMD has a large non-scheduling queue for vector operations, which can prevent renamer stalls when lots of vector operations are in play. In this case, that extra “fake” scheduling capacity seems to be more valuable than CNS’s higher execution throughput.

Finally, CNS leaves Zen 1 in the dust if an application very heavy on vector compute, and can take advantage of AVX-512. In Y-Cruncher, Zen 1 is no match for CNS.Intel’s Coffee LakeCoffee Lake’s die area distribution looks very different. Raw CPU performance is front and center, with CPU cores and L3 cache taking more than half the die.

About a quarter of Coffee Lake’s die goes to an integrated GPU. In both absolute area and percentage terms, Comet Lake’s iGPU takes up more space than Centaur’s NCore, showing the importance Intel places on integrated graphics. After the cores, cache, and iGPU are accounted for, there’s not much die area left. Most of it is used to implement modest IO capabilities. That makes Coffee Lake very area efficient. It packs more CPU power into a smaller die than CHA or Zeppelin, thanks to its narrow focus on client applications. Like Haswell and Zen, the Skylake core used in Coffee Lake looks designed to reach very high clocks, at the expense of density. It’s much larger than CNS and Zen 1, and clocks a lot higher too.

Intel also sacrifices some area to make Skylake “AVX-512 ready”, for lack of a better term. This lets the same basic core design go into everything from low power ultrabooks to servers and supercomputers, saving design effort.

Perhaps it’s good to talk about AVX-512 for a bit, since that’s a clear goal in both Skylake and CNS’s designs.AVX-512 Implementation ChoicesCNS’s AVX-512 implementation looks focused on minimizing area overhead, rather than maximizing performance gain when AVX-512 instructions are used. Intel has done the opposite with Skylake-X. To summarize some AVX-512 design choices:
AVX-512 Implementation Detail
Intel Skylake-X
Centaur CNS
Comments
Vector Register File
Full width 512-bit registers (10.5 KB total)
512-bit architectural registers take two 256-bit physical registers (4.5 KB total)
SKL-X has a lot more renaming capacity when 512-bit registers are used
Mask Register File
Separate physical register file with 128 renames (additional 1 KB of registers)
Shares scalar integer register file (GPRs)
Scalar integer code is still common in AVX-512 code, and SKL-X’s separate mask register file avoids contention
Floating Point Vector Execution
2×512-bit FMA per cycle
2×256-bit FMA per cycle. 512-bit instructions decode into two micro-ops
Intel’s going for maximum throughput, but those extra FP units are large
Integer Vector Execution
2×512-bit vector integer add per cycle
3×256-bit vector integer add per cycle
Intel is still better set up to get the most out of AVX-512.
To me, CNS is very interesting because it shows how decent AVX-512 support can be implemented with as little cost as possible. Furthermore, CNS demonstrates that reasonably powerful vector units can be implemented in a core that takes up relatively little area. To be clear, this kind of strategy may not offer the best gains from AVX-512:

But it can still offer quite a bit of performance improvement in certain situations, as the benchmark above shows.Thoughts about Centaur’s DesignCentaur aimed to create a low cost SoC that combines powerful inference capabilities with CPU performance appropriate for a server. Low cost means keeping die area down. But Centaur’s ML goals mean dedicating significant die area toward its NCore accelerator. Servers need plenty of PCIe lanes for high bandwidth network adapters, and that takes area too. All of that dictates a density oriented CPU design, in order to deliver the required core count within remaining space. To achieve this goal, Centaur started by targeting low clocks. High clocking designs usually take more area, and that’s especially apparent on Intel’s designs. There are tons of reasons for this. To start, process libraries designed for high speeds occupy more area than ones designed for high density. For example, Samsung’s high performance SRAM bitcells occupy 0.032 μm2 on their 7nm node, while their high density SRAM takes 0.026 μm2. Then, higher clock speeds often require longer pipelines, and buffers between pipeline stages take space. On top of that, engineers might use more circuitry to reduce the number of dependent transitions that have to complete within a cycle. One example is using carry lookahead adders instead of simpler carry propagate adders, where additional gates are used to compute whether a carry will be needed at certain positions, rather than wait for a carry signal to propagate across all of a number’s binary digits. Then, AVX-512 is supported with minimal area overhead. Centaur extended the renamer’s register alias table to handle AVX-512 mask registers. They tweaked the decoders to recognize the extra instructions. And they added some execution units for specialized instructions that they cared for. I doubt any of those took much die area. CNS definitely doesn’t follow Intel’s approach of trying to get the most out of AVX-512, which involves throwing core area at wider execution units and registers. Finally, Centaur set modest IPC goals. They aimed for Haswell-like performance per clock, at a time when Skylake was Intel’s top of the line core. That avoids bloating out-of-order execution buffers and queues to chase diminishing IPC gains. To illustrate this, let’s look at simulated reorder buffer (ROB) utilization in an instruction trace of a 7-Zip compression workload. A CPU’s ROB tracks all instructions until their results are made final, so its capacity represents an upper bound for a CPU’s out of order execution window. We modified ChampSim to track ROB occupancies every cycle, to get an idea of how often a certain number of ROB entries are utilized.

Most cycles fall into one of two categories. We either see less than 200 entries used. Or the ROB is basically filling up, indicating it’s not big enough to absorb latency. We can also graph the percentage of cycles for which ROB utilization is below a certain point (below). Again, we see clearly diminishing returns from increased reordering capacity.

Increasing ROB size implies beefing up other out-of-order buffers too. The renamed register files, scheduler, and load/store queues will have to get bigger, or they’ll fill up before the ROB does. That means IPC increases require disproportionate area tradeoffs, and Centaur’s approach has understandably been conservative. The resulting CNS architecture ends up being very dense and well suited to CHA’s target environment, at the expense of flexibility. Under a heavy AVX load, the chip draws about 65 W at 2.2 GHz, but power draw at 2.5 GHz goes up to 140 W. Such a sharp increase in power consumption is a sign that the architecture won’t clock much further. That makes CNS completely unsuitable for consumer applications, where single threaded performance is of paramount importance. At a higher level, CHA also ends up being narrowly targeted. Even though CNS cores are small, each CHA socket only has eight of them. That’s because Centaur spends area on NCore and uses a very small die. CHA therefore falls behind in multi-threaded CPU throughput, and won’t be competitive in applications that don’t use NCore’s ML acceleration capabilities. Perhaps the biggest takeaway from CNS is that it’s possible to implement powerful vector units a physically small core, as long as clock speed and IPC targets are set with density in mind.
Core Width, Integer ALUs, and ROB Capacity
Vector Execution
Clock Speed
Centaur CNS
4-wide, 4 ALUs, 192 entry ROB
2×256-bit FMA3×256-bit vector integer
2.5 GHz
Haswell
4-wide, 4 ALUs, 192 entry ROB
2×256-bit FMA2×256-bit vector integer
4.4 GHz (client)
Zen 1
5-wide, 4 ALUs, 192 entry ROB
2×128-bit FMA + 2×128-bit FADD3×128-bit vector integer
4.2 GHz (HEDT)
Skylake (Client)
4-wide, 4 ALUs, 224 entry ROB
2×256-bit FMA3×256-bit vector integer
5+ GHz
Tremont
4-wide, 4 ALUs, 208 entry ROB
1×128-bit FADD + 1×128-bit FMUL2×128-bit vector integer
3 GHz
Surface-level look at core width/reodering capacity, vector execution, and clock speedsCentaur did all this with a team of around 100 people, on a last-generation process node. That means a density oriented design is quite achievable with limited resources. I wonder what AMD or Intel could do with their larger engineering teams and access to cutting edge process nodes, if they prioritized density.What Could Have Been?Instead of drawing a conclusion, let’s try something a little different. I’m going to speculate and daydream about how else CNS could have been employed.Pre-2020 – More Cores?CHA tries to be a server chip with the die area of a client CPU. Then it tries to stuff a machine learning accelerator on the same die. That’s crazy. CHA wound up with fewer cores than Intel or AMD server CPUs, even though CNS cores prioritized density. What if Centaur didn’t try to create a server processor with the area of a client die? If we quadruple CHA’s 8-core CPU complex along with its cache, that would require about 246 mm2 of die area. Of course, IO and connectivity would require space as well. But it’s quite possible for a 32 core CNS chip to be implemented using a much smaller die than say, a 28-core Skylake-X chip.

A fully enabled 28 core Skylake-X chip would likely outperform a hypothetical 32 core CNS one. But Intel uses 677.98 mm2 to do that. A few years ago, Intel also suffered from a shortage of production capacity on their 14nm process. All of that pushes up prices for Skylake-X. That gives Centaur an opportunity to undercut Intel. Using a much smaller die on a previous generation TSMC node should make for a cheap chip. Before 2019, AMD could also offer 32 cores per socket with their top end Zen 1 Epyc SKUs. Against that, CNS would compete by offering AVX-512 support and better vector performance.Post 2020 – A Die Shrink?But AMD’s 2019 Zen 2 launch changes things. CNS’s ability to execute AVX-512 instructions definitely gives it an advantage. But Zen 2 has 256-bit execution units and registers just like CNS. More importantly, Zen 2 has a process node advantage. AMD uses that to implement a more capable out-of-order execution engine, more cache, and faster caches.
That puts CNS in a rough position.

Core for core, CNS stands no chance. Even if we had CNS running at 2.5 GHz, its performance simply isn’t comparable to Zen 2. It gets worse in file compression:

Even in Y-Cruncher, which is a best case for CNS, Zen 2’s higher clocks and SMT support let it pull ahead.

Worse, TSMC’s 7nm process lets AMD pack 64 of those cores into one socket. In my view, CNS has to be ported to a 7nm class process to have any chance after 2020. It’s hard to guess what that would look like, but I suspect a CNS successor on 7nm would fill a nice niche. It’d be the smallest CPU core with AVX-512 support. Maybe it could be an Ampere Altra competitor with better vector performance. Under Intel, CNS could bring AVX-512 support to E-Cores in Intel’s hybrid designs. That would fix Alder Lake’s mismatched ISA awkwardness. Compared to Gracemont, CNS probably wouldn’t perform as well in integer applications, but vector performance would be much better. And CNS’s small core area would be consistent with Gracemont’s area efficiency goal. Perhaps a future E-Core could combine Gracemont’s integer prowess with CNS’s vector units and AVX-512 implementation.
Centaur CHA’s Probably Unfinished Dual Socket Implementation
April 23, 2022 clamchowder Leave a comment
Centaur’s CHA chip targets the server market with a low core count. Its dual socket capability is therefore quite important, because it’d allow up to 16 cores in a single CHA-based server. Unfortunately for Centaur, modern dual socket implementations are quite complicated. CPUs today use memory controllers integrated into the CPU chip itself, meaning that each CPU socket has its own pool of memory. If CPU cores on one socket want to access memory connected to another socket, they’ll have to go through a cross-socket link. That creates a setup with non-uniform memory access (NUMA). Crossing sockets will always increase latency and reduce bandwidth, but a good NUMA implementation will minimize those penalties.

Cross-Socket LatencyHere, we’ll test how much latency the cross-socket link adds by allocating memory on different nodes, and using cores on different nodes to access that memory. This is basically our latency test being run only at the 1 GB test size, because that size is large enough to spill out of any caches. And we’re using 2 MB pages to avoid TLB miss penalties. That’s not realistic for most consumer applications, which use 4 KB pages, but we’re trying to isolate NUMA-related latency penalties instead of showing memory latency that applications will see in practice.

Crossing sockets adds about 92 ns of additional latency, meaning that memory on a different socket takes almost twice as long to access. For comparison, Intel suffers less of a penalty.

On a dual socket Broadwell system, crossing sockets adds 42 ns of latency with the early snoop setting. Accessing remote memory takes 41.7% longer than hitting memory directly attached to the CPU. Compared to CNS, Intel is a mile ahead, partially because that early snoop mode is optimized for low latency. The other part is that Intel has plenty of experience working on multi-socket capable chips. If we go back over a decade to Intel’s Westmere based Xeon X5650, memory access latency on the same node is 70.3 ns, while remote memory access is 121.1 ns. The latency delta there is just above 50 ns. It’s worse than Broadwell, but still significantly better than Centaur’s showing.

Broadwell also supports a cluster-on-die setting, which creates two NUMA nodes per socket. In this mode, a NUMA node covers a single ring bus, connected to a dual channel DDR4 memory controller. This slightly reduces local memory access latency. But Intel has a much harder time with four pools of memory in play. Crossing sockets now takes almost as long as it does on CHA. Looking closer, we can see that memory latency jumps by nearly 70 ns when accessing “remote” memory connected to the same die. That’s bigger than the cross-socket latency delta, and suggests that Intel takes a lot longer to figure out where to send a memory request if it has three remote nodes to pick from.

Popping over to AMD, we have results from when we tested a Milan-X system on Azure. Like Broadwell’s cluster on die mode, AMD’s NPS2 mode creates two NUMA nodes within a socket. However, AMD seems to have very fast directories for figuring out which node is responsible for a memory address. Going from one half of a socket to another only adds 14.33 ns. The cross socket connection on Milan-X adds around 70-80 ns of latency, depending on which half of the remote socket you’re accessing. To summarize, Centaur’s cross-node latency is mediocre. It’s worse than what we see from Intel or AMD, unless the Intel Broadwell system is juggling four NUMA nodes. But it’s not terrible for a company that has no experience in multi-socket designs.Cross-Socket BandwidthNext, we’ll test bandwidth. Like with the latency test, we’re running our bandwidth test with different combinations of where memory is allocated and what CPU cores are used. The test size here is 3 GB, because that’s the largest size we have hardcoded into our bandwidth test. Size doesn’t really matter as long as it’s big enough to get out of caches. To keep things simple, we’re only testing read bandwidth.

Centaur’s cross socket bandwidth is disastrously poor at just above 1.3 GB/s. When you can read faster from a good NVMe SSD, something is wrong. For comparison, Intel’s decade old Xeon X5650 can sustain 11.2 GB/s of cross socket bandwidth, even though its triple channel DDR3 setup only achieved 20.4 GB/s within a node. A newer design like Broadwell does even better.

With each socket represented by one NUMA node, Broadwell can get nearly 60 GB/s of read bandwidth from its four DDR4-2400 channels. Accessing that from a different socket drops bandwidth to 21.3 GB/s. That’s quite a jump over Westmere, showing the progress Intel has made over the past decade. If we switch Broadwell into cluster on die mode, each node of seven cores can still pull more cross-socket bandwidth than what Centaur can achieve. Curiously, Broadwell suffers a heavy penalty from crossing nodes within a die, with memory bandwidth cut approximately in half.

Finally, let’s have a look at AMD’s Milan-X:

Milan-X is a bandwidth monster compared to the other chips here. It has twice as many DDR4 channels as CHA and Broadwell, so high intra-node bandwidth comes as no surprise. Across nodes, AMD retains very good bandwidth when accessing the other half of the same socket. Across sockets, each NPS2 node can still pull over 40 GB/s, which isn’t far off CHA’s local memory bandwidth.Core to Core Latency with Contested AtomicsOur last test evaluates cache coherency performance by using locked compare-and-exchange operations to modify data shared between two cores. Centaur does well here, with latencies around 90 to 130 ns when crossing sockets.

The core to core latency plot above is similar to Ampere Altra’s, where cache coherency operations on a cache line homed to a remote socket require a round trip over the cross-socket interconnect, even when the two cores communicating with each other are on the same chip. However, absolute latencies on CHA are far lower, thanks to CHA having far fewer cores and a less complex topology. Intel’s Westmere architecture from 2010 is able to do better than CHA when cache coherency goes across sockets. They’re able to handle cache coherency operations within a die (likely at the L3 level) even if the cache line is homed to a remote socket.

But this sort of excellent cross socket performance isn’t typical. Westmere likely benefits because all off-core requests go through a centralized global queue. Compared to the distributed approach used since Sandy Bridge, that approach suffers from higher latency and low bandwidth for regular L3 accesses. But its simplicity and centralized nature likely enables excellent cross-socket cache coherency performance.

Broadwell’s cross-socket performance is similar to CHA’s. By looking at results from both cluster and die and early snoop modes, we can clearly see that the bulk of core to core cache coherence latency comes from how directory lookups are performed. If the transfer happens within a cluster on die node, coherency is handled via the inclusive L3 and its core valid bits. If the L3 is missed, the coherency mechanism is much slower. Intra-die, cross-node latencies are already over 100 ns. Crossing dies only adds another 10-20 ns. Early snoop mode shows that intra-die coherency can be quite fast. Latencies stay within the 50 ns range or under, even when rings are crossed. However, early snoop mode increases cross socket latency to about 140 ns, making it slightly worse than CNS’s.

We don’t have clean results from Milan-X because hypervisor core pinning on the cloud instance was a bit funky. But our results were roughly in line with Anandtech’s results on Epyc 7763. Intra-CCX latencies are very low. Cross-CCX latencies within a NPS2 node were in the 90 ns range. Crossing NPS2 nodes brought latencies to around 110 ns, and crossing sockets resulted in ~190 ns latency. Centaur’s cross socket performance in this category is therefore better than Epyc’s. More generally, CHA puts in its best showing in this kind of test. It’s able to go toe to toe with AMD and Intel systems that smacked it around in our “clean” memory access tests. “Clean” here means that we don’t have multiple cores writing to the same cache line. Unfortunately for Centaur, we’ve seen exactly zero examples of applications that benefit from low core-to-core latency. Contested atomics just don’t seem to be very common in multithreaded code.Final WordsBefore CNS, Centaur focused on low power consumer CPUs with products like the VIA Nano. Some of that experience carries over into server designs. After all, low power consumption and small core size are common goals. But go beyond the cpu core, and servers are a different world. Servers require high core counts, lots of IO, and lots of memory bandwidth. They also need to support high memory capacity. CHA delivers on some of those fronts. It can support hundreds of gigabytes of memory per socket. Its quad channel DDR4 memory controller and 44 PCIe lanes give adequate but not outstanding off-chip bandwidth. CHA is also the highest core count chip created by Centaur. But eight cores is a bit low for the server market today. Dual socket support could partially mitigate that. Unfortunately, the dual socket work appears to be unfinished. CHA’s low cross socket bandwidth will cause serious problems, especially for NUMA-unaware workloads. It also sinks any possibility of using the system in interleaved mode, where accesses are striped across sockets to provide more bandwidth to NUMA-unuaware applications at the expense of higher latency.

So what went wrong? Well, remember that cross socket accesses suffer extra latency. That’s common to all platforms. But achieving high bandwidth over a long latency connection requires being able to queue up a lot of outstanding requests. My guess is that Centaur implemented a queue in front of their cross socket link, but never got around to validating it. Centaur’s small staff and limited resources were probably busy covering all the new server-related technologies. What we’re seeing is probably a work in progress.

Centaur has implemented the protocols and coherence directories necessary to make multiple sockets work. And they work with reasonably good latency. Unfortunately, the cross-socket work can’t be finished because Centaur doesn’t exist anymore, so we’ll never see CHA’s full potential in a dual socket setup. Special thanks to Brutus for setting the system up and running tests on it.
檢查 Centaur CHA 的模具和實(shí)施目標(biāo)
2022 年 4 月 30 日 蛤蜊濃湯 發(fā)表評(píng)論
在上一篇文章中,我們研究了 Centaur 的 CNS 架構(gòu)。Centaur 作為第三個(gè) x86 CPU 設(shè)計(jì)師有著悠久的歷史,而 CNS 是 Centaur 在被英特爾收購之前的最后一個(gè) CPU 架構(gòu)。在本文中,我們將了解 Centaur CHA 的物理實(shí)現(xiàn)。CHA 是 Centaur 對(duì)其片上系統(tǒng) (SoC) 的名稱,它通過集成各種塊來針對(duì)邊緣服務(wù)器推理工作負(fù)載。這些包括:
八個(gè)兼容 x86 的 CNS 內(nèi)核,運(yùn)行頻率高達(dá) 2.5 GHz
NCore,一種機(jī)器學(xué)習(xí)加速器,同樣以 2.5 GHz 運(yùn)行
16 MB 的 L3 緩存
四通道 DDR4 控制器
44 個(gè) PCIe 通道和用于雙插槽支持的 IO 鏈路
我們將研究 CHA 如何為各種功能分配管芯區(qū)域。從那里,我們將討論 Centaur 的設(shè)計(jì)目標(biāo)如何影響他們的 x86 核心實(shí)現(xiàn)。與 Haswell-E 相比Centaur 表示 CNS 具有與 Haswell 相似的 IPC,因此我們將從明顯的比較開始。CHA 采用 TSMC 的 16nm FinFET 工藝節(jié)點(diǎn)制造,芯片尺寸為 194 mm 2。Haswell-E 采用 Intel 的 22nm FinFET 工藝節(jié)點(diǎn)制造,8 核芯片為 355 mm 2。因此,按照服務(wù)器和高端臺(tái)式機(jī)標(biāo)準(zhǔn),CHA 芯片非常緊湊。

CHA 最終只略大于 Haswell-E 的一半,盡管這兩種芯片在核心數(shù)量和 IO 能力方面大體相當(dāng)。這兩款芯片都有八個(gè)內(nèi)核,由一個(gè)四通道 DDR4 控制器供電,并支持雙插槽配置。CPU 和 IO 端的差異很小。CHA 的 PCIe 通道略多一些,有 44 條,而 Haswell-E 有 40 條。Haswell 有 20 MB 的三級(jí)緩存,而 CHA 有 16 MB。 現(xiàn)在,讓我們稍微分解一下裸片面積分布。Haswell 內(nèi)核本身占據(jù)了不到三分之一的芯片。添加 L3 切片和環(huán)形擋塊,我們已經(jīng)占了裸片面積的大約 50%。芯片的其余部分用于實(shí)現(xiàn)IO。
CHA 也有八個(gè)核心和類似數(shù)量的 IO。與 Haswell-E 一樣,IO 占據(jù)了大約一半的芯片。但是八個(gè) CNS 內(nèi)核及其 L3 緩存占據(jù)了芯片的三分之一,而 Haswell-E 則占一半。這留下了足夠的芯片面積來安裝 Centaur 的 NCore。這個(gè)機(jī)器學(xué)習(xí)加速器占用的面積與八個(gè) CNS 核心差不多,因此它絕對(duì)是 CHA 設(shè)計(jì)的重中之重。
Centaur 和 Intel 顯然有不同的目標(biāo)。Haswell-E 用作高端臺(tái)式機(jī)芯片,其運(yùn)行頻率遠(yuǎn)高于 3 GHz 以實(shí)現(xiàn)更高的 CPU 性能。更高的性能尤其適用于沒有很多并行性的應(yīng)用程序。為實(shí)現(xiàn)這一目標(biāo),英特爾將 Haswell-E 芯片面積的更大比例用于 CPU 性能。Haswell 內(nèi)核使用該區(qū)域來實(shí)現(xiàn)大型、高時(shí)鐘電路。因此它們要大得多,即使在工藝節(jié)點(diǎn)縮小到英特爾的 14 納米工藝(以 Broadwell 的形式)之后也是如此。
相比之下,Centaur 旨在使它們的核心盡可能小。這最大限度地提高了小芯片上的計(jì)算密度,并為 NCore 騰出了空間。與 Intel 不同,CNS 并不試圖涵蓋廣泛的基礎(chǔ)。Centaur 僅針對(duì)邊緣服務(wù)器。在這些環(huán)境中,高內(nèi)核數(shù)和有限的功率預(yù)算通常迫使 CPU 以保守的時(shí)鐘速度運(yùn)行。毫不奇怪,時(shí)鐘速度讓位于高密度。 結(jié)果是 CNS 在一個(gè)更小的內(nèi)核中實(shí)現(xiàn)了大致類似于 Haswell 的 IPC,盡管它不能擴(kuò)展到更高的性能目標(biāo)。我們無法讓基于 Haswell 的 i7-4770 完美達(dá)到 2.2 GHz,但你明白了:
7-Zip 有很多分支,其中很多似乎給分支預(yù)測器帶來了困難。在我們之前的文章中,我們注意到 Haswell 的分支預(yù)測器可以跟蹤更長的歷史長度,并且在使用大量分支時(shí)速度更快。這可能是 Haswell 在 7-Zip 中占據(jù)優(yōu)勢的原因。不幸的是,我們沒有記錄 CNS 的性能計(jì)數(shù)器,因此無法驗(yàn)證該假設(shè)。 視頻編碼有點(diǎn)不同。分支性能在矢量吞吐量方面處于次要地位。我們的測試使用的是libx264,它可以使用AVX-512。AVX-512 指令約占執(zhí)行指令的 14.67%,前提是 CPU 支持它們。在這種情況下,CNS 大致匹配 Haswell 的每時(shí)鐘性能。但同樣,當(dāng)比喻手剎關(guān)閉時(shí),哈斯韋爾遙遙領(lǐng)先:
當(dāng)專門的 AVX-512 指令發(fā)揮作用時(shí),CNS 最具競爭力。Y-Cruncher 就是一個(gè)例子,其中 23.29% 的執(zhí)行指令屬于 AVX-512 ISA 擴(kuò)展。VPMADD52LUQ (12.76%) 和 VPADDQ (9.06%) 是最常見的 AVX-512 指令。據(jù)我所知,前者沒有直接的 AVX2 等效項(xiàng)。
盡管它的矢量單元并不比 Haswell 的強(qiáng)大得多,但 CNS 將這項(xiàng)測試淘汰出局。在相似的時(shí)鐘速度下,Haswell 被甩在了塵土中。它仍然設(shè)法在庫存中捆綁 CNS,顯示出高時(shí)鐘速度可以帶來的差異。我們也不能忽視 Haswell 的 SMT 能力,這有助于彌補(bǔ)它的一些密度劣勢。與 Zeppelin 和 Coffee Lake 相比也許與客戶模具進(jìn)行比較更有意義,因?yàn)樗鼈冊趨^(qū)域上更接近 CHA。AMD 的 Zeppelin (Zen 1) 和 Intel 的 Coffee Lake 也包含八個(gè)內(nèi)核和 16 MB 的 L3。兩者都特別有趣,因?yàn)樗鼈兪窃谂c CHA 隱約相似的流程節(jié)點(diǎn)上實(shí)現(xiàn)的。
讓我們從齊柏林飛艇開始。這個(gè) AMD 芯片在兩個(gè)集群中實(shí)現(xiàn)了八個(gè) Zen 1 核心,每個(gè)集群有四個(gè),每個(gè)都有 8 MB 的 L3。它比 CHA 大約大 9%,但 DDR4 通道數(shù)量只有它的一半。Zeppelin 的 PCIe 通道也更少,只有 32 個(gè),而 CHA 有 44 個(gè)。
L3 區(qū)域在 Zeppelin 和 CHA 上處于同一范圍內(nèi),這表明在使用高密度 SRAM 時(shí),臺(tái)積電 16nm 和 GlobalFoundries 14nm 相距不遠(yuǎn)。核心區(qū)域分配介于 CHA 和 Haswell-E 之間,盡管在百分比方面更接近后者。 Zeppelin 上有很多不屬于內(nèi)核、緩存、DDR4 或 PCIe 的空間。那是因?yàn)?Zeppelin 是一切的基石。臺(tái)式機(jī)配有一個(gè) Zepplin 裸片,最多可提供八個(gè)內(nèi)核。工作站和服務(wù)器將多個(gè)裸片連接在一起,以提供更多內(nèi)核、更多內(nèi)存控制器通道和更多 PCIe。在最大的配置中,單個(gè) Epyc 服務(wù)器芯片使用四個(gè) Zeppelin 提供 32 個(gè)內(nèi)核、8 個(gè) DDR4 通道和 128 個(gè) PCIe 通道。 Zeppelin 的靈活性讓 AMD 簡化了生產(chǎn)。但生活中沒有什么是免費(fèi)的,而空間就是用來實(shí)現(xiàn)這一點(diǎn)的。對(duì)于客戶端應(yīng)用程序,Zeppelin 試圖將許多傳統(tǒng)芯片組功能集成到其芯片中。它包含一個(gè)四端口 USB 3.1 控制器。這需要空間。一些 PCIe 通道可以在 SATA 模式下運(yùn)行以支持多模式 M.2 插槽,這意味著額外的 SATA 控制器邏輯。多管芯設(shè)置通過交叉管芯鏈接(稱為 Infinity Fabric On Package,或 IFOP)實(shí)現(xiàn),這也會(huì)占用面積。 AMD 的 Zen 1 核心在設(shè)計(jì)時(shí)可能考慮到了這一點(diǎn)。它不能時(shí)鐘低,因?yàn)樗仨氝_(dá)到桌面性能水平,所以密度是通過犧牲矢量性能來實(shí)現(xiàn)的。Zen 1 使用 128 位執(zhí)行單元和寄存器,256 位 AVX 指令分為兩個(gè)微操作。
然后,Zen 1 瞄準(zhǔn)了適度的 IPC 目標(biāo),并放棄了極高的時(shí)鐘速度。這有助于提高他們的計(jì)算密度,并讓他們能夠在一個(gè)芯片中安裝八個(gè)內(nèi)核,該芯片中包含各種其他連接。與 Haswell 一樣,Zen 使用 SMT 以極小的管芯面積開銷提升多線程性能。
在其服務(wù)器配置中,Zen 1 無法像在臺(tái)式機(jī)上那樣提升。但在工作負(fù)載的低線程部分,它仍然比 CNS 具有顯著的時(shí)鐘速度優(yōu)勢。與 Haswell 一樣,Zen 1 的每時(shí)鐘性能優(yōu)于采用 7-Zip 壓縮的 CNS。
在視頻編碼方面,CNS 再次落后。部分原因在于時(shí)鐘速度。Zen 1 的架構(gòu)在某些方面也更勝一籌。AMD 有一個(gè)用于矢量操作的大型非調(diào)度隊(duì)列,可以防止在進(jìn)行大量矢量操作時(shí)重命名器停頓。在這種情況下,額外的“假”調(diào)度能力似乎比 CNS 更高的執(zhí)行吞吐量更有價(jià)值。
最后,如果應(yīng)用程序的矢量計(jì)算量很大,CNS 將 Zen 1 遠(yuǎn)遠(yuǎn)甩在后面,并且可以利用 AVX-512。在 Y-Cruncher 中,Zen 1 不是 CNS 的對(duì)手。英特爾的咖啡湖Coffee Lake 的 die area 分布看起來很不一樣。原始 CPU 性能是重中之重,CPU 內(nèi)核和 L3 緩存占據(jù)了一半以上的份額。
Coffee Lake 大約四分之一的芯片用于集成 GPU。從絕對(duì)面積和百分比上看,Comet Lake的iGPU比Centaur的NCore占用的空間更多,可見Intel對(duì)集成顯卡的重視。計(jì)算完內(nèi)核、緩存和 iGPU 后,剩下的裸片面積不多了。其中大部分用于實(shí)現(xiàn)適度的 IO 功能。這使得 Coffee Lake 非常高效。與 CHA 或 Zeppelin 相比,它在更小的芯片中集成了更多的 CPU 能力,這要?dú)w功于它對(duì)客戶端應(yīng)用程序的專注。 與 Haswell 和 Zen 一樣,Coffee Lake 中使用的 Skylake 核心看起來旨在以犧牲密度為代價(jià)達(dá)到非常高的時(shí)鐘。它比 CNS 和 Zen 1 大得多,時(shí)鐘頻率也高得多。
英特爾還犧牲了一些區(qū)域來讓 Skylake“準(zhǔn)備好 AVX-512”,因?yàn)闆]有更好的術(shù)語了。這讓相同的基本核心設(shè)計(jì)適用于從低功耗超極本到服務(wù)器和超級(jí)計(jì)算機(jī)的所有設(shè)備,從而節(jié)省了設(shè)計(jì)工作量。
也許稍微談?wù)?AVX-512 是件好事,因?yàn)檫@是 Skylake 和 CNS 設(shè)計(jì)中的明確目標(biāo)。AVX-512 實(shí)施選擇CNS 的 AVX-512 實(shí)現(xiàn)看起來專注于最小化面積開銷,而不是在使用 AVX-512 指令時(shí)最大化性能增益。英特爾對(duì) Skylake-X 做了相反的事情。 總結(jié)一些 AVX-512 設(shè)計(jì)選擇:
AVX-512 實(shí)施細(xì)節(jié)
英特爾 Skylake-X
半人馬中樞神經(jīng)系統(tǒng)
評(píng)論
矢量寄存器文件
全寬 512 位寄存器(總計(jì) 10.5 KB)
512 位架構(gòu)寄存器占用兩個(gè) 256 位物理寄存器(總共 4.5 KB)
當(dāng)使用 512 位寄存器時(shí),SKL-X 具有更多的重命名能力
屏蔽寄存器文件
具有 128 次重命名的獨(dú)立物理寄存器文件(額外的 1 KB 寄存器)
共享標(biāo)量整數(shù)寄存器文件 (GPR)
標(biāo)量整數(shù)代碼在AVX-512代碼中仍然很常見,SKL-X的獨(dú)立掩碼寄存器文件避免了爭用
浮點(diǎn)向量執(zhí)行
每個(gè)周期 2×512 位 FMA
每個(gè)周期 2×256 位 FMA。512 位指令解碼為兩個(gè)微操作
英特爾追求最大吞吐量,但那些額外的 FP 單元很大
整數(shù)向量執(zhí)行
每周期 2×512 位向量整數(shù)加法
每個(gè)周期 3×256 位向量整數(shù)加法
英特爾仍然可以更好地利用 AVX-512。
對(duì)我來說,CNS 非常有趣,因?yàn)樗故玖巳绾我员M可能低的成本實(shí)現(xiàn)對(duì) AVX-512 的良好支持。此外,CNS 證明相當(dāng)強(qiáng)大的矢量單元可以在占用相對(duì)較小面積的內(nèi)核中實(shí)現(xiàn)。需要明確的是,這種策略可能無法提供 AVX-512 的最佳收益:
但它仍然可以在某些情況下提供相當(dāng)多的性能改進(jìn),如上面的基準(zhǔn)所示。Centaur 的設(shè)計(jì)思考Centaur 旨在創(chuàng)建一個(gè)低成本的 SoC,將強(qiáng)大的推理能力與適合服務(wù)器的 CPU 性能相結(jié)合。低成本意味著降低管芯面積。但 Centaur 的 ML 目標(biāo)意味著將大量芯片面積用于其 NCore 加速器。服務(wù)器需要大量 PCIe 通道用于高帶寬網(wǎng)絡(luò)適配器,這也需要占用空間。所有這些都要求以密度為導(dǎo)向的 CPU 設(shè)計(jì),以便在剩余空間內(nèi)提供所需的核心數(shù)。 為實(shí)現(xiàn)這一目標(biāo),Centaur 以低時(shí)鐘為目標(biāo)。高時(shí)鐘頻率設(shè)計(jì)通常占用更多面積,這在英特爾的設(shè)計(jì)中尤為明顯。這有很多原因。首先,為高速設(shè)計(jì)的工藝庫比為高密度設(shè)計(jì)的工藝庫占用更多的區(qū)域。例如,三星的高性能 SRAM 位單元在其 7 納米節(jié)點(diǎn)上占用 0.032 μm2,而其高密度 SRAM 占用 0.026 μm2。然后,更高的時(shí)鐘速度通常需要更長的流水線,流水線級(jí)之間的緩沖區(qū)會(huì)占用空間。最重要的是,工程師可能會(huì)使用更多電路來減少必須在一個(gè)周期內(nèi)完成的相關(guān)轉(zhuǎn)換的數(shù)量。一個(gè)例子是使用進(jìn)位先行加法器而不是更簡單的進(jìn)位傳播加法器, 然后,以最小的面積開銷支持 AVX-512。Centaur 擴(kuò)展了重命名器的寄存器別名表以處理 AVX-512 掩碼寄存器。他們調(diào)整了解碼器以識(shí)別額外的指令。他們?yōu)樗麄冴P(guān)心的專門指令添加了一些執(zhí)行單元。我懷疑其中任何一個(gè)占用了很大的芯片面積。CNS 絕對(duì)不會(huì)遵循英特爾試圖充分利用 AVX-512 的方法,該方法涉及將核心區(qū)域投入更廣泛的執(zhí)行單元和寄存器。 最后,Centaur 設(shè)定了適度的 IPC 目標(biāo)。在 Skylake 是 Intel 的頂級(jí)核心時(shí),他們的目標(biāo)是每個(gè)時(shí)鐘具有類似 Haswell 的性能。這避免了為了追逐遞減的 IPC 收益而使亂序執(zhí)行緩沖區(qū)和隊(duì)列膨脹。為了說明這一點(diǎn),讓我們看一下 7-Zip 壓縮工作負(fù)載的指令跟蹤中的模擬重新排序緩沖區(qū) (ROB) 利用率。CPU 的 ROB 跟蹤所有指令,直到它們的結(jié)果成為最終結(jié)果,因此它的容量代表了 CPU 亂序執(zhí)行窗口的上限。我們修改了 ChampSim 以跟蹤每個(gè)周期的 ROB 占用情況,以了解特定數(shù)量的 ROB 條目的使用頻率。

大多數(shù)周期屬于兩類之一。我們要么看到使用的條目少于 200 個(gè)?;蛘?ROB 基本上已滿,表明它不夠大,無法吸收延遲。我們還可以繪制 ROB 利用率低于特定點(diǎn)(下圖)的周期百分比。再一次,我們看到重新訂購能力增加帶來的回報(bào)明顯減少。
增加 ROB 大小也意味著增加其他無序緩沖區(qū)。重命名的寄存器文件、調(diào)度程序和加載/存儲(chǔ)隊(duì)列必須變大,否則它們會(huì)在 ROB 之前填滿。這意味著 IPC 的增加需要不成比例的面積權(quán)衡,而 Centaur 的方法一直是保守的,這是可以理解的。 由此產(chǎn)生的 CNS 架構(gòu)最終變得非常密集并且非常適合 CHA 的目標(biāo)環(huán)境,但以犧牲靈活性為代價(jià)。在 AVX 負(fù)載很重的情況下,該芯片在 2.2 GHz 時(shí)的功耗約為 65 W,但在 2.5 GHz 時(shí)的功耗高達(dá) 140 W。功耗的急劇增加表明該架構(gòu)的時(shí)鐘頻率不會(huì)再高了。這使得 CNS 完全不適合消費(fèi)者應(yīng)用程序,其中單線程性能至關(guān)重要。在更高的層面上,CHA 也最終成為了狹隘的目標(biāo)。盡管 CNS 內(nèi)核很小,但每個(gè) CHA 插槽只有八個(gè)。那是因?yàn)?Centaur 在 NCore 上占用了面積并且使用了非常小的 die。因此,CHA 在多線程 CPU 吞吐量方面落后,并且在不使用 NCore 的 ML 加速功能的應(yīng)用程序中不會(huì)具有競爭力。 也許 CNS 的最大收獲是,只要在設(shè)置時(shí)鐘速度和 IPC 目標(biāo)時(shí)考慮到密度,就可以在物理上很小的內(nèi)核中實(shí)現(xiàn)強(qiáng)大的矢量單元。
核心寬度、整數(shù) ALU 和 ROB 容量
向量執(zhí)行
時(shí)鐘速度
半人馬中樞神經(jīng)系統(tǒng)
4-wide, 4 ALUs, 192 entry ROB
2×256-bit FMA 3×256-bit 向量整數(shù)
2.5 吉赫
哈斯韋爾
4-wide, 4 ALUs, 192 entry ROB
2×256-bit FMA 2×256-bit 向量整數(shù)
4.4 GHz(客戶端)
禪 1
5-wide, 4 ALUs, 192 entry ROB
2×128-bit FMA + 2×128-bit FADD 3×128-bit 向量整數(shù)
4.2 GHz (HEDT)
Skylake(客戶端)
4-wide, 4 ALUs, 224 entry ROB
2×256-bit FMA 3×256-bit 向量整數(shù)
5+ GHz
特雷蒙
4-wide, 4 ALUs, 208 entry ROB
1×128位FADD + 1×128位FMUL 2×128位向量整數(shù)
3 吉赫
表面層查看核心寬度/重新排序容量、矢量執(zhí)行和時(shí)鐘速度Centaur 在上一代工藝節(jié)點(diǎn)上與大約 100 人的團(tuán)隊(duì)一起完成了所有這些工作。這意味著以有限的資源完全可以實(shí)現(xiàn)面向密度的設(shè)計(jì)。我想知道如果 AMD 或英特爾優(yōu)先考慮密度,他們可以用他們更大的工程團(tuán)隊(duì)做些什么并獲得尖端工藝節(jié)點(diǎn)??赡苁鞘裁??讓我們嘗試一些不同的東西,而不是得出結(jié)論。我將推測和幻想 CNS 還能如何使用。2020 年之前 – 更多內(nèi)核?CHA 試圖成為具有客戶端 CPU 裸片區(qū)域的服務(wù)器芯片。然后它試圖在同一個(gè)芯片上填充機(jī)器學(xué)習(xí)加速器。太瘋狂了。盡管 CNS 內(nèi)核優(yōu)先考慮密度,但 CHA 的內(nèi)核數(shù)量少于 Intel 或 AMD 服務(wù)器 CPU。如果 Centaur 不嘗試創(chuàng)建具有客戶端裸片面積的服務(wù)器處理器怎么辦? 如果我們將 CHA 的 8 核 CPU 復(fù)合體及其高速緩存增加四倍,則需要大約 246 mm 2的裸片面積。當(dāng)然,IO 和連接也需要空間。但是,使用比 28 核 Skylake-X 芯片小得多的芯片來實(shí)現(xiàn) 32 核 CNS 芯片是很有可能的。
完全啟用的 28 核 Skylake-X 芯片可能會(huì)勝過假設(shè)的 32 核 CNS 芯片。但英特爾使用 677.98 mm 2來做到這一點(diǎn)。幾年前,英特爾的14納米工藝也遭遇產(chǎn)能短缺。所有這些都推高了 Skylake-X 的價(jià)格。這讓 Centaur 有機(jī)會(huì)削弱英特爾。在上一代 TSMC 節(jié)點(diǎn)上使用更小的裸片應(yīng)該可以制造出便宜的芯片。 在 2019 年之前,AMD 還可以為其高端 Zen 1 Epyc SKU 提供每個(gè)插槽 32 個(gè)內(nèi)核。與之相反,CNS 將通過提供 AVX-512 支持和更好的矢量性能來競爭。2020 年后——模具收縮?但 AMD 2019 年 Zen 2 的發(fā)布改變了一切。CNS 執(zhí)行 AVX-512 指令的能力無疑給它帶來了優(yōu)勢。但是 Zen 2 和 CNS 一樣有 256 位的執(zhí)行單元和寄存器。更重要的是,Zen 2 擁有工藝節(jié)點(diǎn)優(yōu)勢。AMD 使用它來實(shí)現(xiàn)功能更強(qiáng)大的亂序執(zhí)行引擎、更多緩存和更快的緩存。這使 CNS 處于一個(gè)艱難的境地。
核心對(duì)核心,CNS 沒有機(jī)會(huì)。即使我們讓 CNS 以 2.5 GHz 運(yùn)行,它的性能也無法與 Zen 2 相提并論。它在文件壓縮方面變得更糟:
即使在 CNS 的最佳案例 Y-Cruncher 中,Zen 2 更高的時(shí)鐘和 SMT 支持也讓它領(lǐng)先。
更糟糕的是,臺(tái)積電的 7nm 工藝讓 AMD 可以將 64 個(gè)內(nèi)核封裝到一個(gè)插槽中。在我看來,CNS 必須移植到 7nm 級(jí)工藝才能在 2020 年之后有任何機(jī)會(huì)。很難猜測它會(huì)是什么樣子,但我懷疑 7nm 的 CNS 繼任者會(huì)填補(bǔ)一個(gè)不錯(cuò)的利基市場。它將是支持 AVX-512 的最小 CPU 內(nèi)核。也許它可能是具有更好矢量性能的 Ampere Altra 競爭對(duì)手。 在英特爾的領(lǐng)導(dǎo)下,CNS 可以為英特爾混合設(shè)計(jì)中的 E-Cores 帶來 AVX-512 支持。這將解決 Alder Lake 不匹配的 ISA 尷尬問題。與 Gracemont 相比,CNS 在整數(shù)應(yīng)用程序中可能表現(xiàn)不佳,但矢量性能會(huì)好得多。而 CNS 的小核心區(qū)域?qū)⑴c Gracemont 的區(qū)域效率目標(biāo)一致。也許未來的 E-Core 可以將 Gracemont 的整數(shù)能力與 CNS 的向量單元和 AVX-512 實(shí)現(xiàn)結(jié)合起來。
Centaur CHA 可能未完成的雙插槽實(shí)現(xiàn)
2022 年 4 月 23 日 蛤蜊 濃湯 發(fā)表評(píng)論
Centaur 的 CHA 芯片以低核心數(shù)瞄準(zhǔn)服務(wù)器市場。因此,它的雙插槽功能非常重要,因?yàn)樗试S在單個(gè)基于 CHA 的服務(wù)器中使用多達(dá) 16 個(gè)內(nèi)核。不幸的是,對(duì)于 Centaur 來說,現(xiàn)代雙插槽實(shí)現(xiàn)相當(dāng)復(fù)雜。今天的 CPU 使用集成到 CPU 芯片本身的內(nèi)存控制器,這意味著每個(gè) CPU 插槽都有自己的內(nèi)存池。如果一個(gè)插槽上的 CPU 內(nèi)核想要訪問連接到另一個(gè)插槽的內(nèi)存,它們將必須通過跨插槽鏈接。這會(huì)創(chuàng)建一個(gè)具有非統(tǒng)一內(nèi)存訪問 (NUMA) 的設(shè)置。交叉套接字總是會(huì)增加延遲并減少帶寬,但良好的 NUMA 實(shí)現(xiàn)會(huì)將這些損失降到最低。
跨套接字延遲在這里,我們將通過在不同節(jié)點(diǎn)上分配內(nèi)存并使用不同節(jié)點(diǎn)上的核心來訪問該內(nèi)存來測試跨套接字鏈接增加了多少延遲。這基本上是我們的延遲測試僅在 1 GB 測試大小下運(yùn)行,因?yàn)樵摯笮∽阋砸绯鋈魏尉彺?。我們使?2 MB 頁面來避免 TLB 未命中懲罰。這對(duì)于大多數(shù)使用 4 KB 頁面的消費(fèi)者應(yīng)用程序來說是不現(xiàn)實(shí)的,但我們正在嘗試隔離與 NUMA 相關(guān)的延遲懲罰,而不是顯示應(yīng)用程序在實(shí)踐中會(huì)看到的內(nèi)存延遲。

交叉套接字增加了大約 92 ns 的額外延遲,這意味著不同套接字上的內(nèi)存訪問時(shí)間幾乎是原來的兩倍。相比之下,英特爾受到的處罰較少。
在雙插槽 Broadwell 系統(tǒng)上,交叉插槽通過早期偵聽設(shè)置增加了 42 ns 的延遲。訪問遠(yuǎn)程內(nèi)存比訪問直接連接到 CPU 的內(nèi)存要長 41.7%。與 CNS 相比,英特爾領(lǐng)先一英里,部分原因是早期的偵聽模式針對(duì)低延遲進(jìn)行了優(yōu)化。另一部分是英特爾在開發(fā)支持多插槽的芯片方面擁有豐富的經(jīng)驗(yàn)。如果我們回到十多年前基于英特爾 Westmere 的 Xeon X5650,同一節(jié)點(diǎn)上的內(nèi)存訪問延遲為 70.3 ns,而遠(yuǎn)程內(nèi)存訪問為 121.1 ns。那里的延遲增量剛好超過 50 ns。它比 Broadwell 差,但仍然明顯好于 Centaur 的表現(xiàn)。
Broadwell 還支持 cluster-on-die 設(shè)置,它為每個(gè)插槽創(chuàng)建兩個(gè) NUMA 節(jié)點(diǎn)。在這種模式下,一個(gè) NUMA 節(jié)點(diǎn)覆蓋一條單環(huán)總線,連接到一個(gè)雙通道 DDR4 內(nèi)存控制器。這會(huì)稍微減少本地內(nèi)存訪問延遲。但是英特爾在使用四個(gè)內(nèi)存池時(shí)要困難得多。交叉套接字現(xiàn)在幾乎和在 CHA 上一樣長。仔細(xì)觀察,我們可以看到當(dāng)訪問連接到同一芯片的“遠(yuǎn)程”內(nèi)存時(shí),內(nèi)存延遲跳躍了近 70 ns。這比跨套接字延遲增量大,表明如果英特爾有三個(gè)遠(yuǎn)程節(jié)點(diǎn)可供選擇,它需要更長的時(shí)間來確定將內(nèi)存請(qǐng)求發(fā)送到哪里。

轉(zhuǎn)到 AMD,我們得到了在 Azure 上測試 Milan-X 系統(tǒng)時(shí)的結(jié)果。與 Broadwell 的 cluster on die 模式一樣,AMD 的 NPS2 模式在一個(gè)插槽內(nèi)創(chuàng)建兩個(gè) NUMA 節(jié)點(diǎn)。然而,AMD 似乎有非??焖俚哪夸泚泶_定哪個(gè)節(jié)點(diǎn)負(fù)責(zé)內(nèi)存地址。從一個(gè)套接字的一半轉(zhuǎn)到另一半只會(huì)增加 14.33 ns。Milan-X 上的交叉套接字連接會(huì)增加大約 70-80 納秒的延遲,具體取決于您訪問的是遠(yuǎn)程套接字的哪一半。 總而言之,Centaur 的跨節(jié)點(diǎn)延遲很一般。它比我們從英特爾或 AMD 看到的更糟糕,除非英特爾 Broadwell 系統(tǒng)正在處理四個(gè) NUMA 節(jié)點(diǎn)。但對(duì)于沒有多路設(shè)計(jì)經(jīng)驗(yàn)的公司來說,這并不可怕??缣捉幼謳捊酉聛?,我們將測試帶寬。與延遲測試一樣,我們使用內(nèi)存分配位置和使用的 CPU 內(nèi)核的不同組合來運(yùn)行帶寬測試。這里的測試大小是 3 GB,因?yàn)檫@是我們硬編碼到帶寬測試中的最大大小。大小并不重要,只要它大到可以從緩存中取出即可。為簡單起見,我們只測試讀取帶寬。

Centaur 的交叉套接字帶寬非常差,只有 1.3 GB/s 以上。當(dāng)您可以從優(yōu)質(zhì)的 NVMe SSD 讀取速度更快時(shí),就說明出了問題。相比之下,英特爾十年前的 Xeon X5650 可以維持 11.2 GB/s 的交叉插槽帶寬,盡管其三通道 DDR3 設(shè)置在一個(gè)節(jié)點(diǎn)內(nèi)僅達(dá)到 20.4 GB/s。像 Broadwell 這樣的更新設(shè)計(jì)甚至做得更好。
每個(gè)插槽代表一個(gè) NUMA 節(jié)點(diǎn),Broadwell 可以從其四個(gè) DDR4-2400 通道獲得近 60 GB/s 的讀取帶寬。從不同的套接字訪問它會(huì)使帶寬下降到 21.3 GB/s。這大大超過了 Westmere,顯示了英特爾在過去十年中取得的進(jìn)步。 如果我們將 Broadwell 切換到芯片上的集群模式,七個(gè)內(nèi)核的每個(gè)節(jié)點(diǎn)仍然可以拉取比 Centaur 所能達(dá)到的更多的跨插槽帶寬。奇怪的是,Broadwell 遭受了芯片內(nèi)交叉節(jié)點(diǎn)的嚴(yán)重?fù)p失,內(nèi)存帶寬減少了大約一半。
最后再來看看AMD的Milan-X:
與這里的其他芯片相比,Milan-X 是帶寬怪物。它的 DDR4 通道數(shù)量是 CHA 和 Broadwell 的兩倍,因此高節(jié)點(diǎn)內(nèi)帶寬不足為奇。跨節(jié)點(diǎn),AMD 在訪問同一插槽的另一半時(shí)保留了非常好的帶寬??缣捉幼郑總€(gè) NPS2 節(jié)點(diǎn)仍然可以拉動(dòng)超過 40 GB/s,這與 CHA 的本地內(nèi)存帶寬相差不遠(yuǎn)。有爭議的原子的核心到核心延遲我們的最后一個(gè)測試通過使用鎖定的比較和交換操作來修改兩個(gè)內(nèi)核之間共享的數(shù)據(jù)來評(píng)估緩存一致性性能。Centaur 在這里表現(xiàn)不錯(cuò),跨套接字時(shí)的延遲約為 90 到 130 ns。
上面的核心到核心延遲圖類似于Ampere Altra 的,其中歸屬于遠(yuǎn)程套接字的高速緩存行上的高速緩存一致性操作需要通過跨套接字互連進(jìn)行往返,即使相互通信的兩個(gè)核心在同一個(gè)芯片。但是,CHA 的絕對(duì)延遲要低得多,這要?dú)w功于 CHA 的內(nèi)核少得多,拓?fù)湟膊荒敲磸?fù)雜。 當(dāng)緩存一致性跨套接字時(shí),英特爾 2010 年的 Westmere 架構(gòu)能夠比 CHA 做得更好。他們能夠處理芯片內(nèi)的緩存一致性操作(可能在 L3 級(jí)別),即使緩存行位于遠(yuǎn)程套接字上也是如此。
但這種出色的交叉套接字性能并不常見。Westmere 可能會(huì)受益,因?yàn)樗蟹呛诵恼?qǐng)求都會(huì)通過一個(gè)集中的全局隊(duì)列。與自 Sandy Bridge 以來使用的分布式方法相比,該方法對(duì)于常規(guī) L3 訪問具有更高的延遲和低帶寬。但其簡單性和集中式特性可能會(huì)實(shí)現(xiàn)出色的跨套接字緩存一致性性能。

Broadwell 的跨套接字性能與 CHA 的類似。通過查看 cluster and die 和 early snoop 模式的結(jié)果,我們可以清楚地看到核心到核心緩存一致性延遲的大部分來自目錄查找的執(zhí)行方式。如果傳輸發(fā)生在 die 節(jié)點(diǎn)上的集群內(nèi),則通過包含的 L3 及其核心有效位處理一致性。如果錯(cuò)過了 L3,一致性機(jī)制就會(huì)慢得多。芯片內(nèi)、跨節(jié)點(diǎn)延遲已經(jīng)超過 100 ns。交叉管芯只會(huì)再增加 10-20 ns。 Early snoop 模式表明芯片內(nèi)一致性可以非???。延遲保持在 50 ns 范圍內(nèi)或更短,即使在環(huán)交叉時(shí)也是如此。然而,早期偵聽模式將交叉套接字延遲增加到大約 140 ns,使其略低于 CNS。
我們沒有從 Milan-X 得到干凈的結(jié)果,因?yàn)楣潭ㄔ谠茖?shí)例上的管理程序核心有點(diǎn)古怪。但我們的結(jié)果與Anandtech 在 Epyc 7763 上的結(jié)果大致一致。CCX 內(nèi)部延遲非常低。NPS2 節(jié)點(diǎn)內(nèi)的跨 CCX 延遲在 90 ns 范圍內(nèi)???NPS2 節(jié)點(diǎn)使延遲達(dá)到 110 納秒左右,而跨套接字導(dǎo)致約 190 納秒的延遲。 因此,Centaur 在這一類別中的交叉套接字性能優(yōu)于 Epyc。更一般地說,CHA 在這種測試中表現(xiàn)最好。它能夠與在我們的“干凈”內(nèi)存訪問測試中擊敗它的 AMD 和 Intel 系統(tǒng)針鋒相對(duì)。這里的“干凈”意味著我們沒有多個(gè)內(nèi)核寫入同一緩存行。 不幸的是,對(duì)于 Centaur,我們看到的受益于低核心到核心延遲的應(yīng)用程序的例子完全為零。有爭議的原子在多線程代碼中似乎并不常見。最后的話在 CNS 之前,Centaur 專注于低功耗消費(fèi)類 CPU,其產(chǎn)品包括威盛 Nano。其中一些經(jīng)驗(yàn)會(huì)延續(xù)到服務(wù)器設(shè)計(jì)中。畢竟,低功耗和小內(nèi)核尺寸是共同的目標(biāo)。但是超越cpu核心,服務(wù)器就是另外一個(gè)世界了。 服務(wù)器需要高核心數(shù)、大量 IO 和大量內(nèi)存帶寬。它們還需要支持高內(nèi)存容量。CHA 在其中一些方面有所作為。它可以支持每個(gè)插槽數(shù)百 GB 的內(nèi)存。它的四通道 DDR4 內(nèi)存控制器和 44 個(gè) PCIe 通道提供了足夠但不突出的片外帶寬。CHA也是Centaur打造的核心數(shù)最高的芯片。但八核對(duì)于當(dāng)今的服務(wù)器市場來說有點(diǎn)低。雙插槽支持可以部分緩解這種情況。 不幸的是,雙插座的工作似乎還沒有完成。CHA 的低交叉套接字帶寬會(huì)導(dǎo)致嚴(yán)重的問題,尤其是對(duì)于不支持 NUMA 的工作負(fù)載。它還排除了在交錯(cuò)模式下使用系統(tǒng)的任何可能性,在這種模式下,訪問被跨套接字條帶化,以更高的延遲為代價(jià)為不支持 NUMA 的應(yīng)用程序提供更多帶寬。
那么出了什么問題呢?好吧,請(qǐng)記住,跨套接字訪問會(huì)遭受額外的延遲。這對(duì)所有平臺(tái)都是通用的。但是通過長延遲連接實(shí)現(xiàn)高帶寬需要能夠排隊(duì)大量未完成的請(qǐng)求。我的猜測是 Centaur 在他們的交叉套接字鏈接之前實(shí)現(xiàn)了一個(gè)隊(duì)列,但從來沒有抽出時(shí)間來驗(yàn)證它。Centaur 的人員不多,資源有限,可能正忙于涵蓋所有與服務(wù)器相關(guān)的新技術(shù)。我們所看到的可能是一項(xiàng)正在進(jìn)行的工作。
Centaur 已經(jīng)實(shí)現(xiàn)了使多個(gè)套接字工作所必需的協(xié)議和一致性目錄。而且它們的工作延遲相當(dāng)不錯(cuò)。不幸的是,由于 Centaur 不復(fù)存在,跨插槽工作無法完成,因此我們永遠(yuǎn)無法在雙插槽設(shè)置中看到 CHA 的全部潛力。 特別感謝Brutus設(shè)置系統(tǒng)并在其上運(yùn)行測試。