散文網(wǎng) » 科技 »數(shù)碼 » BROOM：具有28納米CMOS的適應(yīng)低電壓的開源亂序處理器（上）

BROOM：具有28納米CMOS的適應(yīng)低電壓的開源亂序處理器（上）

2021-06-02 21:34 作者:Iammyself001 0人讀過 | 我要投稿

BROOM: An Open-Source Out-of-Order PRocessor With Resilient Low-Voltage Operation in 28-nm CMOS

Abstract:

The Berkeley resilient out-of-order machine (BROOM) is a resilient, wide-voltage-range implementation of an open-source out-of-order (OoO) RISC-V pRocessor implemented in an ASIC flow. A 28-nm test-chip contains a BOOM OoO core and a 1-MiB level-2 (L2) cache, enhanced with architectural error tolerance for low-voltage operation. It was implemented by using an agile design methodology, where the initial OoO architecture was transformed to perform well in a high-performance, low-leakage CMOS pRocess, informed by synthesis, place, and route data by using foundry-provided standard-cell library and memory compiler. The two-person-team productivity was improved in part thanks to a number of open-source artifacts: The Chisel hardware construction language, the RISC-V instruction set architecture, the Rocket-chip SoC generator, and the open-source BOOM core. The resulting chip, taped out using TSMC's 28-nm HPM pRocess, runs at 1.0 GHz at 0.9 V, and is able to operate down to 0.47 V.

Key Words:Open source software ,Random access memory ,Design methodology ,CMOS pRocess ,Generators ,Voltage control ,Agile software development

摘? ? 要

伯克利彈性亂序機器（BROOM）是一種使用ASIC流程實現(xiàn)適應(yīng)寬電壓的開源亂序（OoO）RISC-V處理器。一個28納米的測試芯片包含一個BOOM OoO內(nèi)核和一個1-MiB 2級（L2）高速緩存，并為低壓操作提供增強的架構(gòu)誤差容限。它是通過使用敏捷設(shè)計方法來實現(xiàn)的，該方法在將初始的OoO架構(gòu)轉(zhuǎn)換為在高性能，低泄漏CMOS處理器的工藝過程中表現(xiàn)優(yōu)秀，該CMOS處理器是通過使用代工廠提供的標(biāo)準(zhǔn)單元庫和內(nèi)存編譯器進行合成、排布和數(shù)據(jù)線路的。兩人團隊的生產(chǎn)力在一定程度上得益于許多開源工件：Chisel硬件構(gòu)造語言、RISC-V指令集體系架構(gòu)、Rocket-chip SoC生產(chǎn)商以及開源的BOOM內(nèi)核。使用臺積電的28-nm HPM工藝制成的最終芯片，在0.9 GHz的電壓下以1.0 GHz的頻率運行，并且能夠在0.47 V的電壓下工作。

關(guān)鍵詞：開源軟件，隨機存取存儲器，設(shè)計方法，CMOS工藝，發(fā)電機，電壓控制，敏捷軟件開發(fā)

RISC-V is an open-source instruction set architecture (ISA) that is gaining wide attention. There are several open-source and commercial in-order cores that implement the RISC-V ISA; however, there is a need for high-performance cores. BOOM is a synthesizable, parameterized, superscalar out-of-order (OoO) RISC-V core, that has been originally designed to serve as the prototypical baseline pRocessor for future microarchitectural studies of OoO pRocessors. Its original goal was to provide a readable, open-source implementation for use in education, research, and industry, and had been evaluated through educational standard-cell libraries. The Berkeley Resilient OoO Machine (BROOM) contains an evolved version of BOOM: the core has been transformed to explore the design space in a representative pRocess for high-performance mobile applications. It has been designed in an ASIC flow, which enabled a rapid evaluation of changes to the RTL and physical design to improve the performance of the pRocessor. Figure 1 shows the block diagram of the BROOM pRocessor. BROOM consists of a single BOOM core and a 1-MiB L2 cache, each in their own clock and voltage domains.

RISC-V是一個引起了廣泛的關(guān)注的開源指令集體系結(jié)構(gòu)（ISA）。有幾個開源商業(yè)有序內(nèi)核實現(xiàn)RISC-V?ISA。然而，這些ISA是為了滿足高性能需求的內(nèi)核。BOOM是可綜合的，參數(shù)化的，超標(biāo)量的無序（OoO）RISC-V內(nèi)核，它原本是為將來的OoO處理器進行微體系結(jié)構(gòu)研究的原型基線而設(shè)計的處理器。其最初的目標(biāo)是提供一種可讀的，開源的實現(xiàn)方式，用于教育，研究和工業(yè)，并已通過教育標(biāo)準(zhǔn)單元庫進行了評估。伯克利彈性亂序機器（BROOM）包含一個BOOM的改進版本：該核心已經(jīng)被轉(zhuǎn)換成具有代表性的探索高性能移動應(yīng)用程序的設(shè)計空間。圖1顯示了BROOM處理器的框圖。BROOM由一個BOOM內(nèi)核和一個1-MiB L2緩存組成，每個緩存都在各自的時鐘和電壓域中。

圖1：BROOM chip block diagram, annotated place-and-routed chip plot, and feature summary. BROOM芯片框圖，帶注釋的布局布線圖和功能摘要

The additional feature of the test chip is the architectural resiliency techniques for operation of the cache in a wide voltage range, enabling the pRocessor to operate with a high efficiency at low voltages.

測試芯片的附加功能是用于在較寬的電壓范圍內(nèi)運行高速緩存的體系結(jié)構(gòu)適應(yīng)技術(shù)，從而使處理器能夠在低電壓下高效運行。

BROOM was implemented using LVT-based standard cells and a foundry-provided memory compiler. The entire chip measures less than 2 mm × 3 mm and is composed of 72 million transistors. The chip is composed of 417 000 standard cells and 73 SRAM macros; the core and L1 caches make up 310 000 cells and 20 SRAM macros. The final sign-off in the slow-slow corner was at 1.68 ns. Figure 1 shows the placed-and-routed chip plot.

使用基于LVT的標(biāo)準(zhǔn)單元和代工廠提供的內(nèi)存編譯器來實現(xiàn)BROOM。整個芯片的尺寸小于2 mm×3 mm，由7200萬個晶體管組成。該芯片由417 000個標(biāo)準(zhǔn)單元和73個SRAM宏組成。內(nèi)核和L1高速緩存組成了310 000個單元和20個SRAM宏。慢速轉(zhuǎn)角處的最終簽發(fā)為1.68 ns。圖1顯示了布局和布線的芯片圖。

SECTION 2

LEVERAGING OPEN-SOURCE INFRASTRUCTURE

第2章利用開源基礎(chǔ)設(shè)施

BOOM implements the open-source RISC-V ISA, which was designed from the ground-up to enable technology-driven computer architecture research. The clean and simple design of RISC-V allows for a focus on the pRocessor without getting weighed down with awkward instructions that demand undue attention or spending extra effort managing software ports.

BOOM實現(xiàn)了開源的RISC-V ISA，該ISA完全是從頭開始設(shè)計的，目的在于進行技術(shù)驅(qū)動的計算機體系結(jié)構(gòu)研究。RISC-V簡潔的設(shè)計使您可以專注于處理器，而不會因那些笨拙的需要過度關(guān)注或花費額外精力管理軟件端口的指令而煩惱。

BOOM is written in Chisel, an open-source hardware construction language developed to enable the advanced hardware design. Chisel allows designers to utilize concepts such as object orientation, functional programming, parameterized types, and type inference which makes it easier to implement highly parameterized hardware generators. However, Chisel is not a high-level synthesis language—the primitives provided by Chisel are, for example, registers, wires, and memories. One of Chisel’s strengths is its focus on generating well formed, synthesizable Verilog. This feature decreased design risk. Chisel also brings software development-level productivity to the RTL coding, and helps encourage focusing implementation efforts on writing generators, rather than a single design instance. For example, the open-source RISC-V Rocket-chip generator presents a template for designing systems-on-a-chip (SoCs). Rocket-chip supports coherent multilevel caches and standard interconnects. BOOM makes significant use of Rocket-chip as a library—the caches, the uncore, and the functional units all derive from Rocket. In total, over 11 500 lines of code (LOC) are instantiated by BOOM from the Rocket-chip repository.

BOOM用Chisel編寫，Chisel是一種開源的硬件構(gòu)造語言，旨在實現(xiàn)高級硬件設(shè)計。Chisel允許設(shè)計人員利用諸如面向?qū)ο?，函?shù)式編程，參數(shù)化類型和類型推斷之類的概念，這使實現(xiàn)高度參數(shù)化的硬件發(fā)生器更加容易。然而，Chisel不是由Chisel原語組成的高級語言，例如，寄存器，電線，和內(nèi)存。一個Chisel的優(yōu)勢是它的專注產(chǎn)生良好的Verilog。此功能降低了設(shè)計風(fēng)險。Chisel還為RTL編碼帶來了軟件開發(fā)級的生產(chǎn)力，并有助于將實現(xiàn)工作重點放在編寫生成器上，而不是單個設(shè)計實例上。例如，開源RISC-V火箭芯片生成器提供了一個用于設(shè)計片上系統(tǒng)（SoC）的模板。Rocket-chip支持相干的多層緩存和標(biāo)準(zhǔn)互連。BOOM充分利用了Rocket-chip 作為一個緩存庫，uncore和功能單元都從Rocket派生而來。BOOM總共從Rocket-chip存儲庫中實例化了超過11500行代碼（LOC）。

SECTION 3

BOOM CORE

第3章 BOOM核心

The initial BOOM architecture is inspired by the MIPS R10K and Alpha 21264 pRocessors from the 1990s, whose designs teams provided relatively detailed insight into their pRocessors’ microarchitectures.[1][2] However, both pRocessors relied on custom, dynamic logic which allowed them to achieve very high clock frequencies despite their very short pipelines. The seven-stage Alpha 21264 has 15 fanout-of-four (FO4) inverter delays. As a comparison, the synthesizable Tensilica’s Xtensa pRocessor, fabricated in a 0.25-μmASIC pRocess and contemporary with the Alpha 21264, was estimated to have roughly 44 FO4 delays.[3]

最初的BOOM架構(gòu)受到1990年代的MIPS R10K和Alpha 21264處理器的啟發(fā)，其設(shè)計團隊對他們的處理器的微體系結(jié)構(gòu)提供了相對詳細(xì)的見解。[1] [2]但是，兩個處理器都依賴于定制的動態(tài)邏輯，盡管它們的流水線非常短，但它們?nèi)钥梢詫崿F(xiàn)很高的時鐘頻率。七級Alpha 21264具有15個FO4逆變器延遲。相比之下，可合成的Tensilica Xtensa處理器采用0.25-μ米ASIC工藝和現(xiàn)代Alpha 21264估計大約有44個FO4延遲。[3]

As BOOM is a synthesizable pRocessor, we must rely on microarchitecture-level techniques to address critical paths and add more pipeline stages to trade off instructions per cycle (IPC), cycle time (frequency), and design complexity. However, as pRocess nodes have become smaller, transistor leakage and variability has increased, and power efficiency restrictive, many of the more aggressive custom techniques have become more difficult and expensive to apply.[4] Modern high-performance pRocessors have largely limited their custom design efforts to more regular structures such as memories and register files.

由于BOOM是可綜合處理器，因此我們必須依靠微體系結(jié)構(gòu)級的技術(shù)來解決關(guān)鍵路徑，并增加更多的流水線級來權(quán)衡每個周期（IPC），周期時間（頻率）和設(shè)計復(fù)雜性的指令。但是，隨著制程節(jié)點變得越來越小，晶體管泄漏和可變性增加以及功率效率受到限制，許多更具侵略性的定制技術(shù)的應(yīng)用變得更加困難和昂貴。[4]現(xiàn)代高性能處理器在很大程度上將其自定義設(shè)計工作局限于更規(guī)則的結(jié)構(gòu)，例如內(nèi)存和寄存器文件。

We began our design efforts with BOOMv1; a version of BOOM whose implementation was informed using educational technology libraries and CACTI cache models. BOOMv1 follows the 6-stage pipeline structure of the MIPS R10K— fetch, decode/rename, issue/register-read, execute, memory, and writeback. For design simplicity, all uops are placed into a single unified issue window. Likewise, all physical registers (both integer and floating-point registers) are located in a single unified physical register file. BOOMv1 also utilized a short 2-stage front-end pipeline. Conditional branch prediction occurs after the branches have been decoded.

我們從BOOM v1開始設(shè)計；該BOOM版本，其實現(xiàn)方式是使用教育技術(shù)庫和CACTI緩存模型的。BOOMv1遵循MIPS R10K的6級流水線結(jié)構(gòu)-——取指、譯碼、執(zhí)行、訪存、寫結(jié)果。為了簡化設(shè)計，將所有控件放置在一個統(tǒng)一的發(fā)布窗口中。同樣，所有物理寄存器（整數(shù)和浮點寄存器）都位于單個統(tǒng)一的物理寄存器文件中。BOOM v1還利用了較短的2級前端管道。條件分支預(yù)測發(fā)生在分支已解碼之后。

The design of BOOMv1 was partly informed by using educational technology libraries in conjunction with synthesis-only tools. BOOMv1 used Cacti[5] to analytically model the characteristics of memories, which is oriented toward the single-port, cache-sized SRAMs. However, BOOM makes use of a multitude of smaller, irregular SRAMs for modules such as branch predictor (BPD) tables, and address target buffers. Figure 2 lists all of the SRAM macros used within the BOOM core.

BOOMv1的設(shè)計部分是通過使用教育技術(shù)庫以及僅綜合工具來提供的。BOOMv1使用Cacti [5]對存儲的特性進行了分析建模，該特性面向單端口，高速緩存大小的SRAM。然而，BOOM將許多較小的不規(guī)則SRAM用于模塊，例如分支預(yù)測器（BPD）表和地址目標(biāo)緩沖區(qū)。圖2列出了BOOM內(nèi)核中使用的所有SRAM宏。

圖 2
Final BOOM core configuration used in the BROOM chip, as well as the configurations used for each of the SRAM macros used within the BOOM core.
BROOM芯片中使用的最終BOOM內(nèi)核配置，以及BOOM內(nèi)核中使用的每個SRAM宏的配置

Upon analysis of the timing of BOOMv1 using TSMC 28-nm HPM, the following critical paths were identified:

通過使用TSMC 28-nm HPM分析BOOMv1的時序后，確定了以下關(guān)鍵路徑：

issue window select;? 發(fā)送窗口選擇；
register rename busy-table read;? 注冊重命名忙表讀??；
conditional BPD redirect;??有條件的BPD重定向；
register file read.??注冊文件讀取。

The last path (register-read) only showed up as critical during postplace-and-route analysis.

最后一條路徑（寄存器讀?。﹥H在后置和路由分析中顯示為關(guān)鍵。

SECTION 4

BOOMv2: IMPROVING BOOM’s QUALITY-OF-RESULTS

第4章 BOOMv2：提高BOOM的結(jié)果質(zhì)量

BOOMv2 is an update to BOOMv1 based on information collected through synthesis, place, and route using a commercial TSMC 28 nm pRocess. We performed the design space exploration by using standard single- and dual-ported memory compilers provided by the foundry, and by hand-crafting a standard-cell-based multiported register file.

BOOMv2是對BOOMv1的升級，它基于使用商業(yè)TSMC 28 nm工藝通過合成，排布和布線接收的數(shù)據(jù)。我們通過使用代工廠提供的標(biāo)準(zhǔn)單端口和雙端口內(nèi)存編譯器，以及手工制作基于標(biāo)準(zhǔn)單元的多端口寄存器文件來進行設(shè)計空間探索。

Migration to BOOMv2 included 4948 additions and 2377 deleted LOC out of the total 16 000 LOC code base. The following sections describe some of the major changes that comprise the BOOMv2 update.

在總共16 000個LOC代碼庫中，向BOOMv2的遷移包括4948個新增的LOC和2377個刪除的LOC。以下各節(jié)描述了組成BOOMv2更新的一些主要更改。

4.1 Frontend (Instruction Fetch)

4.1? 前端（指令提?。?/span>

PRocessor performance is best when the frontend provides an uninterrupted stream of instructions. This requires the frontend to utilize branch prediction techniques to predict which path it believes the instruction stream will take long before the branch can be properly resolved. A number of different predictors are used, each trading off accuracy, area, critical path cost, and pipeline penalty when making a prediction.

當(dāng)前端提供不間斷的指令流時，處理器性能最佳。這要求前端使用分支預(yù)測技術(shù)來預(yù)測它認(rèn)為指令流將花費很長時間才能正確解決分支的路徑。使用了許多不同的預(yù)測器，每個預(yù)測器在進行預(yù)測時都要權(quán)衡準(zhǔn)確性、面積、關(guān)鍵路徑成本和管道損失。

The Branch Target Buffer (BTB) maintains a set of tables mapping from instruction addresses to branch targets. Some hysteresis bits are used to help guide the taken/not-taken decision of the BTB in the case of a tag hit. The BTB is a very expensive structure—each BTB entry contains a tag and a target. The BTB also contains a return address stack for predicting the function returns.

該分支目標(biāo)緩沖器（BTB）維持一組表映射從的指令地址到分支目標(biāo)。在命中標(biāo)簽的情況下，一些滯后位用于幫助指導(dǎo)BTB采取/不采取決策。BTB是一種非常昂貴的結(jié)構(gòu)——每個BTB條目都包含一個標(biāo)簽和一個目標(biāo)。BTB還包含用于預(yù)測函數(shù)返回的返回地址堆棧。

To improve a critical path and increase the capacity, we replaced BOOMv1’s fully tagged, fully associative BTB design with a partially tagged, set-associative BTB. We also implemented the new BTB using single-ported SRAM macros, instead of flip-flops.

為了改善關(guān)鍵路徑并增加容量，我們用部分標(biāo)記的集合關(guān)聯(lián)BTB代替了BOOMv1的完全標(biāo)記的集合關(guān)聯(lián)的BTB設(shè)計。我們還使用單端口SRAM宏而不是觸發(fā)器來實現(xiàn)新的BTB。

The Conditional BPD maintains a set of prediction and hysteresis tables to make taken/not-taken predictions based on a look-up address. The BPD only makes taken/not-taken predictions—it therefore relies on some other agent to provide information on what instructions are branches and what their targets are. The BPD can either use the BTB for this information or it can wait and decode the instructions themselves. Because the BPD does not store the branch targets, it can be much denser and more accurate than the BTB.

在有條件的BPD維護了一個集合，該集合使預(yù)測和滯后表根據(jù)預(yù)測來查找地址來判斷是否采用。BPD僅做出采用/未采用的預(yù)測，因此它依賴于其他代理來提供有關(guān)什么指令是分支以及它們的目標(biāo)是什么的信息。BPD可以使用BTB來獲取此信息，也可以等待并自行解碼指令。因為BPD不存儲分支目標(biāo)，所以它比BTB密度更高，更準(zhǔn)確。

BOOM uses a global history predictor, which works by tracking the outcome of the last?N?branches in the program and hashes this history with the look-up address to compute an index into the prediction tables. BOOM’s predictor tables are implemented using single-ported SRAMs. Although many prediction tables are conceptually “tall and skinny” matrices (thousands of 2- or 4-bit entries), a generator written in Chisel transforms the predictor tables into a square memory structure to best match the SRAMs provided by a memory compiler.

BOOM使用全局歷史預(yù)測器，該預(yù)測器通過跟蹤最后N個程序中分支的結(jié)構(gòu)并使用查詢地址對歷史記錄進行哈希處理，來計算出預(yù)測表中的索引。BOOM的預(yù)測器表是使用單端口SRAM來實現(xiàn)的。盡管許多預(yù)測表在概念上都是“又高又瘦”的矩陣（成千上萬的2位或4位條目），但用Chisel編寫的生成器將預(yù)測值表轉(zhuǎn)換為方形存儲結(jié)構(gòu)，來最佳匹配由存儲器編譯器提供的SRAM。

We found a critical path in BOOMv1 to be the BPD making a prediction and redirecting the fetch instruction address, as the BPD must first decode the newly fetched instructions and compute potential branch targets before it can redirect fetch. For BOOMv2, we moved the BPD array access back a stage to now operate in parallel with decoding the instructions. The final prediction and redirection are then performed at the beginning of the following stage (see Figure 2). Moving the BPD redirection back a cycle also gave us the freedom to provide a full cycle for the hash indexing function, which removes the hashing off the critical path of Next-PC selection. However, pushing back the BPD redirection a stage comes at the cost of an extra bubble on BPD redirections.

我們發(fā)現(xiàn)BOOMv1中的關(guān)鍵路徑是BPD進行預(yù)測并重定向獲取指令的地址，因為BPD必須首先解碼新獲取的指令并計算潛在的分支目標(biāo)，然后才能重定向獲取。對于BOOMv2，我們將BPD陣列訪問移回了一個階段，以便現(xiàn)在與解碼指令并行運行。然后，在下一個階段的開始執(zhí)行最終的預(yù)測和重定向（參見圖2）。將BPD重定向移回一個周期還使我們能夠自由地為哈希索引功能提供完整的周期，從而消除了Next-PC選擇的關(guān)鍵路徑上的哈希。但是，將BPD重定向推遲到某個階段會以BPD重定向上的額外氣泡為代價。

4.2 Distributed Issue Windows?

4.2? 分布式發(fā)送窗口

The issue window holds all inflight and un-executed micro-ops (uops). For BOOM, the issue window is implemented as a collapsing queue to allow the oldest instructions to be compressed toward the top. For issue-select, a cascading priority encoder selects the oldest instruction that is ready to issue. This path is exacerbated either by increasing the number of entries or by increasing the number of issue ports. For BOOMv1, our synthesizable implementation of a 20-entry issue window with three issue ports was found to be too aggressive, so we switched to three distributed issue windows with 16 entries each (separate windows for integer, memory, and floating-point operations). This removes issue-select from the critical path while also increasing the total number of instructions that can be scheduled. However, to maintain performance of executing two integer ALU instructions and one memory instruction per cycle, a common configuration of BOOM uses two issue-select ports on the integer issue window.

該發(fā)送窗口包含所有運行中和未執(zhí)行的微操作。對于BOOM，發(fā)送窗口使用折疊隊列實現(xiàn)，以允許最早的指令朝頂部壓縮。對于發(fā)送選擇，級聯(lián)優(yōu)先級編碼器選擇準(zhǔn)備發(fā)送的最早的指令。增加條目數(shù)或增加發(fā)送端口數(shù)都會加劇此路徑的傳輸量。對于BOOMv1，我們發(fā)現(xiàn)具有20個條目的發(fā)送窗口的可綜合實現(xiàn)具有三個發(fā)送端口過于激進，因此我們切換到三個分布式發(fā)送窗口，每個窗口具有16個條目（整數(shù)，內(nèi)存和浮點運算的單獨窗口）。這消除了關(guān)鍵路徑中的問題選擇，同時也增加了可以調(diào)度的指令總數(shù)。然而，為了保持每個周期執(zhí)行兩個整數(shù)ALU指令和一個內(nèi)存指令的性能，BOOM的一個常見配置使用整數(shù)問題窗口上的兩個問題選擇端口。

4.3 Custom Bit-Array Register File Design

4.3? 自定義位陣列寄存器文件設(shè)計

One of the critical components of an OoO pRocessor, and most challenging to synthesize in a standard ASIC flow, is the multiported register file. BOOM’s register file required both microarchitectural adjustments and a semicustom physical design to achieve the desired performance. The design of a register file provides many challenges—reading data out of the register file is a critical path, and routing read data to functional units is a routing challenge. Both the number of registers and the number of ports further exacerbate the challenges of synthesizing the register file.

多端口寄存器文件是OoO處理器的關(guān)鍵組件之一，在標(biāo)準(zhǔn)ASIC流程中進行合成最具挑戰(zhàn)性。BOOM的寄存器文件需要微體系結(jié)構(gòu)調(diào)整和半定制的物理設(shè)計，才能實現(xiàn)所需的性能。寄存器文件的設(shè)計面臨許多挑戰(zhàn)——從寄存器文件中讀取數(shù)據(jù)是關(guān)鍵路徑，而將讀取的數(shù)據(jù)路由到功能單元則是路由挑戰(zhàn)。寄存器的數(shù)量和端口的數(shù)量都進一步加劇了合成寄存器文件的挑戰(zhàn)。

The first path to improving the register file design was purely microarchitectural. The issue-select and register-read stages were split into two separate stages—each now gets a full cycle to themselves. The register count is lowered by splitting up the unified physical register file into separate floating-point and integer register files. This split also allows for reducing the read-port count by moving the three-operand fused-multiply add floating-point unit to the smaller floating-point register file.

改進寄存器文件設(shè)計的第一個途徑是純粹解決微體系結(jié)構(gòu)的問題，選擇發(fā)送和寄存器讀取階段被分成兩個獨立的階段，現(xiàn)在每一個階段都得到一個完整的周期。通過將統(tǒng)一物理寄存器文件拆分為單獨的浮點數(shù)和整數(shù)寄存器文件，可以減少寄存器數(shù)量。這種拆分還可以通過將三操作數(shù)融合乘加浮點單元移動到較小的浮點寄存器文件中來減少讀取端口數(shù)。

The second path to improving the register file involved physical design. A significant problem in placing and routing a register file is the issue in routing many wires to a relatively dense regfile array. BOOMv2’s 70 entry integer register file of six read ports and three write ports comes to 4480 bits, each needing 18 wires routed into and out of it. There is a mismatch between the synthesized array and the area needed to route all required wires, resulting in routing congestion.

改進寄存器文件的第二條路徑涉及物理設(shè)計。放置和路由寄存器文件中的一個重要問題是將許多導(dǎo)線路由到相對密集的regfile數(shù)組中的問題。BOOMv2的70個條目的整數(shù)寄存器文件由6個讀端口和3個寫端口組成，共4480位，每個都需要18條導(dǎo)線進出。合成陣列與布線所有必需導(dǎo)線所需的面積之間不匹配，從而導(dǎo)致布線擁塞。

The register file in this design was implemented by semicustom crafting a register file bit out of foundry-provided standard cells (see Figure 3). The Chisel register file was blackboxed, and a lower level of hierarchy was manually described in structural Verilog in which standard cells were instantiated to construct a bit-cell with its access ports. The bit-cells were preplaced and the router automatically routed wires correctly to complete the register file.

在該設(shè)計中，寄存器文件通過半定制一個寄存器文件中實現(xiàn)位鑄造提供的標(biāo)準(zhǔn)單元（參見出圖3）。該Chisel寄存器文件是blackboxed和分層結(jié)構(gòu)的較低級別的結(jié)構(gòu)中的Verilog手動描述，其中標(biāo)準(zhǔn)單元中實例化以構(gòu)建1bit位單元與它的接入端口。這個1bit位單元是預(yù)先放置并且路由器自動正確布線，以完成寄存器文件。

圖3
Register File Bit manually crafted out of foundry-provided standard cells. Each read port provides a read-enable bit to signal a tri-state buffer to drive its port’s read data line. The register file bits are laid out in an array for placement with guidance to the place tools. The tools are then allowed to automatically route the 18 wires into and out of each bit block.?
從代工廠提供的標(biāo)準(zhǔn)單元中手動制作的注冊文件位。每個讀取端口均提供一個讀取使能位，以向三態(tài)緩沖器發(fā)出信號以驅(qū)動其端口的讀取數(shù)據(jù)線。寄存器文件位以數(shù)組的形式放置，以便在放置工具的指導(dǎo)下進行放置。然后允許工具自動將18根導(dǎo)線布線到每個位塊中以及從每個位塊中布線出來

Although the register file bits are implemented in a structural Verilog, the decode logic and peripheral circuitry are implemented in Chisel. We also implemented a behavioral model of the custom array in Chisel to verify the decode logic through RTL simulation and then performed additional verification of the custom bit-array register file in gate-level simulation.

盡管寄存器文件的位是在Verilog中實現(xiàn)的，但解碼邏輯和外圍電路是在Chisel中實現(xiàn)的。我們還在Chisel中實現(xiàn)了自定義數(shù)組的行為模型，以通過RTL仿真來驗證解碼邏輯，然后在門級仿真中對自定義位數(shù)組寄存器文件進行附加驗證。

To support the target cycle time, the register file is implemented by using hierarchical bitlines; the bits are divided into clusters, tristates drive the read ports inside of each cluster, and muxes select the read data across clusters. This prevents the tristate buffers from having to drive each read wire across all 70 registers.

為了支持目標(biāo)周期時間，使用分層位線實現(xiàn)寄存器文件。這些位被分成簇，三態(tài)驅(qū)動每個簇內(nèi)部的讀取端口，多路復(fù)用器跨簇選擇讀取數(shù)據(jù)。這避免了三態(tài)緩沖器必須驅(qū)動所有70個寄存器中的每個讀取線。

As a counterpoint, the smaller floating-point register file (three read ports, two write ports) is fully synthesized with no placement guidance. Aside from the integer register file and the SRAMs, no other logic in Chisel was implemented via Verilog blackboxes.

作為對策，較小的浮點寄存器文件（三個讀取端口，兩個寫入端口）是完全合成的，沒有放置指導(dǎo)。除了整數(shù)寄存器文件和SRAM，Chisel中沒有其他邏輯通過Verilog黑盒實現(xiàn)。

標(biāo)簽：