HP-HP9000-PARISC-PA8000-CPU 001 (cropped)

The PA-8000 (PCX-U), code-named ''Onyx'', is a

microprocessor A microprocessor is a computer processor where the data processing logic and control is included on a single integrated circuit, or a small number of integrated circuits. The microprocessor contains the arithmetic, logic, and control circ ...

developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). Hunt 1995 It was a completely new design with no circuitry derived from previous PA-RISC microprocessors. The PA-8000 was introduced on 2 November 1995 when shipments began to members of the Precision RISC Organization (PRO). It was used exclusively by PRO members and was not sold on the merchant market. All follow-on PA-8x00 processors (PA-8200 to PA-8900, described further below) are based on the basic PA-8000 processor core. The PA-8000 was used by: * HP in its

HP 9000 HP 9000 is a line of workstation and server computer systems produced by the Hewlett-Packard (HP) Company. The native operating system for almost all HP 9000 systems is HP-UX, which is based on UNIX System V. The HP 9000 brand was introduced ...

workstations and servers *

NEC is a Japanese multinational information technology and electronics corporation, headquartered in Minato, Tokyo. The company was known as the Nippon Electric Company, Limited, before rebranding in 1983 as NEC. It provides IT and network soluti ...

in its TX7/P590 server * Stratus Technologies in its Continuum fault-tolerant servers

Description

The PA-8000 is a four-way superscalar microprocessor that executes instructions out-of-order and speculatively. These features were not found in previous PA-RISC implementations, making the PA-8000 the first PA-RISC CPU to break the tradition of using simple microarchitectures and high-clock rate implementation to attain performance.

Instruction fetch unit

The PA-8000 has a four-stage front-end. During the first two stages, four instructions are fetched from the instruction cache by the instruction fetch unit (IFU). The IFU contains the program counter, branch history table (BHT), branch target address cache (BTAC) and a four-entry

translation lookaside buffer A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. ...

(TLB). The TLB is used to translate virtual address to physical addresses for accessing the instruction cache. In the event of a TLB miss, the translation is requested from the main TLB.

Branch prediction

The PA-8000 performs

branch prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...

using static or dynamic methods. Which method the PA-8000 used was selected by a bit in each TLB entry. Static prediction considers most backwards branches as taken and forward branches as not taken. Static prediction also predicted the outcome of branches by examining hints encoded in the instructions themselves by the compiler. Dynamic prediction uses the recorded history of a branch to decide whether it is taken or not taken. A 256-entry BHT is where this information is stored. Each BHT entry is a three-bit

shift register A shift register is a type of digital circuit using a cascade of flip-flops where the output of one flip-flop is connected to the input of the next. They share a single clock signal, which causes the data stored in the system to shift from one loc ...

. The PA-8000 used a majority vote algorithm, a branch is taken if the majority of the three bits are set, and not taken if they are clear. A mispredicted branch causes a five-cycle penalty. The BHT is updated when the outcome of the branch is known. Although the PA-8000 can execute two branch instructions per cycle, only one of the outcomes is recorded as the BHT is not dual-ported to simplify its implementation. The PA-8000 has a two-cycle bubble for correctly predicted branches, as the target address of the branch must be calculated before it is sent to the instruction cache. To reduce the occurrence of this bubble, the PA-8000 uses a 32-entry fully associative BTAC. The BTAC caches a branch's target address. When the same branch is encountered, and is predicted as taken, the address is sent to the instruction cache immediately, allowing the fetch to begin without delay. To maximize the effectiveness of the BTAC, only the branch target of predicted-taken branches are cached. If a branch is predicted as not taken, but its target address is cached in the BTAC, its entry is deleted. In the event that the BTAC is full, and a new entry needs to be written, the entry that is replaced is selected using a round robin replacement policy.

Instruction cache

The instruction cache is external and supports a capacity of 256 KB to 4 MB. Instructions are pre-decoded before they enter the cache by adding five bits to each instruction. These bits reduce the amount of time required to decode the instruction later in the pipeline. The instruction cache is direct-mapped to avoid the complexity of set associative caches and is accessed via a 148-bit bus. The tags for the cache are also external. It is built from synchronous SRAMs (SSRAMs).

Decode, and the instruction reorder buffer

During the third stage, the instructions are decoded. In the fourth stage, they are placed in the instruction reorder buffer (IRB). The IRB's purpose is the implement

register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...

, out of order execution, speculative execution and to provide a temporary place for results to be stored until the instructions are retired. The IRB determines which instructions are issued during stage five. The IRB consists of two buffers, one for integer and floating-point instructions, the other for load and store instructions. Some instructions are placed into both buffers. These instructions are branch instructions and certain system instructions. Each buffer has 28 entries. Each buffer can accept up to four instructions per cycle and can issue up to two per a cycle to its functional units.

Execution

All instructions begin execution during stage six in the ten functional units. Integer instructions except for multiply are executed in two arithmetic logic units (ALUs) and two shift/merge units. All instructions executed in these units have a single-cycle latency and their results are written to the destination register in stage seven. Floating-point instructions and integer multiply instructions are executed in two fused multiply–accumulate (FMAC) units and two divide/square-root units. The FMAC units are pipelined and have a three-cycle latency. Multiplication is performed during stage six, addition in stage seven, rounding in stage eight and writeback in stage nine. There is no rounding between the multiply and accumulate stages. The FMAC units also execute individual multiply and add instructions, which also have a latency of three cycles for both

single-precision Single-precision floating-point format (sometimes called FP32 or float32) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floatin ...

and double-precision variants. The divide/square-root units are not pipelined and have a 17-cycle latency. One instruction can be issued to them per clock cycle due to register port limitations, but they can operate in parallel with each other and the FMAC units. Both integer and floating-point load and store instructions are executed by two dedicated address adders.

Translation lookaside buffer

The

(TLB) contains 96 entries and is dual-ported and full-associative. It can translate two virtual addresses per cycle. This TLB translates addresses for both instructions and data. When the IFU's TLB misses, this TLB provides the translation for it. Translation for loads and stores have a higher priority than those for instructions. Each TLB entry can be mapped to a

page Page most commonly refers to: * Page (paper), one side of a leaf of paper, as in a book Page, PAGE, pages, or paging may also refer to: Roles * Page (assistance occupation), a professional occupation * Page (servant), traditionally a young m ...

with a size between 4 KB to 16 MB, in increments that are powers of four.

Data cache

The PA-8000 has a data cache with a capacity up to 4 MB. The data cache is dual-ported, so two reads or writes can be performed during every cycle. It is dual-ported by implementing two banks of cache, thus it is not truly dual-ported because if two reads or writes reference the same bank, a conflict arises and only one operation can be performed. It is accessed by two 64-bit buses, one for each bank. The cache tags are external. There are two copies of the cache tags to allow independent accesses in each bank. The data cache is direct-mapped for the same reasons as the instruction cache. It is built from SSRAMs.

Runway bus

The external interface is the

Runway bus According to the International Civil Aviation Organization (ICAO), a runway is a "defined rectangular area on a land aerodrome prepared for the landing and takeoff of aircraft". Runways may be a man-made surface (often asphalt concrete, as ...

, a 64-bit address and data multiplexed bus. The PA-8000 uses a 40-bit

physical address In computing, a physical address (also real address, or binary address), is a memory address that is represented in the form of a binary number on the address bus circuitry in order to enable the data bus to access a ''particular'' storage cell ...

, thus it is able to address 1 TB of

physical memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a compute ...

Physical

HP-HP9000-PARISC-PA8000-CPU 010 (cropped)

The PA-8000 has 3.8 million transistors and measures 17.68 mm by 19.10 mm, Gaddis 1996, p. 1697 for an area of 337.69 mm². It was fabricated by HP in their CMOS-14C process, a 10% gate shrink of the CMOS-14 process. Gwennap 1994 The CMOS-14C process was a 0.5 μm, five-level aluminum interconnect, complementary metal–oxide–semiconductor (CMOS) process. The die has 704 solder bumps for signals and 1,200 for power or ground. It is packaged in a 1,085-pad

flip chip Flip chip, also known as controlled collapse chip connection or its abbreviation, C4, is a method for interconnecting dies such as semiconductor devices, IC chips, integrated passive devices and microelectromechanical systems (MEMS), to exter ...

alumina ceramic

land grid array The land grid array (LGA) is a type of surface-mount packaging for integrated circuits (ICs) that is notable for having the pins on the socket (when a socket is used) rather than the integrated circuit. An LGA can be electrically connected to a ...

(LGA). The PA-8000 uses a 3.3 V power supply.

PA-8200

The PA-8200 (PCX-U+), code-named ''Vulcan'', was a further development of the PA-8000. The first systems to use the PA-8200 became available in June 1997. The PA-8200 operated at 200 to 240 MHz and primarily competed with the

Alpha 21164 The Alpha 21164, also known by its code name, EV5, is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA). It was introduced in January 1995, succeeding the Alp ...

. Improvements were made to branch prediction and the TLB. Scott 1997 Branch prediction was improved by quadrupling the number of BHT entries to 1,024, which required the use of a two-bit algorithm in order to fit without redesign of surrounding circuitry; and by implementing a write queue that enabled two branch outcomes to be recorded by the BHT instead of one. The number of TLB entries was increased to 120 entries from 96, which reduced TLB misses. The clock frequency was also improved through minor circuit redesign. The PA-8200's die was identical in size to the PA-8000 as improvements utilized empty areas of the die. It was fabricated in the CMOS-14C process.

PA-8500

The PA-8500 (PCX-W), code-named ''Barracuda'', is a further development of the PA-8200. It taped-out in early 1998 and was introduced in late-1998 within systems. Production versions operated at frequencies of 300 to 440 MHz, but it was designed to, and has, operated up to 500 MHz. Barnes 1999 The most notable improvements are the higher operating frequencies and the on-die integration of the primary caches. Gwennap 1997 The higher operating frequencies and the integration of the primary caches on the same die as the core was enabled by the migration to a 0.25 μm process. The PA-8500 core measured 10.8 mm by 11.4 mm (123.12 mm²) in the new process, less than half the area of the 0.5 μm PA-8200. This made area available that could be used for integrating the caches. The PA-8500 has a 512 KB instruction cache and a 1 MB data cache. Other improvements to the microarchitecture include a larger BHT containing 2,048 entries, twice the capacity of the PA-8200's, and a larger TLB containing 160 entries. The PA-8500 uses a new version of the

. The new version operates at 125 MHz and transfers data on both rising and falling edges of the clock signal (double data rate, or DDR) and yields 240 MT/s or 2 GB/s of bandwidth. As the Runway bus is used to transfer addresses and data, usable bandwidth is 80% that of 2 GB/s, or around 1.6 GB/s. The PA-8500 contains 140 million transistors and measures 21.3 mm by 22.0 mm (468.6 mm²). It was fabricated by Intel Corporation in a 0.25 μm CMOS process with five levels of aluminium interconnect. It uses a 2.0 V power supply. HP did not fabricate the PA-8500 themselves as they had ceased to upgrade their fabs to implement a process newer than CMOS-14C, which was used to fabricate previous PA-RISC microprocessors. The PA-8500 was packaged in a smaller 544-pad

(LGA) as the integration of the primary caches on die resulted in the removal of the two 128-bit buses which communicated with the external caches and their associated I/O pads.

PA-8600

The PA-8600 (PCX-W+), code-named ''Landshark'', is a further development of the PA-8500 introduced in January 2000. The PA-8600 was intended to be introduced in mid-2000.Wermer, "HP's PA-8600 processor earlier to ship than expected". It was a tweaked version of the PA-8500 to enable it to reach higher clock frequencies of 480 to 550 MHz. It improved the microarchitecture by using a quasi- least recently used (LRU) eviction policy for instruction cache. It was fabricated by Intel.

PA-8700

The PA-8700 (PCX-W2), code-named ''Piranha'', is a further development of the PA-8600. Introduced in August 2001, it operated at 625 to 750 MHz. Improvements were the implementation of data prefetching, a quasi-LRU replacement policy for the data cache, and a larger 44-bit physical address space to address 16 TB of physical memory. Krewell 2000 The PA-8700 also has larger instruction and data caches, increased in capacity by 50% to 0.75 MB and 1.5 MB, respectively. The PA-8700 was fabricated by IBM MicroelectronicsComputerWire 2002 in a 0.18 μm silicon on insulator (SOI) CMOS process with seven levels of copper interconnect and low-κ

dielectric In electromagnetism, a dielectric (or dielectric medium) is an electrical insulator that can be polarised by an applied electric field. When a dielectric material is placed in an electric field, electric charges do not flow through the mate ...

PA-8700+

The PA-8700+ was a further development of the PA-8700 introduced in systems in mid-2002. It operated at 875 MHz.

PA-8800

The PA-8800, code-named ''Mako'', is a further development of the PA-8700. Johnson 2001, p. 1 It was introduced in 2004 and was used by HP in their C8000 workstation and

HP 9000 Superdome The HP Superdome is a high-end server computer designed and manufactured by Hewlett Packard Enterprise (formerly Hewlett-Packard). The product's most recent version, "Superdome 2," was released in 2010. Superdome 2 supports 2 to 32 sockets (up to ...

servers. It was available at 0.8, 0.9 and 1.0 GHz. The PA-8800 was a

dual-core A multi-core processor is a microprocessor on a single integrated circuit with two or more separate processing units, called cores, each of which reads and executes program instructions. The instructions are ordinary CPU instructions (such ...

design consisting of two modified PA-8700+ microprocessors on a single die. Each core has a 768 KB instruction cache and a 768 KB data cache. The primary caches are smaller than those in the PA-8700 to enable both cores to fit on the same die. Improvements over the PA-8700 are improved branch prediction and the inclusion of an external 32 MB unified secondary cache. The secondary cache has a bandwidth of 10 GB/s and a latency of 40 cycles. It is 4-way set-associative, physically indexed and physically tagged with a line size of 128 bytes. The set-associativity was chosen to reduce the number of I/O pins. The L2 cache is implemented with using four 72 Mbit (9 MB) Enhanced Memory Systems Enhanced SRAM (ESRAM) chips, which despite its name, is an implementation of

1T-SRAM 1T-SRAM is a pseudo-static random-access memory (PSRAM) technology introduced by MoSys, Inc. in September 1998, which offers a high-density alternative to traditional static random-access memory (SRAM) in embedded memory applications. Mosys use ...

– dynamic random access memory (DRAM) with a SRAM-like interface. Access to this cache by each core is arbitrated by the on-die controller and the 1 MB of secondary cache tags also resides on-die as SRAM and is protected by ECC. The PA-8800 used the same

front side bus A front-side bus (FSB) is a computer communication interface ( bus) that was often used in Intel-chip-based computers during the 1990s and 2000s. The EV6 bus served the same function for competing AMD CPUs. Both typically carry data between the ...

as the McKinley

Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...

microprocessor, which yields 6.4 GB/s of bandwidth, and is compatible with HP's Itanium chipsets such as the zx1. It consisted of 300 million transistors, of which 25 million were for logic, on a 23.6 mm by 15.5 mm (365.8 mm²) die. It was fabricated by IBM in 0.13 μm SOI process with copper interconnects and low-κ dielectric. The PA-8800 is packaged in a ceramic ball grid array mounted on a printed circuit board (PCB) with the four ESRAMs, forming a module similar to those used by early Itanium microprocessors.

PA-8900

The PA-8900, code-named ''Shortfin'', was a derivative of the PA-8800. It was the last PA-RISC microprocessor to be developed and was introduced on 31 May 2005 when systems using the microprocessor became available. It was used in the

servers and the C8000 workstation. It operated at 0.8, 0.9, 1.0 and 1.1 GHz. It is not a

die shrink The term die shrink (sometimes optical shrink or process shrink) refers to the scaling of metal-oxide-semiconductor (MOS) devices. The act of shrinking a die is to create a somewhat identical circuit using a more advanced fabrication process, ...

of the PA-8800, as was earlier rumored. The L2 cache was doubled in capacity to 64 MB, has lower latency, and better error detection and correction on caches. It uses the McKinley system bus and was compatible with Itanium 2 chipsets such as the HP zx1. There were no microarchitecture changes, but the floating-point unit and on-die cache circuitry was redesigned to reduce power consumption, and each core subsequently dissipated approximately 35 W at 1.0 GHz.

Notes

References

* Barnes, Phillip (26 February 1999). "A 500 MHz 64 bit RISC CPU with 1.5Mbyte on chip Cache". ''Proceedings of the International Solid State Circuits Conference''. * ComputerWire (28 June 2002)
"HP readying dual-core PA-8800"
''

The Register ''The Register'' is a British technology news website co-founded in 1994 by Mike Magee, John Lettice and Ross Alderson. The online newspaper's masthead sublogo is "''Biting the hand that feeds IT''." Their primary focus is information te ...

''. * Gaddis, N.; Lotz, J. (November 1996). "A 64-b quad-issue CMOS RISC microprocessor". ''IEEE Journal of Solid-State Circuits'' 31 (11): pp. 1697–1702. * Gwennap, Linley (14 November 1994). "PA-8000 Combines Complexity and Speed". ''

Microprocessor Report ''Microprocessor Report'' is a newsletter covering the microprocessor industry. The publication is accessible only to paying subscribers. To avoid bias, it does not take advertisements. The publication provides extensive analysis of new high-perfo ...

'', Volume 8, Number 15. * Gwennap, Linley (28 October 1996). "HP Pumps Up PA-8x00 Family". ''

'', Volume 10, Number 14. * Gwennap, Linley (17 November 1997). "PA-8500's 1.5M Cache Aids Performance". ''

''. *

Hewlett-Packard Company The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company headquartered in Palo Alto, California. HP developed and provided a wide variety of hardware components ...

(2 November 1995). ''HP Announces Release of PA-8000 to PRO Partners''. (

Press Release A press release is an official statement delivered to members of the news media for the purpose of providing information, creating an official statement, or making an announcement directed for public release. Press releases are also considere ...

) * Hill, J. Michael and Lachman, Jonathan (2000)
"A 900MHz 2.25MByte Cache with On Chip CPU - Now in SOI/Cu"
''2000 International Solid-State Circuits Conference''. * Hunt, D. (1995). "Advanced performance features of the 64-bit PA-8000". ''Proceedings of CompCon''. pp. 123–128. * Johnson, David J. C. (16 October 2001). "HP's Mako Processor". ''2001 Microprocessor Forum''. * Krewell, Kevin (22 May 2000). "HP Extends PA-RISC With 8700". ''

''. * Kumar, Ashok (19 August 1996)
"The HP PA-8000 RISC CPU"
''Proceedings of Hot Chips VIII''. * Lesartre, Greg; Hunt, Doug (1997). "PA-8500: The Continuing Evolution of the PA-8000 Family". ''Proceedings of CompCon''. * Pountain, Dick (July 1995). "HP's Speedy RISC". ''

Byte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...

''. * Scott, Anne P. et al. (August 1997). "Four-Way Superscalar PA-RISC Processors". ''Hewlett-Packard Journal''. * Tsai, Li C. (16 February 2001). "A 1GHz PA-RISC Processor". ''International Solid State Circuits Conference''. * Wermer, Sandra (8 March 1999). "HP's PA-8600 processor earlier to ship than expected". ''HOISe''.