The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel introduced in November 1, 1995. It introduced the P6 microarchitecture and was intended to replace the original Pentium in a full range of applications. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors the Pentium Pro contained 5.5 million transistors. It was reduced to a more narrow role as a server and high-end desktop processor and was used in supercomputers like ASCI Red, the first computer to reach the teraFLOPS performance mark; the Pentium Pro was capable of both dual- and quad-processor configurations. It only came in one form factor, the large rectangular Socket 8; the Pentium Pro was succeeded by the Pentium II Xeon in 1998. The lead architect of Pentium Pro was Fred Pollack, specialized in superscalarity and had worked as the lead engineer of the Intel iAPX 432; the Pentium Pro incorporated a new microarchitecture, different from the Pentium's P5 microarchitecture. It has a 14-stage superpipelined architecture which used an instruction pool.
The Pentium Pro featured many advanced concepts not found in the Pentium, although it wasn't the first or only x86 processor to implement them. The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could be analysed and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once; the Pentium Pro thus featured out of order execution, including speculative execution via register renaming. It had a wider 36-bit address bus, allowing it to access up to 64GB of memory; the Pentium Pro has an 8 KiB instruction cache, from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders. There are three instruction decoders; the decoders are not equal in capability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts the Pentium Pro's ability to decode multiple instructions limiting superscalar execution.
X86 instructions are decoded into 118-bit micro-operations. The micro-ops are RISC-like; the general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on the memory can only be processed by the general decoder, as this operation requires a minimum of three micro-ops; the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles; the Pentium Pro was the first processor in the x86-family to support upgradeable microcode under BIOS and/or operating system control. Micro-ops exit the re-order buffer and enter a reserve station, where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units; the Pentium Pro has a total of six execution units: two integer units, one floating-point unit, a load unit, store address unit, a store data unit.
One of the integer units shares the same ports as the FPU, therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only one has the full complement of functions such as a barrel shifter and divider; the second integer unit, which shares paths with the FPU, does not have these facilities and is limited to simple operations such as add and the calculation of branch target addresses. The FPU executes floating-point operations. Addition and multiplication have a latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 29-69 cycles, respectively; the smallest number is for single precision floating-point numbers and the largest for extended precision numbers. Division and square root can operate with adds and multiplies, preventing them from executing only when the result has to be stored in the ROB.
After the microprocessor was released, a bug was discovered in the floating point unit called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number won't fit into the smaller integer format, causing the FPU to deviate from its documented behaviour; the bug is considered to be minor and occurs under such special circumstances that few, if any, software programs are affected. The Pentium Pro P6 microarchitecture was used in one form or another by Intel for more than a decade; the pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz with the "Tualatin" Pentium III. The design's various traits would continue after that in the derivative core called "Banias" in Pentium M and Intel Core, which itself would evolve into the Core microarchitecture in 2006 and onward. Despite being advanced for the time, the Pentium Pro's out-of-order register renaming architecture had trouble with running 16-bit code and mixed code, as using partial registers cause frequent pipeline flushing
Intel's i960 was a RISC-based microprocessor design that became popular during the early 1990s as an embedded microcontroller. It became a best-selling CPU in that segment, along with the competing AMD 29000. In spite of its success, Intel stopped marketing the i960 in the late 1990s, as a result of a settlement with DEC whereby Intel received the rights to produce the StrongARM CPU; the processor continues to be used for a few military applications. The i960 design was begun in response to the failure of Intel's iAPX 432 design of the early 1980s; the iAPX 432 was intended to directly support high-level languages that supported tagged, garbage-collected memory—such as Ada and Lisp—in hardware. Because of its instruction-set complexity, its multi-chip implementation, design flaws, the iAPX 432 was slow in comparison to other processors of its time. In 1984, Intel and Siemens started a joint project called BiiN, to create a high-end, fault-tolerant, object-oriented computer system programmed in Ada.
Many of the original i432 team members joined this project, although a new lead architect, Glenford Myers, was brought in from IBM. The intended market for the BiiN systems was high-reliability-computer users such as banks, industrial systems, nuclear power plants. Intel's major contribution to the BiiN system was a new processor design, influenced by the protected-memory concepts from the i432; the new design was to include a number of features to improve performance and avoid problems that had led to the i432's downfall. The first 960 processors entered the final stages of design, known as taping-out, in October 1985 and were sent to manufacturing that month, with the first working chips arriving in late 1985 and early 1986; the BiiN effort failed, due to market forces, the 960MX was left without a use. Myers attempted to save the design by extracting several subsets of the full capability architecture created for the BiiN system, he tried to convince Intel management to market the i960 as a general-purpose processor, both in place of the Intel 80286 and i386, as well as the emerging RISC market for Unix systems, including a pitch to Steve Jobs for use in the NeXT system.
Competition within and outside of Intel came not only from the i386 camp but from the i860 processor, yet another RISC processor design emerging within Intel at the time. Myers was unsuccessful at convincing Intel management to support the i960 as a general-purpose or Unix processor, but the chip found a ready market in early high-performance 32-bit embedded systems; the lead architect of i960 was superscalarity specialist Fred Pollack, the lead engineer of the Intel iAPX 432 and the lead architect of the i686 chip, the Pentium Pro. To avoid the performance issues that plagued the i432, the central i960 instruction-set architecture was a RISC design, only implemented in full in the i960MX; the memory subsystem was 33-bits wide—to accommodate a 32-bit word and a "tag" bit to implement memory protection in hardware. In many ways, the i960 followed the original Berkeley RISC design, notably in its use of register windows, an implementation-specific number of caches for the per-subroutine registers that allowed for fast subroutine calls.
The competing Stanford University design, MIPS, did not use this system, instead relying on the compiler to generate optimal subroutine call and return code. In common with most 32-bit designs, the i960 has a flat 32-bit memory space, with no memory segmentation; the i960 architecture anticipated a superscalar implementation, with instructions being dispatched to more than one unit within the processor. The "full" i960MX was never released for the non-military market, but the otherwise identical i960MC was used in high-end embedded applications; the i960MC included all of the features of the original BiiN system. A version of the RISC core without memory management or an FPU became the i960KA, the RISC core with an FPU became the i960KB; the versions were, identical internally—only the labeling was different. This meant the CPUs were much larger than necessary for the "actually supported" feature sets, as a result, more expensive to manufacture than they needed to be; the i960KA became successful as a low-cost 32-bit processor for the laser-printer market, as well as for early graphics terminals and other embedded applications.
Its success paid for future generations. The i960CA, first announced in July 1989, was the first pure RISC implementation of the i960 architecture, it featured a newly designed superscalar RISC core and added an unusual addressable on-chip cache, but lacked an FPU and MMU, as it was intended for high-performance embedded applications. The i960CA is considered to have been the first single-chip superscalar RISC implementation; the C-series included only one ALU, but could dispatch and execute an arithmetic instruction, a memory reference, a branch instruction at the same time, sustain two instructions per cycle under certain circumstances. The first versions released ran at 33 MHz, Intel promoted the chip as capable of 66 MIPS; the i960CA microarchitecture was designed in 1987–1988 and formally announced on September 12, 1989. In May 1992, came the i960CF, which included a larger instruction cache and added 1 KB of data cache, but was still without an FPU or MMU; the 80960Jx is a processor for embedded applications.
It features a 32-bit multiplexed address/data bus and data cache, 1K on-chip RAM
X86 is a family of instruction set architectures based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introduced in 1978 as a 16-bit extension of Intel's 8-bit 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by a plain 16-bit address; the term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186, 80286, 80386 and 80486 processors. Many additions and extensions have been added to the x86 instruction set over the years consistently with full backward compatibility; the architecture has been implemented in processors from Intel, Cyrix, AMD, VIA and many other companies. Of those, only Intel, AMD, VIA hold x86 architectural licenses, are producing modern 64-bit designs; the term is not synonymous with IBM PC compatibility, as this implies a multitude of other computer hardware. As of 2018, the majority of personal computers and laptops sold are based on the x86 architecture, while other categories—especially high-volume mobile categories such as smartphones or tablets—are dominated by ARM.
In the 1980s and early 1990s, when the 8088 and 80286 were still in common use, the term x86 represented any 8086 compatible CPU. Today, however, x86 implies a binary compatibility with the 32-bit instruction set of the 80386; this is due to the fact that this instruction set has become something of a lowest common denominator for many modern operating systems and also because the term became common after the introduction of the 80386 in 1985. A few years after the introduction of the 8086 and 8088, Intel added some complexity to its naming scheme and terminology as the "iAPX" of the ambitious but ill-fated Intel iAPX 432 processor was tried on the more successful 8086 family of chips, applied as a kind of system-level prefix. An 8086 system, including coprocessors such as 8087 and 8089, as well as simpler Intel-specific system chips, was thereby described as an iAPX 86 system. There were terms iRMX, iSBC, iSBX – all together under the heading Microsystem 80. However, this naming scheme was quite temporary.
Although the 8086 was developed for embedded systems and small multi-user or single-user computers as a response to the successful 8080-compatible Zilog Z80, the x86 line soon grew in features and processing power. Today, x86 is ubiquitous in both stationary and portable personal computers, is used in midrange computers, workstations and most new supercomputer clusters of the TOP500 list. A large amount of software, including a large list of x86 operating systems are using x86-based hardware. Modern x86 is uncommon in embedded systems and small low power applications as well as low-cost microprocessor markets, such as home appliances and toys, lack any significant x86 presence. Simple 8-bit and 16-bit based architectures are common here, although the x86-compatible VIA C7, VIA Nano, AMD's Geode, Athlon Neo and Intel Atom are examples of 32- and 64-bit designs used in some low power and low cost segments. There have been several attempts, including by Intel itself, to end the market dominance of the "inelegant" x86 architecture designed directly from the first simple 8-bit microprocessors.
Examples of this are the iAPX 432, the Intel 960, Intel 860 and the Intel/Hewlett-Packard Itanium architecture. However, the continuous refinement of x86 microarchitectures and semiconductor manufacturing would make it hard to replace x86 in many segments. AMD's 64-bit extension of x86 and the scalability of x86 chips such as the eight-core Intel Xeon and 12-core AMD Opteron is underlining x86 as an example of how continuous refinement of established industry standards can resist the competition from new architectures; the table below lists processor models and model series implementing variations of the x86 instruction set, in chronological order. Each line item is characterized by improved or commercially successful processor microarchitecture designs. At various times, companies such as IBM, NEC, AMD, TI, STM, Fujitsu, OKI, Cyrix, Intersil, C&T, NexGen, UMC, DM&P started to design or manufacture x86 processors intended for personal computers as well as embedded systems; such x86 implementations are simple copies but employ different internal microarchitectures as well as different solutions at the electronic and physical levels.
Quite early compatible microprocessors were 16-bit, while 32-bit designs were developed much later. For the personal computer market, real quantities started to appear around 1990 with i386 and i486 compatible processors named to Intel's original chips. Other companies, which designed or manufactured x86 or x87 processors, include ITT Corporation, National Semiconductor, ULSI System Technology, Weitek. Following the pipelined i486, Intel introduced the Pentium brand name for their new set of superscalar x86 designs.
A battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights and electric cars. When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode; the terminal marked negative is the source of electrons that will flow through an external electric circuit to the positive terminal. When a battery is connected to an external electric load, a redox reaction converts high-energy reactants to lower-energy products, the free-energy difference is delivered to the external circuit as electrical energy; the term "battery" referred to a device composed of multiple cells, however the usage has evolved to include devices composed of a single cell. Primary batteries are discarded. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary batteries can be discharged and recharged multiple times using an applied electric current.
Examples include the lead-acid batteries used in vehicles and lithium-ion batteries used for portable electronics such as laptops and smartphones. Batteries come in many shapes and sizes, from miniature cells used to power hearing aids and wristwatches to small, thin cells used in smartphones, to large lead acid batteries or lithium-ion batteries in vehicles, at the largest extreme, huge battery banks the size of rooms that provide standby or emergency power for telephone exchanges and computer data centers. According to a 2005 estimate, the worldwide battery industry generates US$48 billion in sales each year, with 6% annual growth. Batteries have much lower specific energy than common fuels such as gasoline. In automobiles, this is somewhat offset by the higher efficiency of electric motors in converting chemical energy to mechanical work, compared to combustion engines; the usage of "battery" to describe a group of electrical devices dates to Benjamin Franklin, who in 1748 described multiple Leyden jars by analogy to a battery of cannon.
Italian physicist Alessandro Volta built and described the first electrochemical battery, the voltaic pile, in 1800. This was a stack of copper and zinc plates, separated by brine-soaked paper disks, that could produce a steady current for a considerable length of time. Volta did not understand, he thought that his cells were an inexhaustible source of energy, that the associated corrosion effects at the electrodes were a mere nuisance, rather than an unavoidable consequence of their operation, as Michael Faraday showed in 1834. Although early batteries were of great value for experimental purposes, in practice their voltages fluctuated and they could not provide a large current for a sustained period; the Daniell cell, invented in 1836 by British chemist John Frederic Daniell, was the first practical source of electricity, becoming an industry standard and seeing widespread adoption as a power source for electrical telegraph networks. It consisted of a copper pot filled with a copper sulfate solution, in, immersed an unglazed earthenware container filled with sulfuric acid and a zinc electrode.
These wet cells used liquid electrolytes, which were prone to leakage and spillage if not handled correctly. Many used glass jars to hold their components, which made them fragile and dangerous; these characteristics made. Near the end of the nineteenth century, the invention of dry cell batteries, which replaced the liquid electrolyte with a paste, made portable electrical devices practical. Batteries convert chemical energy directly to electrical energy. In many cases, the electrical energy released is the difference in the cohesive or bond energies of the metals, oxides, or molecules undergoing the electrochemical reaction. For instance, energy can be stored in Zn or Li, which are high-energy metals because they are not stabilized by d-electron bonding, unlike transition metals. Batteries are designed such that the energetically favorable redox reaction can occur only if electrons move through the external part of the circuit. A battery consists of some number of voltaic cells; each cell consists of two half-cells connected in series by a conductive electrolyte containing metal cations.
One half-cell includes electrolyte and the negative electrode, the electrode to which anions migrate. Cations are reduced at the cathode; some cells use different electrolytes for each half-cell. Each half-cell has an electromotive force relative to a standard; the net emf of the cell is the difference between the emfs of its half-cells. Thus, if the electrodes have emfs E 1 and E 2 the net emf is E 2 − E 1.
The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM 7030 Stretch, by a factor of three. With performance of up to three megaFLOPS, the CDC 6600 was the world's fastest computer from 1964 to 1969, when it relinquished that status to its successor, the CDC 7600; the first CDC 6600's were delivered in 1965 to Los Alamos. They became a must-have system in scientific and mathematical computing circles, with systems being delivered to Courant Institute of Mathematical Sciences, CERN, the Lawrence Radiation Laboratory, many others. 50 were delivered in total. A CDC 6600 is on display at the Computer History Museum in California; the only running CDC 6000 series machine has been restored by Living Computers: Museum + Labs. CDC's first products were based on the machines designed at ERA, which Seymour Cray had been asked to update after moving to CDC.
After an experimental machine known as the Little Character, in 1960 they delivered the CDC 1604, one of the first commercial transistor-based computers, one of the fastest machines on the market. Management was delighted, made plans for a new series of machines that were more tailored to business use. Cray was not interested in such a project, set himself the goal of producing a new machine that would be 50 times faster than the 1604; when asked to complete a detailed report on plans at one and five years into the future, he wrote back that his five-year goal was "to produce the largest computer in the world", "largest" at that time being synonymous with "fastest", that his one-year plan was "to be one-fifth of the way". Taking his core team to new offices nearby the original CDC headquarters, they started to experiment with higher quality versions of the "cheap" transistors Cray had used in the 1604. After much experimentation, they found that there was no way the germanium-based transistors could be run much faster than those used in the 1604.
The "business machine" that management had wanted, now forming as the CDC 3000 series, pushed them about as far as they could go. Cray decided the solution was to work with the then-new silicon-based transistors from Fairchild Semiconductor, which were just coming onto the market and offered improved switching performance. During this period, CDC grew from a startup to a large company and Cray became frustrated with what he saw as ridiculous management requirements. Things became more tense in 1962 when the new CDC 3600 started to near production quality, appeared to be what management wanted, when they wanted it. Cray told CDC's CEO, William Norris that something had to change, or he would leave the company. Norris felt he was too important to lose, gave Cray the green light to set up a new laboratory wherever he wanted. After a short search, Cray decided to return to his home town of Chippewa Falls, where he purchased a block of land and started up a new laboratory. Although this process introduced a lengthy delay in the design of his new machine, once in the new laboratory, without management interference, things started to progress quickly.
By this time, the new transistors were becoming quite reliable, modules built with them tended to work properly on the first try. The 6600 began to take form, with Cray working alongside Jim Thornton, system architect and "hidden genius" of the 6600. More than 100 CDC 6600s were sold over the machine's lifetime. Many of these went to various nuclear weapon-related laboratories, quite a few found their way into university computing laboratories. Cray turned his attention to its replacement, this time setting a goal of ten times the performance of the 6600, delivered as the CDC 7600; the CDC Cyber 70 and 170 computers were similar to the CDC 6600 in overall design and were nearly backwards compatible. The 6600 was three times faster than the IBM 7030 Stretch. Then-CEO Thomas Watson Jr. wrote a memo to his employees: "Last week, Control Data... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers...
Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer." Cray's reply was sardonic: "It seems like Mr. Watson has answered his own question." Typical machines of the era used a single CPU to drive the entire system. A typical program would first load data into memory, process it, write it back out; this required the CPUs to be complex in order to handle the complete set of instructions they would be called on to perform. A complex CPU implied a large CPU, introducing signalling delays while information flowed between the individual modules making it up; these delays set a maximum upper limit on performance, the machine could only operate at a cycle speed that allowed the signals time to arrive at the next module. Cray took another approach. At the time, CPUs ran slower than the main memory to which they were attached. For instance, a processor might take 15 cycles to multiply two numbers, while each memory access took only one or two.
This meant. It was this idle time; the CDC 6600 used a simplified core processor that wa
Arithmetic logic unit
An arithmetic logic unit is a combinational digital electronic circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit. An ALU is a fundamental building block of many types of computing circuits, including the central processing unit of computers, FPUs, graphics processing units. A single CPU, FPU or GPU may contain multiple ALUs; the inputs to an ALU are the data to be operated on, called operands, a code indicating the operation to be performed. In many designs, the ALU has status inputs or outputs, or both, which convey information about a previous operation or the current operation between the ALU and external status registers. An ALU has a variety of input and output nets, which are the electrical conductors used to convey digital signals between the ALU and external circuitry; when an ALU is operating, external circuits apply signals to the ALU inputs and, in response, the ALU produces and conveys signals to external circuitry via its outputs.
A basic ALU has three parallel data buses consisting of a result output. Each data bus is a group of signals; the A, B and Y bus widths are identical and match the native word size of the external circuitry. The opcode input is a parallel bus that conveys to the ALU an operation selection code, an enumerated value that specifies the desired arithmetic or logic operation to be performed by the ALU; the opcode size determines the maximum number of different operations. An ALU opcode is not the same as a machine language opcode, though in some cases it may be directly encoded as a bit field within a machine language opcode; the status outputs are various individual signals that convey supplemental information about the result of the current ALU operation. General-purpose ALUs have status signals such as: Carry-out, which conveys the carry resulting from an addition operation, the borrow resulting from a subtraction operation, or the overflow bit resulting from a binary shift operation. Zero, which indicates all bits of Y are logic zero.
Negative, which indicates the result of an arithmetic operation is negative. Overflow, which indicates the result of an arithmetic operation has exceeded the numeric range of Y. Parity, which indicates whether an or odd number of bits in Y are logic one. At the end of each ALU operation, the status output signals are stored in external registers to make them available for future ALU operations or for controlling conditional branching; the collection of bit registers that store the status outputs are treated as a single, multi-bit register, referred to as the "status register" or "condition code register". The status inputs allow additional information to be made available to the ALU when performing an operation; this is a single "carry-in" bit, the stored carry-out from a previous ALU operation. An ALU is a combinational logic circuit, meaning that its outputs will change asynchronously in response to input changes. In normal operation, stable signals are applied to all of the ALU inputs and, when enough time has passed for the signals to propagate through the ALU circuitry, the result of the ALU operation appears at the ALU outputs.
The external circuitry connected to the ALU is responsible for ensuring the stability of ALU input signals throughout the operation, for allowing sufficient time for the signals to propagate through the ALU before sampling the ALU result. In general, external circuitry controls an ALU by applying signals to its inputs; the external circuitry employs sequential logic to control the ALU operation, paced by a clock signal of a sufficiently low frequency to ensure enough time for the ALU outputs to settle under worst-case conditions. For example, a CPU begins an ALU addition operation by routing operands from their sources to the ALU's operand inputs, while the control unit applies a value to the ALU's opcode input, configuring it to perform addition. At the same time, the CPU routes the ALU result output to a destination register that will receive the sum; the ALU's input signals, which are held stable until the next clock, are allowed to propagate through the ALU and to the destination register while the CPU waits for the next clock.
When the next clock arrives, the destination register stores the ALU result and, since the ALU operation has completed, the ALU inputs may be set up for the next ALU operation. A number of basic arithmetic and bitwise logic functions are supported by ALUs. Basic, general purpose ALUs include these operations in their repertoires: Add: A and B are summed and the sum appears at Y and carry-out. Add with carry: A, B and carry-in are summed and the sum appears at Y and carry-out. Subtract: B is subtracted from A and the difference appears at Y and carry-out. For this function, carry-out is a "borrow" indicator; this operation may be used to compare the magnitudes of A and B. Subtract with borrow: B is subtracted from A with borrow and the difference appears at Y and carry-
In computing, a vector processor or array processor is a central processing unit that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms; the rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the 1990s. As of 2015 most commodity CPUs implement architectures that feature instructions for a form of vector processing on multiple data sets. Common examples include Intel x86's MMX, SSE and AVX instructions, AMD's 3DNow! extensions, Sparc's VIS extension, PowerPC's AltiVec and MIPS' MSA. Vector processing techniques operate in video-game console hardware and in graphics accelerators.
In 2000, IBM, Toshiba and Sony collaborated to create the Cell processor. Other CPU designs include some multiple instructions for vector processing on multiple data sets known as MIMD and realized with VLIW; the Fujitsu FR-V VLIW/vector processor combines both technologies. Vector processing development began in the early 1960s at Westinghouse in their "Solomon" project. Solomon's goal was to increase math performance by using a large number of simple math co-processors under the control of a single master CPU; the CPU fed a single common instruction to all of the arithmetic logic units, one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted at the University of Illinois as the ILLIAC IV, their version of the design called for a 1 GFLOPS machine with 256 ALUs, when it was delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS.
It showed that the basic concept was sound, when used on data-intensive applications, such as computational fluid dynamics, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to designs, is referred to under a separate category, massively parallel computing. A computer for operations with functions was presented and developed by Kartsev in 1967; the first successful implementation of vector processing appears to be the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer. The basic ASC ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching 20 MFLOPS achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes; the STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive.
However the machine took considerable time decoding the vector instructions and getting ready to run the process, so it required specific data sets to work on before it sped anything up. The vector technique was first exploited in 1976 by the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight vector registers, which held sixty-four 64-bit words each; the vector instructions were applied between registers, much faster than talking to main memory. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition the design had separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication; this allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS – far faster than any machine of the era.
Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (Fujitsu and Nippon Electric Corporation introduced register-based vector machines similar to the Cray-1 being faster and much smaller. Oregon-based Floating Point Systems built add-on array processors for minicomputers building their own minisupercomputers. Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor.
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day, with their SX series of computers. Most the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card t