Arithmetic logic unit
An arithmetic logic unit is a combinational digital electronic circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit. An ALU is a fundamental building block of many types of computing circuits, including the central processing unit of computers, FPUs, graphics processing units. A single CPU, FPU or GPU may contain multiple ALUs; the inputs to an ALU are the data to be operated on, called operands, a code indicating the operation to be performed. In many designs, the ALU has status inputs or outputs, or both, which convey information about a previous operation or the current operation between the ALU and external status registers. An ALU has a variety of input and output nets, which are the electrical conductors used to convey digital signals between the ALU and external circuitry; when an ALU is operating, external circuits apply signals to the ALU inputs and, in response, the ALU produces and conveys signals to external circuitry via its outputs.
A basic ALU has three parallel data buses consisting of a result output. Each data bus is a group of signals; the A, B and Y bus widths are identical and match the native word size of the external circuitry. The opcode input is a parallel bus that conveys to the ALU an operation selection code, an enumerated value that specifies the desired arithmetic or logic operation to be performed by the ALU; the opcode size determines the maximum number of different operations. An ALU opcode is not the same as a machine language opcode, though in some cases it may be directly encoded as a bit field within a machine language opcode; the status outputs are various individual signals that convey supplemental information about the result of the current ALU operation. General-purpose ALUs have status signals such as: Carry-out, which conveys the carry resulting from an addition operation, the borrow resulting from a subtraction operation, or the overflow bit resulting from a binary shift operation. Zero, which indicates all bits of Y are logic zero.
Negative, which indicates the result of an arithmetic operation is negative. Overflow, which indicates the result of an arithmetic operation has exceeded the numeric range of Y. Parity, which indicates whether an or odd number of bits in Y are logic one. At the end of each ALU operation, the status output signals are stored in external registers to make them available for future ALU operations or for controlling conditional branching; the collection of bit registers that store the status outputs are treated as a single, multi-bit register, referred to as the "status register" or "condition code register". The status inputs allow additional information to be made available to the ALU when performing an operation; this is a single "carry-in" bit, the stored carry-out from a previous ALU operation. An ALU is a combinational logic circuit, meaning that its outputs will change asynchronously in response to input changes. In normal operation, stable signals are applied to all of the ALU inputs and, when enough time has passed for the signals to propagate through the ALU circuitry, the result of the ALU operation appears at the ALU outputs.
The external circuitry connected to the ALU is responsible for ensuring the stability of ALU input signals throughout the operation, for allowing sufficient time for the signals to propagate through the ALU before sampling the ALU result. In general, external circuitry controls an ALU by applying signals to its inputs; the external circuitry employs sequential logic to control the ALU operation, paced by a clock signal of a sufficiently low frequency to ensure enough time for the ALU outputs to settle under worst-case conditions. For example, a CPU begins an ALU addition operation by routing operands from their sources to the ALU's operand inputs, while the control unit applies a value to the ALU's opcode input, configuring it to perform addition. At the same time, the CPU routes the ALU result output to a destination register that will receive the sum; the ALU's input signals, which are held stable until the next clock, are allowed to propagate through the ALU and to the destination register while the CPU waits for the next clock.
When the next clock arrives, the destination register stores the ALU result and, since the ALU operation has completed, the ALU inputs may be set up for the next ALU operation. A number of basic arithmetic and bitwise logic functions are supported by ALUs. Basic, general purpose ALUs include these operations in their repertoires: Add: A and B are summed and the sum appears at Y and carry-out. Add with carry: A, B and carry-in are summed and the sum appears at Y and carry-out. Subtract: B is subtracted from A and the difference appears at Y and carry-out. For this function, carry-out is a "borrow" indicator; this operation may be used to compare the magnitudes of A and B. Subtract with borrow: B is subtracted from A with borrow and the difference appears at Y and carry-
A computer program is a collection of instructions that performs a specific task when executed by a computer. A computer requires programs to function. A computer program is written by a computer programmer in a programming language. From the program in its human-readable form of source code, a compiler can derive machine code—a form consisting of instructions that the computer can directly execute. Alternatively, a computer program may be executed with the aid of an interpreter. A collection of computer programs and related data are referred to as software. Computer programs may be categorized along functional lines, such as application software and system software; the underlying method used for some calculation or manipulation is known as an algorithm. The earliest programmable machines preceded the invention of the digital computer. In 1801, Joseph-Marie Jacquard devised a loom that would weave a pattern by following a series of perforated cards. Patterns could be repeated by arranging the cards.
In 1837, Charles Babbage was inspired by Jacquard's loom to attempt to build the Analytical Engine. The names of the components of the calculating device were borrowed from the textile industry. In the textile industry, yarn was brought from the store to be milled; the device would have had a "store"—memory to hold 1,000 numbers of 40 decimal digits each. Numbers from the "store" would have been transferred to the "mill", for processing, and a "thread" being the execution of programmed instructions by the device. It was programmed using two sets of perforated cards—one to direct the operation and the other for the input variables. However, after more than 17,000 pounds of the British government's money, the thousands of cogged wheels and gears never worked together. During a nine-month period in 1842–43, Ada Lovelace translated the memoir of Italian mathematician Luigi Menabrea; the memoir covered the Analytical Engine. The translation contained Note G which detailed a method for calculating Bernoulli numbers using the Analytical Engine.
This note is recognized by some historians as the world's first written computer program. In 1936, Alan Turing introduced the Universal Turing machine—a theoretical device that can model every computation that can be performed on a Turing complete computing machine, it is a finite-state machine. The machine can move the tape forth, changing its contents as it performs an algorithm; the machine starts in the initial state, goes through a sequence of steps, halts when it encounters the halt state. This machine is considered by some to be the origin of the stored-program computer—used by John von Neumann for the "Electronic Computing Instrument" that now bears the von Neumann architecture name; the Z3 computer, invented by Konrad Zuse in Germany, was a programmable computer. A digital computer uses electricity as the calculating component; the Z3 contained 2,400 relays to create the circuits. The circuits provided a floating-point, nine-instruction computer. Programming the Z3 was through a specially designed keyboard and punched tape.
The Electronic Numerical Integrator And Computer was a Turing complete, general-purpose computer that used 17,468 vacuum tubes to create the circuits. At its core, it was a series of Pascalines wired together, its 40 units weighed 30 tons, occupied 1,800 square feet, consumed $650 per hour in electricity when idle. It had 20 base-10 accumulators. Programming the ENIAC took up to two months. Three function tables needed to be rolled to fixed function panels. Function tables were connected to function panels using heavy black cables; each function table had 728 rotating knobs. Programming the ENIAC involved setting some of the 3,000 switches. Debugging a program took a week; the programmers of the ENIAC were women who were known collectively as the "ENIAC girls." The ENIAC featured parallel operations. Different sets of accumulators could work on different algorithms, it used punched card machines for input and output, it was controlled with a clock signal. It ran for eight years, calculating hydrogen bomb parameters, predicting weather patterns, producing firing tables to aim artillery guns.
The Manchester Baby was a stored-program computer. Programming transitioned away from setting dials. Only three bits of memory were available to store each instruction, so it was limited to eight instructions. 32 switches were available for programming. Computers manufactured; the computer program was written on paper for reference. An instruction was represented by a configuration of on/off settings. After setting the configuration, an execute button was pressed; this process was repeated. Computer programs were manually input via paper tape or punched cards. After the medium was loaded, the starting address was set via switches and the execute button pressed. In 1961, the Burroughs B5000 was built to be programmed in the ALGOL 60 language; the hardware featured circuits to ease the compile phase. In 1964, the IBM System/360 was a line of six computers each having the same instruction set architecture; the Model 30 was the least expensive. Customers could retain the same application software; each System/360 model featured multiprogramming.
With operating system support, multiple programs could be in memory at once. When one was waiting for input/output, another could compute; each model could emulate other computers. Customers could upgrade to the System/360 and ret
Cache hierarchy, or multi-level caches, refers to a memory architecture which uses a hierarchy of memory stores based on varying access speeds to cache data. Highly-requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit cores. Cache hierarchy is a form and part of memory hierarchy, can be considered a form of tiered storage; this design was intended to allow CPU cores to process faster despite the memory latency of main memory access. Accessing main memory can act as a bottleneck for CPU core performance as the CPU waits for data, while making all of main memory high-speed may be prohibitively expensive. High-speed caches are a compromise allowing high-speed access to the data most-used by the CPU, permitting a faster CPU clock. In the history of computer and electronic chip development, there was a period when increases in CPU speed outpaced the improvements in memory access speed; the gap between the speed of CPUs and memory meant that the CPU would be idle.
CPUs were capable of running and executing larger amounts of instructions in a given time, but the time needed to access data from main memory prevented programs from benefiting from this capability. This issue motivated the creation of memory models with higher access rates in order to realize the potential of faster processors; this resulted in the concept of cache memory, first proposed by Maurice Wilkes, a British computer scientist at the University of Cambridge in 1965. He called such memory models "slave memory". Between 1970 and 1990, papers and articles by Anant Agarwal, Alan Jay Smith, Mark D. Hill, Thomas R. Puzak, others discussed better cache memory designs; the first cache memory models were implemented at the time, but as researchers were investigating and proposing better designs, the need for faster memory models continued. This need resulted from the fact that although early cache models improved data access latency, with respect to cost and technical limitations it was not feasible for a computer system's cache to approach the size of main memory.
From 1990 onward, ideas such as adding another cache level, as a backup for the first-level cache were proposed. Jean-Loup Baer, Wen-Hann Wang, Andrew W. Wilson, others have conducted research on this model; when several simulations and implementations demonstrated the advantages of two-level cache models, the concept of multi-level caches caught on as a new and better model of cache memories. Since 2000, multi-level cache models have received widespread attention and are implemented in many systems, such as the three-level caches that are present in Intel's Core i7 products. Accessing main memory for each instruction execution may result in slow processing, with the clock speed depending on the time required to find and fetch the data. In order to hide this memory latency from the processor, data caching is used. Whenever the data is required by the processor, it is fetched from the main memory and stored in the smaller memory structure called a cache. If there is any further need of that data, the cache is searched first before going to the main memory.
This structure resides closer to the processor in terms of the time taken to search and fetch data with respect to the main memory. The advantages of using cache can be proven by calculating the average access time for the memory hierarchy with and without the cache. Caches, being small in size, may result in frequent misses – when a search of the cache does not provide the sought-after information – resulting in a call to main memory to fetch data. Hence, the AAT is affected by the miss rate of each structure from. AAT = hit time + AAT for main memory is given by Hit time main memory. AAT for caches can be given by Hit timecache +; the hit time for caches is much less than the hit time for the main memory, so the AAT for data retrieval is lower when accessing data through the cache rather than main memory. While using the cache may improve memory latency, it may not always result in the required improvement for the time taken to fetch data due to the way caches are organized and traversed. For example, direct-mapped caches that are the same size have a higher miss rate than associative caches.
This may depend on the benchmark of the computer testing the processor and on the pattern of instructions. But using a associative cache may result in more power consumption, as it has to search the whole cache every time. Due to this, the trade-off between power consumption and the size of the cache becomes critical in the cache design. In the case of a cache miss, the purpose of using such a structure will be rendered useless and the computer will have to go to the main memory to fetch the required data. However, with a multiple-level cache, if the computer misses the cache closest to the processor it will search through the next-closest level of cache and go to main memory only if these methods fail; the general trend is to keep the L1 cache small and at a distance of 1–2 CPU clock cycles from the processor, with the lower levels of caches increasing in size to store more data than L1, hence being more distant but with a lower miss rate. This results in a better AAT; the number of cache levels can be designed by architects according to their requirements after checking for trade-offs between cost, AATs, size.
In theoretical computer science a pointer machine is an "atomistic" abstract computational machine model akin to the random-access machine. Depending on the type, a pointer machine may be called a linking automaton, a KU-machine, an SMM, an atomistic LISP machine, a tree-pointer machine, etc.. At least three major varieties exist in the literature—the Kolmogorov-Uspenskii model, the Knuth linking automaton, the Schönhage Storage Modification Machine model; the SMM seems to be the most common. From its "read-only tape" a pointer machine receives input—bounded symbol-sequences made of at least two symbols e.g. -- and it writes output symbol-sequences on an output "write-only" tape. To transform a symbol-sequence to an output symbol-sequence the machine is equipped with a "program"—a finite-state machine. Via its state machine the program reads the input symbols, operates on its storage structure—a collection of "nodes" interconnected by "edges", writes symbols on the output tape. Pointer machines cannot do arithmetic.
Computation proceeds only by reading input symbols and doing various tests on its storage structure—the pattern of nodes and pointers, outputting symbols based on the tests. "Information" is in the storage structure. Both Gurevich and Ben-Amram list a number of similar "atomistic" models of "abstract machines"; this article will discuss the following 3 atomistic models in particular: Schönhage's storage modification machines, Kolmogorov–Uspenskii machines, Knuth's "linking automaton"But Ben-Amram add more: Atomistic pure-LISP machine Atomistic full-LISP machine, General atomistic pointer machines, Jone's I language Use of the model in complexity theory: van Emde Boas expresses concern that this form of abstract model is: "an interesting theoretical model, but... its attractiveness as a fundamental model for complexity theory is questionable. Its time measure is based on uniform time in a context where this measure is known to underestimate the true time complexity; the same observation holds for the space measure for the machine" Gurevich 1988 expresses concern: "Pragmatically speaking, the Schönhage model provides a good measure of time complexity at the current state of the art" The fact that, in §3 and §4, Schönhage himself demonstrates the real-time equivalences of his two random-access machine models "RAM0" and "RAM1" leads one to question the necessity of the SMM for complexity studies.
Potential uses for the model: However, Schönhage demonstrates in his §6, Integer-multiplication in linear time. And Gurevich wonders whether or not the "parallel KU machine" "resembles somewhat the human brain" Schönhage's SMM model seems to be the most common and most accepted, it is quite unlike the register machine model and other common computational models e.g. the tape-based Turing machine or the labeled holes and indistinguishable pebbles of the counter machine. The computer consists of a fixed alphabet of input symbols, a mutable directed graph with its arrows labelled by alphabet symbols; each node of the graph has one outgoing arrow labelled with each symbol, although some of these may loop back into the original node. One fixed node of the graph is identified as "active" node; each word of symbols in the alphabet can be translated to a pathway through the machine. The path can, in turn, be identified with the resulting node, but this identification will change as the graph changes during the computation.
The machine can receive instructions. The basic instructions are the new w instruction, which creates a new node, the "result" of following the string w, the set w to v instruction which directs an edge to a different node. Here w and v represent words. V is a former word—i.e. A previously-created string of symbols—so that the redirected edge will point "backwards" to an old node, the "result" of that string. New w: creates a new node. W represents the new word; the machine reads the word w, following the path represented by the symbols of w until the machine comes to the last, "additional" symbol in the word. The additional symbol instead forces the last state to create a new node, "flip" its corresponding arrow from its old position to point to the new node; the new node in turn points all its edges back to the old last-state, where they just "rest" until redirected by another new or set. In a sense the new nodes are "sleeping". In the case of the starting or center node we would begin with both of its edges pointing back to itself.
Example: Let "w" be 10110, where the final character is in brackets to denote its special status. We take the 1 edge of the node reached by 10110, point it to a new 7th node; the two edges of this new node point "backward" to the 6th node of the path. Set
In computer science, computer engineering and programming language implementations, a stack machine is a type of computer. In some cases, the term refers to a software scheme; the main difference from other computers is that most of its instructions operate on a pushdown stack of numbers rather than numbers in registers. Most computer systems link to subroutines; this does not make these computers stack machines. The common alternative to a stack machine is a register machine, in which each instruction explicitly names specific registers for its operands and result. A "stack machine" is a computer that uses a last-in, first-out stack to hold short-lived temporary values. Most of its instructions assume that operands will be from the stack, results placed in the stack. For a typical instruction such as Add the computer takes both operands from the topmost values of the stack; the computer replaces those two values with the sum, which the computer calculates when it performs the Add instruction. The instruction's operands are "popped" off the stack, its result are "pushed" back onto the stack, ready for the next instruction.
Most stack instructions have only an opcode commanding an operation, with no additional fields to identify a constant, register or memory cell. The stack holds more than two inputs or more than one result, so a richer set of operations can be computed. Integer constant operands are pushed by separate Load Immediate instructions. Memory is accessed by separate Load or Store instructions containing a memory address or calculating the address from values in the stack. For speed, a stack machine implements some part of its stack with registers. To execute operands of the arithmetic logic unit may be the top two registers of the stack and the result from the ALU is stored in the top register of the stack; some stack machines have a stack of limited size, implemented as a register file. The ALU will access this with an index; some machines have a stack of unlimited size, implemented as an array in RAM accessed by a "top of stack" address register. This is slower, but the number of flip-flops is less, making a less-expensive, more compact CPU.
Its topmost N values may be cached for speed. A few machines have both an expression stack in a separate register stack. In this case, software, or an interrupt may move data between them; the instruction set carries out most ALU actions with postfix operations that work only on the expression stack, not on data registers or main memory cells. This can be convenient for executing high-level languages, because most arithmetic expressions can be translated into postfix notation. In contrast, register machines hold temporary values in a fast array of registers. Accumulator machines have only one general-purpose register. Belt machines use a FIFO queue to hold temporary values. Memory-to-memory machines do not have any temporary registers usable by a programmer. Stack machines may have their expression stack and their call-return stack separated or as one integrated structure. If they are separated, the instructions of the stack machine can be pipelined with fewer interactions and less design complexity.
It can run faster. Some technical handheld calculators use reverse Polish notation in their keyboard interface, instead of having parenthesis keys; this is a form of stack machine. The Plus key relies on its two operands being at the correct topmost positions of the user-visible stack. Stack machines have much smaller instructions than the other styles of machines. Loads and stores to memory are separate and so stack code requires twice as many instructions as the equivalent code for register machines; the total code size is still less for stack machines. In stack machine code, the most frequent instructions consist of just an opcode selecting the operation; this can fit in 6 bits or less. Branches, load immediates, load/store instructions require an argument field, but stack machines arrange that the frequent cases of these still fit together with the opcode into a compact group of bits; the selection of operands from prior results is done implicitly by ordering the instructions. In contrast, register machines require two or three register-number fields per ALU instruction to select operands.
The instructions for accumulator or memory-to-memory machines are not padded out with multiple register fields. Instead, they use compiler-managed anonymous variables for subexpression values; these temporary locations require extra memory reference instructions which take more code space than for the stack machine, or compact register machines. All practical stack machines have variants of the load–store opcodes for accessing local variables and formal parameters without explicit address calculations; this can be by offsets from the current top-of-stack address, or by offsets from a stable frame-base register. Register machines handle this with a register + use a wider offset field. Dense machine code was valuable in the 1960s, when main memory was expensive and limited on mainframes, it became important again on the initially-tiny memories of minicomputers and microprocessors. Density remains important today, for smartphone applications, applications downloaded into browsers over slow Internet connections, in ROMs for embedded applications.
A more general advantage of increased density is improved effectiveness of caches and instruction prefetch. Some of the density of Burroughs B6700 code was due to moving vital operand information elsewhere
Reduced instruction set computer
A reduced instruction set computer, or RISC, is one whose instruction set architecture allows it to have fewer cycles per instruction than a complex instruction set computer. Various suggestions have been made regarding a precise definition of RISC, but the general concept is that such a computer has a small set of simple and general instructions, rather than a large set of complex and specialized instructions. Another common RISC trait is their load/store architecture, in which memory is accessed through specific instructions rather than as a part of most instructions. Although a number of computers from the 1960s and'70s have been identified as forerunners of RISCs, the modern concept dates to the 1980s. In particular, two projects at Stanford University and the University of California, Berkeley are most associated with the popularization of this concept. Stanford's MIPS would go on to be commercialized as the successful MIPS architecture, while Berkeley's RISC gave its name to the entire concept and was commercialized as the SPARC.
Another success from this era was IBM's effort that led to the IBM POWER instruction set architecture, PowerPC, Power ISA. As these projects matured, a wide variety of similar designs flourished in the late 1980s and the early 1990s, representing a major force in the Unix workstation market as well as for embedded processors in laser printers and similar products; the many varieties of RISC designs include ARC, Alpha, Am29000, ARM, Atmel AVR, Blackfin, i860, i960, M88000, MIPS, PA-RISC, Power ISA, RISC-V, SuperH, SPARC. In the 21st century, the use of ARM architecture processors in smartphones and tablet computers such as the iPad and Android devices provided a wide user base for RISC-based systems. RISC processors are used in supercomputers such as Summit, which, as of November 2018, is the world's fastest supercomputer as ranked by the TOP500 project. Alan Turing's 1946 Automatic Computing Engine design had many of the characteristics of a RISC architecture. A number of systems, going back to the 1960s, have been credited as the first RISC architecture based on their use of load/store approach.
The term RISC was coined by David Patterson of the Berkeley RISC project, although somewhat similar concepts had appeared before. The CDC 6600 designed by Seymour Cray in 1964 used a load/store architecture with only two addressing modes and 74 operation codes, with the basic clock cycle being 10 times faster than the memory access time. Due to the optimized load/store architecture of the CDC 6600, Jack Dongarra says that it can be considered a forerunner of modern RISC systems, although a number of other technical barriers needed to be overcome for the development of a modern RISC system. Michael J. Flynn views the first RISC system as the IBM 801 design, which began in 1975 by John Cocke and was completed in 1980; the 801 was produced in a single-chip form as the IBM ROMP in 1981, which stood for'Research OPD Micro Processor'. As the name implies, this CPU was designed for "mini" tasks, was used in the IBM RT PC in 1986, which turned out to be a commercial failure, but the 801 inspired several research projects, including new ones at IBM that would lead to the IBM POWER instruction set architecture.
The most public RISC designs, were the results of university research programs run with funding from the DARPA VLSI Program. The VLSI Program unknown today, led to a huge number of advances in chip design and computer graphics; the Berkeley RISC project started in 1980 under the direction of David Patterson and Carlo H. Sequin. Berkeley RISC was based on gaining performance through the use of pipelining and an aggressive use of a technique known as register windowing. In a traditional CPU, one has a small number of registers, a program can use any register at any time. In a CPU with register windows, there are a huge number of registers, e.g. 128, but programs can only use a small number of them, e.g. eight, at any one time. A program that limits itself to eight registers per procedure can make fast procedure calls: The call moves the window "down" by eight, to the set of eight registers used by that procedure, the return moves the window back; the Berkeley RISC project delivered the RISC-I processor in 1982.
Consisting of only 44,420 transistors RISC-I had only 32 instructions, yet outperformed any other single-chip design. They followed this up with the 40,760 transistor, 39 instruction RISC-II in 1983, which ran over three times as fast as RISC-I; the MIPS project grew out of a graduate course by John L. Hennessy at Stanford University in 1981, resulted in a functioning system in 1983, could run simple programs by 1984; the MIPS approach emphasized an aggressive clock cycle and the use of the pipeline, making sure it could be run as "full" as possible. The MIPS system was followed by the MIPS-X and in 1984 Hennessy and his colleagues formed MIPS Computer Systems; the commercial venture resulted in a new architecture, called MIPS and the R2000 microprocessor in 1985. In the early 1980s, significant uncertainties surrounded the RISC concept, it was uncertain if it could have a commercial future, but by the mid-1980s the concepts had matured enough to be seen as commercially viable. In 1986 Hewlett Packard started using an early implementation of their PA-RISC in some of their computers.
In the meantime, the Berkeley RISC effort had become so well known that it became the name for the entire concept and in 1987 Sun Microsystems began shipping systems with the SPARC processor
TRIPS was a microprocessor architecture designed by a team at the University of Texas at Austin in conjunction with IBM, Sun Microsystems. TRIPS uses an instruction set architecture designed to be broken down into large groups of instructions that can be run on independent processing elements; the design collects related data into the graphs, attempting to avoid expensive data reads and writes and keeping the data in high speed memory close to the processing elements. The prototype TRIPS processor contains 16 such elements. TRIPS hoped to reach 1 TFLOP on a single processor, as papers were published from 2003 to 2006. Computer programs consist of a series of instructions stored in memory. A processor runs a program by fetching these instructions from memory, examining them, performing the actions the instruction calls for. In early machines, the speed of main memory was on the same order of time as a basic operation within the processor. For instance, an instruction that adds two numbers might take three or four instruction cycles, while fetching the numbers from memory might take one or two cycles.
In these machines, there was no penalty for data being in main memory, the instruction set architectures were designed to allow direct access, for instance, an add instruction might take a value from one location in memory, add it to the value from another, store the result in a third location. The introduction of fast microprocessors and cheap-but-slower dynamic RAM changed this equation dramatically. In modern machines, fetching a value from main memory might take the equivalent of thousands of cycles. One of the key advances in the RISC concept was to include more processor registers than earlier designs several dozen rather than two or three. Instructions that were provided memory locations were eliminated, replaced by ones that worked only on registers. Loading that data into the register was explicit, a separate load action had to be performed, the results explicitly saved back. One could improve performance by eliminating as many of these memory instructions as possible; this technique reached its limits, since the 1990s modern CPUs have added increasing amounts of CPU cache to increase local storage, although cache is slower than registers.
Since the late 1990s, the performance gains have been made through the use of additional "functional units", which allow some instructions to run in parallel. For instance, two addition instructions working on different data can be run at the same time doubling the speed of the program. Modern CPUs have dozens of such units, some for integer math and boolean logic, some for floating point math, some for long-data words and others for dealing with memory and other housekeeping chores. However, most programs do not work on independent data, but instead use the outputs of one calculation as the input to another; this limits the set of instructions that can be run in parallel to some factor based on how many instructions the processor is able to examine on-the-fly. The level of instruction parallelism plateaued by the mid-2000s. One attempt to break out of this limit is the long instruction word concept. VLIW hands the task of looking for instruction parallelism to the compiler, removing it from the processor itself.
In theory this allows the entire program to be examined for independent instructions, which can be sent into the processor in the order that will make maximal use of the functional units. However, this has proven difficult in practice, VLIW processors have not become popular. In the case of VLIW, another problem has grown to become an issue. In all traditional designs the data and instructions are handled by different parts of the CPU; when processing speeds were low this did not cause problems, but as performance increased, the communication times from one side of the chip to the other grows to become a significant fraction of overall processing time. For further gains in performance, the registers should be distributed closer to their functional units. TRIPS is a processor based on the Explicit Data Graph Execution concept. EDGE attempts to bypass certain performance bottlenecks. EDGE is based on the processor being able to better understand the instruction stream being sent to it, not seeing it as a linear stream of individual instructions, but rather blocks of instructions related to a single task using isolated data.
EDGE attempts to run all of these instructions as a block, distributing them internally along with any data they need to process. The compilers find blocks of code that share information in a specific way; these are assembled into compiled "hyperblocks" and fed into the CPU. Since the compiler is guaranteeing that these blocks have specific interdependencies between them, the processor can isolate the code in a single functional unit with its own local memory. Consider a simple example that adds two numbers from memory adds that result to another value in memory. In this case a traditional processor would have to notice the dependency and schedule the instructions to run one after the other, storing the intermediate results in the registers. In an EDGE processor, the interdependencies between the data in the code would be noticed by the compiler, which would compile these instructions into a single block; that block would be fed, along with all the data it needed to complete, into a single functional unit and its own private set of registers.
This ensures that no additional memory fetching is required, as well as keeping the registers physically close to the functional unit that needs those values. Code that did not rel