Cellular architecture

A cellular architecture is a type of computer architecture prominent in parallel computing. Cellular architectures are new, with IBM's Cell microprocessor being the first one to reach the market. Cellular architecture takes multi-core architecture design to its logical conclusion, by giving the programmer the ability to run large numbers of concurrent threads within a single processor. Each'cell' is a compute node containing thread units and communication. Speed-up is achieved by exploiting thread-level parallelism inherent in many applications. Cell, a cellular architecture containing 9 cores, is the processor used in the PlayStation 3. Another prominent cellular architecture is Cyclops64, a massively parallel architecture under development by IBM. Cellular architectures follow the low-level programming paradigm, which exposes the programmer to much of the underlying hardware; this allows the programmer to optimize their code for the platform, but at the same time makes it more difficult to develop software.

Cellular architecture builds next generation supercomputers ORNL, IBM, the Blue Gene Project Energy, IBM are partners in biological supercomputing project Cell-based Reference Architecture

Finite-state machine

A finite-state machine or finite-state automaton, finite automaton, or a state machine, is a mathematical model of computation. It is an abstract machine that can be in one of a finite number of states at any given time; the FSM can change from one state to another in response to some external inputs. An FSM is defined by a list of its states, its initial state, the conditions for each transition. Finite state machines are of two types – deterministic finite state machines and non-deterministic finite state machines. A deterministic finite-state machine can be constructed equivalent to any non-deterministic one; the behavior of state machines can be observed in many devices in modern society that perform a predetermined sequence of actions depending on a sequence of events with which they are presented. Simple examples are vending machines, which dispense products when the proper combination of coins is deposited, whose sequence of stops is determined by the floors requested by riders, traffic lights, which change sequence when cars are waiting, combination locks, which require the input of combination numbers in the proper order.

The finite state machine has less computational power than some other models of computation such as the Turing machine. The computational power distinction means there are computational tasks that a Turing machine can do but a FSM cannot; this is because a FSM's memory is limited by the number of states it has. FSMs are studied in the more general field of automata theory. An example of a simple mechanism that can be modeled by a state machine is a turnstile. A turnstile, used to control access to subways and amusement park rides, is a gate with three rotating arms at waist height, one across the entryway; the arms are locked, blocking the entry, preventing patrons from passing through. Depositing a coin or token in a slot on the turnstile unlocks the arms, allowing a single customer to push through. After the customer passes through, the arms are locked again. Considered as a state machine, the turnstile has two possible states: Unlocked. There are two possible inputs that affect its state: pushing the arm.

In the locked state, pushing on the arm has no effect. Putting a coin in – that is, giving the machine a coin input – shifts the state from Locked to Unlocked. In the unlocked state, putting additional coins in has no effect. However, a customer pushing through the arms, giving a push input, shifts the state back to Locked; the turnstile state machine can be represented by a state transition table, showing for each possible state, the transitions between them and the outputs resulting from each input: The turnstile state machine can be represented by a directed graph called a state diagram. Each state is represented by a node. Edges show the transitions from one state to another; each arrow is labeled with the input. An input that doesn't cause a change of state is represented by a circular arrow returning to the original state; the arrow into the Locked node from the black dot indicates. A state is a description of the status of a system, waiting to execute a transition. A transition is a set of actions to be executed when a condition is fulfilled or when an event is received.

For example, when using an audio system to listen to the radio, receiving a "next" stimulus results in moving to the next station. When the system is in the "CD" state, the "next" stimulus results in moving to the next track. Identical stimuli trigger different actions depending on the current state. In some finite-state machine representations, it is possible to associate actions with a state: an entry action: performed when entering the state, an exit action: performed when exiting the state. Several state transition table types are used; the most common representation is shown below: the combination of current state and input shows the next state. The complete action's information is not directly described in the table and can only be added using footnotes. A FSM definition including the full actions information is possible using state tables; the Unified Modeling Language has a notation for describing state machines. UML state machines overcome the limitations of traditional finite state machines while retaining their main benefits.

UML state machines introduce the new concepts of hierarchically nested states and orthogonal regions, while extending the notion of actions. UML state machines have the characteristics of Moore machines, they support actions that depend on both the state of the system and the triggering event, as in Mealy machines, as well as entry and exit actions, which are associated with states rather than transitions, as in Moore machines. The Specification and Description Language is a standard from ITU that includes graphical symbols to describe actions in the transition: send an event receive an event start a timer cancel a timer start another concurrent state machine decisionSDL embeds basic data types called "Abstract Data Types", an action language, an execution semantic in order to make the finite state machine executable. There are a large number of variants to represent an FSM such as the one in figure 3. In addition to their use in modeling reactive systems

Cache hierarchy

Cache hierarchy, or multi-level caches, refers to a memory architecture which uses a hierarchy of memory stores based on varying access speeds to cache data. Highly-requested data is cached in high-speed access memory stores, allowing swifter access by central processing unit cores. Cache hierarchy is a form and part of memory hierarchy, can be considered a form of tiered storage; this design was intended to allow CPU cores to process faster despite the memory latency of main memory access. Accessing main memory can act as a bottleneck for CPU core performance as the CPU waits for data, while making all of main memory high-speed may be prohibitively expensive. High-speed caches are a compromise allowing high-speed access to the data most-used by the CPU, permitting a faster CPU clock. In the history of computer and electronic chip development, there was a period when increases in CPU speed outpaced the improvements in memory access speed; the gap between the speed of CPUs and memory meant that the CPU would be idle.

CPUs were capable of running and executing larger amounts of instructions in a given time, but the time needed to access data from main memory prevented programs from benefiting from this capability. This issue motivated the creation of memory models with higher access rates in order to realize the potential of faster processors; this resulted in the concept of cache memory, first proposed by Maurice Wilkes, a British computer scientist at the University of Cambridge in 1965. He called such memory models "slave memory". Between 1970 and 1990, papers and articles by Anant Agarwal, Alan Jay Smith, Mark D. Hill, Thomas R. Puzak, others discussed better cache memory designs; the first cache memory models were implemented at the time, but as researchers were investigating and proposing better designs, the need for faster memory models continued. This need resulted from the fact that although early cache models improved data access latency, with respect to cost and technical limitations it was not feasible for a computer system's cache to approach the size of main memory.

From 1990 onward, ideas such as adding another cache level, as a backup for the first-level cache were proposed. Jean-Loup Baer, Wen-Hann Wang, Andrew W. Wilson, others have conducted research on this model; when several simulations and implementations demonstrated the advantages of two-level cache models, the concept of multi-level caches caught on as a new and better model of cache memories. Since 2000, multi-level cache models have received widespread attention and are implemented in many systems, such as the three-level caches that are present in Intel's Core i7 products. Accessing main memory for each instruction execution may result in slow processing, with the clock speed depending on the time required to find and fetch the data. In order to hide this memory latency from the processor, data caching is used. Whenever the data is required by the processor, it is fetched from the main memory and stored in the smaller memory structure called a cache. If there is any further need of that data, the cache is searched first before going to the main memory.

This structure resides closer to the processor in terms of the time taken to search and fetch data with respect to the main memory. The advantages of using cache can be proven by calculating the average access time for the memory hierarchy with and without the cache. Caches, being small in size, may result in frequent misses – when a search of the cache does not provide the sought-after information – resulting in a call to main memory to fetch data. Hence, the AAT is affected by the miss rate of each structure from. AAT = hit time + AAT for main memory is given by Hit time main memory. AAT for caches can be given by Hit timecache +; the hit time for caches is much less than the hit time for the main memory, so the AAT for data retrieval is lower when accessing data through the cache rather than main memory. While using the cache may improve memory latency, it may not always result in the required improvement for the time taken to fetch data due to the way caches are organized and traversed. For example, direct-mapped caches that are the same size have a higher miss rate than associative caches.

This may depend on the benchmark of the computer testing the processor and on the pattern of instructions. But using a associative cache may result in more power consumption, as it has to search the whole cache every time. Due to this, the trade-off between power consumption and the size of the cache becomes critical in the cache design. In the case of a cache miss, the purpose of using such a structure will be rendered useless and the computer will have to go to the main memory to fetch the required data. However, with a multiple-level cache, if the computer misses the cache closest to the processor it will search through the next-closest level of cache and go to main memory only if these methods fail; the general trend is to keep the L1 cache small and at a distance of 1–2 CPU clock cycles from the processor, with the lower levels of caches increasing in size to store more data than L1, hence being more distant but with a lower miss rate. This results in a better AAT; the number of cache levels can be designed by architects according to their requirements after checking for trade-offs between cost, AATs, size.

With the

Floating-point arithmetic

In computing, floating-point arithmetic is arithmetic using formulaic representation of real numbers as an approximation so as to support a trade-off between range and precision. For this reason, floating-point computation is found in systems which include small and large real numbers, which require fast processing times. A number is, in general, represented to a fixed number of significant digits and scaled using an exponent in some fixed base. A number that can be represented is of the following form: significand × base exponent, where significand is an integer, base is an integer greater than or equal to two, exponent is an integer. For example: 1.2345 = 12345 ⏟ significand × 10 ⏟ base − 4 ⏞ exponent. The term floating point refers to the fact that a number's radix point can "float"; this position is indicated as the exponent component, thus the floating-point representation can be thought of as a kind of scientific notation. A floating-point system can be used to represent, with a fixed number of digits, numbers of different orders of magnitude: e.g. the distance between galaxies or the diameter of an atomic nucleus can be expressed with the same unit of length.

The result of this dynamic range is that the numbers that can be represented are not uniformly spaced. Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, since the 1990s, the most encountered representations are those defined by the IEEE; the speed of floating-point operations measured in terms of FLOPS, is an important characteristic of a computer system for applications that involve intensive mathematical calculations. A floating-point unit is a part of a computer system specially designed to carry out operations on floating-point numbers. A number representation specifies some way of encoding a number as a string of digits. There are several mechanisms. In common mathematical notation, the digit string can be of any length, the location of the radix point is indicated by placing an explicit "point" character there. If the radix point is not specified the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit.

In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might be to use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345. In scientific notation, the given number is scaled by a power of 10, so that it lies within a certain range—typically between 1 and 10, with the radix point appearing after the first digit; the scaling factor, as a power of ten, is indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is 152,853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×105 seconds. Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of: A signed digit string of a given length in a given base; this digit string is referred to mantissa, or coefficient. The length of the significand determines the precision; the radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost digit.

This article follows the convention that the radix point is set just after the most significant digit. A signed integer exponent. To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative. Using base-10 as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1,528,535,047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 105 to give 1.528535047×105, or 152,853.5047. In storing such a number, the base need not be stored, since it will be the same for the entire range of supported numbers, can thus be inferred. Symbolically, this final value is: s b p − 1 × b e, where s is the

Processor (computing)

In computing, a processor or processing unit is an electronic circuit which performs operations on some external data source memory or some other data stream. The term is used to refer to the central processor in a system, but typical computer systems combine a number of specialised "processors". CPU – central processing unit If designed conforming to the von Neumann architecture, it contains at least a control unit, arithmetic logic unit and processor registers. In some contexts, the ALU and registers are called the processing unit. GPU – graphics processing unit VPU – video processing unit TPU – tensor processing unit NPU – neural processing unit PPU – physics processing unit DSP – digital signal processor ISP – image signal processor SPU or SPE – synergistic processing element in the Cell microprocessor FPGA – field-programmable gate array sound chip All pages with titles containing processing unit Microprocessor Multi-core processor Superscalar processor Hardware acceleration

Post–Turing machine

The article Turing machine gives a general introduction to Turing machines, while this article covers a specific class of Turing machines. A Post–Turing machine is a "program formulation" of an simple type of Turing machine, comprising a variant of Emil Post's Turing-equivalent model of computation described below. A Post–Turing machine uses a binary alphabet, an infinite sequence of binary storage locations, a primitive programming language with instructions for bi-directional movement among the storage locations and alteration of their contents one at a time; the names "Post–Turing program" and "Post–Turing machine" were used by Martin Davis in 1973–1974. In 1980, Davis used the name "Turing–Post program". In his 1936 paper "Finite Combinatory Processes—Formulation 1", Emil Post described a model of extreme simplicity which he conjectured is "logically equivalent to recursiveness", and, proved to be so; the quotes in the following are from this paper. Post's model of a computation differs from the Turing-machine model in a further "atomization" of the acts a human "computer" would perform during a computation.{{cref| Post's model employs a "symbol space" consisting of a "two-way infinite sequence of spaces or boxes", each box capable of being in either of two possible conditions, namely "marked" and "unmarked".

Finitely-many of the boxes are marked, the rest being unmarked. A "worker" is to move among the boxes, being in and operating in only one box at a time, according to a fixed finite "set of directions", which are numbered in order. Beginning at a box "singled out as the starting point", the worker is to follow the set of instructions one at a time, beginning with instruction 1; the instructions may require the worker to perform the following "basic acts" or "operations": Marking the box he is in, Erasing the mark in the box he is in, Moving to the box on his right, Moving to the box on his left, Determining whether the box he is in, is or is not marked. The i th "direction" given to the worker is to be one of the following forms: Perform operation Oi and follow direction ji, Perform operation and according as the answer is yes or no correspondingly follow direction ji' or ji", Stop. Post remarks that this formulation is "in its initial stages" of development, mentions several possibilities for "greater flexibility" in its final "definitive form", including replacing the infinity of boxes by a finite extensible symbol space, "extending the primitive operations to allow for the necessary extension of the given finite symbol space as the process proceeds", using an alphabet of more than two symbols, "having more than one way to mark a box", introducing finitely-many "physical objects to serve as pointers, which the worker can identify and move from box to box".

As mentioned in the article Turing machine, Post, in his paper of 1947 atomized the Turing 5-tuples to 4-tuples: "Our quadruplets are quintuplets in the Turing development. That is, where our standard instruction orders either a printing or motion, left or right, Turing's standard instruction always order a printing and a motion, left, or none" Like Turing he defined erasure as printing a symbol "S0", and so his model admitted quadruplets of only three types: qi Sj L ql, qi Sj R ql, qi Sj Sk qlAt this time he was still retaining the Turing state-machine convention – he had not formalized the notion of an assumed sequential execution of steps until a specific test of a symbol "branched" the execution elsewhere. For an further reduction – to only four instructions – of the Wang model presented here see Wang B-machine. Wang is cited as the source of the "program formulation" of binary-tape Turing machines using numbered instructions from the set write 0 write 1 move left move right if scanning 0 goto instruction i if scanning 1 goto instruction jwhere sequential execution is assumed, Post's single "if... then... else" has been "atomised" into two "if... then..." statements.

Wang noted the following: "Since there is no separate instruction for halt, it is understood that the machine will stop when it has arrived at a stage that the program contains no instruction telling the machine what to do next." "In contrast with Turing who uses a one-way infinite tape that has a beginning, we are following Post in the use of a 2-way infinite tape." Unconditional gotos are derived from the above instructions, so "we can use them too". Any binary-tape Turing machine is converted to an equivalent "Wang program" using the above instructions. Martin Davis was an undergraduate student of Emil Post's. Along with Stephen Kleene he completed his PhD under Alonzo Church; the following model he presented in a series of lectures to the Courant Institute at NYU in 1973–1974. This is the model to which Davis formally applied the name "Post–Turing m

Vector processor

In computing, a vector processor or array processor is a central processing unit that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to the scalar processors, whose instructions operate on single data items. Vector processors can improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms; the rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the 1990s. As of 2015 most commodity CPUs implement architectures that feature instructions for a form of vector processing on multiple data sets. Common examples include Intel x86's MMX, SSE and AVX instructions, AMD's 3DNow! extensions, Sparc's VIS extension, PowerPC's AltiVec and MIPS' MSA. Vector processing techniques operate in video-game console hardware and in graphics accelerators.

In 2000, IBM, Toshiba and Sony collaborated to create the Cell processor. Other CPU designs include some multiple instructions for vector processing on multiple data sets known as MIMD and realized with VLIW; the Fujitsu FR-V VLIW/vector processor combines both technologies. Vector processing development began in the early 1960s at Westinghouse in their "Solomon" project. Solomon's goal was to increase math performance by using a large number of simple math co-processors under the control of a single master CPU; the CPU fed a single common instruction to all of the arithmetic logic units, one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set, fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted at the University of Illinois as the ILLIAC IV, their version of the design called for a 1 GFLOPS machine with 256 ALUs, when it was delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS.

It showed that the basic concept was sound, when used on data-intensive applications, such as computational fluid dynamics, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to designs, is referred to under a separate category, massively parallel computing. A computer for operations with functions was presented and developed by Kartsev in 1967; the first successful implementation of vector processing appears to be the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer. The basic ASC ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching 20 MFLOPS achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes; the STAR was otherwise slower than CDC's own supercomputers like the CDC 7600, but at data related tasks they could keep up while being much smaller and less expensive.

However the machine took considerable time decoding the vector instructions and getting ready to run the process, so it required specific data sets to work on before it sped anything up. The vector technique was first exploited in 1976 by the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight vector registers, which held sixty-four 64-bit words each; the vector instructions were applied between registers, much faster than talking to main memory. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition the design had separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication; this allowed a batch of vector instructions themselves to be pipelined, a technique they called vector chaining. The Cray-1 had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS – far faster than any machine of the era.

Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (Fujitsu and Nippon Electric Corporation introduced register-based vector machines similar to the Cray-1 being faster and much smaller. Oregon-based Floating Point Systems built add-on array processors for minicomputers building their own minisupercomputers. Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the Cray-2, Cray X-MP and Cray Y-MP. Since the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as a vector processor.

Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day, with their SX series of computers. Most the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card t