Parallel slowdown is a phenomenon in parallel computing where parallelization of a parallel algorithm beyond a certain point causes the program to run slower. Parallel slowdown is typically the result of a communications bottleneck, as more processor nodes are added, each processing node spends progressively more time doing communication than useful processing. At some point, the communications overhead created by adding another processing node surpasses the increased processing power that node provides, parallel slowdown occurs when the algorithm requires significant communication, particularly of intermediate results. Some problems, known as embarrassingly parallel problems, do not require such communication, mythical man month, an analogous situation for a team programmers where productivity is affected by human communication
Synchronization (computer science)
In computer science, synchronization refers to one of two distinct but related concepts, synchronization of processes, and synchronization of data. Process synchronization refers to the idea that multiple processes are to join up or handshake at a certain point, data synchronization refers to the idea of keeping multiple copies of a dataset in coherence with one another, or to maintain data integrity. Process synchronization primitives are commonly used to implement data synchronization, the need for synchronization does not arise merely in multi-processor systems but for any kind of concurrent processes, even in single processor systems. Mentioned below are some of the needs for synchronization and Joins, When a job arrives at a fork point. After being serviced, each sub-job waits until all other sub-jobs are done processing, they are joined again and leave the system. Thus, in programming, we require synchronization as all the parallel processes wait for several other processes to occur. Producer-Consumer, In a producer-consumer relationship, the process is dependent on the producer process till the necessary data has been produced.
This reduces concurrency. Processes access to critical section is controlled by using synchronization techniques, when one thread starts executing the critical section the other thread should wait until the first thread finishes. For example, suppose there are three processes namely,1,2 and 3. All three of them are concurrently executing and they need to share a common resource as shown in Figure 1, synchronization should be used here to avoid any conflicts for accessing this shared resource. Hence, when Process 1 and 2 both try to access that resource it should be assigned to one process at a time. If it is assigned to Process 1, the process needs to wait until Process 1 frees that resource. Another synchronization requirement which needs to be considered is the order in which particular processes or threads should be executed, for example, we cannot board a plane until we buy a ticket. Similarly, we cannot check emails without validating our credentials, in the same way, an ATM will not provide any service until we provide it with a correct PIN.
Other than mutual exclusion, synchronization deals with the following and this frequent polling robs processing time from other processes. One of the challenges for exascale algorithm design is to minimize or reduce synchronization, synchronization takes more time than computation, especially in distributed computing. Reducing synchronization drew attention from computer scientists for decades, whereas it becomes an increasingly significant problem recently as the gap between the improvement of computing and latency increases. Experiments have shown that due to synchronization on a distributed computers takes a dominated share in a sparse iterative solver
Distributed computing is a field of computer science that studies distributed systems. A distributed system is a model in which components located on networked computers communicate and coordinate their actions by passing messages, the components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are, concurrency of components, lack of a global clock, examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications. A computer program that runs in a system is called a distributed program. There are many alternatives for the message passing mechanism, including pure HTTP, RPC-like connectors, Distributed computing refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. The terms are used in a much wider sense, even referring to autonomous processes that run on the same physical computer.
The entities communicate with each other by message passing, a distributed system may have a common goal, such as solving a large computational problem, the user perceives the collection of autonomous processors as a unit. Other typical properties of distributed systems include the following, The system has to tolerate failures in individual computers. The structure of the system is not known in advance, the system may consist of different kinds of computers and network links, each computer has only a limited, incomplete view of the system. Each computer may know one part of the input. Distributed systems are groups of networked computers, which have the goal for their work. The terms concurrent computing, parallel computing, and distributed computing have a lot of overlap, the same system may be characterized both as parallel and distributed, the processors in a typical distributed system run concurrently in parallel. Parallel computing may be seen as a tightly coupled form of distributed computing.
In distributed computing, each processor has its own private memory, Information is exchanged by passing messages between the processors. The figure on the right illustrates the difference between distributed and parallel systems, figure shows a parallel system in which each processor has a direct access to a shared memory. The situation is complicated by the traditional uses of the terms parallel and distributed algorithm that do not quite match the above definitions of parallel. The use of concurrent processes that communicate by message-passing has its roots in operating system architectures studied in the 1960s, the first widespread distributed systems were local-area networks such as Ethernet, which was invented in the 1970s
In Computer Architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common resource, problems may arise with incoherent data. In the illustration on the right, consider both the clients have a copy of a particular memory block from a previous read. Suppose the client on the bottom updates/changes that memory block, the client on the top could be left with a cache of memory without any notification of the change. Cache coherence is intended to manage such conflicts by maintaining a coherent view of the values in multiple caches. When one of the copies of data is changed, the copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands are propagated throughout the system in a timely fashion. The following are the requirements for coherence, Write Propagation. Transaction Serialization, Reads/Writes to a memory location must be seen by all processors in the same order.
Theoretically, coherence can be performed at the load/store granularity, however, in practice it is generally performed at the granularity of cache blocks. Coherence defines the behavior of reads and writes to an address location. In a multiprocessor system, consider that more than one processor has cached a copy of the memory location X and this condition defines the concept of coherent view of memory. Propagating the writes to the memory location ensures that all the caches have a coherent view of the memory. If processor P1 reads the old value of X, even after the write by P2, the above conditions satisfy the Write Propagation criteria required for cache coherence. However, they are not sufficient as they do not satisfy the Transaction Serialization condition, processor P1 changes the value of S to 10 following which processor P2 changes the value of S in its own cached copy to 20. If we ensure only write propagation, P3 and P4 will certainly see the changes made to S by P1, the processors P3 and P4 now have an incoherent view of the memory.
In other words, if location X received two different values A and B, in order, from any two processors, the processors can never read location X as B and read it as A. The location X must be seen with values A and B in that order, the only difference between the cache coherent system and sequentially consistent system is in the number of address locations the definition talks about
Concurrency (computer science)
In computer science, concurrency is the decomposability property of a program, algorithm, or problem into order-independent or partially-ordered components or units. This means that if the concurrent units of the program, algorithm, or problem are executed out-of-order or in partial order. This allows for execution of the concurrent units, which can significantly improve overall speed of the execution in multi-processor. The ensuing decades have seen a growth of interest in concurrency—particularly in distributed systems. Looking back at the origins of the field, what stands out is the role played by Edsger Dijkstra. Concurrent use of shared resources can be a source of indeterminacy leading to such as deadlocks. Concurrency theory has been a field of research in theoretical computer science. One of the first proposals was Carl Adam Petris seminal work on Petri Nets in the early 1960s, in the years since, a wide variety of formalisms have been developed for modeling and reasoning about concurrency. g.
Some of these are based on passing, while others have different mechanisms for concurrency. The proliferation of different models of concurrency has motivated some researchers to develop ways to unify these different theoretical models, various types of temporal logic can be used to help reason about concurrent systems. Some of these logics, such as linear temporal logic and computational tree logic, such as action computational tree logic, Hennessy-Milner logic, and Lamports temporal logic of actions, build their assertions from sequences of actions. The principal application of these logics is in writing specifications for concurrent systems, concurrent programming encompasses programming languages and algorithms used to implement concurrent systems. The base goals of concurrent programming include correctness and robustness, concurrent systems such as Operating systems and Database management systems are generally designed to operate indefinitely, including automatic recovery from failure, and not terminate unexpectedly.
Because they use shared resources, concurrent systems in general require the inclusion of some kind of arbiter somewhere in their implementation, the use of arbiters introduces the possibility of indeterminacy in concurrent computation which has major implications for practice including correctness and performance. Some concurrent programming models include coprocesses and deterministic concurrency, in these models, threads of control explicitly yield their timeslices, either to the system or to another process. Tanenbaum, Andrew S. Van Steen, Maarten, a Practical Theory of Reactive Systems. Concurrency, State Models and Java Programming, concurrent Systems at The WWW Virtual Library Concurrency patterns presentation given at scaleconf
Single instruction, multiple data, is a class of parallel computers in Flynns taxonomy. It describes computers with multiple processing elements that perform the operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism, but not concurrency, there are simultaneous computations, SIMD is particularly applicable to common tasks like adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions in order to improve the performance of multimedia use, vector processing was especially popularized by Cray in the 1970s and 1980s. The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and these machines had many limited-functionality processors that would work in parallel. Supercomputing moved away from the SIMD approach when inexpensive scalar MIMD approaches based on commodity processors such as the Intel i860 XP became more powerful, the current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market.
Sun Microsystems introduced SIMD integer instructions in its VIS instruction set extensions in 1995, MIPS followed suit with their similar MDMX system. The first widely deployed desktop SIMD was with Intels MMX extensions to the x86 architecture in 1996 and this sparked the introduction of the much more powerful AltiVec system in the Motorola PowerPCs and IBMs POWER systems. Intel responded in 1999 by introducing the all-new SSE system, since then, there have been several extensions to the SIMD instruction sets for both architectures. A modern supercomputer is almost always a cluster of MIMD machines, a modern desktop computer is often a multiprocessor MIMD machine where each processor can execute short-vector SIMD instructions. An application that may take advantage of SIMD is one where the value is being added to a large number of data points. One example would be changing the brightness of an image, each pixel of an image consists of three values for the brightness of the red and blue portions of the color.
To change the brightness, the R, G and B values are read from memory, a value is added to them, with a SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, instead of a series of instructions saying retrieve this pixel, now retrieve the next pixel, a SIMD processor will have a single instruction that effectively says retrieve n pixels. For a variety of reasons, this can take less time than retrieving each pixel individually. Another advantage is that the instruction operates on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight points at once. Not all algorithms can be vectorized easily, Batch-pipeline systems are most advantageous for cache control when implemented with SIMD intrinsics, but they are not exclusive to SIMD features
Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can be solved at the same time, there are several different forms of parallel computing, bit-level, instruction-level and task parallelism. Parallelism has been employed for years, mainly in high-performance computing. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks. Communication and synchronization between the different subtasks are typically some of the greatest obstacles to getting good parallel program performance, a theoretical upper bound on the speed-up of a single program as a result of parallelization is given by Amdahls law. Traditionally, computer software has been written for serial computation, to solve a problem, an algorithm is constructed and implemented as a serial stream of instructions.
These instructions are executed on a central processing unit on one computer, only one instruction may execute at a time—after that instruction is finished, the next one is executed. Parallel computing, on the hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include such as a single computer with multiple processors, several networked computers, specialized hardware. Frequency scaling was the dominant reason for improvements in performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the time per instruction. Maintaining everything else constant, increasing the frequency decreases the average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all compute-bound programs.
However, power consumption P by a chip is given by the equation P = C × V2 × F, where C is the capacitance being switched per clock cycle, V is voltage, increases in frequency increase the amount of power used in a processor. Moores law is the observation that the number of transistors in a microprocessor doubles every 18 to 24 months. Despite power consumption issues, and repeated predictions of its end, with the end of frequency scaling, these additional transistors can be used to add extra hardware for parallel computing. Optimally, the speedup from parallelization would be linear—doubling the number of processing elements should halve the runtime, very few parallel algorithms achieve optimal speedup
A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. It therefore allows for more throughput than would otherwise be possible at a clock rate. A superscalar processor can execute more than one instruction during a cycle by simultaneously dispatching multiple instructions to different execution units on the processor. Each execution unit is not a processor, but an execution resource within a single CPU such as an arithmetic logic unit. In Flynns taxonomy, a superscalar processor is classified as an SISD processor, though many superscalar processors support short vector operations. A multi-core superscalar processor is classified as an MIMD processor, while a superscalar CPU is typically pipelined and superscalar execution are considered different performance enhancement techniques. The Motorola MC88100, the Intel i960CA and the AMD 29000-series 29050 microprocessors were the first commercial single-chip superscalar microprocessors, RISC microprocessors like these were the first to have superscalar execution, because RISC architectures frees transistors and die area which could be used to include multiple execution units.
Except for CPUs used in applications, embedded systems, and battery-powered devices. The simplest processors are scalar processors, each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each executed by a vector processor operates simultaneously on many data items. An analogy is the difference between scalar and vector arithmetic, a superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently, superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and allowing it to keep the multiple execution units in use at all times. This has become important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, a superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle.
In a superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, therefore, a superscalar processor can be envisioned having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Available performance improvement from superscalar techniques is limited by three key areas, The degree of parallelism in the instruction stream. The complexity and time cost of dependency checking logic and register renaming circuitry The branch instruction processing, existing binary executable programs have varying degrees of intrinsic parallelism
In computer architecture, Gustafsons Law gives the theoretical speedup in latency of the execution of a task at fixed execution time that can be expected of a system whose resources are improved. It is named after computer scientist John L. Gustafson and his colleague Edwin H. Barsis, Gustafsons law instead proposes that programmers tend to set the size of problems to fully exploit the computing power that becomes available as the resources improve. Therefore, if faster equipment is available, larger problems can be solved within the same time, the impact of Gustafsons Law was to shift research goals to select or reformulate problems so that solving a larger problem in the same amount of time would be possible. In a way the Law redefines efficiency, due to the possibility that limitations imposed by the part of a program may be countered by increasing the total amount of computation. — A computer program that processes files from disk, a part of that program may scan the directory of the disk and create a list of files internally in memory.
After that, another part of the program passes each file to a thread for processing. The part that scans the directory and creates the file list cannot be sped up on a parallel computer, the execution workload of the whole task before the improvement of the resources of the system is denoted W. It includes the execution workload of the part that does not benefit from the improvement of the resources and the execution workload of the one that benefits from it. The fraction of the workload that would benefit from the improvement of the resources is denoted by p The fraction concerning the part that would not benefit from it is therefore 1 − p. It is the execution of the part that benefits from the improvement of the resources that is sped up by a factor s after the improvement of the resources, the execution workload of the part that does not benefit from it remains the same. In other words, an analysis of the data will take less time given more computing power. Gustafson, on the hand, argues that more computing power will cause the data to be more carefully and fully analyzed, pixel by pixel or unit by unit.
Amdahls Law reveals a limitation in, for example, the ability of multiple cores to reduce the time it takes for a computer to boot to its operating system and be ready for use. Assuming the boot process was mostly parallel, quadrupling computing power on a system that one minute to load might reduce the boot time to just over fifteen seconds. But greater and greater parallelization would eventually fail to make bootup go any faster, Gustafsons Law argues that a fourfold increase in computing power would instead lead to a similar increase in expectations of what the system will be capable of. If the one-minute load time is acceptable to most users, that is a point from which to increase the features. The time taken to boot to the system will be the same, i. e. one minute. Some problems do not have fundamentally larger datasets, as an example, processing one data point per world citizen gets larger at only a few percent per year
Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, the rapid fall in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputers demise in the 1990s. As of 2015 most commodity CPUs implement architectures that feature instructions for a form of processing on multiple data sets. Common examples include Intel x86s MMX, SSE and AVX instructions, Sparcs VIS extension, PowerPCs AltiVec, Vector processing techniques operate in video-game console hardware and in graphics accelerators. Other CPU designs may include some multiple instructions for vector processing on multiple sets, typically known as MIMD. Such designs are usually dedicated to an application and not commonly marketed for general-purpose computing. The Fujitsu FR-V VLIW/vector processor combines both technologies, Vector processing development began in the early 1960s at Westinghouse in their Solomon project.
Solomons goal was to dramatically increase math performance by using a number of simple math co-processors under the control of a single master CPU. The CPU fed a single instruction to all of the arithmetic logic units, one per cycle. This allowed the Solomon machine to apply a single algorithm to a data set. In 1962, Westinghouse cancelled the project, but the effort was restarted at the University of Illinois as the ILLIAC IV. Their version of the originally called for a 1 GFLOPS machine with 256 ALUs, but. Nevertheless, it showed that the concept was sound, when used on data-intensive applications, such as computational fluid dynamics. The ILLIAC approach of using separate ALUs for each element is not common to designs. A computer for operations with functions was presented and developed by Kartsev in 1967, the first successful implementation of vector processing appears to be the Control Data Corporation STAR-100 and the Texas Instruments Advanced Scientific Computer. Expanded ALU configurations supported two pipes or four pipes with a corresponding 2X or 4X performance gain, Memory bandwidth was sufficient to support these expanded modes.
The STAR was otherwise slower than CDCs own supercomputers like the CDC7600, the vector technique was first fully exploited in 1976 by the famous Cray-1. Instead of leaving the data in memory like the STAR and ASC, the Cray design had eight vector registers, the vector instructions were applied between registers, which is much faster than talking to main memory
A supercomputer is a computer with a high level of computing performance compared to a general-purpose computer. Performance of a supercomputer is measured in floating-point operations per second instead of instructions per second. As of 2015, there are supercomputers which can perform up to quadrillions of FLOPS and it tops the rankings in the TOP500 supercomputer list. Sunway TaihuLights emergence is notable for its use of indigenous chips, as of June 2016, for the first time, had more computers on the TOP500 list than the United States. However, U. S. built computers held ten of the top 20 positions, in November 2016 the U. S. has five of the top 10, throughout their history, they have been essential in the field of cryptanalysis. The use of multi-core processors combined with centralization is an emerging trend, the history of supercomputing goes back to the 1960s, with the Atlas at the University of Manchester and a series of computers at Control Data Corporation, designed by Seymour Cray.
These used innovative designs and parallelism to achieve superior computational peak performance, Cray left CDC in 1972 to form his own company, Cray Research. Four years after leaving CDC, Cray delivered the 80 MHz Cray 1 in 1976, the Cray-2 released in 1985 was an 8 processor liquid cooled computer and Fluorinert was pumped through it as it operated. It performed at 1.9 gigaflops and was the second fastest after M-13 supercomputer in Moscow. Fujitsus Numerical Wind Tunnel supercomputer used 166 vector processors to gain the top spot in 1994 with a speed of 1.7 gigaFLOPS per processor. The Hitachi SR2201 obtained a performance of 600 GFLOPS in 1996 by using 2048 processors connected via a fast three-dimensional crossbar network. The Intel Paragon could have 1000 to 4000 Intel i860 processors in various configurations, the Paragon was a MIMD machine which connected processors via a high speed two dimensional mesh, allowing processes to execute on separate nodes, communicating via the Message Passing Interface.
Approaches to supercomputer architecture have taken dramatic turns since the earliest systems were introduced in the 1960s, early supercomputer architectures pioneered by Seymour Cray relied on compact innovative designs and local parallelism to achieve superior computational peak performance. However, in time the demand for increased computational power ushered in the age of massively parallel systems, supercomputers of the 21st century can use over 100,000 processors connected by fast connections. The Connection Machine CM-5 supercomputer is a parallel processing computer capable of many billions of arithmetic operations per second. Throughout the decades, the management of heat density has remained a key issue for most centralized supercomputers, the large amount of heat generated by a system may have other effects, e. g. reducing the lifetime of other system components. There have been diverse approaches to management, from pumping Fluorinert through the system. Systems with a number of processors generally take one of two paths