Parallel programming made easy

New chip design makes parallel programs run many times faster and requires one-tenth the code.

0
Parallel programming
Illustration: Christine Daniloff/MIT

Computer’s chip stopped working faster. A decade ago, a multi-core processor is used to improve the performance of this chip. This multi-core processor is a single computing component with two or more independent actual processing units. These are the units that read and execute program instructions. The instructions are ordinary CPU instructions such as add, remove, and divide. But the multiple cores can run multiple instructions at the same time by increasing overall speed for programs which is cooperative to parallel computing.

Many times, computer programs are sequential and broken up. This is because some parts of them can run in parallel and causes all kinds of complications.

Before few days, I have written about ‘KiloCore: World’s first 1,000-Processor Chip‘. Kilocore is the chip consist of 1000 of individually programmable processors. This chip is equipped with 621 million transistors has large computation rate of 1.78 trillion instructions per second. There is a major advantage of this chip is, it can perform instructions above 100 times more effectively.

Researchers from Massachusetts Institute of Technology’s (MIT) Computer Science and Artificial Intelligence Laboratory (CSAIL), recently develop a new chip known as ‘Swarm’. This chip is not only most effective but also simple to write programs.

Researchers compare this swarm’s versions of six common algorithms with current parallel versions. This parallel version was separately developed by the experienced software developer. The swarm’s version are between 3 and 18 times faster. They normally need 1 to 10 lines of code or less than that. In one of the case, swarm gain a 75-fold speed up on a program that computer scientists had so far failed to parallelize.

Daniel Sanchez, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science, said, “Multicore systems are really hard to program. You have to certainly split the work that you’re doing into tasks, and then you need to enforce some synchronization between tasks accessing shared data. What this architecture does, essentially, is to remove all sorts of explicit synchronization, to make parallel programming much easier. There’s an especially hard set of applications that have resisted parallelization for many, many years, and those are the kinds of applications we’ve focused on in this paper.”

Mostly, those applications include exploration. Scientists call it as graphs. These graphs consist of nodes (commonly described as circles), and edges (typically depicted as line segments connecting the nodes). Those edges have weights which represents the strength of correlations between data points in a dataset, or the distances between cities.

The graphs are used in a wide range of computer science problems. Its most natural use is to describe geographic relationships. One of the algorithms known as standard algorithm among all is used to find a fastest driving route between two points.

Setting priorities:

Investigating graphs would seem to be something that could be parallelized: At the same time, individual cores can detect different regions of a graph or different paths through the graph. The task was with most graph investigating algorithms, it constant becomes clear that whole regions of the graph are irrelevant to the problem at hand. If right off the bat, cores are assigned with investigating those regions, their efforts end up being fruitless.

This fruitless determination of irrelevant regions is a problem for parallel sequential graph investigating algorithms. Therefore, computer scientists have developed a host of application-particular techniques for prioritizing graph investigation. An algorithm might start by investigating just those ways whose edges have the lowest weights. For example, the lowest number of edges are primarily detected.

Swarm has extra circuitry for handling that type of prioritization. It time-stamps tasks is the highest-priority tasks in parallel. Higher-priority tasks may cause their own lower-priority tasks, but Swarm slots those into its queue of tasks automatically.

Sometimes, tasks running in parallel may come into a problem. For example, before a higher-priority task has read the same location, a task with a lower priority may write data to a particular memory location. Swarm automatically cancel the results of the lower-priority tasks at such times. Thus, it manages the synchronization between cores accessing the same data that programmers previously had to worry about themselves.

Using swarm is pretty painless. Whenever the programmer defines any function, he or she can easily add a line of code that loads the function into Swarm’s queue of tasks. The programmer does have to define the metric like edge’s weight or number of edge that the program uses to prioritize tasks, but that would be necessary, anyway. Generally, adjusting the current sequential algorithm to Swarm needs insertion of few lines of code.

Keeping tabs:

The Swarm chip consists of additional circuitry to store and manage its queue of tasks. It also has a circuit that stores the memory addresses of all the data its cores are currently working on. That circuit implements a Bloom filter, which compresses data into a fixed space and answers yes/no questions about its contents. If too many addresses are loaded into this filter, it will sometimes display false positives by indicating “yes, I’m storing that address”. But it will never display false negatives.

The Bloom filter helps Swarm to detect memory access problems. The researchers were able to show that time-stamping makes synchronization between cores easier to enforce. For example, each data item is described with the time stamp of the last task that updated it. Thus, tasks with later timestamps know they can read that data without disturbing to determine who else is using it.

Therefore, all the cores sometimes inform the time stamps of the highest-priority tasks that they are still executing. If any core has finished tasks which have previous time stamps, knows it can write its results to memory. This is done without inviting any problems.

Luis Ceze, an associate professor of computer science and engineering at the University of Washington, said, “I think their architecture has just the right aspects of past work on transactional memory and thread-level speculation. ‘Transactional memory’ refers to a mechanism to make sure that multiple processors working in parallel don’t step on each other’s toes. It guarantees that updates to shared memory locations occur in an orderly way. Thread-level speculation is a related technique that uses transactional memory ideas for parallelization: Do it without being sure the task is parallel, and if it’s not, undo and re-execute serially. Sanchez’s architecture uses many good pieces of those ideas and technologies in a creative way.”