COE United Parallel Processing Discussions: March 2014

Sunday, 30 March 2014

28/03/2014

VECTOR PROCESSING (CONT.)

MEMORY: -

MORE MEMORY GIVES MORE POWER TO SYSTEM AND USER.

THE SYSTEM CAN DO MORE WORK, CAN OPERATE ON LARGER DATA.

MORE MEMORY - > MORE PARALLEL PROCESSING - > MORE DATA CAN FLOW IN

FOR THIS

1) MEMORY ORGANISATION IS NEEDED (FOR EX. IN VECTOR PROCESSING BY INCREASING BANDWIDTH).

2) DATA PLACEMENT IN MEMORY (FOR EX. IN MATRIX MULTIPLICATION ON 4D CUBE – DATA PLACEMENT IS IMPORTANT)

MASKING: -

IF SOME PROCESSORS ARE NOT REQUIRED, MASKING IS DONE SO THAT THOSE PROCESSORS REMAIN INACTIVE. THIS CAN PROVIDE BETTER PERFORMANCE IN SOME CASES.

EXAMPLE:

AN AFFINITY MASK IS A BIT MASK INDICATING WHAT PROCESSOR(S) A THREAD OR PROCESS SHOULD BE RUN ON BY THE SCHEDULER OF AN OPERATING SYSTEM. UNDER WINDOWS THERE ARE SEVERAL SYSTEM PROCESSES (ESPECIALLY ON DOMAIN CONTROLLERS) THAT ARE RESTRICTED TO THE FIRST CPU / CORE. SO, SETTING THE AFFINITY MASK FOR CERTAIN PROCESSES EXCLUDING THE FIRST CPU MIGHT LEAD TO BETTER APPLICATION PERFORMANCE.

IN PREFIX COMPUTATION, NUMBER OF REQUIRED PROCESSORS IN EACH STEP DECREASE. SO SOME PROCESSORS ARE REQUIRED TO BE MASKED.

IN CASE OF SPARSE MATRIX, ALL VALUES ARE NOT REQUIRED TO BE LOADED. THIS CAN BE USED TO IMPROVE EFFICIENCY.

THIS CAN BE DONE BY USING A COMBINATION OF ‘GATHER ‘AND ‘SCATTER ‘.

IF X IS THE SPARSE VECTOR, Y IS THE DENSE MATRIX AND INDEX IS A VECTOR THAT

STORES THE LOCATION OF NON-ZERO VALUES IN Y, THEN

GATHER –: ASSIGNS X [i] = Y [INDEX [i]]

SCATTER –: ASSIGNS Y [INDEX [i]] = X [i]

(NOTE : THE NOTATIONS FOR X AND Y MIGHT BE CONFUSING. THE CONDENSED VECTOR X IS CALLED AS THE SPARSE VECTOR WHILE THE MATRIX Y WHICH HAS A VAST MAJORITY OF ZERO VALUES IS CALLED AS THE DENSE VECTOR OR MATRIX. BEATS ME!!!)

VECTOR PROCESSORS HAVE HARDWARE SUPPORT FOR GATHER-SCATTER OPERATIONS, PROVIDING INSTRUCTIONS SUCH AS LOAD VECTOR INDEXED FOR GATHER AND STORE VECTOR INDEXED FOR SCATTER.

NOTE:

ISA SHOULD BE SIMPLE. COMPLEX INSTRUCTIONS DIRECTLY IMPLY WASTAGE OF SPACE.

ISA SHOULD NOT BE CHANGED TO INCLUDE COMPLEX INSTRUCTIONS THAT WILL BE RARELY USED. EXAMPLE: IT WOULD NOT MAKE SENSE TO CHANGE THE ISA FOR A NORMAL PROCESSOR TO INCLUDE GATHER AND SCATTER INSTRUCTIONS THAT WILL BE USED ONCE IN A BLUE MOON (MAAM’S WORDS).

USING MORE CACHE, REGISTERS, PIPELINING AND REMOVING COMPLEX INSTRUCTIONS REMOVES COMPLEX HARDWARE.

HARDWARE CATERING TO SUCH COMPLEX INSTRUCTIONS IS USED RARELY. GIVING SPACE TO MEMORY ON SILICON IS A BETTER OPTION.

FOR EXAMPLE : R [1: N] = (V1 [1: N] + S) * V2

AS SHOWN ABOVE METHOD IS COSTLY.

TO IMPROVE COMPILER AND HARDWARE MUST WORK TOGETHER ON PROBLEM STATEMENT.

THE SAME WORK CAN BE DONE WITH LESS PROGRAMMABLE REGISTERS. SO IF COMPILER WORKS ALONG AND HARDWARE HAS SPECIAL FEATURES V3 WILL NOT BE REQUIRED.

IF SUCH A VECTORISATION COMPILER IS NOT USED, A COSTLY ARCHITECTURE IS REQUIRED AND A PROGRAMMER WILL BE REQUIRED WHO CAN THINK PARALLELY (HIGH PAY).

VECTORISING PROGRAM:-

do i1 = x1,y1

do i2 = x2,y2

S1 = A f1 (i1, i2… in)… fm (i1, i2… in)

n – Iterations

m – Dimensions

USUALLY, n = m

SUPPOSINGLY

A f1 (i1, i2… in)… fm (i1, i2… in) = B (g1, g2, g3…, gm)

NOW TAKING AN EXAMPLE

do i = 1 to 10

A i = A i-1

THIS RESULTS IN SOMETHING LIKE A1 = A0; A2 = A1; A3 = A2 AND SO ON.

THUS WE CAN SEE THERE IS A FLOW DEPENDENCY BECAUSE A DATA VARIABLE REFERRED ON LHS IS REFERRED ON RHS AT SOME OTHER TIME.

WE CAN REPRESENT THIS IN ITERATION SPACE.

A F1(X) = B G1(Y)

X = ITERATION VECTOR (LEXICOGRAPHIC ORDER)

G(X) , F(X) = FUNCTIONS OF ITERATION VECTOR

A F(X) = DATA

Yi – Xi = DISTANCE VECTOR FOR EVERY DIMENSION

3 CASES POSSIBLE:

1) Xi < Yi

2) Xi > Yi

3) Xi = Yi

COMPILER NEEDS TO ADDRESS TO SUCH AN ISSUE.

SUBMITTED BY :

SIDHANT JAIN ( 339/CO/11)

SIKANDER MAHAN ( 340/CO/11)

28/03/2014

VECTOR PROCESSING (CONT.)

MEMORY: -

MORE MEMORY GIVES MORE POWER TO SYSTEM AND USER.

THE SYSTEM CAN DO MORE WORK, CAN OPERATE ON LARGER DATA.

MORE MEMORY - > MORE PARALLEL PROCESSING - > MORE DATA CAN FLOW IN

FOR THIS

1) MEMORY ORGANISATION IS NEEDED (FOR EX. IN VECTOR PROCESSING BY INCREASING BANDWIDTH).

2) DATA PLACEMENT IN MEMORY (FOR EX. IN MATRIX MULTIPLICATION ON 4D CUBE – DATA PLACEMENT IS IMPORTANT)

MASKING: -

IF SOME PROCESSORS ARE NOT REQUIRED, MASKING IS DONE SO THAT THOSE PROCESSORS REMAIN INACTIVE. THIS CAN PROVIDE BETTER PERFORMANCE IN SOME CASES.

EXAMPLE:

IN PREFIX COMPUTATION, NUMBER OF REQUIRED PROCESSORS IN EACH STEP DECREASE. SO SOME PROCESSORS ARE REQUIRED TO BE MASKED.

IN CASE OF SPARSE MATRIX, ALL VALUES ARE NOT REQUIRED TO BE LOADED. THIS CAN BE USED TO IMPROVE EFFICIENCY.

THIS CAN BE DONE BY USING A COMBINATION OF ‘GATHER ‘AND ‘SCATTER ‘.

IF X IS THE SPARSE VECTOR, Y IS THE DENSE MATRIX AND INDEX IS A VECTOR THAT STORES THE LOCATION OF NON-ZERO VALUES IN Y, THEN

GATHER –: ASSIGNS X [i] = Y [INDEX [i]]

SCATTER –: ASSIGNS Y [INDEX [i]] = X [i]

VECTOR PROCESSORS HAVE HARDWARE SUPPORT FOR GATHER-SCATTER OPERATIONS, PROVIDING INSTRUCTIONS SUCH AS LOAD VECTOR INDEXED FOR GATHER AND STORE VECTOR INDEXED FOR SCATTER.

NOTE:

ISA SHOULD BE SIMPLE. COMPLEX INSTRUCTIONS DIRECTLY IMPLY WASTAGE OF SPACE.

USING MORE CACHE, REGISTERS, PIPELINING AND REMOVING COMPLEX INSTRUCTIONS REMOVES COMPLEX HARDWARE.

HARDWARE CATERING TO SUCH COMPLEX INSTRUCTIONS IS USED RARELY. GIVING SPACE TO MEMORY ON SILICON IS A BETTER OPTION.

FOR EXAMPLE : R [1: N] = (V1 [1: N] + S) * V2

AS SHOWN ABOVE METHOD IS COSTLY.

TO IMPROVE COMPILER AND HARDWARE MUST WORK TOGETHER ON PROBLEM STATEMENT.

THE SAME WORK CAN BE DONE WITH LESS PROGRAMMABLE REGISTERS. SO IF COMPILER WORKS ALONG AND HARDWARE HAS SPECIAL FEATURES V3 WILL NOT BE REQUIRED.

IF SUCH A VECTORISATION COMPILER IS NOT USED, A COSTLY ARCHITECTURE IS REQUIRED AND A PROGRAMMER WILL BE REQUIRED WHO CAN THINK PARALLELY (HIGH PAY).

VECTORISING PROGRAM:-

do i1 = x1,y1

do i2 = x2,y2

S1 = A f1 (i1, i2… in)… fm (i1, i2… in)

n – Iterations

m – Dimensions

USUALLY, n = m

SUPPOSINGLY

A f1 (i1, i2… in)… fm (i1, i2… in) = B (g1, g2, g3…, gm)

NOW TAKING AN EXAMPLE

do i = 1 to 10

A i = A i-1

THIS RESULTS IN SOMETHING LIKE A1 = A0; A2 = A1; A3 = A2 AND SO ON.

THUS WE CAN SEE THERE IS A FLOW DEPENDENCY BECAUSE A DATA VARIABLE REFERRED ON LHS IS REFERRED ON RHS AT SOME OTHER TIME.

WE CAN REPRESENT THIS IN ITERATION SPACE.

A F1(X) = B G1(Y)

X = ITERATION VECTOR (LEXICOGRAPHIC ORDER)

G(X) , F(X) = FUNCTIONS OF ITERATION VECTOR

A F(X) = DATA

Yi – Xi = DISTANCE VECTOR FOR EVERY DIMENSION

3 CASES POSSIBLE:

1) Xi < Yi

2) Xi > Yi

3) Xi = Yi

COMPILER NEEDS TO ADDRESS TO SUCH AN ISSUE.

Thursday, 27 March 2014

Summary of 27th March Class

CONCURRENT ACCESS

Concurrent computing is a form of computing in which several computations are executing during overlapping time periods – concurrently – instead of sequentially (one completing before the next starts). This is a property of a system – this may be an individual program, a computer, or a network – and there is a separate execution point or "thread of control" for each computation ("process"). A concurrent system is one where a computation can make progress without waiting for all other computations to complete – where more than one computation can make progress at "the same time".

The ability to gain admittance to a system or component by more than one user or process. For example, concurrent access to a computer means multiple users are interacting with the system simultaneously.

the lower 4 order bits are used to select memory modules while the higher order bits are used to select the address of the modules.

for eg: if we select 1st module and 1st address the equivalent would be 0001 0001 which equals 17.

if we select 2nd module and 1st address the equivalent would be 0001 0010 which equals 18

References: Wikipedia

Ɵ = time taken by one major cycle

m = no of modules

n = length of vector

Ɵ/m = time taken by one stage of pipeline

Ɵ/m(m+n-1) = total time

Ɵ/m(m+n-1)(1/n) = average time

= Ɵ/m((m-1)/n+1))

So, when n-> $\infty$ , average time = Ɵ/m

when n=m, total time = 2Ɵ

Also, we haven't considered the time taken by instruction calling. Hence the bandwidth becomes ¼.

new average time = 4 Ɵ/m

IMPLEMENTATION

A number of different methods can be used to implement concurrent programs, such as implementing each computational execution as an operating system process, or implementing the computational processes as a set of threads within a single operating system process.

SEQUENTIAL ACCESS

In computer science, sequential access means that a group of elements (such as data in a memory array or a disk file or on magnetic tape data storage) is accessed in a predetermined, ordered sequence. Sequential access is sometimes the only way of accessing the data, for example if it is on a tape. It may also be the access method of choice, for example if all that is wanted is to process a sequence of data elements in order.

STRIP MINING

When a vector has a length greater than that of vector registers, segmentation of the long vector into fixed length segments is necessary. This technique has been called strip- mining. One vector segment is processed at a time. In case of Cray computers, the vector segment length is 64 elements.

Until all the vector elements in each segments are processed, the vector register cannot be assigned to another vector operation. Strip mining is restricted by the number of available vector registers and so is vector chaining.

References: Kai Hwong

SUBMITTED BY:

Shubham Gupta 336/CO/11

Siddhant Sanjeev 337/CO/11

Wednesday, 26 March 2014

Class Summary-26/3/2014

Review of Significant Concepts :

1)SIMD :

Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxanomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.Thus, such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.

2) Type of parallelism :

a)Spatial parallelism : Duplicate hardware performs multiple tasks at once. (SIMD)

b)Temporal parallelism :Task is broken into multiple stages.(pipelining).

example : Ben is baking cookies.It takes 5 minutes to roll a tray of cookies and 10 minutes to bake them.Now the simplest way for ben to achieve this is that he will first roll the cookies and then put them in the oven,waits and then take them out after 10 minutes.This is sequential procedure.And wastes a lot of time.

Q)What if ben uses parallelism?

A)He could do so in two ways :

1)Spatial parallelism : Ben asks his sister allysa to help using her own oven.The diagram for this scenario is :

2)Temporal parallelism : Ben breaks the task into two stages : rolling and baking .While the first batch is baking he rolls the next batch and so on...The diagram for this scenarios is.

Clearly,time is saved during both the above processes.And hence parallelism is used in moder day computing.

3) Pipelining of functional units :

Apart from instruction pipelining , functional units like Floating point adders etc can also be pipelined. The functional unit can be broken down into various parts and same process can be followed as in instruction pipelining. Lets take an example of a floating point adder for adding two normalized floating point numbers. The whole process can be divided into four stages:

1)Compare exponent part of the two inputs by subtraction .

2)Align the mantissas if the exponents are unequal .

3)Add mantissas.

4)Normalize the result .

The diagram is as follows :

The figure above is self explanatory. If any doubts persists please drop in a comment .

4) Road ahead :

Our laptops despite having these functional units cannot be used to operate on vectors. The reason for this is the absence of vector registers .Despite having these function units the input to these pipelines are coming from registers sequentially and thus act as an hindrance in achieving fast computation on vectors.

Solution : what if we have an architecture in which there are scalar as well as vector registers. this architecture is know as Vector processors and is explained below.

Vector Processors :

A vector processor, or array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 90s, notably the various Cray platforms. The rapid rise in the price-to-performance ratio of conventional microprocessor designs led to the vector supercomputer's demise in the later 1990s.

Typical instruction set for the vector processors may include instructions like :

1) V1 ← 0 : initializing all the elements of vector V1 to be 0.(V1 is a vector)

2)V3←V1xV2 : Cross product of two vectors V1 and V2 to be stored in V3.(V1,V2,V3 all are vectors).

3) A←V1.V2 : Dot products of two vectors V1 and V2 to be stored in a scalar A.(V1,V2 are vectors and A is a scalar.)

and so on....

Some Vector processors :

Motivation :

•Maths problems involving physical processes present different difficulties for computation

—Aerodynamics, seismology, meteorology

—Continuous field simulation

•High precision.

•Repeated floating point calculations on large arrays of numbers

•Supercomputers handle these types of problem

—Hundreds of millions of flops

— costs $10-15 million.

—Optimised for calculation rather than multitasking and I/O

—Limited market

–Research, government agencies, meteorology

•Vector processor

—Alternative to supercomputer

—Just run vector portion of problems

HARDWARE NEEDS :

1)Vector registers as well as ways to store vectors in the memory.

2) Scalar registers as well as way to store scalars in the memory.

3) Functional units pipelined.

Vector- Access memory schemes :

The flow of vector operands between main memory and vector registers is usually pipelined with mutiple access paths.We have so far discussed S-access memory organisation .

S-Access memory organisation :

Higher order address bits(b) is used to select among the various memory modules.The lower order address(a) bits is use to access the particular element of the selected memory module. At the end of the memory cycle, m=2^a words are latched in the data buffers simultaneously. The low-order a bits are then used to multiplex the m-words out,one per each minor cycle.

The diagram is as follows :

For concurrent access lower order bits are multiplexed while choosing the memory module.Please do read about C-Access too.

SUBMITTED BY :-

SHASHANK GUPTA

SHAURYA SHARMA

Sources :

http://en.wikipedia.org/wiki/Vector_processor

www.eng.auburn.edu/~agrawvd/COURSE/E6200_Fall07/.../rjm0002.ppt

https://www.youtube.com/watch?v=T9B_DtYTKCc

Advanced computer architecture by Kai hwang