Tuesday, 25 February 2014

Insight of 25 February class lecture

Parity calculation by supposing : 8 bits are there and if bits are taken binary wise then seven processors are required .

Therefore In case of

·         PRAM model time calculated would be log(s)

·         Serial processor time calculated is = s

  Size of instruction=s

 

 

 

                                                                    

Hence speedup in PRAM = s/log(s)

 Now let us consider a realistic situation i.e for a linear array and n processors

 

Computation time = s/n

 

Total time O(n) to collect all results  therefore total time = s/n + n

 Differentiating with respect to n the above equation we get : n = under root s

Now substitute the value of n , total time = 2 ( under root of s)

 

 Scalability =

 speedup of realistic machine/ speedup of PRAM model = log(s)/2(under root of s) < 1 ; ideally it should be = 1

 on similar line we can calculate scalability of hypercube , mess ,  etc.

in case the expression of scalability has log(n) , this indicates a good scalability.

 

Decreasing order of scalability :

 Hypercube > Mess > linear array

  • Hypercube has the greatest scalability in this case due to its compact structure.
  • Mess has a lesser scalability because it has loose ends.

 

 

PIPELINING AND THE OVERHEADS INVOLVED

 In pipelining overheads don’t occur due to network or IPCs  .

 Speedup = kn/ [ k + (n-1)]

For efficiency divide the expression by k . this is done because additional latches are added to the chip.

Therefore efficiency = n / k+n-1

‘pipelining

 increases the CPU instruction throughput - the number of instructions completed per unit of time. But it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the pipeline control. 
The increase in instruction throughput means that a program runs faster and has lower total execution time.
’’ Refer http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/perform.html

Throughput = [n/ k + n-1 ] x f ]

 

Maximum throughput can be achieved only if  instructions occur sequentially

Here    k =no of steps     , f = no. of frequency of operations

 

 

 

 

Overheads in instruction pipelining

  • Stalling
  • Branching

 

Branching

As soon as branching occurs all the instructions already in the pipeline are flushed out and its filled again . hence the first instruction now has to again go for K steps . this is for unconditional branching .

 

Let :

Q = probability of branch being successful

P = probability of branch ; b = variable no. of stage  ; for b=k-1 ( maximum stage)

  

Now speedup considering the overhead due to branching;

 n/ (k + n-1+ {PQnb})

 

Asymptotic throughput=  f/ 1 + PQk  ( therefore longer the pipeline , more the branching , lower the throughput )

 

FOR CONDITIONAL BRANCHING : MECHANISM CALLED PREDICTION IS DONE .

  • STATIC PREDICTION : all the branches will be taken . its advantage is less no. of wrong predictions .
  • Profile the direction of  branches by checking and stimulating .
  • Real time system which depends on inputs from the environment around it . therefore there can be no particular prediction as the direction will keep on changing . its dynamic in nature.  For this Branch Target Buffer  comes into play . it records history of every branching . hence the next branching possibility can be predicted with the help of state diagram.  Refer to the  following link       http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html
  • Speculation is another way where the system lets the branches and execute all the instructions. The instruction which  weren’t executed are poisoned  and hence not commit to it before write back .
  • Initiated by compiler : whenever a branch instruction comes insert  delay slots i.e.  NOP  which don’t depend on branch . Hence pipelined instructions will not be flushed.

 

 

SUPER SCALAR PIPELINED PROCESSOR  : multiple pipeline units in a single processor as in laptops.

 

To make the super scalar pipelined processor temporal increase the no. of stages and it becomes SUPER PIPELINED

Refer to wikipedia 

   http://en.wikipedia.org/wiki/Superscalar for more insights .

 

 

 BY TINA BISWAS  & SURAJ GUPTA

Monday, 24 February 2014

Insight into class lecture of 24th feb 2014

Summary of February 24th class

 

AMDAHL'S LAW

 

''Amdahl's law, also known as Amdahl's argument, is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used In parallel to predict the theoretical maximum speedup using multiple processors."

                                             

 Graph : Speedup vs Number of Processors


  FORMULAE DESCRIBING AMDAHL'S LAW:

 speed up = T(1)/T(n) =  [ I (B+ 1-B)/ rate ]/  [I( B + (1-B)/n ) /rate]

Here I = total no. of instructions,

B = ratio of sequential instructions to total instructions, n = no. of processors

Rate here is considered to be constant and no overheads have been considered in Amdahl's law

Therefore if a more realistic situation is considered the entire process would also include overheads, as a function of

  • Size of Network
  • Size of problem
  • No. of processors

 

 Hence the Speedup in such case would be

 = work at all modes / ((Wi / i) [i/n] + O(s,n))

Wi = work at ith mode, i = the ith mode, { [ ] denotes ceiling }

O(s,n) = function of size of problem and no. of processor

Therefore a graph is plotted for speed up versus B, a hyperbolic graph is obtained which proves that with increase of processors the system losses its parallism on a very fast rate  .

 To overcome this drawback Gustafson Law was introduced .

If the rates of n processors are r1 , r2, r3, r4 ,…..,rn.

Then the average rate is the harmonic mean of the individual rates.

Moreover, the average rate is kind of a weighted harmonic mean.

Rate avg  =  1/[(f1/r1) + (f2/r2) + (f3/r3) + …… + (fn/rn)]

         Where fi is the number of instructions executed by ith processor.

** It ws still not clear whether we should multiply the numerator with F or not        where F = total number of instructions.

 

Gustafson Law

 

''Gustafson's Law is a law in computer science which says that computations involving arbitrarily large data sets can be efficiently parallelized'''. Wikipedia

''Thus Gustafson's law seems to rescue parallel processing from Amdahl's law''Wikipedia

ExploreAmdahl's law

"Gustafsons law states that increasing processors gives linear speed up''eecis.udel.edu

ExploreSpeedup

According to Gustafson Law, proportionality of parallelism can be maintained by increasing the parallelizable workload as the work assigned gets distributed among the increasing no. of processors.

 Wikipidea

The graph shows linear increase and doesn't saturate unlike in case of Amdahl's law.

FORMULAE DESCRIBING Gustafson Law :

SPEEDUP = [B + n(1 -B)] / [B + n(1-B)/n]

Here

B= ratio of sequential instructions to total instructions , n = no. of processors

Efficiency = E = Speedup/ no. of processors

Redundency =R  = O(n) / O (1)

Utilization = E x R

Quality of Parallism = (speedup ) x E/R

 Now if we express the Quality of parallism in terms of time

Quality of parallism = T(1)

 

INSIGHTS PROVIDED:

  • FOR MAINTAINING THE PROPORTIONALITY OF PARALLISM ON INCREASING THE NO.OF PROCESSORS MORE WORK SHOULD BE ASSIGNED
  • FOR PURELY AMDAHL'S LAW , NOT WIE TO INCREASE PROCESSORS BEYOND A CERTAIN LIMIT.

 

 

Submitted by : Suraj Gupta and Tina Biswas

Thursday, 20 February 2014

Summary of 20th Feb class

Static networks VS Dynamic Networks

 

Static Networks:-

·         In a static network, the connection between input and output nodes is fixed and cannot be changed.

·         Static interconnection network cannot be reconfigured.

·         The examples of this type of network are linear array, ring, chordal ring, tree, star, fat tree, mesh, tours, systolic arrays, and hypercube.

·         This type of interconnection networks are more suitable for building computers where the communication pattern is more or less fixed, and can be implemented with static connections.

Dynamic Networks:-

·         In dynamic network the interconnection, pattern between inputs and outputs can be changed.

·         The interconnection pattern can be reconfigured according to the program demands.

·         Here, instead of fixed connections, the switches or arbiters are used.

·         Examples of such networks are buses, crossbar switches, and multistage networks.

·         The dynamic networks are normally used in shared memory(SM) multiprocessors.

 

Wormhole Routing:-

A property of a message passing system in which each part of a message is transmitted independently and one part can be forwarded to the next node before the whole message has been received.

 All parts of a single message follow the same route. The independent parts are normally small and this reduces latency and the storage  requirements on each node when compared with message switching where a node receives the whole message before it starts to forward it to the next node.

It is more complex than message switching because each node must keep track of the messages currently following through it.

Wormhole routing is applied to packets in packet switching system so that forwarding of a packet starts as soon as its destination is known, before the whole packet had arrived.

• A message is broken into multiple packets (each packet has header information that allows the receiver to re-construct the original message)

• A packet may itself be broken into flits – flits do not contain additional headers.

• Two packets can follow different paths to the destination. Flits are always ordered and follow the same path

• Such an architecture allows the use of a large packet size (low header overhead) and yet allows fine-grained resource allocation on a per-flit basis.

The characteristics of wormhole switching system include:

  • Large network packets are divided into small flits
  • Every router is assigned small buffers of one to a few flits.
  • The header flit defines the network path for all the other flits in the network pipeline.
  • The buffer sequence and links that are already occupied by a given packet of flits constitutes the wormhole system. Typically, the length of the network path is measured in accordance to the number of flits in a single packet.

Let the size of a Flit be F and the size of message be L. N are the no. of processors.

No. of flits = L/F

So, at each stage of a linear network,there would be transfer delay of F/W where W is the Bandwidth.(Propagation delay is neglected)

Total Time = F*N/W + (L-F)*1/W

                = F*(N-1)/W + L/W

This time is much less than the transfer time(N*L/W) obtained when L bits of data was passed through the n/w.

 

 

  • If the output channel is busy, the entire flits chain - including the header – can become stuck blocking communication via the transmission path.

The key advantages of wormhole switching include:

  • Working with small, cheap, simple and comparatively fast routers.
  • Suitable for long messages.
  • Employing only input buffering.
  • Throughput is snowballing because it does not buffer the entire network packet to move toward the next node.
  • Bandwidth and channel allocation are generally decoupled.

Disadvantages of wormhole switching include:

  • Blocks resources in cases of stuck network pipelines
  • Prone to deadlock.
  • Inability to support backtracking.

 

Why shared memory is faster than message passing:

In message passing suppose we have thousands or so instructions then steps taken to pass it to the another processors are

--First we need to prepare the message

--Then it need to be passed to the IO ,that is call to the OS

--Another processor will receive it in its IO, hence another call to the OS

And as we know that IO are handle through OS and drivers through some program or instructions. So in need to pass only a single message we saw many other instructions are also need to be execute. This is the case for only one instructions if this number increases, we can imagine how much time would increase but in case of shred memory every processor can access memory of any of the other processor no need to involve OS and device drivers or execute these kind of extra instructions

That is why message passing is slower than shared memory.

But yes there are problems with shared memory too

1).Inconsistency

2).Synchronization

However dynamic networks which we discussed before for shared memory can be use for message passing by using memory/buffer as channels to pass the message. Here two processors will communicate through these buffers. One processor sent the message to other processor through this intermediate buffer. But there should be control over which one is accessing to avoid situations like one is reading and another is writing simultaneously then some errors can occur.

To avoid this situation let us suppose we’ve 3 processors and 2 buffers. First processor will send message a message to its buffer , second buffer will act like a network will transfer this message to the buffer of another processor.

Similarly we can implement shared memory through message passing. But one can imagine the time taken as we know that if a processor need to communicate with the memory or another processor then it will be required to invoke OS calls and involve device drivers etc. hence it is not good idea to implement a shared memory over message passing.

 

Distributed shared memory :-

Distributed Shared Memory (DSM) is a form of memory architecture where the (physically separate) memories can be addressed as one (logically shared) address space.  Here, the term shared does not mean that there is a single centralized memory but shared essentially means that the address space is shared (same physical address on two processors refers to the same location in memory). Distributed Global Address Space (DGAS), is a similar term for a wide class of software and hardware implementations, in which each node of a cluster has access to shared memory in addition to each node's non-shared private memory.

 

Scalability:-

scalability is the ability of a system, network, or process to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth. For example, it can refer to the capability of a system to increase its total output under an increased load when resources (typically hardware) are added. 

 

Scalability  = Speed up on PRAM model/Speed up on realistic machine

 

Submitted by:-

Uzma and Sweety Yadav


Wednesday, 19 February 2014

Lecture summary for 19th of Feb

Various types of topologies used for static

                                 networks

There are various types of static networks, all of which are characterized by their node degree;

Node degree  is the number of links (edges) connected to the node. Some well-known static networks are the following:

Degree 1: shared bus

Degree 2: linear array, ring

Degree 3: binary tree, fat tree, shuffle-exchange

Degree 4: two-dimensional mesh (Illiac, torus)

Varying degree: n-cube, n-dimensional mesh, k-ary n-cube

 

Now we would discuss some of the static networks studied in the class.

Ring Network:-

A ring is obtained by connecting the two terminal nodes of a linear array with one extra ring.Here degree of each node is two.

By increasing the degree of each node to d (where d is greater than 2) we can obtain a chordal ring. And for d = N – 1  , we obtain a completely connected network

Specifically we would study Chordal Ring Network.

No. of Links :- Degree of node * no. of nodes/2  (N represents no. of nodes)

For a completely connected network:-

Degree :- N-1

No. of Links :- N*(N-1)/2

Latency :- O(1)

Bisection Width :- N/2 * N/2

 

Barrel Shifter:-

The barrel shifter is obtained from the ring by adding extra links from each node to those nodes having a distance equal to an integer power of 2.That is node i is connected to node j if

                                        |j – i| = 2^r

Where r can 0 ,1 , 2 ,……..,n-1 and N = 2^n (N is number of processors)

A barrel shifter is sometimes called a plus-minus- 2i network.

Routing functions:

B+i (j ) = (j + 2i) mod N

Bi (j ) = (j – 2i) mod N

 where 0 ≤ N– 1 and 0 ≤ i < log2N.

 For example, a 16-node barrel-shifter network would have not just ±1 and ±4 shifts, like a mesh network, but also ±2 and ±8 shifts. 

Each node in an N-node barrel shifter is directly connected to 2 logN–1 different nodes.

Diameter:- The diameter of a barrel shifter is (log2N)/2.

 

Tree:-

TREE

Basic concept:



In a tree network there is only one path between any two nodes. The taller the tree, the higher is communication bottleneck at high levels of the tree. Two remedies are possible:

          


We may have (a) static tree networks, or (b) dynamic tree networks.

A binary tree with N nodes has diameter 2(h-1), where

h = ceil(logN) is the height of the tree. The binary tree has the advantages of being expandable and having a simple implementation. Nonetheless, it can still cause long communication delays between faraway leaf nodes. Leaf nodes farthest away from each other must ultimately pass their message through the root. Since traffic increases as the root is approached, leaf nodes farthest away from each other will spend the most amount of time waiting for a message to traverse the tree from source to destination.

Fat tree network:-

Alternatively, we may introduce fat tree networks (see below).


One problem with the binary tree is that there can be heavy traffic toward the root node. Consider that the root node acts as the single connection point between the left and right subtrees. As can be observed in Figure, all messages have to be passed through the root.To reduce the effect of such a problem, the fat trees were introduced. Fat trees are more like real trees in which the branches get thicker near the trunk.Proceeding up from the leaf nodes of a fat tree to the root, the number of communication links increases, and therefore the communication bandwidth increases.

The structure of the fat tree is based on a binary tree. Each edge of the binary tree corresponds to two channels of the fat tree. One of the channels is from parent to child, and the other is from child to parent. The number of communication links in each channel increases as we go up the tree from the leaves and is determined by the amount of hardware available.



Star topology of interconnection networks

The star is a two level tree with a high node degree at the central node of d = N-1 and a small constant diameter of 2 .

The star architecture has been used in systems with centralized supervisor node. It is shown in fig.


                          

Mesh and Torus  

A two-dimensional mesh consists of k1*k0 nodes, where ki >= 2 denotes the number of nodes along dimension i. In  general , a k-dimensional mesh with N = n^k nodes has an interior node degree of 2k and the network diameter K(n-1).

In a two-dimensional mesh network each node is connected to its north, south, east, and west neighbors(except the corner ones).


                     

                Routing path from node 6 to node 12.

The mesh is frequently used architecture which has been implemented in the Illiax IV , MPP , DAP and Intel Paragon with variations.

A variation of the mesh by allowing wraparound connections we can obtain illiac mesh


                 

                           A 16-node Illiac network.            

In general , an nxn iliac mesh should have a diameter of d = n-1, which only half of the diameter for a pure mesh.

To further reduce the diameter of a mesh network, another variation of this network, called torus (or two-dimensional  torus ), has also been proposed.

A torus is a combination of ring and mesh networks.

The diameter of a torus with N=n^2 nodes is 2*floor(n / 2) , which is the distance between the corner and the center node. Note that the diameter is further decreased from the mesh network.

 

                                    

 

 

K-ary n-cube Networks

 

Rings , mashes , tori ,binary n-cubes (2 nodes in each  dimension , hypercubes)  and omega networks are topologically isomorphic to a family of k-ary n-cube networks .

 

The parameter n is the dimension and k is the radix , or the number of nodes ( multiplicity ) along each dimension . These two numbers are relative to the numbers of nodes , N , in the network by :

                        N = k^n , (k = nth root of N , n = logN with base k)

 

                   

 

 

Cube connected cycles

 

This architecture is modified from the hypercube. In general one can construct k-cube-connected cycles(CCC) from a k-cube with n = 2^k cycles nodes. The idea is to replace each vertex of the k-dimensional hypercube by a ring of k-nodes. A k-cube can be thus transformed to a k-CCC with k*(2^k) nodes.

Diameter = 2*k

 

 

Summary of Static Network characteristics

 

Network type

  Node

degree,d

Network

diameter

No of

Links,l

Bisection

Width, B

Symmetry

Remarks on

Network size

Linear array

   2

  N - 1

N-1

1

No

N nodes

Ring

   2

Floor(N/2)

N

2

Yes

N nodes

Completely connected

  N - 1

   1

N(N-1)/2

(N/2)^2

Yes

N nodes

Binary tree

   3

   2(h-1)

N-1

1

No

Tree height, h =ceil(logN)

Star

  N - 1

   2

N-1

Floor(N/2)

No

N nodes

2-D  Mesh

   4

   2(r-1)

2N – 2*r

R

No

r*r mesh

where r = r^(1/2)

Illiac mesh

   4

   r - 1

2N

2r

No

Equivalent to a chordal ring of r = N^(1/2)

2-D torus

   4

2*floor(r/2)

2N

2r

Yes

r*r torus where r = N^(1/2)

Hypercube

   n

    n

nN/2

N/2

Yes

N nodes, n = logN (dimension)

CCC

   3

2k-1 +

floor(k/2)

3N/2

N/(2k)

Yes

N = k*2^k nodes with a cycles length k>=3

k-ary n-cube

   2n

n * floor(k/2)

nN

2k^(n-1)

Yes

N = k^n nodes

 

 

 

Hypercube interconnection has the following limitations :-

 

1.   Scalability:- because of change in degree on adding extra node . this can be overcome by using cyclic cube.

2.   Hot spot:- When algorithms are not symmetrical (alike matrix multiplication ), then some of the processors remain more busy then the others ,hence reducing the reliability. Because of this reason higher dimensional hypercubes are not so much in practice. This is overcome by using lower dimensional hypercubes.

3.   Latency :- This problem is overcome by worm-hole routing.

 

(We've tried to summarize all the topics discussed in class. If some of the

 topic are still uncovered, pls do comments )

Thanx .

Posted by Uzma and Sweety Yadav.