Tuesday, 25 March 2014

Summary of 24th March's Lecture


Memory Consistency Policies


P1 ->    1. A=1  

             2. PRINT(B,C)


P2 ->    1. B=1  

             2. PRINT(A,C)


P3 ->    1. C=1  

             2. PRINT(A,B) )


If programs are allowed to execute instructions out of program order then there are 6! = 720 possible execution interleavings.


However, if individual program order is maintained then the total possible execution interleavings reduce to :


=> 720/(2*2*2)= 90.


From these 90 interleavings not all combinations can result. For example, the outcome 000000 isn’t possible if processors execute instructions in program order only.



Memory Consistency Issues



Behavior of a shared memory system as observed in the processor is called memory model.


In general, choosing a memory model involves making a compromise between a strong model minimally restricting software and a weak model offering efficient implementation.



Sequential Consistency Model







Lamport ( 1979)


A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.



Release Consistency Model


One of the most relaxed memory models.

It requires that synchronization accesses in the program be identified and classified as either acquire or release.


Condition for release consistency :


  • Before a read/write is performed all previous acquires must be performed.

  • Before a release is performed all previous read/store must be performed.



Lazy Release Consistency


It has an additional condition that before any acquire is performed any previous read/store must be performed.



Dubois (1989)


The following two sufficient conditions were provided to achieve sequential consistency in shared-memory access:


(a) Before a load is allowed to perform with respect to any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed with respect to all processors.


(b) Before a store is allowed to perform with respect to any other processor, all previous load accesses must be globally performed and all previous store accesses must be performed with respect to all processors.


Some others model that could be classified b/w SC and RC:


  • Sindhu (1992) - Total store order.

  • Goodman (1990) - Processor Consistency.






The stores and swaps issued by a processor are placed in a dedicated store buffer for the processor, which is operated as first-in-first-out. Thus the order in which memory executes

these operations for a given processor is the same as the order ill which the processor issued them (in program order).


The memory order corresponds to the order in which the switch is thrown from one processor to another.


A load by a processor first checks its store buffer to see if it contains a store to the same location. If it does, then the load returns the value of the most recent such store. Otherwise, the load goes directly to memory, Since not all loads go to memory immediately, loads in general do not appear in memory order. A processor is logically blocked from issuing further operations until the load returns a value.



BY : SAHIL AGGARWAL & SANDEEP SHOKEEN

Thursday, 13 March 2014

Summary of lectures on 12/03/2014 and 13/03/2014

MEMORY VS PERFORMANCE


  • Concept of Virtual Memory


The concept of virtual memory allows processes to reference more memory than the actual amount of physical memory available in the system. The operating system maps the real memory to virtual memory. Thus, to a process, the virtual memory appears as a contiguous block of memory (also called virtual/logical address space) which is much more than the physical address space available.


Image reference: chortle.ccsu.edu


As the number of processors increase, the virtual memory which each processor ‘sees’ increases.Therefore, each processor can now handle more amount of work than it could have handled by referencing only the physical memory. This is the basis of Sun and Ni’s Law. It is a superlinear model where if number of processors(n) increase, the work which can be done by each processor increases as some function of n^x (x>1) and not linearly with n.


  • Performance increases with increase in memory space


As we have more and more memory available, the overall performance of the system increases.

This behaviour can be attributed to more buffers, registers and cache.


With higher number of registers available to the processor, resolving anti flow and output dependencies through register renaming becomes easier. As cache size increases, more amount of recently used data can be stored and chance of a hit increases, as mentioned by Shivam. Even increasing main memory size(RAM) boosts the performance of the system, as mentioned by Sumit.

Following buffers increase the performance of the system:


  • Branch Target Buffer(BTB)

  • Loop buffer

  • Pre Fetch buffer

  • Reservation stations

  • Reorder buffer

  • Commit unit

  • Table look aside buffer(TLB)


Storing common micro operations in trace cache also helps to increase performance.

Hence, increasing memory has a much greater impact on performance than increasing the number of functional units in the system.


TACKLING CONSISTENCY OF DATA


In earlier lectures, the following overheads of shared memory paradigm were discussed:


1. Need for synchronization between different processors which are accessing the same    memory.

2. Only one processor should execute its critical section at one time.


Another repercussion of having memory shared between processors is the possibility of inconsistency of data.

Inconsistency of data means that different processors having their own copy of the same data may have conflicting values of that data.


Suppose there are three processors, each having its own cache. They obtain a block of data from the main memory and then change the copy of the data in their cache after some processing. It may be possible that each processor now has a different value of that data and the main memory has an entirely different value of that data block. This leads to inconsistency in the same data.


  • Cache Coherence


Maintaining consistency among different copies of the same data block of memory is knows as cache coherence. We need to maintain coherence of caches when different processors write an updated value of data or in some special circumstances like cache replacement or process switching (context switching). Cache coherence depends on the type of cache used.


1. Write Through Cache:


In write through cache mechanism, the write operation is done simultaneously on the cache

block and the main memory block addressed by the content of that cache block. Therefore the main memory block has the valid data which was most recently updated along with the cache of the processor which updated the data. All other caches have invalid data.

To maintain coherence in write through caches, we use invalidation policy. Each block of the cache has one extra bit which indicates whether the block has valid or invalid data. Since there are only two possible states, that is, Valid and Invalid, we need only one extra bit per cache block.

Each cache has a state diagram which is implemented using external hardware. Since all caches are identical they have the same state diagram. V is the valid state and I is the invalid state.


Image reference: people.engr.ncsu.edu


It is important to note that we invalidate the content of the cache block (the main memory address) and not the physical address of the cache block.


Invalidating every cache other than the cache associated to the processor which executes the write operation increases the overheads and therefore we use the Snoopy Protocol.

A ‘snoopy bus’ connects all cache controllers and each cache controller monitors the bus. If there is a write operation by some cache, all other cache controllers invalidate their copy of the snooped memory address (if they have a copy). This reduces the operating system overheads.


However the above procedure of maintaining cache coherence causes increased memory bus and snoopy bus traffic.


2. Write Back Cache:


In write back cache mechanism, the updated value is written only to the cache and not to the main memory. The main memory block is updated when the cache block containing the data is modified or replaced.

To maintain coherence in write back caches, we use the concept of ‘ownership’ of memory block. The processor which writes to a main memory block holds the ownership of that block. It invalidates all other caches through the bus. A processor which needs to read a block from the main memory first snoops through the bus to check if any other processor has the ownership of that block. If yes, then it reads the valid data from that processor’s cache. If more than one processors have valid copy of the same data, then the main memory copy is also validated. The state diagram for write back cache coherence mechanism using ownership protocol is shown below. We require two extra bits to maintain three states of the caches: RW is read write state, RO is read only state and INV is invalid state.


Ownership protocol reduces the main memory traffic although the snoopy bus traffic is increased.


Between write back and write through cache mechanisms, it is better to use write back mechanism when one or few processors are predominantly involved in writing data. However if the write operations are evenly balanced between all processors, write through mechanism is much better to implement. Write through cache coherence is easier to implement as can be seen from the state diagram and requires only one extra bit compared to two extra bits required in write back cache coherence.


3. Write Once Cache:


In this type of cache, ownership of a particular memory block is given to the cache which writes to that block twice in succession. After the first write, the cache goes into ‘Reserved’ or modified state and after the second write, it goes into the ‘Dirty’ or exclusive state. The state diagram for write once cache coherence is shown below. We have four different possible states for each cache and consequently we require two extra bits per cache block. This is also known as the MESI protocol (Modified Exclusive Shared Invalid).




Image reference: Wikipedia


Other references: http://en.wikipedia.org/wiki/Virtual_memory

      http://en.wikipedia.org/wiki/Cache_(computing)

      http://en.wikipedia.org/wiki/Cache_coherence



      -Compiled by Riva Yadav & Rishul Aggarwal


Tuesday, 11 March 2014

Problem with comment

There is some problem with the blog the links are not working and nobody is able to comment from the homepage of blog .

So either add this blog to your reading list on blogger.com
or
copy paste the following link
http://coeunitedparallelprocessing.blogspot.in/2014/03/superpipeline-and-superscalar.html 

Superpipeline and Superscalar




Summary of class on 10/03/2014

Super Scalar and Super Pipeline


Super Pipeline Processors:

In super pipeline we increase the number of stages in a pipeline . These processors are based on the fact that many pipeline stages need less than half a clock cycle. The speed of pipeline depends on the slowest stage of pipeline.


If we increase the number of pipelines by m.

T(for single pipeline with k stages) =  [k+(n-1)]*Ï„

                            where n=no. of instructions

T(for mk stages) = (mk+(n-1))*τm


SpeedUp= [(k+(n-1))*Ï„]/[(mk+(n-1))*Ï„/m]

         =[mk+m(n-1)]/[mk+(n-1)]

Asymptotic speedup (n->∞) =m


Here the clock period is reduced to (1/m) of the original clock period.

For example each stage in an Look Ahead Carry adder can be separated by a latch .



Super Scalar Processors :

In superscalar processors we increase the numbers of pipelines . We can either have a wider instruction stream or a greater number of instruction streams. Superscalar processors can execute more than one instruction per clock cycle.

A prefetch buffer provides the instructions to different pipelines .

SpeedUp =[(k+n-1)*Ï„]/[(k+(n-m)/m)*Ï„]

         =[m(k+n-1)]/[mk+n-m]

Asymptotic speedup = m


But  in addition to speedup there are dependencies in instruction pipelines which try to reduce the overall speedup .


Pentium P5 was the first 32-bit superscalar microprocessor .It has two pipelines one for most simple instructions and another that can execute any instruction.


Superscalar vs Superpipeline :

Image reference www.uwgb.edu/breznayp/cs353/slides/ch_14.ppt


Superscalar  Superpipeline :

Modern day processors are both superscalar and superpipelined.

Power Mac G5 has a superscalar, superpipelined execution core that can handle up to 216 in-flight instructions, and uses 128 bit, 162-instruction SIMD unit.

SpeedUp =SU(superscalar)*SU(superpipeline)

To improve the overall performance of the processor we apply following techniques:

->Hyper Threading

->Branch Target Predictor

->Speculation Unit

->Distributed Shared memory



Sun and Ni's law :

It is a  generalization of Gustafson's and Amdahl's law.

SpeedUp = [wi+G(n)wn]/[wi+G(n)wn/n]

G(n)=xn

For amdahl's law x=0

For gustafson's law x=1


By Ravi Kumar Nimmi and Rishav Pipal

Tuesday, 25 February 2014

Insight of 25 February class lecture

Parity calculation by supposing : 8 bits are there and if bits are taken binary wise then seven processors are required .

Therefore In case of

·         PRAM model time calculated would be log(s)

·         Serial processor time calculated is = s

  Size of instruction=s

 

 

 

                                                                    

Hence speedup in PRAM = s/log(s)

 Now let us consider a realistic situation i.e for a linear array and n processors

 

Computation time = s/n

 

Total time O(n) to collect all results  therefore total time = s/n + n

 Differentiating with respect to n the above equation we get : n = under root s

Now substitute the value of n , total time = 2 ( under root of s)

 

 Scalability =

 speedup of realistic machine/ speedup of PRAM model = log(s)/2(under root of s) < 1 ; ideally it should be = 1

 on similar line we can calculate scalability of hypercube , mess ,  etc.

in case the expression of scalability has log(n) , this indicates a good scalability.

 

Decreasing order of scalability :

 Hypercube > Mess > linear array

  • Hypercube has the greatest scalability in this case due to its compact structure.
  • Mess has a lesser scalability because it has loose ends.

 

 

PIPELINING AND THE OVERHEADS INVOLVED

 In pipelining overheads don’t occur due to network or IPCs  .

 Speedup = kn/ [ k + (n-1)]

For efficiency divide the expression by k . this is done because additional latches are added to the chip.

Therefore efficiency = n / k+n-1

‘pipelining

 increases the CPU instruction throughput - the number of instructions completed per unit of time. But it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the pipeline control. 
The increase in instruction throughput means that a program runs faster and has lower total execution time.
’’ Refer http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/perform.html

Throughput = [n/ k + n-1 ] x f ]

 

Maximum throughput can be achieved only if  instructions occur sequentially

Here    k =no of steps     , f = no. of frequency of operations

 

 

 

 

Overheads in instruction pipelining

  • Stalling
  • Branching

 

Branching

As soon as branching occurs all the instructions already in the pipeline are flushed out and its filled again . hence the first instruction now has to again go for K steps . this is for unconditional branching .

 

Let :

Q = probability of branch being successful

P = probability of branch ; b = variable no. of stage  ; for b=k-1 ( maximum stage)

  

Now speedup considering the overhead due to branching;

 n/ (k + n-1+ {PQnb})

 

Asymptotic throughput=  f/ 1 + PQk  ( therefore longer the pipeline , more the branching , lower the throughput )

 

FOR CONDITIONAL BRANCHING : MECHANISM CALLED PREDICTION IS DONE .

  • STATIC PREDICTION : all the branches will be taken . its advantage is less no. of wrong predictions .
  • Profile the direction of  branches by checking and stimulating .
  • Real time system which depends on inputs from the environment around it . therefore there can be no particular prediction as the direction will keep on changing . its dynamic in nature.  For this Branch Target Buffer  comes into play . it records history of every branching . hence the next branching possibility can be predicted with the help of state diagram.  Refer to the  following link       http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html
  • Speculation is another way where the system lets the branches and execute all the instructions. The instruction which  weren’t executed are poisoned  and hence not commit to it before write back .
  • Initiated by compiler : whenever a branch instruction comes insert  delay slots i.e.  NOP  which don’t depend on branch . Hence pipelined instructions will not be flushed.

 

 

SUPER SCALAR PIPELINED PROCESSOR  : multiple pipeline units in a single processor as in laptops.

 

To make the super scalar pipelined processor temporal increase the no. of stages and it becomes SUPER PIPELINED

Refer to wikipedia 

   http://en.wikipedia.org/wiki/Superscalar for more insights .

 

 

 BY TINA BISWAS  & SURAJ GUPTA

Monday, 24 February 2014

Insight into class lecture of 24th feb 2014

Summary of February 24th class

 

AMDAHL'S LAW

 

''Amdahl's law, also known as Amdahl's argument, is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used In parallel to predict the theoretical maximum speedup using multiple processors."

                                             

 Graph : Speedup vs Number of Processors


  FORMULAE DESCRIBING AMDAHL'S LAW:

 speed up = T(1)/T(n) =  [ I (B+ 1-B)/ rate ]/  [I( B + (1-B)/n ) /rate]

Here I = total no. of instructions,

B = ratio of sequential instructions to total instructions, n = no. of processors

Rate here is considered to be constant and no overheads have been considered in Amdahl's law

Therefore if a more realistic situation is considered the entire process would also include overheads, as a function of

  • Size of Network
  • Size of problem
  • No. of processors

 

 Hence the Speedup in such case would be

 = work at all modes / ((Wi / i) [i/n] + O(s,n))

Wi = work at ith mode, i = the ith mode, { [ ] denotes ceiling }

O(s,n) = function of size of problem and no. of processor

Therefore a graph is plotted for speed up versus B, a hyperbolic graph is obtained which proves that with increase of processors the system losses its parallism on a very fast rate  .

 To overcome this drawback Gustafson Law was introduced .

If the rates of n processors are r1 , r2, r3, r4 ,…..,rn.

Then the average rate is the harmonic mean of the individual rates.

Moreover, the average rate is kind of a weighted harmonic mean.

Rate avg  =  1/[(f1/r1) + (f2/r2) + (f3/r3) + …… + (fn/rn)]

         Where fi is the number of instructions executed by ith processor.

** It ws still not clear whether we should multiply the numerator with F or not        where F = total number of instructions.

 

Gustafson Law

 

''Gustafson's Law is a law in computer science which says that computations involving arbitrarily large data sets can be efficiently parallelized'''. Wikipedia

''Thus Gustafson's law seems to rescue parallel processing from Amdahl's law''Wikipedia

ExploreAmdahl's law

"Gustafsons law states that increasing processors gives linear speed up''eecis.udel.edu

ExploreSpeedup

According to Gustafson Law, proportionality of parallelism can be maintained by increasing the parallelizable workload as the work assigned gets distributed among the increasing no. of processors.

 Wikipidea

The graph shows linear increase and doesn't saturate unlike in case of Amdahl's law.

FORMULAE DESCRIBING Gustafson Law :

SPEEDUP = [B + n(1 -B)] / [B + n(1-B)/n]

Here

B= ratio of sequential instructions to total instructions , n = no. of processors

Efficiency = E = Speedup/ no. of processors

Redundency =R  = O(n) / O (1)

Utilization = E x R

Quality of Parallism = (speedup ) x E/R

 Now if we express the Quality of parallism in terms of time

Quality of parallism = T(1)

 

INSIGHTS PROVIDED:

  • FOR MAINTAINING THE PROPORTIONALITY OF PARALLISM ON INCREASING THE NO.OF PROCESSORS MORE WORK SHOULD BE ASSIGNED
  • FOR PURELY AMDAHL'S LAW , NOT WIE TO INCREASE PROCESSORS BEYOND A CERTAIN LIMIT.

 

 

Submitted by : Suraj Gupta and Tina Biswas