Thursday 3 April 2014

VARIOUS VECTORISATION TECHNIQUES

VARIOUS VECTORISATION TECHNIQUES:

·        Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) which are lightweight processes run in a shared memory space, while processes run in separate memory spaces.Hence,OS operates far more efficiently for threads than for processes.

·        Now,for instance:

   do i = 2,n

      do j = 1,n

         B[i,j] = B[i+1,j] + B[i-1,j]

          It has two dimensions(i,j).Dependency is in ith dimension.

        B[2,1] = B[3,1] + B[1,1]    (1st iteration)

        B[2,2] = B[3,2] + B[1,2]    (2nd iteration)

        …

        B[3,1] = B[4,1] + B[2,1]    (Successive iterations)   

        …

        B[4,1] = B[5,1] + B[3,1]

 Hence,we find that there exists flow as well as anti-flow dependency.

Here,the inner loop can be vectorised and the outer loop can be parallelised..

Code after parallelising and vectorising:

  do all i

   B[i,1:N] = B[i+1,1:N] + B[i-1,1:N]  (Lower-upper limit 1:N as a  vector)

As a suggestion,Outer loop is parallelised and the inner loop is vectorised but irrespective of the architecture,we can vectorise and parallelise any loop.Here,the inner loop is more frequent as compared to the outer loop.Thus,vectorising the inner loop will contribute to a good pipelining utilisation.

 

 Our Personal Computers are UMA MIMD which is same as SMP (SYMMETRIC MULTIPROCESSORS).

In SMP two or more processor connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes.



.      

In this case:-

            do all i=1:N

            B(1:N) = B( i+1 , 1:N ) + B( i-1 , 1:N )

Latency would come

When we want to interchange inner and outer loop

            do I = I,N

            a( I , j ) = a( I ) + b( I , j )

            end do

            end do

column major:-  b( 1 , 1 )  ->  b( 2 , 1 )  ->  b( 3 , 1 )

do all j

a( 1:N , j ) = a( 1:N ) + b( 1:N , j )

Hence, sometimes we may require rearranging of loop variables to improve the performance. This is usually expected that it would be done by the programmer but it is mostly done by the compiler.

  This vectorisation technique is called as loop interchanging

A systematic approach should be adopted to let the compiler do the loop transformations.

This systematic approach is known as UNIMODULAR transformation.

Transformation matrix :

           [ T ][i  j]

The conditions that has to be fulfilled by the transformation matrix are:-

1.     It must be a square matrix.

2.     Its elements has to be integers since iteration cannot be fractions.

3.     The absolute value of the determinant should be 1.

4.     T.d>=0           [ condition for transformation to be valid]

Where, t=matrix determinant

And, d=distance

Now, the question arises that why we should we interchange loop variables?

 There are mainly three reasons for this:-

1.     The number of cache misses would become as less as possible and thus would greatly improve the performance of the system.

2.     Parallelization increases due to the Interchange in the loop variables.

3.     Bandwidth of the memory also increases.

 

Can this FORTAN code be parallelized?

do i = 1,N

          a( i ) = a( i ) + 2a( i+1 )

          end do

Solution: - This cannot be parallelized because there is a dependency. This FORTAN code has overheads, and to remove them we have to use the technique known as loop unrolling. In this technique we increase the number of the statements, but due the increase in the number of the statement the memory size would also increase.Hence,there must be a compromise between the two.

Loop unrolled Program:-     

do i = 1,n,2

          a( i ) = a( 1 ) + 2a( i+1 )

          a( i+1 ) = a( i+1 )+ 2a( i+2 )

end do

Node Splitting:-

          do I = 2,N

          y( i ) =  x( i-1 ) + x( i+1 )                            -> stored in register

          x( i ) = b( i ) + c( i )

          end do

Temp (1: N-1) = x (3: N+1)

Note: - we should have the temporary register of the appropriate size.

          X(2:N) = b(2:N) + c(2:N)

          Y(2:N) = Temp(1:N-1) +x(1:N-1)

All these techniques are vectorisation techniques.

For Reference: Advanced Computer Architecture (By Kai Hwang)

                                                                                                                   Submitted by:

                                                                                                         TANMAY VARSHNEY(351/CO/11)

                                                                                                         SUSHANT GODARA(349/CO/11)

 

       

 

11 comments:

  1. In the loop unrolling technique, in the eg. done in the class, we took stride = 2. Can anybody tell me why it was taken as 2. Ma'am first told that distance is 1 but then she said that its only for this case.(not general). So, why was the stride taken as 2..?? Has it to do anything with distance..??

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Mam said that there is a space-time tradeoff involved.But,according to me making the loop unrolling factor equal to 2 is not the max feasible compromise.So,the distance vector definitely plays some role..
      See this example(given in the link below)
      http://www.csse.monash.edu.au/~davida/teaching/cse3304/Web/Chapter6/sld027.htm
      Here,the distance is 5.So,the loop is unrolled to 5 statements.

      Delete
    3. This comment has been removed by the author.

      Delete
  2. In unrolled loop it should be a(i)= a(i) + 2a(i+1) and not a(1)+2a(i+1).

    Another doubt: In interchanging loop variables, what is meant by 'bandwidth of memory increases'?

    ReplyDelete
    Replies
    1. Yes,Rishul that was a typo error..It's indeed a(i)..

      Delete
  3. Add to the blog
    About first loop cosidered here , its outer loop can't be parallelised directly firts we need to test for the dependencies . As it contains dependencies (both type -flow and anti flow). So compiler will look for some technique to remove these dependencies first and then parallelises it if possible.

    ReplyDelete
  4. Increase in bandwidth here doesnt mean physical increase, it means make it available more for useful work. Here by these techniques we are reducing overheads one of them is branching to some other instruction which will obviously required to load another instruction. Hence making it less available for data which here is the main concern.

    ReplyDelete
  5. do i = 2,n
    do j = 1,n
    B[i,j] = B[i+1,j] + B[i-1,j]
    End do
    End do
    (please write the end dos also!!)

    1. We have not got the first example right, :(
    The i-th dimension simply cannot be parallelized. The j-th dimension can be parallelized. Now there are two options:
    a) Let i-dim remain outer and j-dim remain inner. Vectorize inner:

    do i = 2, N
    B[i,1:N] = B[i+1,1:N] + B[i-1,1:N]
    End do

    Thus, i-dim remains serialized. This option allows vectorization of the more frequent dimension (j) thereby improving pipeline efficiency and cache hits if the language has column-major ordering.
    But vectorization has certain limitations. The pipeline stalls if there are dependencies between loop statements or if there are conditional branches which may be mis-predicted. Thus, we should have the option of parallelzing outer loop, This is how:

    b) Parallel operation by threads: Interchange i and j iteration variable and Parallelize the outer loop

    doall j = 1, N
    do i = 2,N
    B[i,j] = B[i-1,j] + B[i-1,j]
    End do
    End doall

    Now each outer loop is executed by a separate thread.

    ReplyDelete
  6. There are other typos as well.

    The second example is not put up properly. Let me clarify. This gives another reason why we may want to interchange iteration variable.

    do i = 1,n
    do j=1,n
    a[i] = a[i] + b[i,j]
    end do
    end do

    The Fortran language perform column major ordering. Therefore, in the above example, (which incidentally is dependence-free), we may first interchange the loops
    do j = 1,n
    do i = 1,n
    a[i] = a[i] +b[i,j]
    end do
    end do

    and then vectorize the inner loop
    do j = 1,n
    a[1:n] = a[1:n] + b[i:n, j)
    end do;
    Thus all elements of a column come concurrently. Since that is how they are sequentially organized in memory, there would be much fewer cache misses.

    Loop interchange can thus facilitates cache utilization.

    ReplyDelete
  7. Regarding depth of unrolling, remember i said that we shall revisit this when we do the VLIW architectures?

    The moot point is this: If we unroll more statements, we give a greater chance to the compiler to parallize the inside statements of the loop. The minimum the compiler can do is to unroll upto the distance. Including one more statement than the distance will surely expose a dependency.

    If the unrolled loop is free from dependencies then they can be executed in parallel.

    ReplyDelete