Wednesday 30 April 2014

Supercomputers and changing times



A supercomputer is a computer at the frontline of contemporary processing capacity particularly speed of calculation which can happen at speeds of nanoseconds.
Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). 

Supercomputing, in the broadest sense, is about finding the perfect combination of speed and power, even as the definition of perfection changes as technology advances. But the single biggest challenge in high-performance computing (HPC) is now on the software side: Creating code that can keep up with the processors.

The history of supercomputing goes back to the 1960s .During those days the main idea was to use innovative designs and parallelism to achieve superior computational peak performance. While the supercomputers of the 1980s used only a few processors, in the 1990s, machines with thousands of processors began to appear both in the United States and in Japan, setting new computational performance records. Moore's Law is a famous "rule" of computer science which states that processing power will double every 1.5-2 years and it has  held true for over 50 years. By the end of the 20th century, massively parallel supercomputers with thousands of "off-the-shelf" processors similar to those found in personal computers were constructed and broke through the teraflop computational barrier. Progress in the first decade of the 21st century was dramatic and supercomputers with over 60,000 processors have appeared.

Here is the performance of the fastest supercomputer in the world, the past 15 years:

·  Top in 2010: 2.57 petaflops

·  Top in 2005: 280.6 teraflops

·  Top in 2000: 4.94 teraflops

·  Top in 1995: 170 gigaflops

 

 

One of the challenges to faster supercomputers is designing an operating system capable of handling that many calculations per second.Other important trends which can be identified include emphasis on having open, rather than proprietary, systems, and the growing awareness of energy efficiency as a requirement.

According to top scientists emphasis is shifting a little away from the speed/power paradigm and toward addressing software challenges. "What matters is not acquiring the iron, but being able to run code that matters". Rather than increasing the push to parallelize codes, the effort is on efficient use of codes.



Sources:

http://en.wikipedia.org/wiki/History_of_supercomputing

http://royal.pingdom.com/2010/12/02/incredible-growth-supercomputing-performance-1995-2010/

http://research.microsoft.com/en-us/um/people/blampson/72-cstb-supercomputing/72-cstb-supercomputing.pdf

http://www.businessinsider.in/Soon-Well-Have-A-Supercomputer-In-Every-Living-Room/articleshow/21329777.cms

 

Varun Singla 363/CO/11

HOW DOES INTERNET WORK

 

Internet is a global system of interconnected computer networks that use the standard Internet protocol suite (TCP/IP) to link several billion devices worldwide. It is a networks of networks that consists of millions of private, public, academic, business, and government networks, of local to global scope, that are linked by a broad array of electronic, wireless, and optical networking technologies. The Internet carries an extensive range of information resources and services.

 

Internet is a global network of computers each computer connected to the Internet must have a unique address. Internet addresses are in the form nnn.nnn.nnn.nnn where nnn must be a number from 0 - 255. This address is known as an IP address.

The picture below illustrates two computers connected to the Internet; our computer with IP address 1.2.3.4 and another computer with IP address 5.6.7.8. The Internet is represented as an abstract object in-between.



Diagram 1

 

If we connect to the Internet through an Internet Service Provider (ISP), we are usually assigned a temporary IP address for the duration of our dial-in session. If we connect to the Internet from a local area network (LAN) our computer might have a permanent IP address or it might obtain a temporary one from a DHCP (Dynamic Host Configuration Protocol) server. In any case, if we are connected to the Internet, our computer has a unique IP address.

Servers are where most of the information on the internet "lives". These are specialised computers which store information, share information with other servers, and make this information available to the general public.

Browsers are what people use to access the World Wide Web from any standard computer.

When we connect our computer to the internet, we are connecting to a special type of server which is provided and operated by our Internet Service Provider (ISP). The job of this "ISP Server" is to provide the link between our browser and the rest of the internet. A single ISP server handles the internet connections of many individual browsers - there may be thousands of other people connected to the same server that we are connected to right now.

 

The picture below show a slightly larger slice of the internet:

 

 



The ISP maintains a pool of modems for their dial-in customers. This is managed by some form of computer which controls data flow from the modem pool to a backbone or dedicated line router. This setup may be referred to as a port server, as it 'serves' access to the network. Billing and usage information is usually collected here as well.

After our packets traverse the phone network and  ISP's local equipment, they are routed onto the ISP's backbone or a backbone the ISP buys bandwidth from. From here the packets will usually journey through several routers and over several backbones, dedicated lines, and other networks until they find their destination.

 The information used to get packets to their destinations are contained in routing tables kept by each router connected to the Internet.

Routers are packet switches. A router is usually connected between networks to route packets between them. Each router knows about it's sub-networks and which IP addresses they use. 


INTERNET PROTOCOL

TCP/IP is the basic communication language or protocol of the Internet. It can also be used as a communications protocol in a private network (either an intranet or an extranet). When we are set up with direct access to the Internet, computer is provided with a copy of the TCP/IP program just as every other computer that we may send messages to or get information from also has a copy of TCP/IP.

TCP/IP is a two-layer program. The higher layer, Transmission Control Protocol, manages the assembling of a message or file into smaller packets that are transmitted over the Internet and received by a TCP layer that reassembles the packets into the original message. The lower layer, Internet Protocol, handles the address part of each packet so that it gets to the right destination. Each gateway computer on the network checks this address to see where to forward the message. Even though some packets from the same message are routed differently than others, they'll be reassembled at the destination.

 

 

 

 

http://www.mediacollege.com/internet/intro/thewww2.html

http://en.wikipedia.org/wiki/Internet

http://www.stanford.edu/class/msande91si/www-spr04/readings/week1/InternetWhitepaper.htm

http://searchnetworking.techtarget.com/definition/TCP-IP


Varun Singla 363/CO/11

 

CACHE COHERENCE AND SNOOPY BUS PROTOCOL

In computing, cache coherence refers to the consistency of data stored in local caches of a shared resource. When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system. In a memory hierarchy for a multiprocessor system, data inconsistency may occur between adjacent levels or within the same level. For example, the cache and main memory may contain inconsistent copies of the same memory block because multiple processors operate asynchronously and independently. When multiple processors maintain locally cached copies of a unique shared-memory location, any local modification of the location can result in a globally inconsistent view of memory. Cache coherence schemes prevent this problem by maintaining a uniform state for each cached blocks of data.

SNOOPY BUS PROTOCOL

In computing a snoopy cache is a type of memory cache that performs bus sniffing. Such caches are used in systems where many processors or computers share the same memory and each have their own cache. Snoopy protocols achieve data consistency among the caches and shared memory through a bus watching mechanism.

There are 2 basic approaches

1.    write-invalidate – invalidate all other cached copies of a data object when the local cached copy is modified (invalidated items are sometimes called "dirty")

2.    write-update – broadcast a modified value of a data object to all other caches at the time of modification

Snoopy bus protocols achieve consistency among caches and shared primary memory by requiring the bus interfaces of processors to watch the bus for indications that require updating or invalidating locally cached objects

 

 

Cache consistency by bus watching mechanism

  • Each cache watches bus for write operations
  • Write through: forward write immediately to memory
    • Write-invalidate: cache copies are invalidated.
    • Write-update: cache copies are updated.
    • Disadvantage: high bus traffic.
  • Copy back: write modified cache line only back if
    1. Line needs to be replaced, or
    2. Another processor reads data.

 

 

 

Initial state consistent caches

 

 

After Write-Invalidate by P1


After Write-Update by P1

 

Sources:

http://www.risc.jku.at/education/courses/ws97/intropar/architectures/index_5.html

http://www.icsa.inf.ed.ac.uk/research/groups/hase/models/coherence/

http://www.info425.ece.mcgill.ca/lectures/L28-SnoopyCoherence.pdf

 

 

 

 

Varun Singla 363/CO/11

Graphs of Pthread Programs

1. Pi Mutex

pi_mutex.png

In this graph, initially time increases as number of threads increases, but eventually time decreases as number of threads increase because as we increase the number of threads work is divided and so time of execution decreases and at one point almost becomes zero or minimum because at this point equal work is done by all threads and there are minimum overheads. This is the optimum value of number of threads. But then after this, the graph gradually increases because now the work is less but the number of threads is quite large which results in increased overheads and hence increases the time of execution.


2. Pi Busywait

pi_busywait.png

Here, time increases as number of threads increase as every thread is looping in an infinite loop. So as number of threads increases number of cpu cycles increases which increases the overall time.


3. Matrix Multiplication

matmul.png


Here, when size of array is small, as the no of threads increase, the time rise is very steep because there is no point in parallelizing such small work. Overheads are more in this case.

When we increase the size of array to a high value, d graph does not rise as fast as was the case before because now each thread is working in parallel n work is distributed equally among all so time of execution is less. But when number of threads  is more than what is required time increases because of overheads.


4. Barrier Semaphores

barrier_2sems.png

Here,  graph could not be seen properly because number of threads could not be increased after a certain point because OS would not allow it. So the actual graph could not be obtained but it was seen that for very few number of threads time was almost zero n at high values of no. of threads time increases to a great extent.


5. Barrier Mutex

barrier_mutex.png

Here, the graph increases linearly with the number of threads.


6. Conditional Barrier

cond.png

Here also,  graph could not be seen properly because number of threads could not be increased after a certain point because OS would not allow it. So the actual graph could not be obtained. But it was seen that for very few number of threads time was almost zero and at high values of number of threads time increases to a great extent.


7. Cout v/s Std

greet.png

Here, the graphs just show that std is faster that cout.


VIVEK SEHGAL 374/CO/11

VIREN GULATI 369/CO/11


Summary of 30/04/14

This blog talks about the basics of the PRAM model and its variants. We would also be discussing a very important algorithm dealing with locating the greatest entry in an array using parallel processing.


PRAM models are classified on the basis of the methods employed in handling the various conflicts, and the speed at which they can solve problems. Hence the variants of PRAM model are:-

1. Exclusive Read Exclusive Write (EREW)

2. Exclusive Read Concurrent Write (ERCW)

3. Concurrent Read Exclusive Write (CREW)

4. Concurrent Read Concurrent Write (CRCW)


Here we would be focussing on the CRCW model only. Since multiple processors are interacting and accessing the memory at the same time, conflicts would be inevitable. In case of reading the memory at the same time, no conflicts would be expected because the processors need not enter the critical section when they are only reading concurrently. One way of achieving this would be making use of broadcasting, as do the crossbar and multistage networks. Here the broadcaster sends the duplicated value to the shared memory and these values can then be read by the processors. But in the case of concurrent write, since more than one processor try to access the memory and write into the same memory location at the same time, conflicts are unavoidable and the need for algorithms to tackle these conflicts arises.


Now let's talk about the algorithm which employs parallel processing to locate the element with the greatest value in an array. We would be inputting an array s[ ] from the user, which is the array in which the element with the maximum value is to be located. Since this algorithm makes use of parallel processing, it is way faster than the other methods, but the space complexity increases because of the use of nC2 processors. We are using nC2 processors because we need one processor per comparison and because the elements are being compared in pairs. For example, the processor p[1,2] compares the the values of the first and the second elements. We would also be using another array t[ ] to store locate the element we are searching for. In this array, the entry '1' for an element means that it hasn't been eliminated and '0' means that it has been eliminated and that it cannot be the element having the greatest value. For example, when the processor p[1,2] is done comparing the first two elements, the entry in t[ ] corresponding to the element with the lower value is changed to 0. Likewise, the entry in t[ ] corresponding to the element with the greater value is changed to 1.

The values of all the entries of t[ ] are calculated in a similar fashion and at the last stage of the algorithm, we are left with an array in which all the entries are 0 apart from the one we had planned on locating. Hence the algorithm can be summarized as follows :-

1. Reading the values from the input array

2. Comparing the values in pairs

3. Changing the values of the array t[] in the manner explained above.

4. Reading the array finally obtained to locate the element whose whose value is 1.


Algorithm :-

Keytype parlargest( int n, keytype s [ ])

{

int t[n];

initialization stage- change all t[i] to 1;

local index i, j;

local keytype first, second;

local int ch1, ch2;

i= first index of processor and j= second index of processor;

Read s[i] into first and s[j] into second;

Compare both,

  if(1st<2nd)

    t[i]=0;

   else

     t[j]=1;

Read t[i] into check 1 and t[j] into check 2;

Finally, if(t[i]==1)

          output s[i];

        else if( t[j]==1)

         output s[j];

}


VIVEK SEHGAL 374/CO/11

VIREN GULATI 369/CO/11


Inter-process communication

 

Inter-process communication (IPC) is a set of methods for the exchange of data among multiple threads in one or more processes. Processes may be running on one or more computers connected by a network. IPC methods are divided into methods for message passing, synchronization, shared  memory, and remote procedure calls (RPC). The method of IPC used may vary based on the bandwidth and latency of communication between the threads, and the type of data being communicated.

 

Typically, applications can use IPC categorized as clients or servers. A client is an application or a process that requests a service from some other application or process. A server is an application or a process that responds to a client request. Many applications act as both a client and a server, depending on the situation.

 

There are two types of IPCs:

 

·         LPC (local procedure call)    LPCs are used in multitasking operating systems to allow concurrently running tasks to talk to one another. They can share memory spaces, synchronize tasks, and send messages to one another.

 

·         RPC (remote procedure call)    RPCs are similar to the LPC but work over networks. RPCs provide mechanisms that clients use to communicate requests for services to another network-attached system such as a server. If you think of a client/server application as a program that has been split between front-end and back-end systems, the RPC can be viewed as the component that reintegrates them over the network. RPCs are sometimes called coupling mechanisms.

 

Various methods for inter-process communication used in most processors

 

        ·         Signals:

They are used to signal asynchronous events to one or more processes. A signal could be generated by a keyboard interrupt or an error condition such as the process attempting to access a non-existent location in its virtual memory. 

        ·         Sockets

Sockets are end points of inter-process communication flow across a network. Since communication between most computers is based on internet protocol, therefore most network sockets are internet sockets.

 

  • Shared memory    Processes can exchange values in shared memory. The memory becomes a sort of bulletin board where processes can post status information and data that needs to be shared.

 

  • Queues    A queue IPC is a structured and ordered list of memory segments where processes store or retrieve data.

 

  • Semaphores    A semaphore provides a synchronizing mechanism for processes that are accessing the same resource. No data is passed with a semaphore-it simply coordinates access to shared resources.

 

  • Pipes    A pipe provides a way for processes to communicate with one another by exchanging messages. Named pipes provide a way for processes running on different computer systems to communicate over the network. Mail slots is a store-and-forward messaging system that doesn't require stations to synchronize with one another.

 

Sources: http://en.wikipedia.org/wiki/Inter-process_communication

http://www.linktionary.com/i/ipc.html


By- Tarun Kumar

 :

                            MESSAGE PASSING MECHANISMS

Message passing in any multicomputer computer network requires hardware and software support. There are  both deterministic and adaptive routing algorithms for achieving deadlock-free message routing.

Message Formats:

Message is a logical unit for internode communication. It is assembled from an arbitrary number of fixed length packets.

Packet: It is the basic unit containing the destination address for routing purposes. A sequence number is needed in each packet  as they may arrive asynchronously to allow message to be reassembled properly. A packet can further divided into a number of fixed-length flits(flow control digits). Routing information  and sequence number occupy the header flits. The remaining flits length is often affected by the network size.

                                       WORMHOLE ROUTING

 

  • The packets are split to flits which are snaked along the route exactly in the same pipeline way as in conflict-free VCT switching. Hence, also here transmission of different packets cannot be interleaved or multiplexed freely over one physical channel without additional architectural support.
  • Routers do not have buffers for the whole packets. Every router has small buffers for one or a few flits.
  • The header flit again builds a path in the network, which the other flits follow in pipeline. The sequence of buffers and links occupied by flits of a given packet forms the wormhole. However, the length of the path here is proportional to the number of flits in the packet. Typically, it spans the whole path between the source and destination.
  • If the header cannot proceed due to busy output channels, the whole chain of flits gets stalled, occupying flit buffers in routers on the path constructed so far and blocking other possible communications.

 

 

 

source:http://pages.cs.wisc.edu/~tvrdik/7/html/Section7.html                  

              Wormhole switching of a packet

(a)   The header is copied in the output buffer after having done routing decision. (b) The header flit is transfered to the second router and other flits are following it. (c) The header flit arrived into a router with busy output channel and the whole chain of flits along the path got stalled, blocking all its channels. (d) Pipeline of flits in case of conflict-free routing, establishing the wormhole across routers.

 

 

  • Wormhole routing allows simple, small, cheap, and fast routers. Therefore, it is the most common switching technique used nowadays in commercial machines.
  • Wormhole routers use often only input buffering.
  • Blocking resources in case of stalled pipelines is the main drawback of this technique.
  • Since blocking chains of buffers can easily cause snow ball effect, WH switching is very deadlock-prone. Deadlock handling requires special care.
  • This issue is related to the concept of virtual channels. We cannot mix flits of different packets into one buffer (= one physical channel), since flits carry no identity flags. However, with adding some control logic, we can split one physical channel into several virtual channels. They will have their own buffers, but will share (time-multiplex) one single physical channel medium (similarly to processes with own contexts sharing one physical processor).
Sources:: Advanced computer architecture by Kai hwang 

 By- Tarun Kumar

                     VERY LONG INSTRUCTION WORD (VLIW)

                                          ARCHITECTURE

 

Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). The VLIW approach, on the other hand, executes operations in parallel based on a fixed schedule determined when programs are compiled. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

 

                                  Comparison of cisc,risc and vliw



Source: http://www.site.uottawa.ca/~mbolic/elg6158/vliw1.ppt

The VLIW architecture is generalized from two well-established concepts: Horizontal micro coding and superscalar processing. A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. As illustrated in the Figure-1, multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common large register file. The operations to be simultaneously executed by the functional units are synchronized in VLIW instruction, say, 256 or 1024 bits per instruction word, as implemented in the Multiflow computer models.


Fig-1, A  typical VLIW processor and instruction format

     Source: http://www.csbdu.in/virtual/DIGITAL%20MUP/5.3.php

 

Different fields of the long instruction word carry the opcodes to be dispatched to different functional units. Programs written in conventional short instruction words must be compacted together to form VLIW instructions and code compaction must be done by compiler .                    

 

Sources: Advanced computer architecture by Kai hwang 

                  http://en.wikipedia.org/wiki/Very_long_instruction_word

                  http://www.csbdu.in/virtual/DIGITAL%20MUP/5.3.php

                  http://www.site.uottawa.ca/~mbolic/elg6158/vliw1.ppt

By- Tarun Kumar

                                                  

                     VERY LONG INSTRUCTION WORD (VLIW)

                                          ARCHITECTURE

 

Very long instruction word or VLIW refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). The VLIW approach, on the other hand, executes operations in parallel based on a fixed schedule determined when programs are compiled. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.

 

                                  Comparison of cisc,risc and vliw

 

Source: http://www.site.uottawa.ca/~mbolic/elg6158/vliw1.ppt

The VLIW architecture is generalized from two well-established concepts: Horizontal micro coding and superscalar processing. A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. As illustrated in the Figure-1, multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common large register file. The operations to be simultaneously executed by the functional units are synchronized in VLIW instruction, say, 256 or 1024 bits per instruction word, as implemented in the Multiflow computer models.

http://www.csbdu.in/virtual/DIGITAL%20MUP/5.2_clip_image006.jpg

      Fig-1, A  typical VLIW processor and instruction format

     Source: http://www.csbdu.in/virtual/DIGITAL%20MUP/5.3.php

 

Different fields of the long instruction word carry the opcodes to be dispatched to different functional units. Programs written in conventional short instruction words must be compacted together to form VLIW instructions and code compaction must be done by compiler .                    

 

Sources: Advanced computer architecture by Kai hwang 

                  http://en.wikipedia.org/wiki/Very_long_instruction_word

                  http://www.csbdu.in/virtual/DIGITAL%20MUP/5.3.php

                    http://www.site.uottawa.ca/~mbolic/elg6158/vliw1.ppt

By- Tarun Kumar

                                                  

Monday 28 April 2014

Uniprocessor parallelism

Yesterday i had a small conversation with ma'am and got my doubts clear. 
This may be helpful for you too, so posting it here.

I have a question , if we are talking about architectures then pipeline, superpipeline ,  superscalar where should these come . They are not architectures ,they are some some techniques which can be used to increase DOP . And i think we can use these techniques within any architecture m i rite ??

(this was my question)

reply by ma'am ,
Pipelining per se is indeed a technique.The basic idea is to achieve temporal parallelism by overlapped operations.  

But pipelined architecture is not a misnomer - Any feature that helps instructions to flow from one point of the hardware to another qualifies for a distinct architecture. Just think of architecture as the ISA plus the roadways for instructions and data to flow. 

That way, pipelined architecture determines how they flow within a single microprocessor. You are right - all three versions (P/L, superP/L and superscalar P/L) increase DOP, but within one processor. You can think of them as parallelism within a uni-processor. Hence they all support fine-grain Instruction Level Parallelism (ILP).

That decides where they come in architectures. Write them under uniprocessor parallelism. 

You can make a complete classification for uniprocessor parallelism. 

1) CISC:
- Pipelined, microcoded, Implicit parallelism. Earlier, CISC architectures were NOT pipelined but now any high performance processor (barring say 8086, 6800, etc) are pipelined.
- Simple compiler. 
- Parallelism mostly detected implicitly at run time by hardware support, but compiler optimizations also help. 

2) RISC: 
- Pipelined,hardwired control, Implicit parallelism. 
- Simpler hardware than CISC.
- But need more compiler support.
- Parallelism is detected mostly implicitly, but compiler support needed.

3) RISC at core but CISC outside:
- Pipelined, control unit uses trace cache for frequently used simpler instructions+ microcode ROM for occasional complex instructions, Implicit parallelism.
- Parallelism mostly detected implicitly at run time by hardware support but compiler optimizations help. 
  
4) VLIW:
- Pipelined, horizontally microcoded, Explicit parallelism. 
- Even simpler hardware than RISC.
- But need even higher compiler support. 
-Parallelism detected explicitly by compiler but hardware support helps.

Note All four architectures can be superpipelined & superscalared.

By:-
Uzma

Thursday 24 April 2014

Fwd: Summary for class on 21st April




Minimum Average Latency

 

The term Minimum Average Latency (henceforth referred to as MAL) comes into the picture when we consider non-linear pipelines.

 In non-linear pipelines, as we know, the stages can be used in different orders or even multiple stages can be used in the same cycle. So, it becomes difficult to determine when to initiate each successive task in order to obtain maximum performance without conflict.

 

Reservation Tables

Here we see a sample reservation table for a 3-stage pipeline and an 8-cycle task. The 'X' marks represent the corresponding pipeline stages used in different cycles by the task. For example: The task uses Stage S3 of the pipeline in the 3rd cycle.

 

Latency, Forbidden Latencies & Collision Vectors

Now, we must wait for some cycles before we can initiate another task on the pipeline without conflict. The no. of cycles between initiations is called latency.

 

Now, some latencies are forbidden because they lead to collision, i.e. a simultaneous need for the 'same stage' by two different tasks. These are called Forbidden Latencies.




For example: for the above reservation table, if we initiate another task after two cycles, then it'll lead to collision [ the stage S2 will then be used by the cycle 4 of task 1 & cycle 2 of task 2 ] and hence the initiation isn't possible, thus '2' is a forbidden latency.

 

There is a systematic way to represent all forbidden latencies for a reservation table, called the Collision Vector.

Collision Vector is a n-bit vector, where n=no. of cycles needed for the task-1

C= (Cn Cn-1 Cn-2 …… C1), C is the collision vector

Now, Ci=1, if the latency 'i' causes collision & Ci=0, if the latency 'i' is permissible

The collision vector for the above reservation table is 1011010.

 

State Transitions & Latency Cycles

State diagrams can be constructed to specify the permissible transitions among successive initiations of tasks.

[ Here 'state' means that, the Collision Vector gives us info. about the pipeline that which latencies are allowed and which are not, for task initiation; this info. is being referred to here as 'state'. So each CV is regarded as the state of the pipeline. ]

The next state of the pipeline at time t+p can be obtained as follows-

Right-shift the present Collision Vector for the reservation table by p bits and then logically OR the obtained Collision Vector with the original Collision Vector.

For example: We want to find the state of the pipeline after 1 latency (corresponding to the above Collision Vector and Reservation Table)

Original Collision Vector(CV): 1011010

1-bit right shifted CV: 0101101

New CV (corresponding to the new state): 1011010 |  0101101 = 1111111

 

Latency Cycles are the set of latencies which upon successively applying on the original state (CV) of the pipeline results in a constant Collision Vector.

For example: for the above Reservation Table

Original state: 1011010

State-1 after a 1-bit shift on original state: 1111111

State-2 after an 8-bit shift on State-1: 1011010

Now, here we see that we get the original state back after operating 2 latencies successively: 1 & 8. Now if we again apply 1 & 8, then we will keep getting the same state. Thus 1,8 is called a latency cycle.


  Other Latency cycles for the above reservation table.

 

Average Latency & Minimum Average Latency

Average latencies are defined for the different latency cycles for a reservation table. And the minimum of the average latencies give us the Minimum Average Latency  (MAL).

 

For example:

For the latency cycle 1,8 for the above  reservation table, the

Average Latency = Sum of latencies in the cycle / No. of latencies in the cycle.

Thus, average latency for cycle 1,8 = (1+8)/2 = 4.5

 


Bounds on Minimum Average Latency (MAL)

Lower Bound: Maximum no. of checkmarks in a row in the original reservation table

Upper Bound: No. of 1's in the original Collision Vector + 1

 


Pipeline Optimization Using Delays

To further optimize the pipeline usage even after finding MAL, we can use the concept of introducing 'non-compute delays' in the pipeline. Using these delays, we can change the reservation table of the pipeline so that collisions that were occurring earlier do not happen.

I'll try to explain it through an example:

Suppose we have a 3-stage pipeline with 5 cycles and the following reservation table:

 

                     

 

Original CV: 1011 and Minimum Average Latency: 3

Now, let's say we introduce a delay of one clock cycle after the 3rd cycle and the 5th cycle, then we get the following reservation table:

     

Now the new CV: 100010

For the above reservation table the MAL comes out to be: 2 (for the latency cycle 1,3)

Thus, here we see that the introduction of a delay in the pipeline results in even more optimization due to the reduction of MAL.


References:

Advanced Computer Architecture by Kai Hwang


#No direct copying of content done from any sources from the internet or otherwise.

 

Links For Further Study:

http://iris.nyit.edu/~mcolef/eeng633/extras/Pipelining&Superscalar_Techniques_ch6.ppt


http://vlsi.catholic.ac.kr/pds/data/class/advcom/adcom4.pdf

 

Compiled By:

          Vipul Gaur 368/CO/11

          Vijay 364/CO/11