COE United Parallel Processing Discussions: Summary of lectures on 12/03/2014 and 13/03/2014

MEMORY VS PERFORMANCE

Concept of Virtual Memory

The concept of virtual memory allows processes to reference more memory than the actual amount of physical memory available in the system. The operating system maps the real memory to virtual memory. Thus, to a process, the virtual memory appears as a contiguous block of memory (also called virtual/logical address space) which is much more than the physical address space available.

Image reference: chortle.ccsu.edu

As the number of processors increase, the virtual memory which each processor ‘sees’ increases.Therefore, each processor can now handle more amount of work than it could have handled by referencing only the physical memory. This is the basis of Sun and Ni’s Law. It is a superlinear model where if number of processors(n) increase, the work which can be done by each processor increases as some function of n^x (x>1) and not linearly with n.

Performance increases with increase in memory space

As we have more and more memory available, the overall performance of the system increases.

This behaviour can be attributed to more buffers, registers and cache.

With higher number of registers available to the processor, resolving anti flow and output dependencies through register renaming becomes easier. As cache size increases, more amount of recently used data can be stored and chance of a hit increases, as mentioned by Shivam. Even increasing main memory size(RAM) boosts the performance of the system, as mentioned by Sumit.

Following buffers increase the performance of the system:

Branch Target Buffer(BTB)
Loop buffer
Pre Fetch buffer
Reservation stations
Reorder buffer
Commit unit
Table look aside buffer(TLB)

Storing common micro operations in trace cache also helps to increase performance.

Hence, increasing memory has a much greater impact on performance than increasing the number of functional units in the system.

TACKLING CONSISTENCY OF DATA

In earlier lectures, the following overheads of shared memory paradigm were discussed:

1. Need for synchronization between different processors which are accessing the same memory.

2. Only one processor should execute its critical section at one time.

Another repercussion of having memory shared between processors is the possibility of inconsistency of data.

Inconsistency of data means that different processors having their own copy of the same data may have conflicting values of that data.

Suppose there are three processors, each having its own cache. They obtain a block of data from the main memory and then change the copy of the data in their cache after some processing. It may be possible that each processor now has a different value of that data and the main memory has an entirely different value of that data block. This leads to inconsistency in the same data.

Cache Coherence

Maintaining consistency among different copies of the same data block of memory is knows as cache coherence. We need to maintain coherence of caches when different processors write an updated value of data or in some special circumstances like cache replacement or process switching (context switching). Cache coherence depends on the type of cache used.

1. Write Through Cache:

In write through cache mechanism, the write operation is done simultaneously on the cache

block and the main memory block addressed by the content of that cache block. Therefore the main memory block has the valid data which was most recently updated along with the cache of the processor which updated the data. All other caches have invalid data.

To maintain coherence in write through caches, we use invalidation policy. Each block of the cache has one extra bit which indicates whether the block has valid or invalid data. Since there are only two possible states, that is, Valid and Invalid, we need only one extra bit per cache block.

Each cache has a state diagram which is implemented using external hardware. Since all caches are identical they have the same state diagram. V is the valid state and I is the invalid state.

Image reference: people.engr.ncsu.edu

It is important to note that we invalidate the content of the cache block (the main memory address) and not the physical address of the cache block.

Invalidating every cache other than the cache associated to the processor which executes the write operation increases the overheads and therefore we use the Snoopy Protocol.

A ‘snoopy bus’ connects all cache controllers and each cache controller monitors the bus. If there is a write operation by some cache, all other cache controllers invalidate their copy of the snooped memory address (if they have a copy). This reduces the operating system overheads.

However the above procedure of maintaining cache coherence causes increased memory bus and snoopy bus traffic.

2. Write Back Cache:

In write back cache mechanism, the updated value is written only to the cache and not to the main memory. The main memory block is updated when the cache block containing the data is modified or replaced.

To maintain coherence in write back caches, we use the concept of ‘ownership’ of memory block. The processor which writes to a main memory block holds the ownership of that block. It invalidates all other caches through the bus. A processor which needs to read a block from the main memory first snoops through the bus to check if any other processor has the ownership of that block. If yes, then it reads the valid data from that processor’s cache. If more than one processors have valid copy of the same data, then the main memory copy is also validated. The state diagram for write back cache coherence mechanism using ownership protocol is shown below. We require two extra bits to maintain three states of the caches: RW is read write state, RO is read only state and INV is invalid state.

Ownership protocol reduces the main memory traffic although the snoopy bus traffic is increased.

Between write back and write through cache mechanisms, it is better to use write back mechanism when one or few processors are predominantly involved in writing data. However if the write operations are evenly balanced between all processors, write through mechanism is much better to implement. Write through cache coherence is easier to implement as can be seen from the state diagram and requires only one extra bit compared to two extra bits required in write back cache coherence.

3. Write Once Cache:

In this type of cache, ownership of a particular memory block is given to the cache which writes to that block twice in succession. After the first write, the cache goes into ‘Reserved’ or modified state and after the second write, it goes into the ‘Dirty’ or exclusive state. The state diagram for write once cache coherence is shown below. We have four different possible states for each cache and consequently we require two extra bits per cache block. This is also known as the MESI protocol (Modified Exclusive Shared Invalid).

Image reference: Wikipedia

Other references: http://en.wikipedia.org/wiki/Virtual_memory

http://en.wikipedia.org/wiki/Cache_(computing)

http://en.wikipedia.org/wiki/Cache_coherence

-Compiled by Riva Yadav & Rishul Aggarwal

5 comments:

Shampa chakraverty21 March 2014 at 05:14
Cogent compilation. Was expecting comments from the class and am waiting for individuals to share their knowledge / queries.
Unknown22 March 2014 at 01:08
TO KNOW MORE ABOUT VIRTUAL MEMORY VISIT THIS PAGE
http://computer.howstuffworks.com/virtual-memory.htm
Unknown24 March 2014 at 07:36
Just a small doubt...Pls. comment if am wrong...
In case of Write Back Cache, RW state means that only 'this' cache is validated and others are invalid(or data is invalid). And in case of RO state, other caches can also have valid data. So, if Zo is the istruction then why does not the state change ? Also, is it possible to have Zo in RO state...??

" If more than one processors have valid copy of the same data, then the main memory copy is also validated."
Can anyone pls. elaborate.
Unknown25 March 2014 at 10:18
@sweety, In case one processor reads a block from some other processor's cache because the other processor holds ownership of that block, then both processors will eventually have a valid copy of the same data block. In this case the memory must also be validated. This is what I understood from the discussion in class.
Shampa chakraverty25 March 2014 at 21:27
This is an issue of memory traffic versus snoop traffic.
Normally, the first time a cache block is loaded by one or more processors for reading, both memory and processor copies are valid.
However, only one processor can have ownership when it continues to write in the cache block, thus saving memory traffic.
Now a read request by another processor takes away the ownership and both become valid. Question is, should memory also be updated at this point?
a) Yes-If memory is updated then a third processor that needs this cache block can get it from memory directly without having to snoop through to find out which processor(s) have valid copies.
Moreover, this rule is followed always, irrespective of whether the cache block was originally loaded from memory or from an owner.
b) If not, then one memory write is saved. But you end up with more snoops to look for valid copies because of two diff situations (1) {M,P1...Pn} all valid and,
(2) {P1...Pn} valid.

--------

I would like to make a clarification regarding Z. This is an invalidation due to cache block Replacement, not invalidation due to the consistency protocol.
To distinguish between the two causes of invalidation, let us name Z as Replacement. That probably will clarify the situation of Zo.

Thursday, 13 March 2014

Summary of lectures on 12/03/2014 and 13/03/2014

5 comments: