Sunday, December 15, 2013

Measuring Cache Performance

Measuring Cache Performance

There are two different techniques for improving cache performance.
i. One focuses on reducing the miss rate by reducing the probability that
   2 different memory blocks will contend for the same cache location.
ii. The second technique reduces the miss penalty by adding an additional level to the     
    hierarchy.

CPU time can divide into the clock cycles that the CPU spends executing
the program and the clock cycles that the CPU spends waiting for the memory
system. We assume that the costs of cache accesses that are hits are part
of the CPU execution cycles. So,

CPU time= (CPU execution clock cycles+ Memory stall clock cycles)
                     x Clock cycle time                                                                    

The memory stall clock cycles come mainly from cache misses,
In real processors, the stalls  generated by reads and writes can be
quite complex, and accurate performance prediction usually requires
detailed simulations of the processor and the memory system.

Memory-stall clock cycles can be defined as the sum of the staU cycles coming
from reads plus those coming from writes:

Memory stall cycles = (Memory accesses / Program) x Miss rate x Miss penalty  
We can also factor this as :
Memory stall clock cycles = (Instructions / Program) x (Misses / Instruction) x Miss penalty

Cache Performance Example

Assume an instruction cache miss rate for a program is 2% and a data cache miss rate is 4%. If a processor has a CPI of 2 without any memory stalls and the miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed.

solution:
Let I =instruction count
Instruction miss cycles = I X 2% X 100 = 2.00 X I
Data miss cycles = I x 36% x 4% x 100 = 1.44 X I

The total number of memory stall cycles is 2.00 I + 1.44 I =3.44 I. This is more than 3 cycles of memory stall per instruction. Accordingly, the CPI with  memory stalls is 2 + 3.44 =5.44. Since there is no change in instruction count or clock rate, the ratio of the CPU execution times is

(CPU time with stalls / CPU time with perfect cache = ( I X CPIstall  X Clock cycle / I X                                                                                                CPIperfect X Clock cycle)
CPIstall  / CPIperfect = 5.44/2  
                             = 2.72

The performance with the perfect cache is better by 2.72


Performance summary

i.                    When CPU performance increase, miss penalty becomes more significant.
ii.                  Decreasing base CPI will make the proportion of time spent on memory stalls more greater.
iii.                Increasing clock rate, memory stalls account for more CPU cycles.
iv.                 Cache behavior cannot be neglected when evaluating system performance.



Associative Caches

Fully associative
1.      A block in memory may be associated with any entry in the cache.
2.      To find a given block in a fully associative cache, all the entries in the cache must be searched because a block can be placed in any one.
3.      To make the search practical, it is done in parallel with a comparator associated with each cache entry. These comparators significantly increase the hardware cost, effectively making fully associative placement practical only for caches with small numbers of blocks.

n-way set associative
1.      Consists of a number of sets, each of which consists of n blocks.
2.       Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.
3.      Remember that in a direct-mapped cache, the position of a memory block is given by :
                             (Block number) modulo (Number of cache blocks)
4.      In a set-associative cache, the set containing a memory block is given by :
                             (Block number) modulo (Number of sets in the cache)
5.      N comparators is less expensive compared to fully associative.







Associative Cache Example





For a cache with 8 entries






Compare 4-block caches
i.                    Direct mapped
ii.                  2-way set associative
iii.                Fully associative

Block access sequence: 0, 8, 0, 6, 8





Direct mapped & 2-way set associative





Set Associative Cache Organization




                                                                                                                                  







No comments:

Post a Comment