Measuring Cache Performance
There are two different techniques for improving cache
performance.
i. One focuses on reducing the miss rate by reducing
the probability that
2 different
memory blocks will contend for the same cache location.
ii. The second technique reduces the miss penalty by
adding an additional level to the
hierarchy.
CPU time can divide into the clock cycles that the CPU
spends executing
the program and the clock cycles that the CPU spends
waiting for the memory
system. We assume that the costs of cache accesses
that are hits are part
of the CPU execution cycles. So,
CPU time= (CPU execution clock cycles+ Memory stall
clock cycles)
x Clock cycle time
The memory stall clock cycles come mainly from cache
misses,
In real processors, the stalls generated by reads and writes can be
quite complex, and accurate performance prediction
usually requires
detailed simulations of the processor and the memory
system.
Memory-stall clock cycles can be defined as the sum of
the staU cycles coming
from reads plus those coming from writes:
Memory stall cycles = (Memory accesses / Program) x
Miss rate x Miss penalty
We can also factor this as :
Memory stall clock cycles = (Instructions / Program) x
(Misses / Instruction) x Miss penalty
Cache Performance Example
Assume an instruction cache miss rate for a program is
2% and a data cache miss rate is 4%. If a processor has a CPI of 2 without any
memory stalls and the miss penalty is 100 cycles for all misses, determine how
much faster a processor would run with a perfect cache that never missed.
solution:
Let I =instruction count
Instruction miss cycles = I X 2% X 100 = 2.00 X I
Data miss cycles = I x 36% x 4% x 100 = 1.44 X I
The total number of memory stall cycles is 2.00 I +
1.44 I =3.44 I. This is more than 3 cycles of memory stall per instruction.
Accordingly, the CPI with memory stalls
is 2 + 3.44 =5.44. Since there is no change in instruction count or clock rate,
the ratio of the CPU execution times is
(CPU time with stalls / CPU time with perfect cache =
( I X CPIstall X
Clock cycle / I X CPIperfect X
Clock cycle)
CPIstall / CPIperfect =
5.44/2
= 2.72
The performance with the perfect cache is better by
2.72
Performance summary
i.
When CPU performance increase, miss
penalty becomes more significant.
ii.
Decreasing base CPI will make the
proportion of time spent on memory stalls more greater.
iii.
Increasing clock rate, memory stalls
account for more CPU cycles.
iv.
Cache behavior cannot be neglected
when evaluating system performance.
Associative Caches
Fully associative
1.
A block in memory may be associated
with any entry in the cache.
2.
To find a given block in a fully associative
cache, all the entries in the cache must be searched because a block can be
placed in any one.
3.
To make the search practical, it is
done in parallel with a comparator associated with each cache entry. These
comparators significantly increase the hardware cost, effectively making fully
associative placement practical only for caches with small numbers of blocks.
n-way set associative
1.
Consists of a number of sets, each of
which consists of n blocks.
2.
Each block in the memory maps to a unique set
in the cache given by the index field, and a block can be placed in any element
of that set.
3.
Remember that in a direct-mapped
cache, the position of a memory block is given by :
(Block number)
modulo (Number of cache blocks)
4.
In a set-associative cache, the set
containing a memory block is given by :
(Block number)
modulo (Number of sets in the cache)
5.
N comparators is less expensive
compared to fully associative.
Associative Cache Example
For a cache with 8 entries
Compare 4-block caches
i.
Direct mapped
ii.
2-way set associative
iii.
Fully associative
Block access sequence: 0, 8, 0, 6, 8
Direct mapped & 2-way set associative
Set Associative Cache Organization
No comments:
Post a Comment