Cache
Memory
·
Cache is a random
access memory(RAM) used by the central
processing unit(CPU) to
reduce the average time to access memory.
·
Cache memory stores instructions that
are repeatedly required to run programs, for improving overall system speed as
cache memory is designed to accelerate the memory function.
·
The
cache is a smaller, faster memory which stores copies of the data from the most
frequently used main memory locations.
·
The advantage of cache memory is that
the CPU does not have to use the motherboard ’s system bus for
data transfer, enabling process of data transfer is processed much more faster
by avoiding bottleneck created by system bus.
·
When
the computer read from or write to a location in main memory, it first checks
whether a copy of that data is in the cache. If so, the processor immediately
reads from or writes to the cache, which is much faster than reading from or
writing to main memory.
·
If the item is there, it is called as a "cache hit", else it is "cache miss"
·
Tag
is added to cache to determine whether the requested word is in cache or not.
·
Tag – part of address of actual data
fetched from main memory
·
Data block(cache line) - hold actual
data fetch from main memory
·
Flag bits
1. Valid
bit – show whether cache block has been loaded
2. Dirty
bit – show whether block has been unchanged since it was read from main memory.
·
Size
of cache refer to the amount of main memory data it can hold.
Accessing
Cache
·
The total number of bits needed for a
cache is a function of cache size and address size.
·
This is due to cache includes both the
storage for data and tag.
·
Eg :
32 bits byte address
Direct-mapped cache
Cache size = 2n block, n
bits used for index
Block size = 2m block, m
bits for word between block, 2 used for byte
part of address
Size
of tag field :
32
- ( m + n + 2)
Total
number of bits in direct-mapped cache :
2n
x (block size + tag size + valid field size)
Since
block size is 2m words,1 bit valid field is needed, number of bits
in this cache is
2n
x (2m x 32 + 32 – n – m -2) + 1) = 2n x( 2m x
32 + 31 –n – m)
·
Eg
2 :
How many total bits are required
for a direct-mapped cache with 16KB of data and 4-word blocks, assuming a
32-bit address?
16KB = 4000 words(212)
4 words block size (22)
So, there are 1024 blocks(210)
Tag = 32 – 10 – 2 – 2 bits
Each block has 4 x 32 (128 bits)
data plus tag plus valid bit.
TOTAL cache size :
210 x ( 4 x 32 + ( 32 – 10 – 2 – 2)
+ 1
= 210 x 147
= 147k bits or 18.4KB for 16 KB
Therefore total number of bits in
cache is 1.15times as many as needed for storage data.
Direct-mapped cache
·
Each
location in main memory can go in one entry in the cache.
·
There
is no choice of which cache entry’s contents to evict, so no replacement
policy.
·
They
may collide if two locations map to the same entry.
·
Larger blocks exploit spatial locality
to lower miss rate.
·
Increasing block size normally decreases
miss rate.
·
Miss rate increase if block size becomes
significant fraction of cache size, there will be a great deal of competition
for those blocks, as a result, block will be bumped out of the cache before its
words are accessed.
·
Cost of miss increases if just increase
the block size, because miss penalty is determined by time required to fetch
block from next lower level of hierarchy and load it into cache.
·
Two parts for time to fetch block:
1. Latency
to first word
2. Transfer
time for the rest of the block.
·
This
problem can overcome by changing memory system, or design memory to transfer
larger blocks more efficiently to increase block size and get further
improvement in cache performance.
Handling
Cache Miss
·
CPU must detect a miss and process the
miss by fetching requested data from memory.
·
Cache miss handling is done in
collaboration with processor control unit, with a separate controller that
initiates memory access and refill the cache.
·
When cache miss is being processed, pipeline
stall is created as opposed to an interrupt, which require saving the state of
all register.
·
For cache miss, the entire processor is
stalled, essentially freezing the content of temporary and programmer-visible
register while waiting for memory.
·
How cache misses are handled?
-
Instruct lower level in memory hierarchy
to perform read to get proper instruction into cache.
-
Address of instruction that generate
instruction cache miss is equal to value of program counter minus 4.(because
program counter is increase in first clock cycle of execution)
-
After memory respond, words containing
desired instruction is written into cache.
-
Steps taken on instruction cache miss :
1. Original
PC value is sent to the memory
2. Main
memory is instructed to perform read and is waited to complete the access
3. Cache
entry is written, data from memory is put in data portion of entry, upper bits
of address(from ALU) is wrote into tag field and turned valid bit on.
4. Instruction
execution at first step is restarted, instruction will refetch, this time is
finding it in cahce.
Handling
Writes
·
Simplest way to keep main memory and
cache consisten is through a scheme named write-through.
·
Write-through
cache,
every write to cache causes a write to main memory.
·
After block is fetched from memory and
placed into cache, word that caused the miss can be overwrite into cache block.
Words also writes to main memory using
full address.
·
Write-through take long time(100
processor clock cycles) and slow down processor considerably.
·
This problem can be solved by using
write buffer to stores the data while it is waiting to be written to memory.
·
After writing data into cache and write
buffer, processor continue execution.
·
When write to main memory completes,
entry in write buffer is freed.
·
If write buffer is full when processor
reaches a write, processor must stall until there is empty position in write
buffer.
·
Rate of writes being generated may be
less than rate of memory can accept them, stall may occur. This happen when writes
occur in burst.
·
Write buffer beyond single entry is
increased in depth to reduce chances of stalls by processor.
·
Write-back
scheme improve performance when processor generate writes as fast or faster
than writes handled by main memory.
·
Write-back scheme is alternative to write-through.
·
Write-back,
writes not immediately mirrored to main memory. The cache tracks which
locations have been written over( these location will be marked dirty).
The
data then written back to main memory when it is evicted from cache. This is
because a miss in write-back cache sometimes required two memory accesses to
service: write dirty location to memory, then read new location from memory.
·
Write-back is more complex to implement
than write-through.
·
Write-allocate – assign block in cache,
block is fetched from memory and the appropriate portion of block is overwritten.
·
No write allocate – update portion of
block in memory but not put in cache.(when operating system zeros a page of
memory.
·
Write-back cache cannot overwrite block,
so stores either need two cycles or need a write buffer to hold data
effectively allowing the store to take only one cycle by pipelining it.
·
Write-through cache can always be done
in one cycle.
Designing
memory system to support caches
·
Processor is connected to memory over
bus.
·
Clock rate of bus is usually much slower
than processor.
·
Speed of bus affects miss penalty.
·
Eg :
Aussme :
1 memory bus clock cycle to send
address
15 memory bus clock cycles for each
DRAM access initiated
1 memory bus clock cycle to send a
word of data
-
Miss penalty = 1 + 4 x 15 + 4 x 1 = 65
-
Number of bytes transferred per bus
clock cycle :
·
From above example, three option for
designing memory system:
1. Assuming
memory is one word wide, access are made sequentially
2. Bandwidth
to memory increased by widening memory and buses between processor and memory.
3. Bandwidth
is increase by widening memory but not interconnection bus.
·
Cache must have a block larger than one
word.
·
Use of larger block decreases the miss
rate and improves efficiency of cache by reducing amount of tag storage
relative to the amount of data storage in cache.
·
Larger block size, lower miss rate,
higher miss penalty which lead to lower performance.
·
Bandwidth of main memory is increased to
transfer cache block more efficiently.
·
Common method for increasing bandwidth
external to DRAM is making memory wider n interleaving.
Measuring
and Improving Cache Performance
·
CPU time divide into clock cycle that
spends for executing program, and clock cycles that spends for waiting for
memory system.
·
Cache hits are part of normal CPU
execution cycle.
CPU
time =(CPU execution clock cycles + memory-stall clock cycles)
X clock cycle time
·
Assume that memory-stall clock cycles
initially come from caches misses.
·
Memory-stall clock cycles can be defined
as: (for write-back cache)
Memory-stall clock
cycles = Read-stall cycles + write-stall cycles
·
It
is impossible to give simple equation to compute write-stall as write buffer stall
depend on proximity of writes other than frequency.
·
Reasonable
write buffer depth in system and memory capable of accepting write, the write
buffer stall is small and can be ignored.
·
For write-through cache:
**assume write buffer stall are
negligible.
Assume miss rate of an instruction cache is 2%
and miss rate of data cache is 4%. If a processor have CPI of 2 without any
memory stalls and miss penalty is 100 cycles for all misses, determine how much
faster a processor would run with a perfect cache that never missed.
Assume frequency of all loads and stores is
36%.
Instruction miss cycles = I x 2% x 100 = 2.00
x I
Data miss cycles = I x 36% x 4% x 100 = 1.44 x
I
Total number of = instruction miss cycle + data miss cycle
memory-stall cycle = 2.00I + 1.44 I = 3.44I
total CPI = 2 + 3.44 = 5.44
Since there is no change in instruction count
or clock rate, hence, the ratio of CPU execution is :
·
Increasing
clock rate without changing memory system will increase the cache lost
(performance lost)
·
hit time increases, total time to access
a word will increase, increasing processor cycle time.
·
Increasing the cache size until a
certain size will takes longer to access, increasing the hit time shortly.
·
Average memory access time (AMAT), a way
to examine alternative cache designs.
AMAT
= Time for a hit + miss rate x miss penalty
Reducing
Cache Misses
·
Other then direct-mapped, there is
another extreme scheme called fully associative.
·
For fully-associative cache, a block in
memory may be associated with any entry in cache.
·
The middle range of design between
direct-mapped and fully associative is set associative.
·
Set associative cache has fixed number of locations where each
block can be placed.
·
If set-associative placement combine
with direct-mapped and fully associative placement,
Locating
a Block in Cache
·
Tag = used to choose the block by
comparison with blocks in selected set
·
Index= used to select set
·
Block offset = address of desired data
within the block
·
All tags in selected set are searched in
parallel.
·
If total cache size is keep constant,
associativity increases, number of blocks per set increases. This means
increasing by a factor of 2 in associativity doubles the number of blocks per
set and halves the number of set.
·
Most commonly used scheme to replace the
block occupying a position that is requested is least recently used(LRU).
·
LRU – replaced block that has not been
used for longest time
Reducing
Miss Penalty Using Multilevel Caches
·
Additional level of cache close the gap
further between fast clock rates of modern processor and long time required to
access DRAMs.
·
Cache built into
the CPU itself is Level 1 (L1) cache.
·
Cache that
resides on a separate chip next to the CPU is Level 2 (L2) cache.
·
Some CPUs have
both L1 and L2 cache built-in and designate the separate cache chip as Level 3 (L3) cache.
·
Cache that is
built into the CPU is faster than separate cache, running at the speed of the microprocessor itself. However, separate cache is still roughly twice
as fast as Random Access Memory (RAM).
·
Disk cache contain most recently read in data from the hard disk. This cache is slower than RAM.
Level 1 Cache
· This cache is integrated right into the processor.
·
The main motivations behind this concept
is “locality of reference”, which a location just accessed by the CPU has a
higher probability being visited again in short term.
·
L1 cache holds most recent data, so when
the data is needed again, the microprocessor check this cache first, so it
don’t need to go to main memory.
·
This process is usually two times faster
than with main memory.
Level 2 Cache
·
A secondary cache, usually located on
memory card closed to processor.
·
Links directly to CPU, and a circuit
that is integrated into the motherboard controls it, circuit named L2
controller.
·
L2 cache stores recent used data that
are not in L1 cache.
·
Enables processor to get about 95% of
informations needs from cache memory.
Reference
5. Computer Organization and Design, by David A
Patterson and Jonn L Henessy
Written by,
QUEK XIN YI
B031210203
No comments:
Post a Comment