Friday, 2 November 2012

Memory Organization - Cache Memory



Cache Memory
·        Cache is a random access memory(RAM) used by the central processing unit(CPU) to reduce the average time to access memory.
·        Cache memory stores instructions that are repeatedly required to run programs, for improving overall system speed as cache memory is designed to accelerate the memory function.
·        The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations.
·        The advantage of cache memory is that the CPU does not have to use the motherboard ’s system bus for data transfer, enabling process of data transfer is processed much more faster by avoiding bottleneck created by system bus.
·        When the computer read from or write to a location in main memory, it first checks whether a copy of that data is in the cache. If so, the processor immediately reads from or writes to the cache, which is much faster than reading from or writing to main memory.
·        If the item is there, it is called as a "cache hit", else it is "cache miss"
·        Tag is added to cache to determine whether the requested word is in cache or not.
·        Tag – part of address of actual data fetched from main memory
·        Data block(cache line) - hold actual data fetch from main memory
·        Flag bits
1.     Valid bit – show whether cache block has been loaded
2.     Dirty bit – show whether block has been unchanged since it was read from main memory.
·         Size of cache refer to the amount of main memory data it can hold.

Accessing Cache
·        The total number of bits needed for a cache is a function of cache size and address size.
·        This is due to cache includes both the storage for data and tag.
·        Eg :
32 bits byte address
Direct-mapped cache
Cache size = 2n block, n bits used for index
Block size = 2m block, m bits for word between block, 2 used for byte
part of address
          Size of tag field :
                   32 - ( m + n + 2)
          Total number of bits in direct-mapped cache :
                   2n x (block size + tag size + valid field size)
          Since block size is 2m words,1 bit valid field is needed, number of bits in this cache is
          2n x (2m x 32 + 32 – n – m -2) + 1) = 2n x( 2m x 32 + 31 –n – m)

·         Eg 2 :
How many total bits are required for a direct-mapped cache with 16KB of data and 4-word blocks, assuming a 32-bit address?

16KB = 4000 words(212)
4 words block size (22)
So, there are 1024 blocks(210)
Tag =  32 – 10 – 2 – 2 bits
Each block has 4 x 32 (128 bits) data plus tag plus valid bit.

TOTAL cache size :
 210 x ( 4 x 32 + ( 32 – 10 – 2 – 2) + 1
= 210 x 147
= 147k bits  or 18.4KB for 16 KB
Therefore total number of bits in cache is 1.15times as many as needed for storage data.

Direct-mapped cache

·        Each location in main memory can go in one entry in the cache.
·        There is no choice of which cache entry’s contents to evict, so no replacement policy.
·        They may collide if two locations map to the same entry.
·        Larger blocks exploit spatial locality to lower miss rate.
·        Increasing block size normally decreases miss rate.
·        Miss rate increase if block size becomes significant fraction of cache size, there will be a great deal of competition for those blocks, as a result, block will be bumped out of the cache before its words are accessed.
·        Cost of miss increases if just increase the block size, because miss penalty is determined by time required to fetch block from next lower level of hierarchy and load it into cache.
·        Two parts for time to fetch block:
1.     Latency to first word
2.     Transfer time for the rest of the block.
·         This problem can overcome by changing memory system, or design memory to transfer larger blocks more efficiently to increase block size and get further improvement in cache performance.

Handling Cache Miss
·        CPU must detect a miss and process the miss by fetching requested data from memory.
·        Cache miss handling is done in collaboration with processor control unit, with a separate controller that initiates memory access and refill the cache.
·        When cache miss is being processed, pipeline stall is created as opposed to an interrupt, which require saving the state of all register.
·        For cache miss, the entire processor is stalled, essentially freezing the content of temporary and programmer-visible register while waiting for memory.
·        How cache misses are handled?
-         Instruct lower level in memory hierarchy to perform read to get proper instruction into cache.
-         Address of instruction that generate instruction cache miss is equal to value of program counter minus 4.(because program counter is increase in first clock cycle of execution)
-         After memory respond, words containing desired instruction is written into cache.
-         Steps taken on instruction cache miss :
1.     Original PC value is sent to the memory
2.     Main memory is instructed to perform read and is waited to complete the access
3.     Cache entry is written, data from memory is put in data portion of entry, upper bits of address(from ALU) is wrote into tag field and turned valid bit on.
4.     Instruction execution at first step is restarted, instruction will refetch, this time is finding it in cahce.

Handling Writes
·        Simplest way to keep main memory and cache consisten is through a scheme named write-through.
·        Write-through cache, every write to cache causes a write to main memory.
·        After block is fetched from memory and placed into cache, word that caused the miss can be overwrite into cache block. Words also writes to main memory using  full address.
·        Write-through take long time(100 processor clock cycles) and slow down processor considerably.
·        This problem can be solved by using write buffer to stores the data while it is waiting to be written to memory.
·        After writing data into cache and write buffer, processor continue execution.
·        When write to main memory completes, entry in write buffer is freed.
·        If write buffer is full when processor reaches a write, processor must stall until there is empty position in write buffer.
·        Rate of writes being generated may be less than rate of memory can accept them, stall may occur. This happen when writes occur in burst.
·        Write buffer beyond single entry is increased in depth to reduce chances of stalls by processor.
·        Write-back scheme improve performance when processor generate writes as fast or faster than writes handled by main memory.
·        Write-back scheme is alternative to write-through.
·        Write-back, writes not immediately mirrored to main memory. The cache tracks which locations have been written over( these location will be marked dirty).
The data then written back to main memory when it is evicted from cache. This is because a miss in write-back cache sometimes required two memory accesses to service: write dirty location to memory, then read new location from memory.
·        Write-back is more complex to implement than write-through.
·        Write-allocate – assign block in cache, block is fetched from memory and the appropriate portion of  block is overwritten.
·        No write allocate – update portion of block in memory but not put in cache.(when operating system zeros a page of memory.
·        Write-back cache cannot overwrite block, so stores either need two cycles or need a write buffer to hold data effectively allowing the store to take only one cycle by pipelining it.
·        Write-through cache can always be done in one cycle.

Designing memory system to support caches
·        Processor is connected to memory over bus.
·        Clock rate of bus is usually much slower than processor.
·        Speed of bus affects miss penalty.
·         Eg :
Aussme :
1 memory bus clock cycle to send address
15 memory bus clock cycles for each DRAM access initiated
1 memory bus clock cycle to send a word of data
-         Miss penalty = 1 + 4 x 15 + 4 x 1 = 65
-         Number of bytes transferred per bus clock cycle  :
 
·          From above example, three option for designing memory system:
1.     Assuming memory is one word wide, access are made sequentially
2.     Bandwidth to memory increased by widening memory and buses between processor  and memory.
3.     Bandwidth is increase by widening memory but not interconnection bus.

·        Cache must have a block larger than one word.
·        Use of larger block decreases the miss rate and improves efficiency of cache by reducing amount of tag storage relative to the amount of data storage in cache.
·        Larger block size, lower miss rate, higher miss penalty which lead to lower performance.
·        Bandwidth of main memory is increased to transfer cache block more efficiently.
·        Common method for increasing bandwidth external to DRAM is making memory wider n interleaving.

Measuring and Improving Cache Performance
·        CPU time divide into clock cycle that spends for executing program, and clock cycles that spends for waiting for memory system.
·        Cache hits are part of normal CPU execution cycle.

CPU time =(CPU execution clock cycles + memory-stall clock cycles)
                   X clock cycle time

·        Assume that memory-stall clock cycles initially come from caches misses.
·        Memory-stall clock cycles can be defined as: (for write-back cache)

Memory-stall clock cycles = Read-stall cycles + write-stall cycles
·         It is impossible to give simple equation to compute write-stall as write buffer stall depend on proximity of writes other than frequency.
·         Reasonable write buffer depth in system and memory capable of accepting write, the write buffer stall is small and can be ignored.
·        For write-through cache: **assume write buffer stall are negligible.



 ·        Example of calculating cache performance:
Assume miss rate of an instruction cache is 2% and miss rate of data cache is 4%. If a processor have CPI of 2 without any memory stalls and miss penalty is 100 cycles for all misses, determine how much faster a processor would run with a perfect cache that never missed.
Assume frequency of all loads and stores is 36%.

Instruction miss cycles = I x 2% x 100 = 2.00 x I

Data miss cycles = I x 36% x 4% x 100 = 1.44 x I

Total number of                = instruction miss cycle + data miss cycle
memory-stall cycle           = 2.00I + 1.44 I = 3.44I

total CPI = 2 + 3.44 = 5.44

Since there is no change in instruction count or clock rate, hence, the ratio of CPU execution is :

·         Increasing clock rate without changing memory system will increase the cache lost (performance lost)
·        hit time increases, total time to access a word will increase, increasing processor cycle time.
·        Increasing the cache size until a certain size will takes longer to access, increasing the hit time shortly.
·        Average memory access time (AMAT), a way to examine alternative cache designs.
                                                         
AMAT = Time for a hit + miss rate x miss penalty

Reducing Cache Misses
·        Other then direct-mapped, there is another extreme scheme called fully associative.
·        For fully-associative cache, a block in memory may be associated with any entry in cache.
·        The middle range of design between direct-mapped and fully associative is set associative.
·        Set associative cache  has fixed number of locations where each block can be placed.
·        If set-associative placement combine with direct-mapped and fully associative placement,

Locating a Block in Cache  

·        Tag = used to choose the block by comparison with blocks in selected set
·        Index= used to select set
·        Block offset = address of desired data within the block
·        All tags in selected set are searched in parallel.
·        If total cache size is keep constant, associativity increases, number of blocks per set increases. This means increasing by a factor of 2 in associativity doubles the number of blocks per set and halves the number of set.
·        Most commonly used scheme to replace the block occupying a position that is requested is least recently used(LRU).
·        LRU – replaced block that has not been used for longest time

Reducing Miss Penalty Using Multilevel Caches
·        Additional level of cache close the gap further between fast clock rates of modern processor and long time required to access DRAMs.
·        Cache built into the CPU itself is Level 1 (L1) cache.
·        Cache that resides on a separate chip next to the CPU is Level 2 (L2) cache.
·        Some CPUs have both L1 and L2 cache  built-in and designate the separate cache chip as Level 3 (L3) cache.
·        Cache that is built into the CPU is faster than separate cache, running at the speed of the microprocessor  itself. However, separate cache is still roughly twice as fast as Random Access Memory (RAM).
·        Disk cache contain most recently read in data from the hard disk. This cache is slower than RAM.

Level 1 Cache

·        This cache is integrated right into the processor.

·        The main motivations behind this concept is “locality of reference”, which a location just accessed by the CPU has a higher probability being visited again in short term.
·        L1 cache holds most recent data, so when the data is needed again, the microprocessor check this cache first, so it don’t need to go to main memory.
·        This process is usually two times faster than with main memory.

Level 2 Cache

·        A secondary cache, usually located on memory card closed to processor.
·        Links directly to CPU, and a circuit that is integrated into the motherboard controls it, circuit named L2 controller.
·        L2 cache stores recent used data that are not in L1 cache.
·        Enables processor to get about 95% of informations needs from cache memory.


Reference
5.   Computer Organization and Design, by David A Patterson and Jonn L Henessy


Written by,
QUEK XIN YI
B031210203







No comments:

Post a Comment