[20/20/20/20] <5.3> The performance of a snooping cache-coherent multiprocessordepends on many detailed implementation issues that determine how quicklya cache responds with data in an exclusive or M state block. In some implementations,a CPU read miss to a cache block that is exclusive in another processor’sFigure 5.35 Multicore (point-to-point) multiprocessor.P0 P1 P3. . . . .B0B1B2B3B0B1B2B3B0B1B2B3I 100Coherency stateCoherency stateAddress tag DataCoherency stateAddress tag Data0000000000000000Data00000000100830101068101820081010108110118100128110118MemoryOn-chip interconnect (with coherency manager)Address tag120108110118SMIIMISSSIIAddress Data....100108110118120128130.......00000000000000.......10081018202830...414 ■ Chapter Five Thread-Level Parallelismcache is faster than a miss to a block in memory. This is because caches aresmaller, and thus faster, than main memory. Conversely, in some implementations,misses satisfied by memory are faster than those satisfied by caches. Thisis because caches are generally optimized for “front side” or CPU references,rather than “back side” or snooping accesses. For the multiprocessor illustrated inFigure 5.35, consider the execution of a sequence of operations on a single CPUwhere■ CPU read and write hits generate no stall cycles.■ CPU read and write misses generate Nmemory and Ncache stall cycles if satisfiedby memory and cache, respectively.■ CPU write hits that generate an invalidate incur Ninvalidate stall cycles.■ A write-back of a block, due to either a conflict or another processor’srequest to an exclusive block, incurs an additional Nwriteback stall cycles.Consider two implementations with different performance characteristics summarizedin Figure 5.36. Consider the following sequence of operations assumingthe initial cache state in Figure 5.35. For simplicity, assume that the secondoperation begins after the first completes (even though they are on differentprocessors):P1: read 110P3: read 110For Implementation 1, the first read generates 50 stall cycles because the read issatisfied by P0’s cache. P1 stalls for 40 cycles while it waits for the block, and P0stalls for 10 cycles while it writes the block back to memory in response to P1’srequest. Thus, the second read by P3 generates 100 stall cycles because its miss issatisfied by memory, and this sequence generates a total of 150 stall cycles. Forthe following sequences of operations, how many stall cycles are generated byeach implementation?
đang được dịch, vui lòng đợi..
