2009年12月14日 星期一

Cache

Fact
VA to PA is aligned by page size, therefore any VA/PA pair has the same page offset, i.e. for 4KB page size, their 0-11 bits are the same


Assumption
Cache of 32KB, 4-way, 32B cache line

  • 8KB per way = 2^13
  • 32B cache line = 2^5
Page size 4KB = 2^12

[wiki] Harvard architecture
http://en.wikipedia.org/wiki/Harvard_architecture

[wiki] Cache
http://en.wikipedia.org/wiki/Cache

[wiki] Cache coherence
http://en.wikipedia.org/wiki/Cache_coherency

[wiki] CPU cache
http://en.wikipedia.org/wiki/CPU_cache

Cache Aliases 小註解
http://www.cash.idv.tw/wordpress/?p=2273

Understanding Caching
http://www.linuxjournal.com/article/7105
kmalloc never allocates two regions that overlap in a cache line

(...........)
  • PIPT

    • get virt_addr, address translation to get phy_addr, lookup in cache to get data
    • slow; address translation and lookup in cache have to be sequential
    • no alias
  • VIVT

    • get virt_addr, lookup in cache to get data
    • fast; does not need address translation
    • alias; don't know if multiple virt_addr mapped to the same phy_addr
  • VIPT

    • get virt_addr, do (cache lookup) and (address translation)
    • faster than PIPT; (cache lookup) and (address translation) can be done in parallel
    • COULD be no alias; alias can be detected by seeing same tag (phy_addr) in multiple index(virt_addr), IF the hardware have implemented so.
    • larger tag; for the phy_addr
(...........)

The Aliasing Problem

Any time the kernel sets up more than one virtual mapping for the same physical page, cache line aliasing may occur. The kernel is careful to avoid aliasing, so it usually occurs only in one particular instance: when the user mmaps a file. Here, the kernel has one virtual address for pages of the file in the page cache, and the user may have one or more different virtual addresses. This is possible because nothing prevents the user from mmaping the file at multiple locations.

When a file is mmaped, the kernel adds the mapping to one of the inode's lists: i_mmap, for maps that cannot change the underlying data, or i_mmap_shared, for maps that can change the file's data. The API for bringing the cache aliases of a page into sync is:

void flush_dcache_page(struct page *page);

This API must be called every time data on the page is altered by the kernel, and it should be called before reading data from the page if page->mapping->i_mmap_shared is not empty. In architecture-specific code, flush_dcache_page loops over the i_mmap_shared list and flushes the cache data. It then loops over the i_mmap list and invalidates it, thus bringing all the aliases into sync.

Separate Instruction and Data Caches

In their quest for efficiency, processors often have separate caches for the instructions they execute and the data on which they operate. Often, these caches are separate mechanisms, and a data write may not be seen by the instruction cache. This causes problems if you are trying to execute instructions you just wrote into memory, for example, during module loading or when using a trampoline(?). You must use the following API:

void
flush_icache_range(unsigned long start,
unsigned long end);

to ensure that the instructions are seen by the instruction cache prior to execution. start and end are the starting and ending addresses, respectively, of the block of memory you modified to contain your instructions.

(...........)

(MY COMMENT)
For the SMP, USUALLY, there should be some "hardware" would somehow sync the L1 cache in all CPUs or maintain the coherency of the caches, such that one inv/flush/clean in one CPU, all CPU caches get updated. However for SOME that doesn't (like ARM MPCore), one have to do the operation in EVERY CPU to ensure all CPU get updated, via mechanisms like IPI(Inter Processor Interrupt).

for more details in MPCore, refer to
ARM11 MPCore DMA DCache issue with Linux SMP
http://mkl-note.blogspot.com/2009/12/linux-arm11-mpcore-smp-cache-issue.html


The MIPS Cache Architecture
http://people.openrays.org/~comcat/mydoc/mips.cache.arch.pdf
(簡體中文,作者也是kernel developer,因此格外好讀,不但解釋的很清楚,而且有很多實例)
Writting data
  • Cache Hit
    • Write Through
      • write to cache and RAM
    • Write Back
      • write to cache only
  • Cache Miss
    • Write Allocate
      • allocate cache, read data from RAM, write data to cache
    • No Write Allocate
      • write to next level Cache/Memory directly
Read
  • Cache Hit
    • send data to CPU
  • Cache Miss
    • Read Allocate
      • allocate cache, read data from RAM, send data to CPU
    • No Read Allocate
      • read data from RAM and send data to CPU directly
(.............)

5. Cache Aliases Issue
For VIPT, Cache Alias Issue exists if WAY_SIZE > PAGE_SIZE.
Page address is always page-size-aligned, no matter VA or PA.
WAY_SIZE (8K, 2^13) = CACHE_SIZE (32K) / WAY (4)
PAGE_SIZE (4K, 2^12)


INDEX_SIZE=log2(WAY_SIZE) (13)
VA lower INDEX_SIZE bits are used as index.

color bit=log2(WAY_SIZE)-log2(PAGE_SIZE) = 13-12 = 1
if 2 VA could map to the same PA:
  • if they are of the same color, VIPT are all the same, so both will be in the same cache slot. No problem.
  • if they are of different color, they will be located in 2 different cache slot, which is the Cache Alias Issue

Solution
  • remove color bit: let PAGE_SIZE = WAY_SIZE
  • only uses the VAs of the same color to map a PA
    • somehow complicated
  • flush Cache if there might be Cache Alias

Focus (mostly occurs when...)
  • copy_to_user_page/copy_from_user_page
  • fork
    • while forking, child process would COW(copy-on-write) some pages of some parent process. While copying, the VA that child process uses is usually different from VA of parent process, which introdueces cache alias.







ARM11 MPCore? Processor, Technical Reference Manual, Revision: r2p0
Each MP11 CPU features:
‧ an integer unit with integral EmbeddedICE-RT logic
‧ an 8-stage pipeline
‧ branch prediction with return stack
‧ coprocessors 14 and 15
‧ Instruction and Data Memory Management Units (MMUs), managed using MicroTLB structures backed by a unified main TLB

(32K 4way-associated for each I/D cache)

‧ Instruction and data caches, including a non-blocking data cache with Hit-Under-Miss (HUM)
‧ a data cache that is physically indexed, physically tagged (PIPT)
‧ a data cache that is write back, write allocate only
‧ an instruction cache that is virtually indexed, physically tagged (VIPT)
‧ 32-bit interface to the instruction cache and 64-bit interface to the data cache
‧ hardware support for data cache coherency (among CPUs, but not with DMA)
‧ Vector Floating-Point (VFP) coprocessor support
‧ JTAG-based debug.



write-allocate

CSE 141 Ungraded Homework #5 Answer Sheet
http://cseweb.ucsd.edu/~carter/141/hw5ans.html
NOTE FROM GREG: Last week in section we discussed how caches deal with stores. Specifically, we looked at caches that are write-back vs. write-through, and write-allocate vs. write around. I believe I may have oversimplified things and would like to provide some clarification. It is true that write-back vs. write-through deals with what happens when you write to the cache and find the data present in the cache. However, some students may have been a little confused in reading the solutions to HW #5 when we actually set the dirty bit on a cache miss. The natural question is, "Hey! I thought we only worry about write-back vs. write-through when we have a cache HIT." The potential tricky point here is what happens if your cache is write-back and write-allocate? In this situation, suppose you have a cache miss. The write-allocate policy of the cache will load the data in question into the cache, and the write-back policy will cause only the cache copy to be modified, also turning on the corresponding dirty bit. On the other hand, if your cache is write-through and write allocate, the same thing will happen, but then both the cache copy and the memory copy will be modified.

Secondly, during section we only spoke of a write-allocate cache in terms of a write-allocate and write-through cache. From the previous paragraph, just make note that it is possible for a write-allocate cache to also be write-back, in which case it is not necessarily true that both the cache copy and memory copy are updated on a write. Sorry for the confusion!



VIVT

[wiki] CPU cache: Virtual tags and vhints
http://en.wikipedia.org/wiki/CPU_cache#Virtual_tags_and_vhints

?ARM VIVT看linux的cache ?理
http://docs.google.com/Doc?id=dcbsxfpf_282csrs2pfn
http://blog.chinaunix.net/u2/79526/showart_1200081.html

ARM Architecture Support
http://msdn.microsoft.com/en-us/library/bb905767.aspx
On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) cache in which both the index and the tag are based on the virtual address. The main advantage of this method is that cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However, this caching method does require more frequent cache flushing because of cache aliasing, in which the same physical address can be mapped to multiple virtual addresses.

On ARMv6 and ARMv7 processors, cache is organized as a virtual-indexed, physical-tagged (VIPT) cache. The cache line index is derived from the virtual address. However, the tag is specified by using the physical address. The main advantage is that cache aliasing is not an issue because every physical address has a unique tag in the cache. However, a cache entry cannot be determined to be valid until the TLB has translated the virtual address to a physical address that matches the tag. Generally, the TLB lookup cost offsets the performance gain achieved by avoiding cache aliasing.

(......................)

For ARMv6 and ARMv7 processors, cache flushing in thread switching to a process other than the current active process is limited to the following instances:

* The hardware does not support VIPT I-cache: In ARMv6 and ARMv7, it is optional for I-cache to be VIPT. Data cache is VIPT or physically-indexed and physically-tagged (PIPT) in MPCore systems. If the hardware does not support VIPT I-cache, the OS flushes the I-cache.
* The system is out of address-space identifiers (ASIDs) for each virtual address: In this case, the OS flushes the whole TLB.

This means that, whereas on ARMv4 and ARMv5 processors the whole cache, I-cache, D-cache, and TLB, is flushed on every thread switch to a different process. On ARMv6 and ARMv7 processors, the D-cache is never flushed on thread switch. The I-cache is flushed only if the processor does not support VIPT cache. The TLB is flushed only if all 255 supported ASIDs have been used. This reduction of cache flushes should improve overall system performance.

In addition, moving to VIPT has performance advantages for the following OS features in CE 6.0:

* Memory-mapped files: On an ARMv4 or ARMv5 system, all read/write views are marked as uncached to prevent aliasing. Marking the views as uncached affects overall system performance. However, in VIVT, you must prevent aliasing. On ARMv6 and ARMv7 systems, views are marked as cached.
* VirtualAllocCopyEx: In CE 6.0, if a kernel mode driver creates an explicit alias in which two virtual addresses map to the same physical address by using VirtualAllocCopyEx, the OS marks both the source and destination addresses as uncached to avoid cache aliasing on ARMv4 and ARMv5 systems. On ARMv6 and ARMv7 systems, source and destination addresses are marked as cached. Even though this function can be called only from kernel mode, this affects both kernel-mode and user-mode drivers. Device Manager copies the data only for user-mode drivers.

沒有留言: