2009年12月5日 星期六

ARM11 MPCore Cache Coherency issue with DMA and SMP

Implementing DMA on ARM SMP Systems
http://infocenter.arm.com/help/topic/com.arm.doc.dai0228a/DAI228A_DMA_on_SMP_systems.pdf
In short, on ARM11 MPCore with SMP enabled, cache operation (inv/clean/...) should be done on ALL of the cores, or stall data could still be accessed. There are 4 solutions provided. Solution A and B are application dependent.

PERFORMANCE ISSUE
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005915.html
Linux 2.6.31.1, pcie sata adapter, ahci.c, read 1GB file

1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps

Interrupt of AHCI and dma cache IPI(int # of IPI DMA cache operation/Interrupt # of AHCI)
MYARCH_ahci: 98509/ 4505
pcie_ahci: 81792/ 4501

Both solutions suffer the performance drop. Drivers that accept cacheable buffer and doing DMA would be affected, eg. network, USB, storage ...,etc, which unfortunately, are usually important blocks. Currently I think ARM11 MPCore cannot be used for Linux SMP. There might be chances for AMP.

Solution C. Read for ownership
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005854.html

* Drop my DMA broadcasting patch
* In the dma_cache_maint (and the contiguous one) do the following
based on direction:
* TO_DEVICE: read each cache line in the buffer (you can
read a long variable every 32 bytes) before the local
cache maintenance operations. This ensures that any
dirty cache lines on other CPUs are transferred to L2
(or main memory) or the current CPU and the cache
operation would have the intended effect. The cache
lines on other CPUs may change to a Shared state (see
the MESI protocol)
* FROM_DEVICE: we don't care about the buffer, just write
0 in each cache line (as above, you can only write a
long every 32 bytes). This ensures that the cache lines
become Exclusive to the current CPU (no other CPU has
any copies) and the invalidation would ensure that there
are no cache lines on any CPU
* BIDIRECTIONAL: read a word in a cache line and write it
back. After cache clean&invalidate, the cache lines
would be removed from all the existing CPUs

formal patch not yet available, try Catalin Marinas's patch,
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005860.html
Or my patch:
http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/aed3f989/attachment-0001.obj)
Catalin Marinas's formal commit
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=8108d60829c2d10fe62aaa7b2fae10f0e4abad36

There will be a performance dropping, in my case, SATA read performance is 64MBps with IPI patch, became 44MBps with RFO patch, which is about 31% lower in performance.
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005863.html
I would be very surprised if going down this route doesn't result in
block IO data performance (and network performance) dropping my more
than 60% of the DMA value (that's DMA performance * 0.4).





Solution D. Broadcast of cache maintenance operations
By IPI

Currently this solution would cause a deadlock in ata_scsi_queuecmd. The deadlock situation and the patch that could fix this:
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/006051.html
http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/520c074d/attachment-0001.obj
However, it is not an one-shot solution, one have to fix it whenever encountered. And it will corrupt the atomic context:
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005858.html

Having functions which enable interrupts while the parent is supposed to
be in an atomic context is definitely a recipe for things to go badly
wrong - this is not a solution anyone in their right mind should
contemplate.


The following patch use the IPI(Inter-Processor Interrupt) to broadcast the dma_cache_maint operation.
http://linux-arm.org/git?p=linux-2.6-stable.git;a=commitdiff;h=95298b1792121e7068258de451caec7f3dda0e78
linux-2.6-stable.git@linux-arm.org
commit 95298b1792121e7068258de451caec7f3dda0e78
Author: Catalin Marinas <catalin.marinas@arm.com>
Date: Tue Mar 10 10:22:54 2009 +0000

http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=f1c242dc5f326713578e469c9f5be647978ebe24
linux-2.6-git@linux-arm.org
commit f1c242dc5f326713578e469c9f5be647978ebe24
Author: Catalin Marinas <catalin.marinas@arm.com>
Date: Wed Oct 28 13:27:49 2009 +0000

http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005568.html
The new updated patch which includes patch for dma_cache_maint_contiguous
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=6dd5056b9abe1e38fae3eb8d576e562f49452b0f


Even with above patch, tests still might failed. USB EHCI and a flash are used to verify this issue, by keep reading an 1MB file in flash and compare it to the original file in ramdisk.
without SMP, no fails.
SMP without L1, no fails.
SMP with L1, 13.26% failed
If I forced the dma_unmap_single to do dma_cache_maint, the SMP with L1, the fail rate reduced to 0.3~0.4%.

By FIQ
Laguna SMP Benchmarks 10001092-00.pdf
http://trac.gateworks.com/wiki/laguna%3Agw2388-4%3Asmp_benchmarks

linux-3.2 in src/linux/laguna – DD-WRT
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.2
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.2?rev=18083
first patch on the site
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.0?rev=17578 


[RFC PATCH] Broadcast the DMA cache operations on ARMv6 SMP hardware
http://lists.arm.linux.org.uk/lurker/message/20080620.124707.2bff9c7f.en.html
http://lists.arm.linux.org.uk/lurker/message/20080620.154546.aaa33d72.en.html
> By the way, besides these dmac cache functions,
> don't you think other cache mantain functions
> like clean_dcache_area() need broadcasting?


AFAIK, it was discussed some years ago and these operations don't
require broadcasting.
Basically, once a write to a memory location
occurs, the MESI protocol used by the SCU ensures that the owner of that
cache line is CPU that did the writing. If the cacheline exists on other
CPUs, it is invalidated. Therefore a cache cleaning operation on the CPU
that did the writing is enough since no other CPU has a valid cache
line.

The problem is slightly different with the DMA API since the driver
might invalidate an area of memory (dma_map_singe(FROM_DEVICE)) without
reading or writing it before and hence the CPU is not the owner of those
lines. The same goes for cleaning or flushing since some drivers may run
dma_map_single(TO_DEVICE) in an interrupt routine handled on a CPU but
the buffer to be transmitted could have been written on a different CPU.

Regarding the I-cache invalidation (which, BTW, is completely missing
from the mainline kernel), the patch I proposed (posted again last week)
does this when a thread migrates to another CPU that it hadn't run on
before and there is no need for broadcasting as we track the CPU via
mm->cpu_vm_mask (see switch_mm in mmu_context.h).

The following patch fix my issue. The patch that patch dma_cache_maint should be applied to dma_cache_maint_contiguous, which is called by dma_cache_maint_page.
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index be56c43..15dafb6 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -584,15 +584,15 @@ static void dma_cache_maint_contiguous(struct page *page,

   switch (direction) {
   case DMA_FROM_DEVICE:           /* invalidate only */
-               inner_op = dmac_inv_range;
+               inner_op = smp_dma_inv_range;
           outer_op = outer_inv_range;
           break;
   case DMA_TO_DEVICE:             /* writeback only */
-               inner_op = dmac_clean_range;
+               inner_op = smp_dma_clean_range;
           outer_op = outer_clean_range;
           break;
   case DMA_BIDIRECTIONAL:         /* writeback and invalidate */
-               inner_op = dmac_flush_range;
+               inner_op = smp_dma_flush_range;
           outer_op = outer_flush_range;
           break;
   default:




Call flush_dcache_page after PIO data transfer in libata-aff.c
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=026f474ca17dd758#patch1
When reading data from an ATA device using PIO, the kernel dirties the
D-cache but there is no flush_dcache_page() call in ata_pio_sector().
Since neither the VFS layer calls this function, a subsequent
update_mmu_cache() is not aware of the dirty page which may lead to
cache incoherency in user space.


Call flush_dcache_page in usb_stor_access_xfer_buf
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=d0c91030c392ef4e
Transferring buffers using memcpy dirties the D-cache but there is no
corresponding flush_dcache_page call which leads to data corruption in
user-space.


in setup_processor():
struct cpu_cache_funs* cpu_cache = __v6_proc_info.cache = v6_cache_fns

沒有留言: