2009年12月28日 星期一

Linux firmware

So it is required to bring up udev before using drivers that load its own firmware.

Documentation/firmware_class/README

request_firmware() hotplug interface:
------------------------------------

(..............)

High level behavior (mixed):
============================

kernel(driver): calls request_firmware(&fw_entry, $FIRMWARE, device)

userspace:
  • /sys/class/firmware/xxx/{loading,data} appear.
  • hotplug gets called with a firmware identifier in $FIRMWARE and the usual hotplug environment.
    • hotplug: echo 1 > /sys/class/firmware/xxx/loading

kernel: Discard any previous partial load.

userspace:
  • hotplug: cat appropriate_firmware_image > \
    /sys/class/firmware/xxx/data

kernel: grows a buffer in PAGE_SIZE increments to hold the image as it
comes in.

userspace:
  • hotplug: echo 0 > /sys/class/firmware/xxx/loading

kernel: request_firmware() returns and the driver has the firmware
image in fw_entry->{data,size}. If something went wrong
request_firmware() returns non-zero and fw_entry is set to
NULL.

kernel(driver): Driver code calls release_firmware(fw_entry) releasing
the firmware image and any related resource.

2009年12月24日 星期四

Linux kernel git sources

Check the MAINTAINERS first...

Russell King's git tree

http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm.git/


Catalin Marinas's git tree
git://linux-arm.org/linux-2.6-stable.git
git://linux-arm.org/linux-2.6.git

The kernel trees on linux-arm.org are all Catalin Marinas's (both master and devel branches).

Uwe Kleine-König
git://git.pengutronix.de/git/ukl/linux-2.6.git arm/booting


Anton Vorontsov
git://git.infradead.org/users/cbou/linux-cns3xxx.git master


Jeff Garzik (ata)
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

Linux Scheduler: 娘家 算什麼,看CFS vs SD才熱血嘛!!

CFS排班器初探
http://zylix666.blogspot.com/2007/10/cfs.html

不過背後還有一段kernel developers 之間的鬥爭,簡單來講就是Mingo(CFS開發者)把CK (SD-scheduler開發者)給鬥倒了,然後虛情假意讚許一下CK所貢獻的心力。

CK不爽之餘在LKML上多說了幾句話,惹得Linus不太爽,就決定不再採納SD-scheduler的開發。CK在悲憤下也發表了退黨宣言,並且言不由衷的宣稱這是因為他人生本就另有打算,與此次鬥爭無關...


難得看Linux的東西跟看連續劇一樣令人熱血
不過真的很長,不只要搬板凳,還要帶爆米花才行...
(還有翻譯機...Yahoo字典 也行啦~~)

這就好像是FOSS一直很詬病的software IP, 有些廠商就靠有幾個IP來打壓Linux, 實在很沒道理;但是凡事不能無限上綱,我認為這種人就是 蟑螂,偷別人的idea,自己做出來,還自己認可,"自我感覺良好"嗎? 如果說今天CK沒在動了,mingo拿他的idea來做我覺得OK;或者mingo不是審核者,經過較公開的討論,觀感也會好一點。

Linux: The Completely Fair Scheduler
April 18, 2007
http://kerneltrap.org/node/8059

Linux: CFS Updates, -v20
August 23, 2007
http://kerneltrap.org/Linux/CFS_Updates_v20

CFS Updates
September 25, 2007
http://kerneltrap.org/Linux/CFS_Updates

Some issues while running Linux SMP on ARM11MPCore

http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/006650.html

I'm using ARM11 MPCore with 2 CPU, Linux-2.6.31.1, SMP enabled, L1 enabled, L2 disabled

Under SMP environment, I have observed following issues:

  1. Sometimes, console became extremely slow, print 1 character for 1-2 seconds
    RVDS say that both CPU are idling. kernel seems find because messages response to inserting USB flash is quick and correct.
    fixed

  2. Sometimes, the Linux console halt and canot accept any input.
    RVDS say that both CPU are idling. kernel seems find because messages response to inserting USB flash is quick and correct.
    should be fixed with case 1

  3. Sometimes, the test stop with no reason or some fault like segmantation fault and return to console prompt or login prompt.

  4. Sometimes, the test stop with no reason, but not returning to console prompt. The console can accept input, but no further response, nor prompt.

    RVDS says that one CPU is idling, the other is in IRQ context, at entry-armv.S(676) after __pabt_usr, seems like it's keeps getting prefetch abort.



I can duplicate case 1 by keep inserting a simple test module.
#include <linux/init.h>
#include <linux/module.h>


static int __init MYDRIVER_init(void)
{

printk("%s: \n",__func__);
return 0;
}

static void __exit MYDRIVER_exit(void)
{
printk("%s: \n",__func__);
}

MODULE_AUTHOR("Mac Lin");
MODULE_DESCRIPTION("MYDRIVER");
MODULE_LICENSE("GPL");

module_init(MYDRIVER_init);
module_exit(MYDRIVER_exit);
keep insert and remove modules like below:
module=mydriver;modprobe ${module};rmmod ${module}; (...repeat many times...i have it 30 times..) modprobe ${module};rmmod ${module};
and keep issuing it for, says , 10 times, without waiting the previous command to complete.
then at some point I'll got the case 1.

following command won't do, it just can keep runninng.
module=mydriver;while : ; do modprobe ${module};rmmod ${module};done;


After some tracking, I thought that CONFIG_LOCAL_TIMERS has strange behavior. I disable it, and the situation changed. It's harder to get case 1, but still have some issues. for example, it crash like the following, and it became case 3
[ 57.090000] MYDRIVER_exit:
[ 57.110000] MYDRIVER_init:
[ 57.150000] MYDRIVER_exit:
[ 57.180000] MYDRIVER_init:
[ 57.210000] MYDRIVER_exit:
[ 57.240000] MYDRIVER_init:
[ 57.270000] MYDRIVER_exit:
[ 57.300000] MYDRIVER_init:
[ 57.320000] sh: unhandled page fault (11) at 0x000b7dfc, code 0x017
[ 57.320000] pgd = c78b4000
[ 57.330000] [000b7dfc] *pgd=038f4031, *pte=00000000, *ppte=00000000
[ 57.350000]
[ 57.360000] Pid: 350, comm: sh
[ 57.370000] CPU: 1 Not tainted (2.6.31.1-XXXX1 #53)
[ 57.390000] PC is at 0x40058d04
[ 57.400000] LR is at 0xb7df8
[ 57.400000] pc : [<40058d04>] lr : [<000b7df8>] psr: 60000010
[ 57.400000] sp : bec8b6b8 ip : 0001d020 fp : 00000000
[ 57.440000] r10: 00000000 r9 : bec8b728 r8 : 00000002
[ 57.460000] r7 : 0009c038 r6 : 0001d028 r5 : 4009fe40 r4 : 400a02f8
[ 57.470000] r3 : 00000049 r2 : 0009add8 r1 : 0009add8 r0 : 00000049
[ 57.490000] Flags: nZCv IRQs on FIQs on Mode USER_32 ISA ARM Segment user
[ 57.520000] Control: 00c5787d Table: 078b400a DAC: 00000015
Segmentation fault


Without DCache and CONFIG_LOCAL_TIMERS, I can repeat the above procedure for 216 seconds, then it halted as case 4.

Case 1 also exists.

It means without DCache and CONFIG_LOCAL_TIMERS cannot avoid them, but only mitigate a little.

BTW, I have done a quick port to linux-2.6.33-rc1, branch master, based on commit f2d9a06. With DCache and CONFIG_LOCAL_TIMERS, I have seen case 1, which means this issue is not fixed yet.

Without SMP, I haven't seen such issue yet.

So currently all the clues led to SMP.

http://lists.infradead.org/pipermail/linux-arm-kernel/2010-January/006901.html

(......................)
(case 1 and 2) These two sounds like a problem with interrupts - userspace console IO is interrupt driven, whereas kernel console IO is not.
(......................)
It could be something to do with write allocate caches - we don't support these particularly well in the kernel, and I wouldn't be surprised if you've found some problem there.

The fact that it only happens in SMP mode rather points at that, because that's one of the few hardware configurations which does have write allocate caches. To confirm this, we need someone who can run your tests on a UP platform which has write allocate caches...


http://lists.infradead.org/pipermail/linux-arm-kernel/2010-January/006945.html
http://lists.infradead.org/pipermail/linux-arm-kernel/2010-January/006955.html
Neither without SMP nor SMP with maxcpus=1 have the same behavior.


Fix for case 1 and case 2
http://lists.infradead.org/pipermail/linux-arm-kernel/2010-January/007052.html
Thanks for Russell's advice, after some tracing, I found that my IER (Interrupt Enable Register) of the serial port is 0 under case 1!!

Case 2 is actually the same with case 1. Case 1 would come first, if I don't keep input things and let it finish its slow printing, it would then become case 2.

UART_BUG_THRE are detected and enabled on my platform, causing serial8250_backup_timeout to be used.

There are many places that do ( get IER, clear IER, restore IER ), like serial8250_console_write called by printk, and serial8250_backup_timeout. serial8250_backup_timeout is not protected by spinlock, causing the race condition, and result in wrong IER value.

Following patch fix this issue. Case 3 and Case 4 are still often seen, but not case 1 and case 2.
diff --git a/kernels/linux-2.6.31.1-X/drivers/serial/8250.c b/kernels/linux-2.6.31.1-X/drivers/serial/8250.c
index 288a0e4..55602c3 100644
--- a/kernels/linux-2.6.31.1-cavm1/drivers/serial/8250.c
+++ b/kernels/linux-2.6.31.1-cavm1/drivers/serial/8250.c
@@ -1752,6 +1758,8 @@ static void serial8250_backup_timeout(unsigned long data)
unsigned int iir, ier = 0, lsr;
unsigned long flags;

+
+ spin_lock_irqsave(&up->port.lock, flags);
/*
* Must disable interrupts or else we risk racing with the interrupt
* based handler.
@@ -1769,10 +1777,8 @@ static void serial8250_backup_timeout(unsigned long data)
* the "Diva" UART used on the management processor on many HP
* ia64 and parisc boxes.
*/
- spin_lock_irqsave(&up->port.lock, flags);
lsr = serial_in(up, UART_LSR);
up->lsr_saved_flags |= lsr & LSR_SAVE_FLAGS;
- spin_unlock_irqrestore(&up->port.lock, flags);
if ((iir & UART_IIR_NO_INT) && (up->ier & UART_IER_THRI) &&
(!uart_circ_empty(&up->port.info->xmit) || up->port.x_char) &&
(lsr & UART_LSR_THRE)) {
@@ -1780,12 +1786,14 @@ static void serial8250_backup_timeout(unsigned long data)
iir |= UART_IIR_THRI;
}

- if (!(iir & UART_IIR_NO_INT))
- serial8250_handle_port(up);
-
if (is_real_interrupt(up->port.irq))
serial_out(up, UART_IER, ier);

+ spin_unlock_irqrestore(&up->port.lock, flags);
+
+ if (!(iir & UART_IIR_NO_INT))
+ serial8250_handle_port(up);
+
/* Standard timer interval plus 0.2s to keep the port running */
mod_timer(&up->timer,
jiffies + poll_timeout(up->port.timeout) + HZ / 5);


SMP issues with 8250.c‏
http://old.nabble.com/SMP-issues-with-8250.c%E2%80%8F-to27090634.html
http://www.spinics.net/lists/linux-serial/msg02106.html

2009年12月23日 星期三

2009年12月22日 星期二

Kernel Images: from building to booting

I made those picture about 3 years ago, but it looks like the same, and quiet useful!!!



Same picture, just in another layout.


bootpImage Layout



(The KERNEL_PHYS, and the 2nd step in diagram are not in vanila kernel.)

arch/arm/boot/Makefile

ZRELADDR := $(zreladdr-y)
PARAMS_PHYS := $(params_phys-y)
INITRD_PHYS := $(initrd_phys-y)
KERNEL_PHYS := $(kernel_phys-y)

arch/arm/mach-ARCH/Makefile.boot
zreladdr-y := 0x00008000
params_phys-y := 0x00000100
initrd_phys-y := 0x00C00000
kernel_phys-y := 0x00600000

arch/arm/Makefile
textofs-y := 0x00008000
(...........)
TEXT_OFFSET := $(textofs-y)


Re: About TEXTADDR, ZTEXTADDR, PAGE_OFFSET etc...
http://lists.arm.linux.org.uk/lurker/message/20010723.185051.94ce743c.en.html

defining ZRELADDR as PHYS_OFFSET + TEXT_OFFSET
http://lists.arm.linux.org.uk/lurker/message/20100127.101228.78a1533e.en.html




Linux bluetooth for console-only devices

File transfer without X

I was using a NB with Ubuntu, and a development board, with ARM CPU, ARM Debian installed, which currently only have serial console.

bluetooth
bluez
bluez-gnome
obexftp
obexd-server

DUT to other device

  1. insert required module
  2. if DBus/Bluetoothd is not up yet, bring it up.
    /etc/init.d/dbus restart;
    /etc/init.d/bluetooth restart;
  3. hcitool dev
  4. l2test
    (For unknown reason, I cannot find l2test in any package of Ubuntu, so I download a new bluez-4.58, running configure with --enable-test to get l2test. It works just fine.)
    l2test -I 2000 -r&
    l2test -s 00:11:67:8A:A9:26
    ./l2test -s 0:11:67:8A:A9:27
  5. start authorization agent
    bluetooth-agent 0000 &
  6. do file transfer

    bluetooth-sendto --dest=0E:70:24:91:66:01 video.mp4
    (require X)

    obexftp -b 0E:70:24:91:66:01 -p 2008-10-09-211.jpg
    (doesn't require X)
other device to DUT
  1. enable inquiry scan
    hciconfig hci0 piscan
  2. enable file transfer server
    obexftpd -b /
    (doesn't require X)

    obex-data-server
    (require X)


BlueZ
http://www.bluez.org/

Linux BlueZ Howto
http://jeremythompson.uklinux.net/RH8-0/bluezhowto.pdf

The Penguin with the BlueZ
http://fedoraproject.org/w/uploads/4/40/FUDCon_FUDCon2_FUDCon2MarcelHoltmann.pdf

BlueZ aware applications
http://wiki.bluez.org/wiki/UsingBluez

BlueZ: HOWTO/Authorization
http://wiki.bluez.org/wiki/HOWTO/Authorization

BlueZ: Bluetooth Services
http://wiki.bluez.org/wiki/Services

pastebin - collaborative debugging tool
http://pastebin.com

2009年12月15日 星期二

git-svn

Reconstructing git-svn metadata after a git clone
http://www.spinics.net/lists/git/msg130949.html

git clone git://git.webkit.org/WebKit.git WebKit
cd WebKit
git svn init -T trunk http://svn.webkit.org/repository/webkit
git update-ref refs/remotes/trunk origin/master
(there are other means to get the svn metadata...some works, some don't...)
git svn clone --stdlayout file:///tmp/test/hello hello-git

git svn clone --username your-name -s https://your-project.googlecode.com/svn
# older versions of git: replace "-s" with "-Ttrunk -bbranches -ttags"
(perform git operations)
git svn rebase # think "svn update"
git svn dcommit # think "svn commit"



使用 git-svn 整合 git 與 svn
http://blog.kanru.info/archives/466/comment-page-1

git-svn(1) Manual Page
http://www.kernel.org/pub/software/scm/git/docs/git-svn.html

[Linux][軟體] Git-svn 使用簡單介紹
http://antontw.blogspot.com/2008/05/linux-git-svn.html

Develop with Git on a Google Code Project
http://google-opensource.blogspot.com/2008/05/develop-with-git-on-google-code-project.html

2009年12月14日 星期一

Cache

Fact
VA to PA is aligned by page size, therefore any VA/PA pair has the same page offset, i.e. for 4KB page size, their 0-11 bits are the same


Assumption
Cache of 32KB, 4-way, 32B cache line

  • 8KB per way = 2^13
  • 32B cache line = 2^5
Page size 4KB = 2^12

[wiki] Harvard architecture
http://en.wikipedia.org/wiki/Harvard_architecture

[wiki] Cache
http://en.wikipedia.org/wiki/Cache

[wiki] Cache coherence
http://en.wikipedia.org/wiki/Cache_coherency

[wiki] CPU cache
http://en.wikipedia.org/wiki/CPU_cache

Cache Aliases 小註解
http://www.cash.idv.tw/wordpress/?p=2273

Understanding Caching
http://www.linuxjournal.com/article/7105
kmalloc never allocates two regions that overlap in a cache line

(...........)
  • PIPT

    • get virt_addr, address translation to get phy_addr, lookup in cache to get data
    • slow; address translation and lookup in cache have to be sequential
    • no alias
  • VIVT

    • get virt_addr, lookup in cache to get data
    • fast; does not need address translation
    • alias; don't know if multiple virt_addr mapped to the same phy_addr
  • VIPT

    • get virt_addr, do (cache lookup) and (address translation)
    • faster than PIPT; (cache lookup) and (address translation) can be done in parallel
    • COULD be no alias; alias can be detected by seeing same tag (phy_addr) in multiple index(virt_addr), IF the hardware have implemented so.
    • larger tag; for the phy_addr
(...........)

The Aliasing Problem

Any time the kernel sets up more than one virtual mapping for the same physical page, cache line aliasing may occur. The kernel is careful to avoid aliasing, so it usually occurs only in one particular instance: when the user mmaps a file. Here, the kernel has one virtual address for pages of the file in the page cache, and the user may have one or more different virtual addresses. This is possible because nothing prevents the user from mmaping the file at multiple locations.

When a file is mmaped, the kernel adds the mapping to one of the inode's lists: i_mmap, for maps that cannot change the underlying data, or i_mmap_shared, for maps that can change the file's data. The API for bringing the cache aliases of a page into sync is:

void flush_dcache_page(struct page *page);

This API must be called every time data on the page is altered by the kernel, and it should be called before reading data from the page if page->mapping->i_mmap_shared is not empty. In architecture-specific code, flush_dcache_page loops over the i_mmap_shared list and flushes the cache data. It then loops over the i_mmap list and invalidates it, thus bringing all the aliases into sync.

Separate Instruction and Data Caches

In their quest for efficiency, processors often have separate caches for the instructions they execute and the data on which they operate. Often, these caches are separate mechanisms, and a data write may not be seen by the instruction cache. This causes problems if you are trying to execute instructions you just wrote into memory, for example, during module loading or when using a trampoline(?). You must use the following API:

void
flush_icache_range(unsigned long start,
unsigned long end);

to ensure that the instructions are seen by the instruction cache prior to execution. start and end are the starting and ending addresses, respectively, of the block of memory you modified to contain your instructions.

(...........)

(MY COMMENT)
For the SMP, USUALLY, there should be some "hardware" would somehow sync the L1 cache in all CPUs or maintain the coherency of the caches, such that one inv/flush/clean in one CPU, all CPU caches get updated. However for SOME that doesn't (like ARM MPCore), one have to do the operation in EVERY CPU to ensure all CPU get updated, via mechanisms like IPI(Inter Processor Interrupt).

for more details in MPCore, refer to
ARM11 MPCore DMA DCache issue with Linux SMP
http://mkl-note.blogspot.com/2009/12/linux-arm11-mpcore-smp-cache-issue.html


The MIPS Cache Architecture
http://people.openrays.org/~comcat/mydoc/mips.cache.arch.pdf
(簡體中文,作者也是kernel developer,因此格外好讀,不但解釋的很清楚,而且有很多實例)
Writting data
  • Cache Hit
    • Write Through
      • write to cache and RAM
    • Write Back
      • write to cache only
  • Cache Miss
    • Write Allocate
      • allocate cache, read data from RAM, write data to cache
    • No Write Allocate
      • write to next level Cache/Memory directly
Read
  • Cache Hit
    • send data to CPU
  • Cache Miss
    • Read Allocate
      • allocate cache, read data from RAM, send data to CPU
    • No Read Allocate
      • read data from RAM and send data to CPU directly
(.............)

5. Cache Aliases Issue
For VIPT, Cache Alias Issue exists if WAY_SIZE > PAGE_SIZE.
Page address is always page-size-aligned, no matter VA or PA.
WAY_SIZE (8K, 2^13) = CACHE_SIZE (32K) / WAY (4)
PAGE_SIZE (4K, 2^12)


INDEX_SIZE=log2(WAY_SIZE) (13)
VA lower INDEX_SIZE bits are used as index.

color bit=log2(WAY_SIZE)-log2(PAGE_SIZE) = 13-12 = 1
if 2 VA could map to the same PA:
  • if they are of the same color, VIPT are all the same, so both will be in the same cache slot. No problem.
  • if they are of different color, they will be located in 2 different cache slot, which is the Cache Alias Issue

Solution
  • remove color bit: let PAGE_SIZE = WAY_SIZE
  • only uses the VAs of the same color to map a PA
    • somehow complicated
  • flush Cache if there might be Cache Alias

Focus (mostly occurs when...)
  • copy_to_user_page/copy_from_user_page
  • fork
    • while forking, child process would COW(copy-on-write) some pages of some parent process. While copying, the VA that child process uses is usually different from VA of parent process, which introdueces cache alias.







ARM11 MPCore? Processor, Technical Reference Manual, Revision: r2p0
Each MP11 CPU features:
‧ an integer unit with integral EmbeddedICE-RT logic
‧ an 8-stage pipeline
‧ branch prediction with return stack
‧ coprocessors 14 and 15
‧ Instruction and Data Memory Management Units (MMUs), managed using MicroTLB structures backed by a unified main TLB

(32K 4way-associated for each I/D cache)

‧ Instruction and data caches, including a non-blocking data cache with Hit-Under-Miss (HUM)
‧ a data cache that is physically indexed, physically tagged (PIPT)
‧ a data cache that is write back, write allocate only
‧ an instruction cache that is virtually indexed, physically tagged (VIPT)
‧ 32-bit interface to the instruction cache and 64-bit interface to the data cache
‧ hardware support for data cache coherency (among CPUs, but not with DMA)
‧ Vector Floating-Point (VFP) coprocessor support
‧ JTAG-based debug.



write-allocate

CSE 141 Ungraded Homework #5 Answer Sheet
http://cseweb.ucsd.edu/~carter/141/hw5ans.html
NOTE FROM GREG: Last week in section we discussed how caches deal with stores. Specifically, we looked at caches that are write-back vs. write-through, and write-allocate vs. write around. I believe I may have oversimplified things and would like to provide some clarification. It is true that write-back vs. write-through deals with what happens when you write to the cache and find the data present in the cache. However, some students may have been a little confused in reading the solutions to HW #5 when we actually set the dirty bit on a cache miss. The natural question is, "Hey! I thought we only worry about write-back vs. write-through when we have a cache HIT." The potential tricky point here is what happens if your cache is write-back and write-allocate? In this situation, suppose you have a cache miss. The write-allocate policy of the cache will load the data in question into the cache, and the write-back policy will cause only the cache copy to be modified, also turning on the corresponding dirty bit. On the other hand, if your cache is write-through and write allocate, the same thing will happen, but then both the cache copy and the memory copy will be modified.

Secondly, during section we only spoke of a write-allocate cache in terms of a write-allocate and write-through cache. From the previous paragraph, just make note that it is possible for a write-allocate cache to also be write-back, in which case it is not necessarily true that both the cache copy and memory copy are updated on a write. Sorry for the confusion!



VIVT

[wiki] CPU cache: Virtual tags and vhints
http://en.wikipedia.org/wiki/CPU_cache#Virtual_tags_and_vhints

?ARM VIVT看linux的cache ?理
http://docs.google.com/Doc?id=dcbsxfpf_282csrs2pfn
http://blog.chinaunix.net/u2/79526/showart_1200081.html

ARM Architecture Support
http://msdn.microsoft.com/en-us/library/bb905767.aspx
On ARMv4 and ARMv5 processors, cache is organized as a virtual-indexed, virtual-tagged (VIVT) cache in which both the index and the tag are based on the virtual address. The main advantage of this method is that cache lookups are faster because the translation look-aside buffer (TLB) is not involved in matching cache lines for a virtual address. However, this caching method does require more frequent cache flushing because of cache aliasing, in which the same physical address can be mapped to multiple virtual addresses.

On ARMv6 and ARMv7 processors, cache is organized as a virtual-indexed, physical-tagged (VIPT) cache. The cache line index is derived from the virtual address. However, the tag is specified by using the physical address. The main advantage is that cache aliasing is not an issue because every physical address has a unique tag in the cache. However, a cache entry cannot be determined to be valid until the TLB has translated the virtual address to a physical address that matches the tag. Generally, the TLB lookup cost offsets the performance gain achieved by avoiding cache aliasing.

(......................)

For ARMv6 and ARMv7 processors, cache flushing in thread switching to a process other than the current active process is limited to the following instances:

* The hardware does not support VIPT I-cache: In ARMv6 and ARMv7, it is optional for I-cache to be VIPT. Data cache is VIPT or physically-indexed and physically-tagged (PIPT) in MPCore systems. If the hardware does not support VIPT I-cache, the OS flushes the I-cache.
* The system is out of address-space identifiers (ASIDs) for each virtual address: In this case, the OS flushes the whole TLB.

This means that, whereas on ARMv4 and ARMv5 processors the whole cache, I-cache, D-cache, and TLB, is flushed on every thread switch to a different process. On ARMv6 and ARMv7 processors, the D-cache is never flushed on thread switch. The I-cache is flushed only if the processor does not support VIPT cache. The TLB is flushed only if all 255 supported ASIDs have been used. This reduction of cache flushes should improve overall system performance.

In addition, moving to VIPT has performance advantages for the following OS features in CE 6.0:

* Memory-mapped files: On an ARMv4 or ARMv5 system, all read/write views are marked as uncached to prevent aliasing. Marking the views as uncached affects overall system performance. However, in VIVT, you must prevent aliasing. On ARMv6 and ARMv7 systems, views are marked as cached.
* VirtualAllocCopyEx: In CE 6.0, if a kernel mode driver creates an explicit alias in which two virtual addresses map to the same physical address by using VirtualAllocCopyEx, the OS marks both the source and destination addresses as uncached to avoid cache aliasing on ARMv4 and ARMv5 systems. On ARMv6 and ARMv7 systems, source and destination addresses are marked as cached. Even though this function can be called only from kernel mode, this affects both kernel-mode and user-mode drivers. Device Manager copies the data only for user-mode drivers.

Linux barriers and ARM barriers

./Documentation/memory-barriers.txt

SMP barriers semantics
http://eeek.borgchat.net/lists/linux-arch/msg09402.html
http://marc.info/?l=linux-arch&m=126752718913718&w=2
http://www.spinics.net/lists/linux-arch/msg09402.html
http://article.gmane.org/gmane.linux.kernel.cross-arch/5250

http://www.spinics.net/lists/linux-arch/msg09406.html

The SMP barriers are only required to order cacheable accesses. The plain (non-SMP) barriers (mb, wmb, rmb) are required to order both cacheable and non-cacheable accesses.


ARM11 MPCore™ Processor r2p0 Technical Reference Manual
Data Synchronization Barrier
The Data Synchronization Barrier (DSB) operation acts as a special kind of memory barrier. In the program flow, the DSB occurs at the MCR instruction that performs the DSB. The DSB completes when:
  • all explicit reads and writes before the DSB complete
  • all Cache, Branch predictor and TLB maintenance operations before the DSB complete.
No instruction after the DSB can execute until the DSB completes.

Data Memory Barrier
The Data Memory Barrier (DMB) is a general memory barrier with the following behavior. This description considers the program flow as executing instructions in program order. The DMB occurs at the MCR instruction that performs the DMB.
  • Any explicit memory access by an instruction before the DMB is globally observed before any memory accesses caused by an instruction after the DMB.
  • The DMB has no effect on the ordering of any other instructions executing on the processor.
As such, DMB ensures the apparent order of the explicit memory operations before and after the DMB instruction, but does not ensure the completion of those memory operations. For more information see the ARM Architecture Reference Manual.


ARM: Change the mandatory barriers implementation
http://www.spinics.net/lists/arm-kernel/msg84605.html
The mandatory barriers (mb, rmb, wmb) are used even on uniprocessor
systems for things like ordering Normal Non-cacheable memory accesses
with DMA transfer (via Device memory writes). The current implementation
uses dmb() for mb() and friends but this is not sufficient. The DMB only
ensures the relative ordering of the observability of accesses by other
processors or devices acting as masters. In case of DMA transfers
started by writes to device memory, the relative ordering is not ensured
because accesses to slave ports of a device are not considered
observable by the DMB definition.

A DSB is required for the data to reach the main memory (even if mapped
as Normal Non-cacheable) before the device receives the notification to
begin the transfer. Furthermore, some L2 cache controllers (like L2x0 or
PL310) buffer stores to Normal Non-cacheable memory and this would need
to be drained with the outer_sync() function call.


The patch also allows platforms to define their own mandatory barriers
implementation by selecting CONFIG_ARCH_HAS_BARRIERS and providing a
mach/barriers.h file.

Note that the SMP barriers are unchanged (being DMBs as before) since
they are only guaranteed to work with Normal Cacheable memory.

dma_free_coherent不能在IRQ disabled下跑..

dma_free_coherent不能在IRQ disabled下跑
但是有些東西非得在IRQ時free掉的話
把實際的free移到workqueue來跑即可

ref dev_kfree_skb_any, net_tx_action, completion_queue


LIST_HEAD(tofree_list);
spinlock_t tofree_list_lock=SPIN_LOCK_UNLOCKED;

struct free_param {
struct list_head list;

void* addr;
dma_addr_t dma_addr;
uint32_t size;
};
void free_list_agent_fn(void *data){
struct list_head free_list;
struct free_param *cur,*next;

spin_lock(tofree_list_lock);
list_add(&free_list,&tofree_list);
list_del_init(&tofree_list);
spin_unlock(tofree_list_lock);

list_for_each_entry_safe(cur,next,&free_list,list){
if(cur==&free_list) break;
dma_free_coherent(NULL,cur->size,cur->addr,cur->dma_addr);
list_del(&cur->list);
kfree(cur);
}
}
DECLARE_WORK(free_list_agent,free_list_agent_fn);


void some_free_func(){
if(irqs_disabled()){
struct free_param* fp=kmalloc(sizeof(struct free_param),GFP_KERNEL);
fp->addr=desc_addr;
fp->dma_addr=dma_desc_addr;
fp->size=count*sizeof(dwc_otg_dma_desc_t);

spin_lock(tofree_list_lock);
list_add(&fp->list,&tofree_list);
spin_unlock(tofree_list_lock);

schedule_work(&free_list_agent);
return ;
}
dma_free_coherent(blablabla);
}

2009年12月11日 星期五

Linux clock

clocksource 為一 interface 提供 counter 給 kenel 讀,並提供參數給cyc2ns()將cycles轉換為實際時間 (nsec)

clockevent 則是實際產生 interrupt 推動 Linux time subsystem的

改ARM的HZ直接改arch/arm/Kconfig裡HZ的default,同時clockevent.set_mode()裡CLOCK_EVT_MODE_PERIODIC裡 推動timer的timeout cycle count也要依HZ更新為(1/HZ)sec的相對應cycle(hardware dependent);但如果使用hrtimer(使用CLOCK_EVT_MODE_ONESHOT)則要注意clockevent.mult設定為div_sc(timer_clk_in_hz, NSEC_PER_SEC, shift)即可,跟HZ無關,因為 kernel 要 clockevent發送的 時間 會在clockevents_program_event()用clockevent.multi被轉成cycle再丟給clockevent.set_next_event()


Kernel Timer Systems - eLinux.org
http://elinux.org/Kernel_Timer_Systems#clock_events

第七章 Linux内核的时钟中断(中)
http://www.myfaq.com.cn/2005September/2005-09-13/202037.html

Lab 7 timer interrupt
http://opencsl.openfoundry.org/Lab07_timer_interrupt.rst.html

hrtimer + clockevent + timekeeping
http://www.360doc.com/content/09/0715/19/74585_4282710.shtml
[精华] 研究下hrtimer及内核clock/timer子系统变化
http://www.unixresources.net/linux/clf/linuxK/archive/00/00/66/47/664730.html

struct clocksource
clocksource_register()

struct clock_event_device
clockevents_register_device()

目前kernel内部有periodic/highres/dynamic tick三种时钟中断处理方式

Times in Kernel
  1. system time
    系统启动到现在的nanosecond
    A monotonically increasing value that represents the amount of time the system has been running.
    单调增长的系统运行时间, 可以通过time source, xtime及wall_to_monotonic计算出来.

    system_time = xtime + cyc2ns(clock->read() - clock->cycle_last) + wall_to_monotonic;
  2. wall time (xtime)
    A value representing the the human time of day, as seen on a wrist-watch. Realtime时间. xtime.
  3. time source (clocksource)
    A representation of a free running counter running at a known frequency, usually in hardware, e.g GPT. 可以通过clocksource->read()得到counter值
  4. tick
    A periodic interrupt generated by a hardware-timer, typically with a fixed interval

    defined by HZ: jiffies

real time是从1970年开始到现在的nanosecond
real_time = xtime + cyc2ns(clock->read() - clock->cycle_last)

wall_to_monotonic紀錄boot時的xtime(real time)
system_time = xtime + cyc2ns(clock->read() - clock->cycle_last) + wall_to_monotonic;

2.6.31.1, getrawmonotonic():
system_time = clocksource->raw_time + cyc2ns(clock->read() - clock->cycle_last)

initialization:
xtime=read_persistent_clock() 通常return 0, by arch code might get value from RTC or flash.
wall_to_monotonic = -xtime

2009年12月8日 星期二

Using OProfile



Build Requirement
binutils
popt


Required Kernel Config
CONFIG_SMP cannot be enabled in ARM11 MPCore, due to arch/arm/oprofile/op_model_mpcore.c is arch/arm/mach-realview/ specific.

Linux 2.6.35.12

General setup --->
[*] Profiling support
<M> OProfile system profiling


kernel modules must be built with "-g" and not stripped. Since compiler can optimize code to further improve the performance, "-O3"
is suggested.
Kernel hacking --->
[*] Kernel debugging
[*] Compile the kernel with debug info


To support call-graph profile, kernel and modules must NOT be built with gcc's -fomit-frame-pointer, or enable the CONFIG_FRAME_POINTER. But only x86 has CONFIG_ARCH_WANT_FRAME_POINTERS enabled on 2.6.35.
Kernel hacking --->
[*]Compile the kernel with frame pointers


Required tools
awk, objdump

Required files
vmlinux is required as input to oprofile.
An external disk maybe required, for the log and samples, stored at /var/lib/oprofile/, and vmlinux, may take up to tens of MBs.
Commands
opcontrol --reset;
opcontrol --setup --vmlinux=<vmlinux_path>
opcontrol --start -V all;
opcontrol --shutdown;
opreport image:<vmlinux_path>,<module_path> -p <module_dir> -l -g -w

References
4. Configuration details
http://oprofile.sourceforge.net/doc/detailed-parameters.html
4.3. OProfile in timer interrupt mode
Note
This section applies to 2.6 kernels and above only.

In 2.6 kernels on CPUs without OProfile support for the hardware performance counters, the driver falls back to using the timer interrupt for profiling. Like the RTC mode in 2.4 kernels, this is not able to profile code that has interrupts disabled. Note that there are no configuration parameters for setting this, unlike the RTC and hardware performance counter setup.

You can force use of the timer interrupt by using the timer=1 module parameter (or oprofile.timer=1 on the boot command line if OProfile is built-in).


3. Interpreting call-graph profiles
http://oprofile.sourceforge.net/doc/interpreting-callgraph.html

OProfile usage
http://blog.chinaunix.net/space.php?uid=20585891&do=blog&cuid=1110505

使用oprofile分析性能瓶頸(1)
http://tw.myblog.yahoo.com/chimei-015/article?mid=1023&prev=1024&next=1022

oprofile抓不到采样数据问题和解决方法
http://blog.yufeng.info/archives/1283


warning: /no-vmlinux could not be found.

OProfile unable to find image.
http://old.nabble.com/OProfile-unable-to-find-image.-td29336543.html
The "--no-vmlinux" option is used when you are not interested in analyzing the samples from the kernel. The samples from the kernel are recorded in /novmlinux. However, the needed information used by opreport is missing.

CPU: CPU with timer interrupt, speed 0 MHz (estimated)

OProfile get cpu speed by parsing /proc/cpuinfo. Just it's not found what it expect.
libutil/op_cpufreq.c
double op_cpu_frequency(void)
{

(................)

FILE * fp = op_try_open_file("/proc/cpuinfo", "r");

(................)

/* x86/parisc/ia64/x86_64 */
if (sscanf(line, "cpu MHz : %lf", &fval) == 1)
break;
/* ppc/ppc64 */
if (sscanf(line, "clock : %lfMHz", &fval) == 1)
break;
/* alpha */
if (sscanf(line, "cycle frequency [Hz] : %lu", &uval) == 1) {
fval = uval / 1E6;
break;
}
/* sparc64 if CONFIG_SMP only */
if (sscanf(line, "Cpu0ClkTck : %lx", &uval) == 1) {
fval = uval / 1E6;
break;
}

2009年12月6日 星期日

How all the secondary cores boot in MPCore

http://lists.arm.linux.org.uk/lurker/message/20080611.175337.ecfc6e1c.en.html#linux-arm

Author: Catalin Marinas
Date: 2008-06-12 01:53 +800
To: Charly Bechara
CC: linux-arm
Subject: Re: How all the secondary cores boot in MPCore
On Wed, 2008-06-11 at 15:33 +0000, Charly Bechara wrote:
> I am investigating the boot process of the ARM11 MPCore on the
> PB11MPCore (mach-realview) board.
>
> Initially, the start_kernel() function is executing on CPU0, it
> creates kernel_init thead which I assume it is also executing on CPU0
> or am I wrong?

That's correct, it runs on CPU 0.

> When kernel_init thread executes, it calls smp_prepare_cpus() in
> arch/arm/mach-realview/platsmp.c code, where issupposed to start the
> secondary processors using secondary_startup() (head.S)

smp_prepare_cpus() calls poke_milo() which triggers the other CPUs to
execute realview_secondary_startup.

> After this stage, I am completely lost and I couldnt find any related
> documentation or understand the code :(

Maybe not that clear but it might help:

secondary_startup (in arch/arm/kernel/head.S) does a similar thing to
the initial CPU setup (stext in the same file), i.e. it looks up the
processor type and calls the processor initialisation function
(__v6_setup in arch/arm/mm/proc-v6.S). When returning from the setup
function, it gets into __enable_mmu followed by __turn_mmu_on. The
latter branches to __secondary_data_switch which branches to
secondary_start_kernel (in arch/arm/kernel/smp.c) after setting the
stack pointer to a thread structure allocated via cpu_up() called from
smp_init() called from kernel_init().

secondary_start_kernel (in arch/arm/kernel/smp.c) does some further
initialisation (local timers etc.) and calls cpu_idle() (not(e?) that
secondary_start_kernel is already considered a kernel thread as
described above). From this point, it is up to the scheduler to migrate
threads to the new CPUs since they are initially only executing the idle
task.

--
Catalin

2009年12月5日 星期六

ARM11 MPCore Cache Coherency issue with DMA and SMP

Implementing DMA on ARM SMP Systems
http://infocenter.arm.com/help/topic/com.arm.doc.dai0228a/DAI228A_DMA_on_SMP_systems.pdf
In short, on ARM11 MPCore with SMP enabled, cache operation (inv/clean/...) should be done on ALL of the cores, or stall data could still be accessed. There are 4 solutions provided. Solution A and B are application dependent.

PERFORMANCE ISSUE
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005915.html
Linux 2.6.31.1, pcie sata adapter, ahci.c, read 1GB file

1CPU:13.56/13.56/13.60 (sec) ~ 75.5MBps
SMP+IPI:16.29/16.14/16.05 (sec) ~ 63.8MBps
SMP+RFO/WFO:21.71/21.72/21.70 (sec) ~ 47.18MBps
SMP+RFO/WFO/pld:21.63/21.46/21.41 (sec) ~ 47.82MBps

Interrupt of AHCI and dma cache IPI(int # of IPI DMA cache operation/Interrupt # of AHCI)
MYARCH_ahci: 98509/ 4505
pcie_ahci: 81792/ 4501

Both solutions suffer the performance drop. Drivers that accept cacheable buffer and doing DMA would be affected, eg. network, USB, storage ...,etc, which unfortunately, are usually important blocks. Currently I think ARM11 MPCore cannot be used for Linux SMP. There might be chances for AMP.

Solution C. Read for ownership
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005854.html

* Drop my DMA broadcasting patch
* In the dma_cache_maint (and the contiguous one) do the following
based on direction:
* TO_DEVICE: read each cache line in the buffer (you can
read a long variable every 32 bytes) before the local
cache maintenance operations. This ensures that any
dirty cache lines on other CPUs are transferred to L2
(or main memory) or the current CPU and the cache
operation would have the intended effect. The cache
lines on other CPUs may change to a Shared state (see
the MESI protocol)
* FROM_DEVICE: we don't care about the buffer, just write
0 in each cache line (as above, you can only write a
long every 32 bytes). This ensures that the cache lines
become Exclusive to the current CPU (no other CPU has
any copies) and the invalidation would ensure that there
are no cache lines on any CPU
* BIDIRECTIONAL: read a word in a cache line and write it
back. After cache clean&invalidate, the cache lines
would be removed from all the existing CPUs

formal patch not yet available, try Catalin Marinas's patch,
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005860.html
Or my patch:
http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/aed3f989/attachment-0001.obj)
Catalin Marinas's formal commit
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=8108d60829c2d10fe62aaa7b2fae10f0e4abad36

There will be a performance dropping, in my case, SATA read performance is 64MBps with IPI patch, became 44MBps with RFO patch, which is about 31% lower in performance.
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005863.html
I would be very surprised if going down this route doesn't result in
block IO data performance (and network performance) dropping my more
than 60% of the DMA value (that's DMA performance * 0.4).





Solution D. Broadcast of cache maintenance operations
By IPI

Currently this solution would cause a deadlock in ata_scsi_queuecmd. The deadlock situation and the patch that could fix this:
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/006051.html
http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20091210/520c074d/attachment-0001.obj
However, it is not an one-shot solution, one have to fix it whenever encountered. And it will corrupt the atomic context:
http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005858.html

Having functions which enable interrupts while the parent is supposed to
be in an atomic context is definitely a recipe for things to go badly
wrong - this is not a solution anyone in their right mind should
contemplate.


The following patch use the IPI(Inter-Processor Interrupt) to broadcast the dma_cache_maint operation.
http://linux-arm.org/git?p=linux-2.6-stable.git;a=commitdiff;h=95298b1792121e7068258de451caec7f3dda0e78
linux-2.6-stable.git@linux-arm.org
commit 95298b1792121e7068258de451caec7f3dda0e78
Author: Catalin Marinas <catalin.marinas@arm.com>
Date: Tue Mar 10 10:22:54 2009 +0000

http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=f1c242dc5f326713578e469c9f5be647978ebe24
linux-2.6-git@linux-arm.org
commit f1c242dc5f326713578e469c9f5be647978ebe24
Author: Catalin Marinas <catalin.marinas@arm.com>
Date: Wed Oct 28 13:27:49 2009 +0000

http://lists.infradead.org/pipermail/linux-arm-kernel/2009-December/005568.html
The new updated patch which includes patch for dma_cache_maint_contiguous
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=6dd5056b9abe1e38fae3eb8d576e562f49452b0f


Even with above patch, tests still might failed. USB EHCI and a flash are used to verify this issue, by keep reading an 1MB file in flash and compare it to the original file in ramdisk.
without SMP, no fails.
SMP without L1, no fails.
SMP with L1, 13.26% failed
If I forced the dma_unmap_single to do dma_cache_maint, the SMP with L1, the fail rate reduced to 0.3~0.4%.

By FIQ
Laguna SMP Benchmarks 10001092-00.pdf
http://trac.gateworks.com/wiki/laguna%3Agw2388-4%3Asmp_benchmarks

linux-3.2 in src/linux/laguna – DD-WRT
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.2
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.2?rev=18083
first patch on the site
http://svn.dd-wrt.com:8000/browser/src/linux/laguna/linux-3.0?rev=17578 


[RFC PATCH] Broadcast the DMA cache operations on ARMv6 SMP hardware
http://lists.arm.linux.org.uk/lurker/message/20080620.124707.2bff9c7f.en.html
http://lists.arm.linux.org.uk/lurker/message/20080620.154546.aaa33d72.en.html
> By the way, besides these dmac cache functions,
> don't you think other cache mantain functions
> like clean_dcache_area() need broadcasting?


AFAIK, it was discussed some years ago and these operations don't
require broadcasting.
Basically, once a write to a memory location
occurs, the MESI protocol used by the SCU ensures that the owner of that
cache line is CPU that did the writing. If the cacheline exists on other
CPUs, it is invalidated. Therefore a cache cleaning operation on the CPU
that did the writing is enough since no other CPU has a valid cache
line.

The problem is slightly different with the DMA API since the driver
might invalidate an area of memory (dma_map_singe(FROM_DEVICE)) without
reading or writing it before and hence the CPU is not the owner of those
lines. The same goes for cleaning or flushing since some drivers may run
dma_map_single(TO_DEVICE) in an interrupt routine handled on a CPU but
the buffer to be transmitted could have been written on a different CPU.

Regarding the I-cache invalidation (which, BTW, is completely missing
from the mainline kernel), the patch I proposed (posted again last week)
does this when a thread migrates to another CPU that it hadn't run on
before and there is no need for broadcasting as we track the CPU via
mm->cpu_vm_mask (see switch_mm in mmu_context.h).

The following patch fix my issue. The patch that patch dma_cache_maint should be applied to dma_cache_maint_contiguous, which is called by dma_cache_maint_page.
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index be56c43..15dafb6 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -584,15 +584,15 @@ static void dma_cache_maint_contiguous(struct page *page,

   switch (direction) {
   case DMA_FROM_DEVICE:           /* invalidate only */
-               inner_op = dmac_inv_range;
+               inner_op = smp_dma_inv_range;
           outer_op = outer_inv_range;
           break;
   case DMA_TO_DEVICE:             /* writeback only */
-               inner_op = dmac_clean_range;
+               inner_op = smp_dma_clean_range;
           outer_op = outer_clean_range;
           break;
   case DMA_BIDIRECTIONAL:         /* writeback and invalidate */
-               inner_op = dmac_flush_range;
+               inner_op = smp_dma_flush_range;
           outer_op = outer_flush_range;
           break;
   default:




Call flush_dcache_page after PIO data transfer in libata-aff.c
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=026f474ca17dd758#patch1
When reading data from an ATA device using PIO, the kernel dirties the
D-cache but there is no flush_dcache_page() call in ata_pio_sector().
Since neither the VFS layer calls this function, a subsequent
update_mmu_cache() is not aware of the dirty page which may lead to
cache incoherency in user space.


Call flush_dcache_page in usb_stor_access_xfer_buf
http://linux-arm.org/git?p=linux-2.6.git;a=commitdiff;h=d0c91030c392ef4e
Transferring buffers using memcpy dirties the D-cache but there is no
corresponding flush_dcache_page call which leads to data corruption in
user-space.


in setup_processor():
struct cpu_cache_funs* cpu_cache = __v6_proc_info.cache = v6_cache_fns