2016年9月1日 星期四

Random abnormal high CPU sys usage related to timer

Test files and logs are available in GitHub
https://github.com/mkl0301/abnormal-cpu-load


20160910 update:
Up to now I still don't know clearly whether this is simply a display issue, or something is actually being executed.
  1. Even with the issue happening, the application seems works normally.
  2. But with previous perf report, the data sampling rate when the issue happen is higher than the rate when issue not happen.
I happened to found that on v3.13, enabling CONFIG_CONTEXT_TRACKING_FORCE can fix this issue. But unfortunately the kernel I'm using is 3.4, which doesn't have the context tracking feature added, so the hunting keeps on....

Forcing context tracking only works until 4.5, doing so on 4.6 and after still see the issue.



Random abnormal high CPU sys usage related to timer
https://lkml.org/lkml/2016/8/26/383

We were having issue with our userspace application which
__sometimes__ result in high CPU sys usage at each execution. The high
sys CPU usage persist until the application is killed.

We simplified the application to just creating a timer and its handler
then does nothing, but looping and sleeping for the timer to be
triggered. With top, the CPU that running the application usually
takes almost 0% for sys usage. But sometimes it will occupy certain
amount of sys usage, up to 100% at most of the time on my embedded
device.

On my laptop, Intel Core i5-4200U, running Ubuntu 14.02.2, Linux
3.13.0-45-generic and 4.4.0-34-generic, the issue can be reproduced
with lower sys usage (7~50%). The same can be reproduced with
buildroot+vanilla kernel 4.7 and 3.13.

Restart the application could temporarily fix the the issue, but there
are chances to happen again.

/proc/timer_stats, /proc/interrupts, and perf didn't show any abnormal value or useful clue.
Comparing the good and fail log got the following
  • The /proc/timer_stats is almost the same, but perf events shows extra softirq/timer events.
  • The perf sample of the failed case is much more than the good case, but the ratio of the sampled functions are basically the same.


One issue is found that seems related, but no further action.

Keystone II Linux: Random High CPU usage - userspace application using 1 full core - Linux forum - Linux - TI E2E Community
http://e2e.ti.com/support/embedded/linux/f/354/p/433791/1553204



The Documentation/cpu-load.txt discussed a situation that the cpu load reported by top might underestimated. The C code is very alike to ours. But it's discussing a different topic.

Documentation/cpu-load.txt
https://www.kernel.org/doc/Documentation/cpu-load.txt
https://lkml.org/lkml/2007/2/12/6




Why you should avoid using SIGALRM for timer – Linux, Embedded, Android and Security blog
https://nativeguru.wordpress.com/2015/02/19/why-you-should-avoid-using-sigalrm-for-timer/

Linux timer | 菜鳥的三年成長史
https://wirelessr.gitbooks.io/working-life/content/linux_timer.html
The best timer | 菜鳥的三年成長史
https://wirelessr.gitbooks.io/working-life/content/the_best_timer.html


Linux Timers | Blog | Upvoid
https://upvoid.com/devblog/2014/05/linux-timers/

沒有留言: