如何用Trace32分析內(nèi)核死機(干貨~)
dmesg 初步分析
[ ?423.400073] Unable to handle kernel NULL pointer dereference at virtual address 00000008
[ ?423.400075] [silead finger_interrupt_handler 505]:S IRQ 19 , GPIO 12 state is 0
[ ?423.400083] [silead finger_interrupt_handler 506]:state is 0
[ ?423.400096] pgd = ffffffc0017eb000
[ ?423.400103] [00000008] *pgd=000000008ea0a003, *pud=000000008ea0a003, *pmd=000000008ea0b003, *pte=006000000b000707
[ ?423.400124] Internal error: Oops: 96000046 [#1] PREEMPT SMP
[ ?423.400132] Modules linked in: wlan(O)
[ ?423.400148] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G ? ? ? ?W ?O ? 3.18.31-perf #1
[ ?423.400155] Hardware name: Qualcomm Technologies, Inc. MSM8940-PMI8950 MTP (DT)
[ ?423.400164] task: ffffffc04eae4980 ti: ffffffc0b28bc000 task.ti: ffffffc0b28bc000
[ ?423.400182] PC is at run_timer_softirq+0x4ac/0x5ec
[ ?423.400192] LR is at run_timer_softirq+0x324/0x5ec
[ ?423.400199] pc : [<ffffffc0000feb98>] lr : [<ffffffc0000fea10>] pstate: 600001c5
[ ?423.400204] sp : ffffffc0b28bfb60
...
[ ?423.401490] Process swapper/4 (pid: 0, stack limit = 0xffffffc0b28bc058)
[ ?423.401496] Call trace:
[ ?423.401510] [<ffffffc0000feb98>] run_timer_softirq+0x4ac/0x5ec
[ ?423.401523] [<ffffffc0000a6864>] __do_softirq+0x178/0x350
[ ?423.401532] [<ffffffc0000a6c8c>] irq_exit+0x74/0xb0
[ ?423.401543] [<ffffffc0000edf18>] __handle_domain_irq+0xb4/0xec
[ ?423.401553] [<ffffffc00008254c>] gic_handle_irq+0x54/0x84
[ ?423.401560] Exception stack(0xffffffc0b28bfd40 to 0xffffffc0b28bfe60)
...
[ ?423.401671] [<ffffffc000085da8>] el1_irq+0x68/0xd4
[ ?423.401685] [<ffffffc000851480>] cpuidle_enter_state+0xd0/0x224
[ ?423.401695] [<ffffffc0008516ac>] cpuidle_enter+0x18/0x20
[ ?423.401706] [<ffffffc0000e1cc0>] cpu_startup_entry+0x288/0x384
[ ?423.401717] [<ffffffc000091d5c>] secondary_start_kernel+0x108/0x114
[ ?423.401728] Code: b90052a0 34000840 f9400321 f9400720 (f9000420)
[ ?423.401736] ---[ end trace d0daa1892c14378b ]---
[ ?423.401753] Kernel panic - not syncing: Fatal exception in interrupt
[ ?423.401774] CPU7: stopping
連上trace32,load vmlinux后通過list source看下匯編源碼混合顯示,如下:

FFFFFFC0000FE968: 當(dāng)前匯編指令的虛擬地址
F9000760: 匯編機器碼,ARM/ARM64的指令機器碼都是32位固定長度
str x0,[x27, #0x8]: 匯編指令,;后的是注釋
根據(jù)AAPCS(ARM二進(jìn)制過程調(diào)用標(biāo)準(zhǔn))參數(shù)傳遞規(guī)則,ARM64的 v0 - v7 參數(shù)直接由 x0 - x7 傳遞,其他參數(shù)由壓棧傳遞,子程序返回結(jié)果保存到x0。
那么這里可推導(dǎo)如下:
x0 == prev, x1 == next
指令:str x0 ,[x1,#0x8] x1+0x8 其實就是next+8個字節(jié)的偏移,看下:
struct list_head {
? ?struct list_head *next, *prev;
};
ARM體系結(jié)構(gòu)中,ARM64一個指針占8個字節(jié)內(nèi)存,也就是: [x1+0x8] == prev
所以這個str指令就是對應(yīng)上面的next->prev = prev
我們根據(jù)異常棧的寄存器值來看下:
[ ?423.400182] PC is at run_timer_softirq+0x4ac/0x5ec
[ ?423.400192] LR is at run_timer_softirq+0x324/0x5ec
[ ?423.400199] pc : [<ffffffc0000feb98>] lr : [<ffffffc0000fea10>] pstate: 600001c5
[ ?423.400204] sp : ffffffc0b28bfb60
[ ?423.400210] x29: ffffffc0b28bfb60 x28: ffffffc0b2619038
[ ?423.400219] x27: ffffffc000c9a000 x26: 0000000000000000
[ ?423.400228] x25: ffffffc001741120 x24: ffffffc0006e277c
[ ?423.400237] x23: ffffffc0b2619000 x22: ffffffc0b28bfbf8
[ ?423.400246] x21: ffffffc0b28bc000 x20: ffffffc0013d2000
[ ?423.400254] x19: ffffffc0b2618000 x18: 0000007f9588e080
[ ?423.400263] x17: 0000007f9a36d4b4 x16: ffffffc0001e4f6c
[ ?423.400272] x15: 003b9aca00000000 x14: 0000000000000001
[ ?423.400280] x13: 0000000000000000 x12: 0000000000000001
[ ?423.400289] x11: 000000000000000f x10: ffffffc000c9c3f4
[ ?423.400297] x9 : 0000000000000000 x8 : 0000000000000005
[ ?423.400305] x7 : 0000000000000000 x6 : 000000000001451c
[ ?423.400314] x5 : ffffffc0b2ae8000 x4 : 00135f1aa15bb200
[ ?423.400323] x3 : 0000000000000018 x2 : 0000000000000000
[ ?423.400331] x1 : 0000000000000000 x0 : ffffffc0b28bfbf8
上面可以看到,x1+0x8 ==0x0000000000000000+0x8==0x0000000000000008,這個和出錯時候報的地址一致“Unable to handle kernel NULL pointer dereference at virtual address 00000008”
因為ARM64內(nèi)核的虛擬地址空間范圍是0xFFFF_0000_0000_0000 =>0xFFFF_FFFF_FFFF_FFFF,很明顯這個值0x0000000000000008不在范圍內(nèi),它屬于用戶空間的虛擬地址空間,肯定會被MMU攔截掉上報data abort異常,所以此題的真正原因是程序跑飛訪問非法地址所致。
目前看來從kernel log上的信息無法直接分析出導(dǎo)致問題的具體源代碼,從dmesg的這些信息我們已經(jīng)知道出問題的是這個prev指針,但是比較難直接抓到導(dǎo)致異常的真兇源碼位置。
【文章福利】小編推薦自己的Linux內(nèi)核技術(shù)交流群:【891587639】整理了一些個人覺得比較好的學(xué)習(xí)書籍、視頻資料共享在群文件里面,有需要的可以自行添加哦?。。。ê曨l教程、電子書、實戰(zhàn)項目及代碼)? ? ?


Trace32 分析
利用dmesg我們分析了產(chǎn)生問題時候的來龍去脈,但是想要徹底解決還需要trace32進(jìn)一步分析。
輸入v.f,調(diào)出當(dāng)前的調(diào)用棧關(guān)系:

可以看到,異常時候的各種參數(shù)都顯示出來了,這樣就非常有利于我們debug了,這也是單純從dmesg無法得到的重要信息!注意inline類型的函數(shù)會被編譯器默認(rèn)優(yōu)化掉,所以inline類型的函數(shù)的參數(shù)不可見,需要通過讀匯編代碼,分析寄存器傳參推導(dǎo)。
輸入d.list 查看PC停止的位置,如下高亮:

分析Call Stack:
為方便查看,把調(diào)用棧信息復(fù)制出來,如下:
1. ...
-007|die(
? ?| ??,
? ?| ? ?regs = 0xFFFFFFC0B28BFA40 -> (
? ?| ? ? ?user_regs = (regs = (0xFFFFFFC0B28BFBF8, 0x0, 0x0, 0x18, 0x00135F1AA15BB200, 0xFFFFFFC0B2AE800
? ?| ? ? ?regs = (0xFFFFFFC0B28BFBF8, 0x0, 0x0, 0x18, 0x00135F1AA15BB200, 0xFFFFFFC0B2AE8000, 0x0001451C
? ?| ? ? ?sp = 0xFFFFFFC0B28BFB60,
? ?| ? ? ?pc = 0xFFFFFFC0000FEB98,
? ?| ? ? ?pstate = 0x600001C5,
? ?| ? ? ?orig_x0 = 0xFFFFFFC0B2618000,
? ?| ? ? ?syscallno = 0xFFFFFFC0000FE7D0),
? ?| ? ?err = 0x96000046)
? ?| ?flags = 0x01C0
? ?| ?ret = 0x1
? ?| ?tsk = 0xFFFFFFC04EAE4980
? ?| ?die_counter = 0x1
-008|__do_kernel_fault.part.5(
? ?| ? ?mm = 0x0,
? ?| ? ?addr = 0x8,
? ?| ? ?esr = 0x96000046,
? ?| ? ?regs = 0xFFFFFFC0B28BFA40)
-009|do_page_fault(
? ?| ? ?addr = 0x8,
? ?| ? ?esr = 0x96000046,
? ?| ? ?regs = 0xFFFFFFC0B28BFA40)
? ?| ?tsk = 0xFFFFFFC04EAE4980
? ?| ?mm = 0x0
? ?| ?vm_flags = 0xFFFFFFC000C9A000
? ?| ?vma = 0xFFFFFFC0B28BFA40
-010|do_translation_fault(
? ?| ? ?addr = 0x0A44,
? ?| ? ?esr = 0x0124F800,
? ?| ??)
-011|do_mem_abort(
? ?| ? ?addr = 0x8,
? ?| ? ?esr = 0x96000046,
? ?| ? ?regs = 0xFFFFFFC0B28BFA40)
? ?| ?inf = 0xFFFFFFC0013DC790 -> (
? ?| ? ?fn = 0xFFFFFFC000099A74,
? ?| ? ?sig = 0x0B,
? ?| ? ?code = 0x00030001,
? ?| ? ?name = 0xFFFFFFC0010DF250 -> 0x6C)
? ?| ?info = (
? ?| ? ?si_signo = 0x0032D110,
? ?| ? ?si_errno = 0xFFFFFFC0,
? ?| ? ?si_code = 0x01C0,
? ?| ? ?_sifields = (_pad = (0x7, 0x0, 0xB28BF9E0, 0xFFFFFFC0, 0x000A6D78, 0xFFFFFFC0, 0xB28BF9F0, 0xFFF
-012|el1_da(asm)
-->|exception
-013|__list_del(inline)
-013|detach_timer(inline)
-013|detach_expired_timer(inline)
-013|__run_timers(inline)
-013|run_timer_softirq(
? ?| ??)
? ?| ?base = 0xFFFFFFC0B2618000 -> (
? ?| ? ?lock = (rlock = (raw_lock = (owner = 0x6FCD, next = 0x6FCE))),
? ?| ? ?running_timer = 0xFFFFFFC001741120 -> (
? ?| ? ? ?entry = (next = 0xFFFFFFC0B27BC9B8, prev = 0xFFFFFFC0B27BC9B8),
? ?| ? ? ?expires = 0x0000000100003098,
? ?| ? ? ?base = 0xFFFFFFC0B27BC000,
? ?| ? ? ?function = 0xFFFFFFC0006E277C -> ,
? ?| ? ? ?data = 0x0,
? ?| ? ? ?slack = 0xFFFFFFFF,
? ?| ? ? ?start_pid = 0xFFFFFFFF,
? ?| ? ? ?start_site = 0x0,
? ?| ? ? ?start_comm = (0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0))
? ?| ? ?timer_jiffies = 0x0000000100003035,
? ?| ? ?next_timer = 0x0000000100003034,
? ?| ? ?active_timers = 0x7,
? ?| ? ?all_timers = 0x7,
? ?| ? ?cpu_=_0x4,
? ?| ? ?tv1 = (vec = ((next = 0xFFFFFFC0B2618038, prev = 0xFFFFFFC0B2618038), (next = 0xFFFFFFC0B2618048
? ?| ? ?tv2 = (vec = ((next = 0xFFFFFFC0B2619038, prev = 0xFFFFFFC0B2619038), (next = 0xFFFFFFC0B2619048
? ?| ? ?tv3 = (vec = ((next = 0xFFFFFFC0B2619438, prev = 0xFFFFFFC0B2619438), (next = 0xFFFFFFC0B2619448
? ?| ? ?tv4 = (vec = ((next = 0xFFFFFFC0B2619838, prev = 0xFFFFFFC0B2619838), (next = 0xFFFFFFC0B2619848
? ?| ? ?tv5 = (vec = ((next = 0xFFFFFFC0B2619C38, prev = 0xFFFFFFC0B2619C38), (next = 0xFFFFFFC0B2619C48
? ?| ?fn = 0xFFFFFFC0006E277C ->
? ?| ?data = 0x0
? ?| ?it_func_ptr = 0x0
...
看到這里,我們可以猜想是不是run_timer_softirq的參數(shù)出現(xiàn)了問題導(dǎo)致后面發(fā)生的一系列異常?可以從這個方向開始思考,我們先來看下這個函數(shù)的實現(xiàn):
static void run_timer_softirq(struct softirq_action *h)
{
? ?struct tvec_base *base = __this_cpu_read(tvec_bases);
? ?hrtimer_run_pending();
? ?__run_deferrable_timers();
? ?if (time_after_eq(jiffies, base->timer_jiffies))
? ? ? ?__run_timers(base);
}
我們看到這個函數(shù)最重要的參數(shù)變量就是這個base,傳入的h沒有使用,繼續(xù)來看下base的結(jié)構(gòu)tvec_base :
struct tvec_base {
? ?spinlock_t lock;
? ?struct timer_list *running_timer;
? ?unsigned long timer_jiffies;
? ?unsigned long next_timer;
? ?unsigned long active_timers;
? ?unsigned long all_timers;
? ?int cpu; ?// 跟蹤所在的CPU是哪個核,這里是CPU 4
? ?struct tvec_root tv1;
? ?struct tvec tv2;
? ?struct tvec tv3;
? ?struct tvec tv4;
? ?struct tvec tv5;
} ____cacheline_aligned;
這里就看到 tvec_base 的結(jié)構(gòu)里面有個 struct timer_list * 的結(jié)構(gòu),我們繼續(xù)看它的結(jié)構(gòu)是怎么樣的:
struct timer_list {
? ?/*
? ? * All fields that change during normal runtime grouped to the
? ? * same cacheline
? ? */
? ?struct list_head entry;
? ?unsigned long expires;
? ?struct tvec_base *base;
? ?void (*function)(unsigned long);
? ?unsigned long data;
? ?int slack;
...
首先我們查看 running_timer 的數(shù)據(jù)內(nèi)容,
工具欄調(diào)出:view -> Watch,輸入:(struct timer_list *)0xffffffc001741120

這個就是發(fā)生異常的那個timer的數(shù)據(jù)結(jié)構(gòu)實例,我們最希望的就是希望可以通過這里的數(shù)據(jù)信息找到它在源碼的位置,然后進(jìn)一步分析它,使用Trace32的 dump 分析功能就可以做到這點。
菜單欄調(diào)出:view -> dump

輸入地址 0xffffffc001741120 然后點OK

右擊高亮,選擇view info:

同理,還可以看function的位置(0xFFFFFFC0006E277C):

上面所示,出異常的timer實例就是:fp_drv/m_timer, callback = timer_out_handle,源碼位置也給出來了,那么就可以著手修復(fù)問題了。
