【案例分享】3PAR存儲CRC相關(guān)告警處理

一、故障現(xiàn)象
某客戶反映,通過 IMC管理軟件查看到?3par 8200存儲的兩個端口:0:0:1和0:0:2,有大量的CRC相關(guān)告警,為了保障存儲鏈路穩(wěn)定,請求處理此故障。
二、故障分析
8200 cli% showportlesb single 0:0:1
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
<0:0:1> 0x10a00 20010002AC03DE4F???? 203???? 3652???? 23??? 85??? 4152?? 5023
※0:0:1端口?有明顯的CRC增長
8200 cli% showportlesb single 0:0:2ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq? InvWord InvCRC<0:0:2> 0x20d00 20020002AC01BE4E??????? 8??? 2480?????? 8?????? 0???? 4406????? 0host0?? 0x20000 500143802855522A??????? 0??????? 0?????? 0?????? 0? ? ? ??0? ? ? ?0host1?? 0x20100 5001438028439225? ? ? ? 0??????? 0?????? 0? ? ? ? 0??????? 0????? 0host10? 0x20200 10000090FAC0F1D8? ????0? ? ????3???? ? 0????? ? 0? ? ? 14? ? ? 1host6?? 0x20500 10000090FAC0B96A??????? 0??? 25899?????? 2?????? 0??? 68678??? 424host5?? 0x20400 10000090FAC0B78E??????? 1???? 1233?????? 2?????? 0??? 61019???? 82host11? 0x20300 10000090FAC0B556??????? 0??????? 5?????? 0?????? 0?????? 17????? 0host8?? 0x20700 10000090FAC0B221??????? 0?? 170797?????? 2?????? 0?? 195421?? 3083
host13?? 0x20600 10000090FAD0F665???????
1?? 11226?????? 4?????? 0
101409?491523
※host13?有大量的CRC錯誤
8200 cli% showportlesb hist 0:0:1
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
<0:0:1> 0x10a00 20010002AC03DE4F????? 317???? 2153?????? 4?????? 0??? 5362???2856
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
<0:0:1> 0x10a00 20010002AC01BE4E????? 317???? 2153?????? 4?????? 0??? 5362???2968
?
ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWord InvCRC
<0:0:1> 0x10a00 20010002AC01BE4E????? 317???? 2153?????? 4?????? 0??? 5362???3023
※0:0:1CRC?錯誤?不斷增長,說明?IO端口?或者?鏈路?出現(xiàn)了問題。
8200 cli% showportlesb hist 0:0:2
ID??? ALPA ----Port_WWN---- LinkFail?LossSync LossSig PrimSeq? InvWord InvCRC
<0:0:2> 0x20d00 20020002AC01BE4E??????? 8??? 2480?????? 8?????? 0???? 4406????? 0
host0?? 0x20000 500143802855522A??????? 0????? ??0?????? 0?????? 0??????? 0????? 0
host1?? 0x20100 5001438028439225? ? ? ? ?0??????? 0?????? 0?????? 0??????? 0????? 0
host10? 0x20200 10000090FAC0F1D8???????0??????? 3?????? 0?????? 0?????? 14?????1
host6?? 0x20500 10000090FAC0B96A??????? 0??? 25899???? ??2?????? 0???68678??? 424
host5?? 0x20400 10000090FAC0B78E???????1???? 1233?????? 2??????0???61019???? 82
host11? 0x20300 10000090FAC0B556???????0??????? 5?????? 0?????? 0?????? 17?????0
host8?? 0x20700 10000090FAC0B221??????? 0?? 170797????2??? ?0? ?195421?? 3083
host13?? 0x20600 10000090FAD0F665??????1? ? 11226????4????101409??491523
?
ID??? ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq? InvWord InvCRC
ID??? ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq? InvWord InvCRC
<0:0:2> 0x20d00 20020002AC01BE4E??????? 8??? 2480?????? 8?????? 0???? 4406????? 0
host0?? 0x20000 500143802855522A??????? 0??????? 0?????? 0?????? 0??????? 0????? 0
host1?? 0x20100 5001438028439225?????? 0??????? 0?????? 0?????? 0??????? 0? ????0
host10 ?0x20200 10000090FAC0F1D8??????? 0??????? 3?????? 0?????? 0?????? 14????? 1
host6?? 0x20500 10000090FAC0B96A??????? 0??? 25899?????? 2?????? 0??? 68678??? 424
host5?? 0x20400 10000090FAC0B78E??????? 1???? 1233?????? 2?????? 0??? 61019???? 82
host11? 0x20300 10000090FAC0B556??????? 0??????? 5?????? 0?????? 0?????? 17????? 0
host8?? 0x20700 10000090FAC0B221??????? 0?? 170797?????? 2?????? 0?? 195421?? 3083
host13?? 0x20600 10000090FAD0F665?????? 1?? 11226?????? 4?????? 0 101409?841226
※通過命令輸出發(fā)現(xiàn)?host13?有大量的報錯,鏈路故障嚴(yán)重。

CRC產(chǎn)生的原因:
數(shù)據(jù)在傳輸過程中可能會因?yàn)閭鬏斀橘|(zhì)故障或外界的干擾而產(chǎn)生比特差錯(使原來的0變?yōu)?,原來的1變?yōu)?),從而導(dǎo)致接收方接收到錯誤的數(shù)據(jù)。為盡量提高接收方收到數(shù)據(jù)的正確率,在接收數(shù)據(jù)之前需要對數(shù)據(jù)進(jìn)行差錯檢測,僅當(dāng)檢測的結(jié)果為正確時才接收數(shù)據(jù)。
差錯檢測的方式有多種,常見的有奇偶校驗(yàn)、求和校驗(yàn)、CRC校驗(yàn)等。他們的工作原理都是發(fā)送端對數(shù)據(jù)按照某種算法計算出來校驗(yàn)碼,將校驗(yàn)碼和數(shù)據(jù)一起發(fā)送到接收端,然后接收端進(jìn)行檢驗(yàn)確定數(shù)。
三、解決思路
?0:0:1端口:有CRC間歇性報錯,建議先更換3par存儲0:0:1端口到對端交光纖交換機(jī)的SFP,更換完后執(zhí)行showportlesb reset 命令進(jìn)行端口重置,如果故障還存在,下一步需要更換SAN交換機(jī)和3par之間的光纖線。
?0:0:2端口:這個端口連接有8個主機(jī),只有host13有crc增長,如果端口和線路有問題,將會影響該端口下的所有主機(jī),而不會只影響host13;所以問題點(diǎn)應(yīng)為主機(jī)光纖卡SFP—光纖線—交換機(jī)SFP之間的鏈路故障,需用排除法解決。
四、故障處理
???0:0:1 最終更換了0:0:1的SFP和光纖線,故障解決完畢。
?? 0:0:2 host13主機(jī)最終更換了HBA卡,故障解決完畢。
五、經(jīng)驗(yàn)總結(jié)
處理CRC報錯問題時,需要檢查命令輸出,確認(rèn)報錯出現(xiàn)在主機(jī)至交換機(jī)、或是交換機(jī)至存儲的某一段,確認(rèn)完畢后,按照如下順序依次進(jìn)行處理:
1. Cable between the host and the SAN switch associated with the ports part of the 3PAR SAN Zone.?
2. Replace the SFP’s on the SAN switch.?
3. Replace the SFP’s on the HBA.?
4. Use a different port on the SAN Switch.?
5. Replace the HBA on the server.
* 英文部分引用自《Hp 3par 存儲原廠指導(dǎo)書》
--END--?