- 苏萦
-
Linux 系统死机怎么办? 2007-09-24 14:59 06.9.21 http://lopsa.org/pipermail/ discuss/2006-January/000670. html 1. Flash upgrading BIOS of RAID controllers 2. Replacement of RAID controller cards (both of them) 3. Moving RAID cards to different PCI slots 4. Monitored temperatures via lm_sensors until a lockup to verify no heat problems 5. Removal of unncessary kernel modules (USB, pcmcia, etc) 6. Disabling of APIC in BIOS 7. Adjustment of IOMMU memory apeture hole size in BIOS 8. Booting with various combinations of kernel parameters "noapic", "acpi=0", "nmi_watchdog=1" 9. Various kernels including minimal custom compiled ones 10. full fsck bad blocks check of each individual hard disk in the arrays (took 2.x days) 11. Discontinued usage of eth0 as it seemed to be on same interrupt as one of the RAID controllers. 12. Stood on one foot while rebooting, while simultaneously sacrificing a chicken and burning a copy of MS-Windows (okay not really but if it would fix it I would do it!). 13. A full flogging with memtest, although I am skeptical that this is the problem...the RAM is ECC and the problem seems to be caused from a specific IO pattern. 14. Installing and booting a non-SMP kernel (I"ve read of similar problems when using the non smp 3ware driver in an SMP system....I doubt this is the answer either as I have tried custom compiled kernels without any luck) 06.9.10 很多时候,客户现场很恶劣,没有机会进行内核调试, 或者最多给你一次修正的机会,要么搞定,要么走人, 这种情况下该如何是好呢?只能靠“平时积累+运气”了, 无法控制运气,不过还是有办法增加积累的, 比如及时跟踪内核的进展、阅读redhat, suse, kernel.org 的bugzilla等。 06.7.30 使用netconsole可以把内核信息通过网络传输到另一台机 器上, 也就是说把另一台机器的某个UPD端口作为内核信息输出的对象( 比如发送到远端的syslog中)。很容易设置,具体用法见: /usr/src/linux/Documentation/ networking/netconsole.txt 关于串口console的设置,可以参考的这篇Dave Jones的这篇blog: http://kernelslacker. livejournal.com/42428.html 05.12.18 死机后应做的一些设置,以便死机再次发生时能多少获得些信息: # turn off tty blackout # setterm -blank 0 # turn on magical sysrq # sysctl -w kernel.sysrq=1 # please do NOT reboot when panic happened # sysctl -w kernel.panic=0 # please do NOT reboot when oops happened # sysctl -w kernel.panic_on_oops=0 05.12.17 some references: https://bugzilla.novell.com/ bugreporting-faq/oops-reading. txt https://bugzilla.novell.com/ bugreporting-faq/kernel- debugging-intro.txt http://kerneltrap.org/node/ 3648 05.01.24 今天听说 SUN 要公开他的 DTrace ( http://www.osnews.com/story. php?news_id=9480) ,虽然还没有看到, 感觉应该是一个很有用的系统分析和调优工具,不过只支持 solaris。不知和 LKST 有什么区别。expecting... 05.01.13 如果问题能够再现,那么问题已经解决 80% 了。对于操作系统核心而言,如果有问题的再现方法, 那么可以说是已经解决 99% 了。经常遇到的问题是系统可以正常运行一段时间,然后死机。 如果不好再现问题, 那么只有根据死机现场遗留的东西来进行分析了。 如果系统没有死干净,比如磁盘中断和文件系统是好的, 那么也许能有日志信息保留在文件中, 不过这样的好运气我是从来没有遇到过的。如果键盘中断还能响应 (按下 Num Lock,可以看见键盘小灯亮灭),那么运气就算是足够好了, 这时可以祭出 sysrq 大法,同时按下 Alt-Sysrq-T 获得进程系统堆栈信息,按下 Alt-Sysrq-M 获得内存分配信息,按下 Alt-Sysrq-W 获得当前寄存器信息... 详见 linux/Documentation/sysrq.txt。 另外,最好关闭终端的自动 blank 功能,这样系统死的时候至少能从屏幕上看到一些信息。 设置方法是: # echo 1 > /proc/sys/kernel/sysrq # setterm -blank 这两个设置最好加到系统启动脚本中 (比如 /etc/rc.d/rc.local), 保证每次启动都能得到运行。 如果很不幸,键盘也死悄悄了,(更为不幸的是,这种情况很常见) ,那么也不是只有等死一个办法,这时可以用串口终端 (serial console)将系统信息发送到另一台系统上, 这样可以通过对这些信息分析来定位问题。设置方法如下: ------------------------------ ------------------------------ ------------------ 准备工作 1. 一台被监视的服务器,一台进行监视工作的PC。 2. 一根串口直连线。 配置 1. 在服务器上,加入一个新的 grub 项目,增加核心参数 "console=ttyS0 console=tty1",如: kernel /boot/vmlinuz-2.4.21-9.30AXsmp ro root=LABEL=/1 console=ttyS0 console=tty1 2. 在服务器上,修改 /etc/sysconfig/syslog,加入 klogd 选项 "-c 7",保证更多内核信息得到输出。如: KLOGD_OPTIONS="-x -c 7" 3. 重新启动服务器 4. 用串口直连线连接两台机器,测试: 1) 在PC上运行 "cat /dev/ttyS0",在服务器上运行 "echo hi > /dev/ttyS0",看在 PC 上是否有 "hi" 输出。 2) 在PC上运行 "cat /dev/ttyS0",在服务器上运行 "echo w > /proc/sysrq-trigger",看 PC 上是否有相应内核信息输出。 3) 在PC上运行 "cat /dev/ttyS0",在服务器上运行 "modprobe loop",看 PC 上是否有相应内核信息输出。 5. 如果测试通过,那么在 PC 上运行: cat /dev/ttyS0 | tee /tmp/result 另外,也可以用 Windows 超级终端获得串口信息。 that"s it. ------------------------------ ------------------------------ ------------------ 此外,一些核心支持 LKCD, netdump 等调试功能,也可以一试。 剩下的,就只有靠经验和运气了,一般造成 Linux 系统死机的原因有: 系统硬件问题 (SCSI 卡,主板,RAID 卡,网卡,硬盘...) 外围硬件问题 (终端切换器,网络...) 软件问题 驱动bug (去找更新的驱动试试) 核心系统 bug (去 LKML 上看看,或换个核心试试) 系统设置 最后,google一把。有时候你可以直接输入 "Linux 系统死机怎么办?" 或者 "我的运行 Red Flag Server 2.1 的 Dell PE6650 经常死机",看有没有人遇到过和你同样的问题。即使没有找到, 也是一个有助于分析问题的信息, 至少说明你的系统可能有和其他人不同之处。 调查Linux 系统死机问题,这既是科学又是艺术, 牵扯到众多的硬件软件知识和经验,是一个不断学习的过程,同样, 我也会不断更新这篇东西的。 ------------------------------ - 05.3.25 当然,需要补充的是,有些时候不是OS的问题, 如何判断是否为硬件错误呢? A number of bugs get reported that really don"t make a lot of sense. The cause all sorts of head-scratching among kernel developers. Whilst most bug-reporters don"t like to hear that their shiny new hardware may be broken/crap, sadly this is the case sometimes. Here"s a few tips that may help root-cause hardware problems. - Use the CPU clock speed that the CPU was rated for. Don"t overclock. Don"t overclock. Don"t overclock. Even if Windows XP works fine on your 6GHz water-cooled Pentium4, this is no sign of stability. In some cases Linux can push the theoretical limits of the machine (by utilising all available memory bandwidth for sustained periods of time for example). Under such extreme load, the CPU will be generating a lot more heat than it will sitting idle on a Windows XP desktop. - Make sure your power supply is adequate to power all the peripherals you have attached. The gotcha here is that whilst it may be adequate to get the OS booted, when it"s actually doing some work (like a big compile, or running doom), it"s going to use up more power than it would whilst idling. All of this power has to come from somewhere. If the PSU can"t supply enough, something is going to be underpowered, which can result in very strange kernel panics. - memtest86 Yes it takes ages to run. Sometimes it takes at least a day before it shows up that there"s a bit error in some DIMM. It"s really worth the time testing though. If you don"t do this test, and the problem really is flaky RAM, then the "bug" will never be fixed. - Reset BIOS to safe defaults. A number of times, users have reported issues that manifest themselves as really obscure oopses that don"t really make a lot of sense. They turned out to be things like "CAS timing" set too aggressive on systems with cheap RAM. (A number of times, these settings worked fine until the user added an extra DIMM). Interestingly, this problem didn"t show up under memtest86 [although maybe it would if left to run long enough] - Check cabling isn"t obscuring airflow. Fans should be completely unobstructed, ensuring that air can circulate throughout the case.