Differences between revisions 4 and 5
Revision 4 as of 2011-09-25 20:56:03
Size: 9782
Comment:
Revision 5 as of 2011-09-26 19:24:28
Size: 9905
Comment:
Deletions are marked like this. Additions are marked like this.
Line 69: Line 69:
The KB arcticle recommends, and that's the first recommendation in the BZ too, to use the kernel parameter '''intel_idle.max_cstate=2''', which avoids C6. One reporter in the BZ though claims that this reduces the hangs by 90%, but not completely. The next recommendation is '''intel_idle.max_cstate=0 processor.max_cstate=1''', which disables the intel specific code for entering deep C-states, and limits the ACPI code to C1 (I believe...). Alas, the latter parameters make the idle system consume significantly more power: The KB article recommends, and that's the first recommendation in the BZ too, to use the kernel parameter '''intel_idle.max_cstate=2''', which avoids C6. One reporter in the BZ though claims that this reduces the hangs by 90%, but not completely. The next recommendation is '''intel_idle.max_cstate=0 processor.max_cstate=1''', which disables the intel specific code for entering deep C-states, and limits the ACPI code to C1 (I believe... notice there's a difference in nomenclature between "Intel C-states" and "ACPI C-states" - ACPI C3 seems to be Intel C6 ...). Alas, the latter parameters make the idle system consume significantly more power:

Symptoms

  • System becomes unresponsive, hanging for several seconds, over a period of a few minutes. Often, opening another ssh connection to the system "unlocks" it for a (split) second.
  • Except for these periods, the system will work fine. No reboot required to recover.
  • sometimes:
    • INFO: task ipmi-sensors:19315 blocked for more than 120 seconds.

    • INFO: task ipmi-sel:19317 blocked for more than 120 seconds.

  • sometimes: Clocksource tsc unstable. Switching to clocksource hpet.

  • once: SAS bus reset by controller

Hardware

  • affected:
    • Dell M610 with X5550 Nehalem
  • seem not affected:
    • Dell 2950 with X5150 Woodcrest
    • Dell R510 with L5630 Westmere
    • Dell T3500 with W3503 Nehalem
    • Supermicro X8DTG-QF with X5667 Westmere

What didn't help

  • disabling MSI for the bnx2 driver
  • disabling iptables
  • SELinux is disabled anyway
  • booting with clocksource=hpet (it significantly reduced the frequency of the hangs though)

Probable Cause

Bugs in Nehalem deep C-States.

Links:

The Citrix document is currently unavailable, but still in the google cache. It talks about lockups of Xenserver 5.6 which supports these C-States. It suggests the cause are the following bugs in Nehalem and Westmere CPUs:

  • Nehalem 55xx: AAK120, described in http://www.intel.com/Assets/PDF/specupdate/321324.pdf :

    "Rapid Core C3/C6 Transition May Cause Unpredictable System Behavior
    
    Under a complex set of internal conditions, cores rapidly performing C3/C6 transitions in a system with Intel® Hyper-Threading Technology enabled may cause a machine check error (IA32_MCi_STATUS.MCACOD = 0x0106), system hang or unpredictable system behavior.
    
    This erratum may cause a machine check error, system hang or unpredictable system behavior."
  • Nehalem 35xx: AAM108, described in http://www.intel.com/Assets/PDF/specupdate/321333.pdf :

    "Rapid Core C3/C6 Transition May Cause Unpredictable System Behavior
    
    Under a complex set of internal conditions, cores rapidly performing C3/C6 transitions in a system with Intel® Hyper-Threading Technology enabled may cause a machine check error (IA32_MCi_STATUS.MCACOD = 0x0106), system hang or unpredictable system behavior.
    
    This erratum may cause a machine check error, system hang or unpredictable system behavior."
  • Westmere 56xx: BD59, described in http://www.intel.com/content/dam/doc/specification-update/xeon-5600-specification-update.pdf :

    "Package C3/C6 Transitions When Memory 2x Refresh is Enabled May Result in a System Hang
    
    If ASR_PRESENT (MC_CHANNEL_{0,1,2}_REFRESH_THROTTLE_SUP PORT CSR function 0, offset 68H, bit [0], Auto Self Refresh Present) is clear which indicates that high temperature operation is not supported on the DRAM, the memory controller will not enter self-refresh if software has REF_2X_NOW (bit 4 of the MC_CLOSED_LOOP CSR, function 3, offset 84H) set. This scenario may cause the system to hang during C3/C6 entry.
    
    Failure to enter self-refresh can delay C3/C6 power state transitions to the point that a system hang may result with CATERR being asserted. REF_2X_NOW is used to double the refresh rate when the DRAM is operating in extended temperature range. The ASR_PRESENT was intended to allow low power self refresh with DRAM that does not support automatic self refresh."

This may explain why we have only observed this problem on the M610: The W3503 in the workstations doesn't support Hyperthreading, and the RAM in our Westmere systems is probably (hopefully) not operating in "extended temperature range".

Possible Workarounds

The Citrix document recommends disabling C-States in the BIOS. In the BZ it is discussed that this won't help because the SL6 kernel finds them anyway.

The KB article recommends, and that's the first recommendation in the BZ too, to use the kernel parameter intel_idle.max_cstate=2, which avoids C6. One reporter in the BZ though claims that this reduces the hangs by 90%, but not completely. The next recommendation is intel_idle.max_cstate=0 processor.max_cstate=1, which disables the intel specific code for entering deep C-states, and limits the ACPI code to C1 (I believe... notice there's a difference in nomenclature between "Intel C-states" and "ACPI C-states" - ACPI C3 seems to be Intel C6 ...). Alas, the latter parameters make the idle system consume significantly more power:

C-States.gif

Power consumption rises from 80W to more than 130. Using the first parameter instead, it stays below 100W. For reference, the consumption under SL5 is about 115W.

Another workaround should be disabling Hyperthreading.

How to verify that Workarounds are applied

Intel driver, unrestricted:

# dmesg|grep idle
using mwait in idle threads.
intel_idle: MWAIT substates: 0x1120
intel_idle: v0.4 model 0x1A
intel_idle: lapic_timer_reliable_states 0x2
ACPI: acpi_idle yielding to intel_idle
cpuidle: using governor ladder
cpuidle: using governor menu

# grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/time:4483024
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage:11242
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:NHM-C1
/sys/devices/system/cpu/cpu0/cpuidle/state1/power:1000
/sys/devices/system/cpu/cpu0/cpuidle/state1/time:452411903
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage:1346944
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:20
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:NHM-C3
/sys/devices/system/cpu/cpu0/cpuidle/state2/power:500
/sys/devices/system/cpu/cpu0/cpuidle/state2/time:1797289080
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage:2232648
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:MWAIT 0x20
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency:200
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:NHM-C6
/sys/devices/system/cpu/cpu0/cpuidle/state3/power:350
/sys/devices/system/cpu/cpu0/cpuidle/state3/time:541786973905
/sys/devices/system/cpu/cpu0/cpuidle/state3/usage:19614317

Intel driver, limited to C3:

# dmesg|grep idle
Command line: ro root=UUID=894f2e3a-62c0-4eee-93d5-21360859b6b4 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us pci=bfsort crashkernel=auto intel_idle.max_cstate=2
Kernel command line: ro root=UUID=894f2e3a-62c0-4eee-93d5-21360859b6b4 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us pci=bfsort crashkernel=129M@0M intel_idle.max_cstate=2
using mwait in idle threads.
intel_idle: MWAIT substates: 0x1120
intel_idle: v0.4 model 0x1A
intel_idle: lapic_timer_reliable_states 0x2
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
intel_idle: max_cstate 2 reached
ACPI: acpi_idle yielding to intel_idle
cpuidle: using governor ladder
cpuidle: using governor menu

# grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:C0
/sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295
/sys/devices/system/cpu/cpu0/cpuidle/state0/time:55080
/sys/devices/system/cpu/cpu0/cpuidle/state0/usage:121
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:3
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:NHM-C1
/sys/devices/system/cpu/cpu0/cpuidle/state1/power:1000
/sys/devices/system/cpu/cpu0/cpuidle/state1/time:8614997
/sys/devices/system/cpu/cpu0/cpuidle/state1/usage:79650
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:20
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:NHM-C3
/sys/devices/system/cpu/cpu0/cpuidle/state2/power:500
/sys/devices/system/cpu/cpu0/cpuidle/state2/time:10529011005
/sys/devices/system/cpu/cpu0/cpuidle/state2/usage:458335

Notice NHM-C6 is absent.

With the more restrictive parameters:

Sep 25 14:07:24 blade8d kernel: Kernel command line: ro root=UUID=894f2e3a-62c0-4eee-93d5-21360859b6b4 rd_NO_LUKS rd_NO_LVM rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us pci=bfsort crashkernel=129M@0M intel_idle.max_cstate=0 processor.max_cstate=1
Sep 25 14:07:24 blade8d kernel: using mwait in idle threads.
Sep 25 14:07:24 blade8d kernel: intel_idle: disabled
Sep 25 14:07:24 blade8d kernel: ACPI: acpi_idle registered with cpuidle
Sep 25 14:07:24 blade8d kernel: cpuidle: using governor ladder
Sep 25 14:07:24 blade8d kernel: cpuidle: using governor menu

No clue how to verify the effectiveness of the processor.max_cstate=1 parameter, but the power consumption suggests it works...

SL6 Development/Nehalem Hangs (last edited 2011-09-26 19:24:28 by StephanWiesand)