2010年3月26日 星期五

Adjust MAX_ORDER from 11 to 12 or higher cause e1000e error?

(not fixed yet)

2.6.31.1
e1000e, WG82574L, v1.0.2-k2
Adjust MAX_ORDER from 11 to 12 or higher, iozone a samba share on DUT would cause following dump around 128M~2G tests. e1000e is temporarily down, and then up again. But would occur again and again.

[  247.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  247.040000]   TDH                  <da>
[  247.040000]   TDT                  <8>
[  247.040000]   next_to_use          <8>
[  247.040000]   next_to_clean        <d7>
[  247.040000] buffer_info[next_to_clean]:
[  247.040000]   time_stamp           <ffffea9f>
[  247.040000]   next_to_watch        <da>
[  247.040000]   jiffies              <ffffeb50>
[  247.040000]   next_to_watch.status <0>
[  249.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  249.040000]   TDH                  <da>
[  249.040000]   TDT                  <8>
[  249.040000]   next_to_use          <8>
[  249.040000]   next_to_clean        <d7>
[  249.040000] buffer_info[next_to_clean]:
[  249.040000]   time_stamp           <ffffea9f>
[  249.040000]   next_to_watch        <da>
[  249.040000]   jiffies              <ffffec18>
[  249.040000]   next_to_watch.status <0>
[  251.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  251.040000]   TDH                  <da>
[  251.040000]   TDT                  <8>
[  251.040000]   next_to_use          <8>
[  251.040000]   next_to_clean        <d7>
[  251.040000] buffer_info[next_to_clean]:
[  251.040000]   time_stamp           <ffffea9f>
[  251.040000]   next_to_watch        <da>
[  251.040000]   jiffies              <ffffece0>
[  251.040000]   next_to_watch.status <0>
[  253.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  253.040000]   TDH                  <da>
[  253.040000]   TDT                  <8>
[  253.040000]   next_to_use          <8>
[  253.040000]   next_to_clean        <d7>
[  253.040000] buffer_info[next_to_clean]:
[  253.040000]   time_stamp           <ffffea9f>
[  253.040000]   next_to_watch        <da>
[  253.040000]   jiffies              <ffffeda8>
[  253.040000]   next_to_watch.status <0>
[  255.040000] ------------[ cut here ]------------
[  255.050000] WARNING: at net/sched/sch_generic.c:246 dev_watchdog+0x140/0x220()
[  255.060000] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
[  255.070000] Modules linked in: dwc_otg ohci_hcd ehci_hcd e1000e
[  255.090000] [<c0031bac>] (unwind_backtrace+0x0/0xdc) from [<c0045b4c>] (warn_slowpath_common+0x4c/0x80)
[  255.120000] [<c0045b4c>] (warn_slowpath_common+0x4c/0x80) from [<c0045bbc>] (warn_slowpath_fmt+0x28/0x38)
[  255.140000] [<c0045bbc>] (warn_slowpath_fmt+0x28/0x38) from [<c0223e14>] (dev_watchdog+0x140/0x220)
[  255.170000] [<c0223e14>] (dev_watchdog+0x140/0x220) from [<c004e660>] (run_timer_softirq+0x184/0x208)
[  255.200000] [<c004e660>] (run_timer_softirq+0x184/0x208) from [<c004a538>] (__do_softirq+0x78/0x100)
[  255.220000] [<c004a538>] (__do_softirq+0x78/0x100) from [<c002b070>] (_text+0x70/0x8c)
[  255.250000] [<c002b070>] (_text+0x70/0x8c) from [<c002ba58>] (__irq_svc+0x38/0x80)
[  255.270000] Exception stack(0xc037ff78 to 0xc037ffc0)
[  255.280000] ff60:                                                       c0383998 00000000 
[  255.300000] ff80: c037e000 00000000 c037e000 c0381c84 c03a5c8c c0381c78 00000000 410fb024 
[  255.330000] ffa0: 00022b2c 00000000 c037ff90 c037ffc0 c002ce1c c002ce20 60000013 ffffffff 
[  255.350000] [<c002ba58>] (__irq_svc+0x38/0x80) from [<c002ce20>] (default_idle+0x24/0x2c)
[  255.370000] [<c002ce20>] (default_idle+0x24/0x2c) from [<c002d2d8>] (cpu_idle+0x3c/0x78)
[  255.400000] [<c002d2d8>] (cpu_idle+0x3c/0x78) from [<c0008918>] (start_kernel+0x248/0x2ac)
[  255.420000] [<c0008918>] (start_kernel+0x248/0x2ac) from [<0000802c>] (0x802c)
[  255.440000] ---[ end trace d9e3a5e60e497ff6 ]---
[  255.450000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  255.450000]   TDH                  <da>
[  255.450000]   TDT                  <8>
[  255.450000]   next_to_use          <8>
[  255.450000]   next_to_clean        <d7>
[  255.450000] buffer_info[next_to_clean]:
[  255.450000]   time_stamp           <ffffea9f>
[  255.450000]   next_to_watch        <da>
[  255.450000]   jiffies              <ffffee99>
[  255.450000]   next_to_watch.status <0>
[  257.870000] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[  257.880000] 0000:00:01.0: eth0: 10/100 speed: disabling TSO
[  262.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  262.040000]   TDH                  <3>
[  262.040000]   TDT                  <4>
[  262.040000]   next_to_use          <4>
[  262.040000]   next_to_clean        <0>
[  262.040000] buffer_info[next_to_clean]:
[  262.040000]   time_stamp           <fffff00d>
[  262.040000]   next_to_watch        <3>
[  262.040000]   jiffies              <fffff12c>
[  262.040000]   next_to_watch.status <0>
[  264.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  264.040000]   TDH                  <3>
[  264.040000]   TDT                  <4>
[  264.040000]   next_to_use          <4>
[  264.040000]   next_to_clean        <0>
[  264.040000] buffer_info[next_to_clean]:
[  264.040000]   time_stamp           <fffff00d>
[  264.040000]   next_to_watch        <3>
[  264.040000]   jiffies              <fffff1f4>
[  264.040000]   next_to_watch.status <0>
[  266.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  266.040000]   TDH                  <3>
[  266.040000]   TDT                  <4>
[  266.040000]   next_to_use          <4>
[  266.040000]   next_to_clean        <0>
[  266.040000] buffer_info[next_to_clean]:
[  266.040000]   time_stamp           <fffff00d>
[  266.040000]   next_to_watch        <3>
[  266.040000]   jiffies              <fffff2bc>
[  266.040000]   next_to_watch.status <0>
[  268.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  268.040000]   TDH                  <3>
[  268.040000]   TDT                  <4>
[  268.040000]   next_to_use          <4>
[  268.040000]   next_to_clean        <0>
[  268.040000] buffer_info[next_to_clean]:
[  268.040000]   time_stamp           <fffff00d>
[  268.040000]   next_to_watch        <3>
[  268.040000]   jiffies              <fffff384>
[  268.040000]   next_to_watch.status <0>
[  270.450000] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[  270.460000] 0000:00:01.0: eth0: 10/100 speed: disabling TSO
[  275.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  275.040000]   TDH                  <5>
[  275.040000]   TDT                  <7>
[  275.040000]   next_to_use          <7>
[  275.040000]   next_to_clean        <2>
[  275.040000] buffer_info[next_to_clean]:
[  275.040000]   time_stamp           <fffff58d>
[  275.040000]   next_to_watch        <5>
[  275.040000]   jiffies              <fffff640>
[  275.040000]   next_to_watch.status <0>
[  277.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  277.040000]   TDH                  <5>
[  277.040000]   TDT                  <7>
[  277.040000]   next_to_use          <7>
[  277.040000]   next_to_clean        <2>
[  277.040000] buffer_info[next_to_clean]:
[  277.040000]   time_stamp           <fffff58d>
[  277.040000]   next_to_watch        <5>
[  277.040000]   jiffies              <fffff708>
[  277.040000]   next_to_watch.status <0>
[  279.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  279.040000]   TDH                  <5>
[  279.040000]   TDT                  <7>
[  279.040000]   next_to_use          <7>
[  279.040000]   next_to_clean        <2>
[  279.040000] buffer_info[next_to_clean]:
[  279.040000]   time_stamp           <fffff58d>
[  279.040000]   next_to_watch        <5>
[  279.040000]   jiffies              <fffff7d0>
[  279.040000]   next_to_watch.status <0>
[  281.040000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  281.040000]   TDH                  <5>
[  281.040000]   TDT                  <7>
[  281.040000]   next_to_use          <7>
[  281.040000]   next_to_clean        <2>
[  281.040000] buffer_info[next_to_clean]:
[  281.040000]   time_stamp           <fffff58d>
[  281.040000]   next_to_watch        <5>
[  281.040000]   jiffies              <fffff898>
[  281.040000]   next_to_watch.status <0>
[  283.450000] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[  283.460000] 0000:00:01.0: eth0: 10/100 speed: disabling TSO
[  304.000000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  304.000000]   TDH                  <6b>
[  304.000000]   TDT                  <a0>
[  304.000000]   next_to_use          <a0>
[  304.000000]   next_to_clean        <68>
[  304.000000] buffer_info[next_to_clean]:
[  304.000000]   time_stamp           <8d>
[  304.000000]   next_to_watch        <6b>
[  304.000000]   jiffies              <190>
[  304.000000]   next_to_watch.status <0>
[  306.000000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  306.000000]   TDH                  <6b>
[  306.000000]   TDT                  <a0>
[  306.000000]   next_to_use          <a0>
[  306.000000]   next_to_clean        <68>
[  306.000000] buffer_info[next_to_clean]:
[  306.000000]   time_stamp           <8d>
[  306.000000]   next_to_watch        <6b>
[  306.000000]   jiffies              <258>
[  306.000000]   next_to_watch.status <0>
[  308.000000] 0000:00:01.0: eth0: Detected Tx Unit Hang:
[  308.000000]   TDH                  <6b>
[  308.000000]   TDT                  <a0>
[  308.000000]   next_to_use          <a0>
[  308.000000]   next_to_clean        <68>
[  308.000000] buffer_info[next_to_clean]:
[  308.000000]   time_stamp           <8d>
[  308.000000]   next_to_watch        <6b>
[  308.000000]   jiffies              <320>
[  308.000000]   next_to_watch.status <0>
[  311.520000] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
[  311.530000] 0000:00:01.0: eth0: 10/100 speed: disabling TSO


Intel Wired Ethernet: e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
http://sourceforge.net/tracker/download.php?group_id=42302&atid=447451&file_id=172710&aid=1460945

[Bug 30476] Re: e1000_clean_tx_irq: Detected Tx Unit Hang
https://lists.ubuntu.com/archives/kernel-bugs/2007-July/028088.html
Several NIC's with the 82573 chipset display "TX unit hang" messages
during normal operation with the linux e1000 driver. The issue appears
both with TSO enabled and disabled, and is caused by a power management
function that is enabled in the EEPROM. Early releases of the chipsets
to vendors had the EEPROM bit that enabled the feature. After the issue
was discovered newer adapters were released with the feature disabled in
the EEPROM.

See : http://e1000.sourceforge.net/wiki/index.php/Issues#82573.28V.2FL.2FE.29_TX_Unit_Hang_messages
Run the script fixeep-82573-dspd.sh in order to fix the problem !


e1000_clean_tx_irq: Detected Tx Unit Hang
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/30476
Please note, there are multiple causes of this symptom.

As posted by Dominique Gallot above, if you have an adapter (on board or standalone) based on the Intel 82573 chipset, this issue may be caused by an improper setting in the adapter's EEPROM. It will occur regardless of TSO setting, and even in the new 2.6.27 kernel shipping with Intrepid Ibex. The following document from Intel has additional information, under the section "82573(V/L/E) TX Unit Hang Messages".
http://downloadmirror.intel.com/9180/eng/README.txt

The script to automatically detect and permanently fix this issue can be downloaded at:
http://e1000.sourceforge.net/files/fixeep-82573-dspd.sh

Note: The link provided in the comment above is now broken.


http://downloadmirror.intel.com/9180/eng/README.txt
CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection (controller 82547), setting InterruptThrottleRate to a value greater than 75,000, may hang (stop transmitting) adapters under certain network conditions. If this occurs a NETDEV WATCHDOG message is logged in the system event log. In addition, the controller is automatically reset, restoring the network connection. To eliminate the potential for the hang, ensure that InterruptThrottleRate is set no greater than 75,000 and is not set to 0.

(.................)

CAUTION: When setting RxIntDelay to a value other than 0, adapters may hang (stop transmitting) under certain network conditions. If this occurs a NETDEV WATCHDOG message is logged in the system event log. In addition, the controller is automatically reset, restoring the network connection. To eliminate the potential for the hang ensure that RxIntDelay is set to 0.

(.................)

ignore_64bit_dma
----------------
Valid Range: 0-xxxxxxx (0=off)
Default Value: 0
Usage: insmod e1000.ko ignore_64bit_dma=1

When non zero the driver will only request DMA mapping of host memory
in the lower 4GB region. This provides a workaround for users of AMD platforms
GA-MA78G-DS3H & SM4021M-T2R+ that have reported TXHangs on system that have
>4GB RAM, suspected caused by some (no deep root cause) issue in the Dual
Address Cycle (DAC) DMA mechanism needed to access addresses above 4GB.
Setting ignore_64bit_dma to 1 activates the workaround.

This parameter is different than other parameters, in that it is a
single (not 1,1,1 etc.) parameter applied to all driver instances and
it is also available during runtime at
/sys/module/e1000/parameters/ignore_64bit_dma

(.................)

Detected Tx Unit Hang in Quad Port Adapters
-------------------------------------------
In some cases ports 3 and 4 don't pass traffic and report 'Detected Tx Unit
Hang' followed by 'NETDEV WATCHDOG: ethX: transmit timed out' errors. Ports
1 and 2 don't show any errors and will pass traffic.

This issue MAY be resolved by updating to the latest kernel and BIOS. The
user is encouraged to run an OS that fully supports MSI interrupts. You can
check your system's BIOS by downloading the Linux Firmware Developer Kit
that can be obtained at http://www.linuxfirmwarekit.org/

82573(V/L/E) TX Unit Hang Messages
----------------------------------
Several adapters with the 82573 chipset display "TX unit hang" messages
during normal operation with the e1000 driver. The issue appears both with
TSO enabled and disabled, and is caused by a power management function that
is enabled in the EEPROM. Early releases of the chipsets to vendors had the
EEPROM bit that enabled the feature. After the issue was discovered newer
adapters were released with the feature disabled in the EEPROM.

If you encounter the problem in an adapter, and the chipset is an 82573-based
one, you can verify that your adapter needs the fix by using ethtool:

# ethtool -e eth0
Offset Values
------ ------
0x0000 00 12 34 56 fe dc 30 0d 46 f7 f4 00 ff ff ff ff
0x0010 ff ff ff ff 6b 02 8c 10 d9 15 8c 10 86 80 de 83

The value at offset 0x001e (de) has bit 0 unset. This enables the problematic
power saving feature. In this case, the EEPROM needs to read "df" at offset
0x001e.

A one-time EEPROM fix is available as a shell script. This script will verify
that the adapter is applicable to the fix and if the fix is needed or not. If
the fix is required, it applies the change to the EEPROM and updates the
checksum. The user must reboot the system after applying the fix if changes
were made to the EEPROM.

Example output of the script:

# bash fixeep-82573-dspd.sh eth0
eth0: is a "82573E Gigabit Ethernet Controller"
This fixup is applicable to your hardware
executing command: ethtool -E eth0 magic 0x109a8086 offset 0x1e value 0xdf
Change made. You *MUST* reboot your machine before changes take effect!

The script can be downloaded at
http://e1000.sourceforge.net/files/fixeep-82573-dspd.sh

(.................)


(not fixed yet...) I'm using WG82574L, instead of 82573.
Set MAX_ORDER from 11 to 12, pageblock_order=(MAX_ORDER-1), if pageblock_nr_pages = (1<<10) instead of (1<<pageblock_order) could avoid the problem.

沒有留言: