Announcement

Collapse
No announcement yet.

My Intel Linux NICs Have Developed A Nasty Habit Of Becoming Hung

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by cynic View Post
    OT: nice background in 2nd screenshot
    I believe that's Michael's wife (Fatima?).

    On Topic: Sadly, that's not something I've run into (I've only got one machine running an e1000 NIC and it never sleeps and has a battery-backed power supply).

    I do wonder if rincebrain is on to something. Do you have many of the machines waking up from sleep at the same time? The sudden demand from multiple simultaneous machines waking up could be putting a strain on the electrical system and causing a bit of a brownout in the basement. Maybe e1000 NICs are more susceptible to that condition than others.

    Or yeah, some malformed packet from another machine could be messing the intel NIC driver up in a consistent manner across all of the machines.

    Can you take a look at the dmesg logs to see when the NICs started having trouble, and do multiple machines experience the issue at the same time? Do you have any machines on an older kernel with the same NIC that isn't experiencing issues (e.g. 3.0x kernel, or even 2.6.*).

    Comment


    • #12
      Had the same problem on my SuperMicro Xeon, this solved it: http://www.vxbus.com/software/linux/...k-problem.html

      Comment


      • #13
        I've not seem this problem for quite some time.
        The solution I had to use was to use a program to tweak the power saving setting, which is a flag in the eeprom attached to the Intel chip.
        I'll try and find the specific command

        Comment


        • #14
          OK look for the reply by Auke Kok


          sh fixeep eth0

          Comment


          • #15
            Now that's 3 times exactly the same suggestion in a row

            Comment


            • #16
              I began getting these resets with Intel I217-LM (rev 05) going from Ubuntu 14.04 LTS to 15.10. Traffic also increased which might have contributed.

              Code:
              [51957.229980] ------------[ cut here ]------------
              [51957.229999] WARNING: CPU: 0 PID: 0 at /home/kernel/COD/linux/net/sched/sch_generic.c:303 dev_watchdog+0x237/0x240()
              [51957.230003] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
              [51957.230006] Modules linked in: nls_iso8859_1 intel_rapl snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel irqbypass snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event crct10dif_pclmul crc32_pclmul input_leds snd_rawmidi aesni_intel snd_seq hp_wmi sparse_keymap aes_x86_64 snd_seq_device lrw snd_timer gf128mul snd glue_helper ablk_helper soundcore cryptd serio_raw ie31200_edac edac_core lpc_ich tpm_infineon 8250_fintek mac_hid parport_pc ppdev lp parport autofs4 uas usb_storage hid_generic usbhid hid i915 video i2c_algo_bit drm_kms_helper e1000e syscopyarea sysfillrect sysimgblt ptp fb_sys_fops psmouse ahci drm libahci pps_core wmi fjes
              [51957.230086] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.1-040401-generic #201601311534
              [51957.230089] Hardware name: Hewlett-Packard HP Z230 SFF Workstation/1906, BIOS L51 v01.52 07/20/2015
              [51957.230092]  0000000000000000 77d783cfc7c4a470 ffff88012ea03d98 ffffffff813c8e14
              [51957.230098]  ffff88012ea03de0 ffff88012ea03dd0 ffffffff8107dba2 0000000000000000
              [51957.230103]  ffff880128624e80 0000000000000000 ffff8800351a4000 0000000000000001
              [51957.230108] Call Trace:
              [51957.230110]  <IRQ>  [<ffffffff813c8e14>] dump_stack+0x44/0x60
              [51957.230126]  [<ffffffff8107dba2>] warn_slowpath_common+0x82/0xc0
              [51957.230131]  [<ffffffff8107dc3c>] warn_slowpath_fmt+0x5c/0x80
              [51957.230138]  [<ffffffff8171ac87>] dev_watchdog+0x237/0x240
              [51957.230143]  [<ffffffff8171aa50>] ? qdisc_rcu_free+0x40/0x40
              [51957.230152]  [<ffffffff810e84a5>] call_timer_fn+0x35/0xf0
              [51957.230156]  [<ffffffff8171aa50>] ? qdisc_rcu_free+0x40/0x40
              [51957.230162]  [<ffffffff810e8dd1>] run_timer_softirq+0x221/0x2d0
              [51957.230169]  [<ffffffff81082446>] __do_softirq+0xf6/0x250
              [51957.230174]  [<ffffffff81082713>] irq_exit+0xa3/0xb0
              [51957.230183]  [<ffffffff81800652>] smp_apic_timer_interrupt+0x42/0x50
              [51957.230189]  [<ffffffff817fe922>] apic_timer_interrupt+0x82/0x90
              [51957.230191]  <EOI>  [<ffffffff816960f0>] ? cpuidle_enter_state+0x130/0x270
              [51957.230205]  [<ffffffff81696267>] cpuidle_enter+0x17/0x20
              [51957.230211]  [<ffffffff810c0202>] call_cpuidle+0x32/0x60
              [51957.230216]  [<ffffffff81696243>] ? cpuidle_select+0x13/0x20
              [51957.230222]  [<ffffffff810c0496>] cpu_startup_entry+0x266/0x320
              [51957.230229]  [<ffffffff817f16ac>] rest_init+0x7c/0x80
              [51957.230236]  [<ffffffff81f56011>] start_kernel+0x481/0x4a2
              [51957.230241]  [<ffffffff81f55120>] ? early_idt_handler_array+0x120/0x120
              [51957.230246]  [<ffffffff81f55339>] x86_64_start_reservations+0x2a/0x2c
              [51957.230251]  [<ffffffff81f55485>] x86_64_start_kernel+0x14a/0x16d
              [51957.230254] ---[ end trace 6efdd0998a531612 ]---
              [51957.230301] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
              [51963.143454] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
              [51983.226750] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
              [51987.092272] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
              I solved it by putting this in /etc/rc.local
              Code:
              ethtool --offload eth0 sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off
              It's probably not necessary to turn off as much as I did but tso off wasn't enough.

              Oh, and I bought an Intel NIC for the igb driver.
              Last edited by _pLu_; 22 April 2016, 12:36 PM.

              Comment


              • #17
                Originally posted by debianxfce View Post

                I had intel p4 motherboard with intel networking. The pc was working fine except networking was unstable. Changing the old power unit to a new one solved the problem. Intel has not gone far with networking stability. I am so lucky that my Asus a88xm-e mobo is 100% intel free.
                I specifically didn't specify Intel in my last remark because it wasn't an Intel-specific issue - the chipset was AMD, the processor was a Phenom II X6, the problem was initially with the Realtek NIC onboard and then also manifested when I tried installing an external (Intel) PCI NIC, which was when I tried the PSU swap and the problem went away.

                Comment


                • #18
                  I had a similar issue on the Asus Z170-A board with Intel I219V onboard NIC. I never did figure out how to fix it, and ended up getting a Realtek PCI-E add-on card. Really ticks me off, since I specifically go for motherboards with onboard Intel NICs. In any event, the Realtek is working perfectly and giving excellent performance, so I never researched it any further. But it would be nice to use the hardware that's actually on the motherboard.

                  Comment


                  • #19
                    Originally posted by rincebrain View Post
                    I specifically didn't specify Intel in my last remark because it wasn't an Intel-specific issue - the chipset was AMD, the processor was a Phenom II X6, the problem was initially with the Realtek NIC onboard and then also manifested when I tried installing an external (Intel) PCI NIC, which was when I tried the PSU swap and the problem went away.
                    I had something similar happen recently, funny enough pretty much all the details expect the add on card being an Intel NIC are identical, where I started getting some weird system hangs and strange random failures after adding a couple PCI-E cards to test something. It had turned out they were the straw breaking the camels back in terms of draw on the PSU, which I think was also getting old and just not putting out as much juice as it should, and once I swapped it with a newer higher Watt one everything was fine.

                    On the topic of the Intel NIC issue, Michael's problem concerns as I've usually come to see good behavior out of e1000e NICs on Linux and help manage a few linux systems that use dual port on-board or add on cards for their networking and have never seen an issue like this. The main difference is most of those systems are older Dell and HPs with a couple built machines in the mix. Besides what's already been suggested I also had a issue with on board USB in Linux that got solved that some people indicated can also cause wonky NIC behavior, although I think all their examples had on board Realtek cards, and it was related to IOMMU settings/handling. Had to make sure IOMMU was tuned on in the BIOS as well as pass "iommu=soft" as a kernel option since apparently this particular board has some quirk that required the option to be on in BIOS for the USB support to work 100%, but also needed to override the board supplied IOMMU table (? I'm trying to recall most details from memory so may be a little off) with the kernel option to avoid a lot of page faults at boot and slow behavior.

                    Comment


                    • #20
                      I see this issue on my Lenovo T431s a lot lately. I use VLAN, not sure if that is triggering it too.

                      Code:
                      [97558.373923] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                                       TDH                  <54>
                                       TDT                  <5e>
                                       next_to_use          <5e>
                                       next_to_clean        <51>
                                     buffer_info[next_to_clean]:
                                       time_stamp           <101bd2eff>
                                       next_to_watch        <54>
                                       jiffies              <101bd30c6>
                                       next_to_watch.status <0>
                                     MAC Status             <80083>
                                     PHY Status             <796d>
                                     PHY 1000BASE-T Status  <38ff>
                                     PHY Extended Status    <3000>
                                     PCI Status             <10>
                      [97560.373977] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                                       TDH                  <54>
                                       TDT                  <5e>
                                       next_to_use          <5e>
                                       next_to_clean        <51>
                                     buffer_info[next_to_clean]:
                                       time_stamp           <101bd2eff>
                                       next_to_watch        <54>
                                       jiffies              <101bd331e>
                                       next_to_watch.status <0>
                                     MAC Status             <80083>
                                     PHY Status             <796d>
                                     PHY 1000BASE-T Status  <3800>
                                     PHY Extended Status    <3000>
                                     PCI Status             <10>
                      [97562.374129] e1000e 0000:00:19.0 net0: Detected Hardware Unit Hang:
                                       TDH                  <54>
                                       TDT                  <5e>
                                       next_to_use          <5e>
                                       next_to_clean        <51>
                                     buffer_info[next_to_clean]:
                                       time_stamp           <101bd2eff>
                                       next_to_watch        <54>
                                       jiffies              <101bd3576>
                                       next_to_watch.status <0>
                                     MAC Status             <80083>
                                     PHY Status             <796d>
                                     PHY 1000BASE-T Status  <3800>
                                     PHY Extended Status    <3000>
                                     PCI Status             <10>
                      [97563.379998] ------------[ cut here ]------------
                      [97563.380046] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x235/0x240()
                      [97563.380053] NETDEV WATCHDOG: net0 (e1000e): transmit queue 0 timed out
                      [97563.380057] Modules linked in: rndis_wlan rndis_host cdc_ether visor uas usb_storage loop ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack mmc_block drbg ansi_cprng ecryptfs cbc sha256_ssse3 sha256_generic encrypted_keys mcryptd sha1_ssse3 sha1_generic hmac trusted tun fuse cmac ecb rfcomm 8021q mrp hid_logitech_hidpp hid_generic hid_logitech_dj usbhid hid ftdi_sio usbserial bnep uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 btusb btrtl videobuf2_core btbcm btintel joydev mousedev bluetooth videodev media cdc_mbim cdc_ncm cdc_wdm cdc_acm arc4 iwldvm mac80211 iwlwifi snd_hda_codec_hdmi cfg80211 snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp
                      [97563.380147]  kvm_intel nls_iso8859_1 kvm nls_cp437 vfat fat irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper input_leds cryptd pcspkr psmouse serio_raw i2c_i801 lpc_ich snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer shpchp e1000e mei_me mei ptp pps_core thermal wmi thinkpad_acpi nvram snd soundcore rfkill battery ac fjes evdev tpm_tis tpm mac_hid processor sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) nfsd nfs auth_rpcgss oid_registry nfs_acl lockd grace sunrpc fscache mcs7830 usbnet mii ip_tables x_tables atkbd libps2 sdhci_pci sdhci led_class mmc_core ehci_pci xhci_pci ehci_hcd xhci_hcd usbcore usb_common i8042 serio ext4 crc16 mbcache jbd2 sd_mod ahci libahci libata
                      [97563.380252]  scsi_mod i915 video button intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
                      [97563.380273] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.5.1-1-ARCH #1
                      [97563.380277] Hardware name: LENOVO 20AA000EMZ/20AA000EMZ, BIOS GHET23WW (1.08 ) 06/24/2013
                      [97563.380281]  0000000000000286 d69522a70574cb84 ffff88021e203d98 ffffffff812dae11
                      [97563.380287]  ffff88021e203de0 ffffffff81771380 ffff88021e203dd0 ffffffff81078e52
                      [97563.380293]  0000000000000000 ffff8800d5b2e680 0000000000000000 ffff8800d60d8000
                      [97563.380298] Call Trace:
                      [97563.380302]  <IRQ>  [<ffffffff812dae11>] dump_stack+0x63/0x82
                      [97563.380322]  [<ffffffff81078e52>] warn_slowpath_common+0x82/0xc0
                      [97563.380329]  [<ffffffff81078eec>] warn_slowpath_fmt+0x5c/0x80
                      [97563.380337]  [<ffffffff814ccc65>] dev_watchdog+0x235/0x240
                      [97563.380343]  [<ffffffff814cca30>] ? qdisc_rcu_free+0x40/0x40
                      [97563.380350]  [<ffffffff810e39c5>] call_timer_fn+0x35/0x150
                      [97563.380356]  [<ffffffff814cca30>] ? qdisc_rcu_free+0x40/0x40
                      [97563.380361]  [<ffffffff810e3d2c>] run_timer_softirq+0x24c/0x2f0
                      [97563.380367]  [<ffffffff8107d14d>] __do_softirq+0xcd/0x2d0
                      [97563.380372]  [<ffffffff8107d4c3>] irq_exit+0xa3/0xb0
                      [97563.380380]  [<ffffffff815b0112>] smp_apic_timer_interrupt+0x42/0x50
                      [97563.380386]  [<ffffffff815ae3f2>] apic_timer_interrupt+0x82/0x90
                      [97563.380388]  <EOI>  [<ffffffff8145fb7f>] ? cpuidle_enter_state+0x12f/0x2f0
                      [97563.380398]  [<ffffffff8145fd77>] cpuidle_enter+0x17/0x20
                      [97563.380404]  [<ffffffff810bad5a>] call_cpuidle+0x2a/0x40
                      [97563.380408]  [<ffffffff810bb155>] cpu_startup_entry+0x2c5/0x3a0
                      [97563.380415]  [<ffffffff815a0dd9>] rest_init+0x89/0x90
                      [97563.380422]  [<ffffffff8190c00f>] start_kernel+0x482/0x4a3
                      [97563.380429]  [<ffffffff8190b120>] ? early_idt_handler_array+0x120/0x120
                      [97563.380435]  [<ffffffff8190b339>] x86_64_start_reservations+0x2a/0x2c
                      [97563.380441]  [<ffffffff8190b485>] x86_64_start_kernel+0x14a/0x16d
                      [97563.380446] ---[ end trace ceb564f38ab018e5 ]---

                      Comment

                      Working...
                      X