Sun Quad NICs and x86_64 kernels
Table of Contents
After the last post when I had built up and installed my new Dynamips server along with a set of Sun Quad NIC cards (501-4366, HME or Happy Meal as they are also known) I started to run into some issues.
After building a simple topology with one router connected to my 3550 I was seeing each device in the others CDP table which was good. It wasn’t until today when I was trying to lab something up where two routers connected via that external 3550 (using 2 ports on the quad NIC) would not form a neighbourship, everything was checked and sure enough they couldn’t even ping each other.
I checked the ARP table and there were entries and they were all correct and when I ran a set of debugs between the routers the packets were arriving however there was something wrong with the packets as the router was simply dropping them.
My next step involved ruling out Dynamips and its Pcap wizardry to ensure that it wasn’t the card, after assigning an IP to the card and pinging across to the Vlan1 interface on the 3550 I was getting ping responses however it was complaining about the packet being different from what it was expecting (see below)
david@ccie:~> ping 10.1.2.10
PING 10.1.2.10 (10.1.2.10) 56(84) bytes of data.
64 bytes from 10.1.2.10: icmp_seq=1 ttl=64 time=2.20 ms
wrong data byte #54 should be 0x36 but was 0xba
16     10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 
48     30 31 32 33 34 35 ba cc 
64 bytes from 10.1.2.10: icmp_seq=2 ttl=64 time=0.223 ms
wrong data byte #54 should be 0x36 but was 0x64
16     10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 
48     30 31 32 33 34 35 64 d8
After further troubleshooting I could see that the bytes it was complaining about were always 2 less than the amount of the datagram (54 in this case for a 56byte datagram). I spent all afternoon scouring reports of this issue and finally came to the conclusion that this issue is only present on systems that use a 64-bit kernel and have 2GB+ RAM in, I fall into both of these criteria and if I remove 6GB to take me down to 2GB the system works fine.
How to fix the issue #
I am not going to guide you through compiling your own kernel as it is different for pretty much every distro, make a friend of google and search for something along the lines of ‘compile new kernel ubuntu’
Once you have your kernel source navigate to the ‘drivers/net/ethernet/sun’ folder where you should find a sunhme.c file, download the patch from here and run it like this:
patch -p0 < sunhme.patch
Once you have done this it will ask you which file to patch, tell it sunhme.c and then you are ready to compile and reboot into your new custom kernel. After I compiled and booted into my new kernel the card worked perfectly and I can now form neighbourships, ping without issues and all the other fun L3 stuff you could possibly want to do 😛
I hope this makes the issue easier to resolve for anyone else that experiences it as it took me quite some time to piece together what was wrong and the best way to go about fixing it.
thanks,
David