Data Transfers Lock System Up Completely

I have two laptops on which I've installed Ubuntu Studio 9.04. The first laptop (Acer) has a Centrino 32-bit Intel CPU in it and the second (HP) has a 64-bit dual core Intel CPU. I'm running the 32-bit version of Ubuntu Studio on the Acer and the 64-bit version on the HP. While testing the installation on the Acer in August, I noticed that after about 100 or so bytes of session data being transmitted (an 'ls' command, an scp session, etc...) the Acer would completely lock up. I created the session using gnome-terminal.

When I say "lock up" I mean a complete system failure. At first I thought it was Xorg freezing. Ctrl-Alt-F1 through F6 would not get me to a VT. Pinging the box returned nothing at this point. The only recourse is to press the power button until the system shuts off and then reboot. The system is fine as long as I don't do any SSH sessions which makes it kind of hard to use... Since this laptop was an Acer, I chalked it up to some instability in the system. Maybe a CPU overheat or bad RAM. It was a $279 refurb after all.

But I recently got a brand new HP HDX16t which is a really nice bit of hardware. To my surprise, SSH sessions do the same exact thing. So I just tested a little more. It has nothing to do with how long I remain connected. I let it sit for about five minutes an pings continued to work while I was logged in via SSH. I also checked the interface on the remote machine (the one I'm connected to the HP from) and watched to verify the amount of data transmitted (roughly) until the system stops responding to pings. It's still about 100 or so bytes and then the system is not responsive. Identical behavior to the Acer.

The chances of me having two laptops that both have dodgy hardware are pretty slim. I've swapped network cables, and switches (Netgeat and Linksys) and the same problem occurs, so it doesn't appear to be network hardware. I tested with NFS for file transfers next to see if this might be a NIC driver issue or maybe a disk I/O driver issue. NFS and rsync both did the same thing. Sadly, since the system locks up, there's no real way I can think of logging the cause of the problem. It's likely that the system is already too far gone before it can log anything.

I also tried doing all of this with Xorg/gdm stopped straight from the console. Same behavior. It seems to only affect the system when it's transferring data out. If I pull in from somewhere everything is fine. I'm going to try and export a flash drive to see if it might be an I/O driver problem with the SATA drive...

---------- Post updated at 01:44 AM ---------- Previous update was at 01:30 AM ----------

Same thing with a flash drive exported via NFS. So it's probably not a disk I/O issue. That leads me to believe that the NIC driver might be problematic. The NIC is listed in lspci as "Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Giabit Ethernet controller (rev02) I'm pretty sure that's a fairly stable driver. Anything else I should check?

Did anything suspicous show up i /var/log/messages?
When I read it right you always used Ubuntu 9.04 on those Notebooks?
What about trying a different OS ie. distribution than Ubuntu 9.04?

/var/log/messages doesn't contain anything suspicious. In another forum someone else suggested I try doing an rsync locally via 127.0.0.1 to see if that works. It did. So this seems to specifically have to do with going over the wired NIC. The loaded kernel module is r8169.

I'll test with Gentoo minimal or SystemRescueCD later to see if the problem exists cross distribution on the HP. On the Acer, I ran the 32-bit version of Studio64 (another Debian based AV production distro) and I had no trouble syncing data from one of my other desktops with rsync. Sadly I can't recall pushing data out of the Acer, so that's not much of an answer. I'll see what happens with one of these other distros and post back later.

---------- Post updated at 08:26 AM ---------- Previous update was at 08:18 AM ----------

I also tried stopping sysklogd, moving /var/log/messages to /var/log/message.old and then restarting before doing another rsync test. The system locks up and writes nothing to /var/log/messages. The only message before the crash is the restart of sysklogd. The next set of messages is from the reboot after the crash. So unfortunately the system seems to freeze well before it can save any info to the log.

---------- Post updated at 12:33 PM ---------- Previous update was at 08:26 AM ----------

OK... a little more testing and digging around and I've gotten some new info. It was recommended that I try doing the rsync to the laptop's eth0 IP address. So I did and the rsync worked fine. The only time it becomes a problem is when I try to transfer data out of the laptop over the network.

Seeing that I have the laptop here at work with me now, I decided that this is different networking hardware, worth a shot to try it here just to rule out the switches at home. Same problem transferring to my workstation (Gentoo on there).

Last night I left the system on while it was locked up and it just never came back. Today I went to put a CD in to boot a different Linux distro liveCD and test with that. When I opened the drive I hadn't turned the system off yet and I got kernel error messages:

ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata1.00: cmd 60/08:00:37:08:a9/00:00:10:00:00/40 tag 0 ncq 4096 in
res 40/00:fe:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed to set xfermode (err_mask=0x4)
ata1.00: failed to set xfermode (err_mask=0x4)
ata1.00: failed to set xfermode (err_mask=0x4)
end_request: I/O error, dev sda, sector 279513143
end_request: I/O error, dev sda, sector 279513143
end_request: I/O error, dev sda, sector 279513143
sd 0:0:0:0: [sda] Asking for cache data failed
sd 0:0:0:0: [sda] Assuming drive cache: write through
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]
BUG: soft lockup - CPU#0 stuck for 61s! [ata_aux:46]

So the kernel is still somewhat alive! I hesitate to say that the problem is the SATA controller or drive because if I don't do any data transfers out of the system over the network it can do other disk intensive work just fine. Audio and video editing work just great which is pretty disk intensive. Here is additional info about the system regarding the CPU and SATA portions of the system:

The CPU in the system is an Intel Core 2 Duo 7550 running a 2.26 GHz.
The SATA controller is an Intel Corporation ICH9M/M-E SATA AHCI Controller
There are six SATA ports detected (in dmesg) ata1 through ata6
The system drive is ata1.00 which is a Western Digital WD3200BEKT-60F3T1
The kernel is using libata version 3.0

I've done more web searches for soft lockups and SATA, but haven't found much that's been fruitful especially where network transfers are concerned. Still... this is progress in understanding the problem.

I should also note that I am having a lot of trouble booting various liveCDs. This is likely owing to the fact that the DVD drive is a SATA drive and the liveCDs are probably expecting IDE, so they can't find the CD-ROM for squashfs after the kernel hands off to whatever init process comes next.

Try upgrading your kernel, it may be something to do with IRQ sharing. You could also try booting with the parameter irqpoll but that hurts performance. cat /proc/interrupts will show you what devices use what interrupts.

I booted up with the installation DVD and went into "rescue mode". I then activated the volume group for the root filesystem, mounted root and chrooted so that I essentially have the same system as when I boot normally. The only difference is the kernel. I tried rsync and ssh for tests again and this time they worked. The kernel from the installation DVD is linux-2.6.28-11-generic and my normal kernel is linux-2.6.28-3-rt with SMP and RT optimizations.

So I used aptitude to install linux-2.6.28-11-generic kernel and headers and rebooted. The transfers now work. However, in the long run, this is not ideal since I need the RT optimizations for composing music and SMP optimizations wouldn't hurt for audio and video work. So now I need to see if I can narrow the problem down a bit more to see if I can file a bug report. Anyone know how to do that in Ubuntu land? :slight_smile:

First I would see if the providers of the RT and SMP extensions have a newer kernel yet, and whether the issues persist in it. No sense reporting bugs already fixed.