Ftp hangs

On Oracle Linux 5, 64 bit (derivative of RHEL) ... I have a shell script that runs every Sunday, that ftp's a bunch of files from server 'prod' to server 'test'. Script executes on 'test'. This has been running for YEARS with no problem. Normally the FTP step takes about 1.5 to 2 hours. (pulls a lot of fairly large files). The only actual 'get' command is a single mget with a wild card file spec, so it is never looking for a specific file, just everything in the source directory. This last weekend we physically relocated the data center. Packed up servers, SAN, etc, trucked it a few miles, and put it all back together. Evertyihing from that seems to be working fine, but it needs to be said in the interest of full disclosure of "what's changed". So, on Sunday, due to the move I'm watching things a lot closer, and this job is still in the ftp step at 17:30 -- about 2 hours overdue. And the log file it writes (redirecting stdout) has a time stamp of 15:30 - so nothing written to it in 2 hours. At that point I killed the job. As a diagnostic, and to give real-time visibility to my network admin, I kicked of the job again yesterday afternoon just before leaving for the day. This morning it was still 'running'. It was stuck at the same file as was the original run. This file is neither the first nor last to be transferred. It is neither the first nor the last one created by the job that created it. The creating job reported no issues. Permissions on the file are the same as all the others. I'd think if the file were internally corrupted, ftp wouldn't really know or care, he's just reading whatever is there. I'm not sure where to turn next and am open for any ideas.

Firewall?
Normally it needs port 21 to the FTP server and port 20 (ftp-data) backwards.
Check with telnet

ftpserver$ telnet ftpclient 20

Are the files transferring at all? If it's just very slow, I would look at network speed settings. If the switch and the card in your server do not agree, then there will be serious interference. We had a server that was plugged into a switch that forced the port to be 100M full duplex. The card in the server was only capable on 10M half duplex.

Everything was fine until we tried an FTP when the volume of traffic clobbered it. Telnet users were getting 2-3 seconds response times to key-strokes and other horrible things, yet the server's CPU was fine. I had another where we replicate data cross site. After a powerdown the switch lost it's temporary config forcing the port speed to 1Gb and dropped it to 100M with similar consequences.

Can you find a command like entstat that can give you the detail about how your cards are configured? You can then discuss them with the network people over speed settings. They may also need to check all the hops involved in the process.

Here is this site's man page for entstat

Robin

This occurs while processing the 'n'th file during an 'mget'. If it were firewall/port, it would never get that far.

---------- Post updated at 04:54 PM ---------- Previous update was at 04:51 PM ----------

Yes, at the time of the hang, it has already processed well over a dozen files of similar size on the same 'mget' command. Up to that point, all files transferred in reasonable time.

One thing I hadn't noticed when I first posted .. on both runs it hung on the same file.

Could the file actually be a pipe (with no input process) or a link that is confusing it?

Is the data transfer rate good up to that point?

Robin

I remember a problem that a certain file got stuck during transfer at a certain point.
Regardless if svn or ftp was used. It turned out the problem was with the MPLS provider - who finally solved it.

Just an update to all ... two batch runs (shell script) using 'mget' were able to succesfully transfer a few dozen files before stalling out on the same file each time. After killing the ftp, the file in question is in the target directory, and at first glance looks good .. it's file size is comparable to all of the others - in the 3.5 to 4.5 GB range. But comparison to the file on the source server shows the transferred file is still a few GB short of complete. Most of these files transfer in about 5 minutes.

Today I tried to pull it manually. Started ftp at a command line and did a simple 'get' on the one file. After 20 minutes I killed it. I also did a simple get on another file from the same set and it completed in about 4 minutes, exactly as expected. So it would appear that there is something about the individual file, but I don't know what it would be. It is part of an Oracle database backupset - created by oracle's rman utility. I ran an rman 'verify' against it and it came up clean from that standpoint.

So, I can conclude ..

  1. since it always hits on that one file and only that one file, I can eliminate transient network issues.
  2. since it occurs on that file whether as part of a batch transfer or just the single file, I can eliminate any 'ftp flooding' that I've seen reference to here and there.

At this point I'm just going to let the normal weekend process run again on schedule, but if anyone has any good theories that fit all of the observations so far, I'd be willing to entertain them. If not, I'll have to figure the problem has been found to be off-topic for this forum.

Ensure your ftp service does not have a (default = 5 min) timeout!
If the transfer stalls at a certain bit pattern, and it goes over a WAN, check with your WAN provider!

I've been called "a few GB short of complete" before too. :o

Could there be what is called a packet-shaper on the network to throttle high use connections, or perhaps there is something that flips to an alternate path after a few minutes and the process is not working properly.

The default timeout is probably most likely though. Have a go with a larger file if you can. Probably best to make one with tar (get something really large) & compress (squeeze out anything that your ftp/sftp might be optimising)

Can you run it manually but use the hash subcommand before the get . It will display progress markers and you can then get a better timing for when it is stopping.

Nothing else leaps out at the moment I'm afraid.

Robin